All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/3] slub: introducing detached freelist
@ 2015-08-24  0:58 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-08-24  0:58 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, akpm
  Cc: aravinda, iamjoonsoo.kim, Paul E. McKenney, linux-kernel,
	Jesper Dangaard Brouer

REPOST:
 * Only updated comment in patch01 per request of Christoph Lameter.
 * No other objections have been made
 * Prev post: http://thread.gmane.org/gmane.linux.kernel.mm/135704

NEW use-cases for this API is RCU-free (and still for network NICs).

Introducing what I call detached freelist, for improving the
performance of object freeing in the "slowpath" of kmem_cache_free_bulk,
which calls __slab_free().

The benchmarking tool are avail here:
 https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
 See: slab_bulk_test0{1,2,3}.c

Compared against existing bulk-API (in AKPMs tree), we see a small
regression for small size bulking (between 2-5 cycles), but a huge
improvement for the slowpath.

bulk- Bulk-API-before           - Bulk-API with patchset
  1 -  42 cycles(tsc) 10.520 ns - 47 cycles(tsc) 11.931 ns - improved -11.9%
  2 -  26 cycles(tsc)  6.697 ns - 29 cycles(tsc)  7.368 ns - improved -11.5%
  3 -  22 cycles(tsc)  5.589 ns - 24 cycles(tsc)  6.003 ns - improved -9.1%
  4 -  19 cycles(tsc)  4.921 ns - 22 cycles(tsc)  5.543 ns - improved -15.8%
  8 -  17 cycles(tsc)  4.499 ns - 20 cycles(tsc)  5.047 ns - improved -17.6%
 16 -  69 cycles(tsc) 17.424 ns - 20 cycles(tsc)  5.015 ns - improved 71.0%
 30 -  88 cycles(tsc) 22.075 ns - 20 cycles(tsc)  5.062 ns - improved 77.3%
 32 -  83 cycles(tsc) 20.965 ns - 20 cycles(tsc)  5.089 ns - improved 75.9%
 34 -  80 cycles(tsc) 20.039 ns - 28 cycles(tsc)  7.006 ns - improved 65.0%
 48 -  76 cycles(tsc) 19.252 ns - 31 cycles(tsc)  7.755 ns - improved 59.2%
 64 -  86 cycles(tsc) 21.523 ns - 68 cycles(tsc) 17.203 ns - improved 20.9%
128 -  97 cycles(tsc) 24.444 ns - 72 cycles(tsc) 18.195 ns - improved 25.8%
158 -  96 cycles(tsc) 24.036 ns - 73 cycles(tsc) 18.372 ns - improved 24.0%
250 - 100 cycles(tsc) 25.007 ns - 73 cycles(tsc) 18.430 ns - improved 27.0%

Patchset based on top of commit aefbef10e3ae with previous accepted
bulk patchset(V2) applied (avail in AKPMs quilt).

Small note, benchmark run with kernel compiled with .config
CONFIG_FTRACE in-order to use the perf probes to measure the amount of
page bulking into __slab_free().  While running the "worse-case"
testing module slab_bulk_test03.c

---

Jesper Dangaard Brouer (3):
      slub: extend slowpath __slab_free() to handle bulk free
      slub: optimize bulk slowpath free by detached freelist
      slub: build detached freelist with look-ahead


 mm/slub.c |  142 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 112 insertions(+), 30 deletions(-)

--

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V2 0/3] slub: introducing detached freelist
@ 2015-08-24  0:58 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-08-24  0:58 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, akpm
  Cc: aravinda, iamjoonsoo.kim, Paul E. McKenney, linux-kernel,
	Jesper Dangaard Brouer

REPOST:
 * Only updated comment in patch01 per request of Christoph Lameter.
 * No other objections have been made
 * Prev post: http://thread.gmane.org/gmane.linux.kernel.mm/135704

NEW use-cases for this API is RCU-free (and still for network NICs).

Introducing what I call detached freelist, for improving the
performance of object freeing in the "slowpath" of kmem_cache_free_bulk,
which calls __slab_free().

The benchmarking tool are avail here:
 https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
 See: slab_bulk_test0{1,2,3}.c

Compared against existing bulk-API (in AKPMs tree), we see a small
regression for small size bulking (between 2-5 cycles), but a huge
improvement for the slowpath.

bulk- Bulk-API-before           - Bulk-API with patchset
  1 -  42 cycles(tsc) 10.520 ns - 47 cycles(tsc) 11.931 ns - improved -11.9%
  2 -  26 cycles(tsc)  6.697 ns - 29 cycles(tsc)  7.368 ns - improved -11.5%
  3 -  22 cycles(tsc)  5.589 ns - 24 cycles(tsc)  6.003 ns - improved -9.1%
  4 -  19 cycles(tsc)  4.921 ns - 22 cycles(tsc)  5.543 ns - improved -15.8%
  8 -  17 cycles(tsc)  4.499 ns - 20 cycles(tsc)  5.047 ns - improved -17.6%
 16 -  69 cycles(tsc) 17.424 ns - 20 cycles(tsc)  5.015 ns - improved 71.0%
 30 -  88 cycles(tsc) 22.075 ns - 20 cycles(tsc)  5.062 ns - improved 77.3%
 32 -  83 cycles(tsc) 20.965 ns - 20 cycles(tsc)  5.089 ns - improved 75.9%
 34 -  80 cycles(tsc) 20.039 ns - 28 cycles(tsc)  7.006 ns - improved 65.0%
 48 -  76 cycles(tsc) 19.252 ns - 31 cycles(tsc)  7.755 ns - improved 59.2%
 64 -  86 cycles(tsc) 21.523 ns - 68 cycles(tsc) 17.203 ns - improved 20.9%
128 -  97 cycles(tsc) 24.444 ns - 72 cycles(tsc) 18.195 ns - improved 25.8%
158 -  96 cycles(tsc) 24.036 ns - 73 cycles(tsc) 18.372 ns - improved 24.0%
250 - 100 cycles(tsc) 25.007 ns - 73 cycles(tsc) 18.430 ns - improved 27.0%

Patchset based on top of commit aefbef10e3ae with previous accepted
bulk patchset(V2) applied (avail in AKPMs quilt).

Small note, benchmark run with kernel compiled with .config
CONFIG_FTRACE in-order to use the perf probes to measure the amount of
page bulking into __slab_free().  While running the "worse-case"
testing module slab_bulk_test03.c

---

Jesper Dangaard Brouer (3):
      slub: extend slowpath __slab_free() to handle bulk free
      slub: optimize bulk slowpath free by detached freelist
      slub: build detached freelist with look-ahead


 mm/slub.c |  142 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 112 insertions(+), 30 deletions(-)

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V2 1/3] slub: extend slowpath __slab_free() to handle bulk free
  2015-08-24  0:58 ` Jesper Dangaard Brouer
@ 2015-08-24  0:58   ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-08-24  0:58 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, akpm
  Cc: aravinda, iamjoonsoo.kim, Paul E. McKenney, linux-kernel,
	Jesper Dangaard Brouer

Make it possible to free a freelist with several objects by extending
__slab_free() with two arguments: a freelist_head pointer and objects
counter (cnt).  If freelist_head pointer is set, then the object must
be the freelist tail pointer.

This allows a freelist with several objects (all within the same
slab-page) to be free'ed using a single locked cmpxchg_double.

Micro benchmarking showed no performance reduction due to this change.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
V2: Per request of Christoph Lameter
 * Made it more clear that freelist objs must exist within same page

 mm/slub.c |   16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index c9305f525004..10b57a3bb895 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2573,9 +2573,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
  * So we still attempt to reduce cache line usage. Just take the slab
  * lock and free the item. If there is no additional partial page
  * handling required then we can return immediately.
+ *
+ * Bulk free of a freelist with several objects (all pointing to the
+ * same page) possible by specifying freelist_head ptr and object as
+ * tail ptr, plus objects count (cnt).
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-			void *x, unsigned long addr)
+			void *x, unsigned long addr,
+			void *freelist_head, int cnt)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -2584,6 +2589,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 	unsigned long counters;
 	struct kmem_cache_node *n = NULL;
 	unsigned long uninitialized_var(flags);
+	void *new_freelist = (!freelist_head) ? object : freelist_head;
 
 	stat(s, FREE_SLOWPATH);
 
@@ -2601,7 +2607,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		set_freepointer(s, object, prior);
 		new.counters = counters;
 		was_frozen = new.frozen;
-		new.inuse--;
+		new.inuse -= cnt;
 		if ((!new.inuse || !prior) && !was_frozen) {
 
 			if (kmem_cache_has_cpu_partial(s) && !prior) {
@@ -2632,7 +2638,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 	} while (!cmpxchg_double_slab(s, page,
 		prior, counters,
-		object, new.counters,
+		new_freelist, new.counters,
 		"__slab_free"));
 
 	if (likely(!n)) {
@@ -2736,7 +2742,7 @@ redo:
 		}
 		stat(s, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr);
+		__slab_free(s, page, x, addr, NULL, 1);
 
 }
 
@@ -2780,7 +2786,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 			c->tid = next_tid(c->tid);
 			local_irq_enable();
 			/* Slowpath: overhead locked cmpxchg_double_slab */
-			__slab_free(s, page, object, _RET_IP_);
+			__slab_free(s, page, object, _RET_IP_, NULL, 1);
 			local_irq_disable();
 			c = this_cpu_ptr(s->cpu_slab);
 		}


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V2 1/3] slub: extend slowpath __slab_free() to handle bulk free
@ 2015-08-24  0:58   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-08-24  0:58 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, akpm
  Cc: aravinda, iamjoonsoo.kim, Paul E. McKenney, linux-kernel,
	Jesper Dangaard Brouer

Make it possible to free a freelist with several objects by extending
__slab_free() with two arguments: a freelist_head pointer and objects
counter (cnt).  If freelist_head pointer is set, then the object must
be the freelist tail pointer.

This allows a freelist with several objects (all within the same
slab-page) to be free'ed using a single locked cmpxchg_double.

Micro benchmarking showed no performance reduction due to this change.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
V2: Per request of Christoph Lameter
 * Made it more clear that freelist objs must exist within same page

 mm/slub.c |   16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index c9305f525004..10b57a3bb895 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2573,9 +2573,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
  * So we still attempt to reduce cache line usage. Just take the slab
  * lock and free the item. If there is no additional partial page
  * handling required then we can return immediately.
+ *
+ * Bulk free of a freelist with several objects (all pointing to the
+ * same page) possible by specifying freelist_head ptr and object as
+ * tail ptr, plus objects count (cnt).
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-			void *x, unsigned long addr)
+			void *x, unsigned long addr,
+			void *freelist_head, int cnt)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -2584,6 +2589,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 	unsigned long counters;
 	struct kmem_cache_node *n = NULL;
 	unsigned long uninitialized_var(flags);
+	void *new_freelist = (!freelist_head) ? object : freelist_head;
 
 	stat(s, FREE_SLOWPATH);
 
@@ -2601,7 +2607,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		set_freepointer(s, object, prior);
 		new.counters = counters;
 		was_frozen = new.frozen;
-		new.inuse--;
+		new.inuse -= cnt;
 		if ((!new.inuse || !prior) && !was_frozen) {
 
 			if (kmem_cache_has_cpu_partial(s) && !prior) {
@@ -2632,7 +2638,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 	} while (!cmpxchg_double_slab(s, page,
 		prior, counters,
-		object, new.counters,
+		new_freelist, new.counters,
 		"__slab_free"));
 
 	if (likely(!n)) {
@@ -2736,7 +2742,7 @@ redo:
 		}
 		stat(s, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr);
+		__slab_free(s, page, x, addr, NULL, 1);
 
 }
 
@@ -2780,7 +2786,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 			c->tid = next_tid(c->tid);
 			local_irq_enable();
 			/* Slowpath: overhead locked cmpxchg_double_slab */
-			__slab_free(s, page, object, _RET_IP_);
+			__slab_free(s, page, object, _RET_IP_, NULL, 1);
 			local_irq_disable();
 			c = this_cpu_ptr(s->cpu_slab);
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V2 2/3] slub: optimize bulk slowpath free by detached freelist
  2015-08-24  0:58 ` Jesper Dangaard Brouer
@ 2015-08-24  0:59   ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-08-24  0:59 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, akpm
  Cc: aravinda, iamjoonsoo.kim, Paul E. McKenney, linux-kernel,
	Jesper Dangaard Brouer

This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.

The slowpath call __slab_free() have been extended with support for
bulk free, which amortize the overhead of the locked cmpxchg_double_slab.

To use the new bulking feature of __slab_free(), we build what I call
a detached freelist.  The detached freelist takes advantage of three
properties:

 1) the free function call owns the object that is about to be freed,
    thus writing into this memory is synchronization-free.

 2) many freelist's can co-exist side-by-side in the same page each
    with a separate head pointer.

 3) it is the visibility of the head pointer that needs synchronization.

Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization.
The freelist is constructed directly in the page objects, without any
synchronization needed.  The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk.  Thus, the freelist
head pointer is not visible to other CPUs.

This implementation is fairly simple, as it only builds the detached
freelist if two consecutive objects belongs to the same page.  When
detecting object page does not match, it simply flushes the local
freelist, and starts a new local detached freelist.  It will not
look-ahead to see if further opputunities exists in the

The next patch have a more advanced look-ahead approach, but is also
more complicated. Splitting them up, because I want to be able to
benchmark the simple against the advanced approach.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
bulk- Fallback                  - Bulk API
  1 -  64 cycles(tsc) 16.109 ns - 47 cycles(tsc) 11.894 - improved 26.6%
  2 -  56 cycles(tsc) 14.158 ns - 45 cycles(tsc) 11.274 - improved 19.6%
  3 -  54 cycles(tsc) 13.650 ns - 23 cycles(tsc)  6.001 - improved 57.4%
  4 -  53 cycles(tsc) 13.268 ns - 21 cycles(tsc)  5.262 - improved 60.4%
  8 -  51 cycles(tsc) 12.841 ns - 18 cycles(tsc)  4.718 - improved 64.7%
 16 -  50 cycles(tsc) 12.583 ns - 19 cycles(tsc)  4.896 - improved 62.0%
 30 -  85 cycles(tsc) 21.357 ns - 26 cycles(tsc)  6.549 - improved 69.4%
 32 -  82 cycles(tsc) 20.690 ns - 25 cycles(tsc)  6.412 - improved 69.5%
 34 -  81 cycles(tsc) 20.322 ns - 25 cycles(tsc)  6.365 - improved 69.1%
 48 -  93 cycles(tsc) 23.332 ns - 28 cycles(tsc)  7.139 - improved 69.9%
 64 -  98 cycles(tsc) 24.544 ns - 62 cycles(tsc) 15.543 - improved 36.7%
128 -  96 cycles(tsc) 24.219 ns - 68 cycles(tsc) 17.143 - improved 29.2%
158 - 107 cycles(tsc) 26.817 ns - 69 cycles(tsc) 17.431 - improved 35.5%
250 - 107 cycles(tsc) 26.824 ns - 70 cycles(tsc) 17.730 - improved 34.6%
---
 mm/slub.c |   48 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 10b57a3bb895..40e4b5926311 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2756,12 +2756,26 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
+struct detached_freelist {
+	struct page *page;
+	void *freelist;
+	void *tail_object;
+	int cnt;
+};
+
 /* Note that interrupts must be enabled when calling this function. */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
 	struct kmem_cache_cpu *c;
 	struct page *page;
 	int i;
+	/* Opportunistically delay updating page->freelist, hoping
+	 * next free happen to same page.  Start building the freelist
+	 * in the page, but keep local stack ptr to freelist.  If
+	 * successful several object can be transferred to page with a
+	 * single cmpxchg_double.
+	 */
+	struct detached_freelist df = {0};
 
 	local_irq_disable();
 	c = this_cpu_ptr(s->cpu_slab);
@@ -2778,22 +2792,42 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 
 		page = virt_to_head_page(object);
 
-		if (c->page == page) {
+		if (page == df.page) {
+			/* Oppotunity to delay real free */
+			set_freepointer(s, object, df.freelist);
+			df.freelist = object;
+			df.cnt++;
+		} else if (c->page == page) {
 			/* Fastpath: local CPU free */
 			set_freepointer(s, object, c->freelist);
 			c->freelist = object;
 		} else {
-			c->tid = next_tid(c->tid);
-			local_irq_enable();
-			/* Slowpath: overhead locked cmpxchg_double_slab */
-			__slab_free(s, page, object, _RET_IP_, NULL, 1);
-			local_irq_disable();
-			c = this_cpu_ptr(s->cpu_slab);
+			/* Slowpath: Flush delayed free */
+			if (df.page) {
+				c->tid = next_tid(c->tid);
+				local_irq_enable();
+				__slab_free(s, df.page, df.tail_object,
+					    _RET_IP_, df.freelist, df.cnt);
+				local_irq_disable();
+				c = this_cpu_ptr(s->cpu_slab);
+			}
+			/* Start new round of delayed free */
+			df.page = page;
+			df.tail_object = object;
+			set_freepointer(s, object, NULL);
+			df.freelist = object;
+			df.cnt = 1;
 		}
 	}
 exit:
 	c->tid = next_tid(c->tid);
 	local_irq_enable();
+
+	/* Flush detached freelist */
+	if (df.page) {
+		__slab_free(s, df.page, df.tail_object,
+			    _RET_IP_, df.freelist, df.cnt);
+	}
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V2 2/3] slub: optimize bulk slowpath free by detached freelist
@ 2015-08-24  0:59   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-08-24  0:59 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, akpm
  Cc: aravinda, iamjoonsoo.kim, Paul E. McKenney, linux-kernel,
	Jesper Dangaard Brouer

This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.

The slowpath call __slab_free() have been extended with support for
bulk free, which amortize the overhead of the locked cmpxchg_double_slab.

To use the new bulking feature of __slab_free(), we build what I call
a detached freelist.  The detached freelist takes advantage of three
properties:

 1) the free function call owns the object that is about to be freed,
    thus writing into this memory is synchronization-free.

 2) many freelist's can co-exist side-by-side in the same page each
    with a separate head pointer.

 3) it is the visibility of the head pointer that needs synchronization.

Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization.
The freelist is constructed directly in the page objects, without any
synchronization needed.  The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk.  Thus, the freelist
head pointer is not visible to other CPUs.

This implementation is fairly simple, as it only builds the detached
freelist if two consecutive objects belongs to the same page.  When
detecting object page does not match, it simply flushes the local
freelist, and starts a new local detached freelist.  It will not
look-ahead to see if further opputunities exists in the

The next patch have a more advanced look-ahead approach, but is also
more complicated. Splitting them up, because I want to be able to
benchmark the simple against the advanced approach.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
bulk- Fallback                  - Bulk API
  1 -  64 cycles(tsc) 16.109 ns - 47 cycles(tsc) 11.894 - improved 26.6%
  2 -  56 cycles(tsc) 14.158 ns - 45 cycles(tsc) 11.274 - improved 19.6%
  3 -  54 cycles(tsc) 13.650 ns - 23 cycles(tsc)  6.001 - improved 57.4%
  4 -  53 cycles(tsc) 13.268 ns - 21 cycles(tsc)  5.262 - improved 60.4%
  8 -  51 cycles(tsc) 12.841 ns - 18 cycles(tsc)  4.718 - improved 64.7%
 16 -  50 cycles(tsc) 12.583 ns - 19 cycles(tsc)  4.896 - improved 62.0%
 30 -  85 cycles(tsc) 21.357 ns - 26 cycles(tsc)  6.549 - improved 69.4%
 32 -  82 cycles(tsc) 20.690 ns - 25 cycles(tsc)  6.412 - improved 69.5%
 34 -  81 cycles(tsc) 20.322 ns - 25 cycles(tsc)  6.365 - improved 69.1%
 48 -  93 cycles(tsc) 23.332 ns - 28 cycles(tsc)  7.139 - improved 69.9%
 64 -  98 cycles(tsc) 24.544 ns - 62 cycles(tsc) 15.543 - improved 36.7%
128 -  96 cycles(tsc) 24.219 ns - 68 cycles(tsc) 17.143 - improved 29.2%
158 - 107 cycles(tsc) 26.817 ns - 69 cycles(tsc) 17.431 - improved 35.5%
250 - 107 cycles(tsc) 26.824 ns - 70 cycles(tsc) 17.730 - improved 34.6%
---
 mm/slub.c |   48 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 10b57a3bb895..40e4b5926311 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2756,12 +2756,26 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
+struct detached_freelist {
+	struct page *page;
+	void *freelist;
+	void *tail_object;
+	int cnt;
+};
+
 /* Note that interrupts must be enabled when calling this function. */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
 	struct kmem_cache_cpu *c;
 	struct page *page;
 	int i;
+	/* Opportunistically delay updating page->freelist, hoping
+	 * next free happen to same page.  Start building the freelist
+	 * in the page, but keep local stack ptr to freelist.  If
+	 * successful several object can be transferred to page with a
+	 * single cmpxchg_double.
+	 */
+	struct detached_freelist df = {0};
 
 	local_irq_disable();
 	c = this_cpu_ptr(s->cpu_slab);
@@ -2778,22 +2792,42 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 
 		page = virt_to_head_page(object);
 
-		if (c->page == page) {
+		if (page == df.page) {
+			/* Oppotunity to delay real free */
+			set_freepointer(s, object, df.freelist);
+			df.freelist = object;
+			df.cnt++;
+		} else if (c->page == page) {
 			/* Fastpath: local CPU free */
 			set_freepointer(s, object, c->freelist);
 			c->freelist = object;
 		} else {
-			c->tid = next_tid(c->tid);
-			local_irq_enable();
-			/* Slowpath: overhead locked cmpxchg_double_slab */
-			__slab_free(s, page, object, _RET_IP_, NULL, 1);
-			local_irq_disable();
-			c = this_cpu_ptr(s->cpu_slab);
+			/* Slowpath: Flush delayed free */
+			if (df.page) {
+				c->tid = next_tid(c->tid);
+				local_irq_enable();
+				__slab_free(s, df.page, df.tail_object,
+					    _RET_IP_, df.freelist, df.cnt);
+				local_irq_disable();
+				c = this_cpu_ptr(s->cpu_slab);
+			}
+			/* Start new round of delayed free */
+			df.page = page;
+			df.tail_object = object;
+			set_freepointer(s, object, NULL);
+			df.freelist = object;
+			df.cnt = 1;
 		}
 	}
 exit:
 	c->tid = next_tid(c->tid);
 	local_irq_enable();
+
+	/* Flush detached freelist */
+	if (df.page) {
+		__slab_free(s, df.page, df.tail_object,
+			    _RET_IP_, df.freelist, df.cnt);
+	}
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V2 3/3] slub: build detached freelist with look-ahead
  2015-08-24  0:58 ` Jesper Dangaard Brouer
@ 2015-08-24  0:59   ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-08-24  0:59 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, akpm
  Cc: aravinda, iamjoonsoo.kim, Paul E. McKenney, linux-kernel,
	Jesper Dangaard Brouer

This change is a more advanced use of detached freelist.  The bulk
free array is scanned is a progressive manor with a limited look-ahead
facility.

To maintain the same performance level, as the previous simple
implementation, the look-ahead have been limited to only 3 objects.
This number have been determined my experimental micro benchmarking.

For performance the free loop in kmem_cache_free_bulk have been
significantly reorganized, with a focus on making the branches more
predictable for the compiler.  E.g. the per CPU c->freelist is also
build as a detached freelist, even-though it would be just as fast as
freeing directly to it, but it save creating an unpredictable branch.

Another benefit of this change is that kmem_cache_free_bulk() runs
mostly with IRQs enabled.  The local IRQs are only disabled when
updating the per CPU c->freelist.  This should please Thomas Gleixner.

Pitfall(1): Removed kmem debug support.

Pitfall(2): No BUG_ON() freeing NULL pointers, but the algorithm
            handles and skips these NULL pointers.

Compare against previous patch:
 There is some fluctuation in the benchmarks between runs.  To counter
this I've run some specific[1] bulk sizes, repeated 100 times and run
dmesg through  Rusty's "stats"[2] tool.

Command line:
  sudo dmesg -c ;\
  for x in `seq 100`; do \
    modprobe slab_bulk_test02 bulksz=48 loops=100000 && rmmod slab_bulk_test02; \
    echo $x; \
    sleep 0.${RANDOM} ;\
  done; \
  dmesg | stats

Results:

bulk size:16, average: +2.01 cycles
 Prev: between 19-52 (average: 22.65 stddev:+/-6.9)
 This: between 19-67 (average: 24.67 stddev:+/-9.9)

bulk size:48, average: +1.54 cycles
 Prev: between 23-45 (average: 27.88 stddev:+/-4)
 This: between 24-41 (average: 29.42 stddev:+/-3.7)

bulk size:144, average: +1.73 cycles
 Prev: between 44-76 (average: 60.31 stddev:+/-7.7)
 This: between 49-80 (average: 62.04 stddev:+/-7.3)

bulk size:512, average: +8.94 cycles
 Prev: between 50-68 (average: 60.11 stddev: +/-4.3)
 This: between 56-80 (average: 69.05 stddev: +/-5.2)

bulk size:2048, average: +26.81 cycles
 Prev: between 61-73 (average: 68.10 stddev:+/-2.9)
 This: between 90-104(average: 94.91 stddev:+/-2.1)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test02.c
[2] https://github.com/rustyrussell/stats

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
bulk- Fallback                  - Bulk API
  1 -  64 cycles(tsc) 16.144 ns - 47 cycles(tsc) 11.931 - improved 26.6%
  2 -  57 cycles(tsc) 14.397 ns - 29 cycles(tsc)  7.368 - improved 49.1%
  3 -  55 cycles(tsc) 13.797 ns - 24 cycles(tsc)  6.003 - improved 56.4%
  4 -  53 cycles(tsc) 13.500 ns - 22 cycles(tsc)  5.543 - improved 58.5%
  8 -  52 cycles(tsc) 13.008 ns - 20 cycles(tsc)  5.047 - improved 61.5%
 16 -  51 cycles(tsc) 12.763 ns - 20 cycles(tsc)  5.015 - improved 60.8%
 30 -  50 cycles(tsc) 12.743 ns - 20 cycles(tsc)  5.062 - improved 60.0%
 32 -  51 cycles(tsc) 12.908 ns - 20 cycles(tsc)  5.089 - improved 60.8%
 34 -  87 cycles(tsc) 21.936 ns - 28 cycles(tsc)  7.006 - improved 67.8%
 48 -  79 cycles(tsc) 19.840 ns - 31 cycles(tsc)  7.755 - improved 60.8%
 64 -  86 cycles(tsc) 21.669 ns - 68 cycles(tsc) 17.203 - improved 20.9%
128 - 101 cycles(tsc) 25.340 ns - 72 cycles(tsc) 18.195 - improved 28.7%
158 - 112 cycles(tsc) 28.152 ns - 73 cycles(tsc) 18.372 - improved 34.8%
250 - 110 cycles(tsc) 27.727 ns - 73 cycles(tsc) 18.430 - improved 33.6%
---
 mm/slub.c |  138 ++++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 90 insertions(+), 48 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 40e4b5926311..49ae96f45670 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2763,71 +2763,113 @@ struct detached_freelist {
 	int cnt;
 };
 
-/* Note that interrupts must be enabled when calling this function. */
-void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+/*
+ * This function extract objects belonging to the same page, and
+ * builds a detached freelist directly within the given page/objects.
+ * This can happen without any need for synchronization, because the
+ * objects are owned by running process.  The freelist is build up as
+ * a single linked list in the objects.  The idea is, that this
+ * detached freelist can then be bulk transferred to the real
+ * freelist(s), but only requiring a single synchronization primitive.
+ */
+static inline int build_detached_freelist(
+	struct kmem_cache *s, size_t size, void **p,
+	struct detached_freelist *df, int start_index)
 {
-	struct kmem_cache_cpu *c;
 	struct page *page;
 	int i;
-	/* Opportunistically delay updating page->freelist, hoping
-	 * next free happen to same page.  Start building the freelist
-	 * in the page, but keep local stack ptr to freelist.  If
-	 * successful several object can be transferred to page with a
-	 * single cmpxchg_double.
-	 */
-	struct detached_freelist df = {0};
+	int lookahead = 0;
+	void *object;
 
-	local_irq_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+	/* Always re-init detached_freelist */
+	do {
+		object = p[start_index];
+		if (object) {
+			/* Start new delayed freelist */
+			df->page = virt_to_head_page(object);
+			df->tail_object = object;
+			set_freepointer(s, object, NULL);
+			df->freelist = object;
+			df->cnt = 1;
+			p[start_index] = NULL; /* mark object processed */
+		} else {
+			df->page = NULL; /* Handle NULL ptr in array */
+		}
+		start_index++;
+	} while (!object && start_index < size);
 
-	for (i = 0; i < size; i++) {
-		void *object = p[i];
+	for (i = start_index; i < size; i++) {
+		object = p[i];
 
-		BUG_ON(!object);
-		/* kmem cache debug support */
-		s = cache_from_obj(s, object);
-		if (unlikely(!s))
-			goto exit;
-		slab_free_hook(s, object);
+		if (!object)
+			continue; /* Skip processed objects */
 
 		page = virt_to_head_page(object);
 
-		if (page == df.page) {
-			/* Oppotunity to delay real free */
-			set_freepointer(s, object, df.freelist);
-			df.freelist = object;
-			df.cnt++;
-		} else if (c->page == page) {
-			/* Fastpath: local CPU free */
-			set_freepointer(s, object, c->freelist);
-			c->freelist = object;
+		/* df->page is always set at this point */
+		if (page == df->page) {
+			/* Oppotunity build freelist */
+			set_freepointer(s, object, df->freelist);
+			df->freelist = object;
+			df->cnt++;
+			p[i] = NULL; /* mark object processed */
+			if (!lookahead)
+				start_index++;
 		} else {
-			/* Slowpath: Flush delayed free */
-			if (df.page) {
+			/* Limit look ahead search */
+			if (++lookahead >= 3)
+				return start_index;
+			continue;
+		}
+	}
+	return start_index;
+}
+
+/* Note that interrupts must be enabled when calling this function. */
+void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct kmem_cache_cpu *c;
+	int iterator = 0;
+	struct detached_freelist df;
+
+	BUG_ON(!size);
+
+	/* Per CPU ptr may change afterwards */
+	c = this_cpu_ptr(s->cpu_slab);
+
+	while (likely(iterator < size)) {
+		iterator = build_detached_freelist(s, size, p, &df, iterator);
+		if (likely(df.page)) {
+		redo:
+			if (c->page == df.page) {
+				/*
+				 * Local CPU free require disabling
+				 * IRQs.  It is possible to miss the
+				 * oppotunity and instead free to
+				 * page->freelist, but it does not
+				 * matter as page->freelist will
+				 * eventually be transferred to
+				 * c->freelist
+				 */
+				local_irq_disable();
+				c = this_cpu_ptr(s->cpu_slab); /* reload */
+				if (c->page != df.page) {
+					local_irq_enable();
+					goto redo;
+				}
+				/* Bulk transfer to CPU c->freelist */
+				set_freepointer(s, df.tail_object, c->freelist);
+				c->freelist = df.freelist;
+
 				c->tid = next_tid(c->tid);
 				local_irq_enable();
+			} else {
+				/* Bulk transfer to page->freelist */
 				__slab_free(s, df.page, df.tail_object,
 					    _RET_IP_, df.freelist, df.cnt);
-				local_irq_disable();
-				c = this_cpu_ptr(s->cpu_slab);
 			}
-			/* Start new round of delayed free */
-			df.page = page;
-			df.tail_object = object;
-			set_freepointer(s, object, NULL);
-			df.freelist = object;
-			df.cnt = 1;
 		}
 	}
-exit:
-	c->tid = next_tid(c->tid);
-	local_irq_enable();
-
-	/* Flush detached freelist */
-	if (df.page) {
-		__slab_free(s, df.page, df.tail_object,
-			    _RET_IP_, df.freelist, df.cnt);
-	}
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V2 3/3] slub: build detached freelist with look-ahead
@ 2015-08-24  0:59   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-08-24  0:59 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, akpm
  Cc: aravinda, iamjoonsoo.kim, Paul E. McKenney, linux-kernel,
	Jesper Dangaard Brouer

This change is a more advanced use of detached freelist.  The bulk
free array is scanned is a progressive manor with a limited look-ahead
facility.

To maintain the same performance level, as the previous simple
implementation, the look-ahead have been limited to only 3 objects.
This number have been determined my experimental micro benchmarking.

For performance the free loop in kmem_cache_free_bulk have been
significantly reorganized, with a focus on making the branches more
predictable for the compiler.  E.g. the per CPU c->freelist is also
build as a detached freelist, even-though it would be just as fast as
freeing directly to it, but it save creating an unpredictable branch.

Another benefit of this change is that kmem_cache_free_bulk() runs
mostly with IRQs enabled.  The local IRQs are only disabled when
updating the per CPU c->freelist.  This should please Thomas Gleixner.

Pitfall(1): Removed kmem debug support.

Pitfall(2): No BUG_ON() freeing NULL pointers, but the algorithm
            handles and skips these NULL pointers.

Compare against previous patch:
 There is some fluctuation in the benchmarks between runs.  To counter
this I've run some specific[1] bulk sizes, repeated 100 times and run
dmesg through  Rusty's "stats"[2] tool.

Command line:
  sudo dmesg -c ;\
  for x in `seq 100`; do \
    modprobe slab_bulk_test02 bulksz=48 loops=100000 && rmmod slab_bulk_test02; \
    echo $x; \
    sleep 0.${RANDOM} ;\
  done; \
  dmesg | stats

Results:

bulk size:16, average: +2.01 cycles
 Prev: between 19-52 (average: 22.65 stddev:+/-6.9)
 This: between 19-67 (average: 24.67 stddev:+/-9.9)

bulk size:48, average: +1.54 cycles
 Prev: between 23-45 (average: 27.88 stddev:+/-4)
 This: between 24-41 (average: 29.42 stddev:+/-3.7)

bulk size:144, average: +1.73 cycles
 Prev: between 44-76 (average: 60.31 stddev:+/-7.7)
 This: between 49-80 (average: 62.04 stddev:+/-7.3)

bulk size:512, average: +8.94 cycles
 Prev: between 50-68 (average: 60.11 stddev: +/-4.3)
 This: between 56-80 (average: 69.05 stddev: +/-5.2)

bulk size:2048, average: +26.81 cycles
 Prev: between 61-73 (average: 68.10 stddev:+/-2.9)
 This: between 90-104(average: 94.91 stddev:+/-2.1)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test02.c
[2] https://github.com/rustyrussell/stats

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
bulk- Fallback                  - Bulk API
  1 -  64 cycles(tsc) 16.144 ns - 47 cycles(tsc) 11.931 - improved 26.6%
  2 -  57 cycles(tsc) 14.397 ns - 29 cycles(tsc)  7.368 - improved 49.1%
  3 -  55 cycles(tsc) 13.797 ns - 24 cycles(tsc)  6.003 - improved 56.4%
  4 -  53 cycles(tsc) 13.500 ns - 22 cycles(tsc)  5.543 - improved 58.5%
  8 -  52 cycles(tsc) 13.008 ns - 20 cycles(tsc)  5.047 - improved 61.5%
 16 -  51 cycles(tsc) 12.763 ns - 20 cycles(tsc)  5.015 - improved 60.8%
 30 -  50 cycles(tsc) 12.743 ns - 20 cycles(tsc)  5.062 - improved 60.0%
 32 -  51 cycles(tsc) 12.908 ns - 20 cycles(tsc)  5.089 - improved 60.8%
 34 -  87 cycles(tsc) 21.936 ns - 28 cycles(tsc)  7.006 - improved 67.8%
 48 -  79 cycles(tsc) 19.840 ns - 31 cycles(tsc)  7.755 - improved 60.8%
 64 -  86 cycles(tsc) 21.669 ns - 68 cycles(tsc) 17.203 - improved 20.9%
128 - 101 cycles(tsc) 25.340 ns - 72 cycles(tsc) 18.195 - improved 28.7%
158 - 112 cycles(tsc) 28.152 ns - 73 cycles(tsc) 18.372 - improved 34.8%
250 - 110 cycles(tsc) 27.727 ns - 73 cycles(tsc) 18.430 - improved 33.6%
---
 mm/slub.c |  138 ++++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 90 insertions(+), 48 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 40e4b5926311..49ae96f45670 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2763,71 +2763,113 @@ struct detached_freelist {
 	int cnt;
 };
 
-/* Note that interrupts must be enabled when calling this function. */
-void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+/*
+ * This function extract objects belonging to the same page, and
+ * builds a detached freelist directly within the given page/objects.
+ * This can happen without any need for synchronization, because the
+ * objects are owned by running process.  The freelist is build up as
+ * a single linked list in the objects.  The idea is, that this
+ * detached freelist can then be bulk transferred to the real
+ * freelist(s), but only requiring a single synchronization primitive.
+ */
+static inline int build_detached_freelist(
+	struct kmem_cache *s, size_t size, void **p,
+	struct detached_freelist *df, int start_index)
 {
-	struct kmem_cache_cpu *c;
 	struct page *page;
 	int i;
-	/* Opportunistically delay updating page->freelist, hoping
-	 * next free happen to same page.  Start building the freelist
-	 * in the page, but keep local stack ptr to freelist.  If
-	 * successful several object can be transferred to page with a
-	 * single cmpxchg_double.
-	 */
-	struct detached_freelist df = {0};
+	int lookahead = 0;
+	void *object;
 
-	local_irq_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+	/* Always re-init detached_freelist */
+	do {
+		object = p[start_index];
+		if (object) {
+			/* Start new delayed freelist */
+			df->page = virt_to_head_page(object);
+			df->tail_object = object;
+			set_freepointer(s, object, NULL);
+			df->freelist = object;
+			df->cnt = 1;
+			p[start_index] = NULL; /* mark object processed */
+		} else {
+			df->page = NULL; /* Handle NULL ptr in array */
+		}
+		start_index++;
+	} while (!object && start_index < size);
 
-	for (i = 0; i < size; i++) {
-		void *object = p[i];
+	for (i = start_index; i < size; i++) {
+		object = p[i];
 
-		BUG_ON(!object);
-		/* kmem cache debug support */
-		s = cache_from_obj(s, object);
-		if (unlikely(!s))
-			goto exit;
-		slab_free_hook(s, object);
+		if (!object)
+			continue; /* Skip processed objects */
 
 		page = virt_to_head_page(object);
 
-		if (page == df.page) {
-			/* Oppotunity to delay real free */
-			set_freepointer(s, object, df.freelist);
-			df.freelist = object;
-			df.cnt++;
-		} else if (c->page == page) {
-			/* Fastpath: local CPU free */
-			set_freepointer(s, object, c->freelist);
-			c->freelist = object;
+		/* df->page is always set at this point */
+		if (page == df->page) {
+			/* Oppotunity build freelist */
+			set_freepointer(s, object, df->freelist);
+			df->freelist = object;
+			df->cnt++;
+			p[i] = NULL; /* mark object processed */
+			if (!lookahead)
+				start_index++;
 		} else {
-			/* Slowpath: Flush delayed free */
-			if (df.page) {
+			/* Limit look ahead search */
+			if (++lookahead >= 3)
+				return start_index;
+			continue;
+		}
+	}
+	return start_index;
+}
+
+/* Note that interrupts must be enabled when calling this function. */
+void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct kmem_cache_cpu *c;
+	int iterator = 0;
+	struct detached_freelist df;
+
+	BUG_ON(!size);
+
+	/* Per CPU ptr may change afterwards */
+	c = this_cpu_ptr(s->cpu_slab);
+
+	while (likely(iterator < size)) {
+		iterator = build_detached_freelist(s, size, p, &df, iterator);
+		if (likely(df.page)) {
+		redo:
+			if (c->page == df.page) {
+				/*
+				 * Local CPU free require disabling
+				 * IRQs.  It is possible to miss the
+				 * oppotunity and instead free to
+				 * page->freelist, but it does not
+				 * matter as page->freelist will
+				 * eventually be transferred to
+				 * c->freelist
+				 */
+				local_irq_disable();
+				c = this_cpu_ptr(s->cpu_slab); /* reload */
+				if (c->page != df.page) {
+					local_irq_enable();
+					goto redo;
+				}
+				/* Bulk transfer to CPU c->freelist */
+				set_freepointer(s, df.tail_object, c->freelist);
+				c->freelist = df.freelist;
+
 				c->tid = next_tid(c->tid);
 				local_irq_enable();
+			} else {
+				/* Bulk transfer to page->freelist */
 				__slab_free(s, df.page, df.tail_object,
 					    _RET_IP_, df.freelist, df.cnt);
-				local_irq_disable();
-				c = this_cpu_ptr(s->cpu_slab);
 			}
-			/* Start new round of delayed free */
-			df.page = page;
-			df.tail_object = object;
-			set_freepointer(s, object, NULL);
-			df.freelist = object;
-			df.cnt = 1;
 		}
 	}
-exit:
-	c->tid = next_tid(c->tid);
-	local_irq_enable();
-
-	/* Flush detached freelist */
-	if (df.page) {
-		__slab_free(s, df.page, df.tail_object,
-			    _RET_IP_, df.freelist, df.cnt);
-	}
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-08-24  0:58 ` Jesper Dangaard Brouer
@ 2015-09-04 17:00   ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-04 17:00 UTC (permalink / raw)
  To: netdev, akpm
  Cc: linux-mm, Jesper Dangaard Brouer, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

During TX DMA completion cleanup there exist an opportunity in the NIC
drivers to perform bulk free, without introducing additional latency.

For an IPv4 forwarding workload the network stack is hitting the
slowpath of the kmem_cache "slub" allocator.  This slowpath can be
mitigated by bulk free via the detached freelists patchset.

Depend on patchset:
 http://thread.gmane.org/gmane.linux.kernel.mm/137469

Kernel based on MMOTM tag 2015-08-24-16-12 from git repo:
 git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
 Also contains Christoph's patch "slub: Avoid irqoff/on in bulk allocation"


Benchmarking: Single CPU IPv4 forwarding UDP (generator pktgen):
 * Before: 2043575 pps
 * After : 2090522 pps
 * Improvements: +46947 pps and -10.99 ns

In the before case, perf report shows slub free hits the slowpath:
 1.98%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_free.isra.72
 1.29%  ksoftirqd/6  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.71
 0.95%  ksoftirqd/6  [kernel.vmlinux]  [k] kmem_cache_free
 0.95%  ksoftirqd/6  [kernel.vmlinux]  [k] kmem_cache_alloc
 0.20%  ksoftirqd/6  [kernel.vmlinux]  [k] __cmpxchg_double_slab.isra.60
 0.17%  ksoftirqd/6  [kernel.vmlinux]  [k] ___slab_alloc.isra.68
 0.09%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_alloc.isra.69

After the slowpath calls are almost gone:
 0.22%  ksoftirqd/6  [kernel.vmlinux]  [k] __cmpxchg_double_slab.isra.60
 0.18%  ksoftirqd/6  [kernel.vmlinux]  [k] ___slab_alloc.isra.68
 0.14%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_free.isra.72
 0.14%  ksoftirqd/6  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.71
 0.08%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_alloc.isra.69


Extra info, tuning SLUB per CPU structures gives further improvements:
 * slub-tuned: 2124217 pps
 * patched increase: +33695 pps and  -7.59 ns
 * before  increase: +80642 pps and -18.58 ns

Tuning done:
 echo 256 > /sys/kernel/slab/skbuff_head_cache/cpu_partial
 echo 9   > /sys/kernel/slab/skbuff_head_cache/min_partial

Without SLUB tuning, same performance comes with kernel cmdline "slab_nomerge":
 * slab_nomerge: 2121824 pps

Test notes:
 * Notice very fast CPU i7-4790K CPU @ 4.00GHz
 * gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC)
 * kernel 4.1.0-mmotm-2015-08-24-16-12+ #271 SMP
 * Generator pktgen UDP single flow (pktgen_sample03_burst_single_flow.sh)
 * Tuned for forwarding:
  - unloaded netfilter modules
  - Sysctl settings:
  - net/ipv4/conf/default/rp_filter = 0
  - net/ipv4/conf/all/rp_filter = 0
  - (Forwarding performance is affected by early demux)
  - net/ipv4/ip_early_demux = 0
  - net.ipv4.ip_forward = 1
  - Disabled GRO on NICs
  - ethtool -K ixgbe3 gro off tso off gso off

---

Jesper Dangaard Brouer (3):
      net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
      net: NIC helper API for building array of skbs to free
      ixgbe: bulk free SKBs during TX completion cleanup cycle


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   13 +++-
 include/linux/netdevice.h                     |   62 ++++++++++++++++++
 include/linux/skbuff.h                        |    1 
 net/core/skbuff.c                             |   87 ++++++++++++++++++++-----
 4 files changed, 144 insertions(+), 19 deletions(-)

--

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
@ 2015-09-04 17:00   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-04 17:00 UTC (permalink / raw)
  To: netdev, akpm
  Cc: linux-mm, Jesper Dangaard Brouer, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

During TX DMA completion cleanup there exist an opportunity in the NIC
drivers to perform bulk free, without introducing additional latency.

For an IPv4 forwarding workload the network stack is hitting the
slowpath of the kmem_cache "slub" allocator.  This slowpath can be
mitigated by bulk free via the detached freelists patchset.

Depend on patchset:
 http://thread.gmane.org/gmane.linux.kernel.mm/137469

Kernel based on MMOTM tag 2015-08-24-16-12 from git repo:
 git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
 Also contains Christoph's patch "slub: Avoid irqoff/on in bulk allocation"


Benchmarking: Single CPU IPv4 forwarding UDP (generator pktgen):
 * Before: 2043575 pps
 * After : 2090522 pps
 * Improvements: +46947 pps and -10.99 ns

In the before case, perf report shows slub free hits the slowpath:
 1.98%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_free.isra.72
 1.29%  ksoftirqd/6  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.71
 0.95%  ksoftirqd/6  [kernel.vmlinux]  [k] kmem_cache_free
 0.95%  ksoftirqd/6  [kernel.vmlinux]  [k] kmem_cache_alloc
 0.20%  ksoftirqd/6  [kernel.vmlinux]  [k] __cmpxchg_double_slab.isra.60
 0.17%  ksoftirqd/6  [kernel.vmlinux]  [k] ___slab_alloc.isra.68
 0.09%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_alloc.isra.69

After the slowpath calls are almost gone:
 0.22%  ksoftirqd/6  [kernel.vmlinux]  [k] __cmpxchg_double_slab.isra.60
 0.18%  ksoftirqd/6  [kernel.vmlinux]  [k] ___slab_alloc.isra.68
 0.14%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_free.isra.72
 0.14%  ksoftirqd/6  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.71
 0.08%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_alloc.isra.69


Extra info, tuning SLUB per CPU structures gives further improvements:
 * slub-tuned: 2124217 pps
 * patched increase: +33695 pps and  -7.59 ns
 * before  increase: +80642 pps and -18.58 ns

Tuning done:
 echo 256 > /sys/kernel/slab/skbuff_head_cache/cpu_partial
 echo 9   > /sys/kernel/slab/skbuff_head_cache/min_partial

Without SLUB tuning, same performance comes with kernel cmdline "slab_nomerge":
 * slab_nomerge: 2121824 pps

Test notes:
 * Notice very fast CPU i7-4790K CPU @ 4.00GHz
 * gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC)
 * kernel 4.1.0-mmotm-2015-08-24-16-12+ #271 SMP
 * Generator pktgen UDP single flow (pktgen_sample03_burst_single_flow.sh)
 * Tuned for forwarding:
  - unloaded netfilter modules
  - Sysctl settings:
  - net/ipv4/conf/default/rp_filter = 0
  - net/ipv4/conf/all/rp_filter = 0
  - (Forwarding performance is affected by early demux)
  - net/ipv4/ip_early_demux = 0
  - net.ipv4.ip_forward = 1
  - Disabled GRO on NICs
  - ethtool -K ixgbe3 gro off tso off gso off

---

Jesper Dangaard Brouer (3):
      net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
      net: NIC helper API for building array of skbs to free
      ixgbe: bulk free SKBs during TX completion cleanup cycle


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   13 +++-
 include/linux/netdevice.h                     |   62 ++++++++++++++++++
 include/linux/skbuff.h                        |    1 
 net/core/skbuff.c                             |   87 ++++++++++++++++++++-----
 4 files changed, 144 insertions(+), 19 deletions(-)

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-04 17:00   ` Jesper Dangaard Brouer
  (?)
@ 2015-09-04 17:00   ` Jesper Dangaard Brouer
  2015-09-04 18:47     ` Tom Herbert
  2015-09-08 21:01     ` David Miller
  -1 siblings, 2 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-04 17:00 UTC (permalink / raw)
  To: netdev, akpm
  Cc: linux-mm, Jesper Dangaard Brouer, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

Introduce the first user of SLAB bulk free API kmem_cache_free_bulk(),
in the network stack in form of function kfree_skb_bulk() which bulk
free SKBs (not skb clones or skb->head, yet).

As this is the third user of SKB reference decrementing, split out
refcnt decrement into helper function and use this in all call points.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/skbuff.h |    1 +
 net/core/skbuff.c      |   87 +++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 71 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b97597970ce7..e5f1e007723b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -762,6 +762,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 }
 
 void kfree_skb(struct sk_buff *skb);
+void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size);
 void kfree_skb_list(struct sk_buff *segs);
 void skb_tx_error(struct sk_buff *skb);
 void consume_skb(struct sk_buff *skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 429b407b4fe6..034545934158 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -661,26 +661,83 @@ void __kfree_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(__kfree_skb);
 
+/*
+ *	skb_dec_and_test - Helper to drop ref to SKB and see is ready to free
+ *	@skb: buffer to decrement reference
+ *
+ *	Drop a reference to the buffer, and return true if it is ready
+ *	to free. Which is if the usage count has hit zero or is equal to 1.
+ *
+ *	This is performance critical code that should be inlined.
+ */
+static inline bool skb_dec_and_test(struct sk_buff *skb)
+{
+	if (unlikely(!skb))
+		return false;
+	if (likely(atomic_read(&skb->users) == 1))
+		smp_rmb();
+	else if (likely(!atomic_dec_and_test(&skb->users)))
+		return false;
+	/* If reaching here SKB is ready to free */
+	return true;
+}
+
 /**
  *	kfree_skb - free an sk_buff
  *	@skb: buffer to free
  *
  *	Drop a reference to the buffer and free it if the usage count has
- *	hit zero.
+ *	hit zero or is equal to 1.
  */
 void kfree_skb(struct sk_buff *skb)
 {
-	if (unlikely(!skb))
-		return;
-	if (likely(atomic_read(&skb->users) == 1))
-		smp_rmb();
-	else if (likely(!atomic_dec_and_test(&skb->users)))
-		return;
-	trace_kfree_skb(skb, __builtin_return_address(0));
-	__kfree_skb(skb);
+	if (skb_dec_and_test(skb)) {
+		trace_kfree_skb(skb, __builtin_return_address(0));
+		__kfree_skb(skb);
+	}
 }
 EXPORT_SYMBOL(kfree_skb);
 
+/**
+ *	kfree_skb_bulk - bulk free SKBs when refcnt allows to
+ *	@skbs: array of SKBs to free
+ *	@size: number of SKBs in array
+ *
+ *	If SKB refcnt allows for free, then release any auxiliary data
+ *	and then bulk free SKBs to the SLAB allocator.
+ *
+ *	Note that interrupts must be enabled when calling this function.
+ */
+void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size)
+{
+	int i;
+	size_t cnt = 0;
+
+	for (i = 0; i < size; i++) {
+		struct sk_buff *skb = skbs[i];
+
+		if (!skb_dec_and_test(skb))
+			continue; /* skip skb, not ready to free */
+
+		/* Construct an array of SKBs, ready to be free'ed and
+		 * cleanup all auxiliary, before bulk free to SLAB.
+		 * For now, only handle non-cloned SKBs, related to
+		 * SLAB skbuff_head_cache
+		 */
+		if (skb->fclone == SKB_FCLONE_UNAVAILABLE) {
+			skb_release_all(skb);
+			skbs[cnt++] = skb;
+		} else {
+			/* SKB was a clone, don't handle this case */
+			__kfree_skb(skb);
+		}
+	}
+	if (likely(cnt)) {
+		kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs);
+	}
+}
+EXPORT_SYMBOL(kfree_skb_bulk);
+
 void kfree_skb_list(struct sk_buff *segs)
 {
 	while (segs) {
@@ -722,14 +779,10 @@ EXPORT_SYMBOL(skb_tx_error);
  */
 void consume_skb(struct sk_buff *skb)
 {
-	if (unlikely(!skb))
-		return;
-	if (likely(atomic_read(&skb->users) == 1))
-		smp_rmb();
-	else if (likely(!atomic_dec_and_test(&skb->users)))
-		return;
-	trace_consume_skb(skb);
-	__kfree_skb(skb);
+	if (skb_dec_and_test(skb)) {
+		trace_consume_skb(skb);
+		__kfree_skb(skb);
+	}
 }
 EXPORT_SYMBOL(consume_skb);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 2/3] net: NIC helper API for building array of skbs to free
  2015-09-04 17:00   ` Jesper Dangaard Brouer
@ 2015-09-04 17:01     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-04 17:01 UTC (permalink / raw)
  To: netdev, akpm
  Cc: linux-mm, Jesper Dangaard Brouer, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

The NIC device drivers are expected to use this small helper API, when
building up an array of objects/skbs to bulk free, while (loop)
processing objects to free.  Objects to be free'ed later is added
(dev_free_waitlist_add) to an array and flushed if the array runs
full.  After processing the array is flushed (dev_free_waitlist_flush).
The array should be stored on the local stack.

Usage e.g. during TX completion loop the NIC driver can replace
dev_consume_skb_any() with an "add" and after the loop a "flush".

For performance reasons the compiler should inline most of these
functions.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |   62 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 05b9a694e213..d0133e778314 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2935,6 +2935,68 @@ static inline void dev_consume_skb_any(struct sk_buff *skb)
 	__dev_kfree_skb_any(skb, SKB_REASON_CONSUMED);
 }
 
+/* The NIC device drivers are expected to use this small helper API,
+ * when building up an array of objects/skbs to bulk free, while
+ * (loop) processing objects to free.  Objects to be free'ed later is
+ * added (dev_free_waitlist_add) to an array and flushed if the array
+ * runs full.  After processing the array is flushed (dev_free_waitlist_flush).
+ * The array should be stored on the local stack.
+ *
+ * Usage e.g. during TX completion loop the NIC driver can replace
+ * dev_consume_skb_any() with an "add" and after the loop a "flush".
+ *
+ * For performance reasons the compiler should inline most of these
+ * functions.
+ */
+struct dev_free_waitlist {
+	struct sk_buff **skbs;
+	unsigned int skb_cnt;
+};
+
+static void __dev_free_waitlist_bulkfree(struct dev_free_waitlist *wl)
+{
+	/* Cannot bulk free from interrupt context or with IRQs
+	 * disabled, due to how SLAB bulk API works (and gain it's
+	 * speedup).  This can e.g. happen due to invocation from
+	 * netconsole/netpoll.
+	 */
+	if (unlikely(in_irq() || irqs_disabled())) {
+		int i;
+
+		for (i = 0; i < wl->skb_cnt; i++)
+			dev_consume_skb_irq(wl->skbs[i]);
+	} else {
+		/* Likely fastpath, don't call with cnt == 0 */
+		kfree_skb_bulk(wl->skbs, wl->skb_cnt);
+	}
+}
+
+static inline void dev_free_waitlist_flush(struct dev_free_waitlist *wl)
+{
+	/* Flush the waitlist, but only if any objects remain, as bulk
+	 * freeing "zero" objects is not supported and plus it avoids
+	 * pointless function calls.
+	 */
+	if (likely(wl->skb_cnt))
+		__dev_free_waitlist_bulkfree(wl);
+}
+
+static __always_inline void dev_free_waitlist_add(struct dev_free_waitlist *wl,
+						  struct sk_buff *skb,
+						  unsigned int max)
+{
+	/* It is recommended that max is a builtin constant, as this
+	 * saves one register when inlined. Catch offenders with:
+	 * BUILD_BUG_ON(!__builtin_constant_p(max));
+	 */
+	wl->skbs[wl->skb_cnt++] = skb;
+	if (wl->skb_cnt == max) {
+		/* Detect when waitlist array is full, then flush and reset */
+		__dev_free_waitlist_bulkfree(wl);
+		wl->skb_cnt = 0;
+	}
+}
+
 int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb_sk(struct sock *sk, struct sk_buff *skb);

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 2/3] net: NIC helper API for building array of skbs to free
@ 2015-09-04 17:01     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-04 17:01 UTC (permalink / raw)
  To: netdev, akpm
  Cc: linux-mm, Jesper Dangaard Brouer, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

The NIC device drivers are expected to use this small helper API, when
building up an array of objects/skbs to bulk free, while (loop)
processing objects to free.  Objects to be free'ed later is added
(dev_free_waitlist_add) to an array and flushed if the array runs
full.  After processing the array is flushed (dev_free_waitlist_flush).
The array should be stored on the local stack.

Usage e.g. during TX completion loop the NIC driver can replace
dev_consume_skb_any() with an "add" and after the loop a "flush".

For performance reasons the compiler should inline most of these
functions.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |   62 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 05b9a694e213..d0133e778314 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2935,6 +2935,68 @@ static inline void dev_consume_skb_any(struct sk_buff *skb)
 	__dev_kfree_skb_any(skb, SKB_REASON_CONSUMED);
 }
 
+/* The NIC device drivers are expected to use this small helper API,
+ * when building up an array of objects/skbs to bulk free, while
+ * (loop) processing objects to free.  Objects to be free'ed later is
+ * added (dev_free_waitlist_add) to an array and flushed if the array
+ * runs full.  After processing the array is flushed (dev_free_waitlist_flush).
+ * The array should be stored on the local stack.
+ *
+ * Usage e.g. during TX completion loop the NIC driver can replace
+ * dev_consume_skb_any() with an "add" and after the loop a "flush".
+ *
+ * For performance reasons the compiler should inline most of these
+ * functions.
+ */
+struct dev_free_waitlist {
+	struct sk_buff **skbs;
+	unsigned int skb_cnt;
+};
+
+static void __dev_free_waitlist_bulkfree(struct dev_free_waitlist *wl)
+{
+	/* Cannot bulk free from interrupt context or with IRQs
+	 * disabled, due to how SLAB bulk API works (and gain it's
+	 * speedup).  This can e.g. happen due to invocation from
+	 * netconsole/netpoll.
+	 */
+	if (unlikely(in_irq() || irqs_disabled())) {
+		int i;
+
+		for (i = 0; i < wl->skb_cnt; i++)
+			dev_consume_skb_irq(wl->skbs[i]);
+	} else {
+		/* Likely fastpath, don't call with cnt == 0 */
+		kfree_skb_bulk(wl->skbs, wl->skb_cnt);
+	}
+}
+
+static inline void dev_free_waitlist_flush(struct dev_free_waitlist *wl)
+{
+	/* Flush the waitlist, but only if any objects remain, as bulk
+	 * freeing "zero" objects is not supported and plus it avoids
+	 * pointless function calls.
+	 */
+	if (likely(wl->skb_cnt))
+		__dev_free_waitlist_bulkfree(wl);
+}
+
+static __always_inline void dev_free_waitlist_add(struct dev_free_waitlist *wl,
+						  struct sk_buff *skb,
+						  unsigned int max)
+{
+	/* It is recommended that max is a builtin constant, as this
+	 * saves one register when inlined. Catch offenders with:
+	 * BUILD_BUG_ON(!__builtin_constant_p(max));
+	 */
+	wl->skbs[wl->skb_cnt++] = skb;
+	if (wl->skb_cnt == max) {
+		/* Detect when waitlist array is full, then flush and reset */
+		__dev_free_waitlist_bulkfree(wl);
+		wl->skb_cnt = 0;
+	}
+}
+
 int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb_sk(struct sock *sk, struct sk_buff *skb);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 3/3] ixgbe: bulk free SKBs during TX completion cleanup cycle
  2015-09-04 17:00   ` Jesper Dangaard Brouer
                     ` (2 preceding siblings ...)
  (?)
@ 2015-09-04 17:01   ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-04 17:01 UTC (permalink / raw)
  To: netdev, akpm
  Cc: linux-mm, Jesper Dangaard Brouer, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

First user of the SKB bulk free API (namely kfree_skb_bulk() via
waitlist helper add-and-flush API).

There is an opportunity to bulk free SKBs during reclaiming of
resources after DMA transmit completes in ixgbe_clean_tx_irq.  Thus,
bulk freeing at this point does not introduce any added latency.
Choosing bulk size 32 even-though budget usually is 64, due (1) to
limit the stack usage and (2) as SLAB behind SKBs have 32 objects per
slab.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 463ff47200f1..d35d6b47bae2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1075,6 +1075,7 @@ static void ixgbe_tx_timeout_reset(struct ixgbe_adapter *adapter)
  * @q_vector: structure containing interrupt and ring information
  * @tx_ring: tx ring to clean
  **/
+#define BULK_FREE_SIZE 32
 static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 			       struct ixgbe_ring *tx_ring)
 {
@@ -1084,6 +1085,11 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 	unsigned int total_bytes = 0, total_packets = 0;
 	unsigned int budget = q_vector->tx.work_limit;
 	unsigned int i = tx_ring->next_to_clean;
+	struct sk_buff *skbs[BULK_FREE_SIZE];
+	struct dev_free_waitlist wl;
+
+	wl.skb_cnt = 0;
+	wl.skbs = skbs;
 
 	if (test_bit(__IXGBE_DOWN, &adapter->state))
 		return true;
@@ -1113,8 +1119,8 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 		total_bytes += tx_buffer->bytecount;
 		total_packets += tx_buffer->gso_segs;
 
-		/* free the skb */
-		dev_consume_skb_any(tx_buffer->skb);
+		/* delay skb free and bulk free later */
+		dev_free_waitlist_add(&wl, tx_buffer->skb, BULK_FREE_SIZE);
 
 		/* unmap skb header data */
 		dma_unmap_single(tx_ring->dev,
@@ -1164,6 +1170,8 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 		budget--;
 	} while (likely(budget));
 
+	dev_free_waitlist_flush(&wl); /* free remaining SKBs on waitlist */
+
 	i += tx_ring->count;
 	tx_ring->next_to_clean = i;
 	u64_stats_update_begin(&tx_ring->syncp);
@@ -1224,6 +1232,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 
 	return !!budget;
 }
+#undef BULK_FREE_SIZE
 
 #ifdef CONFIG_IXGBE_DCA
 static void ixgbe_update_tx_dca(struct ixgbe_adapter *adapter,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 17:00   ` Jesper Dangaard Brouer
                     ` (3 preceding siblings ...)
  (?)
@ 2015-09-04 18:09   ` Alexander Duyck
  2015-09-04 18:55     ` Christoph Lameter
  2015-09-07  8:16     ` Jesper Dangaard Brouer
  -1 siblings, 2 replies; 39+ messages in thread
From: Alexander Duyck @ 2015-09-04 18:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, akpm
  Cc: linux-mm, aravinda, Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim

On 09/04/2015 10:00 AM, Jesper Dangaard Brouer wrote:
> During TX DMA completion cleanup there exist an opportunity in the NIC
> drivers to perform bulk free, without introducing additional latency.
>
> For an IPv4 forwarding workload the network stack is hitting the
> slowpath of the kmem_cache "slub" allocator.  This slowpath can be
> mitigated by bulk free via the detached freelists patchset.
>
> Depend on patchset:
>   http://thread.gmane.org/gmane.linux.kernel.mm/137469
>
> Kernel based on MMOTM tag 2015-08-24-16-12 from git repo:
>   git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
>   Also contains Christoph's patch "slub: Avoid irqoff/on in bulk allocation"
>
>
> Benchmarking: Single CPU IPv4 forwarding UDP (generator pktgen):
>   * Before: 2043575 pps
>   * After : 2090522 pps
>   * Improvements: +46947 pps and -10.99 ns
>
> In the before case, perf report shows slub free hits the slowpath:
>   1.98%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_free.isra.72
>   1.29%  ksoftirqd/6  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.71
>   0.95%  ksoftirqd/6  [kernel.vmlinux]  [k] kmem_cache_free
>   0.95%  ksoftirqd/6  [kernel.vmlinux]  [k] kmem_cache_alloc
>   0.20%  ksoftirqd/6  [kernel.vmlinux]  [k] __cmpxchg_double_slab.isra.60
>   0.17%  ksoftirqd/6  [kernel.vmlinux]  [k] ___slab_alloc.isra.68
>   0.09%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_alloc.isra.69
>
> After the slowpath calls are almost gone:
>   0.22%  ksoftirqd/6  [kernel.vmlinux]  [k] __cmpxchg_double_slab.isra.60
>   0.18%  ksoftirqd/6  [kernel.vmlinux]  [k] ___slab_alloc.isra.68
>   0.14%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_free.isra.72
>   0.14%  ksoftirqd/6  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.71
>   0.08%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_alloc.isra.69
>
>
> Extra info, tuning SLUB per CPU structures gives further improvements:
>   * slub-tuned: 2124217 pps
>   * patched increase: +33695 pps and  -7.59 ns
>   * before  increase: +80642 pps and -18.58 ns
>
> Tuning done:
>   echo 256 > /sys/kernel/slab/skbuff_head_cache/cpu_partial
>   echo 9   > /sys/kernel/slab/skbuff_head_cache/min_partial
>
> Without SLUB tuning, same performance comes with kernel cmdline "slab_nomerge":
>   * slab_nomerge: 2121824 pps
>
> Test notes:
>   * Notice very fast CPU i7-4790K CPU @ 4.00GHz
>   * gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC)
>   * kernel 4.1.0-mmotm-2015-08-24-16-12+ #271 SMP
>   * Generator pktgen UDP single flow (pktgen_sample03_burst_single_flow.sh)
>   * Tuned for forwarding:
>    - unloaded netfilter modules
>    - Sysctl settings:
>    - net/ipv4/conf/default/rp_filter = 0
>    - net/ipv4/conf/all/rp_filter = 0
>    - (Forwarding performance is affected by early demux)
>    - net/ipv4/ip_early_demux = 0
>    - net.ipv4.ip_forward = 1
>    - Disabled GRO on NICs
>    - ethtool -K ixgbe3 gro off tso off gso off
>
> ---

This is an interesting start.  However I feel like it might work better 
if you were to create a per-cpu pool for skbs that could be freed and 
allocated in NAPI context.  So for example we already have 
napi_alloc_skb, why not just add a napi_free_skb and then make the array 
of objects to be freed part of a pool that could be used for either 
allocation or freeing?  If the pool runs empty you just allocate 
something like 8 or 16 new skb heads, and if you fill it you just free 
half of the list?

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-04 17:00   ` [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Jesper Dangaard Brouer
@ 2015-09-04 18:47     ` Tom Herbert
  2015-09-07  8:41         ` Jesper Dangaard Brouer
  2015-09-08 21:01     ` David Miller
  1 sibling, 1 reply; 39+ messages in thread
From: Tom Herbert @ 2015-09-04 18:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Linux Kernel Network Developers, akpm, linux-mm, aravinda,
	Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim

On Fri, Sep 4, 2015 at 10:00 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Introduce the first user of SLAB bulk free API kmem_cache_free_bulk(),
> in the network stack in form of function kfree_skb_bulk() which bulk
> free SKBs (not skb clones or skb->head, yet).
>
> As this is the third user of SKB reference decrementing, split out
> refcnt decrement into helper function and use this in all call points.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/skbuff.h |    1 +
>  net/core/skbuff.c      |   87 +++++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 71 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index b97597970ce7..e5f1e007723b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -762,6 +762,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
>  }
>
>  void kfree_skb(struct sk_buff *skb);
> +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size);
>  void kfree_skb_list(struct sk_buff *segs);
>  void skb_tx_error(struct sk_buff *skb);
>  void consume_skb(struct sk_buff *skb);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 429b407b4fe6..034545934158 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -661,26 +661,83 @@ void __kfree_skb(struct sk_buff *skb)
>  }
>  EXPORT_SYMBOL(__kfree_skb);
>
> +/*
> + *     skb_dec_and_test - Helper to drop ref to SKB and see is ready to free
> + *     @skb: buffer to decrement reference
> + *
> + *     Drop a reference to the buffer, and return true if it is ready
> + *     to free. Which is if the usage count has hit zero or is equal to 1.
> + *
> + *     This is performance critical code that should be inlined.
> + */
> +static inline bool skb_dec_and_test(struct sk_buff *skb)
> +{
> +       if (unlikely(!skb))
> +               return false;
> +       if (likely(atomic_read(&skb->users) == 1))
> +               smp_rmb();
> +       else if (likely(!atomic_dec_and_test(&skb->users)))
> +               return false;
> +       /* If reaching here SKB is ready to free */
> +       return true;
> +}
> +
>  /**
>   *     kfree_skb - free an sk_buff
>   *     @skb: buffer to free
>   *
>   *     Drop a reference to the buffer and free it if the usage count has
> - *     hit zero.
> + *     hit zero or is equal to 1.
>   */
>  void kfree_skb(struct sk_buff *skb)
>  {
> -       if (unlikely(!skb))
> -               return;
> -       if (likely(atomic_read(&skb->users) == 1))
> -               smp_rmb();
> -       else if (likely(!atomic_dec_and_test(&skb->users)))
> -               return;
> -       trace_kfree_skb(skb, __builtin_return_address(0));
> -       __kfree_skb(skb);
> +       if (skb_dec_and_test(skb)) {
> +               trace_kfree_skb(skb, __builtin_return_address(0));
> +               __kfree_skb(skb);
> +       }
>  }
>  EXPORT_SYMBOL(kfree_skb);
>
> +/**
> + *     kfree_skb_bulk - bulk free SKBs when refcnt allows to
> + *     @skbs: array of SKBs to free
> + *     @size: number of SKBs in array
> + *
> + *     If SKB refcnt allows for free, then release any auxiliary data
> + *     and then bulk free SKBs to the SLAB allocator.
> + *
> + *     Note that interrupts must be enabled when calling this function.
> + */
> +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size)
> +{
What not pass a list of skbs (e.g. using skb->next)?

> +       int i;
> +       size_t cnt = 0;
> +
> +       for (i = 0; i < size; i++) {
> +               struct sk_buff *skb = skbs[i];
> +
> +               if (!skb_dec_and_test(skb))
> +                       continue; /* skip skb, not ready to free */
> +
> +               /* Construct an array of SKBs, ready to be free'ed and
> +                * cleanup all auxiliary, before bulk free to SLAB.
> +                * For now, only handle non-cloned SKBs, related to
> +                * SLAB skbuff_head_cache
> +                */
> +               if (skb->fclone == SKB_FCLONE_UNAVAILABLE) {
> +                       skb_release_all(skb);
> +                       skbs[cnt++] = skb;
> +               } else {
> +                       /* SKB was a clone, don't handle this case */
> +                       __kfree_skb(skb);
> +               }
> +       }
> +       if (likely(cnt)) {
> +               kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs);
> +       }
> +}
> +EXPORT_SYMBOL(kfree_skb_bulk);
> +
>  void kfree_skb_list(struct sk_buff *segs)
>  {
>         while (segs) {
> @@ -722,14 +779,10 @@ EXPORT_SYMBOL(skb_tx_error);
>   */
>  void consume_skb(struct sk_buff *skb)
>  {
> -       if (unlikely(!skb))
> -               return;
> -       if (likely(atomic_read(&skb->users) == 1))
> -               smp_rmb();
> -       else if (likely(!atomic_dec_and_test(&skb->users)))
> -               return;
> -       trace_consume_skb(skb);
> -       __kfree_skb(skb);
> +       if (skb_dec_and_test(skb)) {
> +               trace_consume_skb(skb);
> +               __kfree_skb(skb);
> +       }
>  }
>  EXPORT_SYMBOL(consume_skb);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 18:09   ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Alexander Duyck
@ 2015-09-04 18:55     ` Christoph Lameter
  2015-09-04 20:39       ` Alexander Duyck
  2015-09-07  8:16     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2015-09-04 18:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On Fri, 4 Sep 2015, Alexander Duyck wrote:

> were to create a per-cpu pool for skbs that could be freed and allocated in
> NAPI context.  So for example we already have napi_alloc_skb, why not just add
> a napi_free_skb and then make the array of objects to be freed part of a pool
> that could be used for either allocation or freeing?  If the pool runs empty
> you just allocate something like 8 or 16 new skb heads, and if you fill it you
> just free half of the list?

The slab allocators provide something like a per cpu pool for you to
optimize object alloc and free.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 18:55     ` Christoph Lameter
@ 2015-09-04 20:39       ` Alexander Duyck
  2015-09-04 23:45         ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Alexander Duyck @ 2015-09-04 20:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On 09/04/2015 11:55 AM, Christoph Lameter wrote:
> On Fri, 4 Sep 2015, Alexander Duyck wrote:
>
>> were to create a per-cpu pool for skbs that could be freed and allocated in
>> NAPI context.  So for example we already have napi_alloc_skb, why not just add
>> a napi_free_skb and then make the array of objects to be freed part of a pool
>> that could be used for either allocation or freeing?  If the pool runs empty
>> you just allocate something like 8 or 16 new skb heads, and if you fill it you
>> just free half of the list?
> The slab allocators provide something like a per cpu pool for you to
> optimize object alloc and free.

Right, but one of the reasons for Jesper to implement the bulk 
alloc/free is to avoid the cmpxchg that is being used to get stuff into 
or off of the per cpu lists.

In the case of network drivers they are running in softirq context 
almost exclusively.  As such it is useful to have a set of buffers that 
can be acquired or freed from this context without the need to use any 
synchronization primitives.  Then once the softirq context ends then we 
can free up some or all of the resources back to the slab allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 20:39       ` Alexander Duyck
@ 2015-09-04 23:45         ` Christoph Lameter
  2015-09-05 11:18           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2015-09-04 23:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On Fri, 4 Sep 2015, Alexander Duyck wrote:
> Right, but one of the reasons for Jesper to implement the bulk alloc/free is
> to avoid the cmpxchg that is being used to get stuff into or off of the per
> cpu lists.

There is no full cmpxchg used for the per cpu lists. Its a cmpxchg without
lock semantics which is very cheap.

> In the case of network drivers they are running in softirq context almost
> exclusively.  As such it is useful to have a set of buffers that can be
> acquired or freed from this context without the need to use any
> synchronization primitives.  Then once the softirq context ends then we can
> free up some or all of the resources back to the slab allocator.

That is the case in the slab allocators.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 23:45         ` Christoph Lameter
@ 2015-09-05 11:18           ` Jesper Dangaard Brouer
  2015-09-08 17:32               ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-05 11:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alexander Duyck, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim, brouer

On Fri, 4 Sep 2015 18:45:13 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Fri, 4 Sep 2015, Alexander Duyck wrote:
> > Right, but one of the reasons for Jesper to implement the bulk alloc/free is
> > to avoid the cmpxchg that is being used to get stuff into or off of the per
> > cpu lists.
> 
> There is no full cmpxchg used for the per cpu lists. Its a cmpxchg without
> lock semantics which is very cheap.

The double_cmpxchg without lock prefix still cost 9 cycles, which is
very fast but still a cost (add approx 19 cycles for a lock prefix).

It is slower than local_irq_disable + local_irq_enable that only cost
7 cycles, which the bulking call uses.  (That is the reason bulk calls
with 1 object can almost compete with fastpath).


> > In the case of network drivers they are running in softirq context almost
> > exclusively.  As such it is useful to have a set of buffers that can be
> > acquired or freed from this context without the need to use any
> > synchronization primitives.  Then once the softirq context ends then we can
> > free up some or all of the resources back to the slab allocator.
> 
> That is the case in the slab allocators.

There is a potential for taking advantage of this softirq context,
which is basically what my qmempool implementation did.

But we have now optimized the slub allocator to an extend that (in case
of slab-tuning or slab_nomerge) is faster than my qmempool implementation.

Thus, I would like a smaller/slimmer layer than qmempool.  We do need
some per CPU cache for allocations, like Alex suggests, but I'm not
sure we need that for the free side.  For now I'm returning
objects/skbs directly to slub, and is hoping enough objects can be
merged in a detached freelist, which allow me to return several objects
with a single locked double_cmpxchg.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 18:09   ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Alexander Duyck
  2015-09-04 18:55     ` Christoph Lameter
@ 2015-09-07  8:16     ` Jesper Dangaard Brouer
  2015-09-07 21:23         ` Alexander Duyck
  1 sibling, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-07  8:16 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, akpm, linux-mm, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim, brouer

On Fri, 4 Sep 2015 11:09:21 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> This is an interesting start.  However I feel like it might work better 
> if you were to create a per-cpu pool for skbs that could be freed and 
> allocated in NAPI context.  So for example we already have 
> napi_alloc_skb, why not just add a napi_free_skb

I do like the idea...

> and then make the array 
> of objects to be freed part of a pool that could be used for either 
> allocation or freeing?  If the pool runs empty you just allocate 
> something like 8 or 16 new skb heads, and if you fill it you just free 
> half of the list?

But I worry that this algorithm will "randomize" the (skb) objects.
And the SLUB bulk optimization only works if we have many objects
belonging to the same page.

It would likely be fastest to implement a simple stack (for these
per-cpu pools), but I again worry that it would randomize the
object-pages.  A simple queue might be better, but slightly slower.
Guess I could just reuse part of qmempool / alf_queue as a quick test.

Having a per-cpu pool in networking would solve the problem of the slub
per-cpu pool isn't large enough for our use-case.  On the other hand,
maybe we should fix slub to dynamically adjust the size of it's per-cpu
resources?


A pre-req knowledge (for people not knowing slub's internal details):
Slub alloc path will pickup a page, and empty all objects for that page
before proceeding to the next page.  Thus, slub bulk alloc will give
many objects belonging to the page.  I'm trying to keep these objects
grouped together until they can be free'ed in a bulk.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-04 18:47     ` Tom Herbert
@ 2015-09-07  8:41         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-07  8:41 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, akpm, linux-mm, aravinda,
	Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim, brouer

On Fri, 4 Sep 2015 11:47:17 -0700 Tom Herbert <tom@herbertland.com> wrote:

> On Fri, Sep 4, 2015 at 10:00 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > Introduce the first user of SLAB bulk free API kmem_cache_free_bulk(),
> > in the network stack in form of function kfree_skb_bulk() which bulk
> > free SKBs (not skb clones or skb->head, yet).
> >
[...]
> > +/**
> > + *     kfree_skb_bulk - bulk free SKBs when refcnt allows to
> > + *     @skbs: array of SKBs to free
> > + *     @size: number of SKBs in array
> > + *
> > + *     If SKB refcnt allows for free, then release any auxiliary data
> > + *     and then bulk free SKBs to the SLAB allocator.
> > + *
> > + *     Note that interrupts must be enabled when calling this function.
> > + */
> > +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size)
> > +{
>
> What not pass a list of skbs (e.g. using skb->next)?

Because the next layer, the slab API needs an array:
  kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)

Look at the patch:
 [PATCH V2 3/3] slub: build detached freelist with look-ahead
 http://thread.gmane.org/gmane.linux.kernel.mm/137469/focus=137472

Where I use this array to progressively scan for objects belonging to
the same page.  (A subtle detail is I manage to zero out the array,
which is good from a security/error-handling point of view, as pointers
to the objects are not left dangling on the stack).


I cannot argue that, writing skb->next comes as an additional cost,
because the slUb free also writes into this cacheline.  Perhaps the
slAb allocator does not?

[...]
> > +       if (likely(cnt)) {
> > +               kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs);
> > +       }
> > +}
> > +EXPORT_SYMBOL(kfree_skb_bulk);

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
@ 2015-09-07  8:41         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-07  8:41 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, akpm, linux-mm, aravinda,
	Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim, brouer

On Fri, 4 Sep 2015 11:47:17 -0700 Tom Herbert <tom@herbertland.com> wrote:

> On Fri, Sep 4, 2015 at 10:00 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > Introduce the first user of SLAB bulk free API kmem_cache_free_bulk(),
> > in the network stack in form of function kfree_skb_bulk() which bulk
> > free SKBs (not skb clones or skb->head, yet).
> >
[...]
> > +/**
> > + *     kfree_skb_bulk - bulk free SKBs when refcnt allows to
> > + *     @skbs: array of SKBs to free
> > + *     @size: number of SKBs in array
> > + *
> > + *     If SKB refcnt allows for free, then release any auxiliary data
> > + *     and then bulk free SKBs to the SLAB allocator.
> > + *
> > + *     Note that interrupts must be enabled when calling this function.
> > + */
> > +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size)
> > +{
>
> What not pass a list of skbs (e.g. using skb->next)?

Because the next layer, the slab API needs an array:
  kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)

Look at the patch:
 [PATCH V2 3/3] slub: build detached freelist with look-ahead
 http://thread.gmane.org/gmane.linux.kernel.mm/137469/focus=137472

Where I use this array to progressively scan for objects belonging to
the same page.  (A subtle detail is I manage to zero out the array,
which is good from a security/error-handling point of view, as pointers
to the objects are not left dangling on the stack).


I cannot argue that, writing skb->next comes as an additional cost,
because the slUb free also writes into this cacheline.  Perhaps the
slAb allocator does not?

[...]
> > +       if (likely(cnt)) {
> > +               kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs);
> > +       }
> > +}
> > +EXPORT_SYMBOL(kfree_skb_bulk);

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-07  8:41         ` Jesper Dangaard Brouer
  (?)
@ 2015-09-07 16:25         ` Tom Herbert
  2015-09-07 20:14             ` Jesper Dangaard Brouer
  -1 siblings, 1 reply; 39+ messages in thread
From: Tom Herbert @ 2015-09-07 16:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Linux Kernel Network Developers, akpm, linux-mm, aravinda,
	Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim

>> What not pass a list of skbs (e.g. using skb->next)?
>
> Because the next layer, the slab API needs an array:
>   kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>

I suppose we could ask the same question of that function. IMO
encouraging drivers to define arrays of pointers on the stack like
you're doing in the ixgbe patch is a bad direction.

In any case I believe this would be simpler in the networking side
just to maintain a list of skb's to free. Then the dev_free_waitlist
structure might not be needed then since we could just use a
skb_buf_head for that.


Tom

> Look at the patch:
>  [PATCH V2 3/3] slub: build detached freelist with look-ahead
>  http://thread.gmane.org/gmane.linux.kernel.mm/137469/focus=137472
>
> Where I use this array to progressively scan for objects belonging to
> the same page.  (A subtle detail is I manage to zero out the array,
> which is good from a security/error-handling point of view, as pointers
> to the objects are not left dangling on the stack).
>
>
> I cannot argue that, writing skb->next comes as an additional cost,
> because the slUb free also writes into this cacheline.  Perhaps the
> slAb allocator does not?
>
> [...]
>> > +       if (likely(cnt)) {
>> > +               kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs);
>> > +       }
>> > +}
>> > +EXPORT_SYMBOL(kfree_skb_bulk);
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Sr. Network Kernel Developer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-07 16:25         ` Tom Herbert
@ 2015-09-07 20:14             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-07 20:14 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, akpm, linux-mm, aravinda,
	Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim, brouer


On Mon, 7 Sep 2015 09:25:49 -0700 Tom Herbert <tom@herbertland.com> wrote:

> >> What not pass a list of skbs (e.g. using skb->next)?
> >
> > Because the next layer, the slab API needs an array:
> >   kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> >
> 
> I suppose we could ask the same question of that function. IMO
> encouraging drivers to define arrays of pointers on the stack like
> you're doing in the ixgbe patch is a bad direction.
> 
> In any case I believe this would be simpler in the networking side
> just to maintain a list of skb's to free. Then the dev_free_waitlist
> structure might not be needed then since we could just use a
> skb_buf_head for that.

I guess it is more natural for the network side to work with skb lists.
But I'm keeping it for slab/slub as we cannot assume/enforce objects of a
specific data type.

I worried about how large bulk free we should allow, due to the
interaction with skb->destructor which for sockets affect their memory
accounting. E.g. we have seen issues with hypervisor network drivers
(Xen and HyperV) that are too slow to cleanup their TX completion queue
that their TCP bandwidth get limited by tcp_limit_output_bytes.
I capped it at 32, and the NAPI budget will cap it at 64.


By the following argument, bulk free of 64 objects/skb's is not a problem.
The delay I'm introducing is very small, before the first real
kfree_skb is called, which calls the destructor with free up socket
memory accounting.

Assume measured packet rate of: 2105011 pps
Time between packets (1/2105011*10^9): 475 ns

Perf shows ixgbe_clean_tx_irq() takes: 1.23%
Extrapolating the function call cost: 5.84 ns (475*(1.23/100))

Processing 64 packets in ixgbe_clean_tx_irq() 373 ns.
At 10Gbit/s how many bytes can arrive in this period, only: 466 bytes.
((373/10^9)*(10000*10^6)/8)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
@ 2015-09-07 20:14             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-07 20:14 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, akpm, linux-mm, aravinda,
	Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim, brouer


On Mon, 7 Sep 2015 09:25:49 -0700 Tom Herbert <tom@herbertland.com> wrote:

> >> What not pass a list of skbs (e.g. using skb->next)?
> >
> > Because the next layer, the slab API needs an array:
> >   kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> >
> 
> I suppose we could ask the same question of that function. IMO
> encouraging drivers to define arrays of pointers on the stack like
> you're doing in the ixgbe patch is a bad direction.
> 
> In any case I believe this would be simpler in the networking side
> just to maintain a list of skb's to free. Then the dev_free_waitlist
> structure might not be needed then since we could just use a
> skb_buf_head for that.

I guess it is more natural for the network side to work with skb lists.
But I'm keeping it for slab/slub as we cannot assume/enforce objects of a
specific data type.

I worried about how large bulk free we should allow, due to the
interaction with skb->destructor which for sockets affect their memory
accounting. E.g. we have seen issues with hypervisor network drivers
(Xen and HyperV) that are too slow to cleanup their TX completion queue
that their TCP bandwidth get limited by tcp_limit_output_bytes.
I capped it at 32, and the NAPI budget will cap it at 64.


By the following argument, bulk free of 64 objects/skb's is not a problem.
The delay I'm introducing is very small, before the first real
kfree_skb is called, which calls the destructor with free up socket
memory accounting.

Assume measured packet rate of: 2105011 pps
Time between packets (1/2105011*10^9): 475 ns

Perf shows ixgbe_clean_tx_irq() takes: 1.23%
Extrapolating the function call cost: 5.84 ns (475*(1.23/100))

Processing 64 packets in ixgbe_clean_tx_irq() 373 ns.
At 10Gbit/s how many bytes can arrive in this period, only: 466 bytes.
((373/10^9)*(10000*10^6)/8)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-07  8:16     ` Jesper Dangaard Brouer
@ 2015-09-07 21:23         ` Alexander Duyck
  0 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2015-09-07 21:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, akpm, linux-mm, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

On 09/07/2015 01:16 AM, Jesper Dangaard Brouer wrote:
> On Fri, 4 Sep 2015 11:09:21 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> This is an interesting start.  However I feel like it might work better
>> if you were to create a per-cpu pool for skbs that could be freed and
>> allocated in NAPI context.  So for example we already have
>> napi_alloc_skb, why not just add a napi_free_skb
> I do like the idea...

If nothing else you want to avoid having to redo this code for every 
driver.  If you can just replace dev_kfree_skb with some other freeing 
call it will make it much easier to convert other drivers.

>> and then make the array
>> of objects to be freed part of a pool that could be used for either
>> allocation or freeing?  If the pool runs empty you just allocate
>> something like 8 or 16 new skb heads, and if you fill it you just free
>> half of the list?
> But I worry that this algorithm will "randomize" the (skb) objects.
> And the SLUB bulk optimization only works if we have many objects
> belonging to the same page.

Agreed to some extent, however at the same time what this does is allow 
for a certain amount of skb recycling.  So instead of freeing the 
buffers received from the socket you would likely be recycling them and 
sending them back as Rx skbs.  In the case of a heavy routing workload 
you would likely just be cycling through the same set of buffers and 
cleaning them off of transmit and placing them back on receive.  The 
general idea is to keep the memory footprint small so recycling Tx 
buffers to use for Rx can have its advantages in terms of keeping things 
confined to limits of the L1/L2 cache.

> It would likely be fastest to implement a simple stack (for these
> per-cpu pools), but I again worry that it would randomize the
> object-pages.  A simple queue might be better, but slightly slower.
> Guess I could just reuse part of qmempool / alf_queue as a quick test.

I would say don't over engineer it.  A stack is the simplest.  The 
qmempool / alf_queue is just going to add extra overhead.

The added advantage to the stack is that you are working with pointers 
and you are guaranteed that the list of pointers are going to be 
linear.  If you use a queue clean-up will require up to 2 blocks of 
freeing in case the ring has wrapped.

> Having a per-cpu pool in networking would solve the problem of the slub
> per-cpu pool isn't large enough for our use-case.  On the other hand,
> maybe we should fix slub to dynamically adjust the size of it's per-cpu
> resources?

The per-cpu pool is just meant to replace the the per-driver pool you 
were using.  By using a per-cpu pool you would get better aggregation 
and can just flush the freed buffers at the end of the Rx softirq or 
when the pool is full instead of having to flush smaller lists per call 
to napi->poll.

> A pre-req knowledge (for people not knowing slub's internal details):
> Slub alloc path will pickup a page, and empty all objects for that page
> before proceeding to the next page.  Thus, slub bulk alloc will give
> many objects belonging to the page.  I'm trying to keep these objects
> grouped together until they can be free'ed in a bulk.

The problem is you aren't going to be able to keep them together very 
easily.  Yes they might be allocated all from one spot on Rx but they 
can very easily end up scattered to multiple locations. The same applies 
to Tx where you will have multiple flows all outgoing on one port.  That 
is why I was thinking adding some skb recycling via a per-cpu stack 
might be useful especially since you have to either fill or empty the 
stack when you allocate or free multiple skbs anyway.  In addition it 
provides an easy way for a bulk alloc and a bulk free to share data 
structures without adding additional overhead by keeping them separate.

If you managed it with some sort of high-water/low-water mark type setup 
you could very well keep the bulk-alloc/free busy without too much 
fragmentation.  For the socket transmit/receive case the thing you have 
to keep in mind is that if you reuse the buffers you are just going to 
be throwing them back at the sockets which are likely not using 
bulk-free anyway.  So in that case reuse could actually improve things 
by simply reducing the number of calls to bulk-alloc you will need to 
make since things like TSO allow you to send 64K using a single sk_buff, 
while you will be likely be receiving one or more acks on the receive 
side which will require allocations.

- Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
@ 2015-09-07 21:23         ` Alexander Duyck
  0 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2015-09-07 21:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, akpm, linux-mm, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

On 09/07/2015 01:16 AM, Jesper Dangaard Brouer wrote:
> On Fri, 4 Sep 2015 11:09:21 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> This is an interesting start.  However I feel like it might work better
>> if you were to create a per-cpu pool for skbs that could be freed and
>> allocated in NAPI context.  So for example we already have
>> napi_alloc_skb, why not just add a napi_free_skb
> I do like the idea...

If nothing else you want to avoid having to redo this code for every 
driver.  If you can just replace dev_kfree_skb with some other freeing 
call it will make it much easier to convert other drivers.

>> and then make the array
>> of objects to be freed part of a pool that could be used for either
>> allocation or freeing?  If the pool runs empty you just allocate
>> something like 8 or 16 new skb heads, and if you fill it you just free
>> half of the list?
> But I worry that this algorithm will "randomize" the (skb) objects.
> And the SLUB bulk optimization only works if we have many objects
> belonging to the same page.

Agreed to some extent, however at the same time what this does is allow 
for a certain amount of skb recycling.  So instead of freeing the 
buffers received from the socket you would likely be recycling them and 
sending them back as Rx skbs.  In the case of a heavy routing workload 
you would likely just be cycling through the same set of buffers and 
cleaning them off of transmit and placing them back on receive.  The 
general idea is to keep the memory footprint small so recycling Tx 
buffers to use for Rx can have its advantages in terms of keeping things 
confined to limits of the L1/L2 cache.

> It would likely be fastest to implement a simple stack (for these
> per-cpu pools), but I again worry that it would randomize the
> object-pages.  A simple queue might be better, but slightly slower.
> Guess I could just reuse part of qmempool / alf_queue as a quick test.

I would say don't over engineer it.  A stack is the simplest.  The 
qmempool / alf_queue is just going to add extra overhead.

The added advantage to the stack is that you are working with pointers 
and you are guaranteed that the list of pointers are going to be 
linear.  If you use a queue clean-up will require up to 2 blocks of 
freeing in case the ring has wrapped.

> Having a per-cpu pool in networking would solve the problem of the slub
> per-cpu pool isn't large enough for our use-case.  On the other hand,
> maybe we should fix slub to dynamically adjust the size of it's per-cpu
> resources?

The per-cpu pool is just meant to replace the the per-driver pool you 
were using.  By using a per-cpu pool you would get better aggregation 
and can just flush the freed buffers at the end of the Rx softirq or 
when the pool is full instead of having to flush smaller lists per call 
to napi->poll.

> A pre-req knowledge (for people not knowing slub's internal details):
> Slub alloc path will pickup a page, and empty all objects for that page
> before proceeding to the next page.  Thus, slub bulk alloc will give
> many objects belonging to the page.  I'm trying to keep these objects
> grouped together until they can be free'ed in a bulk.

The problem is you aren't going to be able to keep them together very 
easily.  Yes they might be allocated all from one spot on Rx but they 
can very easily end up scattered to multiple locations. The same applies 
to Tx where you will have multiple flows all outgoing on one port.  That 
is why I was thinking adding some skb recycling via a per-cpu stack 
might be useful especially since you have to either fill or empty the 
stack when you allocate or free multiple skbs anyway.  In addition it 
provides an easy way for a bulk alloc and a bulk free to share data 
structures without adding additional overhead by keeping them separate.

If you managed it with some sort of high-water/low-water mark type setup 
you could very well keep the bulk-alloc/free busy without too much 
fragmentation.  For the socket transmit/receive case the thing you have 
to keep in mind is that if you reuse the buffers you are just going to 
be throwing them back at the sockets which are likely not using 
bulk-free anyway.  So in that case reuse could actually improve things 
by simply reducing the number of calls to bulk-alloc you will need to 
make since things like TSO allow you to send 64K using a single sk_buff, 
while you will be likely be receiving one or more acks on the receive 
side which will require allocations.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-05 11:18           ` Jesper Dangaard Brouer
@ 2015-09-08 17:32               ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-09-08 17:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On Sat, 5 Sep 2015, Jesper Dangaard Brouer wrote:

> The double_cmpxchg without lock prefix still cost 9 cycles, which is
> very fast but still a cost (add approx 19 cycles for a lock prefix).
>
> It is slower than local_irq_disable + local_irq_enable that only cost
> 7 cycles, which the bulking call uses.  (That is the reason bulk calls
> with 1 object can almost compete with fastpath).

Hmmm... Guess we need to come up with distinct version of kmalloc() for
irq and non irq contexts to take advantage of that . Most at non irq
context anyways.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
@ 2015-09-08 17:32               ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-09-08 17:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On Sat, 5 Sep 2015, Jesper Dangaard Brouer wrote:

> The double_cmpxchg without lock prefix still cost 9 cycles, which is
> very fast but still a cost (add approx 19 cycles for a lock prefix).
>
> It is slower than local_irq_disable + local_irq_enable that only cost
> 7 cycles, which the bulking call uses.  (That is the reason bulk calls
> with 1 object can almost compete with fastpath).

Hmmm... Guess we need to come up with distinct version of kmalloc() for
irq and non irq contexts to take advantage of that . Most at non irq
context anyways.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-04 17:00   ` [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Jesper Dangaard Brouer
  2015-09-04 18:47     ` Tom Herbert
@ 2015-09-08 21:01     ` David Miller
  1 sibling, 0 replies; 39+ messages in thread
From: David Miller @ 2015-09-08 21:01 UTC (permalink / raw)
  To: brouer; +Cc: netdev, akpm, linux-mm, aravinda, cl, paulmck, iamjoonsoo.kim

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Fri, 04 Sep 2015 19:00:53 +0200

> +/**
> + *	kfree_skb_bulk - bulk free SKBs when refcnt allows to
> + *	@skbs: array of SKBs to free
> + *	@size: number of SKBs in array
> + *
> + *	If SKB refcnt allows for free, then release any auxiliary data
> + *	and then bulk free SKBs to the SLAB allocator.
> + *
> + *	Note that interrupts must be enabled when calling this function.
> + */
> +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size)
> +{
> +	int i;
> +	size_t cnt = 0;
> +
> +	for (i = 0; i < size; i++) {
> +		struct sk_buff *skb = skbs[i];
> +
> +		if (!skb_dec_and_test(skb))
> +			continue; /* skip skb, not ready to free */
> +
> +		/* Construct an array of SKBs, ready to be free'ed and
> +		 * cleanup all auxiliary, before bulk free to SLAB.
> +		 * For now, only handle non-cloned SKBs, related to
> +		 * SLAB skbuff_head_cache
> +		 */
> +		if (skb->fclone == SKB_FCLONE_UNAVAILABLE) {
> +			skb_release_all(skb);
> +			skbs[cnt++] = skb;
> +		} else {
> +			/* SKB was a clone, don't handle this case */
> +			__kfree_skb(skb);
> +		}
> +	}
> +	if (likely(cnt)) {
> +		kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs);
> +	}
> +}

You're going to have to do a trace_kfree_skb() or trace_consume_skb() for
these things.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-08 17:32               ` Christoph Lameter
  (?)
@ 2015-09-09 12:59               ` Jesper Dangaard Brouer
  2015-09-09 14:08                   ` Christoph Lameter
  -1 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-09 12:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alexander Duyck, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim, brouer

On Tue, 8 Sep 2015 12:32:40 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Sat, 5 Sep 2015, Jesper Dangaard Brouer wrote:
> 
> > The double_cmpxchg without lock prefix still cost 9 cycles, which is
> > very fast but still a cost (add approx 19 cycles for a lock prefix).
> >
> > It is slower than local_irq_disable + local_irq_enable that only cost
> > 7 cycles, which the bulking call uses.  (That is the reason bulk calls
> > with 1 object can almost compete with fastpath).
> 
> Hmmm... Guess we need to come up with distinct version of kmalloc() for
> irq and non irq contexts to take advantage of that . Most at non irq
> context anyways.

I agree, it would be an easy win.  Do notice this will have the most
impact for the slAb allocator.

I estimate alloc + free cost would save:
 * slAb would save approx 60 cycles
 * slUb would save approx  4 cycles

We might consider keeping the slUb approach as it would be more
friendly for RT with less IRQ disabling.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-09 12:59               ` Jesper Dangaard Brouer
@ 2015-09-09 14:08                   ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-09-09 14:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On Wed, 9 Sep 2015, Jesper Dangaard Brouer wrote:

> > Hmmm... Guess we need to come up with distinct version of kmalloc() for
> > irq and non irq contexts to take advantage of that . Most at non irq
> > context anyways.
>
> I agree, it would be an easy win.  Do notice this will have the most
> impact for the slAb allocator.
>
> I estimate alloc + free cost would save:
>  * slAb would save approx 60 cycles
>  * slUb would save approx  4 cycles
>
> We might consider keeping the slUb approach as it would be more
> friendly for RT with less IRQ disabling.

IRQ disabling it a mixed bag. Older cpus have higher latencies there and
also virtualized contexts may require the hypervisor tracks the interrupt
state.

For recent intel cpus this is certainly a workable approach.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
@ 2015-09-09 14:08                   ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-09-09 14:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On Wed, 9 Sep 2015, Jesper Dangaard Brouer wrote:

> > Hmmm... Guess we need to come up with distinct version of kmalloc() for
> > irq and non irq contexts to take advantage of that . Most at non irq
> > context anyways.
>
> I agree, it would be an easy win.  Do notice this will have the most
> impact for the slAb allocator.
>
> I estimate alloc + free cost would save:
>  * slAb would save approx 60 cycles
>  * slUb would save approx  4 cycles
>
> We might consider keeping the slUb approach as it would be more
> friendly for RT with less IRQ disabling.

IRQ disabling it a mixed bag. Older cpus have higher latencies there and
also virtualized contexts may require the hypervisor tracks the interrupt
state.

For recent intel cpus this is certainly a workable approach.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Experiences with slub bulk use-case for network stack
  2015-09-04 17:00   ` Jesper Dangaard Brouer
@ 2015-09-16 10:02     ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-16 10:02 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter; +Cc: netdev, akpm, Alexander Duyck, iamjoonsoo.kim


Hint, this leads up to discussing if current bulk *ALLOC* API need to
be changed...

Alex and I have been working hard on practical use-case for SLAB
bulking (mostly slUb), in the network stack.  Here is a summary of
what we have learned so far.

Bulk free'ing SKBs during TX completion is a big and easy win.

Specifically for slUb, normal path for freeing these objects (which
are not on c->freelist) require a locked double_cmpxchg per object.
The bulk free (via detached freelist patch) allow to free all objects
belonging to the same slab-page, to be free'ed with a single locked
double_cmpxchg. Thus, the bulk free speedup is quite an improvement.

The slUb alloc is hard to beat on speed:
 * accessing c->freelist, local cmpxchg 9 cycles (38% of cost)
 * c->freelist is refilled with single locked cmpxchg

In micro benchmarking it looks like we can beat alloc, because we do a
local_irq_{disable,enable} (cost 7 cycles).  And then pull out all
objects in c->freelist.  Thus, saving 9 cycles per object (counting
from the 2nd object).

However, in practical use-cases we are seeing the single object alloc
win over bulk alloc, we believe this to be due to prefetching.  When
c->freelist get (semi) cache-cold, then it gets more expensive to walk
the freelist (which is a basic single linked list to next free object).

For bulk alloc the full freelist is walked (right-way) and objects
pulled out into the array.  For normal single object alloc only a
single object is returned, but it does a prefetch on the next object
pointer.  Thus, next time single alloc is called the object will have
been prefetched.  Doing prefetch in bulk alloc only helps a little, as
it does not have enough "time" between accessing/walking the freelist
for objects.

So, how can we solve this and make bulk alloc faster?


Alex and I had the idea of bulk alloc returns an "allocator specific
cache" data-structure (and we add some helpers to access this).

In the slUb case, the freelist is a single linked pointer list.  In
the network stack the skb objects have a skb->next pointer, which is
located at the same position as freelist pointer.  Thus, simply
returning the freelist directly, could be interpreted as a skb-list.
The helper API would then do the prefetching, when pulling out
objects.

For the slUb case, we would simply cmpxchg either c->freelist or
page->freelist with a NULL ptr, and then own all objects on the
freelist. This also reduce the time we keep IRQs disabled.

API wise, we don't (necessary) know how many objects are on the
freelist (without first walking the list, which would cause stalls on
data, which we are trying to avoid).

Thus, the API of always returning the exact number of requested
objects will not work...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

(related to http://thread.gmane.org/gmane.linux.kernel.mm/137469)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Experiences with slub bulk use-case for network stack
@ 2015-09-16 10:02     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-16 10:02 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter; +Cc: netdev, akpm, Alexander Duyck, iamjoonsoo.kim


Hint, this leads up to discussing if current bulk *ALLOC* API need to
be changed...

Alex and I have been working hard on practical use-case for SLAB
bulking (mostly slUb), in the network stack.  Here is a summary of
what we have learned so far.

Bulk free'ing SKBs during TX completion is a big and easy win.

Specifically for slUb, normal path for freeing these objects (which
are not on c->freelist) require a locked double_cmpxchg per object.
The bulk free (via detached freelist patch) allow to free all objects
belonging to the same slab-page, to be free'ed with a single locked
double_cmpxchg. Thus, the bulk free speedup is quite an improvement.

The slUb alloc is hard to beat on speed:
 * accessing c->freelist, local cmpxchg 9 cycles (38% of cost)
 * c->freelist is refilled with single locked cmpxchg

In micro benchmarking it looks like we can beat alloc, because we do a
local_irq_{disable,enable} (cost 7 cycles).  And then pull out all
objects in c->freelist.  Thus, saving 9 cycles per object (counting
from the 2nd object).

However, in practical use-cases we are seeing the single object alloc
win over bulk alloc, we believe this to be due to prefetching.  When
c->freelist get (semi) cache-cold, then it gets more expensive to walk
the freelist (which is a basic single linked list to next free object).

For bulk alloc the full freelist is walked (right-way) and objects
pulled out into the array.  For normal single object alloc only a
single object is returned, but it does a prefetch on the next object
pointer.  Thus, next time single alloc is called the object will have
been prefetched.  Doing prefetch in bulk alloc only helps a little, as
it does not have enough "time" between accessing/walking the freelist
for objects.

So, how can we solve this and make bulk alloc faster?


Alex and I had the idea of bulk alloc returns an "allocator specific
cache" data-structure (and we add some helpers to access this).

In the slUb case, the freelist is a single linked pointer list.  In
the network stack the skb objects have a skb->next pointer, which is
located at the same position as freelist pointer.  Thus, simply
returning the freelist directly, could be interpreted as a skb-list.
The helper API would then do the prefetching, when pulling out
objects.

For the slUb case, we would simply cmpxchg either c->freelist or
page->freelist with a NULL ptr, and then own all objects on the
freelist. This also reduce the time we keep IRQs disabled.

API wise, we don't (necessary) know how many objects are on the
freelist (without first walking the list, which would cause stalls on
data, which we are trying to avoid).

Thus, the API of always returning the exact number of requested
objects will not work...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

(related to http://thread.gmane.org/gmane.linux.kernel.mm/137469)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Experiences with slub bulk use-case for network stack
  2015-09-16 10:02     ` Jesper Dangaard Brouer
  (?)
@ 2015-09-16 15:13     ` Christoph Lameter
  2015-09-17 20:17       ` Jesper Dangaard Brouer
  -1 siblings, 1 reply; 39+ messages in thread
From: Christoph Lameter @ 2015-09-16 15:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, netdev, akpm, Alexander Duyck, iamjoonsoo.kim

On Wed, 16 Sep 2015, Jesper Dangaard Brouer wrote:

>
> Hint, this leads up to discussing if current bulk *ALLOC* API need to
> be changed...
>
> Alex and I have been working hard on practical use-case for SLAB
> bulking (mostly slUb), in the network stack.  Here is a summary of
> what we have learned so far.

SLAB refers to the SLAB allocator which is one slab allocator and SLUB is
another slab allocator.

Please keep that consistent otherwise things get confusing

> Bulk free'ing SKBs during TX completion is a big and easy win.
>
> Specifically for slUb, normal path for freeing these objects (which
> are not on c->freelist) require a locked double_cmpxchg per object.
> The bulk free (via detached freelist patch) allow to free all objects
> belonging to the same slab-page, to be free'ed with a single locked
> double_cmpxchg. Thus, the bulk free speedup is quite an improvement.

Yep.

> Alex and I had the idea of bulk alloc returns an "allocator specific
> cache" data-structure (and we add some helpers to access this).

Maybe add some Macros to handle this?

> In the slUb case, the freelist is a single linked pointer list.  In
> the network stack the skb objects have a skb->next pointer, which is
> located at the same position as freelist pointer.  Thus, simply
> returning the freelist directly, could be interpreted as a skb-list.
> The helper API would then do the prefetching, when pulling out
> objects.

The problem with the SLUB case is that the objects must be on the same
slab page.

> For the slUb case, we would simply cmpxchg either c->freelist or
> page->freelist with a NULL ptr, and then own all objects on the
> freelist. This also reduce the time we keep IRQs disabled.

You dont need to disable interrupts for the cmpxchges. There is additional
state in the page struct though so the updates must be done carefully.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Experiences with slub bulk use-case for network stack
  2015-09-16 15:13     ` Christoph Lameter
@ 2015-09-17 20:17       ` Jesper Dangaard Brouer
  2015-09-17 23:57         ` Christoph Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-17 20:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, netdev, akpm, Alexander Duyck, iamjoonsoo.kim, brouer

On Wed, 16 Sep 2015 10:13:25 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Wed, 16 Sep 2015, Jesper Dangaard Brouer wrote:
> 
> >
> > Hint, this leads up to discussing if current bulk *ALLOC* API need to
> > be changed...
> >
> > Alex and I have been working hard on practical use-case for SLAB
> > bulking (mostly slUb), in the network stack.  Here is a summary of
> > what we have learned so far.
> 
> SLAB refers to the SLAB allocator which is one slab allocator and SLUB is
> another slab allocator.
> 
> Please keep that consistent otherwise things get confusing

This naming scheme is really confusing.  I'll try to be more
consistent.  So, you want capital letters SLAB and SLUB when talking
about a specific slab allocator implementation.


> > Bulk free'ing SKBs during TX completion is a big and easy win.
> >
> > Specifically for slUb, normal path for freeing these objects (which
> > are not on c->freelist) require a locked double_cmpxchg per object.
> > The bulk free (via detached freelist patch) allow to free all objects
> > belonging to the same slab-page, to be free'ed with a single locked
> > double_cmpxchg. Thus, the bulk free speedup is quite an improvement.
> 
> Yep.
> 
> > Alex and I had the idea of bulk alloc returns an "allocator specific
> > cache" data-structure (and we add some helpers to access this).
> 
> Maybe add some Macros to handle this?

Yes, helpers will likely turn out to be macros.


> > In the slUb case, the freelist is a single linked pointer list.  In
> > the network stack the skb objects have a skb->next pointer, which is
> > located at the same position as freelist pointer.  Thus, simply
> > returning the freelist directly, could be interpreted as a skb-list.
> > The helper API would then do the prefetching, when pulling out
> > objects.
> 
> The problem with the SLUB case is that the objects must be on the same
> slab page.

Yes, I'm aware that, that is what we are trying to take advantage of.


> > For the slUb case, we would simply cmpxchg either c->freelist or
> > page->freelist with a NULL ptr, and then own all objects on the
> > freelist. This also reduce the time we keep IRQs disabled.
> 
> You dont need to disable interrupts for the cmpxchges. There is
> additional state in the page struct though so the updates must be
> done carefully.

Yes, I'm aware of cmpxchg does not need to disable interrupts.  And I
plan to take advantage of this, in this new approach for bulk alloc.

Our current bulk alloc disables interrupts for the full period (of
collecting the number requested objects).

What I'm proposing is keeping interrupts on, and then simply cmpxchg
e.g 2 slab-pages out of the SLUB allocator (which the SLUB code calls
freelist's). The bulk call now owns these freelists, and returns them
to the caller.  The API caller gets some helpers/macros to access
objects, to shield him from the details (of SLUB freelist's).

The pitfall with this API is we don't know how many objects are on a
SLUB freelist.  And we cannot walk the freelist and count them, because
then we hit the problem of memory/cache stalls (that we are trying so
hard to avoid).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Experiences with slub bulk use-case for network stack
  2015-09-17 20:17       ` Jesper Dangaard Brouer
@ 2015-09-17 23:57         ` Christoph Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Lameter @ 2015-09-17 23:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, netdev, akpm, Alexander Duyck, iamjoonsoo.kim

On Thu, 17 Sep 2015, Jesper Dangaard Brouer wrote:

> What I'm proposing is keeping interrupts on, and then simply cmpxchg
> e.g 2 slab-pages out of the SLUB allocator (which the SLUB code calls
> freelist's). The bulk call now owns these freelists, and returns them
> to the caller.  The API caller gets some helpers/macros to access
> objects, to shield him from the details (of SLUB freelist's).
>
> The pitfall with this API is we don't know how many objects are on a
> SLUB freelist.  And we cannot walk the freelist and count them, because
> then we hit the problem of memory/cache stalls (that we are trying so
> hard to avoid).

If you get a fresh page from the page allocator then you know how many
objects are available in a slab page.

There is also a counter in each slab page for the objects allocated. The
number of free object is page->objects - page->inuse.

This is only true for a lockec cmpxchg. The unlocked cmpxchg used for the
per cpu freelist does not use the counters in the page struct.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2015-09-17 23:57 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-24  0:58 [PATCH V2 0/3] slub: introducing detached freelist Jesper Dangaard Brouer
2015-08-24  0:58 ` Jesper Dangaard Brouer
2015-08-24  0:58 ` [PATCH V2 1/3] slub: extend slowpath __slab_free() to handle bulk free Jesper Dangaard Brouer
2015-08-24  0:58   ` Jesper Dangaard Brouer
2015-08-24  0:59 ` [PATCH V2 2/3] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
2015-08-24  0:59   ` Jesper Dangaard Brouer
2015-08-24  0:59 ` [PATCH V2 3/3] slub: build detached freelist with look-ahead Jesper Dangaard Brouer
2015-08-24  0:59   ` Jesper Dangaard Brouer
2015-09-04 17:00 ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Jesper Dangaard Brouer
2015-09-04 17:00   ` Jesper Dangaard Brouer
2015-09-04 17:00   ` [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Jesper Dangaard Brouer
2015-09-04 18:47     ` Tom Herbert
2015-09-07  8:41       ` Jesper Dangaard Brouer
2015-09-07  8:41         ` Jesper Dangaard Brouer
2015-09-07 16:25         ` Tom Herbert
2015-09-07 20:14           ` Jesper Dangaard Brouer
2015-09-07 20:14             ` Jesper Dangaard Brouer
2015-09-08 21:01     ` David Miller
2015-09-04 17:01   ` [RFC PATCH 2/3] net: NIC helper API for building array of skbs to free Jesper Dangaard Brouer
2015-09-04 17:01     ` Jesper Dangaard Brouer
2015-09-04 17:01   ` [RFC PATCH 3/3] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
2015-09-04 18:09   ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Alexander Duyck
2015-09-04 18:55     ` Christoph Lameter
2015-09-04 20:39       ` Alexander Duyck
2015-09-04 23:45         ` Christoph Lameter
2015-09-05 11:18           ` Jesper Dangaard Brouer
2015-09-08 17:32             ` Christoph Lameter
2015-09-08 17:32               ` Christoph Lameter
2015-09-09 12:59               ` Jesper Dangaard Brouer
2015-09-09 14:08                 ` Christoph Lameter
2015-09-09 14:08                   ` Christoph Lameter
2015-09-07  8:16     ` Jesper Dangaard Brouer
2015-09-07 21:23       ` Alexander Duyck
2015-09-07 21:23         ` Alexander Duyck
2015-09-16 10:02   ` Experiences with slub bulk use-case for network stack Jesper Dangaard Brouer
2015-09-16 10:02     ` Jesper Dangaard Brouer
2015-09-16 15:13     ` Christoph Lameter
2015-09-17 20:17       ` Jesper Dangaard Brouer
2015-09-17 23:57         ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.