All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] slub: introducing detached freelist
@ 2015-07-15 16:01 Jesper Dangaard Brouer
  2015-07-15 16:01 ` [PATCH 1/3] slub: extend slowpath __slab_free() to handle bulk free Jesper Dangaard Brouer
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2015-07-15 16:01 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Alexander Duyck, Hannes Frederic Sowa,
	Jesper Dangaard Brouer

Introducing what I call detached freelist, for improving the
performance of object freeing in the "slowpath" of kmem_cache_free_bulk,
which calls __slab_free().

The benchmarking tool are avail here:
 https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
 See: slab_bulk_test0{1,2,3}.c

Compared against existing bulk-API (in AKPMs tree), we see a small
regression for the fastpath (between 2-5 cycles), but a huge
improvement for the slowpath.

bulk- Bulk-API-before           - Bulk-API with patchset
  1 -  42 cycles(tsc) 10.520 ns - 47 cycles(tsc) 11.931 ns - improved -11.9%
  2 -  26 cycles(tsc)  6.697 ns - 29 cycles(tsc)  7.368 ns - improved -11.5%
  3 -  22 cycles(tsc)  5.589 ns - 24 cycles(tsc)  6.003 ns - improved -9.1%
  4 -  19 cycles(tsc)  4.921 ns - 22 cycles(tsc)  5.543 ns - improved -15.8%
  8 -  17 cycles(tsc)  4.499 ns - 20 cycles(tsc)  5.047 ns - improved -17.6%
 16 -  69 cycles(tsc) 17.424 ns - 20 cycles(tsc)  5.015 ns - improved 71.0%
 30 -  88 cycles(tsc) 22.075 ns - 20 cycles(tsc)  5.062 ns - improved 77.3%
 32 -  83 cycles(tsc) 20.965 ns - 20 cycles(tsc)  5.089 ns - improved 75.9%
 34 -  80 cycles(tsc) 20.039 ns - 28 cycles(tsc)  7.006 ns - improved 65.0%
 48 -  76 cycles(tsc) 19.252 ns - 31 cycles(tsc)  7.755 ns - improved 59.2%
 64 -  86 cycles(tsc) 21.523 ns - 68 cycles(tsc) 17.203 ns - improved 20.9%
128 -  97 cycles(tsc) 24.444 ns - 72 cycles(tsc) 18.195 ns - improved 25.8%
158 -  96 cycles(tsc) 24.036 ns - 73 cycles(tsc) 18.372 ns - improved 24.0%
250 - 100 cycles(tsc) 25.007 ns - 73 cycles(tsc) 18.430 ns - improved 27.0%

Patchset based on top of commit aefbef10e3ae with previous accepted
bulk patchset(V2) applied (avail in AKPMs quilt).

Small note, benchmark run with kernel compiled with .config
CONFIG_FTRACE in-order to use the perf probes to measure the amount of
page bulking into __slab_free().  While running the "worse-case"
testing module slab_bulk_test03.c

---

Jesper Dangaard Brouer (3):
      slub: extend slowpath __slab_free() to handle bulk free
      slub: optimize bulk slowpath free by detached freelist
      slub: build detached freelist with look-ahead


 mm/slub.c |  141 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 111 insertions(+), 30 deletions(-)

--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] slub: extend slowpath __slab_free() to handle bulk free
  2015-07-15 16:01 [PATCH 0/3] slub: introducing detached freelist Jesper Dangaard Brouer
@ 2015-07-15 16:01 ` Jesper Dangaard Brouer
  2015-07-15 16:54   ` Christoph Lameter
  2015-07-15 16:02 ` [PATCH 2/3] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
  2015-07-15 16:02 ` [PATCH 3/3] slub: build detached freelist with look-ahead Jesper Dangaard Brouer
  2 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2015-07-15 16:01 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Alexander Duyck, Hannes Frederic Sowa,
	Jesper Dangaard Brouer

Make it possible to free a freelist with several objects by extending
__slab_free() with two arguments: a freelist_head pointer and objects
counter (cnt).  If freelist_head pointer is set, then the object must
be the freelist tail pointer.

This allows a list of object to be free'ed using a single locked
cmpxchg_double.

Micro benchmarking showed no performance reduction due to this change.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/slub.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index c9305f525004..d0841a4c61ea 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2573,9 +2573,13 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
  * So we still attempt to reduce cache line usage. Just take the slab
  * lock and free the item. If there is no additional partial page
  * handling required then we can return immediately.
+ *
+ * Bulk free of a freelist with several objects possible by specifying
+ * freelist_head ptr and object as tail ptr, plus objects count (cnt).
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-			void *x, unsigned long addr)
+			void *x, unsigned long addr,
+			void *freelist_head, int cnt)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -2584,6 +2588,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 	unsigned long counters;
 	struct kmem_cache_node *n = NULL;
 	unsigned long uninitialized_var(flags);
+	void *new_freelist = (!freelist_head) ? object : freelist_head;
 
 	stat(s, FREE_SLOWPATH);
 
@@ -2601,7 +2606,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		set_freepointer(s, object, prior);
 		new.counters = counters;
 		was_frozen = new.frozen;
-		new.inuse--;
+		new.inuse -= cnt;
 		if ((!new.inuse || !prior) && !was_frozen) {
 
 			if (kmem_cache_has_cpu_partial(s) && !prior) {
@@ -2632,7 +2637,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 	} while (!cmpxchg_double_slab(s, page,
 		prior, counters,
-		object, new.counters,
+		new_freelist, new.counters,
 		"__slab_free"));
 
 	if (likely(!n)) {
@@ -2736,7 +2741,7 @@ redo:
 		}
 		stat(s, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr);
+		__slab_free(s, page, x, addr, NULL, 1);
 
 }
 
@@ -2780,7 +2785,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 			c->tid = next_tid(c->tid);
 			local_irq_enable();
 			/* Slowpath: overhead locked cmpxchg_double_slab */
-			__slab_free(s, page, object, _RET_IP_);
+			__slab_free(s, page, object, _RET_IP_, NULL, 1);
 			local_irq_disable();
 			c = this_cpu_ptr(s->cpu_slab);
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] slub: optimize bulk slowpath free by detached freelist
  2015-07-15 16:01 [PATCH 0/3] slub: introducing detached freelist Jesper Dangaard Brouer
  2015-07-15 16:01 ` [PATCH 1/3] slub: extend slowpath __slab_free() to handle bulk free Jesper Dangaard Brouer
@ 2015-07-15 16:02 ` Jesper Dangaard Brouer
  2015-07-15 16:56   ` Christoph Lameter
  2015-07-15 16:02 ` [PATCH 3/3] slub: build detached freelist with look-ahead Jesper Dangaard Brouer
  2 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2015-07-15 16:02 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Alexander Duyck, Hannes Frederic Sowa,
	Jesper Dangaard Brouer

This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.

The slowpath call __slab_free() have been extended with support for
bulk free, which amortize the overhead of the locked cmpxchg_double_slab.

To use the new bulking feature of __slab_free(), we build what I call
a detached freelist.  The detached freelist takes advantage of three
properties:

 1) the free function call owns the object that is about to be freed,
    thus writing into this memory is synchronization-free.

 2) many freelist's can co-exist side-by-side in the same page each
    with a separate head pointer.

 3) it is the visibility of the head pointer that needs synchronization.

Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization.
The freelist is constructed directly in the page objects, without any
synchronization needed.  The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk.  Thus, the freelist
head pointer is not visible to other CPUs.

This implementation is fairly simple, as it only builds the detached
freelist if two consecutive objects belongs to the same page.  When
detecting object page does not match, it simply flushes the local
freelist, and starts a new local detached freelist.  It will not
look-ahead to see if further opputunities exists in the

The next patch have a more advanced look-ahead approach, but is also
more complicated. Splitting them up, because I want to be able to
benchmark the simple against the advanced approach.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
bulk- Fallback                  - Bulk API
  1 -  64 cycles(tsc) 16.109 ns - 47 cycles(tsc) 11.894 - improved 26.6%
  2 -  56 cycles(tsc) 14.158 ns - 45 cycles(tsc) 11.274 - improved 19.6%
  3 -  54 cycles(tsc) 13.650 ns - 23 cycles(tsc)  6.001 - improved 57.4%
  4 -  53 cycles(tsc) 13.268 ns - 21 cycles(tsc)  5.262 - improved 60.4%
  8 -  51 cycles(tsc) 12.841 ns - 18 cycles(tsc)  4.718 - improved 64.7%
 16 -  50 cycles(tsc) 12.583 ns - 19 cycles(tsc)  4.896 - improved 62.0%
 30 -  85 cycles(tsc) 21.357 ns - 26 cycles(tsc)  6.549 - improved 69.4%
 32 -  82 cycles(tsc) 20.690 ns - 25 cycles(tsc)  6.412 - improved 69.5%
 34 -  81 cycles(tsc) 20.322 ns - 25 cycles(tsc)  6.365 - improved 69.1%
 48 -  93 cycles(tsc) 23.332 ns - 28 cycles(tsc)  7.139 - improved 69.9%
 64 -  98 cycles(tsc) 24.544 ns - 62 cycles(tsc) 15.543 - improved 36.7%
128 -  96 cycles(tsc) 24.219 ns - 68 cycles(tsc) 17.143 - improved 29.2%
158 - 107 cycles(tsc) 26.817 ns - 69 cycles(tsc) 17.431 - improved 35.5%
250 - 107 cycles(tsc) 26.824 ns - 70 cycles(tsc) 17.730 - improved 34.6%

 mm/slub.c |   48 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d0841a4c61ea..ce4118566761 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2755,12 +2755,26 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
+struct detached_freelist {
+	struct page *page;
+	void *freelist;
+	void *tail_object;
+	int cnt;
+};
+
 /* Note that interrupts must be enabled when calling this function. */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
 	struct kmem_cache_cpu *c;
 	struct page *page;
 	int i;
+	/* Opportunistically delay updating page->freelist, hoping
+	 * next free happen to same page.  Start building the freelist
+	 * in the page, but keep local stack ptr to freelist.  If
+	 * successful several object can be transferred to page with a
+	 * single cmpxchg_double.
+	 */
+	struct detached_freelist df = {0};
 
 	local_irq_disable();
 	c = this_cpu_ptr(s->cpu_slab);
@@ -2777,22 +2791,42 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 
 		page = virt_to_head_page(object);
 
-		if (c->page == page) {
+		if (page == df.page) {
+			/* Oppotunity to delay real free */
+			set_freepointer(s, object, df.freelist);
+			df.freelist = object;
+			df.cnt++;
+		} else if (c->page == page) {
 			/* Fastpath: local CPU free */
 			set_freepointer(s, object, c->freelist);
 			c->freelist = object;
 		} else {
-			c->tid = next_tid(c->tid);
-			local_irq_enable();
-			/* Slowpath: overhead locked cmpxchg_double_slab */
-			__slab_free(s, page, object, _RET_IP_, NULL, 1);
-			local_irq_disable();
-			c = this_cpu_ptr(s->cpu_slab);
+			/* Slowpath: Flush delayed free */
+			if (df.page) {
+				c->tid = next_tid(c->tid);
+				local_irq_enable();
+				__slab_free(s, df.page, df.tail_object,
+					    _RET_IP_, df.freelist, df.cnt);
+				local_irq_disable();
+				c = this_cpu_ptr(s->cpu_slab);
+			}
+			/* Start new round of delayed free */
+			df.page = page;
+			df.tail_object = object;
+			set_freepointer(s, object, NULL);
+			df.freelist = object;
+			df.cnt = 1;
 		}
 	}
 exit:
 	c->tid = next_tid(c->tid);
 	local_irq_enable();
+
+	/* Flush detached freelist */
+	if (df.page) {
+		__slab_free(s, df.page, df.tail_object,
+			    _RET_IP_, df.freelist, df.cnt);
+	}
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] slub: build detached freelist with look-ahead
  2015-07-15 16:01 [PATCH 0/3] slub: introducing detached freelist Jesper Dangaard Brouer
  2015-07-15 16:01 ` [PATCH 1/3] slub: extend slowpath __slab_free() to handle bulk free Jesper Dangaard Brouer
  2015-07-15 16:02 ` [PATCH 2/3] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
@ 2015-07-15 16:02 ` Jesper Dangaard Brouer
  2015-07-16  9:57   ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2015-07-15 16:02 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Alexander Duyck, Hannes Frederic Sowa,
	Jesper Dangaard Brouer

This change is a more advanced use of detached freelist.  The bulk
free array is scanned is a progressive manor with a limited look-ahead
facility.

To maintain the same performance level, as the previous simple
implementation, the look-ahead have been limited to only 3 objects.
This number have been determined my experimental micro benchmarking.

For performance the free loop in kmem_cache_free_bulk have been
significantly reorganized, with a focus on making the branches more
predictable for the compiler.  E.g. the per CPU c->freelist is also
build as a detached freelist, even-though it would be just as fast as
freeing directly to it, but it save creating an unpredictable branch.

Another benefit of this change is that kmem_cache_free_bulk() runs
mostly with IRQs enabled.  The local IRQs are only disabled when
updating the per CPU c->freelist.  This should please Thomas Gleixner.

Pitfall(1): Removed kmem debug support.

Pitfall(2): No BUG_ON() freeing NULL pointers, but the algorithm
            handles and skips these NULL pointers.

Compare against previous patch:
 There is some fluctuation in the benchmarks between runs.  To counter
this I've run some specific[1] bulk sizes, repeated 100 times and run
dmesg through  Rusty's "stats"[2] tool.

Command line:
  sudo dmesg -c ;\
  for x in `seq 100`; do \
    modprobe slab_bulk_test02 bulksz=48 loops=100000 && rmmod slab_bulk_test02; \
    echo $x; \
    sleep 0.${RANDOM} ;\
  done; \
  dmesg | stats

Results:

bulk size:16, average: +2.01 cycles
 Prev: between 19-52 (average: 22.65 stddev:+/-6.9)
 This: between 19-67 (average: 24.67 stddev:+/-9.9)

bulk size:48, average: +1.54 cycles
 Prev: between 23-45 (average: 27.88 stddev:+/-4)
 This: between 24-41 (average: 29.42 stddev:+/-3.7)

bulk size:144, average: +1.73 cycles
 Prev: between 44-76 (average: 60.31 stddev:+/-7.7)
 This: between 49-80 (average: 62.04 stddev:+/-7.3)

bulk size:512, average: +8.94 cycles
 Prev: between 50-68 (average: 60.11 stddev: +/-4.3)
 This: between 56-80 (average: 69.05 stddev: +/-5.2)

bulk size:2048, average: +26.81 cycles
 Prev: between 61-73 (average: 68.10 stddev:+/-2.9)
 This: between 90-104(average: 94.91 stddev:+/-2.1)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test02.c
[2] https://github.com/rustyrussell/stats

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
bulk- Fallback                  - Bulk API
  1 -  64 cycles(tsc) 16.144 ns - 47 cycles(tsc) 11.931 - improved 26.6%
  2 -  57 cycles(tsc) 14.397 ns - 29 cycles(tsc)  7.368 - improved 49.1%
  3 -  55 cycles(tsc) 13.797 ns - 24 cycles(tsc)  6.003 - improved 56.4%
  4 -  53 cycles(tsc) 13.500 ns - 22 cycles(tsc)  5.543 - improved 58.5%
  8 -  52 cycles(tsc) 13.008 ns - 20 cycles(tsc)  5.047 - improved 61.5%
 16 -  51 cycles(tsc) 12.763 ns - 20 cycles(tsc)  5.015 - improved 60.8%
 30 -  50 cycles(tsc) 12.743 ns - 20 cycles(tsc)  5.062 - improved 60.0%
 32 -  51 cycles(tsc) 12.908 ns - 20 cycles(tsc)  5.089 - improved 60.8%
 34 -  87 cycles(tsc) 21.936 ns - 28 cycles(tsc)  7.006 - improved 67.8%
 48 -  79 cycles(tsc) 19.840 ns - 31 cycles(tsc)  7.755 - improved 60.8%
 64 -  86 cycles(tsc) 21.669 ns - 68 cycles(tsc) 17.203 - improved 20.9%
128 - 101 cycles(tsc) 25.340 ns - 72 cycles(tsc) 18.195 - improved 28.7%
158 - 112 cycles(tsc) 28.152 ns - 73 cycles(tsc) 18.372 - improved 34.8%
250 - 110 cycles(tsc) 27.727 ns - 73 cycles(tsc) 18.430 - improved 33.6%

 mm/slub.c |  138 ++++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 90 insertions(+), 48 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ce4118566761..06fef8f503a1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2762,71 +2762,113 @@ struct detached_freelist {
 	int cnt;
 };
 
-/* Note that interrupts must be enabled when calling this function. */
-void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+/*
+ * This function extract objects belonging to the same page, and
+ * builds a detached freelist directly within the given page/objects.
+ * This can happen without any need for synchronization, because the
+ * objects are owned by running process.  The freelist is build up as
+ * a single linked list in the objects.  The idea is, that this
+ * detached freelist can then be bulk transferred to the real
+ * freelist(s), but only requiring a single synchronization primitive.
+ */
+static inline int build_detached_freelist(
+	struct kmem_cache *s, size_t size, void **p,
+	struct detached_freelist *df, int start_index)
 {
-	struct kmem_cache_cpu *c;
 	struct page *page;
 	int i;
-	/* Opportunistically delay updating page->freelist, hoping
-	 * next free happen to same page.  Start building the freelist
-	 * in the page, but keep local stack ptr to freelist.  If
-	 * successful several object can be transferred to page with a
-	 * single cmpxchg_double.
-	 */
-	struct detached_freelist df = {0};
+	int lookahead = 0;
+	void *object;
 
-	local_irq_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+	/* Always re-init detached_freelist */
+	do {
+		object = p[start_index];
+		if (object) {
+			/* Start new delayed freelist */
+			df->page = virt_to_head_page(object);
+			df->tail_object = object;
+			set_freepointer(s, object, NULL);
+			df->freelist = object;
+			df->cnt = 1;
+			p[start_index] = NULL; /* mark object processed */
+		} else {
+			df->page = NULL; /* Handle NULL ptr in array */
+		}
+		start_index++;
+	} while (!object && start_index < size);
 
-	for (i = 0; i < size; i++) {
-		void *object = p[i];
+	for (i = start_index; i < size; i++) {
+		object = p[i];
 
-		BUG_ON(!object);
-		/* kmem cache debug support */
-		s = cache_from_obj(s, object);
-		if (unlikely(!s))
-			goto exit;
-		slab_free_hook(s, object);
+		if (!object)
+			continue; /* Skip processed objects */
 
 		page = virt_to_head_page(object);
 
-		if (page == df.page) {
-			/* Oppotunity to delay real free */
-			set_freepointer(s, object, df.freelist);
-			df.freelist = object;
-			df.cnt++;
-		} else if (c->page == page) {
-			/* Fastpath: local CPU free */
-			set_freepointer(s, object, c->freelist);
-			c->freelist = object;
+		/* df->page is always set at this point */
+		if (page == df->page) {
+			/* Oppotunity build freelist */
+			set_freepointer(s, object, df->freelist);
+			df->freelist = object;
+			df->cnt++;
+			p[i] = NULL; /* mark object processed */
+			if (!lookahead)
+				start_index++;
 		} else {
-			/* Slowpath: Flush delayed free */
-			if (df.page) {
+			/* Limit look ahead search */
+			if (++lookahead >= 3 )
+				return start_index;
+			continue;
+		}
+	}
+	return start_index;
+}
+
+/* Note that interrupts must be enabled when calling this function. */
+void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct kmem_cache_cpu *c;
+	int iterator = 0;
+	struct detached_freelist df;
+
+	BUG_ON(!size);
+
+	/* Per CPU ptr may change afterwards */
+	c = this_cpu_ptr(s->cpu_slab);
+
+	while (likely(iterator < size)) {
+		iterator = build_detached_freelist(s, size, p, &df, iterator);
+		if (likely(df.page)) {
+		redo:
+			if (c->page == df.page) {
+				/*
+				 * Local CPU free require disabling
+				 * IRQs.  It is possible to miss the
+				 * oppotunity and instead free to
+				 * page->freelist, but it does not
+				 * matter as page->freelist will
+				 * eventually be transferred to
+				 * c->freelist
+				 */
+				local_irq_disable();
+				c = this_cpu_ptr(s->cpu_slab); /* reload */
+				if (c->page != df.page) {
+					local_irq_enable();
+					goto redo;
+				}
+				/* Bulk transfer to CPU c->freelist */
+				set_freepointer(s, df.tail_object, c->freelist);
+				c->freelist = df.freelist;
+
 				c->tid = next_tid(c->tid);
 				local_irq_enable();
+			} else {
+				/* Bulk transfer to page->freelist */
 				__slab_free(s, df.page, df.tail_object,
 					    _RET_IP_, df.freelist, df.cnt);
-				local_irq_disable();
-				c = this_cpu_ptr(s->cpu_slab);
 			}
-			/* Start new round of delayed free */
-			df.page = page;
-			df.tail_object = object;
-			set_freepointer(s, object, NULL);
-			df.freelist = object;
-			df.cnt = 1;
 		}
 	}
-exit:
-	c->tid = next_tid(c->tid);
-	local_irq_enable();
-
-	/* Flush detached freelist */
-	if (df.page) {
-		__slab_free(s, df.page, df.tail_object,
-			    _RET_IP_, df.freelist, df.cnt);
-	}
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] slub: extend slowpath __slab_free() to handle bulk free
  2015-07-15 16:01 ` [PATCH 1/3] slub: extend slowpath __slab_free() to handle bulk free Jesper Dangaard Brouer
@ 2015-07-15 16:54   ` Christoph Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2015-07-15 16:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Alexander Duyck,
	Hannes Frederic Sowa

On Wed, 15 Jul 2015, Jesper Dangaard Brouer wrote:

> This allows a list of object to be free'ed using a single locked
> cmpxchg_double.

Well not really. The objects that are to be freed on the list have
additional requirements. They must all be objects from the *same* slab
page. This needs to be pointed out everywhere otherwise people will try to free
random objects via this function and we will have weird failure cases.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] slub: optimize bulk slowpath free by detached freelist
  2015-07-15 16:02 ` [PATCH 2/3] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
@ 2015-07-15 16:56   ` Christoph Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2015-07-15 16:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Alexander Duyck,
	Hannes Frederic Sowa

On Wed, 15 Jul 2015, Jesper Dangaard Brouer wrote:

> Given these properties, the brilliant part is that the detached
> freelist can be constructed without any need for synchronization.
> The freelist is constructed directly in the page objects, without any
> synchronization needed.  The detached freelist is allocated on the
> stack of the function call kmem_cache_free_bulk.  Thus, the freelist
> head pointer is not visible to other CPUs.

Good idea.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] slub: build detached freelist with look-ahead
  2015-07-15 16:02 ` [PATCH 3/3] slub: build detached freelist with look-ahead Jesper Dangaard Brouer
@ 2015-07-16  9:57   ` Jesper Dangaard Brouer
  2015-07-20  2:54     ` Joonsoo Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2015-07-16  9:57 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter, Andrew Morton
  Cc: Joonsoo Kim, Alexander Duyck, Hannes Frederic Sowa, brouer


On Wed, 15 Jul 2015 18:02:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> Results:
[...]
> bulk- Fallback                  - Bulk API
>   1 -  64 cycles(tsc) 16.144 ns - 47 cycles(tsc) 11.931 - improved 26.6%
>   2 -  57 cycles(tsc) 14.397 ns - 29 cycles(tsc)  7.368 - improved 49.1%
>   3 -  55 cycles(tsc) 13.797 ns - 24 cycles(tsc)  6.003 - improved 56.4%
>   4 -  53 cycles(tsc) 13.500 ns - 22 cycles(tsc)  5.543 - improved 58.5%
>   8 -  52 cycles(tsc) 13.008 ns - 20 cycles(tsc)  5.047 - improved 61.5%
>  16 -  51 cycles(tsc) 12.763 ns - 20 cycles(tsc)  5.015 - improved 60.8%
>  30 -  50 cycles(tsc) 12.743 ns - 20 cycles(tsc)  5.062 - improved 60.0%
>  32 -  51 cycles(tsc) 12.908 ns - 20 cycles(tsc)  5.089 - improved 60.8%
>  34 -  87 cycles(tsc) 21.936 ns - 28 cycles(tsc)  7.006 - improved 67.8%
>  48 -  79 cycles(tsc) 19.840 ns - 31 cycles(tsc)  7.755 - improved 60.8%
>  64 -  86 cycles(tsc) 21.669 ns - 68 cycles(tsc) 17.203 - improved 20.9%
> 128 - 101 cycles(tsc) 25.340 ns - 72 cycles(tsc) 18.195 - improved 28.7%
> 158 - 112 cycles(tsc) 28.152 ns - 73 cycles(tsc) 18.372 - improved 34.8%
> 250 - 110 cycles(tsc) 27.727 ns - 73 cycles(tsc) 18.430 - improved 33.6%


Something interesting happens, when I'm tuning the SLAB/slub cache...

I was thinking what happens if I "give" the slub more per CPU partial
pages.  In my benchmark 250 is my "max" bulk working set.

Tuning SLAB/slub for 256 bytes object size, by tuning SLUB saying each
CPU partial should be allowed to contain 256 objects (cpu_partial).

 sudo sh -c 'echo 256 > /sys/kernel/slab/:t-0000256/cpu_partial'

And adjusting 'min_partial' affects __slab_free() by avoiding removing
partial if node->nr_partial >= s->min_partial.  Thus, in our test
min_partial=9 result in keeping 9 pages 32 * 9 = 288 objects in the

 sudo sh -c 'echo 9   > /sys/kernel/slab/:t-0000256/min_partial'
 sudo grep -H . /sys/kernel/slab/:t-0000256/*

First notice the normal fastpath is: 47 cycles(tsc) 11.894 ns

Patch03-TUNED-run01:
bulk-  Fallback                 - Bulk-API
  1 -  63 cycles(tsc) 15.866 ns - 46 cycles(tsc) 11.653 ns - improved 27.0%
  2 -  56 cycles(tsc) 14.137 ns - 28 cycles(tsc)  7.106 ns - improved 50.0%
  3 -  54 cycles(tsc) 13.623 ns - 23 cycles(tsc)  5.845 ns - improved 57.4%
  4 -  53 cycles(tsc) 13.345 ns - 21 cycles(tsc)  5.316 ns - improved 60.4%
  8 -  51 cycles(tsc) 12.960 ns - 20 cycles(tsc)  5.187 ns - improved 60.8%
 16 -  50 cycles(tsc) 12.743 ns - 20 cycles(tsc)  5.091 ns - improved 60.0%
 30 -  80 cycles(tsc) 20.153 ns - 28 cycles(tsc)  7.054 ns - improved 65.0%
 32 -  82 cycles(tsc) 20.621 ns - 33 cycles(tsc)  8.392 ns - improved 59.8%
 34 -  80 cycles(tsc) 20.125 ns - 32 cycles(tsc)  8.046 ns - improved 60.0%
 48 -  91 cycles(tsc) 22.887 ns - 30 cycles(tsc)  7.655 ns - improved 67.0%
 64 -  85 cycles(tsc) 21.362 ns - 36 cycles(tsc)  9.141 ns - improved 57.6%
128 - 101 cycles(tsc) 25.481 ns - 33 cycles(tsc)  8.286 ns - improved 67.3%
158 - 103 cycles(tsc) 25.909 ns - 36 cycles(tsc)  9.179 ns - improved 65.0%
250 - 105 cycles(tsc) 26.481 ns - 39 cycles(tsc)  9.994 ns - improved 62.9%

Notice how ALL of the bulk sizes now are faster than the 47 cycles of
the normal slub fastpath.  This is amazing!

A little strangely, the tuning didn't seem to help the fallback version.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer



On Wed, 15 Jul 2015 18:02:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> Results:
> 
> bulk size:16, average: +2.01 cycles
>  Prev: between 19-52 (average: 22.65 stddev:+/-6.9)
>  This: between 19-67 (average: 24.67 stddev:+/-9.9)

bulk16:  19-39(average: 21.68+/-4.5) cycles(tsc)
 
> bulk size:48, average: +1.54 cycles
>  Prev: between 23-45 (average: 27.88 stddev:+/-4)
>  This: between 24-41 (average: 29.42 stddev:+/-3.7)

bulk48:  25-38(average: 28.4+/-2.3) cycles(tsc)
 
> bulk size:144, average: +1.73 cycles
>  Prev: between 44-76 (average: 60.31 stddev:+/-7.7)
>  This: between 49-80 (average: 62.04 stddev:+/-7.3)

bulk144: 31-45(average: 34.54+/-3.4) cycles(tsc)

> bulk size:512, average: +8.94 cycles
>  Prev: between 50-68 (average: 60.11 stddev: +/-4.3)
>  This: between 56-80 (average: 69.05 stddev: +/-5.2)

bulk512: 38-68(average: 44.48+/-7.1) cycles(tsc)
(quite good given working set tuned for is 256)

> bulk size:2048, average: +26.81 cycles
>  Prev: between 61-73 (average: 68.10 stddev:+/-2.9)
>  This: between 90-104(average: 94.91 stddev:+/-2.1)

bulk2048: 80-87(average: 83.19+/-1.1)
 
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test02.c
> [2] https://github.com/rustyrussell/stats

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] slub: build detached freelist with look-ahead
  2015-07-16  9:57   ` Jesper Dangaard Brouer
@ 2015-07-20  2:54     ` Joonsoo Kim
  2015-07-20 21:28       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Joonsoo Kim @ 2015-07-20  2:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Christoph Lameter, Andrew Morton, Alexander Duyck,
	Hannes Frederic Sowa

On Thu, Jul 16, 2015 at 11:57:56AM +0200, Jesper Dangaard Brouer wrote:
> 
> On Wed, 15 Jul 2015 18:02:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > Results:
> [...]
> > bulk- Fallback                  - Bulk API
> >   1 -  64 cycles(tsc) 16.144 ns - 47 cycles(tsc) 11.931 - improved 26.6%
> >   2 -  57 cycles(tsc) 14.397 ns - 29 cycles(tsc)  7.368 - improved 49.1%
> >   3 -  55 cycles(tsc) 13.797 ns - 24 cycles(tsc)  6.003 - improved 56.4%
> >   4 -  53 cycles(tsc) 13.500 ns - 22 cycles(tsc)  5.543 - improved 58.5%
> >   8 -  52 cycles(tsc) 13.008 ns - 20 cycles(tsc)  5.047 - improved 61.5%
> >  16 -  51 cycles(tsc) 12.763 ns - 20 cycles(tsc)  5.015 - improved 60.8%
> >  30 -  50 cycles(tsc) 12.743 ns - 20 cycles(tsc)  5.062 - improved 60.0%
> >  32 -  51 cycles(tsc) 12.908 ns - 20 cycles(tsc)  5.089 - improved 60.8%
> >  34 -  87 cycles(tsc) 21.936 ns - 28 cycles(tsc)  7.006 - improved 67.8%
> >  48 -  79 cycles(tsc) 19.840 ns - 31 cycles(tsc)  7.755 - improved 60.8%
> >  64 -  86 cycles(tsc) 21.669 ns - 68 cycles(tsc) 17.203 - improved 20.9%
> > 128 - 101 cycles(tsc) 25.340 ns - 72 cycles(tsc) 18.195 - improved 28.7%
> > 158 - 112 cycles(tsc) 28.152 ns - 73 cycles(tsc) 18.372 - improved 34.8%
> > 250 - 110 cycles(tsc) 27.727 ns - 73 cycles(tsc) 18.430 - improved 33.6%
> 
> 
> Something interesting happens, when I'm tuning the SLAB/slub cache...
> 
> I was thinking what happens if I "give" the slub more per CPU partial
> pages.  In my benchmark 250 is my "max" bulk working set.
> 
> Tuning SLAB/slub for 256 bytes object size, by tuning SLUB saying each
> CPU partial should be allowed to contain 256 objects (cpu_partial).
> 
>  sudo sh -c 'echo 256 > /sys/kernel/slab/:t-0000256/cpu_partial'
> 
> And adjusting 'min_partial' affects __slab_free() by avoiding removing
> partial if node->nr_partial >= s->min_partial.  Thus, in our test
> min_partial=9 result in keeping 9 pages 32 * 9 = 288 objects in the
> 
>  sudo sh -c 'echo 9   > /sys/kernel/slab/:t-0000256/min_partial'
>  sudo grep -H . /sys/kernel/slab/:t-0000256/*
> 
> First notice the normal fastpath is: 47 cycles(tsc) 11.894 ns
> 
> Patch03-TUNED-run01:
> bulk-  Fallback                 - Bulk-API
>   1 -  63 cycles(tsc) 15.866 ns - 46 cycles(tsc) 11.653 ns - improved 27.0%
>   2 -  56 cycles(tsc) 14.137 ns - 28 cycles(tsc)  7.106 ns - improved 50.0%
>   3 -  54 cycles(tsc) 13.623 ns - 23 cycles(tsc)  5.845 ns - improved 57.4%
>   4 -  53 cycles(tsc) 13.345 ns - 21 cycles(tsc)  5.316 ns - improved 60.4%
>   8 -  51 cycles(tsc) 12.960 ns - 20 cycles(tsc)  5.187 ns - improved 60.8%
>  16 -  50 cycles(tsc) 12.743 ns - 20 cycles(tsc)  5.091 ns - improved 60.0%
>  30 -  80 cycles(tsc) 20.153 ns - 28 cycles(tsc)  7.054 ns - improved 65.0%
>  32 -  82 cycles(tsc) 20.621 ns - 33 cycles(tsc)  8.392 ns - improved 59.8%
>  34 -  80 cycles(tsc) 20.125 ns - 32 cycles(tsc)  8.046 ns - improved 60.0%
>  48 -  91 cycles(tsc) 22.887 ns - 30 cycles(tsc)  7.655 ns - improved 67.0%
>  64 -  85 cycles(tsc) 21.362 ns - 36 cycles(tsc)  9.141 ns - improved 57.6%
> 128 - 101 cycles(tsc) 25.481 ns - 33 cycles(tsc)  8.286 ns - improved 67.3%
> 158 - 103 cycles(tsc) 25.909 ns - 36 cycles(tsc)  9.179 ns - improved 65.0%
> 250 - 105 cycles(tsc) 26.481 ns - 39 cycles(tsc)  9.994 ns - improved 62.9%
> 
> Notice how ALL of the bulk sizes now are faster than the 47 cycles of
> the normal slub fastpath.  This is amazing!
> 
> A little strangely, the tuning didn't seem to help the fallback version.

Hello,

Looks very nice.

I have some questions about your benchmark and result.

1. Does the slab is merged?
- Your above result shows that fallback bulk for 30, 32 takes longer
  than fallback bulk for 16. This is strange result because fallback
  bulk allocation/free for 16, 30, 32 should happens only on cpu cache.
  If the slab is merged, you should turn off merging to get precise
  result.

2. Could you show result with only tuning min_partial?
- I guess that much improvement for Bulk-API comes from disappearing
  slab page allocation/free cost rather than tuning cpu_partial.

3. For more precise test setup, how about setting cpu affinity?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] slub: build detached freelist with look-ahead
  2015-07-20  2:54     ` Joonsoo Kim
@ 2015-07-20 21:28       ` Jesper Dangaard Brouer
  2015-07-21 13:50         ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2015-07-20 21:28 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, Christoph Lameter, Andrew Morton, Alexander Duyck,
	Hannes Frederic Sowa, brouer

On Mon, 20 Jul 2015 11:54:15 +0900
Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> On Thu, Jul 16, 2015 at 11:57:56AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > On Wed, 15 Jul 2015 18:02:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > 
> > > Results:
> > [...]
> > > bulk- Fallback                  - Bulk API
> > >   1 -  64 cycles(tsc) 16.144 ns - 47 cycles(tsc) 11.931 - improved 26.6%
> > >   2 -  57 cycles(tsc) 14.397 ns - 29 cycles(tsc)  7.368 - improved 49.1%
> > >   3 -  55 cycles(tsc) 13.797 ns - 24 cycles(tsc)  6.003 - improved 56.4%
> > >   4 -  53 cycles(tsc) 13.500 ns - 22 cycles(tsc)  5.543 - improved 58.5%
> > >   8 -  52 cycles(tsc) 13.008 ns - 20 cycles(tsc)  5.047 - improved 61.5%
> > >  16 -  51 cycles(tsc) 12.763 ns - 20 cycles(tsc)  5.015 - improved 60.8%
> > >  30 -  50 cycles(tsc) 12.743 ns - 20 cycles(tsc)  5.062 - improved 60.0%
> > >  32 -  51 cycles(tsc) 12.908 ns - 20 cycles(tsc)  5.089 - improved 60.8%
> > >  34 -  87 cycles(tsc) 21.936 ns - 28 cycles(tsc)  7.006 - improved 67.8%
> > >  48 -  79 cycles(tsc) 19.840 ns - 31 cycles(tsc)  7.755 - improved 60.8%
> > >  64 -  86 cycles(tsc) 21.669 ns - 68 cycles(tsc) 17.203 - improved 20.9%
> > > 128 - 101 cycles(tsc) 25.340 ns - 72 cycles(tsc) 18.195 - improved 28.7%
> > > 158 - 112 cycles(tsc) 28.152 ns - 73 cycles(tsc) 18.372 - improved 34.8%
> > > 250 - 110 cycles(tsc) 27.727 ns - 73 cycles(tsc) 18.430 - improved 33.6%
> > 
> > 
> > Something interesting happens, when I'm tuning the SLAB/slub cache...
> > 
> > I was thinking what happens if I "give" the slub more per CPU partial
> > pages.  In my benchmark 250 is my "max" bulk working set.
> > 
> > Tuning SLAB/slub for 256 bytes object size, by tuning SLUB saying each
> > CPU partial should be allowed to contain 256 objects (cpu_partial).
> > 
> >  sudo sh -c 'echo 256 > /sys/kernel/slab/:t-0000256/cpu_partial'
> > 
> > And adjusting 'min_partial' affects __slab_free() by avoiding removing
> > partial if node->nr_partial >= s->min_partial.  Thus, in our test
> > min_partial=9 result in keeping 9 pages 32 * 9 = 288 objects in the
> > 
> >  sudo sh -c 'echo 9   > /sys/kernel/slab/:t-0000256/min_partial'
> >  sudo grep -H . /sys/kernel/slab/:t-0000256/*
> > 
> > First notice the normal fastpath is: 47 cycles(tsc) 11.894 ns
> > 
> > Patch03-TUNED-run01:
> > bulk-  Fallback                 - Bulk-API
> >   1 -  63 cycles(tsc) 15.866 ns - 46 cycles(tsc) 11.653 ns - improved 27.0%
> >   2 -  56 cycles(tsc) 14.137 ns - 28 cycles(tsc)  7.106 ns - improved 50.0%
> >   3 -  54 cycles(tsc) 13.623 ns - 23 cycles(tsc)  5.845 ns - improved 57.4%
> >   4 -  53 cycles(tsc) 13.345 ns - 21 cycles(tsc)  5.316 ns - improved 60.4%
> >   8 -  51 cycles(tsc) 12.960 ns - 20 cycles(tsc)  5.187 ns - improved 60.8%
> >  16 -  50 cycles(tsc) 12.743 ns - 20 cycles(tsc)  5.091 ns - improved 60.0%
> >  30 -  80 cycles(tsc) 20.153 ns - 28 cycles(tsc)  7.054 ns - improved 65.0%
> >  32 -  82 cycles(tsc) 20.621 ns - 33 cycles(tsc)  8.392 ns - improved 59.8%
> >  34 -  80 cycles(tsc) 20.125 ns - 32 cycles(tsc)  8.046 ns - improved 60.0%
> >  48 -  91 cycles(tsc) 22.887 ns - 30 cycles(tsc)  7.655 ns - improved 67.0%
> >  64 -  85 cycles(tsc) 21.362 ns - 36 cycles(tsc)  9.141 ns - improved 57.6%
> > 128 - 101 cycles(tsc) 25.481 ns - 33 cycles(tsc)  8.286 ns - improved 67.3%
> > 158 - 103 cycles(tsc) 25.909 ns - 36 cycles(tsc)  9.179 ns - improved 65.0%
> > 250 - 105 cycles(tsc) 26.481 ns - 39 cycles(tsc)  9.994 ns - improved 62.9%
> > 
> > Notice how ALL of the bulk sizes now are faster than the 47 cycles of
> > the normal slub fastpath.  This is amazing!
> > 
> > A little strangely, the tuning didn't seem to help the fallback version.
> 
> Hello,
> 
> Looks very nice.

Thanks :-)

> I have some questions about your benchmark and result.
> 
> 1. Does the slab is merged?
> - Your above result shows that fallback bulk for 30, 32 takes longer
>   than fallback bulk for 16. This is strange result because fallback
>   bulk allocation/free for 16, 30, 32 should happens only on cpu cache.

I guess it depends on how "used/full" the page is... some other
subsystem can hold on to objects...

>   If the slab is merged, you should turn off merging to get precise
>   result.

Yes, I think it is merged... how do I turn off merging?

Before adjusting/tuning the SLAB.

$ sudo grep -H . /sys/kernel/slab/:t-0000256/{cpu_partial,min_partial,order,objs_per_slab}
/sys/kernel/slab/:t-0000256/cpu_partial:13
/sys/kernel/slab/:t-0000256/min_partial:5
/sys/kernel/slab/:t-0000256/order:1
/sys/kernel/slab/:t-0000256/objs_per_slab:32

Run01: non-tuned
1 - 64 cycles(tsc) 16.092 ns -  47 cycles(tsc) 11.886 ns
2 - 57 cycles(tsc) 14.258 ns -  28 cycles(tsc) 7.226 ns
3 - 54 cycles(tsc) 13.626 ns -  23 cycles(tsc) 5.822 ns
4 - 53 cycles(tsc) 13.328 ns -  20 cycles(tsc) 5.185 ns
8 - 93 cycles(tsc) 23.301 ns -  49 cycles(tsc) 12.406 ns
16 - 83 cycles(tsc) 20.902 ns -  37 cycles(tsc) 9.418 ns
30 - 77 cycles(tsc) 19.400 ns -  30 cycles(tsc) 7.748 ns
32 - 79 cycles(tsc) 19.938 ns -  30 cycles(tsc) 7.751 ns
34 - 80 cycles(tsc) 20.215 ns -  35 cycles(tsc) 8.907 ns
48 - 85 cycles(tsc) 21.391 ns -  24 cycles(tsc) 6.219 ns
64 - 93 cycles(tsc) 23.272 ns -  67 cycles(tsc) 16.874 ns
128 - 101 cycles(tsc) 25.407 ns -  72 cycles(tsc) 18.097 ns
158 - 105 cycles(tsc) 26.319 ns -  72 cycles(tsc) 18.164 ns
250 - 107 cycles(tsc) 26.783 ns -  72 cycles(tsc) 18.246 ns

Run02: non-tuned
1 - 63 cycles(tsc) 15.864 ns -  46 cycles(tsc) 11.672 ns
2 - 56 cycles(tsc) 14.153 ns -  28 cycles(tsc) 7.119 ns
3 - 54 cycles(tsc) 13.681 ns -  23 cycles(tsc) 5.846 ns
4 - 53 cycles(tsc) 13.354 ns -  20 cycles(tsc) 5.141 ns
8 - 51 cycles(tsc) 12.970 ns -  19 cycles(tsc) 4.954 ns
16 - 51 cycles(tsc) 12.763 ns -  20 cycles(tsc) 5.003 ns
30 - 51 cycles(tsc) 12.760 ns -  20 cycles(tsc) 5.065 ns
32 - 80 cycles(tsc) 20.045 ns -  37 cycles(tsc) 9.311 ns
34 - 73 cycles(tsc) 18.454 ns -  27 cycles(tsc) 6.773 ns
48 - 82 cycles(tsc) 20.544 ns -  35 cycles(tsc) 8.973 ns
64 - 87 cycles(tsc) 21.809 ns -  60 cycles(tsc) 15.167 ns
128 - 103 cycles(tsc) 25.772 ns -  63 cycles(tsc) 15.874 ns
158 - 104 cycles(tsc) 26.215 ns -  61 cycles(tsc) 15.433 ns
250 - 107 cycles(tsc) 26.926 ns -  60 cycles(tsc) 15.058 ns

Notice the variation is fairly high between runs... :-(

> 3. For more precise test setup, how about setting cpu affinity?

Sure, starting to use test cmd:
 sudo taskset -c 1 modprobe slab_bulk_test01 && rmmod slab_bulk_test01 && sudo dmesg

Code:
 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c

For these runs I've also disabled HT (Hyper Threading) in the BIOS, as
this tuned out to be a big disturbance for my network testing use-case.
(ps. I've hacked together a use-case in ixgbe/skbuff.c, but only TX complete
bulk-free which shows improvement of 3ns and 16ns with this slab
tuning, once I also implement alloc-bulk I should get a better boost).


> 2. Could you show result with only tuning min_partial?
> - I guess that much improvement for Bulk-API comes from disappearing
>   slab page allocation/free cost rather than tuning cpu_partial.

Sure, there are some more runs:

  sudo sh -c 'echo 9   > /sys/kernel/slab/:t-0000256/min_partial'

Run03: tuned min_partial=9
1 - 63 cycles(tsc) 15.910 ns -  46 cycles(tsc) 11.720 ns
2 - 57 cycles(tsc) 14.318 ns -  29 cycles(tsc) 7.266 ns
3 - 55 cycles(tsc) 13.762 ns -  23 cycles(tsc) 5.937 ns
4 - 53 cycles(tsc) 13.459 ns -  20 cycles(tsc) 5.211 ns
8 - 51 cycles(tsc) 13.001 ns -  19 cycles(tsc) 4.821 ns
16 - 51 cycles(tsc) 12.772 ns -  20 cycles(tsc) 5.016 ns
30 - 84 cycles(tsc) 21.135 ns -  28 cycles(tsc) 7.047 ns
32 - 83 cycles(tsc) 20.887 ns -  28 cycles(tsc) 7.133 ns
34 - 81 cycles(tsc) 20.454 ns -  28 cycles(tsc) 7.024 ns
48 - 86 cycles(tsc) 21.662 ns -  32 cycles(tsc) 8.121 ns
64 - 92 cycles(tsc) 23.027 ns -  52 cycles(tsc) 13.033 ns
128 - 97 cycles(tsc) 24.270 ns -  51 cycles(tsc) 12.865 ns
158 - 105 cycles(tsc) 26.290 ns -  53 cycles(tsc) 13.435 ns
250 - 106 cycles(tsc) 26.545 ns -  54 cycles(tsc) 13.607 ns

Run04: tuned min_partial=9
1 - 64 cycles(tsc) 16.123 ns -  47 cycles(tsc) 11.906 ns
2 - 57 cycles(tsc) 14.267 ns -  28 cycles(tsc) 7.235 ns
3 - 54 cycles(tsc) 13.691 ns -  23 cycles(tsc) 5.916 ns
4 - 53 cycles(tsc) 13.470 ns -  21 cycles(tsc) 5.278 ns
8 - 51 cycles(tsc) 12.991 ns -  19 cycles(tsc) 4.815 ns
16 - 50 cycles(tsc) 12.651 ns -  19 cycles(tsc) 4.840 ns
30 - 81 cycles(tsc) 20.282 ns -  35 cycles(tsc) 8.835 ns
32 - 77 cycles(tsc) 19.327 ns -  29 cycles(tsc) 7.403 ns
34 - 77 cycles(tsc) 19.438 ns -  31 cycles(tsc) 7.879 ns
48 - 85 cycles(tsc) 21.367 ns -  34 cycles(tsc) 8.563 ns
64 - 87 cycles(tsc) 21.830 ns -  55 cycles(tsc) 13.820 ns
128 - 109 cycles(tsc) 27.445 ns -  56 cycles(tsc) 14.152 ns
158 - 102 cycles(tsc) 25.576 ns -  60 cycles(tsc) 15.120 ns
250 - 108 cycles(tsc) 27.069 ns -  58 cycles(tsc) 14.534 ns

Looking at Run04 the win was not so big...

Also adjust cpu_partial:
 sudo sh -c 'echo 256 > /sys/kernel/slab/:t-0000256/cpu_partial'

$ sudo grep -H . /sys/kernel/slab/:t-0000256/{cpu_partial,min_partial,order,objs_per_slab}
/sys/kernel/slab/:t-0000256/cpu_partial:256
/sys/kernel/slab/:t-0000256/min_partial:9
/sys/kernel/slab/:t-0000256/order:1
/sys/kernel/slab/:t-0000256/objs_per_slab:32

Run05: also tuned cpu_partial=256
1 - 63 cycles(tsc) 15.867 ns -  46 cycles(tsc) 11.656 ns
2 - 56 cycles(tsc) 14.229 ns -  28 cycles(tsc) 7.131 ns
3 - 54 cycles(tsc) 13.587 ns -  23 cycles(tsc) 5.760 ns
4 - 53 cycles(tsc) 13.287 ns -  20 cycles(tsc) 5.081 ns
8 - 51 cycles(tsc) 12.935 ns -  19 cycles(tsc) 4.953 ns
16 - 50 cycles(tsc) 12.707 ns -  20 cycles(tsc) 5.074 ns
30 - 79 cycles(tsc) 19.927 ns -  28 cycles(tsc) 7.057 ns
32 - 79 cycles(tsc) 19.977 ns -  31 cycles(tsc) 7.762 ns
34 - 79 cycles(tsc) 19.800 ns -  33 cycles(tsc) 8.392 ns
48 - 93 cycles(tsc) 23.316 ns -  35 cycles(tsc) 8.777 ns
64 - 92 cycles(tsc) 23.144 ns -  33 cycles(tsc) 8.449 ns
128 - 97 cycles(tsc) 24.268 ns -  35 cycles(tsc) 8.943 ns
158 - 106 cycles(tsc) 26.606 ns -  40 cycles(tsc) 10.067 ns
250 - 109 cycles(tsc) 27.385 ns -  51 cycles(tsc) 12.957 ns

Run06: also tuned cpu_partial=256
1 - 63 cycles(tsc) 15.952 ns -  46 cycles(tsc) 11.710 ns
2 - 57 cycles(tsc) 14.309 ns -  29 cycles(tsc) 7.261 ns
3 - 54 cycles(tsc) 13.703 ns -  23 cycles(tsc) 5.858 ns
4 - 53 cycles(tsc) 13.394 ns -  20 cycles(tsc) 5.161 ns
8 - 52 cycles(tsc) 13.013 ns -  19 cycles(tsc) 4.809 ns
16 - 94 cycles(tsc) 23.734 ns -  49 cycles(tsc) 12.376 ns
30 - 88 cycles(tsc) 22.221 ns -  35 cycles(tsc) 8.933 ns
32 - 101 cycles(tsc) 25.319 ns -  41 cycles(tsc) 10.437 ns
34 - 98 cycles(tsc) 24.711 ns -  41 cycles(tsc) 10.485 ns
48 - 96 cycles(tsc) 24.119 ns -  41 cycles(tsc) 10.479 ns
64 - 100 cycles(tsc) 25.223 ns -  39 cycles(tsc) 9.766 ns
128 - 100 cycles(tsc) 25.078 ns -  34 cycles(tsc) 8.602 ns
158 - 102 cycles(tsc) 25.673 ns -  38 cycles(tsc) 9.645 ns
250 - 110 cycles(tsc) 27.560 ns -  40 cycles(tsc) 10.046 ns

(p.s. I'm currently on vacation for 3 weeks...)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] slub: build detached freelist with look-ahead
  2015-07-20 21:28       ` Jesper Dangaard Brouer
@ 2015-07-21 13:50         ` Christoph Lameter
  2015-07-21 23:28           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2015-07-21 13:50 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Joonsoo Kim, linux-mm, Andrew Morton, Alexander Duyck,
	Hannes Frederic Sowa

On Mon, 20 Jul 2015, Jesper Dangaard Brouer wrote:

> Yes, I think it is merged... how do I turn off merging?

linux/Documentation/kernel-parameters.txt

        slab_nomerge    [MM]
                        Disable merging of slabs with similar size. May be
                        necessary if there is some reason to distinguish
                        allocs to different slabs. Debug options disable
                        merging on their own.
                        For more information see Documentation/vm/slub.txt.

        slab_max_order= [MM, SLAB]
                        Determines the maximum allowed order for slabs.
                        A high setting may cause OOMs due to memory
                        fragmentation.  Defaults to 1 for systems with
                        more than 32MB of RAM, 0 otherwise.


       slub_debug[=options[,slabs]]    [MM, SLUB]
                        Enabling slub_debug allows one to determine the
                        culprit if slab objects become corrupted. Enabling
                        slub_debug can create guard zones around objects and
                        may poison objects when not in use. Also tracks the
                        last alloc / free. For more information see
                        Documentation/vm/slub.txt.

        slub_max_order= [MM, SLUB]
                        Determines the maximum allowed order for slabs.
                        A high setting may cause OOMs due to memory
                        fragmentation. For more information see
                        Documentation/vm/slub.txt.

        slub_min_objects=       [MM, SLUB]
                        The minimum number of objects per slab. SLUB will
                        increase the slab order up to slub_max_order to
                        generate a sufficiently large slab able to contain
                        the number of objects indicated. The higher the number
                        of objects the smaller the overhead of tracking slabs
                        and the less frequently locks need to be acquired.
                        For more information see Documentation/vm/slub.txt.

        slub_min_order= [MM, SLUB]
                        Determines the minimum page order for slabs. Must be
                        lower than slub_max_order.
                        For more information see Documentation/vm/slub.txt.

        slub_nomerge    [MM, SLUB]
                        Same with slab_nomerge. This is supported for legacy.
                        See slab_nomerge for more information.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] slub: build detached freelist with look-ahead
  2015-07-21 13:50         ` Christoph Lameter
@ 2015-07-21 23:28           ` Jesper Dangaard Brouer
  2015-07-23  6:34             ` Joonsoo Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2015-07-21 23:28 UTC (permalink / raw)
  To: Christoph Lameter, brouer
  Cc: Joonsoo Kim, linux-mm, Andrew Morton, Alexander Duyck,
	Hannes Frederic Sowa

On Tue, 21 Jul 2015 08:50:36 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Mon, 20 Jul 2015, Jesper Dangaard Brouer wrote:
> 
> > Yes, I think it is merged... how do I turn off merging?
> 
> linux/Documentation/kernel-parameters.txt
> 
>         slab_nomerge    [MM]
>                         Disable merging of slabs with similar size. May be
>                         necessary if there is some reason to distinguish
>                         allocs to different slabs. Debug options disable
>                         merging on their own.
>                         For more information see Documentation/vm/slub.txt.

I was hoping I could define this per slub runtime.  Any chance this
would be made possible?

Setting boot param "slab_nomerge" made my benchmarking VERY stable
between runs (obj size 256).


Run01: slab_nomerge
1 - 63 cycles(tsc) 15.927 ns -  46 cycles(tsc) 11.707 ns
2 - 56 cycles(tsc) 14.185 ns -  28 cycles(tsc) 7.129 ns
3 - 54 cycles(tsc) 13.588 ns -  23 cycles(tsc) 5.762 ns
4 - 53 cycles(tsc) 13.291 ns -  20 cycles(tsc) 5.085 ns
8 - 51 cycles(tsc) 12.918 ns -  19 cycles(tsc) 4.886 ns
16 - 50 cycles(tsc) 12.607 ns -  19 cycles(tsc) 4.858 ns
30 - 51 cycles(tsc) 12.759 ns -  19 cycles(tsc) 4.980 ns
32 - 51 cycles(tsc) 12.930 ns -  19 cycles(tsc) 4.975 ns
34 - 93 cycles(tsc) 23.410 ns -  27 cycles(tsc) 6.924 ns
48 - 80 cycles(tsc) 20.193 ns -  25 cycles(tsc) 6.279 ns
64 - 73 cycles(tsc) 18.322 ns -  23 cycles(tsc) 5.939 ns
128 - 88 cycles(tsc) 22.083 ns -  29 cycles(tsc) 7.413 ns
158 - 97 cycles(tsc) 24.274 ns -  34 cycles(tsc) 8.696 ns
250 - 102 cycles(tsc) 25.556 ns -  40 cycles(tsc) 10.100 ns

Run02: slab_nomerge
1 - 63 cycles(tsc) 15.879 ns -  46 cycles(tsc) 11.701 ns
2 - 56 cycles(tsc) 14.222 ns -  28 cycles(tsc) 7.140 ns
3 - 54 cycles(tsc) 13.586 ns -  23 cycles(tsc) 5.783 ns
4 - 53 cycles(tsc) 13.339 ns -  20 cycles(tsc) 5.095 ns
8 - 51 cycles(tsc) 12.899 ns -  19 cycles(tsc) 4.905 ns
16 - 50 cycles(tsc) 12.624 ns -  19 cycles(tsc) 4.853 ns
30 - 51 cycles(tsc) 12.781 ns -  19 cycles(tsc) 4.984 ns
32 - 51 cycles(tsc) 12.933 ns -  19 cycles(tsc) 4.997 ns
34 - 93 cycles(tsc) 23.421 ns -  27 cycles(tsc) 6.909 ns
48 - 80 cycles(tsc) 20.241 ns -  25 cycles(tsc) 6.267 ns
64 - 73 cycles(tsc) 18.346 ns -  23 cycles(tsc) 5.947 ns
128 - 88 cycles(tsc) 22.192 ns -  29 cycles(tsc) 7.415 ns
158 - 97 cycles(tsc) 24.358 ns -  34 cycles(tsc) 8.693 ns
250 - 102 cycles(tsc) 25.597 ns -  40 cycles(tsc) 10.144 ns

Run03: slab_nomerge
1 - 63 cycles(tsc) 15.897 ns -  46 cycles(tsc) 11.685 ns
2 - 56 cycles(tsc) 14.178 ns -  28 cycles(tsc) 7.132 ns
3 - 54 cycles(tsc) 13.590 ns -  23 cycles(tsc) 5.774 ns
4 - 53 cycles(tsc) 13.314 ns -  20 cycles(tsc) 5.092 ns
8 - 51 cycles(tsc) 12.872 ns -  19 cycles(tsc) 4.886 ns
16 - 50 cycles(tsc) 12.603 ns -  19 cycles(tsc) 4.840 ns
30 - 50 cycles(tsc) 12.750 ns -  19 cycles(tsc) 4.966 ns
32 - 51 cycles(tsc) 12.910 ns -  19 cycles(tsc) 4.977 ns
34 - 93 cycles(tsc) 23.372 ns -  27 cycles(tsc) 6.929 ns
48 - 80 cycles(tsc) 20.205 ns -  25 cycles(tsc) 6.276 ns
64 - 73 cycles(tsc) 18.292 ns -  23 cycles(tsc) 5.929 ns
128 - 90 cycles(tsc) 22.516 ns -  29 cycles(tsc) 7.425 ns
158 - 99 cycles(tsc) 24.825 ns -  34 cycles(tsc) 8.668 ns
250 - 102 cycles(tsc) 25.652 ns -  40 cycles(tsc) 10.129 ns


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] slub: build detached freelist with look-ahead
  2015-07-21 23:28           ` Jesper Dangaard Brouer
@ 2015-07-23  6:34             ` Joonsoo Kim
  2015-07-23 11:09               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 14+ messages in thread
From: Joonsoo Kim @ 2015-07-23  6:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Christoph Lameter, linux-mm, Andrew Morton, Alexander Duyck,
	Hannes Frederic Sowa

On Wed, Jul 22, 2015 at 01:28:19AM +0200, Jesper Dangaard Brouer wrote:
> On Tue, 21 Jul 2015 08:50:36 -0500 (CDT)
> Christoph Lameter <cl@linux.com> wrote:
> 
> > On Mon, 20 Jul 2015, Jesper Dangaard Brouer wrote:
> > 
> > > Yes, I think it is merged... how do I turn off merging?
> > 
> > linux/Documentation/kernel-parameters.txt
> > 
> >         slab_nomerge    [MM]
> >                         Disable merging of slabs with similar size. May be
> >                         necessary if there is some reason to distinguish
> >                         allocs to different slabs. Debug options disable
> >                         merging on their own.
> >                         For more information see Documentation/vm/slub.txt.
> 
> I was hoping I could define this per slub runtime.  Any chance this
> would be made possible?

It's not possible to set/reset slab merge in runtime. Once merging
happens, one slab could have objects from different kmem_caches so we
can't separate it cleanly. Current best approach is to prevent merging
when creating new kmem_cache by introducing new slab flag
like as SLAB_NO_MERGE.

> 
> Setting boot param "slab_nomerge" made my benchmarking VERY stable
> between runs (obj size 256).
> 
> 
> Run01: slab_nomerge
> 1 - 63 cycles(tsc) 15.927 ns -  46 cycles(tsc) 11.707 ns
> 2 - 56 cycles(tsc) 14.185 ns -  28 cycles(tsc) 7.129 ns
> 3 - 54 cycles(tsc) 13.588 ns -  23 cycles(tsc) 5.762 ns
> 4 - 53 cycles(tsc) 13.291 ns -  20 cycles(tsc) 5.085 ns
> 8 - 51 cycles(tsc) 12.918 ns -  19 cycles(tsc) 4.886 ns
> 16 - 50 cycles(tsc) 12.607 ns -  19 cycles(tsc) 4.858 ns
> 30 - 51 cycles(tsc) 12.759 ns -  19 cycles(tsc) 4.980 ns
> 32 - 51 cycles(tsc) 12.930 ns -  19 cycles(tsc) 4.975 ns
> 34 - 93 cycles(tsc) 23.410 ns -  27 cycles(tsc) 6.924 ns
> 48 - 80 cycles(tsc) 20.193 ns -  25 cycles(tsc) 6.279 ns
> 64 - 73 cycles(tsc) 18.322 ns -  23 cycles(tsc) 5.939 ns
> 128 - 88 cycles(tsc) 22.083 ns -  29 cycles(tsc) 7.413 ns
> 158 - 97 cycles(tsc) 24.274 ns -  34 cycles(tsc) 8.696 ns
> 250 - 102 cycles(tsc) 25.556 ns -  40 cycles(tsc) 10.100 ns
> 
> Run02: slab_nomerge
> 1 - 63 cycles(tsc) 15.879 ns -  46 cycles(tsc) 11.701 ns
> 2 - 56 cycles(tsc) 14.222 ns -  28 cycles(tsc) 7.140 ns
> 3 - 54 cycles(tsc) 13.586 ns -  23 cycles(tsc) 5.783 ns
> 4 - 53 cycles(tsc) 13.339 ns -  20 cycles(tsc) 5.095 ns
> 8 - 51 cycles(tsc) 12.899 ns -  19 cycles(tsc) 4.905 ns
> 16 - 50 cycles(tsc) 12.624 ns -  19 cycles(tsc) 4.853 ns
> 30 - 51 cycles(tsc) 12.781 ns -  19 cycles(tsc) 4.984 ns
> 32 - 51 cycles(tsc) 12.933 ns -  19 cycles(tsc) 4.997 ns
> 34 - 93 cycles(tsc) 23.421 ns -  27 cycles(tsc) 6.909 ns
> 48 - 80 cycles(tsc) 20.241 ns -  25 cycles(tsc) 6.267 ns
> 64 - 73 cycles(tsc) 18.346 ns -  23 cycles(tsc) 5.947 ns
> 128 - 88 cycles(tsc) 22.192 ns -  29 cycles(tsc) 7.415 ns
> 158 - 97 cycles(tsc) 24.358 ns -  34 cycles(tsc) 8.693 ns
> 250 - 102 cycles(tsc) 25.597 ns -  40 cycles(tsc) 10.144 ns
> 
> Run03: slab_nomerge
> 1 - 63 cycles(tsc) 15.897 ns -  46 cycles(tsc) 11.685 ns
> 2 - 56 cycles(tsc) 14.178 ns -  28 cycles(tsc) 7.132 ns
> 3 - 54 cycles(tsc) 13.590 ns -  23 cycles(tsc) 5.774 ns
> 4 - 53 cycles(tsc) 13.314 ns -  20 cycles(tsc) 5.092 ns
> 8 - 51 cycles(tsc) 12.872 ns -  19 cycles(tsc) 4.886 ns
> 16 - 50 cycles(tsc) 12.603 ns -  19 cycles(tsc) 4.840 ns
> 30 - 50 cycles(tsc) 12.750 ns -  19 cycles(tsc) 4.966 ns
> 32 - 51 cycles(tsc) 12.910 ns -  19 cycles(tsc) 4.977 ns
> 34 - 93 cycles(tsc) 23.372 ns -  27 cycles(tsc) 6.929 ns
> 48 - 80 cycles(tsc) 20.205 ns -  25 cycles(tsc) 6.276 ns
> 64 - 73 cycles(tsc) 18.292 ns -  23 cycles(tsc) 5.929 ns
> 128 - 90 cycles(tsc) 22.516 ns -  29 cycles(tsc) 7.425 ns
> 158 - 99 cycles(tsc) 24.825 ns -  34 cycles(tsc) 8.668 ns
> 250 - 102 cycles(tsc) 25.652 ns -  40 cycles(tsc) 10.129 ns

Really looks stable!

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] slub: build detached freelist with look-ahead
  2015-07-23  6:34             ` Joonsoo Kim
@ 2015-07-23 11:09               ` Jesper Dangaard Brouer
  2015-07-23 14:14                 ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Jesper Dangaard Brouer @ 2015-07-23 11:09 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Christoph Lameter, linux-mm, Andrew Morton, Alexander Duyck,
	Hannes Frederic Sowa, brouer


On Thu, 23 Jul 2015 15:34:24 +0900 Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> On Wed, Jul 22, 2015 at 01:28:19AM +0200, Jesper Dangaard Brouer wrote:
> > On Tue, 21 Jul 2015 08:50:36 -0500 (CDT)
> > Christoph Lameter <cl@linux.com> wrote:
> > 
> > > On Mon, 20 Jul 2015, Jesper Dangaard Brouer wrote:
> > > 
> > > > Yes, I think it is merged... how do I turn off merging?
> > > 
> > > linux/Documentation/kernel-parameters.txt
> > > 
> > >         slab_nomerge    [MM]
> > >                         Disable merging of slabs with similar size. May be
> > >                         necessary if there is some reason to distinguish
> > >                         allocs to different slabs. Debug options disable
> > >                         merging on their own.
> > >                         For more information see Documentation/vm/slub.txt.
> > 
> > I was hoping I could define this per slub runtime.  Any chance this
> > would be made possible?
> 
> It's not possible to set/reset slab merge in runtime. Once merging
> happens, one slab could have objects from different kmem_caches so we
> can't separate it cleanly. Current best approach is to prevent merging
> when creating new kmem_cache by introducing new slab flag
> like as SLAB_NO_MERGE.

Yes, the best option would be a new flag (e.g. SLAB_NO_MERGE) when
creating the kmem_cache.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] slub: build detached freelist with look-ahead
  2015-07-23 11:09               ` Jesper Dangaard Brouer
@ 2015-07-23 14:14                 ` Christoph Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2015-07-23 14:14 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Joonsoo Kim, linux-mm, Andrew Morton, Alexander Duyck,
	Hannes Frederic Sowa

On Thu, 23 Jul 2015, Jesper Dangaard Brouer wrote:

> > > I was hoping I could define this per slub runtime.  Any chance this
> > > would be made possible?
> >
> > It's not possible to set/reset slab merge in runtime. Once merging
> > happens, one slab could have objects from different kmem_caches so we
> > can't separate it cleanly. Current best approach is to prevent merging
> > when creating new kmem_cache by introducing new slab flag
> > like as SLAB_NO_MERGE.
>
> Yes, the best option would be a new flag (e.g. SLAB_NO_MERGE) when
> creating the kmem_cache.

If this is only in order to stability artificial test loads then the
current kernel parameter is fine afaict. Such a flag has been proposed
numerous times but we never did anything about these requests.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-07-23 14:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-15 16:01 [PATCH 0/3] slub: introducing detached freelist Jesper Dangaard Brouer
2015-07-15 16:01 ` [PATCH 1/3] slub: extend slowpath __slab_free() to handle bulk free Jesper Dangaard Brouer
2015-07-15 16:54   ` Christoph Lameter
2015-07-15 16:02 ` [PATCH 2/3] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
2015-07-15 16:56   ` Christoph Lameter
2015-07-15 16:02 ` [PATCH 3/3] slub: build detached freelist with look-ahead Jesper Dangaard Brouer
2015-07-16  9:57   ` Jesper Dangaard Brouer
2015-07-20  2:54     ` Joonsoo Kim
2015-07-20 21:28       ` Jesper Dangaard Brouer
2015-07-21 13:50         ` Christoph Lameter
2015-07-21 23:28           ` Jesper Dangaard Brouer
2015-07-23  6:34             ` Joonsoo Kim
2015-07-23 11:09               ` Jesper Dangaard Brouer
2015-07-23 14:14                 ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.