linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] Further optimizing SLAB/SLUB bulking
@ 2015-09-28 12:26 Jesper Dangaard Brouer
  2015-09-28 12:26 ` [PATCH 1/7] slub: create new ___slab_alloc function that can be called with irqs disabled Jesper Dangaard Brouer
                   ` (7 more replies)
  0 siblings, 8 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-28 12:26 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Christoph Lameter, Joonsoo Kim

Most important part of this patchset is the introducing of what I call
detached freelist, for improving SLUB performance of object freeing in
the "slowpath" of kmem_cache_free_bulk.

Previous patchset V2 thread:
  http://thread.gmane.org/gmane.linux.kernel.mm/137469

Not using V3 tag as patch titles have changed and I've merged some
patches. This was joint work with Alexander Duyck while still at Red Hat.

Notes for patches:
 * First two patches (from Christoph) are already in AKPM MMOTS.
 * Patch 3 is trivial
 * Patch 4 is a repost, implements bulking for SLAB.
  - http://thread.gmane.org/gmane.linux.kernel.mm/138220
 * Patch 5 and 6 are the important patches
  - Patch 5 handle "freelists" in slab_free() and __slab_free().
  - Patch 6 intro detached freelists, and significant performance improvement

Patches should be ready for the MM-tree, as I'm now handling kmem
debug support.


Based on top of commit 519f526d39 in net-next, but I've tested it
applies on top of mmotm-2015-09-18-16-08.

The benchmarking tools are avail here:
 https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
 See: slab_bulk_test0{1,2,3}.c


This patchset is part of my network stack use-case.  I'll post the
network side of the patchset as soon as I've cleaned it up, rebased it
on net-next and re-run all the benchmarks.

---

Christoph Lameter (2):
      slub: create new ___slab_alloc function that can be called with irqs disabled
      slub: Avoid irqoff/on in bulk allocation

Jesper Dangaard Brouer (5):
      slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG
      slab: implement bulking for SLAB allocator
      slub: support for bulk free with SLUB freelists
      slub: optimize bulk slowpath free by detached freelist
      slub: do prefetching in kmem_cache_alloc_bulk()


 mm/slab.c |   87 ++++++++++++++-----
 mm/slub.c |  276 +++++++++++++++++++++++++++++++++++++++++++++----------------
 2 files changed, 267 insertions(+), 96 deletions(-)

--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/7] slub: create new ___slab_alloc function that can be called with irqs disabled
  2015-09-28 12:26 [PATCH 0/7] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
@ 2015-09-28 12:26 ` Jesper Dangaard Brouer
  2015-09-28 12:26 ` [PATCH 2/7] slub: Avoid irqoff/on in bulk allocation Jesper Dangaard Brouer
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-28 12:26 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Christoph Lameter, Joonsoo Kim

From: Christoph Lameter <cl@linux.com>

NOTICE: Accepted by AKPM
 http://ozlabs.org/~akpm/mmots/broken-out/slub-create-new-___slab_alloc-function-that-can-be-called-with-irqs-disabled.patch

Bulk alloc needs a function like that because it enables interrupts before
calling __slab_alloc which promptly disables them again using the expensive
local_irq_save().

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/slub.c |   44 +++++++++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index f614b5dc396b..02cfb3a5983e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2298,23 +2298,15 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page)
  * And if we were unable to get a new slab from the partial slab lists then
  * we need to allocate a new slab. This is the slowest path since it involves
  * a call to the page allocator and the setup of a new slab.
+ *
+ * Version of __slab_alloc to use when we know that interrupts are
+ * already disabled (which is the case for bulk allocation).
  */
-static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
+static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			  unsigned long addr, struct kmem_cache_cpu *c)
 {
 	void *freelist;
 	struct page *page;
-	unsigned long flags;
-
-	local_irq_save(flags);
-#ifdef CONFIG_PREEMPT
-	/*
-	 * We may have been preempted and rescheduled on a different
-	 * cpu before disabling interrupts. Need to reload cpu area
-	 * pointer.
-	 */
-	c = this_cpu_ptr(s->cpu_slab);
-#endif
 
 	page = c->page;
 	if (!page)
@@ -2372,7 +2364,6 @@ load_freelist:
 	VM_BUG_ON(!c->page->frozen);
 	c->freelist = get_freepointer(s, freelist);
 	c->tid = next_tid(c->tid);
-	local_irq_restore(flags);
 	return freelist;
 
 new_slab:
@@ -2389,7 +2380,6 @@ new_slab:
 
 	if (unlikely(!freelist)) {
 		slab_out_of_memory(s, gfpflags, node);
-		local_irq_restore(flags);
 		return NULL;
 	}
 
@@ -2405,11 +2395,35 @@ new_slab:
 	deactivate_slab(s, page, get_freepointer(s, freelist));
 	c->page = NULL;
 	c->freelist = NULL;
-	local_irq_restore(flags);
 	return freelist;
 }
 
 /*
+ * Another one that disabled interrupt and compensates for possible
+ * cpu changes by refetching the per cpu area pointer.
+ */
+static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
+			  unsigned long addr, struct kmem_cache_cpu *c)
+{
+	void *p;
+	unsigned long flags;
+
+	local_irq_save(flags);
+#ifdef CONFIG_PREEMPT
+	/*
+	 * We may have been preempted and rescheduled on a different
+	 * cpu before disabling interrupts. Need to reload cpu area
+	 * pointer.
+	 */
+	c = this_cpu_ptr(s->cpu_slab);
+#endif
+
+	p = ___slab_alloc(s, gfpflags, node, addr, c);
+	local_irq_restore(flags);
+	return p;
+}
+
+/*
  * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
  * have the fastpath folded into their functions. So no function call
  * overhead for requests that can be satisfied on the fastpath.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/7] slub: Avoid irqoff/on in bulk allocation
  2015-09-28 12:26 [PATCH 0/7] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
  2015-09-28 12:26 ` [PATCH 1/7] slub: create new ___slab_alloc function that can be called with irqs disabled Jesper Dangaard Brouer
@ 2015-09-28 12:26 ` Jesper Dangaard Brouer
  2015-09-28 12:26 ` [PATCH 3/7] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG Jesper Dangaard Brouer
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-28 12:26 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Christoph Lameter, Joonsoo Kim

From: Christoph Lameter <cl@linux.com>

NOTICE: Accepted by AKPM
 http://ozlabs.org/~akpm/mmots/broken-out/slub-avoid-irqoff-on-in-bulk-allocation.patch

Use the new function that can do allocation while
interrupts are disabled.  Avoids irq on/off sequences.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/slub.c |   24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 02cfb3a5983e..024eed32da2c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2821,30 +2821,23 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 		void *object = c->freelist;
 
 		if (unlikely(!object)) {
-			local_irq_enable();
 			/*
 			 * Invoking slow path likely have side-effect
 			 * of re-populating per CPU c->freelist
 			 */
-			p[i] = __slab_alloc(s, flags, NUMA_NO_NODE,
+			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
 					    _RET_IP_, c);
-			if (unlikely(!p[i])) {
-				__kmem_cache_free_bulk(s, i, p);
-				return false;
-			}
-			local_irq_disable();
+			if (unlikely(!p[i]))
+				goto error;
+
 			c = this_cpu_ptr(s->cpu_slab);
 			continue; /* goto for-loop */
 		}
 
 		/* kmem_cache debug support */
 		s = slab_pre_alloc_hook(s, flags);
-		if (unlikely(!s)) {
-			__kmem_cache_free_bulk(s, i, p);
-			c->tid = next_tid(c->tid);
-			local_irq_enable();
-			return false;
-		}
+		if (unlikely(!s))
+			goto error;
 
 		c->freelist = get_freepointer(s, object);
 		p[i] = object;
@@ -2864,6 +2857,11 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	}
 
 	return true;
+
+error:
+	__kmem_cache_free_bulk(s, i, p);
+	local_irq_enable();
+	return false;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 3/7] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG
  2015-09-28 12:26 [PATCH 0/7] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
  2015-09-28 12:26 ` [PATCH 1/7] slub: create new ___slab_alloc function that can be called with irqs disabled Jesper Dangaard Brouer
  2015-09-28 12:26 ` [PATCH 2/7] slub: Avoid irqoff/on in bulk allocation Jesper Dangaard Brouer
@ 2015-09-28 12:26 ` Jesper Dangaard Brouer
  2015-09-28 13:49   ` Christoph Lameter
  2015-09-28 12:26 ` [PATCH 4/7] slab: implement bulking for SLAB allocator Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-28 12:26 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Christoph Lameter, Joonsoo Kim

The #ifdef of CONFIG_SLUB_DEBUG is located very far from
the associated #else.  For readability mark it with a comment.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/slub.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 024eed32da2c..1cf98d89546d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1202,7 +1202,7 @@ unsigned long kmem_cache_flags(unsigned long object_size,
 
 	return flags;
 }
-#else
+#else /* !CONFIG_SLUB_DEBUG */
 static inline void setup_object_debug(struct kmem_cache *s,
 			struct page *page, void *object) {}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 4/7] slab: implement bulking for SLAB allocator
  2015-09-28 12:26 [PATCH 0/7] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2015-09-28 12:26 ` [PATCH 3/7] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG Jesper Dangaard Brouer
@ 2015-09-28 12:26 ` Jesper Dangaard Brouer
  2015-09-28 15:11   ` Christoph Lameter
  2015-09-28 12:26 ` [PATCH 5/7] slub: support for bulk free with SLUB freelists Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-28 12:26 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Christoph Lameter, Joonsoo Kim

Implement a basic approach of bulking in the SLAB allocator. Simply
use local_irq_{disable,enable} and call single alloc/free in a loop.
This simple implementation approach is surprising fast.

Notice the normal SLAB fastpath is: 96 cycles (24.119 ns). Below table
show that single object bulking only takes 42 cycles.  This can be
explained by the bulk APIs requirement to be called from a known
interrupt context, that is with interrupts enabled.  This allow us to
avoid the expensive (37 cycles) local_irq_{save,restore}, and instead
use the much faster (7 cycles) local_irq_{disable,restore}.

Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz:

bulk - Current                  - simple SLAB bulk implementation
  1 - 115 cycles(tsc) 28.812 ns - 42 cycles(tsc) 10.715 ns - improved 63.5%
  2 - 103 cycles(tsc) 25.956 ns - 27 cycles(tsc)  6.985 ns - improved 73.8%
  3 - 101 cycles(tsc) 25.336 ns - 22 cycles(tsc)  5.733 ns - improved 78.2%
  4 - 100 cycles(tsc) 25.147 ns - 21 cycles(tsc)  5.319 ns - improved 79.0%
  8 -  98 cycles(tsc) 24.616 ns - 18 cycles(tsc)  4.620 ns - improved 81.6%
 16 -  97 cycles(tsc) 24.408 ns - 17 cycles(tsc)  4.344 ns - improved 82.5%
 30 -  98 cycles(tsc) 24.641 ns - 16 cycles(tsc)  4.202 ns - improved 83.7%
 32 -  98 cycles(tsc) 24.607 ns - 16 cycles(tsc)  4.199 ns - improved 83.7%
 34 -  98 cycles(tsc) 24.605 ns - 18 cycles(tsc)  4.579 ns - improved 81.6%
 48 -  97 cycles(tsc) 24.463 ns - 17 cycles(tsc)  4.405 ns - improved 82.5%
 64 -  97 cycles(tsc) 24.370 ns - 17 cycles(tsc)  4.384 ns - improved 82.5%
128 -  99 cycles(tsc) 24.763 ns - 19 cycles(tsc)  4.755 ns - improved 80.8%
158 -  98 cycles(tsc) 24.708 ns - 18 cycles(tsc)  4.723 ns - improved 81.6%
250 - 101 cycles(tsc) 25.342 ns - 20 cycles(tsc)  5.035 ns - improved 80.2%

Also notice how well bulking maintains the performance when the bulk
size increases (which is a soar spot for the SLUB allocator).

Increasing the bulk size further:
 20 cycles(tsc)  5.214 ns (bulk: 512)
 30 cycles(tsc)  7.734 ns (bulk: 768)
 40 cycles(tsc) 10.244 ns (bulk:1024)
 72 cycles(tsc) 18.049 ns (bulk:2048)
 90 cycles(tsc) 22.585 ns (bulk:4096)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/slab.c |   87 +++++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 62 insertions(+), 25 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index c77ebe6cc87c..21da6b1ccae3 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3234,11 +3234,15 @@ __do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 #endif /* CONFIG_NUMA */
 
 static __always_inline void *
-slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
+slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller,
+	   bool irq_off_needed)
 {
 	unsigned long save_flags;
 	void *objp;
 
+	/* Compiler need to remove irq_off_needed branch statements */
+	BUILD_BUG_ON(!__builtin_constant_p(irq_off_needed));
+
 	flags &= gfp_allowed_mask;
 
 	lockdep_trace_alloc(flags);
@@ -3249,9 +3253,11 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	cachep = memcg_kmem_get_cache(cachep, flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
-	local_irq_save(save_flags);
+	if (irq_off_needed)
+		local_irq_save(save_flags);
 	objp = __do_cache_alloc(cachep, flags);
-	local_irq_restore(save_flags);
+	if (irq_off_needed)
+		local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags,
 				 flags);
@@ -3407,7 +3413,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
  */
 void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
-	void *ret = slab_alloc(cachep, flags, _RET_IP_);
+	void *ret = slab_alloc(cachep, flags, _RET_IP_, true);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret,
 			       cachep->object_size, cachep->size, flags);
@@ -3416,16 +3422,23 @@ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 }
 EXPORT_SYMBOL(kmem_cache_alloc);
 
-void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
-{
-	__kmem_cache_free_bulk(s, size, p);
-}
-EXPORT_SYMBOL(kmem_cache_free_bulk);
-
+/* Note that interrupts must be enabled when calling this function. */
 bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
-								void **p)
+			   void **p)
 {
-	return __kmem_cache_alloc_bulk(s, flags, size, p);
+	size_t i;
+
+	local_irq_disable();
+	for (i = 0; i < size; i++) {
+		void *x = p[i] = slab_alloc(s, flags, _RET_IP_, false);
+
+		if (!x) {
+			__kmem_cache_free_bulk(s, i, p);
+			return false;
+		}
+	}
+	local_irq_enable();
+	return true;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk);
 
@@ -3435,7 +3448,7 @@ kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
 {
 	void *ret;
 
-	ret = slab_alloc(cachep, flags, _RET_IP_);
+	ret = slab_alloc(cachep, flags, _RET_IP_, true);
 
 	trace_kmalloc(_RET_IP_, ret,
 		      size, cachep->size, flags);
@@ -3526,7 +3539,7 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
 	cachep = kmalloc_slab(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
 		return cachep;
-	ret = slab_alloc(cachep, flags, caller);
+	ret = slab_alloc(cachep, flags, caller, true);
 
 	trace_kmalloc(caller, ret,
 		      size, cachep->size, flags);
@@ -3546,32 +3559,56 @@ void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
 }
 EXPORT_SYMBOL(__kmalloc_track_caller);
 
-/**
- * kmem_cache_free - Deallocate an object
- * @cachep: The cache the allocation was from.
- * @objp: The previously allocated object.
- *
- * Free an object which was previously allocated from this
- * cache.
- */
-void kmem_cache_free(struct kmem_cache *cachep, void *objp)
+/* Caller is responsible for disabling local IRQs */
+static __always_inline void __kmem_cache_free(struct kmem_cache *cachep,
+					      void *objp, bool irq_off_needed)
 {
 	unsigned long flags;
+
+	/* Compiler need to remove irq_off_needed branch statements */
+	BUILD_BUG_ON(!__builtin_constant_p(irq_off_needed));
+
 	cachep = cache_from_obj(cachep, objp);
 	if (!cachep)
 		return;
 
-	local_irq_save(flags);
+	if (irq_off_needed)
+		local_irq_save(flags);
 	debug_check_no_locks_freed(objp, cachep->object_size);
 	if (!(cachep->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(objp, cachep->object_size);
 	__cache_free(cachep, objp, _RET_IP_);
-	local_irq_restore(flags);
+	if (irq_off_needed)
+		local_irq_restore(flags);
+}
 
+/**
+ * kmem_cache_free - Deallocate an object
+ * @cachep: The cache the allocation was from.
+ * @objp: The previously allocated object.
+ *
+ * Free an object which was previously allocated from this
+ * cache.
+ */
+void kmem_cache_free(struct kmem_cache *cachep, void *objp)
+{
+	__kmem_cache_free(cachep, objp, true);
 	trace_kmem_cache_free(_RET_IP_, objp);
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
+/* Note that interrupts must be enabled when calling this function. */
+void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	size_t i;
+
+	local_irq_disable();
+	for (i = 0; i < size; i++)
+		__kmem_cache_free(s, p[i], false);
+	local_irq_enable();
+}
+EXPORT_SYMBOL(kmem_cache_free_bulk);
+
 /**
  * kfree - free previously allocated memory
  * @objp: pointer returned by kmalloc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 5/7] slub: support for bulk free with SLUB freelists
  2015-09-28 12:26 [PATCH 0/7] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2015-09-28 12:26 ` [PATCH 4/7] slab: implement bulking for SLAB allocator Jesper Dangaard Brouer
@ 2015-09-28 12:26 ` Jesper Dangaard Brouer
  2015-09-28 15:16   ` Christoph Lameter
  2015-09-28 12:26 ` [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-28 12:26 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Christoph Lameter, Joonsoo Kim

Make it possible to free a freelist with several objects by extending
__slab_free() and slab_free() with two arguments: a freelist_head
pointer and objects counter (cnt).  If freelist_head pointer is set,
then the object is the freelist tail pointer.

This allows a freelist with several objects (all within the same
slub-page) to be free'ed using a single locked cmpxchg_double in
__slab_free() and with an unlocked cmpxchg_double in slab_free().

Object debugging on the free path is also extended to handle these
freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
objects don't belong to the same slub-page.

These changes are needed for the next patch to bulk free the detached
freelists it introduces and constructs.

Micro benchmarking showed no performance reduction due to this change,
when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 mm/slub.c |   97 +++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 84 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1cf98d89546d..13b5f53e4840 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -675,11 +675,18 @@ static void init_object(struct kmem_cache *s, void *object, u8 val)
 {
 	u8 *p = object;
 
+	/* Freepointer not overwritten as SLAB_POISON moved it after object */
 	if (s->flags & __OBJECT_POISON) {
 		memset(p, POISON_FREE, s->object_size - 1);
 		p[s->object_size - 1] = POISON_END;
 	}
 
+	/*
+	 * If both SLAB_RED_ZONE and SLAB_POISON are enabled, then
+	 * freepointer is still safe, as then s->offset equals
+	 * s->inuse and below redzone is after s->object_size and only
+	 * area between s->object_size and s->inuse.
+	 */
 	if (s->flags & SLAB_RED_ZONE)
 		memset(p + s->object_size, val, s->inuse - s->object_size);
 }
@@ -1063,18 +1070,32 @@ bad:
 	return 0;
 }
 
+/* Supports checking bulk free of a constructed freelist */
 static noinline struct kmem_cache_node *free_debug_processing(
-	struct kmem_cache *s, struct page *page, void *object,
+	struct kmem_cache *s, struct page *page,
+	void *obj_tail, void *freelist_head, int bulk_cnt,
 	unsigned long addr, unsigned long *flags)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
+	void *object;
+	int cnt = 0;
 
 	spin_lock_irqsave(&n->list_lock, *flags);
 	slab_lock(page);
 
+	/*
+	 * Bulk free of a constructed freelist is indicated by the
+	 * freelist_head pointer being set, else obj_tail is object
+	 * being free'ed
+	 */
+	object = freelist_head ? : obj_tail;
+
 	if (!check_slab(s, page))
 		goto fail;
 
+next_object:
+	cnt++;
+
 	if (!check_valid_pointer(s, page, object)) {
 		slab_err(s, page, "Invalid object pointer 0x%p", object);
 		goto fail;
@@ -1105,8 +1126,19 @@ static noinline struct kmem_cache_node *free_debug_processing(
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_FREE, addr);
 	trace(s, page, object, 0);
+	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
 	init_object(s, object, SLUB_RED_INACTIVE);
+
+	/* Reached end of constructed freelist yet? */
+	if (object != obj_tail) {
+		object = get_freepointer(s, object);
+		goto next_object;
+	}
 out:
+	if (cnt != bulk_cnt)
+		slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n",
+			 bulk_cnt, cnt);
+
 	slab_unlock(page);
 	/*
 	 * Keep node_lock to preserve integrity
@@ -1210,7 +1242,8 @@ static inline int alloc_debug_processing(struct kmem_cache *s,
 	struct page *page, void *object, unsigned long addr) { return 0; }
 
 static inline struct kmem_cache_node *free_debug_processing(
-	struct kmem_cache *s, struct page *page, void *object,
+	struct kmem_cache *s, struct page *page,
+	void *obj_tail, void *freelist_head, int bulk_cnt,
 	unsigned long addr, unsigned long *flags) { return NULL; }
 
 static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
@@ -1306,6 +1339,35 @@ static inline void slab_free_hook(struct kmem_cache *s, void *x)
 	kasan_slab_free(s, x);
 }
 
+/* Compiler cannot detect that slab_free_freelist_hook() can be
+ * removed if slab_free_hook() evaluates to nothing.  Thus, we need to
+ * catch all relevant config debug options here.
+ */
+#if defined(CONFIG_KMEMCHECK) ||		\
+	defined(CONFIG_LOCKDEP)	||		\
+	defined(CONFIG_DEBUG_KMEMLEAK) ||	\
+	defined(CONFIG_DEBUG_OBJECTS_FREE) ||	\
+	defined(CONFIG_KASAN)
+static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail,
+					   void *freelist_head)
+{
+	/*
+	 * Bulk free of a constructed freelist is indicated by the
+	 * freelist_head pointer being set, else obj_tail is object
+	 * being free'ed
+	 */
+	void *object = freelist_head ? : obj_tail;
+
+	do {
+		slab_free_hook(s, object);
+	} while ((object != obj_tail) &&
+		 (object = get_freepointer(s, object)));
+}
+#else
+static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail,
+					   void *freelist_head) {}
+#endif
+
 static void setup_object(struct kmem_cache *s, struct page *page,
 				void *object)
 {
@@ -2584,9 +2646,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
  * So we still attempt to reduce cache line usage. Just take the slab
  * lock and free the item. If there is no additional partial page
  * handling required then we can return immediately.
+ *
+ * Bulk free of a freelist with several objects (all pointing to the
+ * same page) possible by specifying freelist_head ptr and object as
+ * tail ptr, plus objects count (cnt).
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-			void *x, unsigned long addr)
+			void *x, unsigned long addr,
+			void *freelist_head, int cnt)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -2595,11 +2662,13 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 	unsigned long counters;
 	struct kmem_cache_node *n = NULL;
 	unsigned long uninitialized_var(flags);
+	void *new_freelist = freelist_head ? : x;
 
 	stat(s, FREE_SLOWPATH);
 
 	if (kmem_cache_debug(s) &&
-		!(n = free_debug_processing(s, page, x, addr, &flags)))
+	    !(n = free_debug_processing(s, page, x, freelist_head, cnt,
+					addr, &flags)))
 		return;
 
 	do {
@@ -2612,7 +2681,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		set_freepointer(s, object, prior);
 		new.counters = counters;
 		was_frozen = new.frozen;
-		new.inuse--;
+		new.inuse -= cnt;
 		if ((!new.inuse || !prior) && !was_frozen) {
 
 			if (kmem_cache_has_cpu_partial(s) && !prior) {
@@ -2643,7 +2712,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 	} while (!cmpxchg_double_slab(s, page,
 		prior, counters,
-		object, new.counters,
+		new_freelist, new.counters,
 		"__slab_free"));
 
 	if (likely(!n)) {
@@ -2710,13 +2779,15 @@ slab_empty:
  * with all sorts of special processing.
  */
 static __always_inline void slab_free(struct kmem_cache *s,
-			struct page *page, void *x, unsigned long addr)
+			struct page *page, void *x, unsigned long addr,
+			void *freelist_head, int cnt)
 {
 	void **object = (void *)x;
+	void *new_freelist = freelist_head ? : x;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
 
-	slab_free_hook(s, x);
+	slab_free_freelist_hook(s, x, freelist_head);
 
 redo:
 	/*
@@ -2740,14 +2811,14 @@ redo:
 		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				c->freelist, tid,
-				object, next_tid(tid)))) {
+				new_freelist, next_tid(tid)))) {
 
 			note_cmpxchg_failure("slab_free", s, tid);
 			goto redo;
 		}
 		stat(s, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr);
+		__slab_free(s, page, x, addr, freelist_head, cnt);
 
 }
 
@@ -2756,7 +2827,7 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 	s = cache_from_obj(s, x);
 	if (!s)
 		return;
-	slab_free(s, virt_to_head_page(x), x, _RET_IP_);
+	slab_free(s, virt_to_head_page(x), x, _RET_IP_, NULL, 1);
 	trace_kmem_cache_free(_RET_IP_, x);
 }
 EXPORT_SYMBOL(kmem_cache_free);
@@ -2791,7 +2862,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 			c->tid = next_tid(c->tid);
 			local_irq_enable();
 			/* Slowpath: overhead locked cmpxchg_double_slab */
-			__slab_free(s, page, object, _RET_IP_);
+			__slab_free(s, page, object, _RET_IP_, NULL, 1);
 			local_irq_disable();
 			c = this_cpu_ptr(s->cpu_slab);
 		}
@@ -3531,7 +3602,7 @@ void kfree(const void *x)
 		__free_kmem_pages(page, compound_order(page));
 		return;
 	}
-	slab_free(page->slab_cache, page, object, _RET_IP_);
+	slab_free(page->slab_cache, page, object, _RET_IP_, NULL, 1);
 }
 EXPORT_SYMBOL(kfree);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist
  2015-09-28 12:26 [PATCH 0/7] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2015-09-28 12:26 ` [PATCH 5/7] slub: support for bulk free with SLUB freelists Jesper Dangaard Brouer
@ 2015-09-28 12:26 ` Jesper Dangaard Brouer
  2015-09-28 15:22   ` Christoph Lameter
  2015-09-28 12:26 ` [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk() Jesper Dangaard Brouer
  2015-09-29 15:46 ` [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
  7 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-28 12:26 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Christoph Lameter, Joonsoo Kim

This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.

The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.

To use the new bulking feature, we build what I call a detached
freelist.  The detached freelist takes advantage of three properties:

 1) the free function call owns the object that is about to be freed,
    thus writing into this memory is synchronization-free.

 2) many freelist's can co-exist side-by-side in the same slab-page
    each with a separate head pointer.

 3) it is the visibility of the head pointer that needs synchronization.

Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization.  The
freelist is constructed directly in the page objects, without any
synchronization needed.  The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk.  Thus, the freelist
head pointer is not visible to other CPUs.

All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page.  The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.

Kmem debug support is handled in call of slab_free().

Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.


Performance data:
 Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz

SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns

To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).

Performance data, compared against fallback bulking:

bulk -  fallback bulk            - improvement with this patch
   1 -  62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
   2 -  55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
   3 -  53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
   4 -  52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
   8 -  50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
  16 -  49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
  30 -  49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
  32 -  50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
  34 -  96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
  48 -  83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
  64 -  74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
 128 -  90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
 158 -  99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
 250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%

Performance data, compared current in-kernel bulking:

bulk - curr in-kernel  - improvement with this patch
   1 -  46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
   2 -  27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
   3 -  21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
   4 -  18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
   8 -  17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
  16 -  18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1)  5.6%
  30 -  18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0)  0.0%
  32 -  18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0)  0.0%
  34 -  78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
  48 -  60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
  64 -  49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
 128 -  69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
 158 -  79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
 250 -  86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%

Performance with normal SLUB merging is significantly slower for
larger bulking.  This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.

bulk - slab_nomerge   -  normal SLUB merge
   1 -  49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
   2 -  30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
   3 -  23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
   4 -  20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
   8 -  18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
  16 -  17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
  30 -  18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
  32 -  18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
  34 -  23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
  48 -  21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
  64 -  20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
 128 -  27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
 158 -  30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
 250 -  37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19

Joint work with Alexander Duyck.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 mm/slub.c |  109 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 79 insertions(+), 30 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 13b5f53e4840..c25717ab3b5a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2832,44 +2832,93 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
-/* Note that interrupts must be enabled when calling this function. */
-void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
-{
-	struct kmem_cache_cpu *c;
+struct detached_freelist {
 	struct page *page;
-	int i;
+	void *tail_object;
+	void *freelist;
+	int cnt;
+};
 
-	local_irq_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+/*
+ * This function progressively scans the array with free objects (with
+ * a limited look ahead) and extract objects belonging to the same
+ * page.  It builds a detached freelist directly within the given
+ * page/objects.  This can happen without any need for
+ * synchronization, because the objects are owned by running process.
+ * The freelist is build up as a single linked list in the objects.
+ * The idea is, that this detached freelist can then be bulk
+ * transferred to the real freelist(s), but only requiring a single
+ * synchronization primitive.  Look ahead in the array is limited due
+ * to performance reasons.
+ */
+static int build_detached_freelist(struct kmem_cache *s, size_t size,
+				   void **p, struct detached_freelist *df)
+{
+	size_t first_skipped_index = 0;
+	int lookahead = 3;
+	void *object;
 
-	for (i = 0; i < size; i++) {
-		void *object = p[i];
+	/* Always re-init detached_freelist */
+	df->page = NULL;
 
-		BUG_ON(!object);
-		/* kmem cache debug support */
-		s = cache_from_obj(s, object);
-		if (unlikely(!s))
-			goto exit;
-		slab_free_hook(s, object);
+	do {
+		object = p[--size];
+	} while (!object && size);
 
-		page = virt_to_head_page(object);
+	if (!object)
+		return 0;
 
-		if (c->page == page) {
-			/* Fastpath: local CPU free */
-			set_freepointer(s, object, c->freelist);
-			c->freelist = object;
-		} else {
-			c->tid = next_tid(c->tid);
-			local_irq_enable();
-			/* Slowpath: overhead locked cmpxchg_double_slab */
-			__slab_free(s, page, object, _RET_IP_, NULL, 1);
-			local_irq_disable();
-			c = this_cpu_ptr(s->cpu_slab);
+	/* Start new detached freelist */
+	set_freepointer(s, object, NULL);
+	df->page = virt_to_head_page(object);
+	df->tail_object = object;
+	df->freelist = object;
+	p[size] = NULL; /* mark object processed */
+	df->cnt = 1;
+
+	while (size) {
+		object = p[--size];
+		if (!object)
+			continue; /* Skip processed objects */
+
+		/* df->page is always set at this point */
+		if (df->page == virt_to_head_page(object)) {
+			/* Opportunity build freelist */
+			set_freepointer(s, object, df->freelist);
+			df->freelist = object;
+			df->cnt++;
+			p[size] = NULL; /* mark object processed */
+
+			continue;
 		}
+
+		/* Limit look ahead search */
+		if (!--lookahead)
+			break;
+
+		if (!first_skipped_index)
+			first_skipped_index = size + 1;
 	}
-exit:
-	c->tid = next_tid(c->tid);
-	local_irq_enable();
+
+	return first_skipped_index;
+}
+
+
+/* Note that interrupts must be enabled when calling this function. */
+void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	BUG_ON(!size);
+
+	do {
+		struct detached_freelist df;
+
+		size = build_detached_freelist(s, size, p, &df);
+		if (unlikely(!df.page))
+			continue;
+
+		slab_free(s, df.page, df.tail_object, _RET_IP_,
+			  df.freelist, df.cnt);
+	} while (likely(size));
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk()
  2015-09-28 12:26 [PATCH 0/7] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
                   ` (5 preceding siblings ...)
  2015-09-28 12:26 ` [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
@ 2015-09-28 12:26 ` Jesper Dangaard Brouer
  2015-09-28 14:53   ` Alexander Duyck
  2015-09-29 15:46 ` [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
  7 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-28 12:26 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Christoph Lameter, Joonsoo Kim

For practical use-cases it is beneficial to prefetch the next freelist
object in bulk allocation loop.

Micro benchmarking show approx 1 cycle change:

bulk -  prev-patch     -  this patch
   1 -  49 cycles(tsc) - 49 cycles(tsc) - increase in cycles:0
   2 -  30 cycles(tsc) - 31 cycles(tsc) - increase in cycles:1
   3 -  23 cycles(tsc) - 25 cycles(tsc) - increase in cycles:2
   4 -  20 cycles(tsc) - 22 cycles(tsc) - increase in cycles:2
   8 -  18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1
  16 -  17 cycles(tsc) - 18 cycles(tsc) - increase in cycles:1
  30 -  18 cycles(tsc) - 17 cycles(tsc) - increase in cycles:-1
  32 -  18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1
  34 -  23 cycles(tsc) - 24 cycles(tsc) - increase in cycles:1
  48 -  21 cycles(tsc) - 22 cycles(tsc) - increase in cycles:1
  64 -  20 cycles(tsc) - 21 cycles(tsc) - increase in cycles:1
 128 -  27 cycles(tsc) - 27 cycles(tsc) - increase in cycles:0
 158 -  30 cycles(tsc) - 30 cycles(tsc) - increase in cycles:0
 250 -  37 cycles(tsc) - 37 cycles(tsc) - increase in cycles:0

Note, benchmark done with slab_nomerge to keep it stable enough
for accurate comparison.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/slub.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index c25717ab3b5a..5af75a618b91 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2951,6 +2951,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 				goto error;
 
 			c = this_cpu_ptr(s->cpu_slab);
+			prefetch_freepointer(s, c->freelist);
 			continue; /* goto for-loop */
 		}
 
@@ -2960,6 +2961,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			goto error;
 
 		c->freelist = get_freepointer(s, object);
+		prefetch_freepointer(s, c->freelist);
 		p[i] = object;
 
 		/* kmem_cache debug support */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 3/7] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG
  2015-09-28 12:26 ` [PATCH 3/7] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG Jesper Dangaard Brouer
@ 2015-09-28 13:49   ` Christoph Lameter
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2015-09-28 13:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, netdev, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:

> The #ifdef of CONFIG_SLUB_DEBUG is located very far from
> the associated #else.  For readability mark it with a comment.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk()
  2015-09-28 12:26 ` [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk() Jesper Dangaard Brouer
@ 2015-09-28 14:53   ` Alexander Duyck
  2015-09-28 15:59     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Alexander Duyck @ 2015-09-28 14:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, linux-mm, Andrew Morton
  Cc: netdev, Pekka Enberg, David Rientjes, Christoph Lameter, Joonsoo Kim

On 09/28/2015 05:26 AM, Jesper Dangaard Brouer wrote:
> For practical use-cases it is beneficial to prefetch the next freelist
> object in bulk allocation loop.
>
> Micro benchmarking show approx 1 cycle change:
>
> bulk -  prev-patch     -  this patch
>     1 -  49 cycles(tsc) - 49 cycles(tsc) - increase in cycles:0
>     2 -  30 cycles(tsc) - 31 cycles(tsc) - increase in cycles:1
>     3 -  23 cycles(tsc) - 25 cycles(tsc) - increase in cycles:2
>     4 -  20 cycles(tsc) - 22 cycles(tsc) - increase in cycles:2
>     8 -  18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1
>    16 -  17 cycles(tsc) - 18 cycles(tsc) - increase in cycles:1
>    30 -  18 cycles(tsc) - 17 cycles(tsc) - increase in cycles:-1
>    32 -  18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1
>    34 -  23 cycles(tsc) - 24 cycles(tsc) - increase in cycles:1
>    48 -  21 cycles(tsc) - 22 cycles(tsc) - increase in cycles:1
>    64 -  20 cycles(tsc) - 21 cycles(tsc) - increase in cycles:1
>   128 -  27 cycles(tsc) - 27 cycles(tsc) - increase in cycles:0
>   158 -  30 cycles(tsc) - 30 cycles(tsc) - increase in cycles:0
>   250 -  37 cycles(tsc) - 37 cycles(tsc) - increase in cycles:0
>
> Note, benchmark done with slab_nomerge to keep it stable enough
> for accurate comparison.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>   mm/slub.c |    2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index c25717ab3b5a..5af75a618b91 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2951,6 +2951,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>   				goto error;
>   
>   			c = this_cpu_ptr(s->cpu_slab);
> +			prefetch_freepointer(s, c->freelist);
>   			continue; /* goto for-loop */
>   		}
>   
> @@ -2960,6 +2961,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>   			goto error;
>   
>   		c->freelist = get_freepointer(s, object);
> +		prefetch_freepointer(s, c->freelist);
>   		p[i] = object;
>   
>   		/* kmem_cache debug support */
>

I can see the prefetch in the last item case being possibly useful since 
you have time between when you call the prefetch and when you are 
accessing the next object.  However, is there any actual benefit to 
prefetching inside the loop itself?  Based on your data above it doesn't 
seem like that is the case since you are now adding one additional cycle 
to the allocation and I am not seeing any actual gain reported here.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 4/7] slab: implement bulking for SLAB allocator
  2015-09-28 12:26 ` [PATCH 4/7] slab: implement bulking for SLAB allocator Jesper Dangaard Brouer
@ 2015-09-28 15:11   ` Christoph Lameter
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2015-09-28 15:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, netdev, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:

> +/* Note that interrupts must be enabled when calling this function. */
>  bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> -								void **p)
> +			   void **p)
>  {
> -	return __kmem_cache_alloc_bulk(s, flags, size, p);
> +	size_t i;
> +
> +	local_irq_disable();
> +	for (i = 0; i < size; i++) {
> +		void *x = p[i] = slab_alloc(s, flags, _RET_IP_, false);
> +
> +		if (!x) {
> +			__kmem_cache_free_bulk(s, i, p);
> +			return false;
> +		}
> +	}
> +	local_irq_enable();
> +	return true;
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_bulk);
>

Ok the above could result in excessive times when the interrupts are
kept off.  Lets say someone is freeing 1000 objects?

> +/* Note that interrupts must be enabled when calling this function. */
> +void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> +{
> +	size_t i;
> +
> +	local_irq_disable();
> +	for (i = 0; i < size; i++)
> +		__kmem_cache_free(s, p[i], false);
> +	local_irq_enable();
> +}
> +EXPORT_SYMBOL(kmem_cache_free_bulk);

Same concern here. We may just have to accept this for now.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists
  2015-09-28 12:26 ` [PATCH 5/7] slub: support for bulk free with SLUB freelists Jesper Dangaard Brouer
@ 2015-09-28 15:16   ` Christoph Lameter
  2015-09-28 15:51     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2015-09-28 15:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, netdev, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:

> diff --git a/mm/slub.c b/mm/slub.c
> index 1cf98d89546d..13b5f53e4840 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -675,11 +675,18 @@ static void init_object(struct kmem_cache *s, void *object, u8 val)
>  {
>  	u8 *p = object;
>
> +	/* Freepointer not overwritten as SLAB_POISON moved it after object */
>  	if (s->flags & __OBJECT_POISON) {
>  		memset(p, POISON_FREE, s->object_size - 1);
>  		p[s->object_size - 1] = POISON_END;
>  	}
>
> +	/*
> +	 * If both SLAB_RED_ZONE and SLAB_POISON are enabled, then
> +	 * freepointer is still safe, as then s->offset equals
> +	 * s->inuse and below redzone is after s->object_size and only
> +	 * area between s->object_size and s->inuse.
> +	 */
>  	if (s->flags & SLAB_RED_ZONE)
>  		memset(p + s->object_size, val, s->inuse - s->object_size);
>  }

Are these comments really adding something? This is basic metadata
handling for SLUB that is commented on elsehwere.

> @@ -2584,9 +2646,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
>   * So we still attempt to reduce cache line usage. Just take the slab
>   * lock and free the item. If there is no additional partial page
>   * handling required then we can return immediately.
> + *
> + * Bulk free of a freelist with several objects (all pointing to the
> + * same page) possible by specifying freelist_head ptr and object as
> + * tail ptr, plus objects count (cnt).
>   */
>  static void __slab_free(struct kmem_cache *s, struct page *page,
> -			void *x, unsigned long addr)
> +			void *x, unsigned long addr,
> +			void *freelist_head, int cnt)

Do you really need separate parameters for freelist_head? If you just want
to deal with one object pass it as freelist_head and set cnt = 1?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist
  2015-09-28 12:26 ` [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
@ 2015-09-28 15:22   ` Christoph Lameter
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2015-09-28 15:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, netdev, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim


Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists
  2015-09-28 15:16   ` Christoph Lameter
@ 2015-09-28 15:51     ` Jesper Dangaard Brouer
  2015-09-28 16:28       ` Christoph Lameter
  2015-09-28 16:30       ` Christoph Lameter
  0 siblings, 2 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-28 15:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Andrew Morton, netdev, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim, brouer

On Mon, 28 Sep 2015 10:16:49 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:
> 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 1cf98d89546d..13b5f53e4840 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -675,11 +675,18 @@ static void init_object(struct kmem_cache *s, void *object, u8 val)
> >  {
> >  	u8 *p = object;
> >
> > +	/* Freepointer not overwritten as SLAB_POISON moved it after object */
> >  	if (s->flags & __OBJECT_POISON) {
> >  		memset(p, POISON_FREE, s->object_size - 1);
> >  		p[s->object_size - 1] = POISON_END;
> >  	}
> >
> > +	/*
> > +	 * If both SLAB_RED_ZONE and SLAB_POISON are enabled, then
> > +	 * freepointer is still safe, as then s->offset equals
> > +	 * s->inuse and below redzone is after s->object_size and only
> > +	 * area between s->object_size and s->inuse.
> > +	 */
> >  	if (s->flags & SLAB_RED_ZONE)
> >  		memset(p + s->object_size, val, s->inuse - s->object_size);
> >  }
> 
> Are these comments really adding something? This is basic metadata
> handling for SLUB that is commented on elsehwere.

Not knowing SLUB as well as you, it took me several hours to realize
init_object() didn't overwrite the freepointer in the object.  Thus, I
think these comments make the reader aware of not-so-obvious
side-effects of SLAB_POISON and SLAB_RED_ZONE.


> > @@ -2584,9 +2646,14 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
> >   * So we still attempt to reduce cache line usage. Just take the slab
> >   * lock and free the item. If there is no additional partial page
> >   * handling required then we can return immediately.
> > + *
> > + * Bulk free of a freelist with several objects (all pointing to the
> > + * same page) possible by specifying freelist_head ptr and object as
> > + * tail ptr, plus objects count (cnt).
> >   */
> >  static void __slab_free(struct kmem_cache *s, struct page *page,
> > -			void *x, unsigned long addr)
> > +			void *x, unsigned long addr,
> > +			void *freelist_head, int cnt)
> 
> Do you really need separate parameters for freelist_head? If you just want
> to deal with one object pass it as freelist_head and set cnt = 1?

Yes, I need it.  We need to know both the head and tail of the list to
splice it.

See:

> @@ -2612,7 +2681,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
                prior = page->freelist;
		counters = page->counters;
>  		set_freepointer(s, object, prior);
                                   ^^^^^^ 
Here we update the tail ptr (object) to point to "prior" (page->freelist).

>  		new.counters = counters;
>  		was_frozen = new.frozen;
> -		new.inuse--;
> +		new.inuse -= cnt;
>  		if ((!new.inuse || !prior) && !was_frozen) {
>  
>  			if (kmem_cache_has_cpu_partial(s) && !prior) {
> @@ -2643,7 +2712,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>  
>  	} while (!cmpxchg_double_slab(s, page,
>  		prior, counters,
> -		object, new.counters,
> +		new_freelist, new.counters,
>  		"__slab_free"));

Here we update page->freelist ("prior") to point to the head. Thus,
splicing the list.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk()
  2015-09-28 14:53   ` Alexander Duyck
@ 2015-09-28 15:59     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-28 15:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-mm, Andrew Morton, netdev, Pekka Enberg, David Rientjes,
	Christoph Lameter, Joonsoo Kim, brouer


On Mon, 28 Sep 2015 07:53:16 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On 09/28/2015 05:26 AM, Jesper Dangaard Brouer wrote:
> > For practical use-cases it is beneficial to prefetch the next freelist
> > object in bulk allocation loop.
> >
> > Micro benchmarking show approx 1 cycle change:
> >
> > bulk -  prev-patch     -  this patch
> >     1 -  49 cycles(tsc) - 49 cycles(tsc) - increase in cycles:0
> >     2 -  30 cycles(tsc) - 31 cycles(tsc) - increase in cycles:1
> >     3 -  23 cycles(tsc) - 25 cycles(tsc) - increase in cycles:2
> >     4 -  20 cycles(tsc) - 22 cycles(tsc) - increase in cycles:2
> >     8 -  18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1
> >    16 -  17 cycles(tsc) - 18 cycles(tsc) - increase in cycles:1
> >    30 -  18 cycles(tsc) - 17 cycles(tsc) - increase in cycles:-1
> >    32 -  18 cycles(tsc) - 19 cycles(tsc) - increase in cycles:1
> >    34 -  23 cycles(tsc) - 24 cycles(tsc) - increase in cycles:1
> >    48 -  21 cycles(tsc) - 22 cycles(tsc) - increase in cycles:1
> >    64 -  20 cycles(tsc) - 21 cycles(tsc) - increase in cycles:1
> >   128 -  27 cycles(tsc) - 27 cycles(tsc) - increase in cycles:0
> >   158 -  30 cycles(tsc) - 30 cycles(tsc) - increase in cycles:0
> >   250 -  37 cycles(tsc) - 37 cycles(tsc) - increase in cycles:0
> >
> > Note, benchmark done with slab_nomerge to keep it stable enough
> > for accurate comparison.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >   mm/slub.c |    2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index c25717ab3b5a..5af75a618b91 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2951,6 +2951,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> >   				goto error;
> >   
> >   			c = this_cpu_ptr(s->cpu_slab);
> > +			prefetch_freepointer(s, c->freelist);
> >   			continue; /* goto for-loop */
> >   		}
> >   
> > @@ -2960,6 +2961,7 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> >   			goto error;
> >   
> >   		c->freelist = get_freepointer(s, object);
> > +		prefetch_freepointer(s, c->freelist);
> >   		p[i] = object;
> >   
> >   		/* kmem_cache debug support */
> >
> 
> I can see the prefetch in the last item case being possibly useful since 
> you have time between when you call the prefetch and when you are 
> accessing the next object.  However, is there any actual benefit to 
> prefetching inside the loop itself?  Based on your data above it doesn't 
> seem like that is the case since you are now adding one additional cycle 
> to the allocation and I am not seeing any actual gain reported here.

The gain will first show up, when using bulk alloc in real use-cases.

As you know, bulk alloc on RX path don't show any improvement. And I
measured (with perf-mem-record) L1 miss'es here.  I could reduce the L1
misses here by adding prefetch.  But I cannot remember if I measured
any PPS improvement with this.

As you hint, the time I have between my prefetch and use is very small,
thus the question is if this will show any benefit for real use-cases.

We can drop this patch, and then I'll include it in my network
use-case, and measure the effect? (Although I'll likely be wasting my
time, as we should likely redesign the alloc API instead).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists
  2015-09-28 15:51     ` Jesper Dangaard Brouer
@ 2015-09-28 16:28       ` Christoph Lameter
  2015-09-29  7:32         ` Jesper Dangaard Brouer
  2015-09-28 16:30       ` Christoph Lameter
  1 sibling, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2015-09-28 16:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, netdev, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:

> > Do you really need separate parameters for freelist_head? If you just want
> > to deal with one object pass it as freelist_head and set cnt = 1?
>
> Yes, I need it.  We need to know both the head and tail of the list to
> splice it.

Ok so this is to avoid having to scan the list to its end? x is the end
of the list and freelist_head the beginning. That is weird.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists
  2015-09-28 15:51     ` Jesper Dangaard Brouer
  2015-09-28 16:28       ` Christoph Lameter
@ 2015-09-28 16:30       ` Christoph Lameter
  2015-09-29  7:12         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2015-09-28 16:30 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, netdev, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:

> Not knowing SLUB as well as you, it took me several hours to realize
> init_object() didn't overwrite the freepointer in the object.  Thus, I
> think these comments make the reader aware of not-so-obvious
> side-effects of SLAB_POISON and SLAB_RED_ZONE.

>From the source:

/*
 * Object layout:
 *
 * object address
 *      Bytes of the object to be managed.
 *      If the freepointer may overlay the object then the free
 *      pointer is the first word of the object.
 *
 *      Poisoning uses 0x6b (POISON_FREE) and the last byte is
 *      0xa5 (POISON_END)
 *
 * object + s->object_size
 *      Padding to reach word boundary. This is also used for Redzoning.
 *      Padding is extended by another word if Redzoning is enabled and
 *      object_size == inuse.
 *
 *      We fill with 0xbb (RED_INACTIVE) for inactive objects and with
 *      0xcc (RED_ACTIVE) for objects in use.
 *
 * object + s->inuse
 *      Meta data starts here.
 *
 *      A. Free pointer (if we cannot overwrite object on free)
 *      B. Tracking data for SLAB_STORE_USER
 *      C. Padding to reach required alignment boundary or at mininum
 *              one word if debugging is on to be able to detect writes
 *              before the word boundary.
 *
 *      Padding is done using 0x5a (POISON_INUSE)
 *
 * object + s->size
 *      Nothing is used beyond s->size.
 *
 * If slabcaches are merged then the object_size and inuse boundaries are
mostly
 * ignored. And therefore no slab options that rely on these boundaries
 * may be used with merged slabcaches.
 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists
  2015-09-28 16:30       ` Christoph Lameter
@ 2015-09-29  7:12         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-29  7:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Andrew Morton, netdev, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim, brouer


On Mon, 28 Sep 2015 11:30:00 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:

> On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:
> 
> > Not knowing SLUB as well as you, it took me several hours to realize
> > init_object() didn't overwrite the freepointer in the object.  Thus, I
> > think these comments make the reader aware of not-so-obvious
> > side-effects of SLAB_POISON and SLAB_RED_ZONE.
> 
> From the source:
> 
> /*
>  * Object layout:
>  *
>  * object address
>  *      Bytes of the object to be managed.
>  *      If the freepointer may overlay the object then the free
>  *      pointer is the first word of the object.
>  *
>  *      Poisoning uses 0x6b (POISON_FREE) and the last byte is
>  *      0xa5 (POISON_END)
>  *
>  * object + s->object_size
>  *      Padding to reach word boundary. This is also used for Redzoning.
>  *      Padding is extended by another word if Redzoning is enabled and
>  *      object_size == inuse.
>  *
>  *      We fill with 0xbb (RED_INACTIVE) for inactive objects and with
>  *      0xcc (RED_ACTIVE) for objects in use.
>  *
>  * object + s->inuse
>  *      Meta data starts here.
>  *
>  *      A. Free pointer (if we cannot overwrite object on free)
>  *      B. Tracking data for SLAB_STORE_USER
>  *      C. Padding to reach required alignment boundary or at mininum
>  *              one word if debugging is on to be able to detect writes
>  *              before the word boundary.

Okay, I will remove the comment.

The best doc on SLUB and SLAB layout comes from your slides titled:
 "Slab allocators in the Linux Kernel: SLAB, SLOB, SLUB"

Lets gracefully add a link to the slides here:
 http://events.linuxfoundation.org/sites/events/files/slides/slaballocators.pdf

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/7] slub: support for bulk free with SLUB freelists
  2015-09-28 16:28       ` Christoph Lameter
@ 2015-09-29  7:32         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-29  7:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Andrew Morton, netdev, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim, brouer

On Mon, 28 Sep 2015 11:28:15 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Mon, 28 Sep 2015, Jesper Dangaard Brouer wrote:
> 
> > > Do you really need separate parameters for freelist_head? If you just want
> > > to deal with one object pass it as freelist_head and set cnt = 1?
> >
> > Yes, I need it.  We need to know both the head and tail of the list to
> > splice it.
> 
> Ok so this is to avoid having to scan the list to its end?

True.

> x is the end
> of the list and freelist_head the beginning. That is weird.

Yes, it is a bit weird... the bulk free of freelists comes out as a
second-class citizen.

Okay, I'll try to change the slab_free() and __slab_free() calls to
have a "head" and "tail".  And let tail be NULL on single object free,
to allow compiler to do constant propagation (thus keeping existing
fastpath unaffected).  (The same code should be generated, but we will
have a more intuitive API).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking
  2015-09-28 12:26 [PATCH 0/7] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
                   ` (6 preceding siblings ...)
  2015-09-28 12:26 ` [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk() Jesper Dangaard Brouer
@ 2015-09-29 15:46 ` Jesper Dangaard Brouer
  2015-09-29 15:47   ` [MM PATCH V4 1/6] slub: create new ___slab_alloc function that can be called with irqs disabled Jesper Dangaard Brouer
                     ` (5 more replies)
  7 siblings, 6 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-29 15:46 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Christoph Lameter
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

Most important part of this patchset is the introducing of what I call
detached freelist, for improving SLUB performance of object freeing in
the "slowpath" of kmem_cache_free_bulk.

Tagging patchset with "V4" to avoid confusion with "V2":
 (V2) http://thread.gmane.org/gmane.linux.kernel.mm/137469

Addressing comments from:
 ("V3") http://thread.gmane.org/gmane.linux.kernel.mm/139268

I've added Christoph Lameter's ACKs from prev review.
 * Only patch 5 is changed significantly and needs review.
 * Benchmarked, performance is the same

Notes for patches:
 * First two patches (from Christoph) are already in AKPM MMOTS.
 * Patch 3 is trivial
 * Patch 4 is a repost, implements bulking for SLAB.
  - http://thread.gmane.org/gmane.linux.kernel.mm/138220
 * Patch 5 and 6 are the important patches
  - Patch 5 handle "freelists" in slab_free() and __slab_free().
  - Patch 6 intro detached freelists, and significant performance improvement

Patches should be ready for the MM-tree, as I'm now handling kmem
debug support.


Based on top of commit 519f526d39 in net-next, but I've tested it
applies on top of mmotm-2015-09-18-16-08.

The benchmarking tools are avail here:
 https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
 See: slab_bulk_test0{1,2,3}.c

This was joint work with Alexander Duyck while still at Red Hat.

This patchset is part of my network stack use-case.  I'll post the
network side of the patchset as soon as I've cleaned it up, rebased it
on net-next and re-run all the benchmarks.

---

Christoph Lameter (2):
      slub: create new ___slab_alloc function that can be called with irqs disabled
      slub: Avoid irqoff/on in bulk allocation

Jesper Dangaard Brouer (4):
      slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG
      slab: implement bulking for SLAB allocator
      slub: support for bulk free with SLUB freelists
      slub: optimize bulk slowpath free by detached freelist


 mm/slab.c |   87 ++++++++++++++------
 mm/slub.c |  263 +++++++++++++++++++++++++++++++++++++++++++------------------
 2 files changed, 249 insertions(+), 101 deletions(-)

--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MM PATCH V4 1/6] slub: create new ___slab_alloc function that can be called with irqs disabled
  2015-09-29 15:46 ` [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
@ 2015-09-29 15:47   ` Jesper Dangaard Brouer
  2015-09-29 15:47   ` [MM PATCH V4 2/6] slub: Avoid irqoff/on in bulk allocation Jesper Dangaard Brouer
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-29 15:47 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Christoph Lameter
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

From: Christoph Lameter <cl@linux.com>

NOTICE: Accepted by AKPM
 http://ozlabs.org/~akpm/mmots/broken-out/slub-create-new-___slab_alloc-function-that-can-be-called-with-irqs-disabled.patch

Bulk alloc needs a function like that because it enables interrupts before
calling __slab_alloc which promptly disables them again using the expensive
local_irq_save().

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/slub.c |   44 +++++++++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index f614b5dc396b..02cfb3a5983e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2298,23 +2298,15 @@ static inline void *get_freelist(struct kmem_cache *s, struct page *page)
  * And if we were unable to get a new slab from the partial slab lists then
  * we need to allocate a new slab. This is the slowest path since it involves
  * a call to the page allocator and the setup of a new slab.
+ *
+ * Version of __slab_alloc to use when we know that interrupts are
+ * already disabled (which is the case for bulk allocation).
  */
-static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
+static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			  unsigned long addr, struct kmem_cache_cpu *c)
 {
 	void *freelist;
 	struct page *page;
-	unsigned long flags;
-
-	local_irq_save(flags);
-#ifdef CONFIG_PREEMPT
-	/*
-	 * We may have been preempted and rescheduled on a different
-	 * cpu before disabling interrupts. Need to reload cpu area
-	 * pointer.
-	 */
-	c = this_cpu_ptr(s->cpu_slab);
-#endif
 
 	page = c->page;
 	if (!page)
@@ -2372,7 +2364,6 @@ load_freelist:
 	VM_BUG_ON(!c->page->frozen);
 	c->freelist = get_freepointer(s, freelist);
 	c->tid = next_tid(c->tid);
-	local_irq_restore(flags);
 	return freelist;
 
 new_slab:
@@ -2389,7 +2380,6 @@ new_slab:
 
 	if (unlikely(!freelist)) {
 		slab_out_of_memory(s, gfpflags, node);
-		local_irq_restore(flags);
 		return NULL;
 	}
 
@@ -2405,11 +2395,35 @@ new_slab:
 	deactivate_slab(s, page, get_freepointer(s, freelist));
 	c->page = NULL;
 	c->freelist = NULL;
-	local_irq_restore(flags);
 	return freelist;
 }
 
 /*
+ * Another one that disabled interrupt and compensates for possible
+ * cpu changes by refetching the per cpu area pointer.
+ */
+static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
+			  unsigned long addr, struct kmem_cache_cpu *c)
+{
+	void *p;
+	unsigned long flags;
+
+	local_irq_save(flags);
+#ifdef CONFIG_PREEMPT
+	/*
+	 * We may have been preempted and rescheduled on a different
+	 * cpu before disabling interrupts. Need to reload cpu area
+	 * pointer.
+	 */
+	c = this_cpu_ptr(s->cpu_slab);
+#endif
+
+	p = ___slab_alloc(s, gfpflags, node, addr, c);
+	local_irq_restore(flags);
+	return p;
+}
+
+/*
  * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
  * have the fastpath folded into their functions. So no function call
  * overhead for requests that can be satisfied on the fastpath.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MM PATCH V4 2/6] slub: Avoid irqoff/on in bulk allocation
  2015-09-29 15:46 ` [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
  2015-09-29 15:47   ` [MM PATCH V4 1/6] slub: create new ___slab_alloc function that can be called with irqs disabled Jesper Dangaard Brouer
@ 2015-09-29 15:47   ` Jesper Dangaard Brouer
  2015-09-29 15:47   ` [MM PATCH V4 3/6] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG Jesper Dangaard Brouer
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-29 15:47 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Christoph Lameter
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

From: Christoph Lameter <cl@linux.com>

NOTICE: Accepted by AKPM
 http://ozlabs.org/~akpm/mmots/broken-out/slub-avoid-irqoff-on-in-bulk-allocation.patch

Use the new function that can do allocation while
interrupts are disabled.  Avoids irq on/off sequences.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 mm/slub.c |   24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 02cfb3a5983e..024eed32da2c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2821,30 +2821,23 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 		void *object = c->freelist;
 
 		if (unlikely(!object)) {
-			local_irq_enable();
 			/*
 			 * Invoking slow path likely have side-effect
 			 * of re-populating per CPU c->freelist
 			 */
-			p[i] = __slab_alloc(s, flags, NUMA_NO_NODE,
+			p[i] = ___slab_alloc(s, flags, NUMA_NO_NODE,
 					    _RET_IP_, c);
-			if (unlikely(!p[i])) {
-				__kmem_cache_free_bulk(s, i, p);
-				return false;
-			}
-			local_irq_disable();
+			if (unlikely(!p[i]))
+				goto error;
+
 			c = this_cpu_ptr(s->cpu_slab);
 			continue; /* goto for-loop */
 		}
 
 		/* kmem_cache debug support */
 		s = slab_pre_alloc_hook(s, flags);
-		if (unlikely(!s)) {
-			__kmem_cache_free_bulk(s, i, p);
-			c->tid = next_tid(c->tid);
-			local_irq_enable();
-			return false;
-		}
+		if (unlikely(!s))
+			goto error;
 
 		c->freelist = get_freepointer(s, object);
 		p[i] = object;
@@ -2864,6 +2857,11 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	}
 
 	return true;
+
+error:
+	__kmem_cache_free_bulk(s, i, p);
+	local_irq_enable();
+	return false;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MM PATCH V4 3/6] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG
  2015-09-29 15:46 ` [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
  2015-09-29 15:47   ` [MM PATCH V4 1/6] slub: create new ___slab_alloc function that can be called with irqs disabled Jesper Dangaard Brouer
  2015-09-29 15:47   ` [MM PATCH V4 2/6] slub: Avoid irqoff/on in bulk allocation Jesper Dangaard Brouer
@ 2015-09-29 15:47   ` Jesper Dangaard Brouer
  2015-09-29 15:48   ` [MM PATCH V4 4/6] slab: implement bulking for SLAB allocator Jesper Dangaard Brouer
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-29 15:47 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Christoph Lameter
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

The #ifdef of CONFIG_SLUB_DEBUG is located very far from
the associated #else.  For readability mark it with a comment.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 024eed32da2c..1cf98d89546d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1202,7 +1202,7 @@ unsigned long kmem_cache_flags(unsigned long object_size,
 
 	return flags;
 }
-#else
+#else /* !CONFIG_SLUB_DEBUG */
 static inline void setup_object_debug(struct kmem_cache *s,
 			struct page *page, void *object) {}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MM PATCH V4 4/6] slab: implement bulking for SLAB allocator
  2015-09-29 15:46 ` [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
                     ` (2 preceding siblings ...)
  2015-09-29 15:47   ` [MM PATCH V4 3/6] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG Jesper Dangaard Brouer
@ 2015-09-29 15:48   ` Jesper Dangaard Brouer
  2015-09-29 15:48   ` [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists Jesper Dangaard Brouer
  2015-09-29 15:48   ` [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
  5 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-29 15:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Christoph Lameter
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

Implement a basic approach of bulking in the SLAB allocator. Simply
use local_irq_{disable,enable} and call single alloc/free in a loop.
This simple implementation approach is surprising fast.

Notice the normal SLAB fastpath is: 96 cycles (24.119 ns). Below table
show that single object bulking only takes 42 cycles.  This can be
explained by the bulk APIs requirement to be called from a known
interrupt context, that is with interrupts enabled.  This allow us to
avoid the expensive (37 cycles) local_irq_{save,restore}, and instead
use the much faster (7 cycles) local_irq_{disable,restore}.

Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz:

bulk - Current                  - simple SLAB bulk implementation
  1 - 115 cycles(tsc) 28.812 ns - 42 cycles(tsc) 10.715 ns - improved 63.5%
  2 - 103 cycles(tsc) 25.956 ns - 27 cycles(tsc)  6.985 ns - improved 73.8%
  3 - 101 cycles(tsc) 25.336 ns - 22 cycles(tsc)  5.733 ns - improved 78.2%
  4 - 100 cycles(tsc) 25.147 ns - 21 cycles(tsc)  5.319 ns - improved 79.0%
  8 -  98 cycles(tsc) 24.616 ns - 18 cycles(tsc)  4.620 ns - improved 81.6%
 16 -  97 cycles(tsc) 24.408 ns - 17 cycles(tsc)  4.344 ns - improved 82.5%
 30 -  98 cycles(tsc) 24.641 ns - 16 cycles(tsc)  4.202 ns - improved 83.7%
 32 -  98 cycles(tsc) 24.607 ns - 16 cycles(tsc)  4.199 ns - improved 83.7%
 34 -  98 cycles(tsc) 24.605 ns - 18 cycles(tsc)  4.579 ns - improved 81.6%
 48 -  97 cycles(tsc) 24.463 ns - 17 cycles(tsc)  4.405 ns - improved 82.5%
 64 -  97 cycles(tsc) 24.370 ns - 17 cycles(tsc)  4.384 ns - improved 82.5%
128 -  99 cycles(tsc) 24.763 ns - 19 cycles(tsc)  4.755 ns - improved 80.8%
158 -  98 cycles(tsc) 24.708 ns - 18 cycles(tsc)  4.723 ns - improved 81.6%
250 - 101 cycles(tsc) 25.342 ns - 20 cycles(tsc)  5.035 ns - improved 80.2%

Also notice how well bulking maintains the performance when the bulk
size increases (which is a soar spot for the SLUB allocator).

Increasing the bulk size further:
 20 cycles(tsc)  5.214 ns (bulk: 512)
 30 cycles(tsc)  7.734 ns (bulk: 768)
 40 cycles(tsc) 10.244 ns (bulk:1024)
 72 cycles(tsc) 18.049 ns (bulk:2048)
 90 cycles(tsc) 22.585 ns (bulk:4096)

It is not recommended to perform large bulking with SLAB, as
local interrupts are disabled for the entire period.  If these
kind of use-cases evolve, this interface should be adjusted to
mitigate/reduce the interrupts off period.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 mm/slab.c |   87 +++++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 62 insertions(+), 25 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index c77ebe6cc87c..21da6b1ccae3 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3234,11 +3234,15 @@ __do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 #endif /* CONFIG_NUMA */
 
 static __always_inline void *
-slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
+slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller,
+	   bool irq_off_needed)
 {
 	unsigned long save_flags;
 	void *objp;
 
+	/* Compiler need to remove irq_off_needed branch statements */
+	BUILD_BUG_ON(!__builtin_constant_p(irq_off_needed));
+
 	flags &= gfp_allowed_mask;
 
 	lockdep_trace_alloc(flags);
@@ -3249,9 +3253,11 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	cachep = memcg_kmem_get_cache(cachep, flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
-	local_irq_save(save_flags);
+	if (irq_off_needed)
+		local_irq_save(save_flags);
 	objp = __do_cache_alloc(cachep, flags);
-	local_irq_restore(save_flags);
+	if (irq_off_needed)
+		local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags,
 				 flags);
@@ -3407,7 +3413,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
  */
 void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
-	void *ret = slab_alloc(cachep, flags, _RET_IP_);
+	void *ret = slab_alloc(cachep, flags, _RET_IP_, true);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret,
 			       cachep->object_size, cachep->size, flags);
@@ -3416,16 +3422,23 @@ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 }
 EXPORT_SYMBOL(kmem_cache_alloc);
 
-void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
-{
-	__kmem_cache_free_bulk(s, size, p);
-}
-EXPORT_SYMBOL(kmem_cache_free_bulk);
-
+/* Note that interrupts must be enabled when calling this function. */
 bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
-								void **p)
+			   void **p)
 {
-	return __kmem_cache_alloc_bulk(s, flags, size, p);
+	size_t i;
+
+	local_irq_disable();
+	for (i = 0; i < size; i++) {
+		void *x = p[i] = slab_alloc(s, flags, _RET_IP_, false);
+
+		if (!x) {
+			__kmem_cache_free_bulk(s, i, p);
+			return false;
+		}
+	}
+	local_irq_enable();
+	return true;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk);
 
@@ -3435,7 +3448,7 @@ kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
 {
 	void *ret;
 
-	ret = slab_alloc(cachep, flags, _RET_IP_);
+	ret = slab_alloc(cachep, flags, _RET_IP_, true);
 
 	trace_kmalloc(_RET_IP_, ret,
 		      size, cachep->size, flags);
@@ -3526,7 +3539,7 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
 	cachep = kmalloc_slab(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
 		return cachep;
-	ret = slab_alloc(cachep, flags, caller);
+	ret = slab_alloc(cachep, flags, caller, true);
 
 	trace_kmalloc(caller, ret,
 		      size, cachep->size, flags);
@@ -3546,32 +3559,56 @@ void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
 }
 EXPORT_SYMBOL(__kmalloc_track_caller);
 
-/**
- * kmem_cache_free - Deallocate an object
- * @cachep: The cache the allocation was from.
- * @objp: The previously allocated object.
- *
- * Free an object which was previously allocated from this
- * cache.
- */
-void kmem_cache_free(struct kmem_cache *cachep, void *objp)
+/* Caller is responsible for disabling local IRQs */
+static __always_inline void __kmem_cache_free(struct kmem_cache *cachep,
+					      void *objp, bool irq_off_needed)
 {
 	unsigned long flags;
+
+	/* Compiler need to remove irq_off_needed branch statements */
+	BUILD_BUG_ON(!__builtin_constant_p(irq_off_needed));
+
 	cachep = cache_from_obj(cachep, objp);
 	if (!cachep)
 		return;
 
-	local_irq_save(flags);
+	if (irq_off_needed)
+		local_irq_save(flags);
 	debug_check_no_locks_freed(objp, cachep->object_size);
 	if (!(cachep->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(objp, cachep->object_size);
 	__cache_free(cachep, objp, _RET_IP_);
-	local_irq_restore(flags);
+	if (irq_off_needed)
+		local_irq_restore(flags);
+}
 
+/**
+ * kmem_cache_free - Deallocate an object
+ * @cachep: The cache the allocation was from.
+ * @objp: The previously allocated object.
+ *
+ * Free an object which was previously allocated from this
+ * cache.
+ */
+void kmem_cache_free(struct kmem_cache *cachep, void *objp)
+{
+	__kmem_cache_free(cachep, objp, true);
 	trace_kmem_cache_free(_RET_IP_, objp);
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
+/* Note that interrupts must be enabled when calling this function. */
+void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	size_t i;
+
+	local_irq_disable();
+	for (i = 0; i < size; i++)
+		__kmem_cache_free(s, p[i], false);
+	local_irq_enable();
+}
+EXPORT_SYMBOL(kmem_cache_free_bulk);
+
 /**
  * kfree - free previously allocated memory
  * @objp: pointer returned by kmalloc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists
  2015-09-29 15:46 ` [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
                     ` (3 preceding siblings ...)
  2015-09-29 15:48   ` [MM PATCH V4 4/6] slab: implement bulking for SLAB allocator Jesper Dangaard Brouer
@ 2015-09-29 15:48   ` Jesper Dangaard Brouer
  2015-09-29 16:38     ` Alexander Duyck
  2015-09-29 15:48   ` [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
  5 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-29 15:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Christoph Lameter
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

Make it possible to free a freelist with several objects by adjusting
API of slab_free() and __slab_free() to have head, tail and an objects
counter (cnt).

Tail being NULL indicate single object free of head object.  This
allow compiler inline constant propagation in slab_free() and
slab_free_freelist_hook() to avoid adding any overhead in case of
single object free.

This allows a freelist with several objects (all within the same
slab-page) to be free'ed using a single locked cmpxchg_double in
__slab_free() and with an unlocked cmpxchg_double in slab_free().

Object debugging on the free path is also extended to handle these
freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
objects don't belong to the same slab-page.

These changes are needed for the next patch to bulk free the detached
freelists it introduces and constructs.

Micro benchmarking showed no performance reduction due to this change,
when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>

---
V4:
 - Change API per req of Christoph Lameter
 - Remove comments in init_object.

 mm/slub.c |   87 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 69 insertions(+), 18 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1cf98d89546d..7c2abc33fd4e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1063,11 +1063,15 @@ bad:
 	return 0;
 }
 
+/* Supports checking bulk free of a constructed freelist */
 static noinline struct kmem_cache_node *free_debug_processing(
-	struct kmem_cache *s, struct page *page, void *object,
+	struct kmem_cache *s, struct page *page,
+	void *head, void *tail, int bulk_cnt,
 	unsigned long addr, unsigned long *flags)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
+	void *object = head;
+	int cnt = 0;
 
 	spin_lock_irqsave(&n->list_lock, *flags);
 	slab_lock(page);
@@ -1075,6 +1079,9 @@ static noinline struct kmem_cache_node *free_debug_processing(
 	if (!check_slab(s, page))
 		goto fail;
 
+next_object:
+	cnt++;
+
 	if (!check_valid_pointer(s, page, object)) {
 		slab_err(s, page, "Invalid object pointer 0x%p", object);
 		goto fail;
@@ -1105,8 +1112,19 @@ static noinline struct kmem_cache_node *free_debug_processing(
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_FREE, addr);
 	trace(s, page, object, 0);
+	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
 	init_object(s, object, SLUB_RED_INACTIVE);
+
+	/* Reached end of constructed freelist yet? */
+	if (object != tail) {
+		object = get_freepointer(s, object);
+		goto next_object;
+	}
 out:
+	if (cnt != bulk_cnt)
+		slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n",
+			 bulk_cnt, cnt);
+
 	slab_unlock(page);
 	/*
 	 * Keep node_lock to preserve integrity
@@ -1210,7 +1228,8 @@ static inline int alloc_debug_processing(struct kmem_cache *s,
 	struct page *page, void *object, unsigned long addr) { return 0; }
 
 static inline struct kmem_cache_node *free_debug_processing(
-	struct kmem_cache *s, struct page *page, void *object,
+	struct kmem_cache *s, struct page *page,
+	void *head, void *tail, int bulk_cnt,
 	unsigned long addr, unsigned long *flags) { return NULL; }
 
 static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
@@ -1306,6 +1325,31 @@ static inline void slab_free_hook(struct kmem_cache *s, void *x)
 	kasan_slab_free(s, x);
 }
 
+/* Compiler cannot detect that slab_free_freelist_hook() can be
+ * removed if slab_free_hook() evaluates to nothing.  Thus, we need to
+ * catch all relevant config debug options here.
+ */
+#if defined(CONFIG_KMEMCHECK) ||		\
+	defined(CONFIG_LOCKDEP)	||		\
+	defined(CONFIG_DEBUG_KMEMLEAK) ||	\
+	defined(CONFIG_DEBUG_OBJECTS_FREE) ||	\
+	defined(CONFIG_KASAN)
+static inline void slab_free_freelist_hook(struct kmem_cache *s,
+					   void *head, void *tail)
+{
+	void *object = head;
+	void *tail_obj = tail ? : head;
+
+	do {
+		slab_free_hook(s, object);
+	} while ((object != tail_obj) &&
+		 (object = get_freepointer(s, object)));
+}
+#else
+static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail,
+					   void *freelist_head) {}
+#endif
+
 static void setup_object(struct kmem_cache *s, struct page *page,
 				void *object)
 {
@@ -2586,10 +2630,11 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
  * handling required then we can return immediately.
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-			void *x, unsigned long addr)
+			void *head, void *tail, int cnt,
+			unsigned long addr)
+
 {
 	void *prior;
-	void **object = (void *)x;
 	int was_frozen;
 	struct page new;
 	unsigned long counters;
@@ -2599,7 +2644,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 	stat(s, FREE_SLOWPATH);
 
 	if (kmem_cache_debug(s) &&
-		!(n = free_debug_processing(s, page, x, addr, &flags)))
+	    !(n = free_debug_processing(s, page, head, tail, cnt,
+					addr, &flags)))
 		return;
 
 	do {
@@ -2609,10 +2655,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		}
 		prior = page->freelist;
 		counters = page->counters;
-		set_freepointer(s, object, prior);
+		set_freepointer(s, tail, prior);
 		new.counters = counters;
 		was_frozen = new.frozen;
-		new.inuse--;
+		new.inuse -= cnt;
 		if ((!new.inuse || !prior) && !was_frozen) {
 
 			if (kmem_cache_has_cpu_partial(s) && !prior) {
@@ -2643,7 +2689,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 	} while (!cmpxchg_double_slab(s, page,
 		prior, counters,
-		object, new.counters,
+		head, new.counters,
 		"__slab_free"));
 
 	if (likely(!n)) {
@@ -2708,15 +2754,20 @@ slab_empty:
  *
  * If fastpath is not possible then fall back to __slab_free where we deal
  * with all sorts of special processing.
+ *
+ * Bulk free of a freelist with several objects (all pointing to the
+ * same page) possible by specifying head and tail ptr, plus objects
+ * count (cnt). Bulk free indicated by tail pointer being set.
  */
-static __always_inline void slab_free(struct kmem_cache *s,
-			struct page *page, void *x, unsigned long addr)
+static __always_inline void slab_free(struct kmem_cache *s, struct page *page,
+				      void *head, void *tail, int cnt,
+				      unsigned long addr)
 {
-	void **object = (void *)x;
+	void *tail_obj = tail ? : head;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
 
-	slab_free_hook(s, x);
+	slab_free_freelist_hook(s, head, tail);
 
 redo:
 	/*
@@ -2735,19 +2786,19 @@ redo:
 	barrier();
 
 	if (likely(page == c->page)) {
-		set_freepointer(s, object, c->freelist);
+		set_freepointer(s, tail_obj, c->freelist);
 
 		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				c->freelist, tid,
-				object, next_tid(tid)))) {
+				head, next_tid(tid)))) {
 
 			note_cmpxchg_failure("slab_free", s, tid);
 			goto redo;
 		}
 		stat(s, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr);
+		__slab_free(s, page, head, tail_obj, cnt, addr);
 
 }
 
@@ -2756,7 +2807,7 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 	s = cache_from_obj(s, x);
 	if (!s)
 		return;
-	slab_free(s, virt_to_head_page(x), x, _RET_IP_);
+	slab_free(s, virt_to_head_page(x), x, NULL, 1, _RET_IP_);
 	trace_kmem_cache_free(_RET_IP_, x);
 }
 EXPORT_SYMBOL(kmem_cache_free);
@@ -2791,7 +2842,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 			c->tid = next_tid(c->tid);
 			local_irq_enable();
 			/* Slowpath: overhead locked cmpxchg_double_slab */
-			__slab_free(s, page, object, _RET_IP_);
+			__slab_free(s, page, object, object, 1, _RET_IP_);
 			local_irq_disable();
 			c = this_cpu_ptr(s->cpu_slab);
 		}
@@ -3531,7 +3582,7 @@ void kfree(const void *x)
 		__free_kmem_pages(page, compound_order(page));
 		return;
 	}
-	slab_free(page->slab_cache, page, object, _RET_IP_);
+	slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
 }
 EXPORT_SYMBOL(kfree);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist
  2015-09-29 15:46 ` [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
                     ` (4 preceding siblings ...)
  2015-09-29 15:48   ` [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists Jesper Dangaard Brouer
@ 2015-09-29 15:48   ` Jesper Dangaard Brouer
  2015-10-14  5:15     ` Joonsoo Kim
  5 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-29 15:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Christoph Lameter
  Cc: netdev, Jesper Dangaard Brouer, Alexander Duyck, Pekka Enberg,
	David Rientjes, Joonsoo Kim

This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.

The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.

To use the new bulking feature, we build what I call a detached
freelist.  The detached freelist takes advantage of three properties:

 1) the free function call owns the object that is about to be freed,
    thus writing into this memory is synchronization-free.

 2) many freelist's can co-exist side-by-side in the same slab-page
    each with a separate head pointer.

 3) it is the visibility of the head pointer that needs synchronization.

Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization.  The
freelist is constructed directly in the page objects, without any
synchronization needed.  The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk.  Thus, the freelist
head pointer is not visible to other CPUs.

All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page.  The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.

Kmem debug support is handled in call of slab_free().

Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.


Performance data:
 Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz

SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns

To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).

Performance data, compared against fallback bulking:

bulk -  fallback bulk            - improvement with this patch
   1 -  62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
   2 -  55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
   3 -  53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
   4 -  52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
   8 -  50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
  16 -  49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
  30 -  49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
  32 -  50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
  34 -  96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
  48 -  83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
  64 -  74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
 128 -  90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
 158 -  99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
 250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%

Performance data, compared current in-kernel bulking:

bulk - curr in-kernel  - improvement with this patch
   1 -  46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
   2 -  27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
   3 -  21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
   4 -  18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
   8 -  17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
  16 -  18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1)  5.6%
  30 -  18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0)  0.0%
  32 -  18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0)  0.0%
  34 -  78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
  48 -  60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
  64 -  49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
 128 -  69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
 158 -  79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
 250 -  86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%

Performance with normal SLUB merging is significantly slower for
larger bulking.  This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.

bulk - slab_nomerge   -  normal SLUB merge
   1 -  49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
   2 -  30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
   3 -  23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
   4 -  20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
   8 -  18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
  16 -  17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
  30 -  18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
  32 -  18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
  34 -  23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
  48 -  21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
  64 -  20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
 128 -  27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
 158 -  30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
 250 -  37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19

Joint work with Alexander Duyck.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |  108 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 78 insertions(+), 30 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 7c2abc33fd4e..53500f3b70ab 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2812,44 +2812,92 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
-/* Note that interrupts must be enabled when calling this function. */
-void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
-{
-	struct kmem_cache_cpu *c;
+struct detached_freelist {
 	struct page *page;
-	int i;
+	void *tail;
+	void *freelist;
+	int cnt;
+};
 
-	local_irq_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+/*
+ * This function progressively scans the array with free objects (with
+ * a limited look ahead) and extract objects belonging to the same
+ * page.  It builds a detached freelist directly within the given
+ * page/objects.  This can happen without any need for
+ * synchronization, because the objects are owned by running process.
+ * The freelist is build up as a single linked list in the objects.
+ * The idea is, that this detached freelist can then be bulk
+ * transferred to the real freelist(s), but only requiring a single
+ * synchronization primitive.  Look ahead in the array is limited due
+ * to performance reasons.
+ */
+static int build_detached_freelist(struct kmem_cache *s, size_t size,
+				   void **p, struct detached_freelist *df)
+{
+	size_t first_skipped_index = 0;
+	int lookahead = 3;
+	void *object;
 
-	for (i = 0; i < size; i++) {
-		void *object = p[i];
+	/* Always re-init detached_freelist */
+	df->page = NULL;
 
-		BUG_ON(!object);
-		/* kmem cache debug support */
-		s = cache_from_obj(s, object);
-		if (unlikely(!s))
-			goto exit;
-		slab_free_hook(s, object);
+	do {
+		object = p[--size];
+	} while (!object && size);
 
-		page = virt_to_head_page(object);
+	if (!object)
+		return 0;
 
-		if (c->page == page) {
-			/* Fastpath: local CPU free */
-			set_freepointer(s, object, c->freelist);
-			c->freelist = object;
-		} else {
-			c->tid = next_tid(c->tid);
-			local_irq_enable();
-			/* Slowpath: overhead locked cmpxchg_double_slab */
-			__slab_free(s, page, object, object, 1, _RET_IP_);
-			local_irq_disable();
-			c = this_cpu_ptr(s->cpu_slab);
+	/* Start new detached freelist */
+	set_freepointer(s, object, NULL);
+	df->page = virt_to_head_page(object);
+	df->tail = object;
+	df->freelist = object;
+	p[size] = NULL; /* mark object processed */
+	df->cnt = 1;
+
+	while (size) {
+		object = p[--size];
+		if (!object)
+			continue; /* Skip processed objects */
+
+		/* df->page is always set at this point */
+		if (df->page == virt_to_head_page(object)) {
+			/* Opportunity build freelist */
+			set_freepointer(s, object, df->freelist);
+			df->freelist = object;
+			df->cnt++;
+			p[size] = NULL; /* mark object processed */
+
+			continue;
 		}
+
+		/* Limit look ahead search */
+		if (!--lookahead)
+			break;
+
+		if (!first_skipped_index)
+			first_skipped_index = size + 1;
 	}
-exit:
-	c->tid = next_tid(c->tid);
-	local_irq_enable();
+
+	return first_skipped_index;
+}
+
+
+/* Note that interrupts must be enabled when calling this function. */
+void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	BUG_ON(!size);
+
+	do {
+		struct detached_freelist df;
+
+		size = build_detached_freelist(s, size, p, &df);
+		if (unlikely(!df.page))
+			continue;
+
+		slab_free(s, df.page, df.freelist, df.tail, df.cnt, _RET_IP_);
+	} while (likely(size));
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists
  2015-09-29 15:48   ` [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists Jesper Dangaard Brouer
@ 2015-09-29 16:38     ` Alexander Duyck
  2015-09-29 17:00       ` Jesper Dangaard Brouer
  2015-09-30 11:44       ` [MM PATCH V4.1 " Jesper Dangaard Brouer
  0 siblings, 2 replies; 50+ messages in thread
From: Alexander Duyck @ 2015-09-29 16:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, linux-mm, Andrew Morton, Christoph Lameter
  Cc: netdev, Pekka Enberg, David Rientjes, Joonsoo Kim

On 09/29/2015 08:48 AM, Jesper Dangaard Brouer wrote:
> Make it possible to free a freelist with several objects by adjusting
> API of slab_free() and __slab_free() to have head, tail and an objects
> counter (cnt).
>
> Tail being NULL indicate single object free of head object.  This
> allow compiler inline constant propagation in slab_free() and
> slab_free_freelist_hook() to avoid adding any overhead in case of
> single object free.
>
> This allows a freelist with several objects (all within the same
> slab-page) to be free'ed using a single locked cmpxchg_double in
> __slab_free() and with an unlocked cmpxchg_double in slab_free().
>
> Object debugging on the free path is also extended to handle these
> freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
> objects don't belong to the same slab-page.
>
> These changes are needed for the next patch to bulk free the detached
> freelists it introduces and constructs.
>
> Micro benchmarking showed no performance reduction due to this change,
> when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
>
> ---
> V4:
>   - Change API per req of Christoph Lameter
>   - Remove comments in init_object.
>
>   mm/slub.c |   87 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
>   1 file changed, 69 insertions(+), 18 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 1cf98d89546d..7c2abc33fd4e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1063,11 +1063,15 @@ bad:
>   	return 0;
>   }
>
> +/* Supports checking bulk free of a constructed freelist */
>   static noinline struct kmem_cache_node *free_debug_processing(
> -	struct kmem_cache *s, struct page *page, void *object,
> +	struct kmem_cache *s, struct page *page,
> +	void *head, void *tail, int bulk_cnt,
>   	unsigned long addr, unsigned long *flags)
>   {
>   	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
> +	void *object = head;
> +	int cnt = 0;
>
>   	spin_lock_irqsave(&n->list_lock, *flags);
>   	slab_lock(page);
> @@ -1075,6 +1079,9 @@ static noinline struct kmem_cache_node *free_debug_processing(
>   	if (!check_slab(s, page))
>   		goto fail;
>
> +next_object:
> +	cnt++;
> +
>   	if (!check_valid_pointer(s, page, object)) {
>   		slab_err(s, page, "Invalid object pointer 0x%p", object);
>   		goto fail;
> @@ -1105,8 +1112,19 @@ static noinline struct kmem_cache_node *free_debug_processing(
>   	if (s->flags & SLAB_STORE_USER)
>   		set_track(s, object, TRACK_FREE, addr);
>   	trace(s, page, object, 0);
> +	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
>   	init_object(s, object, SLUB_RED_INACTIVE);
> +
> +	/* Reached end of constructed freelist yet? */
> +	if (object != tail) {
> +		object = get_freepointer(s, object);
> +		goto next_object;
> +	}
>   out:
> +	if (cnt != bulk_cnt)
> +		slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n",
> +			 bulk_cnt, cnt);
> +
>   	slab_unlock(page);
>   	/*
>   	 * Keep node_lock to preserve integrity
> @@ -1210,7 +1228,8 @@ static inline int alloc_debug_processing(struct kmem_cache *s,
>   	struct page *page, void *object, unsigned long addr) { return 0; }
>
>   static inline struct kmem_cache_node *free_debug_processing(
> -	struct kmem_cache *s, struct page *page, void *object,
> +	struct kmem_cache *s, struct page *page,
> +	void *head, void *tail, int bulk_cnt,
>   	unsigned long addr, unsigned long *flags) { return NULL; }
>
>   static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
> @@ -1306,6 +1325,31 @@ static inline void slab_free_hook(struct kmem_cache *s, void *x)
>   	kasan_slab_free(s, x);
>   }
>
> +/* Compiler cannot detect that slab_free_freelist_hook() can be
> + * removed if slab_free_hook() evaluates to nothing.  Thus, we need to
> + * catch all relevant config debug options here.
> + */

Is it actually generating nothing but a pointer walking loop or is there 
a bit of code cruft that is being evaluated inside the loop?

> +#if defined(CONFIG_KMEMCHECK) ||		\
> +	defined(CONFIG_LOCKDEP)	||		\
> +	defined(CONFIG_DEBUG_KMEMLEAK) ||	\
> +	defined(CONFIG_DEBUG_OBJECTS_FREE) ||	\
> +	defined(CONFIG_KASAN)
> +static inline void slab_free_freelist_hook(struct kmem_cache *s,
> +					   void *head, void *tail)
> +{
> +	void *object = head;
> +	void *tail_obj = tail ? : head;
> +
> +	do {
> +		slab_free_hook(s, object);
> +	} while ((object != tail_obj) &&
> +		 (object = get_freepointer(s, object)));
> +}
> +#else
> +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail,
> +					   void *freelist_head) {}
> +#endif
> +

Instead of messing around with an #else you might just wrap the contents 
of slab_free_freelist_hook in the #if/#endif instead of the entire 
function declaration.

>   static void setup_object(struct kmem_cache *s, struct page *page,
>   				void *object)
>   {
> @@ -2586,10 +2630,11 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
>    * handling required then we can return immediately.
>    */
>   static void __slab_free(struct kmem_cache *s, struct page *page,
> -			void *x, unsigned long addr)
> +			void *head, void *tail, int cnt,
> +			unsigned long addr)
> +
>   {
>   	void *prior;
> -	void **object = (void *)x;
>   	int was_frozen;
>   	struct page new;
>   	unsigned long counters;
> @@ -2599,7 +2644,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>   	stat(s, FREE_SLOWPATH);
>
>   	if (kmem_cache_debug(s) &&
> -		!(n = free_debug_processing(s, page, x, addr, &flags)))
> +	    !(n = free_debug_processing(s, page, head, tail, cnt,
> +					addr, &flags)))
>   		return;
>
>   	do {
> @@ -2609,10 +2655,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>   		}
>   		prior = page->freelist;
>   		counters = page->counters;
> -		set_freepointer(s, object, prior);
> +		set_freepointer(s, tail, prior);
>   		new.counters = counters;
>   		was_frozen = new.frozen;
> -		new.inuse--;
> +		new.inuse -= cnt;
>   		if ((!new.inuse || !prior) && !was_frozen) {
>
>   			if (kmem_cache_has_cpu_partial(s) && !prior) {
> @@ -2643,7 +2689,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
>
>   	} while (!cmpxchg_double_slab(s, page,
>   		prior, counters,
> -		object, new.counters,
> +		head, new.counters,
>   		"__slab_free"));
>
>   	if (likely(!n)) {
> @@ -2708,15 +2754,20 @@ slab_empty:
>    *
>    * If fastpath is not possible then fall back to __slab_free where we deal
>    * with all sorts of special processing.
> + *
> + * Bulk free of a freelist with several objects (all pointing to the
> + * same page) possible by specifying head and tail ptr, plus objects
> + * count (cnt). Bulk free indicated by tail pointer being set.
>    */
> -static __always_inline void slab_free(struct kmem_cache *s,
> -			struct page *page, void *x, unsigned long addr)
> +static __always_inline void slab_free(struct kmem_cache *s, struct page *page,
> +				      void *head, void *tail, int cnt,
> +				      unsigned long addr)
>   {
> -	void **object = (void *)x;
> +	void *tail_obj = tail ? : head;
>   	struct kmem_cache_cpu *c;
>   	unsigned long tid;
>
> -	slab_free_hook(s, x);
> +	slab_free_freelist_hook(s, head, tail);
>
>   redo:
>   	/*
> @@ -2735,19 +2786,19 @@ redo:
>   	barrier();
>
>   	if (likely(page == c->page)) {
> -		set_freepointer(s, object, c->freelist);
> +		set_freepointer(s, tail_obj, c->freelist);
>
>   		if (unlikely(!this_cpu_cmpxchg_double(
>   				s->cpu_slab->freelist, s->cpu_slab->tid,
>   				c->freelist, tid,
> -				object, next_tid(tid)))) {
> +				head, next_tid(tid)))) {
>
>   			note_cmpxchg_failure("slab_free", s, tid);
>   			goto redo;
>   		}
>   		stat(s, FREE_FASTPATH);
>   	} else
> -		__slab_free(s, page, x, addr);
> +		__slab_free(s, page, head, tail_obj, cnt, addr);
>
>   }
>
> @@ -2756,7 +2807,7 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
>   	s = cache_from_obj(s, x);
>   	if (!s)
>   		return;
> -	slab_free(s, virt_to_head_page(x), x, _RET_IP_);
> +	slab_free(s, virt_to_head_page(x), x, NULL, 1, _RET_IP_);
>   	trace_kmem_cache_free(_RET_IP_, x);
>   }
>   EXPORT_SYMBOL(kmem_cache_free);
> @@ -2791,7 +2842,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>   			c->tid = next_tid(c->tid);
>   			local_irq_enable();
>   			/* Slowpath: overhead locked cmpxchg_double_slab */
> -			__slab_free(s, page, object, _RET_IP_);
> +			__slab_free(s, page, object, object, 1, _RET_IP_);
>   			local_irq_disable();
>   			c = this_cpu_ptr(s->cpu_slab);
>   		}
> @@ -3531,7 +3582,7 @@ void kfree(const void *x)
>   		__free_kmem_pages(page, compound_order(page));
>   		return;
>   	}
> -	slab_free(page->slab_cache, page, object, _RET_IP_);
> +	slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
>   }
>   EXPORT_SYMBOL(kfree);
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists
  2015-09-29 16:38     ` Alexander Duyck
@ 2015-09-29 17:00       ` Jesper Dangaard Brouer
  2015-09-29 17:20         ` Alexander Duyck
  2015-09-30 11:44       ` [MM PATCH V4.1 " Jesper Dangaard Brouer
  1 sibling, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-29 17:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-mm, Andrew Morton, Christoph Lameter, netdev, Pekka Enberg,
	David Rientjes, Joonsoo Kim, brouer

On Tue, 29 Sep 2015 09:38:30 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On 09/29/2015 08:48 AM, Jesper Dangaard Brouer wrote:
> > Make it possible to free a freelist with several objects by adjusting
> > API of slab_free() and __slab_free() to have head, tail and an objects
> > counter (cnt).
> >
> > Tail being NULL indicate single object free of head object.  This
> > allow compiler inline constant propagation in slab_free() and
> > slab_free_freelist_hook() to avoid adding any overhead in case of
> > single object free.
> >
> > This allows a freelist with several objects (all within the same
> > slab-page) to be free'ed using a single locked cmpxchg_double in
> > __slab_free() and with an unlocked cmpxchg_double in slab_free().
> >
> > Object debugging on the free path is also extended to handle these
> > freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
> > objects don't belong to the same slab-page.
> >
> > These changes are needed for the next patch to bulk free the detached
> > freelists it introduces and constructs.
> >
> > Micro benchmarking showed no performance reduction due to this change,
> > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
> >
> > ---
> > V4:
> >   - Change API per req of Christoph Lameter
> >   - Remove comments in init_object.
> >
[...]
> >
> > +/* Compiler cannot detect that slab_free_freelist_hook() can be
> > + * removed if slab_free_hook() evaluates to nothing.  Thus, we need to
> > + * catch all relevant config debug options here.
> > + */
> 
> Is it actually generating nothing but a pointer walking loop or is there 
> a bit of code cruft that is being evaluated inside the loop?

If any of the defines are activated, then slab_free_hook(s, object)
will generate some code.

In the case of single object free, then the compiler see that it can
remove the loop, and also notice if slab_free_hook() eval to nothing.

The compiler is not smart enough to remove the loop for multiobject
case, even-though it can see that slab_free_hook() eval to nothing
(in that case it does a pointer walk without any code eval).  Thus, I
need this construct.

> > +#if defined(CONFIG_KMEMCHECK) ||		\
> > +	defined(CONFIG_LOCKDEP)	||		\
> > +	defined(CONFIG_DEBUG_KMEMLEAK) ||	\
> > +	defined(CONFIG_DEBUG_OBJECTS_FREE) ||	\
> > +	defined(CONFIG_KASAN)
> > +static inline void slab_free_freelist_hook(struct kmem_cache *s,
> > +					   void *head, void *tail)
> > +{
> > +	void *object = head;
> > +	void *tail_obj = tail ? : head;
> > +
> > +	do {
> > +		slab_free_hook(s, object);
> > +	} while ((object != tail_obj) &&
> > +		 (object = get_freepointer(s, object)));
> > +}
> > +#else
> > +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail,
> > +					   void *freelist_head) {}
> > +#endif
> > +
> 
> Instead of messing around with an #else you might just wrap the contents 
> of slab_free_freelist_hook in the #if/#endif instead of the entire 
> function declaration.

I had it that way in an earlier version of the patch, but I liked
better this way.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists
  2015-09-29 17:00       ` Jesper Dangaard Brouer
@ 2015-09-29 17:20         ` Alexander Duyck
  2015-09-29 18:16           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Alexander Duyck @ 2015-09-29 17:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, Christoph Lameter, netdev, Pekka Enberg,
	David Rientjes, Joonsoo Kim

On 09/29/2015 10:00 AM, Jesper Dangaard Brouer wrote:
> On Tue, 29 Sep 2015 09:38:30 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On 09/29/2015 08:48 AM, Jesper Dangaard Brouer wrote:
>>> +#if defined(CONFIG_KMEMCHECK) ||		\
>>> +	defined(CONFIG_LOCKDEP)	||		\
>>> +	defined(CONFIG_DEBUG_KMEMLEAK) ||	\
>>> +	defined(CONFIG_DEBUG_OBJECTS_FREE) ||	\
>>> +	defined(CONFIG_KASAN)
>>> +static inline void slab_free_freelist_hook(struct kmem_cache *s,
>>> +					   void *head, void *tail)
>>> +{
>>> +	void *object = head;
>>> +	void *tail_obj = tail ? : head;
>>> +
>>> +	do {
>>> +		slab_free_hook(s, object);
>>> +	} while ((object != tail_obj) &&
>>> +		 (object = get_freepointer(s, object)));
>>> +}
>>> +#else
>>> +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail,
>>> +					   void *freelist_head) {}
>>> +#endif
>>> +
>> Instead of messing around with an #else you might just wrap the contents
>> of slab_free_freelist_hook in the #if/#endif instead of the entire
>> function declaration.
> I had it that way in an earlier version of the patch, but I liked
> better this way.

It would be nice if the argument names were the same for both cases.  
Having the names differ will make it more difficult to maintain when 
changes need to be made to the function.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists
  2015-09-29 17:20         ` Alexander Duyck
@ 2015-09-29 18:16           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-29 18:16 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: linux-mm, Andrew Morton, Christoph Lameter, netdev, Pekka Enberg,
	David Rientjes, Joonsoo Kim, brouer

On Tue, 29 Sep 2015 10:20:20 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On 09/29/2015 10:00 AM, Jesper Dangaard Brouer wrote:
> > On Tue, 29 Sep 2015 09:38:30 -0700
> > Alexander Duyck <alexander.duyck@gmail.com> wrote:
> >
> >> On 09/29/2015 08:48 AM, Jesper Dangaard Brouer wrote:
> >>> +#if defined(CONFIG_KMEMCHECK) ||		\
> >>> +	defined(CONFIG_LOCKDEP)	||		\
> >>> +	defined(CONFIG_DEBUG_KMEMLEAK) ||	\
> >>> +	defined(CONFIG_DEBUG_OBJECTS_FREE) ||	\
> >>> +	defined(CONFIG_KASAN)
> >>> +static inline void slab_free_freelist_hook(struct kmem_cache *s,
> >>> +					   void *head, void *tail)
> >>> +{
> >>> +	void *object = head;
> >>> +	void *tail_obj = tail ? : head;
> >>> +
> >>> +	do {
> >>> +		slab_free_hook(s, object);
> >>> +	} while ((object != tail_obj) &&
> >>> +		 (object = get_freepointer(s, object)));
> >>> +}
> >>> +#else
> >>> +static inline void slab_free_freelist_hook(struct kmem_cache *s, void *obj_tail,
> >>> +					   void *freelist_head) {}
> >>> +#endif
> >>> +
> >> Instead of messing around with an #else you might just wrap the contents
> >> of slab_free_freelist_hook in the #if/#endif instead of the entire
> >> function declaration.
> >
> > I had it that way in an earlier version of the patch, but I liked
> > better this way.
> 
> It would be nice if the argument names were the same for both cases.  
> Having the names differ will make it more difficult to maintain when 
> changes need to be made to the function.

Nice spotted, I forgot to change arg names of the empty function, when
I updated the patch. Guess, it is an argument for moving the "if
defined()" into the function body.

It just looked strange to have such a big ifdef block inside the
function.  I also earlier had it define another def and use that inside
the function, but then the code-reader would not know if this new def
was/could-be used later (nitpicking alert...)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-09-29 16:38     ` Alexander Duyck
  2015-09-29 17:00       ` Jesper Dangaard Brouer
@ 2015-09-30 11:44       ` Jesper Dangaard Brouer
  2015-09-30 16:03         ` Christoph Lameter
  2015-10-01 22:10         ` Andrew Morton
  1 sibling, 2 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-30 11:44 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Christoph Lameter, Alexander Duyck
  Cc: Pekka Enberg, netdev, Joonsoo Kim, David Rientjes,
	Jesper Dangaard Brouer

Make it possible to free a freelist with several objects by adjusting
API of slab_free() and __slab_free() to have head, tail and an objects
counter (cnt).

Tail being NULL indicate single object free of head object.  This
allow compiler inline constant propagation in slab_free() and
slab_free_freelist_hook() to avoid adding any overhead in case of
single object free.

This allows a freelist with several objects (all within the same
slab-page) to be free'ed using a single locked cmpxchg_double in
__slab_free() and with an unlocked cmpxchg_double in slab_free().

Object debugging on the free path is also extended to handle these
freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
objects don't belong to the same slab-page.

These changes are needed for the next patch to bulk free the detached
freelists it introduces and constructs.

Micro benchmarking showed no performance reduction due to this change,
when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>

---
V4:
 - Change API per req of Christoph Lameter
 - Remove comments in init_object.

V4.1:
 - Took Alex'es approach on defines inside slab_free_freelist_hook()

 mm/slub.c |   85 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 67 insertions(+), 18 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1cf98d89546d..99fcfa8ed0c7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1063,11 +1063,15 @@ bad:
 	return 0;
 }
 
+/* Supports checking bulk free of a constructed freelist */
 static noinline struct kmem_cache_node *free_debug_processing(
-	struct kmem_cache *s, struct page *page, void *object,
+	struct kmem_cache *s, struct page *page,
+	void *head, void *tail, int bulk_cnt,
 	unsigned long addr, unsigned long *flags)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
+	void *object = head;
+	int cnt = 0;
 
 	spin_lock_irqsave(&n->list_lock, *flags);
 	slab_lock(page);
@@ -1075,6 +1079,9 @@ static noinline struct kmem_cache_node *free_debug_processing(
 	if (!check_slab(s, page))
 		goto fail;
 
+next_object:
+	cnt++;
+
 	if (!check_valid_pointer(s, page, object)) {
 		slab_err(s, page, "Invalid object pointer 0x%p", object);
 		goto fail;
@@ -1105,8 +1112,19 @@ static noinline struct kmem_cache_node *free_debug_processing(
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_FREE, addr);
 	trace(s, page, object, 0);
+	/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
 	init_object(s, object, SLUB_RED_INACTIVE);
+
+	/* Reached end of constructed freelist yet? */
+	if (object != tail) {
+		object = get_freepointer(s, object);
+		goto next_object;
+	}
 out:
+	if (cnt != bulk_cnt)
+		slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n",
+			 bulk_cnt, cnt);
+
 	slab_unlock(page);
 	/*
 	 * Keep node_lock to preserve integrity
@@ -1210,7 +1228,8 @@ static inline int alloc_debug_processing(struct kmem_cache *s,
 	struct page *page, void *object, unsigned long addr) { return 0; }
 
 static inline struct kmem_cache_node *free_debug_processing(
-	struct kmem_cache *s, struct page *page, void *object,
+	struct kmem_cache *s, struct page *page,
+	void *head, void *tail, int bulk_cnt,
 	unsigned long addr, unsigned long *flags) { return NULL; }
 
 static inline int slab_pad_check(struct kmem_cache *s, struct page *page)
@@ -1306,6 +1325,29 @@ static inline void slab_free_hook(struct kmem_cache *s, void *x)
 	kasan_slab_free(s, x);
 }
 
+static inline void slab_free_freelist_hook(struct kmem_cache *s,
+					   void *head, void *tail)
+{
+/*
+ * Compiler cannot detect this function can be removed if slab_free_hook()
+ * evaluates to nothing.  Thus, catch all relevant config debug options here.
+ */
+#if defined(CONFIG_KMEMCHECK) ||		\
+	defined(CONFIG_LOCKDEP)	||		\
+	defined(CONFIG_DEBUG_KMEMLEAK) ||	\
+	defined(CONFIG_DEBUG_OBJECTS_FREE) ||	\
+	defined(CONFIG_KASAN)
+
+	void *object = head;
+	void *tail_obj = tail ? : head;
+
+	do {
+		slab_free_hook(s, object);
+	} while ((object != tail_obj) &&
+		 (object = get_freepointer(s, object)));
+#endif
+}
+
 static void setup_object(struct kmem_cache *s, struct page *page,
 				void *object)
 {
@@ -2586,10 +2628,11 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
  * handling required then we can return immediately.
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-			void *x, unsigned long addr)
+			void *head, void *tail, int cnt,
+			unsigned long addr)
+
 {
 	void *prior;
-	void **object = (void *)x;
 	int was_frozen;
 	struct page new;
 	unsigned long counters;
@@ -2599,7 +2642,8 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 	stat(s, FREE_SLOWPATH);
 
 	if (kmem_cache_debug(s) &&
-		!(n = free_debug_processing(s, page, x, addr, &flags)))
+	    !(n = free_debug_processing(s, page, head, tail, cnt,
+					addr, &flags)))
 		return;
 
 	do {
@@ -2609,10 +2653,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		}
 		prior = page->freelist;
 		counters = page->counters;
-		set_freepointer(s, object, prior);
+		set_freepointer(s, tail, prior);
 		new.counters = counters;
 		was_frozen = new.frozen;
-		new.inuse--;
+		new.inuse -= cnt;
 		if ((!new.inuse || !prior) && !was_frozen) {
 
 			if (kmem_cache_has_cpu_partial(s) && !prior) {
@@ -2643,7 +2687,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 	} while (!cmpxchg_double_slab(s, page,
 		prior, counters,
-		object, new.counters,
+		head, new.counters,
 		"__slab_free"));
 
 	if (likely(!n)) {
@@ -2708,15 +2752,20 @@ slab_empty:
  *
  * If fastpath is not possible then fall back to __slab_free where we deal
  * with all sorts of special processing.
+ *
+ * Bulk free of a freelist with several objects (all pointing to the
+ * same page) possible by specifying head and tail ptr, plus objects
+ * count (cnt). Bulk free indicated by tail pointer being set.
  */
-static __always_inline void slab_free(struct kmem_cache *s,
-			struct page *page, void *x, unsigned long addr)
+static __always_inline void slab_free(struct kmem_cache *s, struct page *page,
+				      void *head, void *tail, int cnt,
+				      unsigned long addr)
 {
-	void **object = (void *)x;
+	void *tail_obj = tail ? : head;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
 
-	slab_free_hook(s, x);
+	slab_free_freelist_hook(s, head, tail);
 
 redo:
 	/*
@@ -2735,19 +2784,19 @@ redo:
 	barrier();
 
 	if (likely(page == c->page)) {
-		set_freepointer(s, object, c->freelist);
+		set_freepointer(s, tail_obj, c->freelist);
 
 		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				c->freelist, tid,
-				object, next_tid(tid)))) {
+				head, next_tid(tid)))) {
 
 			note_cmpxchg_failure("slab_free", s, tid);
 			goto redo;
 		}
 		stat(s, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr);
+		__slab_free(s, page, head, tail_obj, cnt, addr);
 
 }
 
@@ -2756,7 +2805,7 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 	s = cache_from_obj(s, x);
 	if (!s)
 		return;
-	slab_free(s, virt_to_head_page(x), x, _RET_IP_);
+	slab_free(s, virt_to_head_page(x), x, NULL, 1, _RET_IP_);
 	trace_kmem_cache_free(_RET_IP_, x);
 }
 EXPORT_SYMBOL(kmem_cache_free);
@@ -2791,7 +2840,7 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 			c->tid = next_tid(c->tid);
 			local_irq_enable();
 			/* Slowpath: overhead locked cmpxchg_double_slab */
-			__slab_free(s, page, object, _RET_IP_);
+			__slab_free(s, page, object, object, 1, _RET_IP_);
 			local_irq_disable();
 			c = this_cpu_ptr(s->cpu_slab);
 		}
@@ -3531,7 +3580,7 @@ void kfree(const void *x)
 		__free_kmem_pages(page, compound_order(page));
 		return;
 	}
-	slab_free(page->slab_cache, page, object, _RET_IP_);
+	slab_free(page->slab_cache, page, object, NULL, 1, _RET_IP_);
 }
 EXPORT_SYMBOL(kfree);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-09-30 11:44       ` [MM PATCH V4.1 " Jesper Dangaard Brouer
@ 2015-09-30 16:03         ` Christoph Lameter
  2015-10-01 22:10         ` Andrew Morton
  1 sibling, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2015-09-30 16:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, Alexander Duyck, Pekka Enberg, netdev,
	Joonsoo Kim, David Rientjes

On Wed, 30 Sep 2015, Jesper Dangaard Brouer wrote:

> Make it possible to free a freelist with several objects by adjusting
> API of slab_free() and __slab_free() to have head, tail and an objects
> counter (cnt).

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-09-30 11:44       ` [MM PATCH V4.1 " Jesper Dangaard Brouer
  2015-09-30 16:03         ` Christoph Lameter
@ 2015-10-01 22:10         ` Andrew Morton
  2015-10-02  9:41           ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 50+ messages in thread
From: Andrew Morton @ 2015-10-01 22:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Christoph Lameter, Alexander Duyck, Pekka Enberg,
	netdev, Joonsoo Kim, David Rientjes

On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> Make it possible to free a freelist with several objects by adjusting
> API of slab_free() and __slab_free() to have head, tail and an objects
> counter (cnt).
> 
> Tail being NULL indicate single object free of head object.  This
> allow compiler inline constant propagation in slab_free() and
> slab_free_freelist_hook() to avoid adding any overhead in case of
> single object free.
> 
> This allows a freelist with several objects (all within the same
> slab-page) to be free'ed using a single locked cmpxchg_double in
> __slab_free() and with an unlocked cmpxchg_double in slab_free().
> 
> Object debugging on the free path is also extended to handle these
> freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
> objects don't belong to the same slab-page.
> 
> These changes are needed for the next patch to bulk free the detached
> freelists it introduces and constructs.
> 
> Micro benchmarking showed no performance reduction due to this change,
> when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
> 

checkpatch says

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
#205: FILE: mm/slub.c:2888:
+       BUG_ON(!size);


Linus will get mad at you if he finds out, and we wouldn't want that.

--- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix
+++ a/mm/slub.c
@@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc
 /* Note that interrupts must be enabled when calling this function. */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
-	BUG_ON(!size);
+	if (WARN_ON(!size))
+		return;
 
 	do {
 		struct detached_freelist df;
_


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-01 22:10         ` Andrew Morton
@ 2015-10-02  9:41           ` Jesper Dangaard Brouer
  2015-10-02 10:10             ` Christoph Lameter
  2015-10-02 13:40             ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-02  9:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Christoph Lameter, Alexander Duyck, Pekka Enberg,
	netdev, Joonsoo Kim, David Rientjes, brouer

On Thu, 1 Oct 2015 15:10:15 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > Make it possible to free a freelist with several objects by adjusting
> > API of slab_free() and __slab_free() to have head, tail and an objects
> > counter (cnt).
> > 
> > Tail being NULL indicate single object free of head object.  This
> > allow compiler inline constant propagation in slab_free() and
> > slab_free_freelist_hook() to avoid adding any overhead in case of
> > single object free.
> > 
> > This allows a freelist with several objects (all within the same
> > slab-page) to be free'ed using a single locked cmpxchg_double in
> > __slab_free() and with an unlocked cmpxchg_double in slab_free().
> > 
> > Object debugging on the free path is also extended to handle these
> > freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
> > objects don't belong to the same slab-page.
> > 
> > These changes are needed for the next patch to bulk free the detached
> > freelists it introduces and constructs.
> > 
> > Micro benchmarking showed no performance reduction due to this change,
> > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
> > 
> 
> checkpatch says
> 
> WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
> #205: FILE: mm/slub.c:2888:
> +       BUG_ON(!size);
> 
> 
> Linus will get mad at you if he finds out, and we wouldn't want that.
> 
> --- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix
> +++ a/mm/slub.c
> @@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc
>  /* Note that interrupts must be enabled when calling this function. */
>  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>  {
> -	BUG_ON(!size);
> +	if (WARN_ON(!size))
> +		return;
>  
>  	do {
>  		struct detached_freelist df;
> _

My problem with this change is that WARN_ON generates (slightly) larger
code size, which is critical for instruction-cache usage...

 [net-next-mm]$ ./scripts/bloat-o-meter vmlinux-with_BUG_ON vmlinux-with_WARN_ON 
 add/remove: 0/0 grow/shrink: 1/0 up/down: 17/0 (17)
 function                                     old     new   delta
 kmem_cache_free_bulk                         438     455     +17

My IP-forwarding benchmark is actually a very challenging use-case,
because the code path "size" a packet have to travel is larger than the
instruction-cache of the CPU.

Thus, I need introducing new code like this patch and at the same time
have to reduce the number of instruction-cache misses/usage.  In this
case we solve the problem by kmem_cache_free_bulk() not getting called
too often. Thus, +17 bytes will hopefully not matter too much... but on
the other hand we sort-of know that calling kmem_cache_free_bulk() will
cause icache misses.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-02  9:41           ` Jesper Dangaard Brouer
@ 2015-10-02 10:10             ` Christoph Lameter
  2015-10-02 10:40               ` Jesper Dangaard Brouer
  2015-10-02 13:40             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 50+ messages in thread
From: Christoph Lameter @ 2015-10-02 10:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andrew Morton, linux-mm, Alexander Duyck, Pekka Enberg, netdev,
	Joonsoo Kim, David Rientjes

On Fri, 2 Oct 2015, Jesper Dangaard Brouer wrote:

> Thus, I need introducing new code like this patch and at the same time
> have to reduce the number of instruction-cache misses/usage.  In this
> case we solve the problem by kmem_cache_free_bulk() not getting called
> too often. Thus, +17 bytes will hopefully not matter too much... but on
> the other hand we sort-of know that calling kmem_cache_free_bulk() will
> cause icache misses.

Can we just drop the WARN/BUG here? Nothing untoward happens if size == 0
right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-02 10:10             ` Christoph Lameter
@ 2015-10-02 10:40               ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-02 10:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-mm, Alexander Duyck, Pekka Enberg, netdev,
	Joonsoo Kim, David Rientjes, brouer

On Fri, 2 Oct 2015 05:10:02 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Fri, 2 Oct 2015, Jesper Dangaard Brouer wrote:
> 
> > Thus, I need introducing new code like this patch and at the same time
> > have to reduce the number of instruction-cache misses/usage.  In this
> > case we solve the problem by kmem_cache_free_bulk() not getting called
> > too often. Thus, +17 bytes will hopefully not matter too much... but on
> > the other hand we sort-of know that calling kmem_cache_free_bulk() will
> > cause icache misses.
> 
> Can we just drop the WARN/BUG here? Nothing untoward happens if size == 0
> right?

I think we crash if size == 0, as we deref p[--size] in build_detached_freelist().


(aside note: The code do handles if pointers in the p array are NULL)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-02  9:41           ` Jesper Dangaard Brouer
  2015-10-02 10:10             ` Christoph Lameter
@ 2015-10-02 13:40             ` Jesper Dangaard Brouer
  2015-10-02 21:50               ` Andrew Morton
  1 sibling, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-02 13:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Christoph Lameter, Alexander Duyck, Pekka Enberg,
	netdev, Joonsoo Kim, David Rientjes, brouer,
	Hannes Frederic Sowa

On Fri, 2 Oct 2015 11:41:18 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Thu, 1 Oct 2015 15:10:15 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > 
> > > Make it possible to free a freelist with several objects by adjusting
> > > API of slab_free() and __slab_free() to have head, tail and an objects
> > > counter (cnt).
> > > 
> > > Tail being NULL indicate single object free of head object.  This
> > > allow compiler inline constant propagation in slab_free() and
> > > slab_free_freelist_hook() to avoid adding any overhead in case of
> > > single object free.
> > > 
> > > This allows a freelist with several objects (all within the same
> > > slab-page) to be free'ed using a single locked cmpxchg_double in
> > > __slab_free() and with an unlocked cmpxchg_double in slab_free().
> > > 
> > > Object debugging on the free path is also extended to handle these
> > > freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
> > > objects don't belong to the same slab-page.
> > > 
> > > These changes are needed for the next patch to bulk free the detached
> > > freelists it introduces and constructs.
> > > 
> > > Micro benchmarking showed no performance reduction due to this change,
> > > when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
> > > 
> > 
> > checkpatch says
> > 
> > WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
> > #205: FILE: mm/slub.c:2888:
> > +       BUG_ON(!size);
> > 
> > 
> > Linus will get mad at you if he finds out, and we wouldn't want that.
> > 
> > --- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix
> > +++ a/mm/slub.c
> > @@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc
> >  /* Note that interrupts must be enabled when calling this function. */
> >  void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> >  {
> > -	BUG_ON(!size);
> > +	if (WARN_ON(!size))
> > +		return;
> >  
> >  	do {
> >  		struct detached_freelist df;
> > _
> 
> My problem with this change is that WARN_ON generates (slightly) larger
> code size, which is critical for instruction-cache usage...
> 
>  [net-next-mm]$ ./scripts/bloat-o-meter vmlinux-with_BUG_ON vmlinux-with_WARN_ON 
>  add/remove: 0/0 grow/shrink: 1/0 up/down: 17/0 (17)
>  function                                     old     new   delta
>  kmem_cache_free_bulk                         438     455     +17
> 
> My IP-forwarding benchmark is actually a very challenging use-case,
> because the code path "size" a packet have to travel is larger than the
> instruction-cache of the CPU.
> 
> Thus, I need introducing new code like this patch and at the same time
> have to reduce the number of instruction-cache misses/usage.  In this
> case we solve the problem by kmem_cache_free_bulk() not getting called
> too often. Thus, +17 bytes will hopefully not matter too much... but on
> the other hand we sort-of know that calling kmem_cache_free_bulk() will
> cause icache misses.

I just tested this change on top of my net-use-case patchset... and for
some strange reason the code with this WARN_ON is faster and have much
less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses).

Thus, I think we should keep your fix.

I cannot explain why using WARN_ON() is better and cause less icache
misses.  And I hate when I don't understand every detail.

 My theory is, after reading the assembler code, that the UD2
instruction (from BUG_ON) cause some kind of icache decoder stall
(Intel experts???).  Now that should not be a problem, as UD2 is
obviously placed as an unlikely branch and left at the end of the asm
function call.  But the call to __slab_free() is also placed at the end
of the asm function (gets inlined from slab_free() as unlikely).  And
it is actually fairly likely that bulking is calling __slab_free (slub
slowpath call).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-02 13:40             ` Jesper Dangaard Brouer
@ 2015-10-02 21:50               ` Andrew Morton
  2015-10-05 19:26                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Andrew Morton @ 2015-10-02 21:50 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Christoph Lameter, Alexander Duyck, Pekka Enberg,
	netdev, Joonsoo Kim, David Rientjes, Hannes Frederic Sowa

On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> > Thus, I need introducing new code like this patch and at the same time
> > have to reduce the number of instruction-cache misses/usage.  In this
> > case we solve the problem by kmem_cache_free_bulk() not getting called
> > too often. Thus, +17 bytes will hopefully not matter too much... but on
> > the other hand we sort-of know that calling kmem_cache_free_bulk() will
> > cause icache misses.
> 
> I just tested this change on top of my net-use-case patchset... and for
> some strange reason the code with this WARN_ON is faster and have much
> less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses).
> 
> Thus, I think we should keep your fix.
> 
> I cannot explain why using WARN_ON() is better and cause less icache
> misses.  And I hate when I don't understand every detail.
> 
>  My theory is, after reading the assembler code, that the UD2
> instruction (from BUG_ON) cause some kind of icache decoder stall
> (Intel experts???).  Now that should not be a problem, as UD2 is
> obviously placed as an unlikely branch and left at the end of the asm
> function call.  But the call to __slab_free() is also placed at the end
> of the asm function (gets inlined from slab_free() as unlikely).  And
> it is actually fairly likely that bulking is calling __slab_free (slub
> slowpath call).

Yes, I was looking at the asm code and the difference is pretty small:
a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly.

But I wouldn't assume that the microbenchmarking is meaningful.  I've
seen shockingly large (and quite repeatable) microbenchmarking
differences from small changes in code which isn't even executed (and
this is one such case, actually).  You add or remove just one byte of
text and half the kernel (or half the .o file?) gets a different
alignment and this seems to change everything.

Deleting the BUG altogether sounds the best solution.  As long as the
kernel crashes in some manner, we'll be able to work out what happened.
And it's cant-happen anyway, isn't it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-02 21:50               ` Andrew Morton
@ 2015-10-05 19:26                 ` Jesper Dangaard Brouer
  2015-10-05 21:20                   ` Andi Kleen
                                     ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-05 19:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Christoph Lameter, Alexander Duyck, Pekka Enberg,
	netdev, Joonsoo Kim, David Rientjes, Hannes Frederic Sowa,
	brouer, Andi Kleen, Arnaldo Carvalho de Melo

On Fri, 2 Oct 2015 14:50:44 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > > Thus, I need introducing new code like this patch and at the same time
> > > have to reduce the number of instruction-cache misses/usage.  In this
> > > case we solve the problem by kmem_cache_free_bulk() not getting called
> > > too often. Thus, +17 bytes will hopefully not matter too much... but on
> > > the other hand we sort-of know that calling kmem_cache_free_bulk() will
> > > cause icache misses.
> > 
> > I just tested this change on top of my net-use-case patchset... and for
> > some strange reason the code with this WARN_ON is faster and have much
> > less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses).
> > 
> > Thus, I think we should keep your fix.
> > 
> > I cannot explain why using WARN_ON() is better and cause less icache
> > misses.  And I hate when I don't understand every detail.
> > 
> >  My theory is, after reading the assembler code, that the UD2
> > instruction (from BUG_ON) cause some kind of icache decoder stall
> > (Intel experts???).  Now that should not be a problem, as UD2 is
> > obviously placed as an unlikely branch and left at the end of the asm
> > function call.  But the call to __slab_free() is also placed at the end
> > of the asm function (gets inlined from slab_free() as unlikely).  And
> > it is actually fairly likely that bulking is calling __slab_free (slub
> > slowpath call).
> 
> Yes, I was looking at the asm code and the difference is pretty small:
> a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly.

Actually managed to find documentation that the UD2 instruction do
stops the instruction prefetcher.

From "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
 Section 3.4.1.6 "Branch Type Selection":
 Cite:
  "[...] follow the indirect branch with a UD2 instruction, which will
   stop the processor from decoding down the fall-through path."

And from LWN article: https://lwn.net/Articles/255364/
 Cite: "[...] like using the ud2 instruction [...]. This instruction,
 which cannot be executed itself, is used after an indirect jump
 instruction; it is used as a signal to the instruction fetcher that
 the processor should not waste efforts decoding the following memory
 since the execution will continue at a different location."

Thus, we were hitting a very unfortunately combination of code, there
the UD2 instruction was located before the call to __slab_free() in the
ASM code, causing instruction prefetching to fail/stop.

Now I understand all the details again :-)

My only problem left, is I want a perf measurement that pinpoint these
kind of spots.  The difference in L1-icache-load-misses were significant
(1,278,276 vs 2,719,158).  I tried to somehow perf record this with
different perf events without being able to pinpoint the location (even
though I know the spot now).  Even tried Andi's ocperf.py... maybe he
will know what event I should try?


> But I wouldn't assume that the microbenchmarking is meaningful.  I've
> seen shockingly large (and quite repeatable) microbenchmarking
> differences from small changes in code which isn't even executed (and
> this is one such case, actually).  You add or remove just one byte of
> text and half the kernel (or half the .o file?) gets a different
> alignment and this seems to change everything.

I do know alignment of code is important, but this performance impact
was just too large.  And I think I found documentation to back my theory.

 
> Deleting the BUG altogether sounds the best solution.  As long as the
> kernel crashes in some manner, we'll be able to work out what happened.
> And it's cant-happen anyway, isn't it?

To me WARN_ON() seems like a good "documentation" if it does not hurt
performance.  I don't think removing the WARN_ON() will improve
performance, but I'm willing to actually test if it matters.

Below is how it will crash, it is fairly obvious (if you know x86_64
call convention). 

API is: void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)

Below it clearly shows second arg "size" in %RSI == 0.

[ 2986.977708] BUG: unable to handle kernel paging request at ffffffffa03c4d58
[ 2986.977760] IP: [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.977805] PGD 19a4067 PUD 19a5063 PMD 408a66067 PTE 3f6407161
[ 2986.977879] Oops: 0003 [#1] SMP 
[ 2986.977936] Modules linked in: slab_bulk_test03(O+) time_bench(O) netconsole tun coretemp kvm_intel kvm mxm_wmi microcode sg i2c_i801 i2c_core pcspkr wmi video shpchp acpi_pad nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc serio_raw sd_mod hid_generic ixgbe mdio vxlan ip6_udp_tunnel e1000e udp_tunnel ptp pps_core mlx5_core [last unloaded: slab_bulk_test03]
[ 2986.978190] CPU: 1 PID: 12495 Comm: modprobe Tainted: G           O    4.3.0-rc2-net-next-mm+ #442
[ 2986.978265] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
[ 2986.978341] task: ffff88040a89d200 ti: ffff8803e3c50000 task.ti: ffff8803e3c50000
[ 2986.978414] RIP: 0010:[<ffffffff8116d044>]  [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.978490] RSP: 0018:ffff8803e3c53aa8  EFLAGS: 00010282
[ 2986.978528] RAX: 0000000000000000 RBX: ffff8803e3c53c30 RCX: 0000000100200016
[ 2986.978571] RDX: ffff8803e3c53ad8 RSI: 0000000000000000 RDI: 000077ff80000000
[ 2986.978614] RBP: ffff8803e3c53ad0 R08: ffff8803e3db3600 R09: 00000000e3db3b01
[ 2986.978655] R10: ffffffffa03c4d58 R11: 0000000000000001 R12: ffffffffffffffff
[ 2986.978698] R13: ffff8803e3c53ae0 R14: 000077ff80000000 R15: ffff88040d774c00
[ 2986.978740] FS:  00007fbf31190740(0000) GS:ffff88041fa40000(0000) knlGS:0000000000000000
[ 2986.978814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2986.978853] CR2: ffffffffa03c4d58 CR3: 00000003f6406000 CR4: 00000000001406e0
[ 2986.978895] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2986.978938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2986.978979] Stack:
[ 2986.979014]  ffff8803e3c53c30 0000000000000009 0000000000000000 ffffffffa002e000
[ 2986.979099]  ffff8803fc40db40 ffff8803e3c53cf0 ffffffffa03c4d58 0000000000000000
[ 2986.979184]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2986.979259] Call Trace:
[ 2986.979296]  [<ffffffffa002e000>] ? 0xffffffffa002e000
[ 2986.979335]  [<ffffffffa03c4d58>] run_try_crash_tests+0x378/0x3a6 [slab_bulk_test03]
[ 2986.979409]  [<ffffffffa002e12d>] slab_bulk_test03_module_init+0x12d/0x1000 [slab_bulk_test03]
[ 2986.979485]  [<ffffffff810002ed>] do_one_initcall+0xad/0x1d0
[ 2986.979526]  [<ffffffff8111c117>] do_init_module+0x60/0x1e8
[ 2986.979567]  [<ffffffff810c2778>] load_module+0x1bb8/0x2420
[ 2986.979608]  [<ffffffff810bf1b0>] ? __symbol_put+0x40/0x40
[ 2986.979647]  [<ffffffff810c31b0>] SyS_finit_module+0x80/0xb0
[ 2986.979690]  [<ffffffff8166b697>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 2986.979732] Code: f8 eb 05 4d 85 e4 74 13 4c 8b 10 49 83 ec 01 48 89 c2 48 83 e8 08 4d 85 d2 74 e8 4d 85 d2 0f 84 2e 01 00 00 49 63 47 20 4c 89 f7 <49> c7 04 02 00 00 00 00 b8 00 00 00 80 4c 01 d0 48 0f 42 3d b4 
[ 2986.979904] RIP  [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.979945]  RSP <ffff8803e3c53aa8>
[ 2986.979982] CR2: ffffffffa03c4d58
[ 2986.980130] ---[ end trace 64f42b02f4347220 ]---


(gdb) list *(kmem_cache_free_bulk)+0x54
0xffffffff8116d044 is in kmem_cache_free_bulk (mm/slub.c:269).
264		return p;
265	}
266	
267	static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
268	{
269		*(void **)(object + s->offset) = fp;
270	}
271	
272	/* Loop over all objects in a slab */
273	#define for_each_object(__p, __s, __addr, __objects) \


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-05 19:26                 ` Jesper Dangaard Brouer
@ 2015-10-05 21:20                   ` Andi Kleen
  2015-10-05 23:07                     ` Jesper Dangaard Brouer
  2015-10-05 23:53                   ` Jesper Dangaard Brouer
  2015-10-07 10:39                   ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2015-10-05 21:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andrew Morton, linux-mm, Christoph Lameter, Alexander Duyck,
	Pekka Enberg, netdev, Joonsoo Kim, David Rientjes,
	Hannes Frederic Sowa, Arnaldo Carvalho de Melo

> My only problem left, is I want a perf measurement that pinpoint these
> kind of spots.  The difference in L1-icache-load-misses were significant
> (1,278,276 vs 2,719,158).  I tried to somehow perf record this with
> different perf events without being able to pinpoint the location (even
> though I know the spot now).  Even tried Andi's ocperf.py... maybe he
> will know what event I should try?

Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the
bottle neck is and what to sample for if there is a suitable event and
even prints the command line.

https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev

However frontend issues are difficult to sample, as they happen very far
away from instruction retirement where the sampling happens. So you may
have large skid and the sampling points may be far away. Skylake has new
special FRONTEND_* PEBS events for this, but before it was often difficult. 

BTW if your main goal is icache; I wrote a gcc patch to help the kernel
by enabling function splitting: Apply the patch in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66890 to gcc 5,
make sure 9bebe9e5b0f (now in mainline) is applied and build with
-freorder-blocks-and-partition. That will split all functions into
statically predicted hot and cold parts and generally relieves
icache pressure. Any testing of this on your workload welcome.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-05 21:20                   ` Andi Kleen
@ 2015-10-05 23:07                     ` Jesper Dangaard Brouer
  2015-10-07 12:31                       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-05 23:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Christoph Lameter, netdev, Arnaldo Carvalho de Melo, brouer

(trimmed Cc list a little)

On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen <ak@linux.intel.com> wrote:

> > My only problem left, is I want a perf measurement that pinpoint these
> > kind of spots.  The difference in L1-icache-load-misses were significant
> > (1,278,276 vs 2,719,158).  I tried to somehow perf record this with
> > different perf events without being able to pinpoint the location (even
> > though I know the spot now).  Even tried Andi's ocperf.py... maybe he
> > will know what event I should try?
> 
> Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the
> bottle neck is and what to sample for if there is a suitable event and
> even prints the command line.
> 
> https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev
> 

My result from (IP-forward flow hitting CPU 0):
 $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0

So, what does this tell me?:

 C0    BAD     Bad_Speculation:                                 0.00 % [  5.50%]
 C0    BE      Backend_Bound:                                 100.00 % [  5.50%]
 C0    BE/Mem  Backend_Bound.Memory_Bound:                     53.06 % [  5.50%]
 C0    BE/Core Backend_Bound.Core_Bound:                       46.94 % [  5.50%]
 C0-T0 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [  5.50%]
 C0-T0 BE/Mem  Backend_Bound.Memory_Bound.L1_Bound:            54.51 % [  5.50%]
 C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization:     20.99 % [  5.60%]
 C0-T0         CPU utilization: 1.00 CPUs   	[100.00%]
 C0-T1 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [  5.50%]
 C0-T1         CPU utilization: 1.00 CPUs   	[100.00%]

Unfortunately the perf command it gives me fails with:
 "invalid or unsupported event".

Perf command:

 perf record -g -e cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp -C 0,4 -a


> However frontend issues are difficult to sample, as they happen very far
> away from instruction retirement where the sampling happens. So you may
> have large skid and the sampling points may be far away. Skylake has new
> special FRONTEND_* PEBS events for this, but before it was often difficult. 

This testlab CPU is i7-4790K @ 4.00GHz.  Maybe I should get a Skylake...


p.s. thanks for your pmu-tools[1], even-though I don't know how to use
most of them ;-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://github.com/andikleen/pmu-tools

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-05 19:26                 ` Jesper Dangaard Brouer
  2015-10-05 21:20                   ` Andi Kleen
@ 2015-10-05 23:53                   ` Jesper Dangaard Brouer
  2015-10-07 10:39                   ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-05 23:53 UTC (permalink / raw)
  Cc: linux-mm, Christoph Lameter, netdev, Andi Kleen,
	Arnaldo Carvalho de Melo, brouer

On Mon, 5 Oct 2015 21:26:39 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> My only problem left, is I want a perf measurement that pinpoint these
> kind of spots.  The difference in L1-icache-load-misses were significant
> (1,278,276 vs 2,719,158).  I tried to somehow perf record this with
> different perf events without being able to pinpoint the location (even
> though I know the spot now).  Even tried Andi's ocperf.py... maybe he
> will know what event I should try?

Using: 'ocperf.py -e icache_misses' and looking closer at the perf
annotate and considering "skid" I think I can see the icache misses
happening in the end of the function, due to the UD2 inst.

Annotation of kmem_cache_free_bulk (last/end of func):

       │17b:   test   %r12,%r12
       │     ↑ jne    2e
       │184:   pop    %rbx
       │       pop    %r12
       │       pop    %r13
       │       pop    %r14
       │       pop    %r15
       │       pop    %rbp
       │     ← retq
  8.57 │18f:   mov    0x30(%rdx),%rdx
  5.71 │     ↑ jmp    116
       │195:   ud2
  2.86 │197:   mov    %rdi,%rsi
       │       mov    %r11d,%r8d
       │       mov    %r10,%rcx
       │       mov    %rbx,%rdx
       │       mov    %r15,%rdi
       │     → callq  __slab_free
       │     ↑ jmp    17b
  2.86 │1ad:   mov    0x30(%rdi),%rdi
       │     ↑ jmpq   99

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-05 19:26                 ` Jesper Dangaard Brouer
  2015-10-05 21:20                   ` Andi Kleen
  2015-10-05 23:53                   ` Jesper Dangaard Brouer
@ 2015-10-07 10:39                   ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-07 10:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Christoph Lameter, Alexander Duyck, netdev,
	Arnaldo Carvalho de Melo, brouer


On Mon, 5 Oct 2015 21:26:39 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> On Fri, 2 Oct 2015 14:50:44 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
[...] 
>  
> > Deleting the BUG altogether sounds the best solution.  As long as the
> > kernel crashes in some manner, we'll be able to work out what happened.
> > And it's cant-happen anyway, isn't it?
> 
> To me WARN_ON() seems like a good "documentation" if it does not hurt
> performance.  I don't think removing the WARN_ON() will improve
> performance, but I'm willing to actually test if it matters.

I tested removing BUG/WARN_ON altogether, and it gives slightly worse
performance. The icache-misses only increase approx 14% (not 112% as
before).  This, I'm willing to attribute to some code alignment issue.

Thus, let us just keep the WARN_ON() and move along.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-05 23:07                     ` Jesper Dangaard Brouer
@ 2015-10-07 12:31                       ` Jesper Dangaard Brouer
  2015-10-07 13:36                         ` Arnaldo Carvalho de Melo
  2015-10-07 16:06                         ` Andi Kleen
  0 siblings, 2 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-07 12:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, netdev, Arnaldo Carvalho de Melo, brouer

On Tue, 6 Oct 2015 01:07:03 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> (trimmed Cc list a little)
> 
> On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen <ak@linux.intel.com> wrote:
> 
> > > My only problem left, is I want a perf measurement that pinpoint these
> > > kind of spots.  The difference in L1-icache-load-misses were significant
> > > (1,278,276 vs 2,719,158).  I tried to somehow perf record this with
> > > different perf events without being able to pinpoint the location (even
> > > though I know the spot now).  Even tried Andi's ocperf.py... maybe he
> > > will know what event I should try?
> > 
> > Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the
> > bottle neck is and what to sample for if there is a suitable event and
> > even prints the command line.
> > 
> > https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev
> > 
> 
> My result from (IP-forward flow hitting CPU 0):
>  $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0
> 
> So, what does this tell me?:
>
>  C0    BAD     Bad_Speculation:                                 0.00 % [  5.50%]
>  C0    BE      Backend_Bound:                                 100.00 % [  5.50%]
>  C0    BE/Mem  Backend_Bound.Memory_Bound:                     53.06 % [  5.50%]
>  C0    BE/Core Backend_Bound.Core_Bound:                       46.94 % [  5.50%]
>  C0-T0 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [  5.50%]
>  C0-T0 BE/Mem  Backend_Bound.Memory_Bound.L1_Bound:            54.51 % [  5.50%]
>  C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization:     20.99 % [  5.60%]
>  C0-T0         CPU utilization: 1.00 CPUs   	[100.00%]
>  C0-T1 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [  5.50%]
>  C0-T1         CPU utilization: 1.00 CPUs   	[100.00%]

Reading: https://github.com/andikleen/pmu-tools/wiki/toplev-manual
Helped me understand most of above.

My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have
limited "Frontend" support. E.g. 

 # perf record -g -a -e stalled-cycles-frontend
 Error:
 The stalled-cycles-frontend event is not supported.

And AFAIK icache misses are part of "frontend".


> Unfortunately the perf command it gives me fails with:
>  "invalid or unsupported event".
> 
> Perf command:
> 
>  sudo ./ocperf.py record -g -e \
  cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,\
  cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\
  cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,\
  cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp \
  -C 0,4 -a

I fixed the problem with this perf command by removing the ":pp" part.
Perhaps your tool need to fix that?

A working command line looks like this:

 sudo ./ocperf.py record -g -e \
cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/pp,\
cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\
cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT,period=2000003/pp,\
cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB,period=100003/pp \
  -C 0,4 -a

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-07 12:31                       ` Jesper Dangaard Brouer
@ 2015-10-07 13:36                         ` Arnaldo Carvalho de Melo
  2015-10-07 15:44                           ` Andi Kleen
  2015-10-07 16:06                         ` Andi Kleen
  1 sibling, 1 reply; 50+ messages in thread
From: Arnaldo Carvalho de Melo @ 2015-10-07 13:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Andi Kleen, linux-mm, netdev

Em Wed, Oct 07, 2015 at 02:31:20PM +0200, Jesper Dangaard Brouer escreveu:
> On Tue, 6 Oct 2015 01:07:03 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > (trimmed Cc list a little)
> > 
> > On Mon, 5 Oct 2015 14:20:45 -0700 Andi Kleen <ak@linux.intel.com> wrote:
> > 
> > > > My only problem left, is I want a perf measurement that pinpoint these
> > > > kind of spots.  The difference in L1-icache-load-misses were significant
> > > > (1,278,276 vs 2,719,158).  I tried to somehow perf record this with
> > > > different perf events without being able to pinpoint the location (even
> > > > though I know the spot now).  Even tried Andi's ocperf.py... maybe he
> > > > will know what event I should try?
> > > 
> > > Run pmu-tools toplev.py -l3 with --show-sample. It tells you what the
> > > bottle neck is and what to sample for if there is a suitable event and
> > > even prints the command line.
> > > 
> > > https://github.com/andikleen/pmu-tools/wiki/toplev-manual#sampling-with-toplev
> > > 
> > 
> > My result from (IP-forward flow hitting CPU 0):
> >  $ sudo ./toplev.py -I 1000 -l3 -a --show-sample --core C0
> > 
> > So, what does this tell me?:
> >
> >  C0    BAD     Bad_Speculation:                                 0.00 % [  5.50%]
> >  C0    BE      Backend_Bound:                                 100.00 % [  5.50%]
> >  C0    BE/Mem  Backend_Bound.Memory_Bound:                     53.06 % [  5.50%]
> >  C0    BE/Core Backend_Bound.Core_Bound:                       46.94 % [  5.50%]
> >  C0-T0 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.42 % [  5.50%]
> >  C0-T0 BE/Mem  Backend_Bound.Memory_Bound.L1_Bound:            54.51 % [  5.50%]
> >  C0-T0 BE/Core Backend_Bound.Core_Bound.Ports_Utilization:     20.99 % [  5.60%]
> >  C0-T0         CPU utilization: 1.00 CPUs   	[100.00%]
> >  C0-T1 FE      Frontend_Bound.Frontend_Latency.Branch_Resteers: 6.04 % [  5.50%]
> >  C0-T1         CPU utilization: 1.00 CPUs   	[100.00%]
> 
> Reading: https://github.com/andikleen/pmu-tools/wiki/toplev-manual
> Helped me understand most of above.
> 
> My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have
> limited "Frontend" support. E.g. 
> 
>  # perf record -g -a -e stalled-cycles-frontend
>  Error:
>  The stalled-cycles-frontend event is not supported.
> 
> And AFAIK icache misses are part of "frontend".
> 
> 
> > Unfortunately the perf command it gives me fails with:
> >  "invalid or unsupported event".
> > 
> > Perf command:
> > 
> >  sudo ./ocperf.py record -g -e \
>   cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES:pp,period=400009/pp,\
>   cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\
>   cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT:pp,period=2000003/pp,\
>   cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB:pp,period=100003/pp \
>   -C 0,4 -a
> 
> I fixed the problem with this perf command by removing the ":pp" part.
> Perhaps your tool need to fix that?
> 
> A working command line looks like this:
> 
>  sudo ./ocperf.py record -g -e \
> cpu/event=0xc5,umask=0x0,name=Branch_Resteers_BR_MISP_RETIRED_ALL_BRANCHES,period=400009/pp,\
> cpu/event=0xd,umask=0x3,cmask=1,name=Bad_Speculation_INT_MISC_RECOVERY_CYCLES,period=2000003/,\
> cpu/event=0xd1,umask=0x1,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_L1_HIT,period=2000003/pp,\
> cpu/event=0xd1,umask=0x40,name=L1_Bound_MEM_LOAD_UOPS_RETIRED_HIT_LFB,period=100003/pp \
>   -C 0,4 -a

There is a recent patch that may help here, see below, but maybe its
just a matter of removing that :pp, as it ends with a /pp anyway, no
need to state that twice :)

With the patch below all those /pp would be replaced with /P.

- Arnaldo


https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/tools/perf?id=7f94af7a489fada17d28cc60e8f4409ce216bd6d

----------------------------------------------------------------------
perf tools: Introduce 'P' modifier to request max precision
The 'P' will cause the event to get maximum possible detected precise
level.

Following record:
  $ perf record -e cycles:P ...

will detect maximum precise level for 'cycles' event and use it.
----------------------------------------------------------------------

- Arnaldo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-07 13:36                         ` Arnaldo Carvalho de Melo
@ 2015-10-07 15:44                           ` Andi Kleen
  0 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2015-10-07 15:44 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: Jesper Dangaard Brouer, linux-mm, netdev

> There is a recent patch that may help here, see below, but maybe its
> just a matter of removing that :pp, as it ends with a /pp anyway, no
> need to state that twice :)

Yes the extra :pp was a regression in toplev. I fixed it now. Thanks.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists
  2015-10-07 12:31                       ` Jesper Dangaard Brouer
  2015-10-07 13:36                         ` Arnaldo Carvalho de Melo
@ 2015-10-07 16:06                         ` Andi Kleen
  1 sibling, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2015-10-07 16:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: linux-mm, netdev, Arnaldo Carvalho de Melo

> My specific CPU (i7-4790K @ 4.00GHz) unfortunately seems to have
> limited "Frontend" support. E.g. 
> 
>  # perf record -g -a -e stalled-cycles-frontend
>  Error:
>  The stalled-cycles-frontend event is not supported.
> 
> And AFAIK icache misses are part of "frontend".

Ignore stalled-cycles-frontend. It is very unreliable and has never worked right.
toplev gives you much more reliable output.


-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist
  2015-09-29 15:48   ` [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
@ 2015-10-14  5:15     ` Joonsoo Kim
  2015-10-21  7:57       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Joonsoo Kim @ 2015-10-14  5:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, Christoph Lameter, netdev,
	Alexander Duyck, Pekka Enberg, David Rientjes

On Tue, Sep 29, 2015 at 05:48:26PM +0200, Jesper Dangaard Brouer wrote:
> This change focus on improving the speed of object freeing in the
> "slowpath" of kmem_cache_free_bulk.
> 
> The calls slab_free (fastpath) and __slab_free (slowpath) have been
> extended with support for bulk free, which amortize the overhead of
> the (locked) cmpxchg_double.
> 
> To use the new bulking feature, we build what I call a detached
> freelist.  The detached freelist takes advantage of three properties:
> 
>  1) the free function call owns the object that is about to be freed,
>     thus writing into this memory is synchronization-free.
> 
>  2) many freelist's can co-exist side-by-side in the same slab-page
>     each with a separate head pointer.
> 
>  3) it is the visibility of the head pointer that needs synchronization.
> 
> Given these properties, the brilliant part is that the detached
> freelist can be constructed without any need for synchronization.  The
> freelist is constructed directly in the page objects, without any
> synchronization needed.  The detached freelist is allocated on the
> stack of the function call kmem_cache_free_bulk.  Thus, the freelist
> head pointer is not visible to other CPUs.
> 
> All objects in a SLUB freelist must belong to the same slab-page.
> Thus, constructing the detached freelist is about matching objects
> that belong to the same slab-page.  The bulk free array is scanned is
> a progressive manor with a limited look-ahead facility.
> 
> Kmem debug support is handled in call of slab_free().
> 
> Notice kmem_cache_free_bulk no longer need to disable IRQs. This
> only slowed down single free bulk with approx 3 cycles.
> 
> 
> Performance data:
>  Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
> 
> SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
> 
> To get stable and comparable numbers, the kernel have been booted with
> "slab_merge" (this also improve performance for larger bulk sizes).
> 
> Performance data, compared against fallback bulking:
> 
> bulk -  fallback bulk            - improvement with this patch
>    1 -  62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
>    2 -  55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
>    3 -  53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
>    4 -  52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
>    8 -  50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
>   16 -  49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
>   30 -  49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
>   32 -  50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
>   34 -  96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
>   48 -  83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
>   64 -  74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
>  128 -  90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
>  158 -  99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
>  250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
> 
> Performance data, compared current in-kernel bulking:
> 
> bulk - curr in-kernel  - improvement with this patch
>    1 -  46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
>    2 -  27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
>    3 -  21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
>    4 -  18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
>    8 -  17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
>   16 -  18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1)  5.6%
>   30 -  18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0)  0.0%
>   32 -  18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0)  0.0%
>   34 -  78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
>   48 -  60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
>   64 -  49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
>  128 -  69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
>  158 -  79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
>  250 -  86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
> 
> Performance with normal SLUB merging is significantly slower for
> larger bulking.  This is believed to (primarily) be an effect of not
> having to share the per-CPU data-structures, as tuning per-CPU size
> can achieve similar performance.
> 
> bulk - slab_nomerge   -  normal SLUB merge
>    1 -  49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
>    2 -  30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
>    3 -  23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
>    4 -  20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
>    8 -  18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
>   16 -  17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
>   30 -  18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
>   32 -  18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
>   34 -  23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
>   48 -  21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
>   64 -  20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
>  128 -  27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
>  158 -  30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
>  250 -  37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
> 
> Joint work with Alexander Duyck.
> 
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> ---
>  mm/slub.c |  108 ++++++++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 78 insertions(+), 30 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 7c2abc33fd4e..53500f3b70ab 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2812,44 +2812,92 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
>  }
>  EXPORT_SYMBOL(kmem_cache_free);
>  
> -/* Note that interrupts must be enabled when calling this function. */
> -void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> -{
> -	struct kmem_cache_cpu *c;
> +struct detached_freelist {
>  	struct page *page;
> -	int i;
> +	void *tail;
> +	void *freelist;
> +	int cnt;
> +};
>  
> -	local_irq_disable();
> -	c = this_cpu_ptr(s->cpu_slab);
> +/*
> + * This function progressively scans the array with free objects (with
> + * a limited look ahead) and extract objects belonging to the same
> + * page.  It builds a detached freelist directly within the given
> + * page/objects.  This can happen without any need for
> + * synchronization, because the objects are owned by running process.
> + * The freelist is build up as a single linked list in the objects.
> + * The idea is, that this detached freelist can then be bulk
> + * transferred to the real freelist(s), but only requiring a single
> + * synchronization primitive.  Look ahead in the array is limited due
> + * to performance reasons.
> + */
> +static int build_detached_freelist(struct kmem_cache *s, size_t size,
> +				   void **p, struct detached_freelist *df)
> +{
> +	size_t first_skipped_index = 0;
> +	int lookahead = 3;
> +	void *object;
>  
> -	for (i = 0; i < size; i++) {
> -		void *object = p[i];
> +	/* Always re-init detached_freelist */
> +	df->page = NULL;
>  
> -		BUG_ON(!object);
> -		/* kmem cache debug support */
> -		s = cache_from_obj(s, object);
> -		if (unlikely(!s))
> -			goto exit;
> -		slab_free_hook(s, object);
> +	do {
> +		object = p[--size];
> +	} while (!object && size);
>  
> -		page = virt_to_head_page(object);
> +	if (!object)
> +		return 0;
>  
> -		if (c->page == page) {
> -			/* Fastpath: local CPU free */
> -			set_freepointer(s, object, c->freelist);
> -			c->freelist = object;
> -		} else {
> -			c->tid = next_tid(c->tid);
> -			local_irq_enable();
> -			/* Slowpath: overhead locked cmpxchg_double_slab */
> -			__slab_free(s, page, object, object, 1, _RET_IP_);
> -			local_irq_disable();
> -			c = this_cpu_ptr(s->cpu_slab);
> +	/* Start new detached freelist */
> +	set_freepointer(s, object, NULL);
> +	df->page = virt_to_head_page(object);
> +	df->tail = object;
> +	df->freelist = object;
> +	p[size] = NULL; /* mark object processed */
> +	df->cnt = 1;

Hello, Jesper.

AFAIK, it is uncommon to clear pointer to object in argument array.
At least, it is better to comment it on somewhere. Or, how about removing
lookahead facility? Does it have real benefit?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist
  2015-10-14  5:15     ` Joonsoo Kim
@ 2015-10-21  7:57       ` Jesper Dangaard Brouer
  2015-11-05  5:09         ` Joonsoo Kim
  0 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-21  7:57 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, Andrew Morton, Christoph Lameter, netdev,
	Alexander Duyck, Pekka Enberg, David Rientjes, brouer

On Wed, 14 Oct 2015 14:15:25 +0900
Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:

> On Tue, Sep 29, 2015 at 05:48:26PM +0200, Jesper Dangaard Brouer wrote:
> > This change focus on improving the speed of object freeing in the
> > "slowpath" of kmem_cache_free_bulk.
> > 
> > The calls slab_free (fastpath) and __slab_free (slowpath) have been
> > extended with support for bulk free, which amortize the overhead of
> > the (locked) cmpxchg_double.
> > 
> > To use the new bulking feature, we build what I call a detached
> > freelist.  The detached freelist takes advantage of three properties:
> > 
> >  1) the free function call owns the object that is about to be freed,
> >     thus writing into this memory is synchronization-free.
> > 
> >  2) many freelist's can co-exist side-by-side in the same slab-page
> >     each with a separate head pointer.
> > 
> >  3) it is the visibility of the head pointer that needs synchronization.
> > 
> > Given these properties, the brilliant part is that the detached
> > freelist can be constructed without any need for synchronization.  The
> > freelist is constructed directly in the page objects, without any
> > synchronization needed.  The detached freelist is allocated on the
> > stack of the function call kmem_cache_free_bulk.  Thus, the freelist
> > head pointer is not visible to other CPUs.
> > 
> > All objects in a SLUB freelist must belong to the same slab-page.
> > Thus, constructing the detached freelist is about matching objects
> > that belong to the same slab-page.  The bulk free array is scanned is
> > a progressive manor with a limited look-ahead facility.
[...]


> Hello, Jesper.
> 
> AFAIK, it is uncommon to clear pointer to object in argument array.
> At least, it is better to comment it on somewhere.

In this case, I think clearing the array is a good thing, as
using/referencing objects after they have been free'ed is a bug (which
can be hard to detect).

> Or, how about removing  lookahead facility? Does it have real benefit?

In my earlier patch series I had a version with and without lookahead
facility.  Just so I could benchmark the difference.  With Alex'es help
we/I tuned the code with the lookahead feature to be just as fast.
Thus, I merged the two patches. (Also did testing for worstcase [1])

I do wonder if the lookahead have any real benefit.  In micro
benchmarking it might be "just-as-fast", but I do suspect (just the code
size increase) it can affect real use-cases... Should we remove it?

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test03.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist
  2015-10-21  7:57       ` Jesper Dangaard Brouer
@ 2015-11-05  5:09         ` Joonsoo Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2015-11-05  5:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, Andrew Morton, Christoph Lameter, netdev,
	Alexander Duyck, Pekka Enberg, David Rientjes

On Wed, Oct 21, 2015 at 09:57:09AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 14 Oct 2015 14:15:25 +0900
> Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> > On Tue, Sep 29, 2015 at 05:48:26PM +0200, Jesper Dangaard Brouer wrote:
> > > This change focus on improving the speed of object freeing in the
> > > "slowpath" of kmem_cache_free_bulk.
> > > 
> > > The calls slab_free (fastpath) and __slab_free (slowpath) have been
> > > extended with support for bulk free, which amortize the overhead of
> > > the (locked) cmpxchg_double.
> > > 
> > > To use the new bulking feature, we build what I call a detached
> > > freelist.  The detached freelist takes advantage of three properties:
> > > 
> > >  1) the free function call owns the object that is about to be freed,
> > >     thus writing into this memory is synchronization-free.
> > > 
> > >  2) many freelist's can co-exist side-by-side in the same slab-page
> > >     each with a separate head pointer.
> > > 
> > >  3) it is the visibility of the head pointer that needs synchronization.
> > > 
> > > Given these properties, the brilliant part is that the detached
> > > freelist can be constructed without any need for synchronization.  The
> > > freelist is constructed directly in the page objects, without any
> > > synchronization needed.  The detached freelist is allocated on the
> > > stack of the function call kmem_cache_free_bulk.  Thus, the freelist
> > > head pointer is not visible to other CPUs.
> > > 
> > > All objects in a SLUB freelist must belong to the same slab-page.
> > > Thus, constructing the detached freelist is about matching objects
> > > that belong to the same slab-page.  The bulk free array is scanned is
> > > a progressive manor with a limited look-ahead facility.
> [...]
> 
> 
> > Hello, Jesper.
> > 
> > AFAIK, it is uncommon to clear pointer to object in argument array.
> > At least, it is better to comment it on somewhere.
> 
> In this case, I think clearing the array is a good thing, as
> using/referencing objects after they have been free'ed is a bug (which
> can be hard to detect).

Okay.

> 
> > Or, how about removing  lookahead facility? Does it have real benefit?
> 
> In my earlier patch series I had a version with and without lookahead
> facility.  Just so I could benchmark the difference.  With Alex'es help
> we/I tuned the code with the lookahead feature to be just as fast.
> Thus, I merged the two patches. (Also did testing for worstcase [1])
> 
> I do wonder if the lookahead have any real benefit.  In micro
> benchmarking it might be "just-as-fast", but I do suspect (just the code
> size increase) it can affect real use-cases... Should we remove it?
> 
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test03.c

If it's not implemented yet, I would say that starting with simple
one first. But, now, we already have well implemented one so we don't
need to remove it. :)

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2015-11-05  5:09 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-28 12:26 [PATCH 0/7] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
2015-09-28 12:26 ` [PATCH 1/7] slub: create new ___slab_alloc function that can be called with irqs disabled Jesper Dangaard Brouer
2015-09-28 12:26 ` [PATCH 2/7] slub: Avoid irqoff/on in bulk allocation Jesper Dangaard Brouer
2015-09-28 12:26 ` [PATCH 3/7] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG Jesper Dangaard Brouer
2015-09-28 13:49   ` Christoph Lameter
2015-09-28 12:26 ` [PATCH 4/7] slab: implement bulking for SLAB allocator Jesper Dangaard Brouer
2015-09-28 15:11   ` Christoph Lameter
2015-09-28 12:26 ` [PATCH 5/7] slub: support for bulk free with SLUB freelists Jesper Dangaard Brouer
2015-09-28 15:16   ` Christoph Lameter
2015-09-28 15:51     ` Jesper Dangaard Brouer
2015-09-28 16:28       ` Christoph Lameter
2015-09-29  7:32         ` Jesper Dangaard Brouer
2015-09-28 16:30       ` Christoph Lameter
2015-09-29  7:12         ` Jesper Dangaard Brouer
2015-09-28 12:26 ` [PATCH 6/7] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
2015-09-28 15:22   ` Christoph Lameter
2015-09-28 12:26 ` [PATCH 7/7] slub: do prefetching in kmem_cache_alloc_bulk() Jesper Dangaard Brouer
2015-09-28 14:53   ` Alexander Duyck
2015-09-28 15:59     ` Jesper Dangaard Brouer
2015-09-29 15:46 ` [MM PATCH V4 0/6] Further optimizing SLAB/SLUB bulking Jesper Dangaard Brouer
2015-09-29 15:47   ` [MM PATCH V4 1/6] slub: create new ___slab_alloc function that can be called with irqs disabled Jesper Dangaard Brouer
2015-09-29 15:47   ` [MM PATCH V4 2/6] slub: Avoid irqoff/on in bulk allocation Jesper Dangaard Brouer
2015-09-29 15:47   ` [MM PATCH V4 3/6] slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG Jesper Dangaard Brouer
2015-09-29 15:48   ` [MM PATCH V4 4/6] slab: implement bulking for SLAB allocator Jesper Dangaard Brouer
2015-09-29 15:48   ` [MM PATCH V4 5/6] slub: support for bulk free with SLUB freelists Jesper Dangaard Brouer
2015-09-29 16:38     ` Alexander Duyck
2015-09-29 17:00       ` Jesper Dangaard Brouer
2015-09-29 17:20         ` Alexander Duyck
2015-09-29 18:16           ` Jesper Dangaard Brouer
2015-09-30 11:44       ` [MM PATCH V4.1 " Jesper Dangaard Brouer
2015-09-30 16:03         ` Christoph Lameter
2015-10-01 22:10         ` Andrew Morton
2015-10-02  9:41           ` Jesper Dangaard Brouer
2015-10-02 10:10             ` Christoph Lameter
2015-10-02 10:40               ` Jesper Dangaard Brouer
2015-10-02 13:40             ` Jesper Dangaard Brouer
2015-10-02 21:50               ` Andrew Morton
2015-10-05 19:26                 ` Jesper Dangaard Brouer
2015-10-05 21:20                   ` Andi Kleen
2015-10-05 23:07                     ` Jesper Dangaard Brouer
2015-10-07 12:31                       ` Jesper Dangaard Brouer
2015-10-07 13:36                         ` Arnaldo Carvalho de Melo
2015-10-07 15:44                           ` Andi Kleen
2015-10-07 16:06                         ` Andi Kleen
2015-10-05 23:53                   ` Jesper Dangaard Brouer
2015-10-07 10:39                   ` Jesper Dangaard Brouer
2015-09-29 15:48   ` [MM PATCH V4 6/6] slub: optimize bulk slowpath free by detached freelist Jesper Dangaard Brouer
2015-10-14  5:15     ` Joonsoo Kim
2015-10-21  7:57       ` Jesper Dangaard Brouer
2015-11-05  5:09         ` Joonsoo Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).