linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH] slub: Introduce 'alternate' per cpu partial lists
@ 2017-06-08 23:53 Laura Abbott
  2017-06-14  4:45 ` Joonsoo Kim
  2017-07-12 14:54 ` Christopher Lameter
  0 siblings, 2 replies; 5+ messages in thread
From: Laura Abbott @ 2017-06-08 23:53 UTC (permalink / raw)
  To: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim
  Cc: Laura Abbott, Andrew Morton, linux-mm, linux-kernel, Kees Cook

SLUB debugging features (poisoning, red zoning etc.) skip the fast path
completely. This ensures there is a single place to do all checks and
take any locks that may be necessary for debugging. The overhead of some
of the debugging features (e.g. poisoning) ends up being comparatively
small vs the overhead of not using the fast path.

We don't want to impose any kind of overhead on the fast path so
introduce the notion of an alternate fast path. This is essentially the
same idea as the existing fast path (store partially used pages on the
per-cpu list) but it happens after the real fast path. Debugging that
doesn't require locks (poisoning/red zoning) can happen on this path to
avoid the penalty of always needing to go for the slow path.

Signed-off-by: Laura Abbott <labbott@redhat.com>
---
This is a follow up to my previous proposal to speed up slub_debug=P
https://marc.info/?l=linux-mm&m=145920558822958&w=2 . The current approach
is hopelessly slow and can't really be used outside of limited debugging.
The goal is to make slub_debug=P more usable for general use.

Joonsoo Kim pointed out that my previous attempt still wouldn't scale
as it still involved taking the list_lock for every allocation. He suggested
adding per-cpu support, as did Christoph Lameter in a separate thread. This
proposal adds a separate per-cpu list for use when poisoning is enabled.
For this version, I'm mostly looking for general feedback about how reasonable
this approach is before trying to clean it up more.

- Some of this code is redundant and can probably be combined.
- The fast path is very sensitive and it was suggested I leave it alone. The
approach I took means the fastpath cmpxchg always fails before trying the
alternate cmpxchg. From some of my profiling, the cmpxchg seemed to be fairly
expensive.
- The flow for ___slab_free is ugly and should really be reworked.
- The poisoning is now scattered around in a few places. Ideally there would
be a single path for this to happen. If the fast path free fails we end up
doing an extra initialization of the object.
- I don't care about the name, feel free to suggest something better.

For some performance numbers, I used hackbench -g 20 -l 1000.
There does not seem to be an impact with slub_debug=- with this series.
For slub_debug=P, I get ~2 second drop with a UP system and almost a 10 second
drop with a 4 core system as run with QEMU.
---
 include/linux/slub_def.h |   8 ++
 mm/slub.c                | 263 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 244 insertions(+), 27 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 07ef550..d582101 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -10,8 +10,10 @@
 
 enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
+	ALLOC_ALT_FASTPATH,	/* Allocation from alternate cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
 	FREE_FASTPATH,		/* Free to cpu slab */
+	FREE_ALT_FASTPATH,	/* Free to alternate cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
 	FREE_ADD_PARTIAL,	/* Freeing moves slab to partial list */
@@ -42,6 +44,12 @@ struct kmem_cache_cpu {
 	unsigned long tid;	/* Globally unique transaction id */
 	struct page *page;	/* The slab from which we are allocating */
 	struct page *partial;	/* Partially allocated frozen slabs */
+	/*
+	 * The following fields have identical uses to those above */
+	void **alt_freelist;
+	unsigned long alt_tid;
+	struct page *alt_partial;
+	struct page *alt_page;
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
diff --git a/mm/slub.c b/mm/slub.c
index 7449593..b1fc4c6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -132,10 +132,24 @@ void *fixup_red_left(struct kmem_cache *s, void *p)
 	return p;
 }
 
+#define SLAB_NO_PARTIAL (SLAB_CONSISTENCY_CHECKS | SLAB_STORE_USER | \
+                               SLAB_TRACE)
+
+
+static inline bool kmem_cache_use_alt_partial(struct kmem_cache *s)
+{
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+	return s->flags & (SLAB_RED_ZONE | SLAB_POISON) &&
+		!(s->flags & SLAB_NO_PARTIAL);
+#else
+	return false;
+#endif
+}
+
 static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
 {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
-	return !kmem_cache_debug(s);
+	return !(s->flags & SLAB_NO_PARTIAL);
 #else
 	return false;
 #endif
@@ -1786,6 +1800,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
 }
 
 static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
+static void put_cpu_partial_alt(struct kmem_cache *s, struct page *page, int drain);
 static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
 
 /*
@@ -1821,11 +1836,17 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 
 		available += objects;
 		if (!object) {
-			c->page = page;
+			if (kmem_cache_use_alt_partial(s))
+				c->alt_page = page;
+			else
+				c->page = page;
 			stat(s, ALLOC_FROM_PARTIAL);
 			object = t;
 		} else {
-			put_cpu_partial(s, page, 0);
+			if (kmem_cache_use_alt_partial(s))
+				put_cpu_partial_alt(s, page, 0);
+			else
+				put_cpu_partial(s, page, 0);
 			stat(s, CPU_PARTIAL_NODE);
 		}
 		if (!kmem_cache_has_cpu_partial(s)
@@ -2147,12 +2168,16 @@ static void unfreeze_partials(struct kmem_cache *s,
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct kmem_cache_node *n = NULL, *n2 = NULL;
 	struct page *page, *discard_page = NULL;
+	bool alt = kmem_cache_use_alt_partial(s);
 
-	while ((page = c->partial)) {
+	while ((page = alt ? c->alt_partial : c->partial)) {
 		struct page new;
 		struct page old;
 
-		c->partial = page->next;
+		if (alt)
+			c->alt_partial = page->next;
+		else
+			c->partial = page->next;
 
 		n2 = get_node(s, page_to_nid(page));
 		if (n != n2) {
@@ -2263,6 +2288,58 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 #endif
 }
 
+static void put_cpu_partial_alt(struct kmem_cache *s, struct page *page, int drain)
+{
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+	struct page *oldpage;
+	int pages;
+	int pobjects;
+
+	preempt_disable();
+	do {
+		pages = 0;
+		pobjects = 0;
+		oldpage = this_cpu_read(s->cpu_slab->alt_partial);
+
+		if (oldpage) {
+			pobjects = oldpage->pobjects;
+			pages = oldpage->pages;
+			if (drain && pobjects > s->cpu_partial) {
+				unsigned long flags;
+				/*
+				 * partial array is full. Move the existing
+				 * set to the per node partial list.
+				 */
+				local_irq_save(flags);
+				unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+				local_irq_restore(flags);
+				oldpage = NULL;
+				pobjects = 0;
+				pages = 0;
+				stat(s, CPU_PARTIAL_DRAIN);
+			}
+		}
+
+		pages++;
+		pobjects += page->objects - page->inuse;
+
+		page->pages = pages;
+		page->pobjects = pobjects;
+		page->next = oldpage;
+
+	} while (this_cpu_cmpxchg(s->cpu_slab->alt_partial, oldpage, page)
+								!= oldpage);
+	if (unlikely(!s->cpu_partial)) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+		local_irq_restore(flags);
+	}
+	preempt_enable();
+#endif
+}
+
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	stat(s, CPUSLAB_FLUSH);
@@ -2273,6 +2350,16 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 	c->freelist = NULL;
 }
 
+static inline void flush_slab_alt(struct kmem_cache *s, struct kmem_cache_cpu *c)
+{
+	stat(s, CPUSLAB_FLUSH);
+	deactivate_slab(s, c->alt_page, c->alt_freelist);
+
+	c->alt_tid = next_tid(c->alt_tid);
+	c->alt_page = NULL;
+	c->alt_freelist = NULL;
+}
+
 /*
  * Flush cpu slab.
  *
@@ -2285,6 +2372,8 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 	if (likely(c)) {
 		if (c->page)
 			flush_slab(s, c);
+		if (c->alt_page)
+			flush_slab_alt(s, c);
 
 		unfreeze_partials(s, c);
 	}
@@ -2302,7 +2391,7 @@ static bool has_cpu_slab(int cpu, void *info)
 	struct kmem_cache *s = info;
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
-	return c->page || c->partial;
+	return c->page || c->partial || c->alt_page || c->alt_partial;
 }
 
 static void flush_all(struct kmem_cache *s)
@@ -2425,6 +2514,8 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 		if (c->page)
 			flush_slab(s, c);
 
+		if (c->alt_page)
+			flush_slab_alt(s, c);
 		/*
 		 * No other reference to the page yet so we can
 		 * muck around with it freely without cmpxchg
@@ -2433,7 +2524,10 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 		page->freelist = NULL;
 
 		stat(s, ALLOC_SLAB);
-		c->page = page;
+		if (kmem_cache_use_alt_partial(s))
+			c->alt_page = page;
+		else
+			c->page = page;
 		*pc = c;
 	} else
 		freelist = NULL;
@@ -2507,10 +2601,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 {
 	void *freelist;
 	struct page *page;
+	bool alt = kmem_cache_use_alt_partial(s);
 
 	page = c->page;
-	if (!page)
-		goto new_slab;
+	if (!page) {
+		page = c->alt_page;
+		if (!page)
+			goto new_slab;
+	}
 redo:
 
 	if (unlikely(!node_match(page, node))) {
@@ -2541,14 +2639,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	}
 
 	/* must check again c->freelist in case of cpu migration or IRQ */
-	freelist = c->freelist;
+	freelist = alt ? c->alt_freelist : c->freelist;
 	if (freelist)
 		goto load_freelist;
 
 	freelist = get_freelist(s, page);
 
+
 	if (!freelist) {
-		c->page = NULL;
+		if (alt)
+			c->alt_page = NULL;
+		else
+			c->page = NULL;
 		stat(s, DEACTIVATE_BYPASS);
 		goto new_slab;
 	}
@@ -2561,9 +2663,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 * page is pointing to the page from which the objects are obtained.
 	 * That page must be frozen for per cpu allocations to work.
 	 */
-	VM_BUG_ON(!c->page->frozen);
-	c->freelist = get_freepointer(s, freelist);
-	c->tid = next_tid(c->tid);
+	if (alt) {
+		VM_BUG_ON(!c->alt_page->frozen);
+		c->alt_freelist = get_freepointer(s, freelist);
+		c->alt_tid = next_tid(c->alt_tid);
+		init_object(s, freelist, SLUB_RED_ACTIVE);
+	} else {
+		VM_BUG_ON(!c->page->frozen);
+		c->freelist = get_freepointer(s, freelist);
+		c->tid = next_tid(c->tid);
+	}
 	return freelist;
 
 new_slab:
@@ -2576,6 +2685,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto redo;
 	}
 
+	if (c->alt_partial) {
+		page = c->alt_page = c->alt_partial;
+		c->alt_partial = page->next;
+		stat(s, CPU_PARTIAL_ALLOC);
+		c->alt_freelist = NULL;
+		goto redo;
+	}
+
 	freelist = new_slab_objects(s, gfpflags, node, &c);
 
 	if (unlikely(!freelist)) {
@@ -2583,19 +2700,21 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		return NULL;
 	}
 
-	page = c->page;
-	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
-		goto load_freelist;
-
+	page = alt ? c->alt_page : c->page;
 	/* Only entered in the debug case */
-	if (kmem_cache_debug(s) &&
-			!alloc_debug_processing(s, page, freelist, addr))
-		goto new_slab;	/* Slab failed checks. Next slab needed */
+	if (kmem_cache_debug(s)) {
+		if (!alloc_debug_processing(s, page, freelist, addr))
+			goto new_slab;	/* Slab failed checks. Next slab needed */
 
-	deactivate_slab(s, page, get_freepointer(s, freelist));
-	c->page = NULL;
-	c->freelist = NULL;
-	return freelist;
+		if (!kmem_cache_use_alt_partial(s)) {
+			deactivate_slab(s, page, get_freepointer(s, freelist));
+			c->page = NULL;
+			c->freelist = NULL;
+			return freelist;
+		}
+	}
+	/* XXX Fix this flow */
+	goto load_freelist;
 }
 
 /*
@@ -2623,6 +2742,39 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	return p;
 }
 
+static void *__slab_alloc_alt_path(struct kmem_cache *s)
+{
+	void *object;
+	void *next_object;
+	struct kmem_cache_cpu *c;
+	unsigned long tid;
+
+	do {
+		tid = this_cpu_read(s->cpu_slab->alt_tid);
+		c = raw_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) &&
+		 unlikely(tid != READ_ONCE(c->alt_tid)));
+
+	barrier();
+
+	object = c->alt_freelist;
+
+	if (!object)
+		return NULL;
+
+	next_object = get_freepointer_safe(s, object);
+
+	if (unlikely(!this_cpu_cmpxchg_double(
+			s->cpu_slab->alt_freelist, s->cpu_slab->alt_tid,
+			object, tid,
+			next_object, next_tid(tid))))
+		return NULL;
+
+	init_object(s, object, SLUB_RED_ACTIVE);
+	stat(s, ALLOC_ALT_FASTPATH);
+	return object;
+}
+
 /*
  * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
  * have the fastpath folded into their functions. So no function call
@@ -2681,7 +2833,10 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
 	object = c->freelist;
 	page = c->page;
 	if (unlikely(!object || !node_match(page, node))) {
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+		if (kmem_cache_use_alt_partial(s))
+			object = __slab_alloc_alt_path(s);
+		if (!object)
+			object = __slab_alloc(s, gfpflags, node, addr, c);
 		stat(s, ALLOC_SLOWPATH);
 	} else {
 		void *next_object = get_freepointer_safe(s, object);
@@ -2777,6 +2932,50 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
 #endif
 #endif
 
+static bool __slab_free_alt_path(struct kmem_cache *s, struct page *page,
+				void *head, void *tail)
+{
+	unsigned long tid;
+	struct kmem_cache_cpu *c;
+	void *object = head;
+
+	do {
+		tid = this_cpu_read(s->cpu_slab->alt_tid);
+		c = raw_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) &&
+		 unlikely(tid != READ_ONCE(c->alt_tid)));
+
+	barrier();
+
+	/*
+	 * XXX How to avoid duplicating the initialization?
+	 */
+next_object:
+	init_object(s, object, SLUB_RED_INACTIVE);
+	if (object != tail) {
+		object = get_freepointer(s, object);
+		goto next_object;
+	}
+
+	if (likely(page == c->alt_page)) {
+		set_freepointer(s, tail, c->alt_freelist);
+
+		if (unlikely(!this_cpu_cmpxchg_double(
+				s->cpu_slab->alt_freelist, s->cpu_slab->alt_tid,
+				c->alt_freelist, tid,
+				head, next_tid(tid)))) {
+
+			note_cmpxchg_failure("slab_free", s, tid);
+			return false;
+		}
+
+		stat(s, FREE_ALT_FASTPATH);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Slow path handling. This may still be called frequently since objects
  * have a longer lifetime than the cpu slabs in most processing loads.
@@ -2799,6 +2998,9 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 
 	stat(s, FREE_SLOWPATH);
 
+	if (kmem_cache_use_alt_partial(s) && __slab_free_alt_path(s, page, head, tail))
+		return;
+
 	if (kmem_cache_debug(s) &&
 	    !free_debug_processing(s, page, head, tail, cnt, addr))
 		return;
@@ -2854,7 +3056,10 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		 * per cpu partial list.
 		 */
 		if (new.frozen && !was_frozen) {
-			put_cpu_partial(s, page, 1);
+			if (kmem_cache_use_alt_partial(s))
+				put_cpu_partial_alt(s, page, 1);
+			else
+				put_cpu_partial(s, page, 1);
 			stat(s, CPU_PARTIAL_FREE);
 		}
 		/*
@@ -5317,9 +5522,11 @@ static ssize_t text##_store(struct kmem_cache *s,		\
 }								\
 SLAB_ATTR(text);						\
 
+STAT_ATTR(ALLOC_ALT_FASTPATH, alloc_alt_fastpath);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
+STAT_ATTR(FREE_ALT_FASTPATH, free_alt_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
 STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
@@ -5385,8 +5592,10 @@ static struct attribute *slab_attrs[] = {
 #endif
 #ifdef CONFIG_SLUB_STATS
 	&alloc_fastpath_attr.attr,
+	&alloc_alt_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
 	&free_fastpath_attr.attr,
+	&free_alt_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,
 	&free_add_partial_attr.attr,
-- 
2.7.5

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] slub: Introduce 'alternate' per cpu partial lists
  2017-06-08 23:53 [RFC][PATCH] slub: Introduce 'alternate' per cpu partial lists Laura Abbott
@ 2017-06-14  4:45 ` Joonsoo Kim
  2017-06-14 20:31   ` Laura Abbott
  2017-07-12 14:49   ` Christopher Lameter
  2017-07-12 14:54 ` Christopher Lameter
  1 sibling, 2 replies; 5+ messages in thread
From: Joonsoo Kim @ 2017-06-14  4:45 UTC (permalink / raw)
  To: Laura Abbott
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Andrew Morton,
	linux-mm, linux-kernel, Kees Cook

On Thu, Jun 08, 2017 at 04:53:04PM -0700, Laura Abbott wrote:
> SLUB debugging features (poisoning, red zoning etc.) skip the fast path
> completely. This ensures there is a single place to do all checks and
> take any locks that may be necessary for debugging. The overhead of some
> of the debugging features (e.g. poisoning) ends up being comparatively
> small vs the overhead of not using the fast path.
> 
> We don't want to impose any kind of overhead on the fast path so
> introduce the notion of an alternate fast path. This is essentially the
> same idea as the existing fast path (store partially used pages on the
> per-cpu list) but it happens after the real fast path. Debugging that
> doesn't require locks (poisoning/red zoning) can happen on this path to
> avoid the penalty of always needing to go for the slow path.
> 
> Signed-off-by: Laura Abbott <labbott@redhat.com>
> ---
> This is a follow up to my previous proposal to speed up slub_debug=P
> https://marc.info/?l=linux-mm&m=145920558822958&w=2 . The current approach
> is hopelessly slow and can't really be used outside of limited debugging.
> The goal is to make slub_debug=P more usable for general use.
> 
> Joonsoo Kim pointed out that my previous attempt still wouldn't scale
> as it still involved taking the list_lock for every allocation. He suggested
> adding per-cpu support, as did Christoph Lameter in a separate thread. This
> proposal adds a separate per-cpu list for use when poisoning is enabled.
> For this version, I'm mostly looking for general feedback about how reasonable
> this approach is before trying to clean it up more.
> 
> - Some of this code is redundant and can probably be combined.
> - The fast path is very sensitive and it was suggested I leave it alone. The
> approach I took means the fastpath cmpxchg always fails before trying the
> alternate cmpxchg. From some of my profiling, the cmpxchg seemed to be fairly
> expensive.

It looks better to modify the fastpath for non-debuging poisoning. If
we use the jump label, it doesn't cause any overhead to the fastpath
for the user who doesn't use this feature. It really makes thing
simpler. Only a few more lines will be needed in the fastpath.

Christoph, any opinion?

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] slub: Introduce 'alternate' per cpu partial lists
  2017-06-14  4:45 ` Joonsoo Kim
@ 2017-06-14 20:31   ` Laura Abbott
  2017-07-12 14:49   ` Christopher Lameter
  1 sibling, 0 replies; 5+ messages in thread
From: Laura Abbott @ 2017-06-14 20:31 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Andrew Morton,
	linux-mm, linux-kernel, Kees Cook

On 06/13/2017 09:45 PM, Joonsoo Kim wrote:
> On Thu, Jun 08, 2017 at 04:53:04PM -0700, Laura Abbott wrote:
>> SLUB debugging features (poisoning, red zoning etc.) skip the fast path
>> completely. This ensures there is a single place to do all checks and
>> take any locks that may be necessary for debugging. The overhead of some
>> of the debugging features (e.g. poisoning) ends up being comparatively
>> small vs the overhead of not using the fast path.
>>
>> We don't want to impose any kind of overhead on the fast path so
>> introduce the notion of an alternate fast path. This is essentially the
>> same idea as the existing fast path (store partially used pages on the
>> per-cpu list) but it happens after the real fast path. Debugging that
>> doesn't require locks (poisoning/red zoning) can happen on this path to
>> avoid the penalty of always needing to go for the slow path.
>>
>> Signed-off-by: Laura Abbott <labbott@redhat.com>
>> ---
>> This is a follow up to my previous proposal to speed up slub_debug=P
>> https://marc.info/?l=linux-mm&m=145920558822958&w=2 . The current approach
>> is hopelessly slow and can't really be used outside of limited debugging.
>> The goal is to make slub_debug=P more usable for general use.
>>
>> Joonsoo Kim pointed out that my previous attempt still wouldn't scale
>> as it still involved taking the list_lock for every allocation. He suggested
>> adding per-cpu support, as did Christoph Lameter in a separate thread. This
>> proposal adds a separate per-cpu list for use when poisoning is enabled.
>> For this version, I'm mostly looking for general feedback about how reasonable
>> this approach is before trying to clean it up more.
>>
>> - Some of this code is redundant and can probably be combined.
>> - The fast path is very sensitive and it was suggested I leave it alone. The
>> approach I took means the fastpath cmpxchg always fails before trying the
>> alternate cmpxchg. From some of my profiling, the cmpxchg seemed to be fairly
>> expensive.
> 
> It looks better to modify the fastpath for non-debuging poisoning. If
> we use the jump label, it doesn't cause any overhead to the fastpath
> for the user who doesn't use this feature. It really makes thing
> simpler. Only a few more lines will be needed in the fastpath.
> 

When I initially tried something like that with the first version
I still saw an increase in the fast path even with slub_debug=-.
I might not have been testing with jump labels though and I didn't
spend that much time trying to narrow down the issue.

Thanks,
Laura

> Christoph, any opinion?
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] slub: Introduce 'alternate' per cpu partial lists
  2017-06-14  4:45 ` Joonsoo Kim
  2017-06-14 20:31   ` Laura Abbott
@ 2017-07-12 14:49   ` Christopher Lameter
  1 sibling, 0 replies; 5+ messages in thread
From: Christopher Lameter @ 2017-07-12 14:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Laura Abbott, Pekka Enberg, David Rientjes, Andrew Morton,
	linux-mm, linux-kernel, Kees Cook

On Wed, 14 Jun 2017, Joonsoo Kim wrote:

> > - Some of this code is redundant and can probably be combined.
> > - The fast path is very sensitive and it was suggested I leave it alone. The
> > approach I took means the fastpath cmpxchg always fails before trying the
> > alternate cmpxchg. From some of my profiling, the cmpxchg seemed to be fairly
> > expensive.
>
> It looks better to modify the fastpath for non-debuging poisoning. If
> we use the jump label, it doesn't cause any overhead to the fastpath
> for the user who doesn't use this feature. It really makes thing
> simpler. Only a few more lines will be needed in the fastpath.
>
> Christoph, any opinion?

Just looked through it. Sorry was on vacation in Europe for awhile.

The duplication in kmem_cache_cpu is not good performance wise. Maybe just
keep the single per cpu partial list and depending on a kmem_cache flag
change the locking semantics in order to allow for faster debugging?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC][PATCH] slub: Introduce 'alternate' per cpu partial lists
  2017-06-08 23:53 [RFC][PATCH] slub: Introduce 'alternate' per cpu partial lists Laura Abbott
  2017-06-14  4:45 ` Joonsoo Kim
@ 2017-07-12 14:54 ` Christopher Lameter
  1 sibling, 0 replies; 5+ messages in thread
From: Christopher Lameter @ 2017-07-12 14:54 UTC (permalink / raw)
  To: Laura Abbott
  Cc: Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton,
	linux-mm, linux-kernel, Kees Cook

On Thu, 8 Jun 2017, Laura Abbott wrote:

> - Some of this code is redundant and can probably be combined.
> - The fast path is very sensitive and it was suggested I leave it alone. The
> approach I took means the fastpath cmpxchg always fails before trying the
> alternate cmpxchg. From some of my profiling, the cmpxchg seemed to be fairly
> expensive.

I think its better to change the fast path. Just make sure that the hot
path is as unencumbered as possible. There are already slow pieces in the
hotpath. If you modifications are similar then it would work.

> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 07ef550..d582101 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -42,6 +44,12 @@ struct kmem_cache_cpu {
>  	unsigned long tid;	/* Globally unique transaction id */
>  	struct page *page;	/* The slab from which we are allocating */
>  	struct page *partial;	/* Partially allocated frozen slabs */
> +	/*
> +	 * The following fields have identical uses to those above */
> +	void **alt_freelist;
> +	unsigned long alt_tid;
> +	struct page *alt_partial;
> +	struct page *alt_page;
>  #ifdef CONFIG_SLUB_STATS
>  	unsigned stat[NR_SLUB_STAT_ITEMS];
>  #endif

I would rather avoid duplication here. Use the regular entries and modify
the flow depending on a flag.

> diff --git a/mm/slub.c b/mm/slub.c
> index 7449593..b1fc4c6 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -132,10 +132,24 @@ void *fixup_red_left(struct kmem_cache *s, void *p)
>  	return p;
>  }
>
> +#define SLAB_NO_PARTIAL (SLAB_CONSISTENCY_CHECKS | SLAB_STORE_USER | \
> +                               SLAB_TRACE)
> +
> +
> +static inline bool kmem_cache_use_alt_partial(struct kmem_cache *s)
> +{
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
> +	return s->flags & (SLAB_RED_ZONE | SLAB_POISON) &&
> +		!(s->flags & SLAB_NO_PARTIAL);
> +#else
> +	return false;
> +#endif
> +}
> +
>  static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
>  {
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
> -	return !kmem_cache_debug(s);
> +	return !(s->flags & SLAB_NO_PARTIAL);
>  #else
>  	return false;
>  #endif
> @@ -1786,6 +1800,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
>  }

Hmmm... Looks like the inversion would be better

SLAB_PARTIAL?


...

Lots of duplication. I think that can be avoided by rearranging the fast
path depending on a flag.

Maybe make the fast poisoning the default? If you can keep the performance
of the fast path for regular use then this may be best. You can then avoid
adding the additional flag as well as the additional debug counters.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-07-12 14:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-08 23:53 [RFC][PATCH] slub: Introduce 'alternate' per cpu partial lists Laura Abbott
2017-06-14  4:45 ` Joonsoo Kim
2017-06-14 20:31   ` Laura Abbott
2017-07-12 14:49   ` Christopher Lameter
2017-07-12 14:54 ` Christopher Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).