All of lore.kernel.org
 help / color / mirror / Atom feed
* slub bulk alloc: Extract objects from the per cpu slab
@ 2015-04-08 18:13 Christoph Lameter
  2015-04-08 22:53 ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2015-04-08 18:13 UTC (permalink / raw)
  To: akpm; +Cc: brouer, Joonsoo Kim, Pekka Enberg, David Rientjes, linux-mm

First piece: accelleration of retrieval of per cpu objects


If we are allocating lots of objects then it is advantageous to
disable interrupts and avoid the this_cpu_cmpxchg() operation to
get these objects faster. Note that we cannot do the fast operation
if debugging is enabled. Note also that the requirement of having
interrupts disabled avoids having to do processor flag operations.

Allocate as many objects as possible in the fast way and then fall
back to the generic implementation for the rest of the objects.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -2761,7 +2761,32 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
 bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 								void **p)
 {
-	return kmem_cache_alloc_bulk(s, flags, size, p);
+	if (!kmem_cache_debug(s)) {
+		struct kmem_cache_cpu *c;
+
+		/* Drain objects in the per cpu slab */
+		local_irq_disable();
+		c = this_cpu_ptr(s->cpu_slab);
+
+		while (size) {
+			void *object = c->freelist;
+
+			if (!object)
+				break;
+
+			c->freelist = get_freepointer(s, object);
+			*p++ = object;
+			size--;
+
+			if (unlikely(flags & __GFP_ZERO))
+				memset(object, 0, s->object_size);
+		}
+		c->tid = next_tid(c->tid);
+
+		local_irq_enable();
+	}
+
+	return __kmem_cache_alloc_bulk(s, flags, size, p);
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slub bulk alloc: Extract objects from the per cpu slab
  2015-04-08 18:13 slub bulk alloc: Extract objects from the per cpu slab Christoph Lameter
@ 2015-04-08 22:53 ` Andrew Morton
  2015-04-09 14:03   ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2015-04-08 22:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: brouer, Joonsoo Kim, Pekka Enberg, David Rientjes, linux-mm

On Wed, 8 Apr 2015 13:13:29 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:

> First piece: accelleration of retrieval of per cpu objects
> 
> 
> If we are allocating lots of objects then it is advantageous to
> disable interrupts and avoid the this_cpu_cmpxchg() operation to
> get these objects faster. Note that we cannot do the fast operation
> if debugging is enabled.

Why can't we do it if debugging is enabled?

> Note also that the requirement of having
> interrupts disabled avoids having to do processor flag operations.
> 
> Allocate as many objects as possible in the fast way and then fall
> back to the generic implementation for the rest of the objects.

Seems sane.  What's the expected success rate of the initial bulk
allocation attempt?

> --- linux.orig/mm/slub.c
> +++ linux/mm/slub.c
> @@ -2761,7 +2761,32 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
>  bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  								void **p)
>  {
> -	return kmem_cache_alloc_bulk(s, flags, size, p);
> +	if (!kmem_cache_debug(s)) {
> +		struct kmem_cache_cpu *c;
> +
> +		/* Drain objects in the per cpu slab */
> +		local_irq_disable();
> +		c = this_cpu_ptr(s->cpu_slab);
> +
> +		while (size) {
> +			void *object = c->freelist;
> +
> +			if (!object)
> +				break;
> +
> +			c->freelist = get_freepointer(s, object);
> +			*p++ = object;
> +			size--;
> +
> +			if (unlikely(flags & __GFP_ZERO))
> +				memset(object, 0, s->object_size);
> +		}
> +		c->tid = next_tid(c->tid);
> +
> +		local_irq_enable();
> +	}
> +
> +	return __kmem_cache_alloc_bulk(s, flags, size, p);

This kmem_cache_cpu.tid logic is a bit opaque.  The low-level
operations seem reasonably well documented but I couldn't find anywhere
which tells me how it all actually works - what is "disambiguation
during cmpxchg" and how do we achieve it?


I'm in two minds about putting
slab-infrastructure-for-bulk-object-allocation-and-freeing-v3.patch and
slub-bulk-alloc-extract-objects-from-the-per-cpu-slab.patch into 4.1. 
They're standalone (ie: no in-kernel callers!) hence harmless, and
merging them will make Jesper's life a bit easier.  But otoh they are
unproven and have no in-kernel callers, so formally they shouldn't be
merged yet.  I suppose we can throw them away again if things don't
work out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slub bulk alloc: Extract objects from the per cpu slab
  2015-04-08 22:53 ` Andrew Morton
@ 2015-04-09 14:03   ` Christoph Lameter
  2015-04-09 17:16     ` slub: bulk allocation from per cpu partial pages Christoph Lameter
  2015-04-09 20:19     ` slub bulk alloc: Extract objects from the per cpu slab Andrew Morton
  0 siblings, 2 replies; 13+ messages in thread
From: Christoph Lameter @ 2015-04-09 14:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: brouer, Joonsoo Kim, Pekka Enberg, David Rientjes, linux-mm

On Wed, 8 Apr 2015, Andrew Morton wrote:

> On Wed, 8 Apr 2015 13:13:29 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:
>
> > First piece: accelleration of retrieval of per cpu objects
> >
> >
> > If we are allocating lots of objects then it is advantageous to
> > disable interrupts and avoid the this_cpu_cmpxchg() operation to
> > get these objects faster. Note that we cannot do the fast operation
> > if debugging is enabled.
>
> Why can't we do it if debugging is enabled?

We would have to add extra code to do all the debugging checks. And it
would not be fast anyways.

> > Allocate as many objects as possible in the fast way and then fall
> > back to the generic implementation for the rest of the objects.
>
> Seems sane.  What's the expected success rate of the initial bulk
> allocation attempt?

This is going to increase as we add more capabilities. I have a second
patch here that extends the fast allocation to the per cpu partial pages.

> > +		c->tid = next_tid(c->tid);
> > +
> > +		local_irq_enable();
> > +	}
> > +
> > +	return __kmem_cache_alloc_bulk(s, flags, size, p);
>
> This kmem_cache_cpu.tid logic is a bit opaque.  The low-level
> operations seem reasonably well documented but I couldn't find anywhere
> which tells me how it all actually works - what is "disambiguation
> during cmpxchg" and how do we achieve it?

This is used to force a retry in slab_alloc_node() if preemption occurs
there. We are modifying the per cpu state thus a retry must be forced.

> I'm in two minds about putting
> slab-infrastructure-for-bulk-object-allocation-and-freeing-v3.patch and
> slub-bulk-alloc-extract-objects-from-the-per-cpu-slab.patch into 4.1.
> They're standalone (ie: no in-kernel callers!) hence harmless, and
> merging them will make Jesper's life a bit easier.  But otoh they are
> unproven and have no in-kernel callers, so formally they shouldn't be
> merged yet.  I suppose we can throw them away again if things don't
> work out.

Can we keep them in -next and I will add patches as we go forward? There
was already a lot of discussion before and I would like to go
incrementally adding methods to do bulk extraction from the various
control structures that we have holding objects.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* slub: bulk allocation from per cpu partial pages
  2015-04-09 14:03   ` Christoph Lameter
@ 2015-04-09 17:16     ` Christoph Lameter
  2015-04-16 12:06       ` Jesper Dangaard Brouer
  2015-04-09 20:19     ` slub bulk alloc: Extract objects from the per cpu slab Andrew Morton
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2015-04-09 17:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: brouer, Joonsoo Kim, Pekka Enberg, David Rientjes, linux-mm

Next step: cover all of the per cpu objects available.


Expand the bulk allocation support to drain the per cpu partial
pages while interrupts are off.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -2771,15 +2771,45 @@ bool kmem_cache_alloc_bulk(struct kmem_c
 		while (size) {
 			void *object = c->freelist;

-			if (!object)
-				break;
+			if (unlikely(!object)) {
+				/*
+				 * Check if there remotely freed objects
+				 * availalbe in the page.
+				 */
+				object = get_freelist(s, c->page);
+
+				if (!object) {
+					/*
+					 * All objects in use lets check if
+					 * we have other per cpu partial
+					 * pages that have available
+					 * objects.
+					 */
+					c->page = c->partial;
+					if (!c->page) {
+						/* No per cpu objects left */
+						c->freelist = NULL;
+						break;
+					}
+
+					/* Next per cpu partial page */
+					c->partial = c->page->next;
+					c->freelist = get_freelist(s,
+							c->page);
+					continue;
+				}
+
+			}
+

-			c->freelist = get_freepointer(s, object);
 			*p++ = object;
 			size--;

 			if (unlikely(flags & __GFP_ZERO))
 				memset(object, 0, s->object_size);
+
+			c->freelist = get_freepointer(s, object);
+
 		}
 		c->tid = next_tid(c->tid);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slub bulk alloc: Extract objects from the per cpu slab
  2015-04-09 14:03   ` Christoph Lameter
  2015-04-09 17:16     ` slub: bulk allocation from per cpu partial pages Christoph Lameter
@ 2015-04-09 20:19     ` Andrew Morton
  2015-04-11  2:19       ` Christoph Lameter
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2015-04-09 20:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: brouer, Joonsoo Kim, Pekka Enberg, David Rientjes, linux-mm

On Thu, 9 Apr 2015 09:03:24 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:

> On Wed, 8 Apr 2015, Andrew Morton wrote:
> 
> > On Wed, 8 Apr 2015 13:13:29 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:
> >
> > > First piece: accelleration of retrieval of per cpu objects
> > >
> > >
> > > If we are allocating lots of objects then it is advantageous to
> > > disable interrupts and avoid the this_cpu_cmpxchg() operation to
> > > get these objects faster. Note that we cannot do the fast operation
> > > if debugging is enabled.
> >
> > Why can't we do it if debugging is enabled?
> 
> We would have to add extra code to do all the debugging checks. And it
> would not be fast anyways.

I updated the changelog to reflect this.

> > > Allocate as many objects as possible in the fast way and then fall
> > > back to the generic implementation for the rest of the objects.
> >
> > Seems sane.  What's the expected success rate of the initial bulk
> > allocation attempt?
> 
> This is going to increase as we add more capabilities. I have a second
> patch here that extends the fast allocation to the per cpu partial pages.

Yes, but what is the expected success rate of the initial bulk
allocation attempt?  If it's 1% then perhaps there's no point in doing
it.

> > > +		c->tid = next_tid(c->tid);
> > > +
> > > +		local_irq_enable();
> > > +	}
> > > +
> > > +	return __kmem_cache_alloc_bulk(s, flags, size, p);
> >
> > This kmem_cache_cpu.tid logic is a bit opaque.  The low-level
> > operations seem reasonably well documented but I couldn't find anywhere
> > which tells me how it all actually works - what is "disambiguation
> > during cmpxchg" and how do we achieve it?
> 
> This is used to force a retry in slab_alloc_node() if preemption occurs
> there. We are modifying the per cpu state thus a retry must be forced.

No, I'm not referring to this patch.  I'm referring to the overall
design concept behind kmem_cache_cpu.tid.  This patch made me go and
look, and it's a bit of a head-scratcher.  It's unobvious and doesn't
appear to be documented in any central place.  Perhaps it's in a
changelog, but who has time for that?

A comment somewhere which describes the concept is needed.

> > I'm in two minds about putting
> > slab-infrastructure-for-bulk-object-allocation-and-freeing-v3.patch and
> > slub-bulk-alloc-extract-objects-from-the-per-cpu-slab.patch into 4.1.
> > They're standalone (ie: no in-kernel callers!) hence harmless, and
> > merging them will make Jesper's life a bit easier.  But otoh they are
> > unproven and have no in-kernel callers, so formally they shouldn't be
> > merged yet.  I suppose we can throw them away again if things don't
> > work out.
> 
> Can we keep them in -next and I will add patches as we go forward? There
> was already a lot of discussion before and I would like to go
> incrementally adding methods to do bulk extraction from the various
> control structures that we have holding objects.

Keeping them in -next is not a problem - I was wondering about when to
start moving the code into mainline.  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slub bulk alloc: Extract objects from the per cpu slab
  2015-04-09 20:19     ` slub bulk alloc: Extract objects from the per cpu slab Andrew Morton
@ 2015-04-11  2:19       ` Christoph Lameter
  2015-04-11  7:25         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2015-04-11  2:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: brouer, Joonsoo Kim, Pekka Enberg, David Rientjes, linux-mm

On Thu, 9 Apr 2015, Andrew Morton wrote:

> > This is going to increase as we add more capabilities. I have a second
> > patch here that extends the fast allocation to the per cpu partial pages.
>
> Yes, but what is the expected success rate of the initial bulk
> allocation attempt?  If it's 1% then perhaps there's no point in doing
> it.

After we have extracted object from all structures aorund we can also go
directly to the page allocator if we wanted and bypass lots of the
processing for metadata. So we will ultimately end up with 100% success
rate.

> > > This kmem_cache_cpu.tid logic is a bit opaque.  The low-level
> > > operations seem reasonably well documented but I couldn't find anywhere
> > > which tells me how it all actually works - what is "disambiguation
> > > during cmpxchg" and how do we achieve it?
> >
> > This is used to force a retry in slab_alloc_node() if preemption occurs
> > there. We are modifying the per cpu state thus a retry must be forced.
>
> No, I'm not referring to this patch.  I'm referring to the overall
> design concept behind kmem_cache_cpu.tid.  This patch made me go and
> look, and it's a bit of a head-scratcher.  It's unobvious and doesn't
> appear to be documented in any central place.  Perhaps it's in a
> changelog, but who has time for that?

The tid logic is documented somewhat in mm/slub.c. Line 1749 and
following.

> Keeping them in -next is not a problem - I was wondering about when to
> start moving the code into mainline.

When Mr. Brouer has confirmed that the stuff actually does some good for
his issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slub bulk alloc: Extract objects from the per cpu slab
  2015-04-11  2:19       ` Christoph Lameter
@ 2015-04-11  7:25         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2015-04-11  7:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	linux-mm, brouer


On Fri, 10 Apr 2015 21:19:06 -0500 (CDT) Christoph Lameter <cl@linux.com> wrote:
> On Thu, 9 Apr 2015, Andrew Morton wrote:
> 
[...]
> > Keeping them in -next is not a problem - I was wondering about when to
> > start moving the code into mainline.
> 
> When Mr. Brouer has confirmed that the stuff actually does some good for
> his issue.

I plan to pickup working on this from Monday. (As Christoph already
knows, I've just moved back to Denmark from New Zealand.)

I'll start with micro benchmarking, to make sure bulk-alloc is faster
than normal-alloc.  Once we/I have some framework, we can easier
compare the different optimizations that Christoph is planning.

The interesting step for me is using this in the networking stack.

For real use-cases, like IP-forwarding, my experience tells me that the
added code size can easily reduce the performance gain, because
of more instruction-cache misses.  Fortunately bulk-alloc is call
less-times, which amortize these icache-misses, but still something we
need to be aware of as it will not show-up in micro benchmarking.

ps. Thanks for the work guys! :-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slub: bulk allocation from per cpu partial pages
  2015-04-09 17:16     ` slub: bulk allocation from per cpu partial pages Christoph Lameter
@ 2015-04-16 12:06       ` Jesper Dangaard Brouer
  2015-04-16 15:54         ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2015-04-16 12:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	linux-mm, brouer

On Thu, 9 Apr 2015 12:16:23 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> Next step: cover all of the per cpu objects available.
> 
> 
> Expand the bulk allocation support to drain the per cpu partial
> pages while interrupts are off.

Started my micro benchmarking.

On CPU E5-2630 @ 2.30GHz, the cost of kmem_cache_alloc +
kmem_cache_free, is a tight loop (most optimal fast-path), cost 22ns.
With elem size 256 bytes, where slab chooses to make 32 obj-per-slab.

With this patch, testing different bulk sizes, the cost of alloc+free
per element is improved for small sizes of bulk (which I guess this the
is expected outcome).

Have something to compare against, I also ran the bulk sizes through
the fallback versions __kmem_cache_alloc_bulk() and
__kmem_cache_free_bulk(), e.g. the none optimized versions.

 size    --  optimized -- fallback
 bulk  8 --  15ns      --  22ns
 bulk 16 --  15ns      --  22ns
 bulk 30 --  44ns      --  48ns
 bulk 32 --  47ns      --  50ns
 bulk 64 --  52ns      --  54ns

For smaller bulk sizes 8 and 16, this is actually a significant
improvement, especially considering the free side is not optimized.

Thus, the 7ns improvement must come from the alloc side only.


> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c
> +++ linux/mm/slub.c
> @@ -2771,15 +2771,45 @@ bool kmem_cache_alloc_bulk(struct kmem_c
>  		while (size) {
>  			void *object = c->freelist;
> 
> -			if (!object)
> -				break;
> +			if (unlikely(!object)) {
> +				/*
> +				 * Check if there remotely freed objects
> +				 * availalbe in the page.
> +				 */
> +				object = get_freelist(s, c->page);
> +
> +				if (!object) {
> +					/*
> +					 * All objects in use lets check if
> +					 * we have other per cpu partial
> +					 * pages that have available
> +					 * objects.
> +					 */
> +					c->page = c->partial;
> +					if (!c->page) {
> +						/* No per cpu objects left */
> +						c->freelist = NULL;
> +						break;
> +					}
> +
> +					/* Next per cpu partial page */
> +					c->partial = c->page->next;
> +					c->freelist = get_freelist(s,
> +							c->page);
> +					continue;
> +				}
> +
> +			}
> +
> 
> -			c->freelist = get_freepointer(s, object);
>  			*p++ = object;
>  			size--;
> 
>  			if (unlikely(flags & __GFP_ZERO))
>  				memset(object, 0, s->object_size);
> +
> +			c->freelist = get_freepointer(s, object);
> +
>  		}
>  		c->tid = next_tid(c->tid);
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slub: bulk allocation from per cpu partial pages
  2015-04-16 12:06       ` Jesper Dangaard Brouer
@ 2015-04-16 15:54         ` Christoph Lameter
  2015-04-17  5:44           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2015-04-16 15:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, linux-mm

On Thu, 16 Apr 2015, Jesper Dangaard Brouer wrote:

> On CPU E5-2630 @ 2.30GHz, the cost of kmem_cache_alloc +
> kmem_cache_free, is a tight loop (most optimal fast-path), cost 22ns.
> With elem size 256 bytes, where slab chooses to make 32 obj-per-slab.
>
> With this patch, testing different bulk sizes, the cost of alloc+free
> per element is improved for small sizes of bulk (which I guess this the
> is expected outcome).
>
> Have something to compare against, I also ran the bulk sizes through
> the fallback versions __kmem_cache_alloc_bulk() and
> __kmem_cache_free_bulk(), e.g. the none optimized versions.
>
>  size    --  optimized -- fallback
>  bulk  8 --  15ns      --  22ns
>  bulk 16 --  15ns      --  22ns

Good.

>  bulk 30 --  44ns      --  48ns
>  bulk 32 --  47ns      --  50ns
>  bulk 64 --  52ns      --  54ns

Hmm.... We are hittling the atomics I guess... What you got so far is only
using the per cpu data. Wonder how many partial pages are available
there and how much is satisfied from which per cpu structure. There are a
couple of cmpxchg_doubles in the optimized patch to squeeze even the last
object out of the pages before going to the next. I could avoid those
and simply rotate to another per cpu partial page instead.

Got some more here that deals with per node partials but at that point we
will be taking spinlocks.

> For smaller bulk sizes 8 and 16, this is actually a significant
> improvement, especially considering the free side is not optimized.

I have some draft code here to do the same for the free side. But I
thought we better get to some working code on the free side first.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slub: bulk allocation from per cpu partial pages
  2015-04-16 15:54         ` Christoph Lameter
@ 2015-04-17  5:44           ` Jesper Dangaard Brouer
  2015-04-17  6:06             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2015-04-17  5:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	linux-mm, brouer

On Thu, 16 Apr 2015 10:54:07 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Thu, 16 Apr 2015, Jesper Dangaard Brouer wrote:
> 
> > On CPU E5-2630 @ 2.30GHz, the cost of kmem_cache_alloc +
> > kmem_cache_free, is a tight loop (most optimal fast-path), cost 22ns.
> > With elem size 256 bytes, where slab chooses to make 32 obj-per-slab.
> >
> > With this patch, testing different bulk sizes, the cost of alloc+free
> > per element is improved for small sizes of bulk (which I guess this the
> > is expected outcome).
> >
> > Have something to compare against, I also ran the bulk sizes through
> > the fallback versions __kmem_cache_alloc_bulk() and
> > __kmem_cache_free_bulk(), e.g. the none optimized versions.
> >
> >  size    --  optimized -- fallback
> >  bulk  8 --  15ns      --  22ns
> >  bulk 16 --  15ns      --  22ns
> 
> Good.
> 
> >  bulk 30 --  44ns      --  48ns
> >  bulk 32 --  47ns      --  50ns
> >  bulk 64 --  52ns      --  54ns
> 
> Hmm.... We are hittling the atomics I guess... What you got so far is only
> using the per cpu data. Wonder how many partial pages are available

Ups, I can see that this kernel don't have CONFIG_SLUB_CPU_PARTIAL,
I'll re-run tests with this enabled.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slub: bulk allocation from per cpu partial pages
  2015-04-17  5:44           ` Jesper Dangaard Brouer
@ 2015-04-17  6:06             ` Jesper Dangaard Brouer
  2015-04-30 18:40               ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2015-04-17  6:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	linux-mm, brouer

On Fri, 17 Apr 2015 07:44:46 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Thu, 16 Apr 2015 10:54:07 -0500 (CDT)
> Christoph Lameter <cl@linux.com> wrote:
> 
> > On Thu, 16 Apr 2015, Jesper Dangaard Brouer wrote:
> > 
> > > On CPU E5-2630 @ 2.30GHz, the cost of kmem_cache_alloc +
> > > kmem_cache_free, is a tight loop (most optimal fast-path), cost 22ns.
> > > With elem size 256 bytes, where slab chooses to make 32 obj-per-slab.
> > >
> > > With this patch, testing different bulk sizes, the cost of alloc+free
> > > per element is improved for small sizes of bulk (which I guess this the
> > > is expected outcome).
> > >
> > > Have something to compare against, I also ran the bulk sizes through
> > > the fallback versions __kmem_cache_alloc_bulk() and
> > > __kmem_cache_free_bulk(), e.g. the none optimized versions.
> > >
> > >  size    --  optimized -- fallback
> > >  bulk  8 --  15ns      --  22ns
> > >  bulk 16 --  15ns      --  22ns
> > 
> > Good.
> > 
> > >  bulk 30 --  44ns      --  48ns
> > >  bulk 32 --  47ns      --  50ns
> > >  bulk 64 --  52ns      --  54ns
> > 
> > Hmm.... We are hittling the atomics I guess... What you got so far is only
> > using the per cpu data. Wonder how many partial pages are available
> 
> Ups, I can see that this kernel don't have CONFIG_SLUB_CPU_PARTIAL,
> I'll re-run tests with this enabled.

Results with CONFIG_SLUB_CPU_PARTIAL.

 size    --  optimized -- fallback
 bulk  8 --  16ns      -- 22ns
 bulk 16 --  16ns      -- 22ns
 bulk 30 --  16ns      -- 22ns
 bulk 32 --  16ns      -- 22ns
 bulk 64 --  30ns      -- 38ns

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slub: bulk allocation from per cpu partial pages
  2015-04-17  6:06             ` Jesper Dangaard Brouer
@ 2015-04-30 18:40               ` Christoph Lameter
  2015-04-30 19:20                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2015-04-30 18:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes, linux-mm

On Fri, 17 Apr 2015, Jesper Dangaard Brouer wrote:

> > Ups, I can see that this kernel don't have CONFIG_SLUB_CPU_PARTIAL,
> > I'll re-run tests with this enabled.
>
> Results with CONFIG_SLUB_CPU_PARTIAL.
>
>  size    --  optimized -- fallback
>  bulk  8 --  16ns      -- 22ns
>  bulk 16 --  16ns      -- 22ns
>  bulk 30 --  16ns      -- 22ns
>  bulk 32 --  16ns      -- 22ns
>  bulk 64 --  30ns      -- 38ns

That looks better. Can I get the code for testing? Then I can vary the
approach a bit before posting patches? I still want to add a fast path for
allocation from the per node partial list.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: slub: bulk allocation from per cpu partial pages
  2015-04-30 18:40               ` Christoph Lameter
@ 2015-04-30 19:20                 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 13+ messages in thread
From: Jesper Dangaard Brouer @ 2015-04-30 19:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	linux-mm, brouer

On Thu, 30 Apr 2015 13:40:58 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Fri, 17 Apr 2015, Jesper Dangaard Brouer wrote:
> 
> > > Ups, I can see that this kernel don't have CONFIG_SLUB_CPU_PARTIAL,
> > > I'll re-run tests with this enabled.
> >
> > Results with CONFIG_SLUB_CPU_PARTIAL.
> >
> >  size    --  optimized -- fallback
> >  bulk  8 --  16ns      -- 22ns
> >  bulk 16 --  16ns      -- 22ns
> >  bulk 30 --  16ns      -- 22ns
> >  bulk 32 --  16ns      -- 22ns
> >  bulk 64 --  30ns      -- 38ns
> 
> That looks better. Can I get the code for testing? Then I can vary the
> approach a bit before posting patches? I still want to add a fast path for
> allocation from the per node partial list.

Sure you can get the code.  For now the test is fairly simple, will
expand later. I have made a branch "mm_bulk_api" to avoid
people using my repo getting compile errors (due to API not merged).

Git repo[1] branch "mm_bulk_api":
 [1] https://github.com/netoptimizer/prototype-kernel/

The test kernel module is called "slab_bulk_test01", located under
kernel/mm/slab_bulk_test01.c [2].


[2] https://github.com/netoptimizer/prototype-kernel/blob/mm_bulk_api/kernel/mm/slab_bulk_test01.c
Howto use repo [3]:
[3] http://netoptimizer.blogspot.dk/2014/11/announce-github-repo-prototype-kernel.html

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-04-30 19:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-08 18:13 slub bulk alloc: Extract objects from the per cpu slab Christoph Lameter
2015-04-08 22:53 ` Andrew Morton
2015-04-09 14:03   ` Christoph Lameter
2015-04-09 17:16     ` slub: bulk allocation from per cpu partial pages Christoph Lameter
2015-04-16 12:06       ` Jesper Dangaard Brouer
2015-04-16 15:54         ` Christoph Lameter
2015-04-17  5:44           ` Jesper Dangaard Brouer
2015-04-17  6:06             ` Jesper Dangaard Brouer
2015-04-30 18:40               ` Christoph Lameter
2015-04-30 19:20                 ` Jesper Dangaard Brouer
2015-04-09 20:19     ` slub bulk alloc: Extract objects from the per cpu slab Andrew Morton
2015-04-11  2:19       ` Christoph Lameter
2015-04-11  7:25         ` Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.