All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Christoph Lameter <cl@linux.com>
Cc: linux-mm@kvack.org, Vladimir Davydov <vdavydov@virtuozzo.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	brouer@redhat.com
Subject: Re: [RFC PATCH V2 8/9] slab: implement bulk free in SLAB allocator
Date: Tue, 15 Dec 2015 13:02:35 +0100	[thread overview]
Message-ID: <20151215130235.10e92367@redhat.com> (raw)
In-Reply-To: <20151214161958.1a8edf79@redhat.com>


On Mon, 14 Dec 2015 16:19:58 +0100 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Wed, 9 Dec 2015 10:06:39 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> 
> > From: Christoph Lameter <cl@linux.com>
> > Subject: slab bulk api: Remove the kmem_cache parameter from kmem_cache_bulk_free()
> > 
> > It is desirable and necessary to free objects from different kmem_caches.
> > It is required in order to support memcg object freeing across different5
> > cgroups.
> > 
> > So drop the pointless parameter and allow freeing of arbitrary lists
> > of slab allocated objects.
> > 
> > This patch also does the proper compound page handling so that
> > arbitrary objects allocated via kmalloc() can be handled by
> > kmem_cache_bulk_free().
> > 
> > Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> I've modified this patch, to instead introduce a kfree_bulk() and keep
> the old behavior of kmem_cache_free_bulk().  This allow us to easier
> compare the two impl. approaches.
> 
> [...]
> > Index: linux/mm/slub.c
> > ===================================================================
> > --- linux.orig/mm/slub.c
> > +++ linux/mm/slub.c
> > @@ -2887,23 +2887,30 @@ static int build_detached_freelist(struc
> > 
> > 
> >  /* Note that interrupts must be enabled when calling this function. */
> > -void kmem_cache_free_bulk(struct kmem_cache *orig_s, size_t size, void **p) 
> > +void kmem_cache_free_bulk(size_t size, void **p)
> 
> Renamed to kfree_bulk(size_t size, void **p)
> 
> >  {
> >  	if (WARN_ON(!size))
> >  		return;
> > 
> >  	do {
> >  		struct detached_freelist df;
> > -		struct kmem_cache *s;
> > +		struct page *page;
> > 
> > -		/* Support for memcg */
> > -		s = cache_from_obj(orig_s, p[size - 1]);
> > +		page = virt_to_head_page(p[size - 1]);
> > 
> > -		size = build_detached_freelist(s, size, p, &df);
> > +		if (unlikely(!PageSlab(page))) {
> > +			BUG_ON(!PageCompound(page));
> > +			kfree_hook(p[size - 1]);
> > +			__free_kmem_pages(page, compound_order(page));
> > +			p[--size] = NULL;
> > +			continue;
> > +		}
> > +
> > +		size = build_detached_freelist(page->slab_cache, size, p, &df);
> > 		if (unlikely(!df.page))
> >  			continue;
> > 
> > -		slab_free(s, df.page, df.freelist, df.tail, df.cnt, _RET_IP_);
> > +		slab_free(page->slab_cache, df.page, df.freelist,
> > + 			df.tail, df.cnt, _RET_IP_);
> > 	} while (likely(size));
> >  }
> >  EXPORT_SYMBOL(kmem_cache_free_bulk);
> 
> This specific implementation was too slow, mostly because we call
> virt_to_head_page() both in this function and inside build_detached_freelist().
> 
> After integrating this directly into build_detached_freelist() I'm
> getting more comparative results.
> 
> Results with disabled CONFIG_MEMCG_KMEM, and SLUB (and slab_nomerge for
> more accurate results between runs):
> 
>  bulk- fallback          - kmem_cache_free_bulk - kfree_bulk
>  1 - 58 cycles 14.735 ns -  55 cycles 13.880 ns -  59 cycles 14.843 ns
>  2 - 53 cycles 13.298 ns -  32 cycles  8.037 ns -  34 cycles 8.592 ns
>  3 - 51 cycles 12.837 ns -  25 cycles  6.442 ns -  27 cycles 6.794 ns
>  4 - 50 cycles 12.514 ns -  23 cycles  5.952 ns -  23 cycles 5.958 ns
>  8 - 48 cycles 12.097 ns -  20 cycles  5.160 ns -  22 cycles 5.505 ns
>  16 - 47 cycles 11.888 ns -  19 cycles 4.900 ns -  19 cycles 4.969 ns
>  30 - 47 cycles 11.793 ns -  18 cycles 4.688 ns -  18 cycles 4.682 ns
>  32 - 47 cycles 11.926 ns -  18 cycles 4.674 ns -  18 cycles 4.702 ns
>  34 - 95 cycles 23.823 ns -  24 cycles 6.068 ns -  24 cycles 6.058 ns
>  48 - 81 cycles 20.258 ns -  21 cycles 5.360 ns -  21 cycles 5.338 ns
>  64 - 73 cycles 18.414 ns -  20 cycles 5.160 ns -  20 cycles 5.140 ns
>  128 - 90 cycles 22.563 ns -  27 cycles 6.765 ns -  27 cycles 6.801 ns
>  158 - 99 cycles 24.831 ns -  30 cycles 7.625 ns -  30 cycles 7.720 ns
>  250 - 104 cycles 26.173 ns -  37 cycles 9.271 ns -  37 cycles 9.371 ns
> 
> As can been seen the old kmem_cache_free_bulk() is faster than the new
> kfree_bulk() (which omits the kmem_cache pointer and need to derive it
> from the page->slab_cache). The base (bulk=1) extra cost is 4 cycles,
> which then gets amortized as build_detached_freelist() combines objects
> belonging to same page.
> 
> This is likely because the compiler, with disabled CONFIG_MEMCG_KMEM=n,
> can optimize and avoid doing the lookup of the kmem_cache structure.
> 
> I'll start doing testing with CONFIG_MEMCG_KMEM enabled...

Enabling CONFIG_MEMCG_KMEM didn't change the performance, likely
because memcg_kmem_enabled() uses a static_key_false() construct.

Implemented kfree_bulk() as an inline function just calling
kmem_cache_free_bulk(NULL, size, p) run with CONFIG_MEMCG_KMEM=y (but
no "user" of memcg):

 bulk- fallback          - kmem_cache_free_bulk - kmem_cache_free_bulk(s=NULL)
 1 - 60 cycles 15.232 ns -  52 cycles 13.124 ns -  58 cycles 14.562 ns
 2 - 54 cycles 13.678 ns -  31 cycles 7.875 ns -  34 cycles 8.509 ns
 3 - 52 cycles 13.112 ns -  25 cycles 6.406 ns -  28 cycles 7.045 ns
 4 - 51 cycles 12.833 ns -  24 cycles 6.036 ns -  24 cycles 6.076 ns
 8 - 49 cycles 12.367 ns -  21 cycles 5.404 ns -  22 cycles 5.522 ns
 16 - 48 cycles 12.144 ns -  19 cycles 4.934 ns -  19 cycles 4.967 ns
 30 - 50 cycles 12.542 ns -  18 cycles 4.685 ns -  18 cycles 4.687 ns
 32 - 48 cycles 12.234 ns -  18 cycles 4.661 ns -  18 cycles 4.662 ns
 34 - 96 cycles 24.204 ns -  24 cycles 6.068 ns -  24 cycles 6.067 ns
 48 - 82 cycles 20.553 ns -  21 cycles 5.410 ns -  21 cycles 5.433 ns
 64 - 74 cycles 18.743 ns -  20 cycles 5.171 ns -  20 cycles 5.179 ns
 128 - 91 cycles 22.791 ns -  26 cycles 6.601 ns -  26 cycles 6.600 ns
 158 - 100 cycles 25.191 ns -  30 cycles 7.569 ns -  30 cycles 7.617 ns
 250 - 106 cycles 26.535 ns -  37 cycles 9.426 ns -  37 cycles 9.427 ns

As can be seen, it is still more costly to get the kmem_cache pointer
via the object->slab_cache, even when compiled with CONFIG_MEMCG_KMEM.


But what happen when enabling a "user" of kmemcg, which modify the
static_key_false() in function call memcg_kmem_enabled().

First notice normal kmem fastpath cost increased: 70 cycles(tsc) 17.640 ns

 bulk- fallback          - kmem_cache_free_bulk - kmem_cache_free_bulk(s=NULL)
 1 - 85 cycles 21.351 ns -  76 cycles 19.086 ns -  74 cycles 18.701 ns
 2 - 79 cycles 19.907 ns -  43 cycles 10.848 ns -  41 cycles 10.467 ns
 3 - 77 cycles 19.310 ns -  32 cycles 8.132 ns -  31 cycles 7.880 ns
 4 - 76 cycles 19.017 ns -  28 cycles 7.195 ns -  28 cycles 7.050 ns
 8 - 77 cycles 19.443 ns -  23 cycles 5.838 ns -  23 cycles 5.782 ns
 16 - 75 cycles 18.815 ns -  20 cycles 5.174 ns -  20 cycles 5.127 ns
 30 - 74 cycles 18.719 ns -  19 cycles 4.870 ns -  19 cycles 4.816 ns
 32 - 75 cycles 18.917 ns -  19 cycles 4.850 ns -  19 cycles 4.800 ns
 34 - 119 cycles 29.878 ns -  25 cycles 6.312 ns -  25 cycles 6.282 ns
 48 - 106 cycles 26.536 ns -  21 cycles 5.471 ns -  21 cycles 5.448 ns
 64 - 98 cycles 24.635 ns -  20 cycles 5.239 ns -  20 cycles 5.224 ns
 128 - 114 cycles 28.586 ns -  27 cycles 6.768 ns -  26 cycles 6.736 ns
 158 - 124 cycles 31.173 ns -  30 cycles 7.668 ns -  30 cycles 7.611 ns
 250 - 129 cycles 32.336 ns -  37 cycles 9.460 ns -  37 cycles 9.468 ns

With kmemcg runtime-enabled, I was expecting to see a bigger win for
the case of kmem_cache_free_bulk(s=NULL).

Implemented a function kfree_bulk_export(), which inlines build_detached_freelist()
and which avoids including the code path for memcg.  But result is almost the same:

 bulk- fallback          - kmem_cache_free_bulk - kfree_bulk_export
 1 - 84 cycles 21.221 ns -  77 cycles 19.324 ns -  74 cycles 18.515 ns
 2 - 79 cycles 19.963 ns -  44 cycles 11.120 ns -  41 cycles 10.329 ns
 3 - 77 cycles 19.471 ns -  33 cycles 8.291 ns -  31 cycles 7.771 ns
 4 - 76 cycles 19.191 ns -  28 cycles 7.100 ns -  27 cycles 6.765 ns
 8 - 78 cycles 19.525 ns -  23 cycles 5.781 ns -  22 cycles 5.737 ns
 16 - 75 cycles 18.922 ns -  20 cycles 5.152 ns -  20 cycles 5.092 ns
 30 - 75 cycles 18.812 ns -  19 cycles 4.864 ns -  19 cycles 4.833 ns
 32 - 75 cycles 18.807 ns -  19 cycles 4.852 ns -  23 cycles 5.896 ns
 34 - 119 cycles 29.989 ns -  25 cycles 6.258 ns -  25 cycles 6.411 ns
 48 - 106 cycles 26.677 ns -  28 cycles 7.182 ns -  22 cycles 5.651 ns
 64 - 98 cycles 24.716 ns -  20 cycles 5.245 ns -  21 cycles 5.425 ns
 128 - 115 cycles 28.905 ns -  26 cycles 6.748 ns -  27 cycles 6.787 ns
 158 - 124 cycles 31.039 ns -  30 cycles 7.604 ns -  30 cycles 7.629 ns
 250 - 129 cycles 32.372 ns -  37 cycles 9.475 ns -  37 cycles 9.325 ns

I took a closer look with perf, and it looks like the extra cost from
enabling kmemcg originates from the alloc code path.  Specifically the
function call get_mem_cgroup_from_mm(9.95%) is the costly part, plus
__memcg_kmem_get_cache(4.84%) and __memcg_kmem_put_cache(1.35%)


(Added current code as a patch below signature)
- - 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH] New API kfree_bulk()

---
 include/linux/slab.h |   14 +++++++++++++
 mm/slab_common.c     |    8 ++++++--
 mm/slub.c            |   52 +++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 2037a861e367..9dd353215c2b 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -317,6 +317,20 @@ void kmem_cache_free(struct kmem_cache *, void *);
  */
 void kmem_cache_free_bulk(struct kmem_cache *, size_t, void **);
 int kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
+void kfree_bulk_export(size_t, void **);
+
+static __always_inline void kfree_bulk(size_t size, void **p)
+{
+	/* Reusing call to kmem_cache_free_bulk() allow kfree_bulk to
+	 * use same code icache
+	 */
+	kmem_cache_free_bulk(NULL, size, p);
+	/*
+	 * Inlining in kfree_bulk_export() generate smaller code size
+	 * than more generic kmem_cache_free_bulk().
+	 */
+	// kfree_bulk_export(size, p);
+}
 
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3c6a86b4ec25..963c25589949 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -108,8 +108,12 @@ void __kmem_cache_free_bulk(struct kmem_cache *s, size_t nr, void **p)
 {
 	size_t i;
 
-	for (i = 0; i < nr; i++)
-		kmem_cache_free(s, p[i]);
+	for (i = 0; i < nr; i++) {
+		if (s)
+			kmem_cache_free(s, p[i]);
+		else
+			kfree(p[i]);
+	}
 }
 
 int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
diff --git a/mm/slub.c b/mm/slub.c
index 4e4dfc4cdd0c..1675f42f17b5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2833,29 +2833,46 @@ struct detached_freelist {
  * synchronization primitive.  Look ahead in the array is limited due
  * to performance reasons.
  */
-static int build_detached_freelist(struct kmem_cache **s, size_t size,
-				   void **p, struct detached_freelist *df)
+static inline
+int build_detached_freelist(struct kmem_cache **s, size_t size,
+			    void **p, struct detached_freelist *df)
 {
 	size_t first_skipped_index = 0;
 	int lookahead = 3;
 	void *object;
+	struct page *page;
 
 	/* Always re-init detached_freelist */
 	df->page = NULL;
 
 	do {
 		object = p[--size];
+		/* Do we need !ZERO_OR_NULL_PTR(object) here? (for kfree) */
 	} while (!object && size);
 
 	if (!object)
 		return 0;
 
-	/* Support for memcg, compiler can optimize this out */
-	*s = cache_from_obj(*s, object);
+	page = virt_to_head_page(object);
+	if (!*s) {
+		/* Handle kalloc'ed objects */
+		if (unlikely(!PageSlab(page))) {
+			BUG_ON(!PageCompound(page));
+			kfree_hook(object);
+			__free_kmem_pages(page, compound_order(page));
+			p[size] = NULL; /* mark object processed */
+			return size;
+		}
+		/* Derive kmem_cache from object */
+		*s = page->slab_cache;
+	} else {
+		/* Support for memcg, compiler can optimize this out */
+		*s = cache_from_obj(*s, object);
+	}
 
 	/* Start new detached freelist */
+	df->page = page;
 	set_freepointer(*s, object, NULL);
-	df->page = virt_to_head_page(object);
 	df->tail = object;
 	df->freelist = object;
 	p[size] = NULL; /* mark object processed */
@@ -2906,6 +2923,31 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 
+/* Note that interrupts must be enabled when calling this function.
+ * This function can also handle kmem_cache allocated object arrays
+ */
+// Will likely remove this function
+void kfree_bulk_export(size_t size, void **p)
+{
+	if (WARN_ON(!size))
+		return;
+
+	do {
+		struct detached_freelist df;
+		/* NULL here removes code in inlined build_detached_freelist */
+		struct kmem_cache *s = NULL;
+
+		size = build_detached_freelist(&s, size, p, &df);
+		if (unlikely(!df.page))
+			continue;
+
+		slab_free(s, df.page, df.freelist, df.tail, df.cnt, _RET_IP_);
+//		slab_free(df.page->slab_cache, df.page, df.freelist, df.tail,
+//			  df.cnt, _RET_IP_);
+	} while (likely(size));
+}
+EXPORT_SYMBOL(kfree_bulk_export);
+
 /* Note that interrupts must be enabled when calling this function. */
 int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			  void **p)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-12-15 12:02 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-03 15:56 [RFC PATCH 0/2] slab: implement bulking for SLAB allocator Jesper Dangaard Brouer
2015-12-03 15:57 ` [RFC PATCH 1/2] slab: implement bulk alloc in " Jesper Dangaard Brouer
2015-12-04 10:16   ` Jesper Dangaard Brouer
2015-12-04 17:10   ` Christoph Lameter
2015-12-07 10:20     ` Jesper Dangaard Brouer
2015-12-07 14:57       ` Christoph Lameter
2015-12-03 15:57 ` [RFC PATCH 2/2] slab: implement bulk free " Jesper Dangaard Brouer
2015-12-04 17:17   ` Christoph Lameter
2015-12-07 11:25     ` Jesper Dangaard Brouer
2015-12-07 14:59       ` Christoph Lameter
2015-12-08 13:39         ` Jesper Dangaard Brouer
2015-12-08 14:11           ` Christoph Lameter
2015-12-08 14:12         ` Vladimir Davydov
2015-12-08 14:15           ` Christoph Lameter
2015-12-08 14:56             ` Vladimir Davydov
2015-12-08 15:13               ` Jesper Dangaard Brouer
2015-12-08 15:32               ` Christoph Lameter
2015-12-04  9:01 ` [RFC PATCH 0/2] slab: implement bulking for " Jesper Dangaard Brouer
2015-12-08 16:18 ` [RFC PATCH V2 0/9] slab: cleanup and bulk API for SLAB Jesper Dangaard Brouer
2015-12-08 16:18   ` [RFC PATCH V2 1/9] mm/slab: move SLUB alloc hooks to common mm/slab.h Jesper Dangaard Brouer
2015-12-09 15:43     ` Christoph Lameter
2015-12-08 16:18   ` [RFC PATCH V2 2/9] mm: generalize avoid fault-inject on bootstrap kmem_cache Jesper Dangaard Brouer
2015-12-09  2:36     ` Joonsoo Kim
2015-12-08 16:18   ` [RFC PATCH V2 3/9] slab: use slab_pre_alloc_hook in SLAB allocator Jesper Dangaard Brouer
2015-12-08 16:18   ` [RFC PATCH V2 4/9] mm: kmemcheck skip object if slab allocation failed Jesper Dangaard Brouer
2015-12-08 16:18   ` [RFC PATCH V2 5/9] slab: use slab_post_alloc_hook in SLAB allocator Jesper Dangaard Brouer
2015-12-08 16:18   ` [RFC PATCH V2 6/9] slab: implement bulk alloc " Jesper Dangaard Brouer
2015-12-08 16:18   ` [RFC PATCH V2 7/9] slab: avoid running debug SLAB code with IRQs disabled for alloc_bulk Jesper Dangaard Brouer
2015-12-08 16:19   ` [RFC PATCH V2 8/9] slab: implement bulk free in SLAB allocator Jesper Dangaard Brouer
2015-12-09 16:06     ` Christoph Lameter
2015-12-09 18:53       ` Jesper Dangaard Brouer
2015-12-09 19:41         ` Christoph Lameter
2015-12-09 20:50           ` Jesper Dangaard Brouer
2015-12-10 15:15             ` Christoph Lameter
2015-12-10 15:10           ` Jesper Dangaard Brouer
2015-12-10 15:18             ` Christoph Lameter
2015-12-10 15:26               ` Vladimir Davydov
2015-12-10 17:24                 ` Christoph Lameter
2015-12-14 15:19       ` Jesper Dangaard Brouer
2015-12-15 12:02         ` Jesper Dangaard Brouer [this message]
2015-12-08 16:19   ` [RFC PATCH V2 9/9] slab: annotate code to generate more compact asm code Jesper Dangaard Brouer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151215130235.10e92367@redhat.com \
    --to=brouer@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vdavydov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.