linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mm/slab: reduce lock contention in alloc path
@ 2016-03-28  5:26 js1304
  2016-03-28  5:26 ` [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() js1304
                   ` (10 more replies)
  0 siblings, 11 replies; 29+ messages in thread
From: js1304 @ 2016-03-28  5:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

While processing concurrent allocation, SLAB could be contended
a lot because it did a lots of work with holding a lock. This
patchset try to reduce the number of critical section to reduce
lock contention. Major changes are lockless decision to allocate
more slab and lockless cpu cache refill from the newly allocated slab.

Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less
is better.

* Before
Kmalloc N*alloc N*free(32): Average=365/806
Kmalloc N*alloc N*free(64): Average=452/690
Kmalloc N*alloc N*free(128): Average=736/886
Kmalloc N*alloc N*free(256): Average=1167/985
Kmalloc N*alloc N*free(512): Average=2088/1125
Kmalloc N*alloc N*free(1024): Average=4115/1184
Kmalloc N*alloc N*free(2048): Average=8451/1748
Kmalloc N*alloc N*free(4096): Average=16024/2048

* After
Kmalloc N*alloc N*free(32): Average=344/792
Kmalloc N*alloc N*free(64): Average=347/882
Kmalloc N*alloc N*free(128): Average=390/959
Kmalloc N*alloc N*free(256): Average=393/1067
Kmalloc N*alloc N*free(512): Average=683/1229
Kmalloc N*alloc N*free(1024): Average=1295/1325
Kmalloc N*alloc N*free(2048): Average=2513/1664
Kmalloc N*alloc N*free(4096): Average=4742/2172

It shows that performance improves greatly (roughly more than 50%)
for the object class whose size is more than 128 bytes.

Thanks.

Joonsoo Kim (11):
  mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()
  mm/slab: remove BAD_ALIEN_MAGIC again
  mm/slab: drain the free slab as much as possible
  mm/slab: factor out kmem_cache_node initialization code
  mm/slab: clean-up kmem_cache_node setup
  mm/slab: don't keep free slabs if free_objects exceeds free_limit
  mm/slab: racy access/modify the slab color
  mm/slab: make cache_grow() handle the page allocated on arbitrary node
  mm/slab: separate cache_grow() to two parts
  mm/slab: refill cpu cache through a new slab without holding a node
    lock
  mm/slab: lockless decision to grow cache

 mm/slab.c        | 495 ++++++++++++++++++++++++++++---------------------------
 mm/slab_common.c |   4 +
 2 files changed, 255 insertions(+), 244 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()
  2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
@ 2016-03-28  5:26 ` js1304
  2016-03-29  0:50   ` Christoph Lameter
  2016-03-31 10:53   ` Nikolay Borisov
  2016-03-28  5:26 ` [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again js1304
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 29+ messages in thread
From: js1304 @ 2016-03-28  5:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Major kmem_cache metadata in slab subsystem is synchronized with
the slab_mutex. In SLAB, if some of them is changed, node's shared
array cache would be freed and re-populated. If __kmem_cache_shrink()
is called at the same time, it will call drain_array() with n->shared
without holding node lock so problem can happen.

We can fix this small theoretical race condition by holding node lock
in drain_array(), but, holding a slab_mutex in kmem_cache_shrink()
looks more appropriate solution because stable state would make things
less error-prone and this is not performance critical path.

In addtion, annotate on SLAB functions.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c        | 2 ++
 mm/slab_common.c | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/mm/slab.c b/mm/slab.c
index a53a0f6..043606a 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2218,6 +2218,7 @@ static void do_drain(void *arg)
 	ac->avail = 0;
 }
 
+/* Should be called with slab_mutex to prevent from freeing shared array */
 static void drain_cpu_caches(struct kmem_cache *cachep)
 {
 	struct kmem_cache_node *n;
@@ -3871,6 +3872,7 @@ skip_setup:
  * Drain an array if it contains any elements taking the node lock only if
  * necessary. Note that the node listlock also protects the array_cache
  * if drain_array() is used on the shared array.
+ * Should be called with slab_mutex to prevent from freeing shared array.
  */
 static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n,
 			 struct array_cache *ac, int force, int node)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index a65dad7..5bed565 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -755,7 +755,11 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 	get_online_cpus();
 	get_online_mems();
 	kasan_cache_shrink(cachep);
+
+	mutex_lock(&slab_mutex);
 	ret = __kmem_cache_shrink(cachep, false);
+	mutex_unlock(&slab_mutex);
+
 	put_online_mems();
 	put_online_cpus();
 	return ret;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again
  2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
  2016-03-28  5:26 ` [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() js1304
@ 2016-03-28  5:26 ` js1304
  2016-03-28  8:58   ` Geert Uytterhoeven
  2016-03-28 21:19   ` Andrew Morton
  2016-03-28  5:26 ` [PATCH 03/11] mm/slab: drain the free slab as much as possible js1304
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 29+ messages in thread
From: js1304 @ 2016-03-28  5:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by
'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")'
because it causes a problem on m68k which has many node
but !CONFIG_NUMA. In this case, although alien cache isn't used
at all but to cope with some initialization path, garbage value
is used and that is BAD_ALIEN_MAGIC. Now, this patch set
use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization
path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 043606a..a5a205b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -421,8 +421,6 @@ static struct kmem_cache kmem_cache_boot = {
 	.name = "kmem_cache",
 };
 
-#define BAD_ALIEN_MAGIC 0x01020304ul
-
 static DEFINE_PER_CPU(struct delayed_work, slab_reap_work);
 
 static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep)
@@ -637,7 +635,7 @@ static int transfer_objects(struct array_cache *to,
 static inline struct alien_cache **alloc_alien_cache(int node,
 						int limit, gfp_t gfp)
 {
-	return (struct alien_cache **)BAD_ALIEN_MAGIC;
+	return NULL;
 }
 
 static inline void free_alien_cache(struct alien_cache **ac_ptr)
@@ -1205,7 +1203,7 @@ void __init kmem_cache_init(void)
 					sizeof(struct rcu_head));
 	kmem_cache = &kmem_cache_boot;
 
-	if (num_possible_nodes() == 1)
+	if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1)
 		use_alien_caches = 0;
 
 	for (i = 0; i < NUM_INIT_LISTS; i++)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 03/11] mm/slab: drain the free slab as much as possible
  2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
  2016-03-28  5:26 ` [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() js1304
  2016-03-28  5:26 ` [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again js1304
@ 2016-03-28  5:26 ` js1304
  2016-03-29  0:54   ` Christoph Lameter
  2016-03-28  5:26 ` [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code js1304
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 29+ messages in thread
From: js1304 @ 2016-03-28  5:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

slabs_tofree() implies freeing all free slab. We can do it with
just providing INT_MAX.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index a5a205b..ba2eacf 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -888,12 +888,6 @@ static int init_cache_node_node(int node)
 	return 0;
 }
 
-static inline int slabs_tofree(struct kmem_cache *cachep,
-						struct kmem_cache_node *n)
-{
-	return (n->free_objects + cachep->num - 1) / cachep->num;
-}
-
 static void cpuup_canceled(long cpu)
 {
 	struct kmem_cache *cachep;
@@ -958,7 +952,7 @@ free_slab:
 		n = get_node(cachep, node);
 		if (!n)
 			continue;
-		drain_freelist(cachep, n, slabs_tofree(cachep, n));
+		drain_freelist(cachep, n, INT_MAX);
 	}
 }
 
@@ -1110,7 +1104,7 @@ static int __meminit drain_cache_node_node(int node)
 		if (!n)
 			continue;
 
-		drain_freelist(cachep, n, slabs_tofree(cachep, n));
+		drain_freelist(cachep, n, INT_MAX);
 
 		if (!list_empty(&n->slabs_full) ||
 		    !list_empty(&n->slabs_partial)) {
@@ -2280,7 +2274,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 
 	check_irq_on();
 	for_each_kmem_cache_node(cachep, node, n) {
-		drain_freelist(cachep, n, slabs_tofree(cachep, n));
+		drain_freelist(cachep, n, INT_MAX);
 
 		ret += !list_empty(&n->slabs_full) ||
 			!list_empty(&n->slabs_partial);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code
  2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
                   ` (2 preceding siblings ...)
  2016-03-28  5:26 ` [PATCH 03/11] mm/slab: drain the free slab as much as possible js1304
@ 2016-03-28  5:26 ` js1304
  2016-03-29  0:56   ` Christoph Lameter
  2016-03-28  5:26 ` [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup js1304
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 29+ messages in thread
From: js1304 @ 2016-03-28  5:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

It can be reused on other place, so factor out it. Following
patch will use it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c | 68 ++++++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 39 insertions(+), 29 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index ba2eacf..569d7db 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -841,6 +841,40 @@ static inline gfp_t gfp_exact_node(gfp_t flags)
 }
 #endif
 
+static int init_cache_node(struct kmem_cache *cachep, int node, gfp_t gfp)
+{
+	struct kmem_cache_node *n;
+
+	/*
+	 * Set up the kmem_cache_node for cpu before we can
+	 * begin anything. Make sure some other cpu on this
+	 * node has not already allocated this
+	 */
+	n = get_node(cachep, node);
+	if (n)
+		return 0;
+
+	n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node);
+	if (!n)
+		return -ENOMEM;
+
+	kmem_cache_node_init(n);
+	n->next_reap = jiffies + REAPTIMEOUT_NODE +
+		    ((unsigned long)cachep) % REAPTIMEOUT_NODE;
+
+	n->free_limit =
+		(1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num;
+
+	/*
+	 * The kmem_cache_nodes don't come and go as CPUs
+	 * come and go.  slab_mutex is sufficient
+	 * protection here.
+	 */
+	cachep->node[node] = n;
+
+	return 0;
+}
+
 /*
  * Allocates and initializes node for a node on each slab cache, used for
  * either memory or cpu hotplug.  If memory is being hot-added, the kmem_cache_node
@@ -852,39 +886,15 @@ static inline gfp_t gfp_exact_node(gfp_t flags)
  */
 static int init_cache_node_node(int node)
 {
+	int ret;
 	struct kmem_cache *cachep;
-	struct kmem_cache_node *n;
-	const size_t memsize = sizeof(struct kmem_cache_node);
 
 	list_for_each_entry(cachep, &slab_caches, list) {
-		/*
-		 * Set up the kmem_cache_node for cpu before we can
-		 * begin anything. Make sure some other cpu on this
-		 * node has not already allocated this
-		 */
-		n = get_node(cachep, node);
-		if (!n) {
-			n = kmalloc_node(memsize, GFP_KERNEL, node);
-			if (!n)
-				return -ENOMEM;
-			kmem_cache_node_init(n);
-			n->next_reap = jiffies + REAPTIMEOUT_NODE +
-			    ((unsigned long)cachep) % REAPTIMEOUT_NODE;
-
-			/*
-			 * The kmem_cache_nodes don't come and go as CPUs
-			 * come and go.  slab_mutex is sufficient
-			 * protection here.
-			 */
-			cachep->node[node] = n;
-		}
-
-		spin_lock_irq(&n->list_lock);
-		n->free_limit =
-			(1 + nr_cpus_node(node)) *
-			cachep->batchcount + cachep->num;
-		spin_unlock_irq(&n->list_lock);
+		ret = init_cache_node(cachep, node, GFP_KERNEL);
+		if (ret)
+			return ret;
 	}
+
 	return 0;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup
  2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
                   ` (3 preceding siblings ...)
  2016-03-28  5:26 ` [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code js1304
@ 2016-03-28  5:26 ` js1304
  2016-03-29  0:58   ` Christoph Lameter
  2016-03-28  5:26 ` [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit js1304
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 29+ messages in thread
From: js1304 @ 2016-03-28  5:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

There are mostly same code for setting up kmem_cache_node either
in cpuup_prepare() or alloc_kmem_cache_node(). Factor out and
clean-up them.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c | 167 +++++++++++++++++++++++++-------------------------------------
 1 file changed, 67 insertions(+), 100 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 569d7db..b96f381 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -898,6 +898,62 @@ static int init_cache_node_node(int node)
 	return 0;
 }
 
+static int setup_kmem_cache_node(struct kmem_cache *cachep,
+				int node, gfp_t gfp, bool force_change)
+{
+	int ret = -ENOMEM;
+	struct kmem_cache_node *n;
+	struct array_cache *old_shared = NULL;
+	struct array_cache *new_shared = NULL;
+	struct alien_cache **new_alien = NULL;
+	LIST_HEAD(list);
+
+	if (use_alien_caches) {
+		new_alien = alloc_alien_cache(node, cachep->limit, gfp);
+		if (!new_alien)
+			goto fail;
+	}
+
+	if (cachep->shared) {
+		new_shared = alloc_arraycache(node,
+			cachep->shared * cachep->batchcount, 0xbaadf00d, gfp);
+		if (!new_shared)
+			goto fail;
+	}
+
+	ret = init_cache_node(cachep, node, gfp);
+	if (ret)
+		goto fail;
+
+	n = get_node(cachep, node);
+	spin_lock_irq(&n->list_lock);
+	if (n->shared) {
+		free_block(cachep, n->shared->entry,
+				n->shared->avail, node, &list);
+	}
+
+	if (!n->shared || force_change) {
+		old_shared = n->shared;
+		n->shared = new_shared;
+		new_shared = NULL;
+	}
+
+	if (!n->alien) {
+		n->alien = new_alien;
+		new_alien = NULL;
+	}
+
+	spin_unlock_irq(&n->list_lock);
+	slabs_destroy(cachep, &list);
+
+fail:
+	kfree(old_shared);
+	kfree(new_shared);
+	free_alien_cache(new_alien);
+
+	return ret;
+}
+
 static void cpuup_canceled(long cpu)
 {
 	struct kmem_cache *cachep;
@@ -969,7 +1025,6 @@ free_slab:
 static int cpuup_prepare(long cpu)
 {
 	struct kmem_cache *cachep;
-	struct kmem_cache_node *n = NULL;
 	int node = cpu_to_mem(cpu);
 	int err;
 
@@ -988,44 +1043,9 @@ static int cpuup_prepare(long cpu)
 	 * array caches
 	 */
 	list_for_each_entry(cachep, &slab_caches, list) {
-		struct array_cache *shared = NULL;
-		struct alien_cache **alien = NULL;
-
-		if (cachep->shared) {
-			shared = alloc_arraycache(node,
-				cachep->shared * cachep->batchcount,
-				0xbaadf00d, GFP_KERNEL);
-			if (!shared)
-				goto bad;
-		}
-		if (use_alien_caches) {
-			alien = alloc_alien_cache(node, cachep->limit, GFP_KERNEL);
-			if (!alien) {
-				kfree(shared);
-				goto bad;
-			}
-		}
-		n = get_node(cachep, node);
-		BUG_ON(!n);
-
-		spin_lock_irq(&n->list_lock);
-		if (!n->shared) {
-			/*
-			 * We are serialised from CPU_DEAD or
-			 * CPU_UP_CANCELLED by the cpucontrol lock
-			 */
-			n->shared = shared;
-			shared = NULL;
-		}
-#ifdef CONFIG_NUMA
-		if (!n->alien) {
-			n->alien = alien;
-			alien = NULL;
-		}
-#endif
-		spin_unlock_irq(&n->list_lock);
-		kfree(shared);
-		free_alien_cache(alien);
+		err = setup_kmem_cache_node(cachep, node, GFP_KERNEL, false);
+		if (err)
+			goto bad;
 	}
 
 	return 0;
@@ -3652,72 +3672,19 @@ EXPORT_SYMBOL(kfree);
 /*
  * This initializes kmem_cache_node or resizes various caches for all nodes.
  */
-static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp)
+static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp)
 {
+	int ret;
 	int node;
 	struct kmem_cache_node *n;
-	struct array_cache *new_shared;
-	struct alien_cache **new_alien = NULL;
 
 	for_each_online_node(node) {
-
-		if (use_alien_caches) {
-			new_alien = alloc_alien_cache(node, cachep->limit, gfp);
-			if (!new_alien)
-				goto fail;
-		}
-
-		new_shared = NULL;
-		if (cachep->shared) {
-			new_shared = alloc_arraycache(node,
-				cachep->shared*cachep->batchcount,
-					0xbaadf00d, gfp);
-			if (!new_shared) {
-				free_alien_cache(new_alien);
-				goto fail;
-			}
-		}
-
-		n = get_node(cachep, node);
-		if (n) {
-			struct array_cache *shared = n->shared;
-			LIST_HEAD(list);
-
-			spin_lock_irq(&n->list_lock);
-
-			if (shared)
-				free_block(cachep, shared->entry,
-						shared->avail, node, &list);
-
-			n->shared = new_shared;
-			if (!n->alien) {
-				n->alien = new_alien;
-				new_alien = NULL;
-			}
-			n->free_limit = (1 + nr_cpus_node(node)) *
-					cachep->batchcount + cachep->num;
-			spin_unlock_irq(&n->list_lock);
-			slabs_destroy(cachep, &list);
-			kfree(shared);
-			free_alien_cache(new_alien);
-			continue;
-		}
-		n = kmalloc_node(sizeof(struct kmem_cache_node), gfp, node);
-		if (!n) {
-			free_alien_cache(new_alien);
-			kfree(new_shared);
+		ret = setup_kmem_cache_node(cachep, node, gfp, true);
+		if (ret)
 			goto fail;
-		}
 
-		kmem_cache_node_init(n);
-		n->next_reap = jiffies + REAPTIMEOUT_NODE +
-				((unsigned long)cachep) % REAPTIMEOUT_NODE;
-		n->shared = new_shared;
-		n->alien = new_alien;
-		n->free_limit = (1 + nr_cpus_node(node)) *
-					cachep->batchcount + cachep->num;
-		cachep->node[node] = n;
 	}
+
 	return 0;
 
 fail:
@@ -3759,7 +3726,7 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
 	cachep->shared = shared;
 
 	if (!prev)
-		goto alloc_node;
+		goto setup_node;
 
 	for_each_online_cpu(cpu) {
 		LIST_HEAD(list);
@@ -3776,8 +3743,8 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
 	}
 	free_percpu(prev);
 
-alloc_node:
-	return alloc_kmem_cache_node(cachep, gfp);
+setup_node:
+	return setup_kmem_cache_node_node(cachep, gfp);
 }
 
 static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit
  2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
                   ` (4 preceding siblings ...)
  2016-03-28  5:26 ` [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup js1304
@ 2016-03-28  5:26 ` js1304
  2016-03-29  1:03   ` Christoph Lameter
  2016-03-28  5:26 ` [PATCH 07/11] mm/slab: racy access/modify the slab color js1304
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 29+ messages in thread
From: js1304 @ 2016-03-28  5:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Currently, determination to free a slab is done whenever free object is
put into the slab. This has a problem that free slabs are not freed
even if we have free slabs and have more free_objects than free_limit
when processed slab isn't a free slab. This would cause to keep
too much memory in the slab subsystem. This patch try to fix it
by checking number of free object after all free work is done. If there
is free slab at that time, we can free it so we keep free slab as minimal
as possible.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b96f381..df11757 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3258,6 +3258,9 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
 {
 	int i;
 	struct kmem_cache_node *n = get_node(cachep, node);
+	struct page *page;
+
+	n->free_objects += nr_objects;
 
 	for (i = 0; i < nr_objects; i++) {
 		void *objp;
@@ -3270,17 +3273,11 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
 		check_spinlock_acquired_node(cachep, node);
 		slab_put_obj(cachep, page, objp);
 		STATS_DEC_ACTIVE(cachep);
-		n->free_objects++;
 
 		/* fixup slab chains */
-		if (page->active == 0) {
-			if (n->free_objects > n->free_limit) {
-				n->free_objects -= cachep->num;
-				list_add_tail(&page->lru, list);
-			} else {
-				list_add(&page->lru, &n->slabs_free);
-			}
-		} else {
+		if (page->active == 0)
+			list_add(&page->lru, &n->slabs_free);
+		else {
 			/* Unconditionally move a slab to the end of the
 			 * partial list on free - maximum time for the
 			 * other objects to be freed, too.
@@ -3288,6 +3285,14 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
 			list_add_tail(&page->lru, &n->slabs_partial);
 		}
 	}
+
+	while (n->free_objects > n->free_limit && !list_empty(&n->slabs_free)) {
+		n->free_objects -= cachep->num;
+
+		page = list_last_entry(&n->slabs_free, struct page, lru);
+		list_del(&page->lru);
+		list_add(&page->lru, list);
+	}
 }
 
 static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 07/11] mm/slab: racy access/modify the slab color
  2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
                   ` (5 preceding siblings ...)
  2016-03-28  5:26 ` [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit js1304
@ 2016-03-28  5:26 ` js1304
  2016-03-29  1:05   ` Christoph Lameter
  2016-03-28  5:26 ` [PATCH 08/11] mm/slab: make cache_grow() handle the page allocated on arbitrary node js1304
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 29+ messages in thread
From: js1304 @ 2016-03-28  5:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Slab color isn't needed to be changed strictly. Because locking
for changing slab color could cause more lock contention so this patch
implements racy access/modify the slab color. This is a preparation step
to implement lockless allocation path when there is no free objects in
the kmem_cache.

Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less
is better.

* Before
Kmalloc N*alloc N*free(32): Average=365/806
Kmalloc N*alloc N*free(64): Average=452/690
Kmalloc N*alloc N*free(128): Average=736/886
Kmalloc N*alloc N*free(256): Average=1167/985
Kmalloc N*alloc N*free(512): Average=2088/1125
Kmalloc N*alloc N*free(1024): Average=4115/1184
Kmalloc N*alloc N*free(2048): Average=8451/1748
Kmalloc N*alloc N*free(4096): Average=16024/2048

* After
Kmalloc N*alloc N*free(32): Average=355/750
Kmalloc N*alloc N*free(64): Average=452/812
Kmalloc N*alloc N*free(128): Average=559/1070
Kmalloc N*alloc N*free(256): Average=1176/980
Kmalloc N*alloc N*free(512): Average=1939/1189
Kmalloc N*alloc N*free(1024): Average=3521/1278
Kmalloc N*alloc N*free(2048): Average=7152/1838
Kmalloc N*alloc N*free(4096): Average=13438/2013

It shows that contention is reduced for object size >= 1024
and performance increases by roughly 15%.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index df11757..52fc5e3 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep,
 	}
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
-	/* Take the node list lock to change the colour_next on this node */
 	check_irq_off();
-	n = get_node(cachep, nodeid);
-	spin_lock(&n->list_lock);
-
-	/* Get colour for the slab, and cal the next value. */
-	offset = n->colour_next;
-	n->colour_next++;
-	if (n->colour_next >= cachep->colour)
-		n->colour_next = 0;
-	spin_unlock(&n->list_lock);
-
-	offset *= cachep->colour_off;
-
 	if (gfpflags_allow_blocking(local_flags))
 		local_irq_enable();
 
@@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep,
 	if (!page)
 		goto failed;
 
+	n = get_node(cachep, nodeid);
+
+	/* Get colour for the slab, and cal the next value. */
+	n->colour_next++;
+	if (n->colour_next >= cachep->colour)
+		n->colour_next = 0;
+
+	offset = n->colour_next;
+	if (offset >= cachep->colour)
+		offset = 0;
+
+	offset *= cachep->colour_off;
+
 	/* Get slab management. */
 	freelist = alloc_slabmgmt(cachep, page, offset,
 			local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 08/11] mm/slab: make cache_grow() handle the page allocated on arbitrary node
  2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
                   ` (6 preceding siblings ...)
  2016-03-28  5:26 ` [PATCH 07/11] mm/slab: racy access/modify the slab color js1304
@ 2016-03-28  5:26 ` js1304
  2016-03-28  5:26 ` [PATCH 09/11] mm/slab: separate cache_grow() to two parts js1304
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 29+ messages in thread
From: js1304 @ 2016-03-28  5:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Currently, cache_grow() assumes that allocated page's nodeid would be
same with parameter nodeid which is used for allocation request. If
we discard this assumption, we can handle fallback_alloc() case
gracefully. So, this patch makes cache_grow() handle the page allocated
on arbitrary node and clean-up relevant code.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c | 60 +++++++++++++++++++++---------------------------------------
 1 file changed, 21 insertions(+), 39 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 52fc5e3..ce8ed65 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2518,13 +2518,14 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page,
  * Grow (by 1) the number of slabs within a cache.  This is called by
  * kmem_cache_alloc() when there are no active objs left in a cache.
  */
-static int cache_grow(struct kmem_cache *cachep,
-		gfp_t flags, int nodeid, struct page *page)
+static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
 	void *freelist;
 	size_t offset;
 	gfp_t local_flags;
+	int page_node;
 	struct kmem_cache_node *n;
+	struct page *page;
 
 	/*
 	 * Be lazy and only check for valid flags here,  keeping it out of the
@@ -2552,12 +2553,12 @@ static int cache_grow(struct kmem_cache *cachep,
 	 * Get mem for the objs.  Attempt to allocate a physical page from
 	 * 'nodeid'.
 	 */
-	if (!page)
-		page = kmem_getpages(cachep, local_flags, nodeid);
+	page = kmem_getpages(cachep, local_flags, nodeid);
 	if (!page)
 		goto failed;
 
-	n = get_node(cachep, nodeid);
+	page_node = page_to_nid(page);
+	n = get_node(cachep, page_node);
 
 	/* Get colour for the slab, and cal the next value. */
 	n->colour_next++;
@@ -2572,7 +2573,7 @@ static int cache_grow(struct kmem_cache *cachep,
 
 	/* Get slab management. */
 	freelist = alloc_slabmgmt(cachep, page, offset,
-			local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
+			local_flags & ~GFP_CONSTRAINT_MASK, page_node);
 	if (OFF_SLAB(cachep) && !freelist)
 		goto opps1;
 
@@ -2591,13 +2592,13 @@ static int cache_grow(struct kmem_cache *cachep,
 	STATS_INC_GROWN(cachep);
 	n->free_objects += cachep->num;
 	spin_unlock(&n->list_lock);
-	return 1;
+	return page_node;
 opps1:
 	kmem_freepages(cachep, page);
 failed:
 	if (gfpflags_allow_blocking(local_flags))
 		local_irq_disable();
-	return 0;
+	return -1;
 }
 
 #if DEBUG
@@ -2878,14 +2879,14 @@ alloc_done:
 				return obj;
 		}
 
-		x = cache_grow(cachep, gfp_exact_node(flags), node, NULL);
+		x = cache_grow(cachep, gfp_exact_node(flags), node);
 
 		/* cache_grow can reenable interrupts, then ac could change. */
 		ac = cpu_cache_get(cachep);
 		node = numa_mem_id();
 
 		/* no objects in sight? abort */
-		if (!x && ac->avail == 0)
+		if (x < 0 && ac->avail == 0)
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
@@ -3014,7 +3015,6 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
 static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 {
 	struct zonelist *zonelist;
-	gfp_t local_flags;
 	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
@@ -3025,8 +3025,6 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
-
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 	zonelist = node_zonelist(mempolicy_slab_node(), flags);
@@ -3056,33 +3054,17 @@ retry:
 		 * We may trigger various forms of reclaim on the allowed
 		 * set and go into memory reserves if necessary.
 		 */
-		struct page *page;
+		nid = cache_grow(cache, flags, numa_mem_id());
+		if (nid >= 0) {
+			obj = ____cache_alloc_node(cache,
+				gfp_exact_node(flags), nid);
 
-		if (gfpflags_allow_blocking(local_flags))
-			local_irq_enable();
-		kmem_flagcheck(cache, flags);
-		page = kmem_getpages(cache, local_flags, numa_mem_id());
-		if (gfpflags_allow_blocking(local_flags))
-			local_irq_disable();
-		if (page) {
 			/*
-			 * Insert into the appropriate per node queues
+			 * Another processor may allocate the objects in
+			 * the slab since we are not holding any locks.
 			 */
-			nid = page_to_nid(page);
-			if (cache_grow(cache, flags, nid, page)) {
-				obj = ____cache_alloc_node(cache,
-					gfp_exact_node(flags), nid);
-				if (!obj)
-					/*
-					 * Another processor may allocate the
-					 * objects in the slab since we are
-					 * not holding any locks.
-					 */
-					goto retry;
-			} else {
-				/* cache_grow already freed obj */
-				obj = NULL;
-			}
+			if (!obj)
+				goto retry;
 		}
 	}
 
@@ -3133,8 +3115,8 @@ retry:
 
 must_grow:
 	spin_unlock(&n->list_lock);
-	x = cache_grow(cachep, gfp_exact_node(flags), nodeid, NULL);
-	if (x)
+	x = cache_grow(cachep, gfp_exact_node(flags), nodeid);
+	if (x >= 0)
 		goto retry;
 
 	return fallback_alloc(cachep, flags);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 09/11] mm/slab: separate cache_grow() to two parts
  2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
                   ` (7 preceding siblings ...)
  2016-03-28  5:26 ` [PATCH 08/11] mm/slab: make cache_grow() handle the page allocated on arbitrary node js1304
@ 2016-03-28  5:26 ` js1304
  2016-03-28  5:27 ` [PATCH 10/11] mm/slab: refill cpu cache through a new slab without holding a node lock js1304
  2016-03-28  5:27 ` [PATCH 11/11] mm/slab: lockless decision to grow cache js1304
  10 siblings, 0 replies; 29+ messages in thread
From: js1304 @ 2016-03-28  5:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

This is a preparation step to implement lockless allocation path when
there is no free objects in kmem_cache. What we'd like to do here is
to refill cpu cache without holding a node lock. To accomplish this
purpose, refill should be done after new slab allocation but before
attaching the slab to the management list. So, this patch separates
cache_grow() to two parts, allocation and attaching to the list
in order to add some code inbetween them in the following patch.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c | 74 ++++++++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 52 insertions(+), 22 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index ce8ed65..401e60c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -213,6 +213,11 @@ static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list);
 static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp);
 static void cache_reap(struct work_struct *unused);
 
+static inline void fixup_objfreelist_debug(struct kmem_cache *cachep,
+						void **list);
+static inline void fixup_slab_list(struct kmem_cache *cachep,
+				struct kmem_cache_node *n, struct page *page,
+				void **list);
 static int slab_early_init = 1;
 
 #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node))
@@ -1796,7 +1801,7 @@ static size_t calculate_slab_order(struct kmem_cache *cachep,
 
 			/*
 			 * Needed to avoid possible looping condition
-			 * in cache_grow()
+			 * in cache_grow_begin()
 			 */
 			if (OFF_SLAB(freelist_cache))
 				continue;
@@ -2518,7 +2523,8 @@ static void slab_map_pages(struct kmem_cache *cache, struct page *page,
  * Grow (by 1) the number of slabs within a cache.  This is called by
  * kmem_cache_alloc() when there are no active objs left in a cache.
  */
-static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid)
+static struct page *cache_grow_begin(struct kmem_cache *cachep,
+				gfp_t flags, int nodeid)
 {
 	void *freelist;
 	size_t offset;
@@ -2584,21 +2590,40 @@ static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 
 	if (gfpflags_allow_blocking(local_flags))
 		local_irq_disable();
-	check_irq_off();
-	spin_lock(&n->list_lock);
 
-	/* Make slab active. */
-	list_add_tail(&page->lru, &(n->slabs_free));
-	STATS_INC_GROWN(cachep);
-	n->free_objects += cachep->num;
-	spin_unlock(&n->list_lock);
-	return page_node;
+	return page;
+
 opps1:
 	kmem_freepages(cachep, page);
 failed:
 	if (gfpflags_allow_blocking(local_flags))
 		local_irq_disable();
-	return -1;
+	return NULL;
+}
+
+static void cache_grow_end(struct kmem_cache *cachep, struct page *page)
+{
+	struct kmem_cache_node *n;
+	void *list = NULL;
+
+	check_irq_off();
+
+	if (!page)
+		return;
+
+	INIT_LIST_HEAD(&page->lru);
+	n = get_node(cachep, page_to_nid(page));
+
+	spin_lock(&n->list_lock);
+	if (!page->active)
+		list_add_tail(&page->lru, &(n->slabs_free));
+	else
+		fixup_slab_list(cachep, n, page, &list);
+	STATS_INC_GROWN(cachep);
+	n->free_objects += cachep->num - page->active;
+	spin_unlock(&n->list_lock);
+
+	fixup_objfreelist_debug(cachep, &list);
 }
 
 #if DEBUG
@@ -2809,6 +2834,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
 	struct array_cache *ac;
 	int node;
 	void *list = NULL;
+	struct page *page;
 
 	check_irq_off();
 	node = numa_mem_id();
@@ -2836,7 +2862,6 @@ retry:
 	}
 
 	while (batchcount > 0) {
-		struct page *page;
 		/* Get slab alloc is to come from. */
 		page = get_first_slab(n, false);
 		if (!page)
@@ -2869,8 +2894,6 @@ alloc_done:
 	fixup_objfreelist_debug(cachep, &list);
 
 	if (unlikely(!ac->avail)) {
-		int x;
-
 		/* Check if we can use obj in pfmemalloc slab */
 		if (sk_memalloc_socks()) {
 			void *obj = cache_alloc_pfmemalloc(cachep, n, flags);
@@ -2879,14 +2902,18 @@ alloc_done:
 				return obj;
 		}
 
-		x = cache_grow(cachep, gfp_exact_node(flags), node);
+		page = cache_grow_begin(cachep, gfp_exact_node(flags), node);
+		cache_grow_end(cachep, page);
 
-		/* cache_grow can reenable interrupts, then ac could change. */
+		/*
+		 * cache_grow_begin() can reenable interrupts,
+		 * then ac could change.
+		 */
 		ac = cpu_cache_get(cachep);
 		node = numa_mem_id();
 
 		/* no objects in sight? abort */
-		if (x < 0 && ac->avail == 0)
+		if (!page && ac->avail == 0)
 			return NULL;
 
 		if (!ac->avail)		/* objects refilled by interrupt? */
@@ -3019,6 +3046,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
+	struct page *page;
 	int nid;
 	unsigned int cpuset_mems_cookie;
 
@@ -3054,8 +3082,10 @@ retry:
 		 * We may trigger various forms of reclaim on the allowed
 		 * set and go into memory reserves if necessary.
 		 */
-		nid = cache_grow(cache, flags, numa_mem_id());
-		if (nid >= 0) {
+		page = cache_grow_begin(cache, flags, numa_mem_id());
+		cache_grow_end(cache, page);
+		if (page) {
+			nid = page_to_nid(page);
 			obj = ____cache_alloc_node(cache,
 				gfp_exact_node(flags), nid);
 
@@ -3083,7 +3113,6 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
 	struct kmem_cache_node *n;
 	void *obj;
 	void *list = NULL;
-	int x;
 
 	VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES);
 	n = get_node(cachep, nodeid);
@@ -3115,8 +3144,9 @@ retry:
 
 must_grow:
 	spin_unlock(&n->list_lock);
-	x = cache_grow(cachep, gfp_exact_node(flags), nodeid);
-	if (x >= 0)
+	page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid);
+	cache_grow_end(cachep, page);
+	if (page)
 		goto retry;
 
 	return fallback_alloc(cachep, flags);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 10/11] mm/slab: refill cpu cache through a new slab without holding a node lock
  2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
                   ` (8 preceding siblings ...)
  2016-03-28  5:26 ` [PATCH 09/11] mm/slab: separate cache_grow() to two parts js1304
@ 2016-03-28  5:27 ` js1304
  2016-03-28  5:27 ` [PATCH 11/11] mm/slab: lockless decision to grow cache js1304
  10 siblings, 0 replies; 29+ messages in thread
From: js1304 @ 2016-03-28  5:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Until now, cache growing makes a free slab on node's slab list and then
we can allocate free objects from it. This necessarily requires
to hold a node lock which is very contended. If we refill cpu cache
before attaching it to node's slab list, we can avoid holding a node lock
as much as possible because this newly allocated slab is only visible
to the current task. This will reduce lock contention.

Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less
is better.

* Before
Kmalloc N*alloc N*free(32): Average=355/750
Kmalloc N*alloc N*free(64): Average=452/812
Kmalloc N*alloc N*free(128): Average=559/1070
Kmalloc N*alloc N*free(256): Average=1176/980
Kmalloc N*alloc N*free(512): Average=1939/1189
Kmalloc N*alloc N*free(1024): Average=3521/1278
Kmalloc N*alloc N*free(2048): Average=7152/1838
Kmalloc N*alloc N*free(4096): Average=13438/2013

* After
Kmalloc N*alloc N*free(32): Average=248/966
Kmalloc N*alloc N*free(64): Average=261/949
Kmalloc N*alloc N*free(128): Average=314/1016
Kmalloc N*alloc N*free(256): Average=741/1061
Kmalloc N*alloc N*free(512): Average=1246/1152
Kmalloc N*alloc N*free(1024): Average=2437/1259
Kmalloc N*alloc N*free(2048): Average=4980/1800
Kmalloc N*alloc N*free(4096): Average=9000/2078

It shows that contention is reduced for all the object sizes
and performance increases by 30 ~ 40%.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c | 68 +++++++++++++++++++++++++++++++++------------------------------
 1 file changed, 36 insertions(+), 32 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 401e60c..029d6b3 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2827,6 +2827,30 @@ static noinline void *cache_alloc_pfmemalloc(struct kmem_cache *cachep,
 	return obj;
 }
 
+/*
+ * Slab list should be fixed up by fixup_slab_list() for existing slab
+ * or cache_grow_end() for new slab
+ */
+static __always_inline int alloc_block(struct kmem_cache *cachep,
+		struct array_cache *ac, struct page *page, int batchcount)
+{
+	/*
+	 * There must be at least one object available for
+	 * allocation.
+	 */
+	BUG_ON(page->active >= cachep->num);
+
+	while (page->active < cachep->num && batchcount--) {
+		STATS_INC_ALLOCED(cachep);
+		STATS_INC_ACTIVE(cachep);
+		STATS_SET_HIGH(cachep);
+
+		ac->entry[ac->avail++] = slab_get_obj(cachep, page);
+	}
+
+	return batchcount;
+}
+
 static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
 {
 	int batchcount;
@@ -2839,7 +2863,6 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
 	check_irq_off();
 	node = numa_mem_id();
 
-retry:
 	ac = cpu_cache_get(cachep);
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -2869,21 +2892,7 @@ retry:
 
 		check_spinlock_acquired(cachep);
 
-		/*
-		 * The slab was either on partial or free list so
-		 * there must be at least one object available for
-		 * allocation.
-		 */
-		BUG_ON(page->active >= cachep->num);
-
-		while (page->active < cachep->num && batchcount--) {
-			STATS_INC_ALLOCED(cachep);
-			STATS_INC_ACTIVE(cachep);
-			STATS_SET_HIGH(cachep);
-
-			ac->entry[ac->avail++] = slab_get_obj(cachep, page);
-		}
-
+		batchcount = alloc_block(cachep, ac, page, batchcount);
 		fixup_slab_list(cachep, n, page, &list);
 	}
 
@@ -2903,21 +2912,18 @@ alloc_done:
 		}
 
 		page = cache_grow_begin(cachep, gfp_exact_node(flags), node);
-		cache_grow_end(cachep, page);
 
 		/*
 		 * cache_grow_begin() can reenable interrupts,
 		 * then ac could change.
 		 */
 		ac = cpu_cache_get(cachep);
-		node = numa_mem_id();
+		if (!ac->avail && page)
+			alloc_block(cachep, ac, page, batchcount);
+		cache_grow_end(cachep, page);
 
-		/* no objects in sight? abort */
-		if (!page && ac->avail == 0)
+		if (!ac->avail)
 			return NULL;
-
-		if (!ac->avail)		/* objects refilled by interrupt? */
-			goto retry;
 	}
 	ac->touched = 1;
 
@@ -3111,14 +3117,13 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
 {
 	struct page *page;
 	struct kmem_cache_node *n;
-	void *obj;
+	void *obj = NULL;
 	void *list = NULL;
 
 	VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES);
 	n = get_node(cachep, nodeid);
 	BUG_ON(!n);
 
-retry:
 	check_irq_off();
 	spin_lock(&n->list_lock);
 	page = get_first_slab(n, false);
@@ -3140,19 +3145,18 @@ retry:
 
 	spin_unlock(&n->list_lock);
 	fixup_objfreelist_debug(cachep, &list);
-	goto done;
+	return obj;
 
 must_grow:
 	spin_unlock(&n->list_lock);
 	page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid);
+	if (page) {
+		/* This slab isn't counted yet so don't update free_objects */
+		obj = slab_get_obj(cachep, page);
+	}
 	cache_grow_end(cachep, page);
-	if (page)
-		goto retry;
 
-	return fallback_alloc(cachep, flags);
-
-done:
-	return obj;
+	return obj ? obj : fallback_alloc(cachep, flags);
 }
 
 static __always_inline void *
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 11/11] mm/slab: lockless decision to grow cache
  2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
                   ` (9 preceding siblings ...)
  2016-03-28  5:27 ` [PATCH 10/11] mm/slab: refill cpu cache through a new slab without holding a node lock js1304
@ 2016-03-28  5:27 ` js1304
  10 siblings, 0 replies; 29+ messages in thread
From: js1304 @ 2016-03-28  5:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

To check whther free objects exist or not precisely, we need to grab
a lock. But, accuracy isn't that important because race window would
be even small and if there is too much free object, cache reaper would
reap it. So, this patch makes the check for free object exisistence
not to hold a lock. This will reduce lock contention in heavily
allocation case.

Note that until now, n->shared can be freed during the processing by
writing slabinfo, but, with some trick in this patch, we can access it
freely within interrupt disabled period.

Below is the result of concurrent allocation/free in slab allocation
benchmark made by Christoph a long time ago. I make the output simpler.
The number shows cycle count during alloc/free respectively so less
is better.

* Before
Kmalloc N*alloc N*free(32): Average=248/966
Kmalloc N*alloc N*free(64): Average=261/949
Kmalloc N*alloc N*free(128): Average=314/1016
Kmalloc N*alloc N*free(256): Average=741/1061
Kmalloc N*alloc N*free(512): Average=1246/1152
Kmalloc N*alloc N*free(1024): Average=2437/1259
Kmalloc N*alloc N*free(2048): Average=4980/1800
Kmalloc N*alloc N*free(4096): Average=9000/2078

* After
Kmalloc N*alloc N*free(32): Average=344/792
Kmalloc N*alloc N*free(64): Average=347/882
Kmalloc N*alloc N*free(128): Average=390/959
Kmalloc N*alloc N*free(256): Average=393/1067
Kmalloc N*alloc N*free(512): Average=683/1229
Kmalloc N*alloc N*free(1024): Average=1295/1325
Kmalloc N*alloc N*free(2048): Average=2513/1664
Kmalloc N*alloc N*free(4096): Average=4742/2172

It shows that allocation performance decreases for the object size
up to 128 and it may be due to extra checks in cache_alloc_refill().
But, with considering improvement of free performance, net result looks
the same. Result for other size class looks very promising, roughly,
50% performance improvement.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 029d6b3..b70aabf 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -951,6 +951,15 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep,
 	spin_unlock_irq(&n->list_lock);
 	slabs_destroy(cachep, &list);
 
+	/*
+	 * To protect lockless access to n->shared during irq disabled context.
+	 * If n->shared isn't NULL in irq disabled context, accessing to it is
+	 * guaranteed to be valid until irq is re-enabled, because it will be
+	 * freed after kick_all_cpus_sync().
+	 */
+	if (force_change)
+		kick_all_cpus_sync();
+
 fail:
 	kfree(old_shared);
 	kfree(new_shared);
@@ -2855,7 +2864,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
 {
 	int batchcount;
 	struct kmem_cache_node *n;
-	struct array_cache *ac;
+	struct array_cache *ac, *shared;
 	int node;
 	void *list = NULL;
 	struct page *page;
@@ -2876,11 +2885,16 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
 	n = get_node(cachep, node);
 
 	BUG_ON(ac->avail > 0 || !n);
+	shared = READ_ONCE(n->shared);
+	if (!n->free_objects && (!shared || !shared->avail))
+		goto direct_grow;
+
 	spin_lock(&n->list_lock);
+	shared = READ_ONCE(n->shared);
 
 	/* See if we can refill from the shared array */
-	if (n->shared && transfer_objects(ac, n->shared, batchcount)) {
-		n->shared->touched = 1;
+	if (shared && transfer_objects(ac, shared, batchcount)) {
+		shared->touched = 1;
 		goto alloc_done;
 	}
 
@@ -2902,6 +2916,7 @@ alloc_done:
 	spin_unlock(&n->list_lock);
 	fixup_objfreelist_debug(cachep, &list);
 
+direct_grow:
 	if (unlikely(!ac->avail)) {
 		/* Check if we can use obj in pfmemalloc slab */
 		if (sk_memalloc_socks()) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again
  2016-03-28  5:26 ` [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again js1304
@ 2016-03-28  8:58   ` Geert Uytterhoeven
  2016-03-30  8:11     ` Joonsoo Kim
  2016-03-28 21:19   ` Andrew Morton
  1 sibling, 1 reply; 29+ messages in thread
From: Geert Uytterhoeven @ 2016-03-28  8:58 UTC (permalink / raw)
  To: JoonSoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, Linux MM, linux-kernel, Joonsoo Kim

Hi Jonsoo,

On Mon, Mar 28, 2016 at 7:26 AM,  <js1304@gmail.com> wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by
> 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")'
> because it causes a problem on m68k which has many node
> but !CONFIG_NUMA. In this case, although alien cache isn't used
> at all but to cope with some initialization path, garbage value
> is used and that is BAD_ALIEN_MAGIC. Now, this patch set
> use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization
> path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

I gave this a try on m68k/ARAnyM, and it didn't crash, unlike the previous
version that was reverted, so
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again
  2016-03-28  5:26 ` [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again js1304
  2016-03-28  8:58   ` Geert Uytterhoeven
@ 2016-03-28 21:19   ` Andrew Morton
  2016-03-29  0:53     ` Christoph Lameter
  1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2016-03-28 21:19 UTC (permalink / raw)
  To: js1304
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

On Mon, 28 Mar 2016 14:26:52 +0900 js1304@gmail.com wrote:

> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by
> 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")'
> because it causes a problem on m68k which has many node
> but !CONFIG_NUMA.

Whaaa?  How is that even possible?  I'd have thought that everything
would break at compile time (at least) with such a setup.

> In this case, although alien cache isn't used
> at all but to cope with some initialization path, garbage value
> is used and that is BAD_ALIEN_MAGIC. Now, this patch set
> use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization
> path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it.
> 
> ...
>
> @@ -1205,7 +1203,7 @@ void __init kmem_cache_init(void)
>  					sizeof(struct rcu_head));
>  	kmem_cache = &kmem_cache_boot;
>  
> -	if (num_possible_nodes() == 1)
> +	if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1)
>  		use_alien_caches = 0;
>  
>  	for (i = 0; i < NUM_INIT_LISTS; i++)

This does look screwy.  How can num_possible_nodes() possibly return
anything but "1" if CONFIG_NUMA=n.

Can we please get a code comment in here to explain things to the poor
old reader and to prevent people from trying to "fix" it?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()
  2016-03-28  5:26 ` [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() js1304
@ 2016-03-29  0:50   ` Christoph Lameter
  2016-03-30  8:11     ` Joonsoo Kim
  2016-03-31 10:53   ` Nikolay Borisov
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2016-03-29  0:50 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

On Mon, 28 Mar 2016, js1304@gmail.com wrote:

> Major kmem_cache metadata in slab subsystem is synchronized with
> the slab_mutex. In SLAB, if some of them is changed, node's shared
> array cache would be freed and re-populated. If __kmem_cache_shrink()
> is called at the same time, it will call drain_array() with n->shared
> without holding node lock so problem can happen.
>
> We can fix this small theoretical race condition by holding node lock
> in drain_array(), but, holding a slab_mutex in kmem_cache_shrink()
> looks more appropriate solution because stable state would make things
> less error-prone and this is not performance critical path.

Ummm.. The mutex taking is added to common code. So this will also affect
SLUB.  The patch needs to consider this. Do we want to force all
allocators to run shrinking only when holding the lock? SLUB does not
need to hold the mutex. And frankly the mutex is for reconfiguration of
metadata which is *not* occurring here. A shrink operation does not do
that. Can we figure out a slab specific way of handling synchronization
in the strange free/realloc cycle?

It seems that taking the node lock is the appropriate level of
synchrnonization since the concern is with the contents of a shared cache
at that level. There is no change of metadata which would require the
mutex.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again
  2016-03-28 21:19   ` Andrew Morton
@ 2016-03-29  0:53     ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2016-03-29  0:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: js1304, Pekka Enberg, David Rientjes, Jesper Dangaard Brouer,
	linux-mm, linux-kernel, Joonsoo Kim

On Mon, 28 Mar 2016, Andrew Morton wrote:

> On Mon, 28 Mar 2016 14:26:52 +0900 js1304@gmail.com wrote:
>
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by
> > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")'
> > because it causes a problem on m68k which has many node
> > but !CONFIG_NUMA.
>
> Whaaa?  How is that even possible?  I'd have thought that everything
> would break at compile time (at least) with such a setup.

Yes we have that and the support for this caused numerous issues. Can we
stop supporting such a configuration?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 03/11] mm/slab: drain the free slab as much as possible
  2016-03-28  5:26 ` [PATCH 03/11] mm/slab: drain the free slab as much as possible js1304
@ 2016-03-29  0:54   ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2016-03-29  0:54 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim


Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code
  2016-03-28  5:26 ` [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code js1304
@ 2016-03-29  0:56   ` Christoph Lameter
  2016-03-30  8:12     ` Joonsoo Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2016-03-29  0:56 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

On Mon, 28 Mar 2016, js1304@gmail.com wrote:

> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> -		spin_lock_irq(&n->list_lock);
> -		n->free_limit =
> -			(1 + nr_cpus_node(node)) *
> -			cachep->batchcount + cachep->num;
> -		spin_unlock_irq(&n->list_lock);
> +		ret = init_cache_node(cachep, node, GFP_KERNEL);
> +		if (ret)
> +			return ret;

Drop ret and do a

	return init_cache_node(...);

instead?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup
  2016-03-28  5:26 ` [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup js1304
@ 2016-03-29  0:58   ` Christoph Lameter
  2016-03-30  8:15     ` Joonsoo Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2016-03-29  0:58 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

On Mon, 28 Mar 2016, js1304@gmail.com wrote:

>   * This initializes kmem_cache_node or resizes various caches for all nodes.
>   */
> -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp)
> +static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp)

... _node_node? Isnt there a better name for it?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit
  2016-03-28  5:26 ` [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit js1304
@ 2016-03-29  1:03   ` Christoph Lameter
  2016-03-30  8:25     ` Joonsoo Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2016-03-29  1:03 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

On Mon, 28 Mar 2016, js1304@gmail.com wrote:

> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> Currently, determination to free a slab is done whenever free object is
> put into the slab. This has a problem that free slabs are not freed
> even if we have free slabs and have more free_objects than free_limit

There needs to be a better explanation here since I do not get why there
is an issue with checking after free if a slab is actually free.

> when processed slab isn't a free slab. This would cause to keep
> too much memory in the slab subsystem. This patch try to fix it
> by checking number of free object after all free work is done. If there
> is free slab at that time, we can free it so we keep free slab as minimal
> as possible.

Ok if we check after free work is done then the number of free slabs may
be higher than the limit set and then we free the additional slabs to get
down to the limit that was set?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 07/11] mm/slab: racy access/modify the slab color
  2016-03-28  5:26 ` [PATCH 07/11] mm/slab: racy access/modify the slab color js1304
@ 2016-03-29  1:05   ` Christoph Lameter
  2016-03-30  8:25     ` Joonsoo Kim
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2016-03-29  1:05 UTC (permalink / raw)
  To: js1304
  Cc: Andrew Morton, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim

On Mon, 28 Mar 2016, js1304@gmail.com wrote:

> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>
> Slab color isn't needed to be changed strictly. Because locking
> for changing slab color could cause more lock contention so this patch
> implements racy access/modify the slab color. This is a preparation step
> to implement lockless allocation path when there is no free objects in
> the kmem_cache.

Acked-by: Christoph Lameter <cl@linux.com>

The rest of the description does not relate to this patch and does not
actually reflect the improvement of applying this patch. Remove the rest?


> Below is the result of concurrent allocation/free in slab allocation
> benchmark made by Christoph a long time ago. I make the output simpler.
> The number shows cycle count during alloc/free respectively so less
> is better.
>
> * Before
> Kmalloc N*alloc N*free(32): Average=365/806
> Kmalloc N*alloc N*free(64): Average=452/690
> Kmalloc N*alloc N*free(128): Average=736/886
> Kmalloc N*alloc N*free(256): Average=1167/985
> Kmalloc N*alloc N*free(512): Average=2088/1125
> Kmalloc N*alloc N*free(1024): Average=4115/1184
> Kmalloc N*alloc N*free(2048): Average=8451/1748
> Kmalloc N*alloc N*free(4096): Average=16024/2048
>
> * After
> Kmalloc N*alloc N*free(32): Average=355/750
> Kmalloc N*alloc N*free(64): Average=452/812
> Kmalloc N*alloc N*free(128): Average=559/1070
> Kmalloc N*alloc N*free(256): Average=1176/980
> Kmalloc N*alloc N*free(512): Average=1939/1189
> Kmalloc N*alloc N*free(1024): Average=3521/1278
> Kmalloc N*alloc N*free(2048): Average=7152/1838
> Kmalloc N*alloc N*free(4096): Average=13438/2013
>
> It shows that contention is reduced for object size >= 1024
> and performance increases by roughly 15%.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/slab.c | 26 +++++++++++++-------------
>  1 file changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index df11757..52fc5e3 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep,
>  	}
>  	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
>
> -	/* Take the node list lock to change the colour_next on this node */
>  	check_irq_off();
> -	n = get_node(cachep, nodeid);
> -	spin_lock(&n->list_lock);
> -
> -	/* Get colour for the slab, and cal the next value. */
> -	offset = n->colour_next;
> -	n->colour_next++;
> -	if (n->colour_next >= cachep->colour)
> -		n->colour_next = 0;
> -	spin_unlock(&n->list_lock);
> -
> -	offset *= cachep->colour_off;
> -
>  	if (gfpflags_allow_blocking(local_flags))
>  		local_irq_enable();
>
> @@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep,
>  	if (!page)
>  		goto failed;
>
> +	n = get_node(cachep, nodeid);
> +
> +	/* Get colour for the slab, and cal the next value. */
> +	n->colour_next++;
> +	if (n->colour_next >= cachep->colour)
> +		n->colour_next = 0;
> +
> +	offset = n->colour_next;
> +	if (offset >= cachep->colour)
> +		offset = 0;
> +
> +	offset *= cachep->colour_off;
> +
>  	/* Get slab management. */
>  	freelist = alloc_slabmgmt(cachep, page, offset,
>  			local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()
  2016-03-29  0:50   ` Christoph Lameter
@ 2016-03-30  8:11     ` Joonsoo Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Joonsoo Kim @ 2016-03-30  8:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel

On Mon, Mar 28, 2016 at 07:50:36PM -0500, Christoph Lameter wrote:
> On Mon, 28 Mar 2016, js1304@gmail.com wrote:
> 
> > Major kmem_cache metadata in slab subsystem is synchronized with
> > the slab_mutex. In SLAB, if some of them is changed, node's shared
> > array cache would be freed and re-populated. If __kmem_cache_shrink()
> > is called at the same time, it will call drain_array() with n->shared
> > without holding node lock so problem can happen.
> >
> > We can fix this small theoretical race condition by holding node lock
> > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink()
> > looks more appropriate solution because stable state would make things
> > less error-prone and this is not performance critical path.
> 
> Ummm.. The mutex taking is added to common code. So this will also affect
> SLUB.  The patch needs to consider this. Do we want to force all
> allocators to run shrinking only when holding the lock? SLUB does not
> need to hold the mutex. And frankly the mutex is for reconfiguration of
> metadata which is *not* occurring here. A shrink operation does not do
> that. Can we figure out a slab specific way of handling synchronization
> in the strange free/realloc cycle?
> 
> It seems that taking the node lock is the appropriate level of
> synchrnonization since the concern is with the contents of a shared cache
> at that level. There is no change of metadata which would require the
> mutex.

Okay. I will fix it.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again
  2016-03-28  8:58   ` Geert Uytterhoeven
@ 2016-03-30  8:11     ` Joonsoo Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Joonsoo Kim @ 2016-03-30  8:11 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, Linux MM, linux-kernel

On Mon, Mar 28, 2016 at 10:58:38AM +0200, Geert Uytterhoeven wrote:
> Hi Jonsoo,
> 
> On Mon, Mar 28, 2016 at 7:26 AM,  <js1304@gmail.com> wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by
> > 'commit edcad2509550 ("Revert "slab: remove BAD_ALIEN_MAGIC"")'
> > because it causes a problem on m68k which has many node
> > but !CONFIG_NUMA. In this case, although alien cache isn't used
> > at all but to cope with some initialization path, garbage value
> > is used and that is BAD_ALIEN_MAGIC. Now, this patch set
> > use_alien_caches to 0 when !CONFIG_NUMA, there is no initialization
> > path problem so we don't need BAD_ALIEN_MAGIC at all. So remove it.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> I gave this a try on m68k/ARAnyM, and it didn't crash, unlike the previous
> version that was reverted, so
> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>

Thanks for testing!!!

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code
  2016-03-29  0:56   ` Christoph Lameter
@ 2016-03-30  8:12     ` Joonsoo Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Joonsoo Kim @ 2016-03-30  8:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel

On Mon, Mar 28, 2016 at 07:56:15PM -0500, Christoph Lameter wrote:
> On Mon, 28 Mar 2016, js1304@gmail.com wrote:
> 
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > -		spin_lock_irq(&n->list_lock);
> > -		n->free_limit =
> > -			(1 + nr_cpus_node(node)) *
> > -			cachep->batchcount + cachep->num;
> > -		spin_unlock_irq(&n->list_lock);
> > +		ret = init_cache_node(cachep, node, GFP_KERNEL);
> > +		if (ret)
> > +			return ret;
> 
> Drop ret and do a
> 
> 	return init_cache_node(...);
> 
> instead?

Will do it.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup
  2016-03-29  0:58   ` Christoph Lameter
@ 2016-03-30  8:15     ` Joonsoo Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Joonsoo Kim @ 2016-03-30  8:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel

On Mon, Mar 28, 2016 at 07:58:29PM -0500, Christoph Lameter wrote:
> On Mon, 28 Mar 2016, js1304@gmail.com wrote:
> 
> >   * This initializes kmem_cache_node or resizes various caches for all nodes.
> >   */
> > -static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp)
> > +static int setup_kmem_cache_node_node(struct kmem_cache *cachep, gfp_t gfp)
> 
> ... _node_node? Isnt there a better name for it?

I will think it more. Reason I use this naming is that there is other
site that uses this naming convention. I'm just mimicking it. :)
It's very appreaciate if you have a suggestion.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit
  2016-03-29  1:03   ` Christoph Lameter
@ 2016-03-30  8:25     ` Joonsoo Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Joonsoo Kim @ 2016-03-30  8:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel

On Mon, Mar 28, 2016 at 08:03:16PM -0500, Christoph Lameter wrote:
> On Mon, 28 Mar 2016, js1304@gmail.com wrote:
> 
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > Currently, determination to free a slab is done whenever free object is
> > put into the slab. This has a problem that free slabs are not freed
> > even if we have free slabs and have more free_objects than free_limit
> 
> There needs to be a better explanation here since I do not get why there
> is an issue with checking after free if a slab is actually free.

Okay. Consider following 3 objects free situation.

free_limt = 10
nr_free = 9

free(free slab) free(free slab) free(not free slab)

If we check it one by one, when nr_free > free_limit (at last free),
we cannot free the slab because current slab isn't a free slab.

But, if we check it lastly, we can free 1 free slab.

I will add more explanation on the next version.

> 
> > when processed slab isn't a free slab. This would cause to keep
> > too much memory in the slab subsystem. This patch try to fix it
> > by checking number of free object after all free work is done. If there
> > is free slab at that time, we can free it so we keep free slab as minimal
> > as possible.
> 
> Ok if we check after free work is done then the number of free slabs may
> be higher than the limit set and then we free the additional slabs to get
> down to the limit that was set?

Yes.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 07/11] mm/slab: racy access/modify the slab color
  2016-03-29  1:05   ` Christoph Lameter
@ 2016-03-30  8:25     ` Joonsoo Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Joonsoo Kim @ 2016-03-30  8:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel

On Mon, Mar 28, 2016 at 08:05:41PM -0500, Christoph Lameter wrote:
> On Mon, 28 Mar 2016, js1304@gmail.com wrote:
> 
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >
> > Slab color isn't needed to be changed strictly. Because locking
> > for changing slab color could cause more lock contention so this patch
> > implements racy access/modify the slab color. This is a preparation step
> > to implement lockless allocation path when there is no free objects in
> > the kmem_cache.
> 
> Acked-by: Christoph Lameter <cl@linux.com>
> 
> The rest of the description does not relate to this patch and does not
> actually reflect the improvement of applying this patch. Remove the rest?

No, below improvement is for this individual patch.

Thanks.

> 
> 
> > Below is the result of concurrent allocation/free in slab allocation
> > benchmark made by Christoph a long time ago. I make the output simpler.
> > The number shows cycle count during alloc/free respectively so less
> > is better.
> >
> > * Before
> > Kmalloc N*alloc N*free(32): Average=365/806
> > Kmalloc N*alloc N*free(64): Average=452/690
> > Kmalloc N*alloc N*free(128): Average=736/886
> > Kmalloc N*alloc N*free(256): Average=1167/985
> > Kmalloc N*alloc N*free(512): Average=2088/1125
> > Kmalloc N*alloc N*free(1024): Average=4115/1184
> > Kmalloc N*alloc N*free(2048): Average=8451/1748
> > Kmalloc N*alloc N*free(4096): Average=16024/2048
> >
> > * After
> > Kmalloc N*alloc N*free(32): Average=355/750
> > Kmalloc N*alloc N*free(64): Average=452/812
> > Kmalloc N*alloc N*free(128): Average=559/1070
> > Kmalloc N*alloc N*free(256): Average=1176/980
> > Kmalloc N*alloc N*free(512): Average=1939/1189
> > Kmalloc N*alloc N*free(1024): Average=3521/1278
> > Kmalloc N*alloc N*free(2048): Average=7152/1838
> > Kmalloc N*alloc N*free(4096): Average=13438/2013
> >
> > It shows that contention is reduced for object size >= 1024
> > and performance increases by roughly 15%.
> >
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > ---
> >  mm/slab.c | 26 +++++++++++++-------------
> >  1 file changed, 13 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/slab.c b/mm/slab.c
> > index df11757..52fc5e3 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -2536,20 +2536,7 @@ static int cache_grow(struct kmem_cache *cachep,
> >  	}
> >  	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
> >
> > -	/* Take the node list lock to change the colour_next on this node */
> >  	check_irq_off();
> > -	n = get_node(cachep, nodeid);
> > -	spin_lock(&n->list_lock);
> > -
> > -	/* Get colour for the slab, and cal the next value. */
> > -	offset = n->colour_next;
> > -	n->colour_next++;
> > -	if (n->colour_next >= cachep->colour)
> > -		n->colour_next = 0;
> > -	spin_unlock(&n->list_lock);
> > -
> > -	offset *= cachep->colour_off;
> > -
> >  	if (gfpflags_allow_blocking(local_flags))
> >  		local_irq_enable();
> >
> > @@ -2570,6 +2557,19 @@ static int cache_grow(struct kmem_cache *cachep,
> >  	if (!page)
> >  		goto failed;
> >
> > +	n = get_node(cachep, nodeid);
> > +
> > +	/* Get colour for the slab, and cal the next value. */
> > +	n->colour_next++;
> > +	if (n->colour_next >= cachep->colour)
> > +		n->colour_next = 0;
> > +
> > +	offset = n->colour_next;
> > +	if (offset >= cachep->colour)
> > +		offset = 0;
> > +
> > +	offset *= cachep->colour_off;
> > +
> >  	/* Get slab management. */
> >  	freelist = alloc_slabmgmt(cachep, page, offset,
> >  			local_flags & ~GFP_CONSTRAINT_MASK, nodeid);
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()
  2016-03-28  5:26 ` [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() js1304
  2016-03-29  0:50   ` Christoph Lameter
@ 2016-03-31 10:53   ` Nikolay Borisov
  2016-04-01  2:18     ` Joonsoo Kim
  1 sibling, 1 reply; 29+ messages in thread
From: Nikolay Borisov @ 2016-03-31 10:53 UTC (permalink / raw)
  To: js1304, Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel, Joonsoo Kim



On 03/28/2016 08:26 AM, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Major kmem_cache metadata in slab subsystem is synchronized with
> the slab_mutex. In SLAB, if some of them is changed, node's shared
> array cache would be freed and re-populated. If __kmem_cache_shrink()
> is called at the same time, it will call drain_array() with n->shared
> without holding node lock so problem can happen.
> 
> We can fix this small theoretical race condition by holding node lock
> in drain_array(), but, holding a slab_mutex in kmem_cache_shrink()
> looks more appropriate solution because stable state would make things
> less error-prone and this is not performance critical path.
> 
> In addtion, annotate on SLAB functions.

Just a nit but would it not be better instead of doing comment-style
annotation to use lockdep_assert_held/_once. In both cases for someone
to understand what locks have to be held will go and read the source. In
my mind it's easier to miss a comment line, rather than the
lockdep_assert. Furthermore in case lockdep is enabled a locking
violation would spew useful info to dmesg.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink()
  2016-03-31 10:53   ` Nikolay Borisov
@ 2016-04-01  2:18     ` Joonsoo Kim
  0 siblings, 0 replies; 29+ messages in thread
From: Joonsoo Kim @ 2016-04-01  2:18 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Jesper Dangaard Brouer, linux-mm, linux-kernel

On Thu, Mar 31, 2016 at 01:53:14PM +0300, Nikolay Borisov wrote:
> 
> 
> On 03/28/2016 08:26 AM, js1304@gmail.com wrote:
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > 
> > Major kmem_cache metadata in slab subsystem is synchronized with
> > the slab_mutex. In SLAB, if some of them is changed, node's shared
> > array cache would be freed and re-populated. If __kmem_cache_shrink()
> > is called at the same time, it will call drain_array() with n->shared
> > without holding node lock so problem can happen.
> > 
> > We can fix this small theoretical race condition by holding node lock
> > in drain_array(), but, holding a slab_mutex in kmem_cache_shrink()
> > looks more appropriate solution because stable state would make things
> > less error-prone and this is not performance critical path.
> > 
> > In addtion, annotate on SLAB functions.
> 
> Just a nit but would it not be better instead of doing comment-style
> annotation to use lockdep_assert_held/_once. In both cases for someone
> to understand what locks have to be held will go and read the source. In
> my mind it's easier to miss a comment line, rather than the
> lockdep_assert. Furthermore in case lockdep is enabled a locking
> violation would spew useful info to dmesg.

Good idea. I'm not sure if lockdep_assert is best fit but I will add
something to check it rather than just adding the comment.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-04-01  2:16 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-28  5:26 mm/slab: reduce lock contention in alloc path js1304
2016-03-28  5:26 ` [PATCH 01/11] mm/slab: hold a slab_mutex when calling __kmem_cache_shrink() js1304
2016-03-29  0:50   ` Christoph Lameter
2016-03-30  8:11     ` Joonsoo Kim
2016-03-31 10:53   ` Nikolay Borisov
2016-04-01  2:18     ` Joonsoo Kim
2016-03-28  5:26 ` [PATCH 02/11] mm/slab: remove BAD_ALIEN_MAGIC again js1304
2016-03-28  8:58   ` Geert Uytterhoeven
2016-03-30  8:11     ` Joonsoo Kim
2016-03-28 21:19   ` Andrew Morton
2016-03-29  0:53     ` Christoph Lameter
2016-03-28  5:26 ` [PATCH 03/11] mm/slab: drain the free slab as much as possible js1304
2016-03-29  0:54   ` Christoph Lameter
2016-03-28  5:26 ` [PATCH 04/11] mm/slab: factor out kmem_cache_node initialization code js1304
2016-03-29  0:56   ` Christoph Lameter
2016-03-30  8:12     ` Joonsoo Kim
2016-03-28  5:26 ` [PATCH 05/11] mm/slab: clean-up kmem_cache_node setup js1304
2016-03-29  0:58   ` Christoph Lameter
2016-03-30  8:15     ` Joonsoo Kim
2016-03-28  5:26 ` [PATCH 06/11] mm/slab: don't keep free slabs if free_objects exceeds free_limit js1304
2016-03-29  1:03   ` Christoph Lameter
2016-03-30  8:25     ` Joonsoo Kim
2016-03-28  5:26 ` [PATCH 07/11] mm/slab: racy access/modify the slab color js1304
2016-03-29  1:05   ` Christoph Lameter
2016-03-30  8:25     ` Joonsoo Kim
2016-03-28  5:26 ` [PATCH 08/11] mm/slab: make cache_grow() handle the page allocated on arbitrary node js1304
2016-03-28  5:26 ` [PATCH 09/11] mm/slab: separate cache_grow() to two parts js1304
2016-03-28  5:27 ` [PATCH 10/11] mm/slab: refill cpu cache through a new slab without holding a node lock js1304
2016-03-28  5:27 ` [PATCH 11/11] mm/slab: lockless decision to grow cache js1304

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).