All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] mm/slab: optimize allocation fastpath
@ 2015-01-05  1:37 ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

SLAB always disable irq before executing any object alloc/free operation.
This is really painful in terms of performance. Benchmark result that does
alloc/free repeatedly shows that each alloc/free is rougly 2 times slower
than SLUB's one (27 ns : 14 ns). To improve performance, this patchset
try to implement allocation fastpath without disabling irq.

This is a similar way to implement allocation fastpath in SLUB.
Transaction id is introduced and updated on every operation. In allocation
fastpath, object in array cache is read speculartively. And then, pointer
pointing object position in array cache and transaction id are updated
simultaneously through this_cpu_cmpxchg_double(). If tid is unchanged
until this updating, it ensures that there is no concurrent clients
allocating/freeing object to this slab. So allocation could succeed
without disabling irq.

Above mentioned benchmark shows that alloc/free fastpath performance
is improved roughly 22%. (27 ns -> 21 ns).

Unfortunately, I cannot optimize free fastpath, because speculartively
writing freeing object pointer into array cache cannot be possible.
If anyone have a good idea to optimize free fastpath, please let me know.

Thanks.

Joonsoo Kim (6):
  mm/slab: fix gfp flags of percpu allocation at boot phase
  mm/slab: remove kmemleak_erase() call
  mm/slab: clean-up __ac_get_obj() to prepare future changes
  mm/slab: rearrange irq management
  mm/slab: cleanup ____cache_alloc()
  mm/slab: allocation fastpath without disabling irq

 include/linux/kmemleak.h |    8 --
 mm/slab.c                |  257 +++++++++++++++++++++++++++++++---------------
 2 files changed, 176 insertions(+), 89 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 0/6] mm/slab: optimize allocation fastpath
@ 2015-01-05  1:37 ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

SLAB always disable irq before executing any object alloc/free operation.
This is really painful in terms of performance. Benchmark result that does
alloc/free repeatedly shows that each alloc/free is rougly 2 times slower
than SLUB's one (27 ns : 14 ns). To improve performance, this patchset
try to implement allocation fastpath without disabling irq.

This is a similar way to implement allocation fastpath in SLUB.
Transaction id is introduced and updated on every operation. In allocation
fastpath, object in array cache is read speculartively. And then, pointer
pointing object position in array cache and transaction id are updated
simultaneously through this_cpu_cmpxchg_double(). If tid is unchanged
until this updating, it ensures that there is no concurrent clients
allocating/freeing object to this slab. So allocation could succeed
without disabling irq.

Above mentioned benchmark shows that alloc/free fastpath performance
is improved roughly 22%. (27 ns -> 21 ns).

Unfortunately, I cannot optimize free fastpath, because speculartively
writing freeing object pointer into array cache cannot be possible.
If anyone have a good idea to optimize free fastpath, please let me know.

Thanks.

Joonsoo Kim (6):
  mm/slab: fix gfp flags of percpu allocation at boot phase
  mm/slab: remove kmemleak_erase() call
  mm/slab: clean-up __ac_get_obj() to prepare future changes
  mm/slab: rearrange irq management
  mm/slab: cleanup ____cache_alloc()
  mm/slab: allocation fastpath without disabling irq

 include/linux/kmemleak.h |    8 --
 mm/slab.c                |  257 +++++++++++++++++++++++++++++++---------------
 2 files changed, 176 insertions(+), 89 deletions(-)

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/6] mm/slab: fix gfp flags of percpu allocation at boot phase
  2015-01-05  1:37 ` Joonsoo Kim
@ 2015-01-05  1:37   ` Joonsoo Kim
  -1 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

__alloc_percpu() passed GFP_KERNEL implicitly to core function of
percpu allocator. At boot phase, it's not valid gfp flag so change it.

Without this change, while implementing new feature, I found that
__alloc_percpu() calls kmalloc() which is not initialized at this time
and the system fail to boot. percpu allocator regards GFP_KERNEL as
the sign of the system fully initialized so aggressively try to make
spare room. With GFP_NOWAIT, it doesn't do that so succeed to boot.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/slab.c b/mm/slab.c
index 65b5dcb..1150c8b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1990,9 +1990,12 @@ static struct array_cache __percpu *alloc_kmem_cache_cpus(
 	int cpu;
 	size_t size;
 	struct array_cache __percpu *cpu_cache;
+	gfp_t gfp_flags = GFP_KERNEL;
 
 	size = sizeof(void *) * entries + sizeof(struct array_cache);
-	cpu_cache = __alloc_percpu(size, sizeof(void *));
+	if (slab_state < FULL)
+		gfp_flags = GFP_NOWAIT;
+	cpu_cache = __alloc_percpu_gfp(size, sizeof(void *), gfp_flags);
 
 	if (!cpu_cache)
 		return NULL;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 1/6] mm/slab: fix gfp flags of percpu allocation at boot phase
@ 2015-01-05  1:37   ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

__alloc_percpu() passed GFP_KERNEL implicitly to core function of
percpu allocator. At boot phase, it's not valid gfp flag so change it.

Without this change, while implementing new feature, I found that
__alloc_percpu() calls kmalloc() which is not initialized at this time
and the system fail to boot. percpu allocator regards GFP_KERNEL as
the sign of the system fully initialized so aggressively try to make
spare room. With GFP_NOWAIT, it doesn't do that so succeed to boot.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/slab.c b/mm/slab.c
index 65b5dcb..1150c8b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1990,9 +1990,12 @@ static struct array_cache __percpu *alloc_kmem_cache_cpus(
 	int cpu;
 	size_t size;
 	struct array_cache __percpu *cpu_cache;
+	gfp_t gfp_flags = GFP_KERNEL;
 
 	size = sizeof(void *) * entries + sizeof(struct array_cache);
-	cpu_cache = __alloc_percpu(size, sizeof(void *));
+	if (slab_state < FULL)
+		gfp_flags = GFP_NOWAIT;
+	cpu_cache = __alloc_percpu_gfp(size, sizeof(void *), gfp_flags);
 
 	if (!cpu_cache)
 		return NULL;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/6] mm/slab: remove kmemleak_erase() call
  2015-01-05  1:37 ` Joonsoo Kim
@ 2015-01-05  1:37   ` Joonsoo Kim
  -1 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

We already call kmemleak_no_scan() in initialization step of array cache,
so kmemleak doesn't scan array cache. Therefore, we don't need to call
kmemleak_erase() here.

And, this call is the last caller of kmemleak_erase(), so remove
kmemleak_erase() definition completely.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/kmemleak.h |    8 --------
 mm/slab.c                |   12 ------------
 2 files changed, 20 deletions(-)

diff --git a/include/linux/kmemleak.h b/include/linux/kmemleak.h
index e705467..8470733 100644
--- a/include/linux/kmemleak.h
+++ b/include/linux/kmemleak.h
@@ -52,11 +52,6 @@ static inline void kmemleak_free_recursive(const void *ptr, unsigned long flags)
 		kmemleak_free(ptr);
 }
 
-static inline void kmemleak_erase(void **ptr)
-{
-	*ptr = NULL;
-}
-
 #else
 
 static inline void kmemleak_init(void)
@@ -98,9 +93,6 @@ static inline void kmemleak_ignore(const void *ptr)
 static inline void kmemleak_scan_area(const void *ptr, size_t size, gfp_t gfp)
 {
 }
-static inline void kmemleak_erase(void **ptr)
-{
-}
 static inline void kmemleak_no_scan(const void *ptr)
 {
 }
diff --git a/mm/slab.c b/mm/slab.c
index 1150c8b..9aa58fc 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2942,20 +2942,8 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 
 	STATS_INC_ALLOCMISS(cachep);
 	objp = cache_alloc_refill(cachep, flags, force_refill);
-	/*
-	 * the 'ac' may be updated by cache_alloc_refill(),
-	 * and kmemleak_erase() requires its correct value.
-	 */
-	ac = cpu_cache_get(cachep);
 
 out:
-	/*
-	 * To avoid a false negative, if an object that is in one of the
-	 * per-CPU caches is leaked, we need to make sure kmemleak doesn't
-	 * treat the array pointers as a reference to the object.
-	 */
-	if (objp)
-		kmemleak_erase(&ac->entry[ac->avail]);
 	return objp;
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/6] mm/slab: remove kmemleak_erase() call
@ 2015-01-05  1:37   ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

We already call kmemleak_no_scan() in initialization step of array cache,
so kmemleak doesn't scan array cache. Therefore, we don't need to call
kmemleak_erase() here.

And, this call is the last caller of kmemleak_erase(), so remove
kmemleak_erase() definition completely.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/kmemleak.h |    8 --------
 mm/slab.c                |   12 ------------
 2 files changed, 20 deletions(-)

diff --git a/include/linux/kmemleak.h b/include/linux/kmemleak.h
index e705467..8470733 100644
--- a/include/linux/kmemleak.h
+++ b/include/linux/kmemleak.h
@@ -52,11 +52,6 @@ static inline void kmemleak_free_recursive(const void *ptr, unsigned long flags)
 		kmemleak_free(ptr);
 }
 
-static inline void kmemleak_erase(void **ptr)
-{
-	*ptr = NULL;
-}
-
 #else
 
 static inline void kmemleak_init(void)
@@ -98,9 +93,6 @@ static inline void kmemleak_ignore(const void *ptr)
 static inline void kmemleak_scan_area(const void *ptr, size_t size, gfp_t gfp)
 {
 }
-static inline void kmemleak_erase(void **ptr)
-{
-}
 static inline void kmemleak_no_scan(const void *ptr)
 {
 }
diff --git a/mm/slab.c b/mm/slab.c
index 1150c8b..9aa58fc 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2942,20 +2942,8 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 
 	STATS_INC_ALLOCMISS(cachep);
 	objp = cache_alloc_refill(cachep, flags, force_refill);
-	/*
-	 * the 'ac' may be updated by cache_alloc_refill(),
-	 * and kmemleak_erase() requires its correct value.
-	 */
-	ac = cpu_cache_get(cachep);
 
 out:
-	/*
-	 * To avoid a false negative, if an object that is in one of the
-	 * per-CPU caches is leaked, we need to make sure kmemleak doesn't
-	 * treat the array pointers as a reference to the object.
-	 */
-	if (objp)
-		kmemleak_erase(&ac->entry[ac->avail]);
 	return objp;
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/6] mm/slab: clean-up __ac_get_obj() to prepare future changes
  2015-01-05  1:37 ` Joonsoo Kim
@ 2015-01-05  1:37   ` Joonsoo Kim
  -1 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

This is the patch for clean-up and preparation to optimize
allocation fastpath. Until now, SLAB handles allocation request with
disabling irq. But, to improve performance, irq will not be disabled
at first in allocation fastpath. This requires changes of interface
and assumption of __ac_get_obj(). Object will be passed to
__ac_get_obj() rather than direct accessing object in array cache
in __ac_get_obj(). irq will not be disabled when entering.

To handle this future situation, this patch changes interface and name of
function to make suitable for future use. Main purpose of this
function, that is, if we have pfmemalloc object and we are not legimate
user for this memory, exchanging it to non-pfmemalloc object, is
unchanged.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |   91 +++++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 52 insertions(+), 39 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 9aa58fc..62cd5c6 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -720,50 +720,62 @@ out:
 	spin_unlock_irqrestore(&n->list_lock, flags);
 }
 
-static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
-						gfp_t flags, bool force_refill)
+static void *get_obj_from_pfmemalloc_obj(struct kmem_cache *cachep,
+				struct array_cache *ac, void *objp,
+				gfp_t flags, bool force_refill)
 {
 	int i;
-	void *objp = ac->entry[--ac->avail];
+	struct kmem_cache_node *n;
+	LIST_HEAD(list);
+	int node;
 
-	/* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
-	if (unlikely(is_obj_pfmemalloc(objp))) {
-		struct kmem_cache_node *n;
+	BUG_ON(ac->avail >= ac->limit);
+	BUG_ON(objp != ac->entry[ac->avail]);
 
-		if (gfp_pfmemalloc_allowed(flags)) {
-			clear_obj_pfmemalloc(&objp);
-			return objp;
-		}
+	if (gfp_pfmemalloc_allowed(flags)) {
+		clear_obj_pfmemalloc(&objp);
+		return objp;
+	}
 
-		/* The caller cannot use PFMEMALLOC objects, find another one */
-		for (i = 0; i < ac->avail; i++) {
-			/* If a !PFMEMALLOC object is found, swap them */
-			if (!is_obj_pfmemalloc(ac->entry[i])) {
-				objp = ac->entry[i];
-				ac->entry[i] = ac->entry[ac->avail];
-				ac->entry[ac->avail] = objp;
-				return objp;
-			}
-		}
+	/* The caller cannot use PFMEMALLOC objects, find another one */
+	for (i = 0; i < ac->avail; i++) {
+		if (is_obj_pfmemalloc(ac->entry[i]))
+			continue;
 
-		/*
-		 * If there are empty slabs on the slabs_free list and we are
-		 * being forced to refill the cache, mark this one !pfmemalloc.
-		 */
-		n = get_node(cachep, numa_mem_id());
-		if (!list_empty(&n->slabs_free) && force_refill) {
-			struct page *page = virt_to_head_page(objp);
-			ClearPageSlabPfmemalloc(page);
-			clear_obj_pfmemalloc(&objp);
-			recheck_pfmemalloc_active(cachep, ac);
-			return objp;
-		}
+		/* !PFMEMALLOC object is found, swap them */
+		objp = ac->entry[i];
+		ac->entry[i] = ac->entry[ac->avail];
+		ac->entry[ac->avail] = objp;
 
-		/* No !PFMEMALLOC objects available */
-		ac->avail++;
-		objp = NULL;
+		return objp;
 	}
 
+	/*
+	 * If there are empty slabs on the slabs_free list and we are
+	 * being forced to refill the cache, mark this one !pfmemalloc.
+	 */
+	node = numa_mem_id();
+	n = get_node(cachep, node);
+	if (!list_empty(&n->slabs_free) && force_refill) {
+		struct page *page = virt_to_head_page(objp);
+
+		ClearPageSlabPfmemalloc(page);
+		clear_obj_pfmemalloc(&objp);
+		recheck_pfmemalloc_active(cachep, ac);
+
+		return objp;
+	}
+
+	/* No !PFMEMALLOC objects available */
+	if (ac->avail < ac->limit)
+		ac->entry[ac->avail++] = objp;
+	else {
+		spin_lock(&n->list_lock);
+		free_block(cachep, &objp, 1, node, &list);
+		spin_unlock(&n->list_lock);
+	}
+	objp = NULL;
+
 	return objp;
 }
 
@@ -772,10 +784,11 @@ static inline void *ac_get_obj(struct kmem_cache *cachep,
 {
 	void *objp;
 
-	if (unlikely(sk_memalloc_socks()))
-		objp = __ac_get_obj(cachep, ac, flags, force_refill);
-	else
-		objp = ac->entry[--ac->avail];
+	objp = ac->entry[--ac->avail];
+	if (unlikely(sk_memalloc_socks()) && is_obj_pfmemalloc(objp)) {
+		objp = get_obj_from_pfmemalloc_obj(cachep, ac, objp,
+						flags, force_refill);
+	}
 
 	return objp;
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/6] mm/slab: clean-up __ac_get_obj() to prepare future changes
@ 2015-01-05  1:37   ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

This is the patch for clean-up and preparation to optimize
allocation fastpath. Until now, SLAB handles allocation request with
disabling irq. But, to improve performance, irq will not be disabled
at first in allocation fastpath. This requires changes of interface
and assumption of __ac_get_obj(). Object will be passed to
__ac_get_obj() rather than direct accessing object in array cache
in __ac_get_obj(). irq will not be disabled when entering.

To handle this future situation, this patch changes interface and name of
function to make suitable for future use. Main purpose of this
function, that is, if we have pfmemalloc object and we are not legimate
user for this memory, exchanging it to non-pfmemalloc object, is
unchanged.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |   91 +++++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 52 insertions(+), 39 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 9aa58fc..62cd5c6 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -720,50 +720,62 @@ out:
 	spin_unlock_irqrestore(&n->list_lock, flags);
 }
 
-static void *__ac_get_obj(struct kmem_cache *cachep, struct array_cache *ac,
-						gfp_t flags, bool force_refill)
+static void *get_obj_from_pfmemalloc_obj(struct kmem_cache *cachep,
+				struct array_cache *ac, void *objp,
+				gfp_t flags, bool force_refill)
 {
 	int i;
-	void *objp = ac->entry[--ac->avail];
+	struct kmem_cache_node *n;
+	LIST_HEAD(list);
+	int node;
 
-	/* Ensure the caller is allowed to use objects from PFMEMALLOC slab */
-	if (unlikely(is_obj_pfmemalloc(objp))) {
-		struct kmem_cache_node *n;
+	BUG_ON(ac->avail >= ac->limit);
+	BUG_ON(objp != ac->entry[ac->avail]);
 
-		if (gfp_pfmemalloc_allowed(flags)) {
-			clear_obj_pfmemalloc(&objp);
-			return objp;
-		}
+	if (gfp_pfmemalloc_allowed(flags)) {
+		clear_obj_pfmemalloc(&objp);
+		return objp;
+	}
 
-		/* The caller cannot use PFMEMALLOC objects, find another one */
-		for (i = 0; i < ac->avail; i++) {
-			/* If a !PFMEMALLOC object is found, swap them */
-			if (!is_obj_pfmemalloc(ac->entry[i])) {
-				objp = ac->entry[i];
-				ac->entry[i] = ac->entry[ac->avail];
-				ac->entry[ac->avail] = objp;
-				return objp;
-			}
-		}
+	/* The caller cannot use PFMEMALLOC objects, find another one */
+	for (i = 0; i < ac->avail; i++) {
+		if (is_obj_pfmemalloc(ac->entry[i]))
+			continue;
 
-		/*
-		 * If there are empty slabs on the slabs_free list and we are
-		 * being forced to refill the cache, mark this one !pfmemalloc.
-		 */
-		n = get_node(cachep, numa_mem_id());
-		if (!list_empty(&n->slabs_free) && force_refill) {
-			struct page *page = virt_to_head_page(objp);
-			ClearPageSlabPfmemalloc(page);
-			clear_obj_pfmemalloc(&objp);
-			recheck_pfmemalloc_active(cachep, ac);
-			return objp;
-		}
+		/* !PFMEMALLOC object is found, swap them */
+		objp = ac->entry[i];
+		ac->entry[i] = ac->entry[ac->avail];
+		ac->entry[ac->avail] = objp;
 
-		/* No !PFMEMALLOC objects available */
-		ac->avail++;
-		objp = NULL;
+		return objp;
 	}
 
+	/*
+	 * If there are empty slabs on the slabs_free list and we are
+	 * being forced to refill the cache, mark this one !pfmemalloc.
+	 */
+	node = numa_mem_id();
+	n = get_node(cachep, node);
+	if (!list_empty(&n->slabs_free) && force_refill) {
+		struct page *page = virt_to_head_page(objp);
+
+		ClearPageSlabPfmemalloc(page);
+		clear_obj_pfmemalloc(&objp);
+		recheck_pfmemalloc_active(cachep, ac);
+
+		return objp;
+	}
+
+	/* No !PFMEMALLOC objects available */
+	if (ac->avail < ac->limit)
+		ac->entry[ac->avail++] = objp;
+	else {
+		spin_lock(&n->list_lock);
+		free_block(cachep, &objp, 1, node, &list);
+		spin_unlock(&n->list_lock);
+	}
+	objp = NULL;
+
 	return objp;
 }
 
@@ -772,10 +784,11 @@ static inline void *ac_get_obj(struct kmem_cache *cachep,
 {
 	void *objp;
 
-	if (unlikely(sk_memalloc_socks()))
-		objp = __ac_get_obj(cachep, ac, flags, force_refill);
-	else
-		objp = ac->entry[--ac->avail];
+	objp = ac->entry[--ac->avail];
+	if (unlikely(sk_memalloc_socks()) && is_obj_pfmemalloc(objp)) {
+		objp = get_obj_from_pfmemalloc_obj(cachep, ac, objp,
+						flags, force_refill);
+	}
 
 	return objp;
 }
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/6] mm/slab: rearrange irq management
  2015-01-05  1:37 ` Joonsoo Kim
@ 2015-01-05  1:37   ` Joonsoo Kim
  -1 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

Currently, irq is disabled at the very beginning phase of allocation
functions. In the following patch, some of allocation functions will
be changed to work without irq disabling so rearrange irq management
code as preparation step.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |   22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 62cd5c6..1246ac6 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2934,8 +2934,9 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 	void *objp;
 	struct array_cache *ac;
 	bool force_refill = false;
+	unsigned long save_flags;
 
-	check_irq_off();
+	local_irq_save(save_flags);
 
 	ac = cpu_cache_get(cachep);
 	if (likely(ac->avail)) {
@@ -2957,6 +2958,8 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 	objp = cache_alloc_refill(cachep, flags, force_refill);
 
 out:
+	local_irq_restore(save_flags);
+
 	return objp;
 }
 
@@ -3082,13 +3085,15 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
 	struct kmem_cache_node *n;
 	void *obj;
 	int x;
+	unsigned long save_flags;
 
 	VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES);
 	n = get_node(cachep, nodeid);
 	BUG_ON(!n);
 
+	local_irq_save(save_flags);
+
 retry:
-	check_irq_off();
 	spin_lock(&n->list_lock);
 	entry = n->slabs_partial.next;
 	if (entry == &n->slabs_partial) {
@@ -3126,9 +3131,10 @@ must_grow:
 	if (x)
 		goto retry;
 
-	return fallback_alloc(cachep, flags);
+	obj = fallback_alloc(cachep, flags);
 
 done:
+	local_irq_restore(save_flags);
 	return obj;
 }
 
@@ -3150,14 +3156,15 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	cachep = memcg_kmem_get_cache(cachep, flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
-	local_irq_save(save_flags);
 
 	if (nodeid == NUMA_NO_NODE)
 		nodeid = slab_node;
 
 	if (unlikely(!get_node(cachep, nodeid))) {
 		/* Node not bootstrapped yet */
+		local_irq_save(save_flags);
 		ptr = fallback_alloc(cachep, flags);
+		local_irq_restore(save_flags);
 		goto out;
 	}
 
@@ -3174,8 +3181,8 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	}
 	/* ___cache_alloc_node can fall back to other nodes */
 	ptr = ____cache_alloc_node(cachep, flags, nodeid);
-  out:
-	local_irq_restore(save_flags);
+
+out:
 	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
 	kmemleak_alloc_recursive(ptr, cachep->object_size, 1, cachep->flags,
 				 flags);
@@ -3225,7 +3232,6 @@ __do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 static __always_inline void *
 slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 {
-	unsigned long save_flags;
 	void *objp;
 
 	flags &= gfp_allowed_mask;
@@ -3238,9 +3244,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	cachep = memcg_kmem_get_cache(cachep, flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
-	local_irq_save(save_flags);
 	objp = __do_cache_alloc(cachep, flags);
-	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags,
 				 flags);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/6] mm/slab: rearrange irq management
@ 2015-01-05  1:37   ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

Currently, irq is disabled at the very beginning phase of allocation
functions. In the following patch, some of allocation functions will
be changed to work without irq disabling so rearrange irq management
code as preparation step.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |   22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 62cd5c6..1246ac6 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2934,8 +2934,9 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 	void *objp;
 	struct array_cache *ac;
 	bool force_refill = false;
+	unsigned long save_flags;
 
-	check_irq_off();
+	local_irq_save(save_flags);
 
 	ac = cpu_cache_get(cachep);
 	if (likely(ac->avail)) {
@@ -2957,6 +2958,8 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 	objp = cache_alloc_refill(cachep, flags, force_refill);
 
 out:
+	local_irq_restore(save_flags);
+
 	return objp;
 }
 
@@ -3082,13 +3085,15 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
 	struct kmem_cache_node *n;
 	void *obj;
 	int x;
+	unsigned long save_flags;
 
 	VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES);
 	n = get_node(cachep, nodeid);
 	BUG_ON(!n);
 
+	local_irq_save(save_flags);
+
 retry:
-	check_irq_off();
 	spin_lock(&n->list_lock);
 	entry = n->slabs_partial.next;
 	if (entry == &n->slabs_partial) {
@@ -3126,9 +3131,10 @@ must_grow:
 	if (x)
 		goto retry;
 
-	return fallback_alloc(cachep, flags);
+	obj = fallback_alloc(cachep, flags);
 
 done:
+	local_irq_restore(save_flags);
 	return obj;
 }
 
@@ -3150,14 +3156,15 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	cachep = memcg_kmem_get_cache(cachep, flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
-	local_irq_save(save_flags);
 
 	if (nodeid == NUMA_NO_NODE)
 		nodeid = slab_node;
 
 	if (unlikely(!get_node(cachep, nodeid))) {
 		/* Node not bootstrapped yet */
+		local_irq_save(save_flags);
 		ptr = fallback_alloc(cachep, flags);
+		local_irq_restore(save_flags);
 		goto out;
 	}
 
@@ -3174,8 +3181,8 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	}
 	/* ___cache_alloc_node can fall back to other nodes */
 	ptr = ____cache_alloc_node(cachep, flags, nodeid);
-  out:
-	local_irq_restore(save_flags);
+
+out:
 	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
 	kmemleak_alloc_recursive(ptr, cachep->object_size, 1, cachep->flags,
 				 flags);
@@ -3225,7 +3232,6 @@ __do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 static __always_inline void *
 slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 {
-	unsigned long save_flags;
 	void *objp;
 
 	flags &= gfp_allowed_mask;
@@ -3238,9 +3244,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 	cachep = memcg_kmem_get_cache(cachep, flags);
 
 	cache_alloc_debugcheck_before(cachep, flags);
-	local_irq_save(save_flags);
 	objp = __do_cache_alloc(cachep, flags);
-	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags,
 				 flags);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/6] mm/slab: cleanup ____cache_alloc()
  2015-01-05  1:37 ` Joonsoo Kim
@ 2015-01-05  1:37   ` Joonsoo Kim
  -1 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

This cleanup makes code more readable and help future changes.
In the following patch, many code will be added to this function.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 1246ac6..449fc6b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2939,21 +2939,23 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 	local_irq_save(save_flags);
 
 	ac = cpu_cache_get(cachep);
-	if (likely(ac->avail)) {
-		ac->touched = 1;
-		objp = ac_get_obj(cachep, ac, flags, false);
+	if (unlikely(!ac->avail))
+		goto slowpath;
 
-		/*
-		 * Allow for the possibility all avail objects are not allowed
-		 * by the current flags
-		 */
-		if (objp) {
-			STATS_INC_ALLOCHIT(cachep);
-			goto out;
-		}
-		force_refill = true;
+	ac->touched = 1;
+	objp = ac_get_obj(cachep, ac, flags, false);
+
+	/*
+	 * Allow for the possibility all avail objects are not allowed
+	 * by the current flags
+	 */
+	if (likely(objp)) {
+		STATS_INC_ALLOCHIT(cachep);
+		goto out;
 	}
+	force_refill = true;
 
+slowpath:
 	STATS_INC_ALLOCMISS(cachep);
 	objp = cache_alloc_refill(cachep, flags, force_refill);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/6] mm/slab: cleanup ____cache_alloc()
@ 2015-01-05  1:37   ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

This cleanup makes code more readable and help future changes.
In the following patch, many code will be added to this function.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 1246ac6..449fc6b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2939,21 +2939,23 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 	local_irq_save(save_flags);
 
 	ac = cpu_cache_get(cachep);
-	if (likely(ac->avail)) {
-		ac->touched = 1;
-		objp = ac_get_obj(cachep, ac, flags, false);
+	if (unlikely(!ac->avail))
+		goto slowpath;
 
-		/*
-		 * Allow for the possibility all avail objects are not allowed
-		 * by the current flags
-		 */
-		if (objp) {
-			STATS_INC_ALLOCHIT(cachep);
-			goto out;
-		}
-		force_refill = true;
+	ac->touched = 1;
+	objp = ac_get_obj(cachep, ac, flags, false);
+
+	/*
+	 * Allow for the possibility all avail objects are not allowed
+	 * by the current flags
+	 */
+	if (likely(objp)) {
+		STATS_INC_ALLOCHIT(cachep);
+		goto out;
 	}
+	force_refill = true;
 
+slowpath:
 	STATS_INC_ALLOCMISS(cachep);
 	objp = cache_alloc_refill(cachep, flags, force_refill);
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
  2015-01-05  1:37 ` Joonsoo Kim
@ 2015-01-05  1:37   ` Joonsoo Kim
  -1 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

SLAB always disable irq before executing any object alloc/free operation.
This is really painful in terms of performance. Benchmark result that does
alloc/free repeatedly shows that each alloc/free is rougly 2 times slower
than SLUB's one (27 ns : 14 ns). To improve performance, this patch
implements allocation fastpath without disable irq.

Transaction id is introduced and updated on every operation. In allocation
fastpath, object in array cache is read speculartively. And then, pointer
pointing object position in array cache and transaction id are updated
simultaneously through this_cpu_cmpxchg_double(). If tid is unchanged
until this updating, it ensures that there is no concurrent clients
allocating/freeing object to this slab. So allocation could succeed
without disabling irq.

This is a similar way to implement allocation fastpath in SLUB.

Above mentioned benchmark shows that alloc/free fastpath performance
is improved roughly 22%. (27 ns -> 21 ns).

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |  151 +++++++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 118 insertions(+), 33 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 449fc6b..54656f0 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -168,6 +168,41 @@ typedef unsigned short freelist_idx_t;
 
 #define SLAB_OBJ_MAX_NUM ((1 << sizeof(freelist_idx_t) * BITS_PER_BYTE) - 1)
 
+#ifdef CONFIG_PREEMPT
+/*
+ * Calculate the next globally unique transaction for disambiguiation
+ * during cmpxchg. The transactions start with the cpu number and are then
+ * incremented by CONFIG_NR_CPUS.
+ */
+#define TID_STEP  roundup_pow_of_two(CONFIG_NR_CPUS)
+#else
+/*
+ * No preemption supported therefore also no need to check for
+ * different cpus.
+ */
+#define TID_STEP 1
+#endif
+
+static inline unsigned long next_tid(unsigned long tid)
+{
+	return tid + TID_STEP;
+}
+
+static inline unsigned int tid_to_cpu(unsigned long tid)
+{
+	return tid % TID_STEP;
+}
+
+static inline unsigned long tid_to_event(unsigned long tid)
+{
+	return tid / TID_STEP;
+}
+
+static inline unsigned int init_tid(int cpu)
+{
+	return cpu;
+}
+
 /*
  * true if a page was allocated from pfmemalloc reserves for network-based
  * swap
@@ -187,7 +222,8 @@ static bool pfmemalloc_active __read_mostly;
  *
  */
 struct array_cache {
-	unsigned int avail;
+	unsigned long avail;
+	unsigned long tid;
 	unsigned int limit;
 	unsigned int batchcount;
 	unsigned int touched;
@@ -657,7 +693,8 @@ static void start_cpu_timer(int cpu)
 	}
 }
 
-static void init_arraycache(struct array_cache *ac, int limit, int batch)
+static void init_arraycache(struct array_cache *ac, int limit,
+				int batch, int cpu)
 {
 	/*
 	 * The array_cache structures contain pointers to free object.
@@ -669,6 +706,7 @@ static void init_arraycache(struct array_cache *ac, int limit, int batch)
 	kmemleak_no_scan(ac);
 	if (ac) {
 		ac->avail = 0;
+		ac->tid = init_tid(cpu);
 		ac->limit = limit;
 		ac->batchcount = batch;
 		ac->touched = 0;
@@ -676,13 +714,13 @@ static void init_arraycache(struct array_cache *ac, int limit, int batch)
 }
 
 static struct array_cache *alloc_arraycache(int node, int entries,
-					    int batchcount, gfp_t gfp)
+					    int batchcount, gfp_t gfp, int cpu)
 {
 	size_t memsize = sizeof(void *) * entries + sizeof(struct array_cache);
 	struct array_cache *ac = NULL;
 
 	ac = kmalloc_node(memsize, gfp, node);
-	init_arraycache(ac, entries, batchcount);
+	init_arraycache(ac, entries, batchcount, cpu);
 	return ac;
 }
 
@@ -721,22 +759,36 @@ out:
 }
 
 static void *get_obj_from_pfmemalloc_obj(struct kmem_cache *cachep,
-				struct array_cache *ac, void *objp,
-				gfp_t flags, bool force_refill)
+				void *objp, gfp_t flags, bool force_refill)
 {
 	int i;
 	struct kmem_cache_node *n;
+	struct array_cache *ac;
 	LIST_HEAD(list);
-	int node;
-
-	BUG_ON(ac->avail >= ac->limit);
-	BUG_ON(objp != ac->entry[ac->avail]);
+	int page_node, node;
+	unsigned long save_flags;
 
 	if (gfp_pfmemalloc_allowed(flags)) {
 		clear_obj_pfmemalloc(&objp);
 		return objp;
 	}
 
+	local_irq_save(save_flags);
+	page_node = page_to_nid(virt_to_page(objp));
+	node = numa_mem_id();
+
+	/*
+	 * Because we disable irq just now, cpu can be changed
+	 * and we are on different node with object node. In this rare
+	 * case, just return pfmemalloc object for simplicity.
+	 */
+	if (unlikely(node != page_node)) {
+		clear_obj_pfmemalloc(&objp);
+		goto out;
+	}
+
+	ac = cpu_cache_get(cachep);
+
 	/* The caller cannot use PFMEMALLOC objects, find another one */
 	for (i = 0; i < ac->avail; i++) {
 		if (is_obj_pfmemalloc(ac->entry[i]))
@@ -747,7 +799,7 @@ static void *get_obj_from_pfmemalloc_obj(struct kmem_cache *cachep,
 		ac->entry[i] = ac->entry[ac->avail];
 		ac->entry[ac->avail] = objp;
 
-		return objp;
+		goto out;
 	}
 
 	/*
@@ -763,7 +815,7 @@ static void *get_obj_from_pfmemalloc_obj(struct kmem_cache *cachep,
 		clear_obj_pfmemalloc(&objp);
 		recheck_pfmemalloc_active(cachep, ac);
 
-		return objp;
+		goto out;
 	}
 
 	/* No !PFMEMALLOC objects available */
@@ -776,6 +828,8 @@ static void *get_obj_from_pfmemalloc_obj(struct kmem_cache *cachep,
 	}
 	objp = NULL;
 
+out:
+	local_irq_restore(save_flags);
 	return objp;
 }
 
@@ -784,9 +838,10 @@ static inline void *ac_get_obj(struct kmem_cache *cachep,
 {
 	void *objp;
 
+	ac->tid = next_tid(ac->tid);
 	objp = ac->entry[--ac->avail];
 	if (unlikely(sk_memalloc_socks()) && is_obj_pfmemalloc(objp)) {
-		objp = get_obj_from_pfmemalloc_obj(cachep, ac, objp,
+		objp = get_obj_from_pfmemalloc_obj(cachep, objp,
 						flags, force_refill);
 	}
 
@@ -812,6 +867,7 @@ static inline void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
 	if (unlikely(sk_memalloc_socks()))
 		objp = __ac_put_obj(cachep, ac, objp);
 
+	ac->tid = next_tid(ac->tid);
 	ac->entry[ac->avail++] = objp;
 }
 
@@ -825,7 +881,8 @@ static int transfer_objects(struct array_cache *to,
 		struct array_cache *from, unsigned int max)
 {
 	/* Figure out how many entries to transfer */
-	int nr = min3(from->avail, max, to->limit - to->avail);
+	int nr = min3(from->avail, (unsigned long)max,
+			(unsigned long)to->limit - to->avail);
 
 	if (!nr)
 		return 0;
@@ -882,7 +939,7 @@ static struct alien_cache *__alloc_alien_cache(int node, int entries,
 	struct alien_cache *alc = NULL;
 
 	alc = kmalloc_node(memsize, gfp, node);
-	init_arraycache(&alc->ac, entries, batch);
+	init_arraycache(&alc->ac, entries, batch, 0);
 	spin_lock_init(&alc->lock);
 	return alc;
 }
@@ -1117,6 +1174,7 @@ static void cpuup_canceled(long cpu)
 		nc = per_cpu_ptr(cachep->cpu_cache, cpu);
 		if (nc) {
 			free_block(cachep, nc->entry, nc->avail, node, &list);
+			nc->tid = next_tid(nc->tid);
 			nc->avail = 0;
 		}
 
@@ -1187,7 +1245,7 @@ static int cpuup_prepare(long cpu)
 		if (cachep->shared) {
 			shared = alloc_arraycache(node,
 				cachep->shared * cachep->batchcount,
-				0xbaadf00d, GFP_KERNEL);
+				0xbaadf00d, GFP_KERNEL, cpu);
 			if (!shared)
 				goto bad;
 		}
@@ -2008,14 +2066,14 @@ static struct array_cache __percpu *alloc_kmem_cache_cpus(
 	size = sizeof(void *) * entries + sizeof(struct array_cache);
 	if (slab_state < FULL)
 		gfp_flags = GFP_NOWAIT;
-	cpu_cache = __alloc_percpu_gfp(size, sizeof(void *), gfp_flags);
+	cpu_cache = __alloc_percpu_gfp(size, 2 * sizeof(void *), gfp_flags);
 
 	if (!cpu_cache)
 		return NULL;
 
 	for_each_possible_cpu(cpu) {
 		init_arraycache(per_cpu_ptr(cpu_cache, cpu),
-				entries, batchcount);
+				entries, batchcount, cpu);
 	}
 
 	return cpu_cache;
@@ -2051,6 +2109,7 @@ static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
 			jiffies + REAPTIMEOUT_NODE +
 			((unsigned long)cachep) % REAPTIMEOUT_NODE;
 
+	cpu_cache_get(cachep)->tid = init_tid(smp_processor_id());
 	cpu_cache_get(cachep)->avail = 0;
 	cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
 	cpu_cache_get(cachep)->batchcount = 1;
@@ -2339,6 +2398,7 @@ static void do_drain(void *arg)
 	free_block(cachep, ac->entry, ac->avail, node, &list);
 	spin_unlock(&n->list_lock);
 	slabs_destroy(cachep, &list);
+	ac->tid = next_tid(ac->tid);
 	ac->avail = 0;
 }
 
@@ -2774,6 +2834,9 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
 		goto force_grow;
 retry:
 	ac = cpu_cache_get(cachep);
+	if (ac->avail)
+		goto avail;
+
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
 		/*
@@ -2790,6 +2853,7 @@ retry:
 
 	/* See if we can refill from the shared array */
 	if (n->shared && transfer_objects(ac, n->shared, batchcount)) {
+		ac->tid = next_tid(ac->tid);
 		n->shared->touched = 1;
 		goto alloc_done;
 	}
@@ -2854,6 +2918,8 @@ force_grow:
 		if (!ac->avail)		/* objects refilled by interrupt? */
 			goto retry;
 	}
+
+avail:
 	ac->touched = 1;
 
 	return ac_get_obj(cachep, ac, flags, force_refill);
@@ -2935,31 +3001,48 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 	struct array_cache *ac;
 	bool force_refill = false;
 	unsigned long save_flags;
+	unsigned long tid;
+	unsigned long avail;
 
-	local_irq_save(save_flags);
-
+redo:
+	preempt_disable();
 	ac = cpu_cache_get(cachep);
-	if (unlikely(!ac->avail))
+	tid = ac->tid;
+	preempt_enable();
+
+	avail = ac->avail;
+	if (unlikely(!avail))
 		goto slowpath;
 
 	ac->touched = 1;
-	objp = ac_get_obj(cachep, ac, flags, false);
-
-	/*
-	 * Allow for the possibility all avail objects are not allowed
-	 * by the current flags
-	 */
-	if (likely(objp)) {
-		STATS_INC_ALLOCHIT(cachep);
-		goto out;
+	objp = ac->entry[avail - 1];
+	if (unlikely(!this_cpu_cmpxchg_double(
+		cachep->cpu_cache->avail, cachep->cpu_cache->tid,
+		avail, tid,
+		avail - 1, next_tid(tid))))
+		goto redo;
+
+	if (unlikely(sk_memalloc_socks() && is_obj_pfmemalloc(objp))) {
+		/*
+		 * Allow for the possibility all avail objects are not
+		 * allowed by the current flags
+		 */
+		objp = get_obj_from_pfmemalloc_obj(cachep, objp,
+						flags, force_refill);
+		if (!objp) {
+			force_refill = true;
+			goto slowpath;
+		}
 	}
-	force_refill = true;
+
+	STATS_INC_ALLOCHIT(cachep);
+	return objp;
 
 slowpath:
+	local_irq_save(save_flags);
 	STATS_INC_ALLOCMISS(cachep);
 	objp = cache_alloc_refill(cachep, flags, force_refill);
 
-out:
 	local_irq_restore(save_flags);
 
 	return objp;
@@ -3353,6 +3436,7 @@ free_done:
 #endif
 	spin_unlock(&n->list_lock);
 	slabs_destroy(cachep, &list);
+	ac->tid = next_tid(ac->tid);
 	ac->avail -= batchcount;
 	memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
 }
@@ -3605,7 +3689,7 @@ static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp)
 		if (cachep->shared) {
 			new_shared = alloc_arraycache(node,
 				cachep->shared*cachep->batchcount,
-					0xbaadf00d, gfp);
+					0xbaadf00d, gfp, 0);
 			if (!new_shared) {
 				free_alien_cache(new_alien);
 				goto fail;
@@ -3829,6 +3913,7 @@ static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n,
 			if (tofree > ac->avail)
 				tofree = (ac->avail + 1) / 2;
 			free_block(cachep, ac->entry, tofree, node, &list);
+			ac->tid = next_tid(ac->tid);
 			ac->avail -= tofree;
 			memmove(ac->entry, &(ac->entry[tofree]),
 				sizeof(void *) * ac->avail);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
@ 2015-01-05  1:37   ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

SLAB always disable irq before executing any object alloc/free operation.
This is really painful in terms of performance. Benchmark result that does
alloc/free repeatedly shows that each alloc/free is rougly 2 times slower
than SLUB's one (27 ns : 14 ns). To improve performance, this patch
implements allocation fastpath without disable irq.

Transaction id is introduced and updated on every operation. In allocation
fastpath, object in array cache is read speculartively. And then, pointer
pointing object position in array cache and transaction id are updated
simultaneously through this_cpu_cmpxchg_double(). If tid is unchanged
until this updating, it ensures that there is no concurrent clients
allocating/freeing object to this slab. So allocation could succeed
without disabling irq.

This is a similar way to implement allocation fastpath in SLUB.

Above mentioned benchmark shows that alloc/free fastpath performance
is improved roughly 22%. (27 ns -> 21 ns).

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |  151 +++++++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 118 insertions(+), 33 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 449fc6b..54656f0 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -168,6 +168,41 @@ typedef unsigned short freelist_idx_t;
 
 #define SLAB_OBJ_MAX_NUM ((1 << sizeof(freelist_idx_t) * BITS_PER_BYTE) - 1)
 
+#ifdef CONFIG_PREEMPT
+/*
+ * Calculate the next globally unique transaction for disambiguiation
+ * during cmpxchg. The transactions start with the cpu number and are then
+ * incremented by CONFIG_NR_CPUS.
+ */
+#define TID_STEP  roundup_pow_of_two(CONFIG_NR_CPUS)
+#else
+/*
+ * No preemption supported therefore also no need to check for
+ * different cpus.
+ */
+#define TID_STEP 1
+#endif
+
+static inline unsigned long next_tid(unsigned long tid)
+{
+	return tid + TID_STEP;
+}
+
+static inline unsigned int tid_to_cpu(unsigned long tid)
+{
+	return tid % TID_STEP;
+}
+
+static inline unsigned long tid_to_event(unsigned long tid)
+{
+	return tid / TID_STEP;
+}
+
+static inline unsigned int init_tid(int cpu)
+{
+	return cpu;
+}
+
 /*
  * true if a page was allocated from pfmemalloc reserves for network-based
  * swap
@@ -187,7 +222,8 @@ static bool pfmemalloc_active __read_mostly;
  *
  */
 struct array_cache {
-	unsigned int avail;
+	unsigned long avail;
+	unsigned long tid;
 	unsigned int limit;
 	unsigned int batchcount;
 	unsigned int touched;
@@ -657,7 +693,8 @@ static void start_cpu_timer(int cpu)
 	}
 }
 
-static void init_arraycache(struct array_cache *ac, int limit, int batch)
+static void init_arraycache(struct array_cache *ac, int limit,
+				int batch, int cpu)
 {
 	/*
 	 * The array_cache structures contain pointers to free object.
@@ -669,6 +706,7 @@ static void init_arraycache(struct array_cache *ac, int limit, int batch)
 	kmemleak_no_scan(ac);
 	if (ac) {
 		ac->avail = 0;
+		ac->tid = init_tid(cpu);
 		ac->limit = limit;
 		ac->batchcount = batch;
 		ac->touched = 0;
@@ -676,13 +714,13 @@ static void init_arraycache(struct array_cache *ac, int limit, int batch)
 }
 
 static struct array_cache *alloc_arraycache(int node, int entries,
-					    int batchcount, gfp_t gfp)
+					    int batchcount, gfp_t gfp, int cpu)
 {
 	size_t memsize = sizeof(void *) * entries + sizeof(struct array_cache);
 	struct array_cache *ac = NULL;
 
 	ac = kmalloc_node(memsize, gfp, node);
-	init_arraycache(ac, entries, batchcount);
+	init_arraycache(ac, entries, batchcount, cpu);
 	return ac;
 }
 
@@ -721,22 +759,36 @@ out:
 }
 
 static void *get_obj_from_pfmemalloc_obj(struct kmem_cache *cachep,
-				struct array_cache *ac, void *objp,
-				gfp_t flags, bool force_refill)
+				void *objp, gfp_t flags, bool force_refill)
 {
 	int i;
 	struct kmem_cache_node *n;
+	struct array_cache *ac;
 	LIST_HEAD(list);
-	int node;
-
-	BUG_ON(ac->avail >= ac->limit);
-	BUG_ON(objp != ac->entry[ac->avail]);
+	int page_node, node;
+	unsigned long save_flags;
 
 	if (gfp_pfmemalloc_allowed(flags)) {
 		clear_obj_pfmemalloc(&objp);
 		return objp;
 	}
 
+	local_irq_save(save_flags);
+	page_node = page_to_nid(virt_to_page(objp));
+	node = numa_mem_id();
+
+	/*
+	 * Because we disable irq just now, cpu can be changed
+	 * and we are on different node with object node. In this rare
+	 * case, just return pfmemalloc object for simplicity.
+	 */
+	if (unlikely(node != page_node)) {
+		clear_obj_pfmemalloc(&objp);
+		goto out;
+	}
+
+	ac = cpu_cache_get(cachep);
+
 	/* The caller cannot use PFMEMALLOC objects, find another one */
 	for (i = 0; i < ac->avail; i++) {
 		if (is_obj_pfmemalloc(ac->entry[i]))
@@ -747,7 +799,7 @@ static void *get_obj_from_pfmemalloc_obj(struct kmem_cache *cachep,
 		ac->entry[i] = ac->entry[ac->avail];
 		ac->entry[ac->avail] = objp;
 
-		return objp;
+		goto out;
 	}
 
 	/*
@@ -763,7 +815,7 @@ static void *get_obj_from_pfmemalloc_obj(struct kmem_cache *cachep,
 		clear_obj_pfmemalloc(&objp);
 		recheck_pfmemalloc_active(cachep, ac);
 
-		return objp;
+		goto out;
 	}
 
 	/* No !PFMEMALLOC objects available */
@@ -776,6 +828,8 @@ static void *get_obj_from_pfmemalloc_obj(struct kmem_cache *cachep,
 	}
 	objp = NULL;
 
+out:
+	local_irq_restore(save_flags);
 	return objp;
 }
 
@@ -784,9 +838,10 @@ static inline void *ac_get_obj(struct kmem_cache *cachep,
 {
 	void *objp;
 
+	ac->tid = next_tid(ac->tid);
 	objp = ac->entry[--ac->avail];
 	if (unlikely(sk_memalloc_socks()) && is_obj_pfmemalloc(objp)) {
-		objp = get_obj_from_pfmemalloc_obj(cachep, ac, objp,
+		objp = get_obj_from_pfmemalloc_obj(cachep, objp,
 						flags, force_refill);
 	}
 
@@ -812,6 +867,7 @@ static inline void ac_put_obj(struct kmem_cache *cachep, struct array_cache *ac,
 	if (unlikely(sk_memalloc_socks()))
 		objp = __ac_put_obj(cachep, ac, objp);
 
+	ac->tid = next_tid(ac->tid);
 	ac->entry[ac->avail++] = objp;
 }
 
@@ -825,7 +881,8 @@ static int transfer_objects(struct array_cache *to,
 		struct array_cache *from, unsigned int max)
 {
 	/* Figure out how many entries to transfer */
-	int nr = min3(from->avail, max, to->limit - to->avail);
+	int nr = min3(from->avail, (unsigned long)max,
+			(unsigned long)to->limit - to->avail);
 
 	if (!nr)
 		return 0;
@@ -882,7 +939,7 @@ static struct alien_cache *__alloc_alien_cache(int node, int entries,
 	struct alien_cache *alc = NULL;
 
 	alc = kmalloc_node(memsize, gfp, node);
-	init_arraycache(&alc->ac, entries, batch);
+	init_arraycache(&alc->ac, entries, batch, 0);
 	spin_lock_init(&alc->lock);
 	return alc;
 }
@@ -1117,6 +1174,7 @@ static void cpuup_canceled(long cpu)
 		nc = per_cpu_ptr(cachep->cpu_cache, cpu);
 		if (nc) {
 			free_block(cachep, nc->entry, nc->avail, node, &list);
+			nc->tid = next_tid(nc->tid);
 			nc->avail = 0;
 		}
 
@@ -1187,7 +1245,7 @@ static int cpuup_prepare(long cpu)
 		if (cachep->shared) {
 			shared = alloc_arraycache(node,
 				cachep->shared * cachep->batchcount,
-				0xbaadf00d, GFP_KERNEL);
+				0xbaadf00d, GFP_KERNEL, cpu);
 			if (!shared)
 				goto bad;
 		}
@@ -2008,14 +2066,14 @@ static struct array_cache __percpu *alloc_kmem_cache_cpus(
 	size = sizeof(void *) * entries + sizeof(struct array_cache);
 	if (slab_state < FULL)
 		gfp_flags = GFP_NOWAIT;
-	cpu_cache = __alloc_percpu_gfp(size, sizeof(void *), gfp_flags);
+	cpu_cache = __alloc_percpu_gfp(size, 2 * sizeof(void *), gfp_flags);
 
 	if (!cpu_cache)
 		return NULL;
 
 	for_each_possible_cpu(cpu) {
 		init_arraycache(per_cpu_ptr(cpu_cache, cpu),
-				entries, batchcount);
+				entries, batchcount, cpu);
 	}
 
 	return cpu_cache;
@@ -2051,6 +2109,7 @@ static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
 			jiffies + REAPTIMEOUT_NODE +
 			((unsigned long)cachep) % REAPTIMEOUT_NODE;
 
+	cpu_cache_get(cachep)->tid = init_tid(smp_processor_id());
 	cpu_cache_get(cachep)->avail = 0;
 	cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
 	cpu_cache_get(cachep)->batchcount = 1;
@@ -2339,6 +2398,7 @@ static void do_drain(void *arg)
 	free_block(cachep, ac->entry, ac->avail, node, &list);
 	spin_unlock(&n->list_lock);
 	slabs_destroy(cachep, &list);
+	ac->tid = next_tid(ac->tid);
 	ac->avail = 0;
 }
 
@@ -2774,6 +2834,9 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
 		goto force_grow;
 retry:
 	ac = cpu_cache_get(cachep);
+	if (ac->avail)
+		goto avail;
+
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
 		/*
@@ -2790,6 +2853,7 @@ retry:
 
 	/* See if we can refill from the shared array */
 	if (n->shared && transfer_objects(ac, n->shared, batchcount)) {
+		ac->tid = next_tid(ac->tid);
 		n->shared->touched = 1;
 		goto alloc_done;
 	}
@@ -2854,6 +2918,8 @@ force_grow:
 		if (!ac->avail)		/* objects refilled by interrupt? */
 			goto retry;
 	}
+
+avail:
 	ac->touched = 1;
 
 	return ac_get_obj(cachep, ac, flags, force_refill);
@@ -2935,31 +3001,48 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 	struct array_cache *ac;
 	bool force_refill = false;
 	unsigned long save_flags;
+	unsigned long tid;
+	unsigned long avail;
 
-	local_irq_save(save_flags);
-
+redo:
+	preempt_disable();
 	ac = cpu_cache_get(cachep);
-	if (unlikely(!ac->avail))
+	tid = ac->tid;
+	preempt_enable();
+
+	avail = ac->avail;
+	if (unlikely(!avail))
 		goto slowpath;
 
 	ac->touched = 1;
-	objp = ac_get_obj(cachep, ac, flags, false);
-
-	/*
-	 * Allow for the possibility all avail objects are not allowed
-	 * by the current flags
-	 */
-	if (likely(objp)) {
-		STATS_INC_ALLOCHIT(cachep);
-		goto out;
+	objp = ac->entry[avail - 1];
+	if (unlikely(!this_cpu_cmpxchg_double(
+		cachep->cpu_cache->avail, cachep->cpu_cache->tid,
+		avail, tid,
+		avail - 1, next_tid(tid))))
+		goto redo;
+
+	if (unlikely(sk_memalloc_socks() && is_obj_pfmemalloc(objp))) {
+		/*
+		 * Allow for the possibility all avail objects are not
+		 * allowed by the current flags
+		 */
+		objp = get_obj_from_pfmemalloc_obj(cachep, objp,
+						flags, force_refill);
+		if (!objp) {
+			force_refill = true;
+			goto slowpath;
+		}
 	}
-	force_refill = true;
+
+	STATS_INC_ALLOCHIT(cachep);
+	return objp;
 
 slowpath:
+	local_irq_save(save_flags);
 	STATS_INC_ALLOCMISS(cachep);
 	objp = cache_alloc_refill(cachep, flags, force_refill);
 
-out:
 	local_irq_restore(save_flags);
 
 	return objp;
@@ -3353,6 +3436,7 @@ free_done:
 #endif
 	spin_unlock(&n->list_lock);
 	slabs_destroy(cachep, &list);
+	ac->tid = next_tid(ac->tid);
 	ac->avail -= batchcount;
 	memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
 }
@@ -3605,7 +3689,7 @@ static int alloc_kmem_cache_node(struct kmem_cache *cachep, gfp_t gfp)
 		if (cachep->shared) {
 			new_shared = alloc_arraycache(node,
 				cachep->shared*cachep->batchcount,
-					0xbaadf00d, gfp);
+					0xbaadf00d, gfp, 0);
 			if (!new_shared) {
 				free_alien_cache(new_alien);
 				goto fail;
@@ -3829,6 +3913,7 @@ static void drain_array(struct kmem_cache *cachep, struct kmem_cache_node *n,
 			if (tofree > ac->avail)
 				tofree = (ac->avail + 1) / 2;
 			free_block(cachep, ac->entry, tofree, node, &list);
+			ac->tid = next_tid(ac->tid);
 			ac->avail -= tofree;
 			memmove(ac->entry, &(ac->entry[tofree]),
 				sizeof(void *) * ac->avail);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
  2015-01-05  1:37   ` Joonsoo Kim
@ 2015-01-05 15:28     ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2015-01-05 15:28 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer

On Mon, 5 Jan 2015, Joonsoo Kim wrote:

> index 449fc6b..54656f0 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -168,6 +168,41 @@ typedef unsigned short freelist_idx_t;
>
>  #define SLAB_OBJ_MAX_NUM ((1 << sizeof(freelist_idx_t) * BITS_PER_BYTE) - 1)
>
> +#ifdef CONFIG_PREEMPT
> +/*
> + * Calculate the next globally unique transaction for disambiguiation
> + * during cmpxchg. The transactions start with the cpu number and are then
> + * incremented by CONFIG_NR_CPUS.
> + */
> +#define TID_STEP  roundup_pow_of_two(CONFIG_NR_CPUS)
> +#else
> +/*
> + * No preemption supported therefore also no need to check for
> + * different cpus.
> + */
> +#define TID_STEP 1
> +#endif
> +
> +static inline unsigned long next_tid(unsigned long tid)
> +{
> +	return tid + TID_STEP;
> +}
> +
> +static inline unsigned int tid_to_cpu(unsigned long tid)
> +{
> +	return tid % TID_STEP;
> +}
> +
> +static inline unsigned long tid_to_event(unsigned long tid)
> +{
> +	return tid / TID_STEP;
> +}
> +
> +static inline unsigned int init_tid(int cpu)
> +{
> +	return cpu;
> +}
> +

Ok the above stuff needs to go into the common code. Maybe in mm/slab.h?
And its a significant feature contributed by me so I'd like to have an
attribution here.

>  /*
>   * true if a page was allocated from pfmemalloc reserves for network-based
>   * swap
> @@ -187,7 +222,8 @@ static bool pfmemalloc_active __read_mostly;
>   *
>   */
>  struct array_cache {
> -	unsigned int avail;
> +	unsigned long avail;
> +	unsigned long tid;
>  	unsigned int limit;
>  	unsigned int batchcount;
>  	unsigned int touched;
> @@ -657,7 +693,8 @@ static void start_cpu_timer(int cpu)
>  	}
>  }

This increases the per cpu struct size and should lead to a small
performance penalty.

> -	 */
> -	if (likely(objp)) {
> -		STATS_INC_ALLOCHIT(cachep);
> -		goto out;
> +	objp = ac->entry[avail - 1];
> +	if (unlikely(!this_cpu_cmpxchg_double(
> +		cachep->cpu_cache->avail, cachep->cpu_cache->tid,
> +		avail, tid,
> +		avail - 1, next_tid(tid))))
> +		goto redo;


Hmm... Ok that looks good.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
@ 2015-01-05 15:28     ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2015-01-05 15:28 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer

On Mon, 5 Jan 2015, Joonsoo Kim wrote:

> index 449fc6b..54656f0 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -168,6 +168,41 @@ typedef unsigned short freelist_idx_t;
>
>  #define SLAB_OBJ_MAX_NUM ((1 << sizeof(freelist_idx_t) * BITS_PER_BYTE) - 1)
>
> +#ifdef CONFIG_PREEMPT
> +/*
> + * Calculate the next globally unique transaction for disambiguiation
> + * during cmpxchg. The transactions start with the cpu number and are then
> + * incremented by CONFIG_NR_CPUS.
> + */
> +#define TID_STEP  roundup_pow_of_two(CONFIG_NR_CPUS)
> +#else
> +/*
> + * No preemption supported therefore also no need to check for
> + * different cpus.
> + */
> +#define TID_STEP 1
> +#endif
> +
> +static inline unsigned long next_tid(unsigned long tid)
> +{
> +	return tid + TID_STEP;
> +}
> +
> +static inline unsigned int tid_to_cpu(unsigned long tid)
> +{
> +	return tid % TID_STEP;
> +}
> +
> +static inline unsigned long tid_to_event(unsigned long tid)
> +{
> +	return tid / TID_STEP;
> +}
> +
> +static inline unsigned int init_tid(int cpu)
> +{
> +	return cpu;
> +}
> +

Ok the above stuff needs to go into the common code. Maybe in mm/slab.h?
And its a significant feature contributed by me so I'd like to have an
attribution here.

>  /*
>   * true if a page was allocated from pfmemalloc reserves for network-based
>   * swap
> @@ -187,7 +222,8 @@ static bool pfmemalloc_active __read_mostly;
>   *
>   */
>  struct array_cache {
> -	unsigned int avail;
> +	unsigned long avail;
> +	unsigned long tid;
>  	unsigned int limit;
>  	unsigned int batchcount;
>  	unsigned int touched;
> @@ -657,7 +693,8 @@ static void start_cpu_timer(int cpu)
>  	}
>  }

This increases the per cpu struct size and should lead to a small
performance penalty.

> -	 */
> -	if (likely(objp)) {
> -		STATS_INC_ALLOCHIT(cachep);
> -		goto out;
> +	objp = ac->entry[avail - 1];
> +	if (unlikely(!this_cpu_cmpxchg_double(
> +		cachep->cpu_cache->avail, cachep->cpu_cache->tid,
> +		avail, tid,
> +		avail - 1, next_tid(tid))))
> +		goto redo;


Hmm... Ok that looks good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
  2015-01-05  1:37   ` Joonsoo Kim
@ 2015-01-05 17:21     ` Andreas Mohr
  -1 siblings, 0 replies; 34+ messages in thread
From: Andreas Mohr @ 2015-01-05 17:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, linux-mm, linux-kernel, Jesper Dangaard Brouer

Hi,

Joonsoo Kim wrote:
> + * Calculate the next globally unique transaction for disambiguiation

"disambiguation"


> +	ac->tid = next_tid(ac->tid);
(and all others)

object oriented:
array_cache_next_tid(ac);
(or perhaps rather: array_cache_start_transaction(ac);?).


> +	/*
> +	 * Because we disable irq just now, cpu can be changed
> +	 * and we are on different node with object node. In this rare
> +	 * case, just return pfmemalloc object for simplicity.
> +	 */

"are on a node which is different from object's node"




General thoughts (maybe just rambling, but that's just my feelings vs.
this mechanism, so maybe it's food for thought):
To me, the existing implementation seems too fond of IRQ fumbling
(i.e., affecting of oh so nicely *unrelated*
outer global environment context stuff).
A proper implementation wouldn't need *any* knowledge of this
(i.e., modifying such "IRQ disable" side effects,
to avoid having a scheduler hit and possibly ending up on another node).

Thus to me, the whole handling seems somewhat wrong and split
(since there remains the need to deal with scheduler distortion/disruption).
The bare-metal "inner" algorithm should not need to depend on such shenanigans
but simply be able to carry out its task unaffected,
where IRQs are simply always left enabled
(or at least potentially disabled by other kernel components only)
and the code then elegantly/inherently deals with IRQ complications.

Since the node change is scheduler-driven (I assume),
any (changes of) context attributes
which are relevant to (affect) SLAB-internal operations
ought to be implicitly/automatically re-assigned by the scheduler,
and then the most that should be needed is a *final* check in SLAB
(possibly in an outer user-facing layer of it)
whether the current final calculation result still matches expectations,
i.e. whether there was no disruption
(in which case we'd also do a goto redo: operation or some such :).

These thoughts also mean that I'm unsure (difficult to determine)
of whether this change is good (i.e. a clean step in the right direction),
or whether instead the implementation could easily directly be made
fully independent from IRQ constraints.

Thanks,

Andreas Mohr

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
@ 2015-01-05 17:21     ` Andreas Mohr
  0 siblings, 0 replies; 34+ messages in thread
From: Andreas Mohr @ 2015-01-05 17:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

Hi,

Joonsoo Kim wrote:
> + * Calculate the next globally unique transaction for disambiguiation

"disambiguation"


> +	ac->tid = next_tid(ac->tid);
(and all others)

object oriented:
array_cache_next_tid(ac);
(or perhaps rather: array_cache_start_transaction(ac);?).


> +	/*
> +	 * Because we disable irq just now, cpu can be changed
> +	 * and we are on different node with object node. In this rare
> +	 * case, just return pfmemalloc object for simplicity.
> +	 */

"are on a node which is different from object's node"




General thoughts (maybe just rambling, but that's just my feelings vs.
this mechanism, so maybe it's food for thought):
To me, the existing implementation seems too fond of IRQ fumbling
(i.e., affecting of oh so nicely *unrelated*
outer global environment context stuff).
A proper implementation wouldn't need *any* knowledge of this
(i.e., modifying such "IRQ disable" side effects,
to avoid having a scheduler hit and possibly ending up on another node).

Thus to me, the whole handling seems somewhat wrong and split
(since there remains the need to deal with scheduler distortion/disruption).
The bare-metal "inner" algorithm should not need to depend on such shenanigans
but simply be able to carry out its task unaffected,
where IRQs are simply always left enabled
(or at least potentially disabled by other kernel components only)
and the code then elegantly/inherently deals with IRQ complications.

Since the node change is scheduler-driven (I assume),
any (changes of) context attributes
which are relevant to (affect) SLAB-internal operations
ought to be implicitly/automatically re-assigned by the scheduler,
and then the most that should be needed is a *final* check in SLAB
(possibly in an outer user-facing layer of it)
whether the current final calculation result still matches expectations,
i.e. whether there was no disruption
(in which case we'd also do a goto redo: operation or some such :).

These thoughts also mean that I'm unsure (difficult to determine)
of whether this change is good (i.e. a clean step in the right direction),
or whether instead the implementation could easily directly be made
fully independent from IRQ constraints.

Thanks,

Andreas Mohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
  2015-01-05 17:21     ` Andreas Mohr
@ 2015-01-05 17:52       ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2015-01-05 17:52 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Joonsoo Kim, Andrew Morton, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

On Mon, 5 Jan 2015, Andreas Mohr wrote:

> These thoughts also mean that I'm unsure (difficult to determine)
> of whether this change is good (i.e. a clean step in the right direction),
> or whether instead the implementation could easily directly be made
> fully independent from IRQ constraints.

We have thought a couple of times about making it independent of
interrupts. We can do that if there is guarantee that no slab operations
are going to be performed from an interrupt context. That in turn will
simplify allocator design significantly.

Regarding this patchset: I think this has the character of an RFC at this
point. There are some good ideas here but this needs to mature a bit and
get lots of feedback.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
@ 2015-01-05 17:52       ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2015-01-05 17:52 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Joonsoo Kim, Andrew Morton, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

On Mon, 5 Jan 2015, Andreas Mohr wrote:

> These thoughts also mean that I'm unsure (difficult to determine)
> of whether this change is good (i.e. a clean step in the right direction),
> or whether instead the implementation could easily directly be made
> fully independent from IRQ constraints.

We have thought a couple of times about making it independent of
interrupts. We can do that if there is guarantee that no slab operations
are going to be performed from an interrupt context. That in turn will
simplify allocator design significantly.

Regarding this patchset: I think this has the character of an RFC at this
point. There are some good ideas here but this needs to mature a bit and
get lots of feedback.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
  2015-01-05 15:28     ` Christoph Lameter
@ 2015-01-06  1:04       ` Joonsoo Kim
  -1 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-06  1:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer

On Mon, Jan 05, 2015 at 09:28:14AM -0600, Christoph Lameter wrote:
> On Mon, 5 Jan 2015, Joonsoo Kim wrote:
> 
> > index 449fc6b..54656f0 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -168,6 +168,41 @@ typedef unsigned short freelist_idx_t;
> >
> >  #define SLAB_OBJ_MAX_NUM ((1 << sizeof(freelist_idx_t) * BITS_PER_BYTE) - 1)
> >
> > +#ifdef CONFIG_PREEMPT
> > +/*
> > + * Calculate the next globally unique transaction for disambiguiation
> > + * during cmpxchg. The transactions start with the cpu number and are then
> > + * incremented by CONFIG_NR_CPUS.
> > + */
> > +#define TID_STEP  roundup_pow_of_two(CONFIG_NR_CPUS)
> > +#else
> > +/*
> > + * No preemption supported therefore also no need to check for
> > + * different cpus.
> > + */
> > +#define TID_STEP 1
> > +#endif
> > +
> > +static inline unsigned long next_tid(unsigned long tid)
> > +{
> > +	return tid + TID_STEP;
> > +}
> > +
> > +static inline unsigned int tid_to_cpu(unsigned long tid)
> > +{
> > +	return tid % TID_STEP;
> > +}
> > +
> > +static inline unsigned long tid_to_event(unsigned long tid)
> > +{
> > +	return tid / TID_STEP;
> > +}
> > +
> > +static inline unsigned int init_tid(int cpu)
> > +{
> > +	return cpu;
> > +}
> > +
> 
> Ok the above stuff needs to go into the common code. Maybe in mm/slab.h?
> And its a significant feature contributed by me so I'd like to have an
> attribution here.

Okay. I will try!

> 
> >  /*
> >   * true if a page was allocated from pfmemalloc reserves for network-based
> >   * swap
> > @@ -187,7 +222,8 @@ static bool pfmemalloc_active __read_mostly;
> >   *
> >   */
> >  struct array_cache {
> > -	unsigned int avail;
> > +	unsigned long avail;
> > +	unsigned long tid;
> >  	unsigned int limit;
> >  	unsigned int batchcount;
> >  	unsigned int touched;
> > @@ -657,7 +693,8 @@ static void start_cpu_timer(int cpu)
> >  	}
> >  }
> 
> This increases the per cpu struct size and should lead to a small
> performance penalty.

Yes, but, it's marginal than improvement of this patchset.
> 
> > -	 */
> > -	if (likely(objp)) {
> > -		STATS_INC_ALLOCHIT(cachep);
> > -		goto out;
> > +	objp = ac->entry[avail - 1];
> > +	if (unlikely(!this_cpu_cmpxchg_double(
> > +		cachep->cpu_cache->avail, cachep->cpu_cache->tid,
> > +		avail, tid,
> > +		avail - 1, next_tid(tid))))
> > +		goto redo;
> 
> 
> Hmm... Ok that looks good.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
@ 2015-01-06  1:04       ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-06  1:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer

On Mon, Jan 05, 2015 at 09:28:14AM -0600, Christoph Lameter wrote:
> On Mon, 5 Jan 2015, Joonsoo Kim wrote:
> 
> > index 449fc6b..54656f0 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -168,6 +168,41 @@ typedef unsigned short freelist_idx_t;
> >
> >  #define SLAB_OBJ_MAX_NUM ((1 << sizeof(freelist_idx_t) * BITS_PER_BYTE) - 1)
> >
> > +#ifdef CONFIG_PREEMPT
> > +/*
> > + * Calculate the next globally unique transaction for disambiguiation
> > + * during cmpxchg. The transactions start with the cpu number and are then
> > + * incremented by CONFIG_NR_CPUS.
> > + */
> > +#define TID_STEP  roundup_pow_of_two(CONFIG_NR_CPUS)
> > +#else
> > +/*
> > + * No preemption supported therefore also no need to check for
> > + * different cpus.
> > + */
> > +#define TID_STEP 1
> > +#endif
> > +
> > +static inline unsigned long next_tid(unsigned long tid)
> > +{
> > +	return tid + TID_STEP;
> > +}
> > +
> > +static inline unsigned int tid_to_cpu(unsigned long tid)
> > +{
> > +	return tid % TID_STEP;
> > +}
> > +
> > +static inline unsigned long tid_to_event(unsigned long tid)
> > +{
> > +	return tid / TID_STEP;
> > +}
> > +
> > +static inline unsigned int init_tid(int cpu)
> > +{
> > +	return cpu;
> > +}
> > +
> 
> Ok the above stuff needs to go into the common code. Maybe in mm/slab.h?
> And its a significant feature contributed by me so I'd like to have an
> attribution here.

Okay. I will try!

> 
> >  /*
> >   * true if a page was allocated from pfmemalloc reserves for network-based
> >   * swap
> > @@ -187,7 +222,8 @@ static bool pfmemalloc_active __read_mostly;
> >   *
> >   */
> >  struct array_cache {
> > -	unsigned int avail;
> > +	unsigned long avail;
> > +	unsigned long tid;
> >  	unsigned int limit;
> >  	unsigned int batchcount;
> >  	unsigned int touched;
> > @@ -657,7 +693,8 @@ static void start_cpu_timer(int cpu)
> >  	}
> >  }
> 
> This increases the per cpu struct size and should lead to a small
> performance penalty.

Yes, but, it's marginal than improvement of this patchset.
> 
> > -	 */
> > -	if (likely(objp)) {
> > -		STATS_INC_ALLOCHIT(cachep);
> > -		goto out;
> > +	objp = ac->entry[avail - 1];
> > +	if (unlikely(!this_cpu_cmpxchg_double(
> > +		cachep->cpu_cache->avail, cachep->cpu_cache->tid,
> > +		avail, tid,
> > +		avail - 1, next_tid(tid))))
> > +		goto redo;
> 
> 
> Hmm... Ok that looks good.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
  2015-01-05 17:21     ` Andreas Mohr
@ 2015-01-06  1:31       ` Joonsoo Kim
  -1 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-06  1:31 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

Hello,

On Mon, Jan 05, 2015 at 06:21:39PM +0100, Andreas Mohr wrote:
> Hi,
> 
> Joonsoo Kim wrote:
> > + * Calculate the next globally unique transaction for disambiguiation
> 
> "disambiguation"

Okay.

> 
> > +	ac->tid = next_tid(ac->tid);
> (and all others)
> 
> object oriented:
> array_cache_next_tid(ac);
> (or perhaps rather: array_cache_start_transaction(ac);?).

Okay. Christoph request common transaction id management code. If
above object oriented design fit that, I will do it.
> 
> > +	/*
> > +	 * Because we disable irq just now, cpu can be changed
> > +	 * and we are on different node with object node. In this rare
> > +	 * case, just return pfmemalloc object for simplicity.
> > +	 */
> 
> "are on a node which is different from object's node"
> 

Thanks. :)

> 
> 
> General thoughts (maybe just rambling, but that's just my feelings vs.
> this mechanism, so maybe it's food for thought):
> To me, the existing implementation seems too fond of IRQ fumbling
> (i.e., affecting of oh so nicely *unrelated*
> outer global environment context stuff).
> A proper implementation wouldn't need *any* knowledge of this
> (i.e., modifying such "IRQ disable" side effects,
> to avoid having a scheduler hit and possibly ending up on another node).
> 
> Thus to me, the whole handling seems somewhat wrong and split
> (since there remains the need to deal with scheduler distortion/disruption).
> The bare-metal "inner" algorithm should not need to depend on such shenanigans
> but simply be able to carry out its task unaffected,
> where IRQs are simply always left enabled
> (or at least potentially disabled by other kernel components only)
> and the code then elegantly/inherently deals with IRQ complications.

I'm not sure I understand your opinion correctly. If my response is
wrong, please let me know your thought more correctly.

IRQ manipulation is done for synchronization of array cache, not for
freezing allocation context. That is just side-effect. As Christoph
said, slab operation could be executed in interrupt context so we
should protect array cache even if we are in process context.

> Since the node change is scheduler-driven (I assume),
> any (changes of) context attributes
> which are relevant to (affect) SLAB-internal operations
> ought to be implicitly/automatically re-assigned by the scheduler,
> and then the most that should be needed is a *final* check in SLAB
> (possibly in an outer user-facing layer of it)
> whether the current final calculation result still matches expectations,
> i.e. whether there was no disruption
> (in which case we'd also do a goto redo: operation or some such :).

If we can do whole procedure without IRQ manipulation, final check
would be more elegant solution. But, current implementation necessarily
needs IRQ manipulation for synchronization at pretty early phase. In this
situation, doing context check and adjusting context at first step may
be better than final check and redo.

> These thoughts also mean that I'm unsure (difficult to determine)
> of whether this change is good (i.e. a clean step in the right direction),
> or whether instead the implementation could easily directly be made
> fully independent from IRQ constraints.

Is there any issue that this change prevent further improvment?
I think that we can go right direction easily as soon as we find
a better solution even if this change is merged.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
@ 2015-01-06  1:31       ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-06  1:31 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

Hello,

On Mon, Jan 05, 2015 at 06:21:39PM +0100, Andreas Mohr wrote:
> Hi,
> 
> Joonsoo Kim wrote:
> > + * Calculate the next globally unique transaction for disambiguiation
> 
> "disambiguation"

Okay.

> 
> > +	ac->tid = next_tid(ac->tid);
> (and all others)
> 
> object oriented:
> array_cache_next_tid(ac);
> (or perhaps rather: array_cache_start_transaction(ac);?).

Okay. Christoph request common transaction id management code. If
above object oriented design fit that, I will do it.
> 
> > +	/*
> > +	 * Because we disable irq just now, cpu can be changed
> > +	 * and we are on different node with object node. In this rare
> > +	 * case, just return pfmemalloc object for simplicity.
> > +	 */
> 
> "are on a node which is different from object's node"
> 

Thanks. :)

> 
> 
> General thoughts (maybe just rambling, but that's just my feelings vs.
> this mechanism, so maybe it's food for thought):
> To me, the existing implementation seems too fond of IRQ fumbling
> (i.e., affecting of oh so nicely *unrelated*
> outer global environment context stuff).
> A proper implementation wouldn't need *any* knowledge of this
> (i.e., modifying such "IRQ disable" side effects,
> to avoid having a scheduler hit and possibly ending up on another node).
> 
> Thus to me, the whole handling seems somewhat wrong and split
> (since there remains the need to deal with scheduler distortion/disruption).
> The bare-metal "inner" algorithm should not need to depend on such shenanigans
> but simply be able to carry out its task unaffected,
> where IRQs are simply always left enabled
> (or at least potentially disabled by other kernel components only)
> and the code then elegantly/inherently deals with IRQ complications.

I'm not sure I understand your opinion correctly. If my response is
wrong, please let me know your thought more correctly.

IRQ manipulation is done for synchronization of array cache, not for
freezing allocation context. That is just side-effect. As Christoph
said, slab operation could be executed in interrupt context so we
should protect array cache even if we are in process context.

> Since the node change is scheduler-driven (I assume),
> any (changes of) context attributes
> which are relevant to (affect) SLAB-internal operations
> ought to be implicitly/automatically re-assigned by the scheduler,
> and then the most that should be needed is a *final* check in SLAB
> (possibly in an outer user-facing layer of it)
> whether the current final calculation result still matches expectations,
> i.e. whether there was no disruption
> (in which case we'd also do a goto redo: operation or some such :).

If we can do whole procedure without IRQ manipulation, final check
would be more elegant solution. But, current implementation necessarily
needs IRQ manipulation for synchronization at pretty early phase. In this
situation, doing context check and adjusting context at first step may
be better than final check and redo.

> These thoughts also mean that I'm unsure (difficult to determine)
> of whether this change is good (i.e. a clean step in the right direction),
> or whether instead the implementation could easily directly be made
> fully independent from IRQ constraints.

Is there any issue that this change prevent further improvment?
I think that we can go right direction easily as soon as we find
a better solution even if this change is merged.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
  2015-01-06  1:31       ` Joonsoo Kim
@ 2015-01-06 10:34         ` Andreas Mohr
  -1 siblings, 0 replies; 34+ messages in thread
From: Andreas Mohr @ 2015-01-06 10:34 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andreas Mohr, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, linux-mm, linux-kernel, Jesper Dangaard Brouer

On Tue, Jan 06, 2015 at 10:31:22AM +0900, Joonsoo Kim wrote:
> Hello,
> 
> On Mon, Jan 05, 2015 at 06:21:39PM +0100, Andreas Mohr wrote:
> > Hi,
> > 
> > Joonsoo Kim wrote:
> > > +	ac->tid = next_tid(ac->tid);
> > (and all others)
> > 
> > object oriented:
> > array_cache_next_tid(ac);
> > (or perhaps rather: array_cache_start_transaction(ac);?).
> 
> Okay. Christoph request common transaction id management code. If
> above object oriented design fit that, I will do it.

Yeah, that may easily have been the same thing.
This function would then obviously and simply do a
    ac->tid = next_tid(ac->tid);
dance internally
(thereby likely introducing some nice instruction cache savings, too).


> > General thoughts (maybe just rambling, but that's just my feelings vs.
> > this mechanism, so maybe it's food for thought):
> > To me, the existing implementation seems too fond of IRQ fumbling
> > (i.e., affecting of oh so nicely *unrelated*
> > outer global environment context stuff).
> > A proper implementation wouldn't need *any* knowledge of this
> > (i.e., modifying such "IRQ disable" side effects,
> > to avoid having a scheduler hit and possibly ending up on another node).
> > 
> > Thus to me, the whole handling seems somewhat wrong and split
> > (since there remains the need to deal with scheduler distortion/disruption).
> > The bare-metal "inner" algorithm should not need to depend on such shenanigans
> > but simply be able to carry out its task unaffected,
> > where IRQs are simply always left enabled
> > (or at least potentially disabled by other kernel components only)
> > and the code then elegantly/inherently deals with IRQ complications.
> 
> I'm not sure I understand your opinion correctly. If my response is
> wrong, please let me know your thought more correctly.

I have to admit that I was talking general implementation guidelines
rather than having a very close look at what can be done *here*.
Put differently, my goal at this point was just to state weird questions,
for people to then reason about the bigger picture
and come to possibly brilliant conclusions ;)


> IRQ manipulation is done for synchronization of array cache, not for
> freezing allocation context. That is just side-effect. As Christoph
> said, slab operation could be executed in interrupt context so we
> should protect array cache even if we are in process context.

Ah, that clarifies collision areas, thanks!

In general the goal likely should be
to attempt to do as many things "in parallel" (guaranteed-collision-free,
via cleanly instance-separate data) as possible,
and then once the result has been calculated,
do quick/short updating (insertion/deletion)
of the shared/contended resource (here: array cache)
with the necessary protection (IRQs disabled, in this case).
Via some layering tricks, one could manage to do all the calculation handling
without even drawing in any "IRQ management" dependency
to that inner code part ("code file"?),
but the array cache management would then remain IRQ-affected
(in those outer layers which know that they are unfortunately still drawn down
by an annoying dependency on IRQ shenanigans).
Or, in other words: achieve a clean separation of layers
so that it's obvious which ones get hampered by annoying dependencies
and which ones are able to carry out their job unaffected.


And for the "disruption via interrupt context" part:
we currently seem to have "synchronous" handling,
where one needs to forcefully block access to a shared/contended resource
(which would be done
either via annoying mutex contention,
or even via wholesale blocking of IRQ execution).
To remove/reduce friction,
one should either remove any additional foreign-context access to that resource,
thereby making this implementation cleanly (if woefully)
single-context only (as Christoph seems to be intending,
by removing SLAB access from IRQ handlers?),
or ideally implement fully parallel instance-separate data management
(but I guess that's not what one would want
for a global multi-sized-area and thus fragmentation-avoiding SL*B allocator,
since that would mean splitting global memory resources
into per-instance areas?).

Alternatively, one could go for "asynchronous" handling,
where SL*B updates of any foreign-context (but not main-context)
are *delayed*,
by merely queuing them into a simple submission queue
which then will be delay-applied by main-context
either once main-context enters a certain "quiet" state (e.g. context switch?),
or once main-context needs to actively take into account
these effects of foreign-context registrations.
But for an allocator (memory manager),
such "asynchronous" handling might be completely impossible anyway,
since it might need to have a consistent global view,
of all resources at all times
(--> we're back to having a synchronous handling requirement,
via lock contention).


> > These thoughts also mean that I'm unsure (difficult to determine)
> > of whether this change is good (i.e. a clean step in the right direction),
> > or whether instead the implementation could easily directly be made
> > fully independent from IRQ constraints.
> 
> Is there any issue that this change prevent further improvment?
> I think that we can go right direction easily as soon as we find
> a better solution even if this change is merged.

Ah, I see you're skillfully asking the right forward progress question ;)

To which I'm currently limited to saying
that from my side there are no objections to be added to the list
(other than the minor mechanical parts directly stated in my review).

Thanks,

Andreas Mohr

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
@ 2015-01-06 10:34         ` Andreas Mohr
  0 siblings, 0 replies; 34+ messages in thread
From: Andreas Mohr @ 2015-01-06 10:34 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andreas Mohr, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, linux-mm, linux-kernel, Jesper Dangaard Brouer

On Tue, Jan 06, 2015 at 10:31:22AM +0900, Joonsoo Kim wrote:
> Hello,
> 
> On Mon, Jan 05, 2015 at 06:21:39PM +0100, Andreas Mohr wrote:
> > Hi,
> > 
> > Joonsoo Kim wrote:
> > > +	ac->tid = next_tid(ac->tid);
> > (and all others)
> > 
> > object oriented:
> > array_cache_next_tid(ac);
> > (or perhaps rather: array_cache_start_transaction(ac);?).
> 
> Okay. Christoph request common transaction id management code. If
> above object oriented design fit that, I will do it.

Yeah, that may easily have been the same thing.
This function would then obviously and simply do a
    ac->tid = next_tid(ac->tid);
dance internally
(thereby likely introducing some nice instruction cache savings, too).


> > General thoughts (maybe just rambling, but that's just my feelings vs.
> > this mechanism, so maybe it's food for thought):
> > To me, the existing implementation seems too fond of IRQ fumbling
> > (i.e., affecting of oh so nicely *unrelated*
> > outer global environment context stuff).
> > A proper implementation wouldn't need *any* knowledge of this
> > (i.e., modifying such "IRQ disable" side effects,
> > to avoid having a scheduler hit and possibly ending up on another node).
> > 
> > Thus to me, the whole handling seems somewhat wrong and split
> > (since there remains the need to deal with scheduler distortion/disruption).
> > The bare-metal "inner" algorithm should not need to depend on such shenanigans
> > but simply be able to carry out its task unaffected,
> > where IRQs are simply always left enabled
> > (or at least potentially disabled by other kernel components only)
> > and the code then elegantly/inherently deals with IRQ complications.
> 
> I'm not sure I understand your opinion correctly. If my response is
> wrong, please let me know your thought more correctly.

I have to admit that I was talking general implementation guidelines
rather than having a very close look at what can be done *here*.
Put differently, my goal at this point was just to state weird questions,
for people to then reason about the bigger picture
and come to possibly brilliant conclusions ;)


> IRQ manipulation is done for synchronization of array cache, not for
> freezing allocation context. That is just side-effect. As Christoph
> said, slab operation could be executed in interrupt context so we
> should protect array cache even if we are in process context.

Ah, that clarifies collision areas, thanks!

In general the goal likely should be
to attempt to do as many things "in parallel" (guaranteed-collision-free,
via cleanly instance-separate data) as possible,
and then once the result has been calculated,
do quick/short updating (insertion/deletion)
of the shared/contended resource (here: array cache)
with the necessary protection (IRQs disabled, in this case).
Via some layering tricks, one could manage to do all the calculation handling
without even drawing in any "IRQ management" dependency
to that inner code part ("code file"?),
but the array cache management would then remain IRQ-affected
(in those outer layers which know that they are unfortunately still drawn down
by an annoying dependency on IRQ shenanigans).
Or, in other words: achieve a clean separation of layers
so that it's obvious which ones get hampered by annoying dependencies
and which ones are able to carry out their job unaffected.


And for the "disruption via interrupt context" part:
we currently seem to have "synchronous" handling,
where one needs to forcefully block access to a shared/contended resource
(which would be done
either via annoying mutex contention,
or even via wholesale blocking of IRQ execution).
To remove/reduce friction,
one should either remove any additional foreign-context access to that resource,
thereby making this implementation cleanly (if woefully)
single-context only (as Christoph seems to be intending,
by removing SLAB access from IRQ handlers?),
or ideally implement fully parallel instance-separate data management
(but I guess that's not what one would want
for a global multi-sized-area and thus fragmentation-avoiding SL*B allocator,
since that would mean splitting global memory resources
into per-instance areas?).

Alternatively, one could go for "asynchronous" handling,
where SL*B updates of any foreign-context (but not main-context)
are *delayed*,
by merely queuing them into a simple submission queue
which then will be delay-applied by main-context
either once main-context enters a certain "quiet" state (e.g. context switch?),
or once main-context needs to actively take into account
these effects of foreign-context registrations.
But for an allocator (memory manager),
such "asynchronous" handling might be completely impossible anyway,
since it might need to have a consistent global view,
of all resources at all times
(--> we're back to having a synchronous handling requirement,
via lock contention).


> > These thoughts also mean that I'm unsure (difficult to determine)
> > of whether this change is good (i.e. a clean step in the right direction),
> > or whether instead the implementation could easily directly be made
> > fully independent from IRQ constraints.
> 
> Is there any issue that this change prevent further improvment?
> I think that we can go right direction easily as soon as we find
> a better solution even if this change is merged.

Ah, I see you're skillfully asking the right forward progress question ;)

To which I'm currently limited to saying
that from my side there are no objections to be added to the list
(other than the minor mechanical parts directly stated in my review).

Thanks,

Andreas Mohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
  2015-01-06 10:34         ` Andreas Mohr
@ 2015-01-06 15:33           ` Christoph Lameter
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2015-01-06 15:33 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Joonsoo Kim, Andrew Morton, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

On Tue, 6 Jan 2015, Andreas Mohr wrote:
> by merely queuing them into a simple submission queue
> which then will be delay-applied by main-context
> either once main-context enters a certain "quiet" state (e.g. context switch?),
> or once main-context needs to actively take into account

This is basically the same approach as you mentioned before and would
multiply the resources needed. I think we are close to be able to avoid
allocations from interrupt contexts. Someone would need to perform an
audit to see what is left to be done. If so then lots of allocator paths
both in the page allocator and slab allocator can be dramatically
simplified.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
@ 2015-01-06 15:33           ` Christoph Lameter
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2015-01-06 15:33 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Joonsoo Kim, Andrew Morton, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

On Tue, 6 Jan 2015, Andreas Mohr wrote:
> by merely queuing them into a simple submission queue
> which then will be delay-applied by main-context
> either once main-context enters a certain "quiet" state (e.g. context switch?),
> or once main-context needs to actively take into account

This is basically the same approach as you mentioned before and would
multiply the resources needed. I think we are close to be able to avoid
allocations from interrupt contexts. Someone would need to perform an
audit to see what is left to be done. If so then lots of allocator paths
both in the page allocator and slab allocator can be dramatically
simplified.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
  2015-01-06 15:33           ` Christoph Lameter
@ 2015-01-06 16:26             ` Andreas Mohr
  -1 siblings, 0 replies; 34+ messages in thread
From: Andreas Mohr @ 2015-01-06 16:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andreas Mohr, Joonsoo Kim, Andrew Morton, Pekka Enberg,
	David Rientjes, linux-mm, linux-kernel, Jesper Dangaard Brouer

On Tue, Jan 06, 2015 at 09:33:15AM -0600, Christoph Lameter wrote:
> On Tue, 6 Jan 2015, Andreas Mohr wrote:
> > by merely queuing them into a simple submission queue
> > which then will be delay-applied by main-context
> > either once main-context enters a certain "quiet" state (e.g. context switch?),
> > or once main-context needs to actively take into account
> 
> This is basically the same approach as you mentioned before and would
> multiply the resources needed. I think we are close to be able to avoid
> allocations from interrupt contexts. Someone would need to perform an
> audit to see what is left to be done. If so then lots of allocator paths
> both in the page allocator and slab allocator can be dramatically
> simplified.

OK, so we seem to be already well near the finishing line of single-context
operation.

In case of multi-context access, in general I'd guess
that challenges are similar to
"traditional" multi-thread-capable heap allocator implementations in userspace,
where I'm sure large amounts of research papers (some dead-tree?)
have been written about how to achieve scalable low-contention
multi-thread access arbitration to the same shared underlying memory resource
(but since in Linux circles some implementations exceed usual
accumulated research knowledge / Best Practice, such papers may or may not be
of much help ;)

Andreas Mohr

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
@ 2015-01-06 16:26             ` Andreas Mohr
  0 siblings, 0 replies; 34+ messages in thread
From: Andreas Mohr @ 2015-01-06 16:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andreas Mohr, Joonsoo Kim, Andrew Morton, Pekka Enberg,
	David Rientjes, linux-mm, linux-kernel, Jesper Dangaard Brouer

On Tue, Jan 06, 2015 at 09:33:15AM -0600, Christoph Lameter wrote:
> On Tue, 6 Jan 2015, Andreas Mohr wrote:
> > by merely queuing them into a simple submission queue
> > which then will be delay-applied by main-context
> > either once main-context enters a certain "quiet" state (e.g. context switch?),
> > or once main-context needs to actively take into account
> 
> This is basically the same approach as you mentioned before and would
> multiply the resources needed. I think we are close to be able to avoid
> allocations from interrupt contexts. Someone would need to perform an
> audit to see what is left to be done. If so then lots of allocator paths
> both in the page allocator and slab allocator can be dramatically
> simplified.

OK, so we seem to be already well near the finishing line of single-context
operation.

In case of multi-context access, in general I'd guess
that challenges are similar to
"traditional" multi-thread-capable heap allocator implementations in userspace,
where I'm sure large amounts of research papers (some dead-tree?)
have been written about how to achieve scalable low-contention
multi-thread access arbitration to the same shared underlying memory resource
(but since in Linux circles some implementations exceed usual
accumulated research knowledge / Best Practice, such papers may or may not be
of much help ;)

Andreas Mohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
  2015-01-06 10:34         ` Andreas Mohr
@ 2015-01-08  7:54           ` Joonsoo Kim
  -1 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-08  7:54 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

On Tue, Jan 06, 2015 at 11:34:39AM +0100, Andreas Mohr wrote:
> On Tue, Jan 06, 2015 at 10:31:22AM +0900, Joonsoo Kim wrote:
> > Hello,
> > 
> > On Mon, Jan 05, 2015 at 06:21:39PM +0100, Andreas Mohr wrote:
> > > Hi,
> > > 
> > > Joonsoo Kim wrote:
> > > > +	ac->tid = next_tid(ac->tid);
> > > (and all others)
> > > 
> > > object oriented:
> > > array_cache_next_tid(ac);
> > > (or perhaps rather: array_cache_start_transaction(ac);?).
> > 
> > Okay. Christoph request common transaction id management code. If
> > above object oriented design fit that, I will do it.
> 
> Yeah, that may easily have been the same thing.
> This function would then obviously and simply do a
>     ac->tid = next_tid(ac->tid);
> dance internally
> (thereby likely introducing some nice instruction cache savings, too).
> 
> 
> > > General thoughts (maybe just rambling, but that's just my feelings vs.
> > > this mechanism, so maybe it's food for thought):
> > > To me, the existing implementation seems too fond of IRQ fumbling
> > > (i.e., affecting of oh so nicely *unrelated*
> > > outer global environment context stuff).
> > > A proper implementation wouldn't need *any* knowledge of this
> > > (i.e., modifying such "IRQ disable" side effects,
> > > to avoid having a scheduler hit and possibly ending up on another node).
> > > 
> > > Thus to me, the whole handling seems somewhat wrong and split
> > > (since there remains the need to deal with scheduler distortion/disruption).
> > > The bare-metal "inner" algorithm should not need to depend on such shenanigans
> > > but simply be able to carry out its task unaffected,
> > > where IRQs are simply always left enabled
> > > (or at least potentially disabled by other kernel components only)
> > > and the code then elegantly/inherently deals with IRQ complications.
> > 
> > I'm not sure I understand your opinion correctly. If my response is
> > wrong, please let me know your thought more correctly.
> 
> I have to admit that I was talking general implementation guidelines
> rather than having a very close look at what can be done *here*.
> Put differently, my goal at this point was just to state weird questions,
> for people to then reason about the bigger picture
> and come to possibly brilliant conclusions ;)
> 
> 
> > IRQ manipulation is done for synchronization of array cache, not for
> > freezing allocation context. That is just side-effect. As Christoph
> > said, slab operation could be executed in interrupt context so we
> > should protect array cache even if we are in process context.
> 
> Ah, that clarifies collision areas, thanks!
> 
> In general the goal likely should be
> to attempt to do as many things "in parallel" (guaranteed-collision-free,
> via cleanly instance-separate data) as possible,
> and then once the result has been calculated,
> do quick/short updating (insertion/deletion)
> of the shared/contended resource (here: array cache)
> with the necessary protection (IRQs disabled, in this case).
> Via some layering tricks, one could manage to do all the calculation handling
> without even drawing in any "IRQ management" dependency
> to that inner code part ("code file"?),
> but the array cache management would then remain IRQ-affected
> (in those outer layers which know that they are unfortunately still drawn down
> by an annoying dependency on IRQ shenanigans).
> Or, in other words: achieve a clean separation of layers
> so that it's obvious which ones get hampered by annoying dependencies
> and which ones are able to carry out their job unaffected.

Hello,

Thanks for good explanation.
I understand your suggestion and it sounds good. However, most of
inner operations in SLAB are also affected by IRQ because they aim at
updating array cache. :) If operation is done only in signle-context,
more optimization may be possible, but, not yet possible.

Anyway, your suggestion makes me relaize a big picture to go. I will
keep this in mind. Thank you.

> 
> 
> And for the "disruption via interrupt context" part:
> we currently seem to have "synchronous" handling,
> where one needs to forcefully block access to a shared/contended resource
> (which would be done
> either via annoying mutex contention,
> or even via wholesale blocking of IRQ execution).
> To remove/reduce friction,
> one should either remove any additional foreign-context access to that resource,
> thereby making this implementation cleanly (if woefully)
> single-context only (as Christoph seems to be intending,
> by removing SLAB access from IRQ handlers?),
> or ideally implement fully parallel instance-separate data management
> (but I guess that's not what one would want
> for a global multi-sized-area and thus fragmentation-avoiding SL*B allocator,
> since that would mean splitting global memory resources
> into per-instance areas?).
> 
> Alternatively, one could go for "asynchronous" handling,
> where SL*B updates of any foreign-context (but not main-context)
> are *delayed*,
> by merely queuing them into a simple submission queue
> which then will be delay-applied by main-context
> either once main-context enters a certain "quiet" state (e.g. context switch?),
> or once main-context needs to actively take into account
> these effects of foreign-context registrations.
> But for an allocator (memory manager),
> such "asynchronous" handling might be completely impossible anyway,
> since it might need to have a consistent global view,
> of all resources at all times
> (--> we're back to having a synchronous handling requirement,
> via lock contention).

While implementing this patchset, I also thought having another queue
for foreign context, but, didn't go that direction since it needs
complicated code and I can't easily estimate trade-off of that
implementation in terms of performance.

> 
> 
> > > These thoughts also mean that I'm unsure (difficult to determine)
> > > of whether this change is good (i.e. a clean step in the right direction),
> > > or whether instead the implementation could easily directly be made
> > > fully independent from IRQ constraints.
> > 
> > Is there any issue that this change prevent further improvment?
> > I think that we can go right direction easily as soon as we find
> > a better solution even if this change is merged.
> 
> Ah, I see you're skillfully asking the right forward progress question ;)
> 
> To which I'm currently limited to saying
> that from my side there are no objections to be added to the list
> (other than the minor mechanical parts directly stated in my review).

Okay.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/6] mm/slab: allocation fastpath without disabling irq
@ 2015-01-08  7:54           ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2015-01-08  7:54 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer

On Tue, Jan 06, 2015 at 11:34:39AM +0100, Andreas Mohr wrote:
> On Tue, Jan 06, 2015 at 10:31:22AM +0900, Joonsoo Kim wrote:
> > Hello,
> > 
> > On Mon, Jan 05, 2015 at 06:21:39PM +0100, Andreas Mohr wrote:
> > > Hi,
> > > 
> > > Joonsoo Kim wrote:
> > > > +	ac->tid = next_tid(ac->tid);
> > > (and all others)
> > > 
> > > object oriented:
> > > array_cache_next_tid(ac);
> > > (or perhaps rather: array_cache_start_transaction(ac);?).
> > 
> > Okay. Christoph request common transaction id management code. If
> > above object oriented design fit that, I will do it.
> 
> Yeah, that may easily have been the same thing.
> This function would then obviously and simply do a
>     ac->tid = next_tid(ac->tid);
> dance internally
> (thereby likely introducing some nice instruction cache savings, too).
> 
> 
> > > General thoughts (maybe just rambling, but that's just my feelings vs.
> > > this mechanism, so maybe it's food for thought):
> > > To me, the existing implementation seems too fond of IRQ fumbling
> > > (i.e., affecting of oh so nicely *unrelated*
> > > outer global environment context stuff).
> > > A proper implementation wouldn't need *any* knowledge of this
> > > (i.e., modifying such "IRQ disable" side effects,
> > > to avoid having a scheduler hit and possibly ending up on another node).
> > > 
> > > Thus to me, the whole handling seems somewhat wrong and split
> > > (since there remains the need to deal with scheduler distortion/disruption).
> > > The bare-metal "inner" algorithm should not need to depend on such shenanigans
> > > but simply be able to carry out its task unaffected,
> > > where IRQs are simply always left enabled
> > > (or at least potentially disabled by other kernel components only)
> > > and the code then elegantly/inherently deals with IRQ complications.
> > 
> > I'm not sure I understand your opinion correctly. If my response is
> > wrong, please let me know your thought more correctly.
> 
> I have to admit that I was talking general implementation guidelines
> rather than having a very close look at what can be done *here*.
> Put differently, my goal at this point was just to state weird questions,
> for people to then reason about the bigger picture
> and come to possibly brilliant conclusions ;)
> 
> 
> > IRQ manipulation is done for synchronization of array cache, not for
> > freezing allocation context. That is just side-effect. As Christoph
> > said, slab operation could be executed in interrupt context so we
> > should protect array cache even if we are in process context.
> 
> Ah, that clarifies collision areas, thanks!
> 
> In general the goal likely should be
> to attempt to do as many things "in parallel" (guaranteed-collision-free,
> via cleanly instance-separate data) as possible,
> and then once the result has been calculated,
> do quick/short updating (insertion/deletion)
> of the shared/contended resource (here: array cache)
> with the necessary protection (IRQs disabled, in this case).
> Via some layering tricks, one could manage to do all the calculation handling
> without even drawing in any "IRQ management" dependency
> to that inner code part ("code file"?),
> but the array cache management would then remain IRQ-affected
> (in those outer layers which know that they are unfortunately still drawn down
> by an annoying dependency on IRQ shenanigans).
> Or, in other words: achieve a clean separation of layers
> so that it's obvious which ones get hampered by annoying dependencies
> and which ones are able to carry out their job unaffected.

Hello,

Thanks for good explanation.
I understand your suggestion and it sounds good. However, most of
inner operations in SLAB are also affected by IRQ because they aim at
updating array cache. :) If operation is done only in signle-context,
more optimization may be possible, but, not yet possible.

Anyway, your suggestion makes me relaize a big picture to go. I will
keep this in mind. Thank you.

> 
> 
> And for the "disruption via interrupt context" part:
> we currently seem to have "synchronous" handling,
> where one needs to forcefully block access to a shared/contended resource
> (which would be done
> either via annoying mutex contention,
> or even via wholesale blocking of IRQ execution).
> To remove/reduce friction,
> one should either remove any additional foreign-context access to that resource,
> thereby making this implementation cleanly (if woefully)
> single-context only (as Christoph seems to be intending,
> by removing SLAB access from IRQ handlers?),
> or ideally implement fully parallel instance-separate data management
> (but I guess that's not what one would want
> for a global multi-sized-area and thus fragmentation-avoiding SL*B allocator,
> since that would mean splitting global memory resources
> into per-instance areas?).
> 
> Alternatively, one could go for "asynchronous" handling,
> where SL*B updates of any foreign-context (but not main-context)
> are *delayed*,
> by merely queuing them into a simple submission queue
> which then will be delay-applied by main-context
> either once main-context enters a certain "quiet" state (e.g. context switch?),
> or once main-context needs to actively take into account
> these effects of foreign-context registrations.
> But for an allocator (memory manager),
> such "asynchronous" handling might be completely impossible anyway,
> since it might need to have a consistent global view,
> of all resources at all times
> (--> we're back to having a synchronous handling requirement,
> via lock contention).

While implementing this patchset, I also thought having another queue
for foreign context, but, didn't go that direction since it needs
complicated code and I can't easily estimate trade-off of that
implementation in terms of performance.

> 
> 
> > > These thoughts also mean that I'm unsure (difficult to determine)
> > > of whether this change is good (i.e. a clean step in the right direction),
> > > or whether instead the implementation could easily directly be made
> > > fully independent from IRQ constraints.
> > 
> > Is there any issue that this change prevent further improvment?
> > I think that we can go right direction easily as soon as we find
> > a better solution even if this change is merged.
> 
> Ah, I see you're skillfully asking the right forward progress question ;)
> 
> To which I'm currently limited to saying
> that from my side there are no objections to be added to the list
> (other than the minor mechanical parts directly stated in my review).

Okay.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/6] mm/slab: remove kmemleak_erase() call
  2015-01-05  1:37   ` Joonsoo Kim
@ 2015-01-08 12:01     ` Catalin Marinas
  -1 siblings, 0 replies; 34+ messages in thread
From: Catalin Marinas @ 2015-01-08 12:01 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, Linux Kernel Mailing List, Jesper Dangaard Brouer

On 5 January 2015 at 01:37, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> We already call kmemleak_no_scan() in initialization step of array cache,
> so kmemleak doesn't scan array cache. Therefore, we don't need to call
> kmemleak_erase() here.
>
> And, this call is the last caller of kmemleak_erase(), so remove
> kmemleak_erase() definition completely.

Good point.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

-- 
Catalin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/6] mm/slab: remove kmemleak_erase() call
@ 2015-01-08 12:01     ` Catalin Marinas
  0 siblings, 0 replies; 34+ messages in thread
From: Catalin Marinas @ 2015-01-08 12:01 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, Linux Kernel Mailing List, Jesper Dangaard Brouer

On 5 January 2015 at 01:37, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> We already call kmemleak_no_scan() in initialization step of array cache,
> so kmemleak doesn't scan array cache. Therefore, we don't need to call
> kmemleak_erase() here.
>
> And, this call is the last caller of kmemleak_erase(), so remove
> kmemleak_erase() definition completely.

Good point.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

-- 
Catalin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2015-01-08 12:02 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-05  1:37 [PATCH 0/6] mm/slab: optimize allocation fastpath Joonsoo Kim
2015-01-05  1:37 ` Joonsoo Kim
2015-01-05  1:37 ` [PATCH 1/6] mm/slab: fix gfp flags of percpu allocation at boot phase Joonsoo Kim
2015-01-05  1:37   ` Joonsoo Kim
2015-01-05  1:37 ` [PATCH 2/6] mm/slab: remove kmemleak_erase() call Joonsoo Kim
2015-01-05  1:37   ` Joonsoo Kim
2015-01-08 12:01   ` Catalin Marinas
2015-01-08 12:01     ` Catalin Marinas
2015-01-05  1:37 ` [PATCH 3/6] mm/slab: clean-up __ac_get_obj() to prepare future changes Joonsoo Kim
2015-01-05  1:37   ` Joonsoo Kim
2015-01-05  1:37 ` [PATCH 4/6] mm/slab: rearrange irq management Joonsoo Kim
2015-01-05  1:37   ` Joonsoo Kim
2015-01-05  1:37 ` [PATCH 5/6] mm/slab: cleanup ____cache_alloc() Joonsoo Kim
2015-01-05  1:37   ` Joonsoo Kim
2015-01-05  1:37 ` [PATCH 6/6] mm/slab: allocation fastpath without disabling irq Joonsoo Kim
2015-01-05  1:37   ` Joonsoo Kim
2015-01-05 15:28   ` Christoph Lameter
2015-01-05 15:28     ` Christoph Lameter
2015-01-06  1:04     ` Joonsoo Kim
2015-01-06  1:04       ` Joonsoo Kim
2015-01-05 17:21   ` Andreas Mohr
2015-01-05 17:21     ` Andreas Mohr
2015-01-05 17:52     ` Christoph Lameter
2015-01-05 17:52       ` Christoph Lameter
2015-01-06  1:31     ` Joonsoo Kim
2015-01-06  1:31       ` Joonsoo Kim
2015-01-06 10:34       ` Andreas Mohr
2015-01-06 10:34         ` Andreas Mohr
2015-01-06 15:33         ` Christoph Lameter
2015-01-06 15:33           ` Christoph Lameter
2015-01-06 16:26           ` Andreas Mohr
2015-01-06 16:26             ` Andreas Mohr
2015-01-08  7:54         ` Joonsoo Kim
2015-01-08  7:54           ` Joonsoo Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.