linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Slab allocator array operations V2
@ 2015-02-10 19:48 Christoph Lameter
  2015-02-10 19:48 ` [PATCH 1/3] Slab infrastructure for array operations Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-02-10 19:48 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, penberg, iamjoonsoo, Jesper Dangaard Brouer

V1->V2:
- Allocator will determine how to acquire the objects. Remove
  the flags that we exposed to the subsystems in V1.
- Restructure patch a bit to minimize size
- Add material provided by Jesper.

Attached a series of 3 patches to implement functionality to allocate
arrays of pointers to slab objects. This can be used by the slab
allocators to offer more optimized allocation and free paths.

Allocator performance issues were discovered by the network subsystem
developers when trying to get the kernel to send at line rate to
saturate a 40G link. Jesper developed special queueing methods
to compensate for the performance issues. See the following material:

LWN: Improving Linux networking performance
 - http://lwn.net/Articles/629155/
 - YouTube: https://www.youtube.com/watch?v=3XG9-X777Jo

LWN: Toward a more efficient slab allocator
 - http://lwn.net/Articles/629152/
 - YouTube: https://www.youtube.com/watch?v=s0lZzP1jOzI



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] Slab infrastructure for array operations
  2015-02-10 19:48 [PATCH 0/3] Slab allocator array operations V2 Christoph Lameter
@ 2015-02-10 19:48 ` Christoph Lameter
  2015-02-10 22:43   ` Jesper Dangaard Brouer
  2015-02-10 23:58   ` David Rientjes
  2015-02-10 19:48 ` [PATCH 2/3] slub: Support " Christoph Lameter
  2015-02-10 19:48 ` [PATCH 3/3] Array alloc test code Christoph Lameter
  2 siblings, 2 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-02-10 19:48 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, penberg, iamjoonsoo, Jesper Dangaard Brouer

[-- Attachment #1: array_alloc --]
[-- Type: text/plain, Size: 5198 bytes --]

This patch adds the basic infrastructure for alloc / free operations
on pointer arrays. It includes a fallback function that can perform
the array operations using the single alloc and free that every
slab allocator performs.

Allocators must define _HAVE_SLAB_ALLOCATOR_OPERATIONS in their
header files in order to implement their own fast version for
these array operations.

Array operations allow a reduction of the processing overhead
during allocation and therefore speed up acquisition of larger
amounts of objects.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/include/linux/slab.h
===================================================================
--- linux.orig/include/linux/slab.h
+++ linux/include/linux/slab.h
@@ -123,6 +123,7 @@ struct kmem_cache *memcg_create_kmem_cac
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void kmem_cache_free(struct kmem_cache *, void *);
+void kmem_cache_free_array(struct kmem_cache *, size_t, void **);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -289,6 +290,8 @@ static __always_inline int kmalloc_index
 
 void *__kmalloc(size_t size, gfp_t flags);
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags);
+int kmem_cache_alloc_array(struct kmem_cache *, gfp_t gfpflags,
+				size_t nr, void **);
 
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node);
Index: linux/mm/slab_common.c
===================================================================
--- linux.orig/mm/slab_common.c
+++ linux/mm/slab_common.c
@@ -105,6 +105,83 @@ static inline int kmem_cache_sanity_chec
 }
 #endif
 
+/*
+ * Fallback function that just calls kmem_cache_alloc
+ * for each element. This may be used if not all
+ * objects can be allocated or as a generic fallback
+ * if the allocator cannot support buik operations.
+ */
+int __kmem_cache_alloc_array(struct kmem_cache *s,
+		gfp_t flags, size_t nr, void **p)
+{
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		void *x = kmem_cache_alloc(s, flags);
+
+		if (!x)
+			return i;
+		p[i] = x;
+	}
+	return i;
+}
+
+int kmem_cache_alloc_array(struct kmem_cache *s,
+		gfp_t flags, size_t nr, void **p)
+{
+	int i = 0;
+
+#ifdef _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS
+	/*
+	 * First extract objects from partial lists in order to
+	 * avoid further fragmentation.
+	 */
+	i += slab_array_alloc_from_partial(s, nr - i, p + i);
+
+	/*
+	 * If there are still a larger number of objects to be allocated
+	 * use the page allocator directly.
+	 */
+	if (nr - i > objects_per_slab_page(s))
+		i += slab_array_alloc_from_page_allocator(s,
+				flags, nr - i, p + i);
+
+	/* Get per cpu objects that may be available */
+	if (i < nr)
+		i += slab_array_alloc_from_local(s, nr - i, p + i);
+
+#endif
+	/*
+	 * If a fully filled array has been requested then fill it
+	 * up if there are objects missing using the regular kmem_cache_alloc()
+	 */
+	if (i < nr)
+		i += __kmem_cache_alloc_array(s, flags, nr - i, p + i);
+
+	return i;
+}
+EXPORT_SYMBOL(kmem_cache_alloc_array);
+
+/*
+ * Fallback function for objects that an allocator does not want
+ * to deal with or for allocators that do not support bulk operations.
+ */
+void __kmem_cache_free_array(struct kmem_cache *s, size_t nr, void **p)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		kmem_cache_free(s, p[i]);
+}
+
+#ifndef _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS
+void kmem_cache_free_array(struct kmem_cache *s, size_t nr, void **p)
+{
+	__kmem_cache_free_array(s, nr, p);
+}
+EXPORT_SYMBOL(kmem_cache_free_array);
+#endif
+
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_alloc_cache_params(struct mem_cgroup *memcg,
 		struct kmem_cache *s, struct kmem_cache *root_cache)
Index: linux/mm/slab.h
===================================================================
--- linux.orig/mm/slab.h
+++ linux/mm/slab.h
@@ -69,6 +69,9 @@ extern struct kmem_cache *kmem_cache;
 unsigned long calculate_alignment(unsigned long flags,
 		unsigned long align, unsigned long size);
 
+/* Determine the number of objects per slab page */
+unsigned objects_per_slab_page(struct kmem_cache *);
+
 #ifndef CONFIG_SLOB
 /* Kmalloc array related functions */
 void create_kmalloc_caches(unsigned long);
@@ -362,4 +365,12 @@ void *slab_next(struct seq_file *m, void
 void slab_stop(struct seq_file *m, void *p);
 int memcg_slab_show(struct seq_file *m, void *p);
 
+void __kmem_cache_free_array(struct kmem_cache *s, int nr, void **p);
+void __kmem_cache_alloc_array(struct kmem_cache *s, gfp_t flags, int nr, void **p);
+
+int slab_array_alloc_from_partial(struct kmem_cache *s, size_t nr, void **p);
+int slab_array_alloc_from_local(struct kmem_cache *s, size_t nr, void **p);
+int slab_array_alloc_from_page_allocator(struct kmem_cache *s, gfp_t flags,
+					size_t nr, void **p);
+
 #endif /* MM_SLAB_H */
Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -332,6 +332,11 @@ static inline int oo_objects(struct kmem
 	return x.x & OO_MASK;
 }
 
+unsigned objects_per_slab_page(struct kmem_cache *s)
+{
+	return oo_objects(s->oo);
+}
+
 /*
  * Per slab locking using the pagelock
  */


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/3] slub: Support for array operations
  2015-02-10 19:48 [PATCH 0/3] Slab allocator array operations V2 Christoph Lameter
  2015-02-10 19:48 ` [PATCH 1/3] Slab infrastructure for array operations Christoph Lameter
@ 2015-02-10 19:48 ` Christoph Lameter
  2015-02-11  4:48   ` Jesper Dangaard Brouer
  2015-02-13  2:45   ` Joonsoo Kim
  2015-02-10 19:48 ` [PATCH 3/3] Array alloc test code Christoph Lameter
  2 siblings, 2 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-02-10 19:48 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, penberg, iamjoonsoo, Jesper Dangaard Brouer

[-- Attachment #1: array_alloc_slub --]
[-- Type: text/plain, Size: 5154 bytes --]

The major portions are there but there is no support yet for
directly allocating per cpu objects. There could also be more
sophisticated code to exploit the batch freeing.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/include/linux/slub_def.h
===================================================================
--- linux.orig/include/linux/slub_def.h
+++ linux/include/linux/slub_def.h
@@ -110,4 +110,5 @@ static inline void sysfs_slab_remove(str
 }
 #endif
 
+#define _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS
 #endif /* _LINUX_SLUB_DEF_H */
Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -1379,13 +1379,9 @@ static void setup_object(struct kmem_cac
 		s->ctor(object);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *__new_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
 	struct page *page;
-	void *start;
-	void *p;
-	int order;
-	int idx;
 
 	if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
 		pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
@@ -1394,33 +1390,42 @@ static struct page *new_slab(struct kmem
 
 	page = allocate_slab(s,
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-	if (!page)
-		goto out;
+	if (page) {
+		inc_slabs_node(s, page_to_nid(page), page->objects);
+		page->slab_cache = s;
+		__SetPageSlab(page);
+		if (page->pfmemalloc)
+			SetPageSlabPfmemalloc(page);
+	}
 
-	order = compound_order(page);
-	inc_slabs_node(s, page_to_nid(page), page->objects);
-	page->slab_cache = s;
-	__SetPageSlab(page);
-	if (page->pfmemalloc)
-		SetPageSlabPfmemalloc(page);
-
-	start = page_address(page);
-
-	if (unlikely(s->flags & SLAB_POISON))
-		memset(start, POISON_INUSE, PAGE_SIZE << order);
-
-	for_each_object_idx(p, idx, s, start, page->objects) {
-		setup_object(s, page, p);
-		if (likely(idx < page->objects))
-			set_freepointer(s, p, p + s->size);
-		else
-			set_freepointer(s, p, NULL);
-	}
-
-	page->freelist = start;
-	page->inuse = page->objects;
-	page->frozen = 1;
-out:
+	return page;
+}
+
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+{
+	struct page *page = __new_slab(s, flags, node);
+
+	if (page) {
+		void *p;
+		int idx;
+		void *start = page_address(page);
+
+		if (unlikely(s->flags & SLAB_POISON))
+			memset(start, POISON_INUSE,
+				PAGE_SIZE << compound_order(page));
+
+		for_each_object_idx(p, idx, s, start, page->objects) {
+			setup_object(s, page, p);
+			if (likely(idx < page->objects))
+				set_freepointer(s, p, p + s->size);
+			else
+				set_freepointer(s, p, NULL);
+		}
+
+		page->freelist = start;
+		page->inuse = page->objects;
+		page->frozen = 1;
+	}
 	return page;
 }
 
@@ -2516,8 +2521,78 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trac
 #endif
 #endif
 
+int slab_array_alloc_from_partial(struct kmem_cache *s,
+			size_t nr, void **p)
+{
+	void **end = p + nr;
+	struct kmem_cache_node *n = get_node(s, numa_mem_id());
+	int allocated = 0;
+	unsigned long flags;
+	struct page *page, *page2;
+
+	if (!n->nr_partial)
+		return 0;
+
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry_safe(page, page2, &n->partial, lru) {
+		void *freelist;
+
+		if (page->objects - page->inuse > end - p)
+			/* More objects free in page than we want */
+			break;
+		list_del(&page->lru);
+		slab_lock(page);
+		freelist = page->freelist;
+		page->inuse = page->objects;
+		page->freelist = NULL;
+		slab_unlock(page);
+		/* Grab all available objects */
+		while (freelist) {
+			*p++ = freelist;
+			freelist = get_freepointer(s, freelist);
+			allocated++;
+		}
+	}
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	return allocated;
+}
+
+int slab_array_alloc_from_page_allocator(struct kmem_cache *s,
+		gfp_t flags, size_t nr, void **p)
+{
+	void **end = p + nr;
+	int allocated = 0;
+
+	while (end - p >= oo_objects(s->oo)) {
+		struct page *page = __new_slab(s, flags, NUMA_NO_NODE);
+		void *q = page_address(page);
+		int i;
+
+		/* Use all the objects */
+		for (i = 0; i < page->objects; i++) {
+			setup_object(s, page, q);
+			*p++ = q;
+			q += s->size;
+		}
+
+		page->inuse = page->objects;
+		page->freelist = NULL;
+		allocated += page->objects;
+	}
+	return allocated;
+}
+
+int slab_array_alloc_from_local(struct kmem_cache *s,
+		size_t nr, void **p)
+{
+	/* Go for the per cpu partials list first */
+	/* Use the cpu_slab if objects are still needed */
+	return 0;
+}
+
 /*
- * Slow patch handling. This may still be called frequently since objects
+ * Slow path handling. This may still be called frequently since objects
  * have a longer lifetime than the cpu slabs in most processing loads.
  *
  * So we still attempt to reduce cache line usage. Just take the slab
@@ -2637,6 +2712,14 @@ slab_empty:
 	discard_slab(s, page);
 }
 
+void kmem_cache_free_array(struct kmem_cache *s, size_t nr, void **p)
+{
+	void **end = p + nr;
+
+	for ( ; p < end; p++)
+		__slab_free(s, virt_to_head_page(p), p, 0);
+}
+
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
  * can perform fastpath freeing without additional function calls.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/3] Array alloc test code
  2015-02-10 19:48 [PATCH 0/3] Slab allocator array operations V2 Christoph Lameter
  2015-02-10 19:48 ` [PATCH 1/3] Slab infrastructure for array operations Christoph Lameter
  2015-02-10 19:48 ` [PATCH 2/3] slub: Support " Christoph Lameter
@ 2015-02-10 19:48 ` Christoph Lameter
  2 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-02-10 19:48 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, linux-mm, penberg, iamjoonsoo, Jesper Dangaard Brouer

[-- Attachment #1: array_alloc_test --]
[-- Type: text/plain, Size: 991 bytes --]

Some simply thrown in thing that allocates 100 objects and frees them again.

Spews out complaints about interrupts disabled since we are in an initcall.
But it shows that it works.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -5308,6 +5308,22 @@ static int __init slab_sysfs_init(void)
 
 	mutex_unlock(&slab_mutex);
 	resiliency_test();
+
+	/* Test array alloc */
+	{
+		void *arr[100];
+		int nr;
+
+		printk(KERN_INFO "Array allocation test\n");
+		printk(KERN_INFO "---------------------\n");
+		printk(KERN_INFO "Allocation 100 objects\n");
+		nr = kmem_cache_alloc_array(kmem_cache_node, GFP_KERNEL, 100, arr);
+		printk(KERN_INFO "Number allocated = %d\n", nr);
+		printk(KERN_INFO "Freeing the objects\n");
+		kmem_cache_free_array(kmem_cache_node, 100, arr);
+		printk(KERN_INFO "Array allocation test done.\n");
+	}
+
 	return 0;
 }
 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-10 19:48 ` [PATCH 1/3] Slab infrastructure for array operations Christoph Lameter
@ 2015-02-10 22:43   ` Jesper Dangaard Brouer
  2015-02-10 23:58   ` David Rientjes
  1 sibling, 0 replies; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2015-02-10 22:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo, brouer



On Tue, 10 Feb 2015 13:48:05 -0600 Christoph Lameter <cl@linux.com> wrote:
[...]
> Index: linux/mm/slab_common.c
> ===================================================================
> --- linux.orig/mm/slab_common.c
> +++ linux/mm/slab_common.c
> @@ -105,6 +105,83 @@ static inline int kmem_cache_sanity_chec
>  }
>  #endif
>  
> +/*
> + * Fallback function that just calls kmem_cache_alloc
> + * for each element. This may be used if not all
> + * objects can be allocated or as a generic fallback
> + * if the allocator cannot support buik operations.
                                      ^^^^
Minor typo "buik" -> "bulk"

> + */
> +int __kmem_cache_alloc_array(struct kmem_cache *s,
> +		gfp_t flags, size_t nr, void **p)
> +{
[...]

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-10 19:48 ` [PATCH 1/3] Slab infrastructure for array operations Christoph Lameter
  2015-02-10 22:43   ` Jesper Dangaard Brouer
@ 2015-02-10 23:58   ` David Rientjes
  2015-02-11 18:47     ` Christoph Lameter
  1 sibling, 1 reply; 26+ messages in thread
From: David Rientjes @ 2015-02-10 23:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo,
	Jesper Dangaard Brouer

On Tue, 10 Feb 2015, Christoph Lameter wrote:

> This patch adds the basic infrastructure for alloc / free operations
> on pointer arrays. It includes a fallback function that can perform
> the array operations using the single alloc and free that every
> slab allocator performs.
> 
> Allocators must define _HAVE_SLAB_ALLOCATOR_OPERATIONS in their
> header files in order to implement their own fast version for
> these array operations.
> 
> Array operations allow a reduction of the processing overhead
> during allocation and therefore speed up acquisition of larger
> amounts of objects.
> 

This doesn't apply to -mm because of commits f707780a2121 ("slab: embed 
memcg_cache_params to kmem_cache") and 0d48f42820db ("memcg: free 
memcg_caches slot on css offline"), but it should be trivial to resolve.

> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/include/linux/slab.h
> ===================================================================
> --- linux.orig/include/linux/slab.h
> +++ linux/include/linux/slab.h
> @@ -123,6 +123,7 @@ struct kmem_cache *memcg_create_kmem_cac
>  void kmem_cache_destroy(struct kmem_cache *);
>  int kmem_cache_shrink(struct kmem_cache *);
>  void kmem_cache_free(struct kmem_cache *, void *);
> +void kmem_cache_free_array(struct kmem_cache *, size_t, void **);
>  
>  /*
>   * Please use this macro to create slab caches. Simply specify the
> @@ -289,6 +290,8 @@ static __always_inline int kmalloc_index
>  
>  void *__kmalloc(size_t size, gfp_t flags);
>  void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags);
> +int kmem_cache_alloc_array(struct kmem_cache *, gfp_t gfpflags,
> +				size_t nr, void **);
>  
>  #ifdef CONFIG_NUMA
>  void *__kmalloc_node(size_t size, gfp_t flags, int node);
> Index: linux/mm/slab_common.c
> ===================================================================
> --- linux.orig/mm/slab_common.c
> +++ linux/mm/slab_common.c
> @@ -105,6 +105,83 @@ static inline int kmem_cache_sanity_chec
>  }
>  #endif
>  
> +/*
> + * Fallback function that just calls kmem_cache_alloc
> + * for each element. This may be used if not all
> + * objects can be allocated or as a generic fallback
> + * if the allocator cannot support buik operations.
> + */
> +int __kmem_cache_alloc_array(struct kmem_cache *s,
> +		gfp_t flags, size_t nr, void **p)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; i++) {
> +		void *x = kmem_cache_alloc(s, flags);
> +
> +		if (!x)
> +			return i;
> +		p[i] = x;
> +	}
> +	return i;
> +}

If size_t is unsigned long and i is int and i overflows then bad things 
happen.  I don't expect that we'll have any callers that have such large 
values of nr, but it shouldn't index negatively into an array.

> +
> +int kmem_cache_alloc_array(struct kmem_cache *s,
> +		gfp_t flags, size_t nr, void **p)
> +{
> +	int i = 0;
> +
> +#ifdef _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS
> +	/*
> +	 * First extract objects from partial lists in order to
> +	 * avoid further fragmentation.
> +	 */
> +	i += slab_array_alloc_from_partial(s, nr - i, p + i);
> +
> +	/*
> +	 * If there are still a larger number of objects to be allocated
> +	 * use the page allocator directly.
> +	 */
> +	if (nr - i > objects_per_slab_page(s))
> +		i += slab_array_alloc_from_page_allocator(s,
> +				flags, nr - i, p + i);
> +
> +	/* Get per cpu objects that may be available */
> +	if (i < nr)
> +		i += slab_array_alloc_from_local(s, nr - i, p + i);
> +
> +#endif

This patch is referencing functions that don't exist and can do so since 
it's not compiled, but I think this belongs in the next patch.  I also 
think that this particular implementation may be slub-specific so I would 
have expected just a call to an allocator-defined
__kmem_cache_alloc_array() here with i = __kmem_cache_alloc_array().

If that's done, then slab and slob can just define a dummy inline 
__kmem_cache_alloc_array() functions that 
return 0 instead of using _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS at all.

> +	/*
> +	 * If a fully filled array has been requested then fill it
> +	 * up if there are objects missing using the regular kmem_cache_alloc()
> +	 */
> +	if (i < nr)
> +		i += __kmem_cache_alloc_array(s, flags, nr - i, p + i);
> +
> +	return i;
> +}
> +EXPORT_SYMBOL(kmem_cache_alloc_array);
> +
> +/*
> + * Fallback function for objects that an allocator does not want
> + * to deal with or for allocators that do not support bulk operations.
> + */
> +void __kmem_cache_free_array(struct kmem_cache *s, size_t nr, void **p)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; i++)
> +		kmem_cache_free(s, p[i]);
> +}
> +
> +#ifndef _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS
> +void kmem_cache_free_array(struct kmem_cache *s, size_t nr, void **p)
> +{
> +	__kmem_cache_free_array(s, nr, p);
> +}
> +EXPORT_SYMBOL(kmem_cache_free_array);
> +#endif
> +

Hmm, not sure why the allocator would be required to do the 
EXPORT_SYMBOL() if it defines kmem_cache_free_array() itself.  This 
becomes simpler if you remove _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS 
entirely and just have slab and slob do __kmem_cache_free_array() 
behavior.

>  #ifdef CONFIG_MEMCG_KMEM
>  static int memcg_alloc_cache_params(struct mem_cgroup *memcg,
>  		struct kmem_cache *s, struct kmem_cache *root_cache)
> Index: linux/mm/slab.h
> ===================================================================
> --- linux.orig/mm/slab.h
> +++ linux/mm/slab.h
> @@ -69,6 +69,9 @@ extern struct kmem_cache *kmem_cache;
>  unsigned long calculate_alignment(unsigned long flags,
>  		unsigned long align, unsigned long size);
>  
> +/* Determine the number of objects per slab page */
> +unsigned objects_per_slab_page(struct kmem_cache *);

Seems like it should be in the next patch.

> +
>  #ifndef CONFIG_SLOB
>  /* Kmalloc array related functions */
>  void create_kmalloc_caches(unsigned long);
> @@ -362,4 +365,12 @@ void *slab_next(struct seq_file *m, void
>  void slab_stop(struct seq_file *m, void *p);
>  int memcg_slab_show(struct seq_file *m, void *p);
>  
> +void __kmem_cache_free_array(struct kmem_cache *s, int nr, void **p);
> +void __kmem_cache_alloc_array(struct kmem_cache *s, gfp_t flags, int nr, void **p);

Longer than 80 chars.

> +
> +int slab_array_alloc_from_partial(struct kmem_cache *s, size_t nr, void **p);
> +int slab_array_alloc_from_local(struct kmem_cache *s, size_t nr, void **p);
> +int slab_array_alloc_from_page_allocator(struct kmem_cache *s, gfp_t flags,
> +					size_t nr, void **p);
> +
>  #endif /* MM_SLAB_H */
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c
> +++ linux/mm/slub.c
> @@ -332,6 +332,11 @@ static inline int oo_objects(struct kmem
>  	return x.x & OO_MASK;
>  }
>  
> +unsigned objects_per_slab_page(struct kmem_cache *s)
> +{
> +	return oo_objects(s->oo);
> +}
> +
>  /*
>   * Per slab locking using the pagelock
>   */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] slub: Support for array operations
  2015-02-10 19:48 ` [PATCH 2/3] slub: Support " Christoph Lameter
@ 2015-02-11  4:48   ` Jesper Dangaard Brouer
  2015-02-11 19:07     ` Christoph Lameter
  2015-02-13  2:45   ` Joonsoo Kim
  1 sibling, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2015-02-11  4:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo, brouer


On Tue, 10 Feb 2015 13:48:06 -0600 Christoph Lameter <cl@linux.com> wrote:

> The major portions are there but there is no support yet for
> directly allocating per cpu objects. There could also be more
> sophisticated code to exploit the batch freeing.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
[...]
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c
> +++ linux/mm/slub.c
[...]
> @@ -2516,8 +2521,78 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trac
>  #endif
>  #endif
>  
> +int slab_array_alloc_from_partial(struct kmem_cache *s,
> +			size_t nr, void **p)
> +{
> +	void **end = p + nr;
> +	struct kmem_cache_node *n = get_node(s, numa_mem_id());
> +	int allocated = 0;
> +	unsigned long flags;
> +	struct page *page, *page2;
> +
> +	if (!n->nr_partial)
> +		return 0;
> +
> +
> +	spin_lock_irqsave(&n->list_lock, flags);

This is quite an expensive lock with irqsave.


> +	list_for_each_entry_safe(page, page2, &n->partial, lru) {
> +		void *freelist;
> +
> +		if (page->objects - page->inuse > end - p)
> +			/* More objects free in page than we want */
> +			break;
> +		list_del(&page->lru);
> +		slab_lock(page);

Yet another lock cost.

> +		freelist = page->freelist;
> +		page->inuse = page->objects;
> +		page->freelist = NULL;
> +		slab_unlock(page);
> +		/* Grab all available objects */
> +		while (freelist) {
> +			*p++ = freelist;
> +			freelist = get_freepointer(s, freelist);
> +			allocated++;
> +		}
> +	}
> +	spin_unlock_irqrestore(&n->list_lock, flags);
> +	return allocated;

I estimate (on my CPU) the locking cost itself is more than 32ns, plus
the irqsave (which I've also found quite expensive, alone 14ns).  Thus,
estimated 46ns.  Single elem slub fast path cost is 18-19ns. Thus 3-4
elem bulking should be enough to amortized the cost, guess we are still
good :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-10 23:58   ` David Rientjes
@ 2015-02-11 18:47     ` Christoph Lameter
  2015-02-11 20:18       ` David Rientjes
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-02-11 18:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo,
	Jesper Dangaard Brouer

On Tue, 10 Feb 2015, David Rientjes wrote:

> > +int kmem_cache_alloc_array(struct kmem_cache *s,
> > +		gfp_t flags, size_t nr, void **p)
> > +{
> > +	int i = 0;
> > +
> > +#ifdef _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS
> > +	/*

...

> > +		i += slab_array_alloc_from_local(s, nr - i, p + i);
> > +
> > +#endif
>
> This patch is referencing functions that don't exist and can do so since
> it's not compiled, but I think this belongs in the next patch.  I also
> think that this particular implementation may be slub-specific so I would
> have expected just a call to an allocator-defined
> __kmem_cache_alloc_array() here with i = __kmem_cache_alloc_array().

The implementation is generic and can be used in the same way for SLAB.
SLOB does not have these types of object though.

> return 0 instead of using _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS at all.

Ok that is a good idea. I'll just drop that macro and have all allocators
provide dummy functions.

> > +#ifndef _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS
> > +void kmem_cache_free_array(struct kmem_cache *s, size_t nr, void **p)
> > +{
> > +	__kmem_cache_free_array(s, nr, p);
> > +}
> > +EXPORT_SYMBOL(kmem_cache_free_array);
> > +#endif
> > +
>
> Hmm, not sure why the allocator would be required to do the
> EXPORT_SYMBOL() if it defines kmem_cache_free_array() itself.  This

Keeping the EXPORT with the definition is the custom as far as I could
tell.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] slub: Support for array operations
  2015-02-11  4:48   ` Jesper Dangaard Brouer
@ 2015-02-11 19:07     ` Christoph Lameter
  2015-02-11 21:43       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-02-11 19:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo

On Wed, 11 Feb 2015, Jesper Dangaard Brouer wrote:

> > +
> > +
> > +	spin_lock_irqsave(&n->list_lock, flags);
>
> This is quite an expensive lock with irqsave.

Yes but we take it for all partial pages.

> Yet another lock cost.

Yup the page access is shared but there is one per page. Contention is
unlikely.

> > +	spin_unlock_irqrestore(&n->list_lock, flags);
> > +	return allocated;
>
> I estimate (on my CPU) the locking cost itself is more than 32ns, plus
> the irqsave (which I've also found quite expensive, alone 14ns).  Thus,
> estimated 46ns.  Single elem slub fast path cost is 18-19ns. Thus 3-4
> elem bulking should be enough to amortized the cost, guess we are still
> good :-)

We can require that interrupt are off when the functions are called. Then
we can avoid the "save" part?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-11 18:47     ` Christoph Lameter
@ 2015-02-11 20:18       ` David Rientjes
  2015-02-11 22:04         ` Christoph Lameter
  2015-02-13  2:35         ` Joonsoo Kim
  0 siblings, 2 replies; 26+ messages in thread
From: David Rientjes @ 2015-02-11 20:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo,
	Jesper Dangaard Brouer

On Wed, 11 Feb 2015, Christoph Lameter wrote:

> > This patch is referencing functions that don't exist and can do so since
> > it's not compiled, but I think this belongs in the next patch.  I also
> > think that this particular implementation may be slub-specific so I would
> > have expected just a call to an allocator-defined
> > __kmem_cache_alloc_array() here with i = __kmem_cache_alloc_array().
> 
> The implementation is generic and can be used in the same way for SLAB.
> SLOB does not have these types of object though.
> 

Ok, I didn't know if the slab implementation would follow the same format 
with the same callbacks or whether this would need to be cleaned up later.  

> > return 0 instead of using _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS at all.
> 
> Ok that is a good idea. I'll just drop that macro and have all allocators
> provide dummy functions.
> 
> > > +#ifndef _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS
> > > +void kmem_cache_free_array(struct kmem_cache *s, size_t nr, void **p)
> > > +{
> > > +	__kmem_cache_free_array(s, nr, p);
> > > +}
> > > +EXPORT_SYMBOL(kmem_cache_free_array);
> > > +#endif
> > > +
> >
> > Hmm, not sure why the allocator would be required to do the
> > EXPORT_SYMBOL() if it defines kmem_cache_free_array() itself.  This
> 
> Keeping the EXPORT with the definition is the custom as far as I could
> tell.
> 

If you do dummy functions for all the allocators, then this should be as 
simple as unconditionally defining kmem_cache_free_array() and doing 
EXPORT_SYMBOL() here and then using your current implementation of 
__kmem_cache_free_array() for mm/slab.c.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] slub: Support for array operations
  2015-02-11 19:07     ` Christoph Lameter
@ 2015-02-11 21:43       ` Jesper Dangaard Brouer
  2015-02-11 22:06         ` Christoph Lameter
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2015-02-11 21:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo, brouer

On Wed, 11 Feb 2015 13:07:24 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:

> On Wed, 11 Feb 2015, Jesper Dangaard Brouer wrote:
> 
> > > +
> > > +
> > > +	spin_lock_irqsave(&n->list_lock, flags);
> >
> > This is quite an expensive lock with irqsave.
> 
> Yes but we take it for all partial pages.

Sure, that is good, but this might be a contention point. In a micro
benchmark, this contention should be visible, but in real use-cases the
given subsystem also need to spend time to use these elements before
requesting a new batch (as long as NIC cleanup cycles don't get too
synchronized)


> > Yet another lock cost.
> 
> Yup the page access is shared but there is one per page. Contention is
> unlikely.

Yes, contention is unlikely, but every atomic operation is expensive.
On my system the measured cost is 8ns, and a lock/unlock does two, thus
16ns.  Which we then do per page freelist.


> > > +	spin_unlock_irqrestore(&n->list_lock, flags);
> > > +	return allocated;
> >
> > I estimate (on my CPU) the locking cost itself is more than 32ns, plus
> > the irqsave (which I've also found quite expensive, alone 14ns).  Thus,
> > estimated 46ns.  Single elem slub fast path cost is 18-19ns. Thus 3-4
> > elem bulking should be enough to amortized the cost, guess we are still
> > good :-)
> 
> We can require that interrupt are off when the functions are called. Then
> we can avoid the "save" part?

Yes, we could also do so with an "_irqoff" variant of the func call,
but given we are defining the API we can just require this from the
start.

I plan to use this in softirq, where I know interrupts are on, but I
can use the less-expensive "non-save" variant local_irq_{disable,enable}.

Measurements show (x86_64 E5-2695):
 *  2.860 ns cost for local_irq_{disable,enable}
 * 14.840 ns cost for local_irq_save()+local_irq_restore()

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-11 20:18       ` David Rientjes
@ 2015-02-11 22:04         ` Christoph Lameter
  2015-02-12  0:35           ` David Rientjes
  2015-02-13  2:35         ` Joonsoo Kim
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-02-11 22:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo,
	Jesper Dangaard Brouer

On Wed, 11 Feb 2015, David Rientjes wrote:

> > >
> > > Hmm, not sure why the allocator would be required to do the
> > > EXPORT_SYMBOL() if it defines kmem_cache_free_array() itself.  This
> >
> > Keeping the EXPORT with the definition is the custom as far as I could
> > tell.
> >
>
> If you do dummy functions for all the allocators, then this should be as
> simple as unconditionally defining kmem_cache_free_array() and doing
> EXPORT_SYMBOL() here and then using your current implementation of
> __kmem_cache_free_array() for mm/slab.c.

That works if I put an EXPORT_SYMBOL in mm/slab_common.c and define the
function in mm/slub.c?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] slub: Support for array operations
  2015-02-11 21:43       ` Jesper Dangaard Brouer
@ 2015-02-11 22:06         ` Christoph Lameter
  2015-02-12  0:16           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-02-11 22:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo

On Thu, 12 Feb 2015, Jesper Dangaard Brouer wrote:

> > > This is quite an expensive lock with irqsave.
> >
> > Yes but we take it for all partial pages.
>
> Sure, that is good, but this might be a contention point. In a micro
> benchmark, this contention should be visible, but in real use-cases the
> given subsystem also need to spend time to use these elements before
> requesting a new batch (as long as NIC cleanup cycles don't get too
> synchronized)

Yes definitely it will be a contention point. There is no way around it.

> > Yup the page access is shared but there is one per page. Contention is
> > unlikely.
>
> Yes, contention is unlikely, but every atomic operation is expensive.
> On my system the measured cost is 8ns, and a lock/unlock does two, thus
> 16ns.  Which we then do per page freelist.

Not sure what we can do about this.

> > We can require that interrupt are off when the functions are called. Then
> > we can avoid the "save" part?
>
> Yes, we could also do so with an "_irqoff" variant of the func call,
> but given we are defining the API we can just require this from the
> start.

Allright. Lets do that then.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] slub: Support for array operations
  2015-02-11 22:06         ` Christoph Lameter
@ 2015-02-12  0:16           ` Jesper Dangaard Brouer
  2015-02-12  2:46             ` Christoph Lameter
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2015-02-12  0:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo, brouer

On Wed, 11 Feb 2015 16:06:50 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:

> On Thu, 12 Feb 2015, Jesper Dangaard Brouer wrote:
> 
> > > > This is quite an expensive lock with irqsave.
[...]
> > > We can require that interrupt are off when the functions are called. Then
> > > we can avoid the "save" part?
> >
> > Yes, we could also do so with an "_irqoff" variant of the func call,
> > but given we are defining the API we can just require this from the
> > start.
> 
> Allright. Lets do that then.

Okay. Some measurements to guide this choice.

Measured on my laptop CPU i7-2620M CPU @ 2.70GHz:

 * 12.775 ns - "clean" spin_lock_unlock
 * 21.099 ns - irqsave variant spinlock
 * 22.808 ns - "manual" irqsave before spin_lock
 * 14.618 ns - "manual" local_irq_disable + spin_lock

Reproducible via my github repo:
 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

The clean spin_lock_unlock is 8.324 ns faster than irqsave variant.
The irqsave variant is actually faster than expected, as the measurement
of an isolated local_irq_save_restore were 13.256 ns. 

The difference to the "manual" irqsave is only 1.709 ns, which is approx
the cost of an extra function call.

If one can use the non-flags-save version of local_irq_disable, then one
can save 6.481 ns (on this specific CPU and kernel config 3.17.8-200.fc20.x86_64).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

https://github.com/netoptimizer/prototype-kernel/commit/1471ac60

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-11 22:04         ` Christoph Lameter
@ 2015-02-12  0:35           ` David Rientjes
  0 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2015-02-12  0:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo,
	Jesper Dangaard Brouer

On Wed, 11 Feb 2015, Christoph Lameter wrote:

> > > > Hmm, not sure why the allocator would be required to do the
> > > > EXPORT_SYMBOL() if it defines kmem_cache_free_array() itself.  This
> > >
> > > Keeping the EXPORT with the definition is the custom as far as I could
> > > tell.
> > >
> >
> > If you do dummy functions for all the allocators, then this should be as
> > simple as unconditionally defining kmem_cache_free_array() and doing
> > EXPORT_SYMBOL() here and then using your current implementation of
> > __kmem_cache_free_array() for mm/slab.c.
> 
> That works if I put an EXPORT_SYMBOL in mm/slab_common.c and define the
> function in mm/slub.c?
> 

No, my suggestion was for the same pattern as kmem_cache_alloc_array().  
In other words, I think you should leave the definition of 
kmem_cache_free_array() the way it is in your patch, remove the #ifndef 
since _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS is going away, and then define 
a __kmem_cache_free_array() for each allocator.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] slub: Support for array operations
  2015-02-12  0:16           ` Jesper Dangaard Brouer
@ 2015-02-12  2:46             ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-02-12  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo

On Thu, 12 Feb 2015, Jesper Dangaard Brouer wrote:

> Measured on my laptop CPU i7-2620M CPU @ 2.70GHz:
>
>  * 12.775 ns - "clean" spin_lock_unlock
>  * 21.099 ns - irqsave variant spinlock
>  * 22.808 ns - "manual" irqsave before spin_lock
>  * 14.618 ns - "manual" local_irq_disable + spin_lock
>
> Reproducible via my github repo:
>  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
>
> The clean spin_lock_unlock is 8.324 ns faster than irqsave variant.
> The irqsave variant is actually faster than expected, as the measurement
> of an isolated local_irq_save_restore were 13.256 ns.

I am using spin_lock_irq() in the current version on my system. If the
performance of that is a problem then please optimize that function.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-11 20:18       ` David Rientjes
  2015-02-11 22:04         ` Christoph Lameter
@ 2015-02-13  2:35         ` Joonsoo Kim
  2015-02-13 15:47           ` Christoph Lameter
  1 sibling, 1 reply; 26+ messages in thread
From: Joonsoo Kim @ 2015-02-13  2:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, akpm, linux-kernel, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

On Wed, Feb 11, 2015 at 12:18:07PM -0800, David Rientjes wrote:
> On Wed, 11 Feb 2015, Christoph Lameter wrote:
> 
> > > This patch is referencing functions that don't exist and can do so since
> > > it's not compiled, but I think this belongs in the next patch.  I also
> > > think that this particular implementation may be slub-specific so I would
> > > have expected just a call to an allocator-defined
> > > __kmem_cache_alloc_array() here with i = __kmem_cache_alloc_array().
> > 
> > The implementation is generic and can be used in the same way for SLAB.
> > SLOB does not have these types of object though.
> > 
> 
> Ok, I didn't know if the slab implementation would follow the same format 
> with the same callbacks or whether this would need to be cleaned up later.  

Hello, Christoph.

I also think that this implementation is slub-specific. For example,
in slab case, it is always better to access local cpu cache first than
page allocator since slab doesn't use list to manage free objects and
there is no cache line overhead like as slub. I think that,
in kmem_cache_alloc_array(), just call to allocator-defined
__kmem_cache_alloc_array() is better approach.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] slub: Support for array operations
  2015-02-10 19:48 ` [PATCH 2/3] slub: Support " Christoph Lameter
  2015-02-11  4:48   ` Jesper Dangaard Brouer
@ 2015-02-13  2:45   ` Joonsoo Kim
  2015-02-13 15:49     ` Christoph Lameter
  1 sibling, 1 reply; 26+ messages in thread
From: Joonsoo Kim @ 2015-02-13  2:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo,
	Jesper Dangaard Brouer

On Tue, Feb 10, 2015 at 01:48:06PM -0600, Christoph Lameter wrote:
> The major portions are there but there is no support yet for
> directly allocating per cpu objects. There could also be more
> sophisticated code to exploit the batch freeing.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/include/linux/slub_def.h
> ===================================================================
> --- linux.orig/include/linux/slub_def.h
> +++ linux/include/linux/slub_def.h
> @@ -110,4 +110,5 @@ static inline void sysfs_slab_remove(str
>  }
>  #endif
>  
> +#define _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS
>  #endif /* _LINUX_SLUB_DEF_H */
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c
> +++ linux/mm/slub.c
> @@ -1379,13 +1379,9 @@ static void setup_object(struct kmem_cac
>  		s->ctor(object);
>  }
>  
> -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> +static struct page *__new_slab(struct kmem_cache *s, gfp_t flags, int node)
>  {
>  	struct page *page;
> -	void *start;
> -	void *p;
> -	int order;
> -	int idx;
>  
>  	if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
>  		pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
> @@ -1394,33 +1390,42 @@ static struct page *new_slab(struct kmem
>  
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> -		goto out;
> +	if (page) {
> +		inc_slabs_node(s, page_to_nid(page), page->objects);
> +		page->slab_cache = s;
> +		__SetPageSlab(page);
> +		if (page->pfmemalloc)
> +			SetPageSlabPfmemalloc(page);
> +	}
>  
> -	order = compound_order(page);
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> -	page->slab_cache = s;
> -	__SetPageSlab(page);
> -	if (page->pfmemalloc)
> -		SetPageSlabPfmemalloc(page);
> -
> -	start = page_address(page);
> -
> -	if (unlikely(s->flags & SLAB_POISON))
> -		memset(start, POISON_INUSE, PAGE_SIZE << order);
> -
> -	for_each_object_idx(p, idx, s, start, page->objects) {
> -		setup_object(s, page, p);
> -		if (likely(idx < page->objects))
> -			set_freepointer(s, p, p + s->size);
> -		else
> -			set_freepointer(s, p, NULL);
> -	}
> -
> -	page->freelist = start;
> -	page->inuse = page->objects;
> -	page->frozen = 1;
> -out:
> +	return page;
> +}
> +
> +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> +{
> +	struct page *page = __new_slab(s, flags, node);
> +
> +	if (page) {
> +		void *p;
> +		int idx;
> +		void *start = page_address(page);
> +
> +		if (unlikely(s->flags & SLAB_POISON))
> +			memset(start, POISON_INUSE,
> +				PAGE_SIZE << compound_order(page));

I'm not sure, but, this poisoning is also needed for
slab_array_alloc_from_page_allocator()?

> +
> +		for_each_object_idx(p, idx, s, start, page->objects) {
> +			setup_object(s, page, p);
> +			if (likely(idx < page->objects))
> +				set_freepointer(s, p, p + s->size);
> +			else
> +				set_freepointer(s, p, NULL);
> +		}
> +
> +		page->freelist = start;
> +		page->inuse = page->objects;
> +		page->frozen = 1;
> +	}
>  	return page;
>  }
>  
> @@ -2516,8 +2521,78 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trac
>  #endif
>  #endif
>  
> +int slab_array_alloc_from_partial(struct kmem_cache *s,
> +			size_t nr, void **p)
> +{
> +	void **end = p + nr;
> +	struct kmem_cache_node *n = get_node(s, numa_mem_id());
> +	int allocated = 0;
> +	unsigned long flags;
> +	struct page *page, *page2;
> +
> +	if (!n->nr_partial)
> +		return 0;
> +
> +
> +	spin_lock_irqsave(&n->list_lock, flags);
> +	list_for_each_entry_safe(page, page2, &n->partial, lru) {
> +		void *freelist;
> +
> +		if (page->objects - page->inuse > end - p)
> +			/* More objects free in page than we want */
> +			break;
> +		list_del(&page->lru);
> +		slab_lock(page);

slab_lock() doesn't protect freelist if CONFIG_HAVE_CMPXCHG_DOUBLE is
enabled. You should use cmpxchg_double_slab() things.

And, better solution is to use acquire_slab() rather than
re-implementation of detaching freelist.

> +		freelist = page->freelist;
> +		page->inuse = page->objects;
> +		page->freelist = NULL;
> +		slab_unlock(page);
> +		/* Grab all available objects */
> +		while (freelist) {
> +			*p++ = freelist;
> +			freelist = get_freepointer(s, freelist);
> +			allocated++;
> +		}

Fetching all objects with holding node lock could result in enomourous
lock contention. How about getting free ojbect pointer without holding
the node lock? We can temporarilly store all head of freelists in
array p and can fetch each object pointer without holding node lock.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-13  2:35         ` Joonsoo Kim
@ 2015-02-13 15:47           ` Christoph Lameter
  2015-02-13 21:20             ` David Rientjes
  2015-02-17  5:15             ` Joonsoo Kim
  0 siblings, 2 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-02-13 15:47 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, akpm, linux-kernel, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

On Fri, 13 Feb 2015, Joonsoo Kim wrote:
>
> I also think that this implementation is slub-specific. For example,
> in slab case, it is always better to access local cpu cache first than
> page allocator since slab doesn't use list to manage free objects and
> there is no cache line overhead like as slub. I think that,
> in kmem_cache_alloc_array(), just call to allocator-defined
> __kmem_cache_alloc_array() is better approach.

What do you mean by "better"? Please be specific as to where you would see
a difference. And slab definititely manages free objects although
differently than slub. SLAB manages per cpu (local) objects, per node
partial lists etc. Same as SLUB. The cache line overhead is there but no
that big a difference in terms of choosing objects to get first.

For a large allocation it is beneficial for both allocators to fist reduce
the list of partial allocated slab pages on a node.

Going to the local objects first is enticing since these are cache hot but
there are only a limited number of these available and there are issues
with acquiring a large number of objects. For SLAB the objects dispersed
and not spatially local. For SLUB the number of objects is usually much
more limited than SLAB (but that is configurable these days via the cpu
partial pages). SLUB allocates spatially local objects from one page
before moving to the other. This is an advantage. However, it has to
traverse a linked list instead of an array (SLAB).



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] slub: Support for array operations
  2015-02-13  2:45   ` Joonsoo Kim
@ 2015-02-13 15:49     ` Christoph Lameter
  2015-02-17  5:26       ` Joonsoo Kim
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-02-13 15:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo,
	Jesper Dangaard Brouer

On Fri, 13 Feb 2015, Joonsoo Kim wrote:

> > +			*p++ = freelist;
> > +			freelist = get_freepointer(s, freelist);
> > +			allocated++;
> > +		}
>
> Fetching all objects with holding node lock could result in enomourous
> lock contention. How about getting free ojbect pointer without holding
> the node lock? We can temporarilly store all head of freelists in
> array p and can fetch each object pointer without holding node lock.


Could do that but lets first see if there is really an issue. The other
cpu sharing the same partial lists presumaly have cpu local objects to get
through first before they hit this lock.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-13 15:47           ` Christoph Lameter
@ 2015-02-13 21:20             ` David Rientjes
  2015-02-17  5:15             ` Joonsoo Kim
  1 sibling, 0 replies; 26+ messages in thread
From: David Rientjes @ 2015-02-13 21:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, akpm, linux-kernel, linux-mm, penberg, iamjoonsoo,
	Jesper Dangaard Brouer

On Fri, 13 Feb 2015, Christoph Lameter wrote:

> > I also think that this implementation is slub-specific. For example,
> > in slab case, it is always better to access local cpu cache first than
> > page allocator since slab doesn't use list to manage free objects and
> > there is no cache line overhead like as slub. I think that,
> > in kmem_cache_alloc_array(), just call to allocator-defined
> > __kmem_cache_alloc_array() is better approach.
> 
> What do you mean by "better"? Please be specific as to where you would see
> a difference. And slab definititely manages free objects although
> differently than slub. SLAB manages per cpu (local) objects, per node
> partial lists etc. Same as SLUB. The cache line overhead is there but no
> that big a difference in terms of choosing objects to get first.
> 

I think because we currently lack a non-fallback implementation for slab 
that it may be premature to discuss what would be unified if such an 
implementation were to exist.  That unification can always happen later 
if/when the slab implementation is proposed, but I don't think we should 
be unifying an implementation that doesn't exist.  

In other words, I think it would be much cleaner to do just define the 
generic array alloc and array free functions in mm/slab_common.c along 
with their EXPORT_SYMBOL()'s as simple callbacks to per-allocator 
__kmem_cache_{alloc,free}_array() implementations.  I think it's also 
better from a source code perspective to avoid reading two different 
functions and then realizing that nothing is actually unified between them 
(and the absence of an unnecessary #ifdef is currently helpful).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-13 15:47           ` Christoph Lameter
  2015-02-13 21:20             ` David Rientjes
@ 2015-02-17  5:15             ` Joonsoo Kim
  2015-02-17 16:03               ` Christoph Lameter
  1 sibling, 1 reply; 26+ messages in thread
From: Joonsoo Kim @ 2015-02-17  5:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, akpm, linux-kernel, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

On Fri, Feb 13, 2015 at 09:47:59AM -0600, Christoph Lameter wrote:
> On Fri, 13 Feb 2015, Joonsoo Kim wrote:
> >
> > I also think that this implementation is slub-specific. For example,
> > in slab case, it is always better to access local cpu cache first than
> > page allocator since slab doesn't use list to manage free objects and
> > there is no cache line overhead like as slub. I think that,
> > in kmem_cache_alloc_array(), just call to allocator-defined
> > __kmem_cache_alloc_array() is better approach.
> 
> What do you mean by "better"? Please be specific as to where you would see
> a difference. And slab definititely manages free objects although
> differently than slub. SLAB manages per cpu (local) objects, per node
> partial lists etc. Same as SLUB. The cache line overhead is there but no
> that big a difference in terms of choosing objects to get first.
> 
> For a large allocation it is beneficial for both allocators to fist reduce
> the list of partial allocated slab pages on a node.
> 
> Going to the local objects first is enticing since these are cache hot but
> there are only a limited number of these available and there are issues
> with acquiring a large number of objects. For SLAB the objects dispersed
> and not spatially local. For SLUB the number of objects is usually much
> more limited than SLAB (but that is configurable these days via the cpu
> partial pages). SLUB allocates spatially local objects from one page
> before moving to the other. This is an advantage. However, it has to
> traverse a linked list instead of an array (SLAB).

Hello,

Hmm...so far, SLAB focus on temporal locality rather than spatial locality
as you know. Why SLAB need to consider spatial locality first in this
kmem_cache_alloc_array() case?

And, although we use partial list first, we can't reduce
fragmentation as much as SLUB. Local cache may keep some free objects
of the partial slab so just exhausting free objects of partial slab doesn't
means that there is no free object left. For SLUB, exhausting free
objects of partial slab means there is no free object left.

If we allocate objects from local cache as much as possible, we can
keep temporal locality and return objects as fast as possible since
returing objects from local cache just needs memcpy from local array
cache to destination array.

This cannot be implemented by using kmem_cache_alloc_array() you
suggested and this is why I think just calling allocator-defined
__kmem_cache_alloc_array() is better approach.

As David said, there is no implementation for SLAB yet and we have
different opinion about implementation for SLAB. It's better
to delay detailed implementation of kmem_cache_alloc_array()
until implementation for SLAB is agreed. Before it, calling
__kmem_cache_alloc_array() in kmem_cache_alloc_array() is sufficient
to provide functionality.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/3] slub: Support for array operations
  2015-02-13 15:49     ` Christoph Lameter
@ 2015-02-17  5:26       ` Joonsoo Kim
  0 siblings, 0 replies; 26+ messages in thread
From: Joonsoo Kim @ 2015-02-17  5:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-kernel, linux-mm, penberg, iamjoonsoo,
	Jesper Dangaard Brouer

On Fri, Feb 13, 2015 at 09:49:24AM -0600, Christoph Lameter wrote:
> On Fri, 13 Feb 2015, Joonsoo Kim wrote:
> 
> > > +			*p++ = freelist;
> > > +			freelist = get_freepointer(s, freelist);
> > > +			allocated++;
> > > +		}
> >
> > Fetching all objects with holding node lock could result in enomourous
> > lock contention. How about getting free ojbect pointer without holding
> > the node lock? We can temporarilly store all head of freelists in
> > array p and can fetch each object pointer without holding node lock.
> 
> 
> Could do that but lets first see if there is really an issue. The other
> cpu sharing the same partial lists presumaly have cpu local objects to get
> through first before they hit this lock.

Okay.

Thanks.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-17  5:15             ` Joonsoo Kim
@ 2015-02-17 16:03               ` Christoph Lameter
  2015-02-17 21:32                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Lameter @ 2015-02-17 16:03 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, akpm, linux-kernel, linux-mm, penberg,
	iamjoonsoo, Jesper Dangaard Brouer

On Tue, 17 Feb 2015, Joonsoo Kim wrote:

> Hmm...so far, SLAB focus on temporal locality rather than spatial locality
> as you know. Why SLAB need to consider spatial locality first in this
> kmem_cache_alloc_array() case?

Well we are talking about a large number of objects. And going around
randomly in memory is going to cause a lot of TLB misses. Spatial locality
increases the effectiveness of the processing of these objects.

> And, although we use partial list first, we can't reduce
> fragmentation as much as SLUB. Local cache may keep some free objects
> of the partial slab so just exhausting free objects of partial slab doesn't
> means that there is no free object left. For SLUB, exhausting free
> objects of partial slab means there is no free object left.

SLUB will still have the per cpu objects in the per cpu page and te per
cpu slab pages.

> If we allocate objects from local cache as much as possible, we can
> keep temporal locality and return objects as fast as possible since
> returing objects from local cache just needs memcpy from local array
> cache to destination array.

I thought the point was that this is used to allocate very large amounts
of objects. The hotness is not that big of an issue.

> As David said, there is no implementation for SLAB yet and we have
> different opinion about implementation for SLAB. It's better
> to delay detailed implementation of kmem_cache_alloc_array()
> until implementation for SLAB is agreed. Before it, calling
> __kmem_cache_alloc_array() in kmem_cache_alloc_array() is sufficient
> to provide functionality.

Its not that detailed. It is just layin out the basic strategy for the
array allocs. First go to the partial lists to decrease fragmentation.
Then bypass the allocator layers completely and go direct to the page
allocator if all objects that the page will accomodate can be put into
the array. Lastly use the cpu hot objects to fill in the leftover (which
would in any case be less than the objects in a page).



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-17 16:03               ` Christoph Lameter
@ 2015-02-17 21:32                 ` Jesper Dangaard Brouer
  2015-02-18 23:02                   ` Christoph Lameter
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2015-02-17 21:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, David Rientjes, akpm, linux-kernel, linux-mm,
	penberg, iamjoonsoo, brouer

On Tue, 17 Feb 2015 10:03:51 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:

> On Tue, 17 Feb 2015, Joonsoo Kim wrote:
> 
[...]
> > If we allocate objects from local cache as much as possible, we can
> > keep temporal locality and return objects as fast as possible since
> > returing objects from local cache just needs memcpy from local array
> > cache to destination array.
> 
> I thought the point was that this is used to allocate very large amounts
> of objects. The hotness is not that big of an issue.
>

(My use-case is in area of 32-64 elems)

[...]
> 
> Its not that detailed. It is just layin out the basic strategy for the
> array allocs. First go to the partial lists to decrease fragmentation.
> Then bypass the allocator layers completely and go direct to the page
> allocator if all objects that the page will accomodate can be put into
> the array. Lastly use the cpu hot objects to fill in the leftover (which
> would in any case be less than the objects in a page). 

IMHO this strategy is a bit off, from what I was looking for.

I would prefer the first elements to be cache hot, and the later/rest of
the elements can be more cache-cold. Reasoning behind this is,
subsystem calling this alloc_array have likely ran out of elems (from
it's local store/prev-call) and need to handout one elem immediately
after this call returns.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Slab infrastructure for array operations
  2015-02-17 21:32                 ` Jesper Dangaard Brouer
@ 2015-02-18 23:02                   ` Christoph Lameter
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Lameter @ 2015-02-18 23:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Joonsoo Kim, David Rientjes, akpm, linux-kernel, linux-mm,
	penberg, iamjoonsoo

On Wed, 18 Feb 2015, Jesper Dangaard Brouer wrote:

> (My use-case is in area of 32-64 elems)

Ok that is in the realm of a couple of pages from the page allocator?

> > Its not that detailed. It is just layin out the basic strategy for the
> > array allocs. First go to the partial lists to decrease fragmentation.
> > Then bypass the allocator layers completely and go direct to the page
> > allocator if all objects that the page will accomodate can be put into
> > the array. Lastly use the cpu hot objects to fill in the leftover (which
> > would in any case be less than the objects in a page).
>
> IMHO this strategy is a bit off, from what I was looking for.
>
> I would prefer the first elements to be cache hot, and the later/rest of
> the elements can be more cache-cold. Reasoning behind this is,
> subsystem calling this alloc_array have likely ran out of elems (from
> it's local store/prev-call) and need to handout one elem immediately
> after this call returns.

The problem is that going for the cache hot objects involves dealing with
synchronization that you would not have to spend time on if going direct
to the page allocator or going to the partial lists and retrieving
multiple objects by taking a single lock.

Per cpu object (cache hot!) is already optimized to the hilt. There wont
be much of a benefit.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-02-18 23:02 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-10 19:48 [PATCH 0/3] Slab allocator array operations V2 Christoph Lameter
2015-02-10 19:48 ` [PATCH 1/3] Slab infrastructure for array operations Christoph Lameter
2015-02-10 22:43   ` Jesper Dangaard Brouer
2015-02-10 23:58   ` David Rientjes
2015-02-11 18:47     ` Christoph Lameter
2015-02-11 20:18       ` David Rientjes
2015-02-11 22:04         ` Christoph Lameter
2015-02-12  0:35           ` David Rientjes
2015-02-13  2:35         ` Joonsoo Kim
2015-02-13 15:47           ` Christoph Lameter
2015-02-13 21:20             ` David Rientjes
2015-02-17  5:15             ` Joonsoo Kim
2015-02-17 16:03               ` Christoph Lameter
2015-02-17 21:32                 ` Jesper Dangaard Brouer
2015-02-18 23:02                   ` Christoph Lameter
2015-02-10 19:48 ` [PATCH 2/3] slub: Support " Christoph Lameter
2015-02-11  4:48   ` Jesper Dangaard Brouer
2015-02-11 19:07     ` Christoph Lameter
2015-02-11 21:43       ` Jesper Dangaard Brouer
2015-02-11 22:06         ` Christoph Lameter
2015-02-12  0:16           ` Jesper Dangaard Brouer
2015-02-12  2:46             ` Christoph Lameter
2015-02-13  2:45   ` Joonsoo Kim
2015-02-13 15:49     ` Christoph Lameter
2015-02-17  5:26       ` Joonsoo Kim
2015-02-10 19:48 ` [PATCH 3/3] Array alloc test code Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).