linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
@ 2010-08-04  2:45 Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 01/23] percpu: make @dyn_size always mean min dyn_size in first chunk init functions Christoph Lameter
                   ` (23 more replies)
  0 siblings, 24 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

The following is a first release of an allocator based on SLAB
and SLUB that integrates the best approaches from both allocators. The
per cpu queuing is like the two prior releases. The NUMA facilities
were much improved vs V2. Shared and alien cache support was added to
track the cache hot state of objects. 

After this patches SLUB will track the cpu cache contents
like SLAB attemped to. There are a number of architectural differences:

1. SLUB accurately tracks cpu caches instead of assuming that there
   is only a single cpu cache per node or system.

2. SLUB object expiration is tied into the page reclaim logic. There
   is no periodic cache expiration.

3. SLUB caches are dynamically configurable via the sysfs filesystem.

4. There is no per slab page metadata structure to maintain (aside
   from the object bitmap that usually fits into the page struct).

5. Keeps all the other good features of SLUB as well.

SLUB+Q is a merging of SLUB with some queuing concepts from SLAB and a
new way of managing objects in the slabs using bitmaps. It uses a percpu
queue so that free operations can be properly buffered and a bitmap for
managing the free/allocated state in the slabs. It is slightly more
inefficient than SLUB (due to the need to place large bitmaps --sized
a few words--in some slab pages if there are more than BITS_PER_LONG
objects in a slab) but in general does not increase space use too much.

The SLAB scheme of not touching the object during management is adopted.
SLUB+Q can efficiently free and allocate cache cold objects without
causing cache misses.

I have had limited time for benchmarking this release so far since I
was more focused on getting SLAB features merged in and making it
work reliably with all the usual SLUB bells and whistles. The queueing
scheme from the SLUB+Q V1/V2 releases was not changed so that the basic
SMP performance is still the same. V1 and V2 did not have NUMA clean
queues and therefore the performance on NUMA system was not great.

Since the basic queueing scheme from SLAB was taken we should be seeing
similar or better performance on NUMA. But then I am limited to two node
systems at this point. For those systems the alien caches are allocated
of similar size than the shared caches. Meaning that more optimizations
will now be geared to small NUMA systems.



Patches against 2.6.35

1,2 Some percpu stuff that I hope will independently be merged in the 2.6.36
	cycle.

3-13 Cleanup patches for SLUB that are general improvements. Some of those
	are already in the slab tree for 2.6.36.

14-18 Minimal set that realizes per cpu queues without fancy shared or alien
    queues.  This should be enough to be competitive with SMP against SLAB
    on modern hardware as the earlier measurements show.

19   NUMA policies applied at the object level. This will cause significantly
	more processing in the allocator hotpath for the NUMA case on
	particular slabs so that individual allocations can be redirected
	to different nodes.

20	Shared caches per cache sibling group between processors.

21	Alien caches per cache sibling group. Just adds a couple of
	shared caches and uses them for foreign nodes.

22	Cache expiration

23	Expire caches from page reclaim logic in mm/vmscan.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 01/23] percpu: make @dyn_size always mean min dyn_size in first chunk init functions
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 02/23] percpu: allow limited allocation before slab is online Christoph Lameter
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-mm, Tejun Heo, David Rientjes, linux-kernel, Nick Piggin

[-- Attachment #1: percpu_early_1 --]
[-- Type: text/plain, Size: 5878 bytes --]

From: Tejun Heo <tj@kernel.org>

In pcpu_build_alloc_info() and pcpu_embed_first_chunk(), @dyn_size was
ssize_t, -1 meant auto-size, 0 forced 0 and positive meant minimum
size.  There's no use case for forcing 0 and the upcoming early alloc
support always requires non-zero dynamic size.  Make @dyn_size always
mean minimum dyn_size.

While at it, make pcpu_build_alloc_info() static which doesn't have
any external caller as suggested by David Rientjes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Index: work/include/linux/percpu.h
===================================================================
--- work.orig/include/linux/percpu.h
+++ work/include/linux/percpu.h
@@ -104,16 +104,11 @@ extern struct pcpu_alloc_info * __init p
 							     int nr_units);
 extern void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai);

-extern struct pcpu_alloc_info * __init pcpu_build_alloc_info(
-				size_t reserved_size, ssize_t dyn_size,
-				size_t atom_size,
-				pcpu_fc_cpu_distance_fn_t cpu_distance_fn);
-
 extern int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 					 void *base_addr);

 #ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
-extern int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size,
+extern int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
 				size_t atom_size,
 				pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
 				pcpu_fc_alloc_fn_t alloc_fn,
Index: work/mm/percpu.c
===================================================================
--- work.orig/mm/percpu.c
+++ work/mm/percpu.c
@@ -1013,20 +1013,6 @@ phys_addr_t per_cpu_ptr_to_phys(void *ad
 		return page_to_phys(pcpu_addr_to_page(addr));
 }

-static inline size_t pcpu_calc_fc_sizes(size_t static_size,
-					size_t reserved_size,
-					ssize_t *dyn_sizep)
-{
-	size_t size_sum;
-
-	size_sum = PFN_ALIGN(static_size + reserved_size +
-			     (*dyn_sizep >= 0 ? *dyn_sizep : 0));
-	if (*dyn_sizep != 0)
-		*dyn_sizep = size_sum - static_size - reserved_size;
-
-	return size_sum;
-}
-
 /**
  * pcpu_alloc_alloc_info - allocate percpu allocation info
  * @nr_groups: the number of groups
@@ -1085,7 +1071,7 @@ void __init pcpu_free_alloc_info(struct
 /**
  * pcpu_build_alloc_info - build alloc_info considering distances between CPUs
  * @reserved_size: the size of reserved percpu area in bytes
- * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @dyn_size: minimum free size for dynamic allocation in bytes
  * @atom_size: allocation atom size
  * @cpu_distance_fn: callback to determine distance between cpus, optional
  *
@@ -1103,8 +1089,8 @@ void __init pcpu_free_alloc_info(struct
  * On success, pointer to the new allocation_info is returned.  On
  * failure, ERR_PTR value is returned.
  */
-struct pcpu_alloc_info * __init pcpu_build_alloc_info(
-				size_t reserved_size, ssize_t dyn_size,
+static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
+				size_t reserved_size, size_t dyn_size,
 				size_t atom_size,
 				pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
 {
@@ -1123,13 +1109,15 @@ struct pcpu_alloc_info * __init pcpu_bui
 	memset(group_map, 0, sizeof(group_map));
 	memset(group_cnt, 0, sizeof(group_cnt));

+	size_sum = PFN_ALIGN(static_size + reserved_size + dyn_size);
+	dyn_size = size_sum - static_size - reserved_size;
+
 	/*
 	 * Determine min_unit_size, alloc_size and max_upa such that
 	 * alloc_size is multiple of atom_size and is the smallest
 	 * which can accomodate 4k aligned segments which are equal to
 	 * or larger than min_unit_size.
 	 */
-	size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);
 	min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);

 	alloc_size = roundup(min_unit_size, atom_size);
@@ -1532,7 +1520,7 @@ early_param("percpu_alloc", percpu_alloc
 /**
  * pcpu_embed_first_chunk - embed the first percpu chunk into bootmem
  * @reserved_size: the size of reserved percpu area in bytes
- * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @dyn_size: minimum free size for dynamic allocation in bytes
  * @atom_size: allocation atom size
  * @cpu_distance_fn: callback to determine distance between cpus, optional
  * @alloc_fn: function to allocate percpu page
@@ -1553,10 +1541,7 @@ early_param("percpu_alloc", percpu_alloc
  * vmalloc space is not orders of magnitude larger than distances
  * between node memory addresses (ie. 32bit NUMA machines).
  *
- * When @dyn_size is positive, dynamic area might be larger than
- * specified to fill page alignment.  When @dyn_size is auto,
- * @dyn_size is just big enough to fill page alignment after static
- * and reserved areas.
+ * @dyn_size specifies the minimum dynamic area size.
  *
  * If the needed size is smaller than the minimum or specified unit
  * size, the leftover is returned using @free_fn.
@@ -1564,7 +1549,7 @@ early_param("percpu_alloc", percpu_alloc
  * RETURNS:
  * 0 on success, -errno on failure.
  */
-int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size,
+int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
 				  size_t atom_size,
 				  pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
 				  pcpu_fc_alloc_fn_t alloc_fn,
@@ -1695,7 +1680,7 @@ int __init pcpu_page_first_chunk(size_t

 	snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10);

-	ai = pcpu_build_alloc_info(reserved_size, -1, PAGE_SIZE, NULL);
+	ai = pcpu_build_alloc_info(reserved_size, 0, PAGE_SIZE, NULL);
 	if (IS_ERR(ai))
 		return PTR_ERR(ai);
 	BUG_ON(ai->nr_groups != 1);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 02/23] percpu: allow limited allocation before slab is online
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 01/23] percpu: make @dyn_size always mean min dyn_size in first chunk init functions Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 03/23] slub: Use a constant for a unspecified node Christoph Lameter
                   ` (21 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-mm, Tejun Heo, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: percpu_early_2 --]
[-- Type: text/plain, Size: 6913 bytes --]

From: Tejun Heo <tj@kernel.org>

This patch updates percpu allocator such that it can serve limited
amount of allocation before slab comes online.  This is primarily to
allow slab to depend on working percpu allocator.

Two parameters, PERCPU_DYNAMIC_EARLY_SIZE and SLOTS, determine how
much memory space and allocation map slots are reserved.  If this
reserved area is exhausted, WARN_ON_ONCE() will trigger and allocation
will fail till slab comes online.

The following changes are made to implement early alloc.

* pcpu_mem_alloc() now checks slab_is_available()

* Chunks are allocated using pcpu_mem_alloc()

* Init paths make sure ai->dyn_size is at least as large as
  PERCPU_DYNAMIC_EARLY_SIZE.

* Initial alloc maps are allocated in __initdata and copied to
  kmalloc'd areas once slab is online.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/percpu.h |   13 ++++++++++++
 init/main.c            |    1
 include/linux/percpu.h |   13 ++++++++++++
 init/main.c            |    1 
 mm/percpu.c            |   52 +++++++++++++++++++++++++++++++++++++------------
 3 files changed, 54 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/percpu.c
===================================================================
--- linux-2.6.orig/mm/percpu.c	2010-07-07 08:47:18.000000000 -0500
+++ linux-2.6/mm/percpu.c	2010-07-07 08:47:19.000000000 -0500
@@ -282,6 +282,9 @@ static void __maybe_unused pcpu_next_pop
  */
 static void *pcpu_mem_alloc(size_t size)
 {
+	if (WARN_ON_ONCE(!slab_is_available()))
+		return NULL;
+
 	if (size <= PAGE_SIZE)
 		return kzalloc(size, GFP_KERNEL);
 	else {
@@ -392,13 +395,6 @@ static int pcpu_extend_area_map(struct p
 	old_size = chunk->map_alloc * sizeof(chunk->map[0]);
 	memcpy(new, chunk->map, old_size);
 
-	/*
-	 * map_alloc < PCPU_DFL_MAP_ALLOC indicates that the chunk is
-	 * one of the first chunks and still using static map.
-	 */
-	if (chunk->map_alloc >= PCPU_DFL_MAP_ALLOC)
-		old = chunk->map;
-
 	chunk->map_alloc = new_alloc;
 	chunk->map = new;
 	new = NULL;
@@ -604,7 +600,7 @@ static struct pcpu_chunk *pcpu_alloc_chu
 {
 	struct pcpu_chunk *chunk;
 
-	chunk = kzalloc(pcpu_chunk_struct_size, GFP_KERNEL);
+	chunk = pcpu_mem_alloc(pcpu_chunk_struct_size);
 	if (!chunk)
 		return NULL;
 
@@ -1109,7 +1105,9 @@ static struct pcpu_alloc_info * __init p
 	memset(group_map, 0, sizeof(group_map));
 	memset(group_cnt, 0, sizeof(group_cnt));
 
-	size_sum = PFN_ALIGN(static_size + reserved_size + dyn_size);
+	/* calculate size_sum and ensure dyn_size is enough for early alloc */
+	size_sum = PFN_ALIGN(static_size + reserved_size +
+			    max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE));
 	dyn_size = size_sum - static_size - reserved_size;
 
 	/*
@@ -1338,7 +1336,8 @@ int __init pcpu_setup_first_chunk(const 
 				  void *base_addr)
 {
 	static char cpus_buf[4096] __initdata;
-	static int smap[2], dmap[2];
+	static int smap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata;
+	static int dmap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata;
 	size_t dyn_size = ai->dyn_size;
 	size_t size_sum = ai->static_size + ai->reserved_size + dyn_size;
 	struct pcpu_chunk *schunk, *dchunk = NULL;
@@ -1361,14 +1360,13 @@ int __init pcpu_setup_first_chunk(const 
 } while (0)
 
 	/* sanity checks */
-	BUILD_BUG_ON(ARRAY_SIZE(smap) >= PCPU_DFL_MAP_ALLOC ||
-		     ARRAY_SIZE(dmap) >= PCPU_DFL_MAP_ALLOC);
 	PCPU_SETUP_BUG_ON(ai->nr_groups <= 0);
 	PCPU_SETUP_BUG_ON(!ai->static_size);
 	PCPU_SETUP_BUG_ON(!base_addr);
 	PCPU_SETUP_BUG_ON(ai->unit_size < size_sum);
 	PCPU_SETUP_BUG_ON(ai->unit_size & ~PAGE_MASK);
 	PCPU_SETUP_BUG_ON(ai->unit_size < PCPU_MIN_UNIT_SIZE);
+	PCPU_SETUP_BUG_ON(ai->dyn_size < PERCPU_DYNAMIC_EARLY_SIZE);
 	PCPU_SETUP_BUG_ON(pcpu_verify_alloc_info(ai) < 0);
 
 	/* process group information and build config tables accordingly */
@@ -1806,3 +1804,33 @@ void __init setup_per_cpu_areas(void)
 		__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
 }
 #endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
+
+/*
+ * First and reserved chunks are initialized with temporary allocation
+ * map in initdata so that they can be used before slab is online.
+ * This function is called after slab is brought up and replaces those
+ * with properly allocated maps.
+ */
+void __init percpu_init_late(void)
+{
+	struct pcpu_chunk *target_chunks[] =
+		{ pcpu_first_chunk, pcpu_reserved_chunk, NULL };
+	struct pcpu_chunk *chunk;
+	unsigned long flags;
+	int i;
+
+	for (i = 0; (chunk = target_chunks[i]); i++) {
+		int *map;
+		const size_t size = PERCPU_DYNAMIC_EARLY_SLOTS * sizeof(map[0]);
+
+		BUILD_BUG_ON(size > PAGE_SIZE);
+
+		map = pcpu_mem_alloc(size);
+		BUG_ON(!map);
+
+		spin_lock_irqsave(&pcpu_lock, flags);
+		memcpy(map, chunk->map, size);
+		chunk->map = map;
+		spin_unlock_irqrestore(&pcpu_lock, flags);
+	}
+}
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c	2010-07-07 08:45:22.000000000 -0500
+++ linux-2.6/init/main.c	2010-07-07 08:47:19.000000000 -0500
@@ -532,6 +532,7 @@ static void __init mm_init(void)
 	page_cgroup_init_flatmem();
 	mem_init();
 	kmem_cache_init();
+	percpu_init_late();
 	pgtable_cache_init();
 	vmalloc_init();
 }
Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2010-07-07 08:47:18.000000000 -0500
+++ linux-2.6/include/linux/percpu.h	2010-07-07 08:47:19.000000000 -0500
@@ -45,6 +45,16 @@
 #define PCPU_MIN_UNIT_SIZE		PFN_ALIGN(64 << 10)
 
 /*
+ * Percpu allocator can serve percpu allocations before slab is
+ * initialized which allows slab to depend on the percpu allocator.
+ * The following two parameters decide how much resource to
+ * preallocate for this.  Keep PERCPU_DYNAMIC_RESERVE equal to or
+ * larger than PERCPU_DYNAMIC_EARLY_SIZE.
+ */
+#define PERCPU_DYNAMIC_EARLY_SLOTS	128
+#define PERCPU_DYNAMIC_EARLY_SIZE	(12 << 10)
+
+/*
  * PERCPU_DYNAMIC_RESERVE indicates the amount of free area to piggy
  * back on the first chunk for dynamic percpu allocation if arch is
  * manually allocating and mapping it for faster access (as a part of
@@ -135,6 +145,7 @@ extern bool is_kernel_percpu_address(uns
 #ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
 extern void __init setup_per_cpu_areas(void);
 #endif
+extern void __init percpu_init_late(void);
 
 #else /* CONFIG_SMP */
 
@@ -148,6 +159,8 @@ static inline bool is_kernel_percpu_addr
 
 static inline void __init setup_per_cpu_areas(void) { }
 
+static inline void __init percpu_init_late(void) { }
+
 static inline void *pcpu_lpage_remapped(void *kaddr)
 {
 	return NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 03/23] slub: Use a constant for a unspecified node.
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 01/23] percpu: make @dyn_size always mean min dyn_size in first chunk init functions Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 02/23] percpu: allow limited allocation before slab is online Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  3:34   ` David Rientjes
  2010-08-04  2:45 ` [S+Q3 04/23] SLUB: Constants need UL Christoph Lameter
                   ` (20 subsequent siblings)
  23 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-mm, KAMEZAWA Hiroyuki, David Rientjes, linux-kernel, Nick Piggin

[-- Attachment #1: slab_node_unspecified --]
[-- Type: text/plain, Size: 3002 bytes --]

kmalloc_node() and friends can be passed a constant -1 to indicate
that no choice was made for the node from which the object needs to
come.

Use NUMA_NO_NODE instead of -1.

CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-26 12:57:52.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-26 12:57:59.000000000 -0500
@@ -1073,7 +1073,7 @@ static inline struct page *alloc_slab_pa
 
 	flags |= __GFP_NOTRACK;
 
-	if (node == -1)
+	if (node == NUMA_NO_NODE)
 		return alloc_pages(flags, order);
 	else
 		return alloc_pages_exact_node(node, flags, order);
@@ -1387,7 +1387,7 @@ static struct page *get_any_partial(stru
 static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
 {
 	struct page *page;
-	int searchnode = (node == -1) ? numa_node_id() : node;
+	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
 
 	page = get_partial_node(get_node(s, searchnode));
 	if (page || (flags & __GFP_THISNODE) || node != -1)
@@ -1515,7 +1515,7 @@ static void flush_all(struct kmem_cache 
 static inline int node_match(struct kmem_cache_cpu *c, int node)
 {
 #ifdef CONFIG_NUMA
-	if (node != -1 && c->node != node)
+	if (node != NUMA_NO_NODE && c->node != node)
 		return 0;
 #endif
 	return 1;
@@ -1727,7 +1727,7 @@ static __always_inline void *slab_alloc(
 
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 {
-	void *ret = slab_alloc(s, gfpflags, -1, _RET_IP_);
+	void *ret = slab_alloc(s, gfpflags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret, s->objsize, s->size, gfpflags);
 
@@ -1738,7 +1738,7 @@ EXPORT_SYMBOL(kmem_cache_alloc);
 #ifdef CONFIG_TRACING
 void *kmem_cache_alloc_notrace(struct kmem_cache *s, gfp_t gfpflags)
 {
-	return slab_alloc(s, gfpflags, -1, _RET_IP_);
+	return slab_alloc(s, gfpflags, NUMA_NO_NODE, _RET_IP_);
 }
 EXPORT_SYMBOL(kmem_cache_alloc_notrace);
 #endif
@@ -2728,7 +2728,7 @@ void *__kmalloc(size_t size, gfp_t flags
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
-	ret = slab_alloc(s, flags, -1, _RET_IP_);
+	ret = slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmalloc(_RET_IP_, ret, size, s->size, flags);
 
@@ -3312,7 +3312,7 @@ void *__kmalloc_track_caller(size_t size
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
-	ret = slab_alloc(s, gfpflags, -1, caller);
+	ret = slab_alloc(s, gfpflags, NUMA_NO_NODE, caller);
 
 	/* Honor the call site pointer we recieved. */
 	trace_kmalloc(caller, ret, size, s->size, gfpflags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 04/23] SLUB: Constants need UL
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (2 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 03/23] slub: Use a constant for a unspecified node Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 05/23] Subjec Slub: Force no inlining of debug functions Christoph Lameter
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, David Rientjes, linux-kernel, Nick Piggin

[-- Attachment #1: slub_constant_ul --]
[-- Type: text/plain, Size: 1160 bytes --]

UL suffix is missing in some constants. Conform to how slab.h uses constants.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-06 14:53:16.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-06 15:08:24.000000000 -0500
@@ -162,8 +162,8 @@
 #define MAX_OBJS_PER_PAGE	65535 /* since page.objects is u16 */
 
 /* Internal SLUB flags */
-#define __OBJECT_POISON		0x80000000 /* Poison object */
-#define __SYSFS_ADD_DEFERRED	0x40000000 /* Not yet visible via sysfs */
+#define __OBJECT_POISON		0x80000000UL /* Poison object */
+#define __SYSFS_ADD_DEFERRED	0x40000000UL /* Not yet visible via sysfs */
 
 static int kmem_size = sizeof(struct kmem_cache);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 05/23] Subjec Slub: Force no inlining of debug functions
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (3 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 04/23] SLUB: Constants need UL Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 06/23] slub: Check kasprintf results in kmem_cache_init() Christoph Lameter
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: slub_nolinline --]
[-- Type: text/plain, Size: 1360 bytes --]

Compiler folds the debgging functions into the critical paths.
Avoid that by adding noinline to the functions that check for
problems.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/slub.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-29 18:32:26.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-29 18:32:33.000000000 -0500
@@ -857,7 +857,7 @@ static void setup_object_debug(struct km
 	init_tracking(s, object);
 }
 
-static int alloc_debug_processing(struct kmem_cache *s, struct page *page,
+static noinline int alloc_debug_processing(struct kmem_cache *s, struct page *page,
 					void *object, unsigned long addr)
 {
 	if (!check_slab(s, page))
@@ -897,8 +897,8 @@ bad:
 	return 0;
 }
 
-static int free_debug_processing(struct kmem_cache *s, struct page *page,
-					void *object, unsigned long addr)
+static noinline int free_debug_processing(struct kmem_cache *s,
+		 struct page *page, void *object, unsigned long addr)
 {
 	if (!check_slab(s, page))
 		goto fail;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 06/23] slub: Check kasprintf results in kmem_cache_init()
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (4 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 05/23] Subjec Slub: Force no inlining of debug functions Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 07/23] slub: Use kmem_cache flags to detect if slab is in debugging mode Christoph Lameter
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, David Rientjes, linux-kernel, Nick Piggin

[-- Attachment #1: slub_check_kasprintf_result --]
[-- Type: text/plain, Size: 1298 bytes --]

Small allocations may fail during slab bringup which is fatal. Add a BUG_ON()
so that we fail immediately rather than failing later during sysfs
processing.

CC: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-06 15:12:14.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-06 15:13:48.000000000 -0500
@@ -3118,9 +3118,12 @@ void __init kmem_cache_init(void)
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
-	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++)
-		kmalloc_caches[i]. name =
-			kasprintf(GFP_NOWAIT, "kmalloc-%d", 1 << i);
+	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
+		char *s = kasprintf(GFP_NOWAIT, "kmalloc-%d", 1 << i);
+
+		BUG_ON(!s);
+		kmalloc_caches[i].name = s;
+	}
 
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 07/23] slub: Use kmem_cache flags to detect if slab is in debugging mode.
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (5 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 06/23] slub: Check kasprintf results in kmem_cache_init() Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 08/23] slub: remove dynamic dma slab allocation Christoph Lameter
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, David Rientjes, linux-kernel, Nick Piggin

[-- Attachment #1: slub_debug_on --]
[-- Type: text/plain, Size: 4169 bytes --]

The cacheline with the flags is reachable from the hot paths after the
percpu allocator changes went in. So there is no need anymore to put a
flag into each slab page. Get rid of the SlubDebug flag and use
the flags in kmem_cache instead.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/page-flags.h |    2 --
 mm/slub.c                  |   33 ++++++++++++---------------------
 2 files changed, 12 insertions(+), 23 deletions(-)

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2010-07-28 12:03:19.000000000 -0500
+++ linux-2.6/include/linux/page-flags.h	2010-07-28 12:44:57.000000000 -0500
@@ -128,7 +128,6 @@ enum pageflags {
 
 	/* SLUB */
 	PG_slub_frozen = PG_active,
-	PG_slub_debug = PG_error,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -215,7 +214,6 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEAR
 __PAGEFLAG(SlobFree, slob_free)
 
 __PAGEFLAG(SlubFrozen, slub_frozen)
-__PAGEFLAG(SlubDebug, slub_debug)
 
 /*
  * Private page markings that may be used by the filesystem that owns the page
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-28 12:44:56.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-28 12:44:57.000000000 -0500
@@ -107,11 +107,17 @@
  * 			the fast path and disables lockless freelists.
  */
 
+#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
+		SLAB_TRACE | SLAB_DEBUG_FREE)
+
+static inline int kmem_cache_debug(struct kmem_cache *s)
+{
 #ifdef CONFIG_SLUB_DEBUG
-#define SLABDEBUG 1
+	return unlikely(s->flags & SLAB_DEBUG_FLAGS);
 #else
-#define SLABDEBUG 0
+	return 0;
 #endif
+}
 
 /*
  * Issues still to be resolved:
@@ -1157,9 +1163,6 @@ static struct page *new_slab(struct kmem
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
-	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
-			SLAB_STORE_USER | SLAB_TRACE))
-		__SetPageSlubDebug(page);
 
 	start = page_address(page);
 
@@ -1186,14 +1189,13 @@ static void __free_slab(struct kmem_cach
 	int order = compound_order(page);
 	int pages = 1 << order;
 
-	if (unlikely(SLABDEBUG && PageSlubDebug(page))) {
+	if (kmem_cache_debug(s)) {
 		void *p;
 
 		slab_pad_check(s, page);
 		for_each_object(p, s, page_address(page),
 						page->objects)
 			check_object(s, page, p, 0);
-		__ClearPageSlubDebug(page);
 	}
 
 	kmemcheck_free_shadow(page, compound_order(page));
@@ -1415,8 +1417,7 @@ static void unfreeze_slab(struct kmem_ca
 			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
 		} else {
 			stat(s, DEACTIVATE_FULL);
-			if (SLABDEBUG && PageSlubDebug(page) &&
-						(s->flags & SLAB_STORE_USER))
+			if (kmem_cache_debug(s) && (s->flags & SLAB_STORE_USER))
 				add_full(n, page);
 		}
 		slab_unlock(page);
@@ -1624,7 +1625,7 @@ load_freelist:
 	object = c->page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
+	if (kmem_cache_debug(s))
 		goto debug;
 
 	c->freelist = get_freepointer(s, object);
@@ -1783,7 +1784,7 @@ static void __slab_free(struct kmem_cach
 	stat(s, FREE_SLOWPATH);
 	slab_lock(page);
 
-	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
+	if (kmem_cache_debug(s))
 		goto debug;
 
 checks_ok:
@@ -3398,16 +3399,6 @@ static void validate_slab_slab(struct km
 	} else
 		printk(KERN_INFO "SLUB %s: Skipped busy slab 0x%p\n",
 			s->name, page);
-
-	if (s->flags & DEBUG_DEFAULT_FLAGS) {
-		if (!PageSlubDebug(page))
-			printk(KERN_ERR "SLUB %s: SlubDebug not set "
-				"on slab 0x%p\n", s->name, page);
-	} else {
-		if (PageSlubDebug(page))
-			printk(KERN_ERR "SLUB %s: SlubDebug set on "
-				"slab 0x%p\n", s->name, page);
-	}
 }
 
 static int validate_slab_node(struct kmem_cache *s,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 08/23] slub: remove dynamic dma slab allocation
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (6 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 07/23] slub: Use kmem_cache flags to detect if slab is in debugging mode Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 09/23] slub: Remove static kmem_cache_cpu array for boot Christoph Lameter
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: slub_remove_dynamic_dma --]
[-- Type: text/plain, Size: 8612 bytes --]

Remove the dynamic dma slab allocation since this causes too many issues with
nested locks etc etc. The change avoids passing gfpflags into many functions.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |  151 ++++++++++++++++----------------------------------------------
 1 file changed, 40 insertions(+), 111 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-27 22:51:36.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-27 22:51:36.000000000 -0500
@@ -2065,7 +2065,7 @@ init_kmem_cache_node(struct kmem_cache_n
 
 static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[KMALLOC_CACHES]);
 
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 {
 	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
@@ -2092,7 +2092,7 @@ static inline int alloc_kmem_cache_cpus(
  * when allocating for the kmalloc_node_cache. This is used for bootstrapping
  * memory on a fresh node that has no slab structures yet.
  */
-static void early_kmem_cache_node_alloc(gfp_t gfpflags, int node)
+static void early_kmem_cache_node_alloc(int node)
 {
 	struct page *page;
 	struct kmem_cache_node *n;
@@ -2100,7 +2100,7 @@ static void early_kmem_cache_node_alloc(
 
 	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, gfpflags, node);
+	page = new_slab(kmalloc_caches, GFP_KERNEL, node);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
@@ -2144,7 +2144,7 @@ static void free_kmem_cache_nodes(struct
 	}
 }
 
-static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
+static int init_kmem_cache_nodes(struct kmem_cache *s)
 {
 	int node;
 
@@ -2152,11 +2152,11 @@ static int init_kmem_cache_nodes(struct 
 		struct kmem_cache_node *n;
 
 		if (slab_state == DOWN) {
-			early_kmem_cache_node_alloc(gfpflags, node);
+			early_kmem_cache_node_alloc(node);
 			continue;
 		}
 		n = kmem_cache_alloc_node(kmalloc_caches,
-						gfpflags, node);
+						GFP_KERNEL, node);
 
 		if (!n) {
 			free_kmem_cache_nodes(s);
@@ -2173,7 +2173,7 @@ static void free_kmem_cache_nodes(struct
 {
 }
 
-static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
+static int init_kmem_cache_nodes(struct kmem_cache *s)
 {
 	init_kmem_cache_node(&s->local_node, s);
 	return 1;
@@ -2313,7 +2313,7 @@ static int calculate_sizes(struct kmem_c
 
 }
 
-static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
+static int kmem_cache_open(struct kmem_cache *s,
 		const char *name, size_t size,
 		size_t align, unsigned long flags,
 		void (*ctor)(void *))
@@ -2349,10 +2349,10 @@ static int kmem_cache_open(struct kmem_c
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
-	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
+	if (!init_kmem_cache_nodes(s))
 		goto error;
 
-	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
+	if (alloc_kmem_cache_cpus(s))
 		return 1;
 
 	free_kmem_cache_nodes(s);
@@ -2512,6 +2512,10 @@ EXPORT_SYMBOL(kmem_cache_destroy);
 struct kmem_cache kmalloc_caches[KMALLOC_CACHES] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
+#ifdef CONFIG_ZONE_DMA
+static struct kmem_cache kmalloc_dma_caches[SLUB_PAGE_SHIFT];
+#endif
+
 static int __init setup_slub_min_order(char *str)
 {
 	get_option(&str, &slub_min_order);
@@ -2548,116 +2552,26 @@ static int __init setup_slub_nomerge(cha
 
 __setup("slub_nomerge", setup_slub_nomerge);
 
-static struct kmem_cache *create_kmalloc_cache(struct kmem_cache *s,
-		const char *name, int size, gfp_t gfp_flags)
+static void create_kmalloc_cache(struct kmem_cache *s,
+		const char *name, int size, unsigned int flags)
 {
-	unsigned int flags = 0;
-
-	if (gfp_flags & SLUB_DMA)
-		flags = SLAB_CACHE_DMA;
-
 	/*
 	 * This function is called with IRQs disabled during early-boot on
 	 * single CPU so there's no need to take slub_lock here.
 	 */
-	if (!kmem_cache_open(s, gfp_flags, name, size, ARCH_KMALLOC_MINALIGN,
+	if (!kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN,
 								flags, NULL))
 		goto panic;
 
 	list_add(&s->list, &slab_caches);
 
-	if (sysfs_slab_add(s))
-		goto panic;
-	return s;
+	if (!sysfs_slab_add(s))
+		return;
 
 panic:
 	panic("Creation of kmalloc slab %s size=%d failed.\n", name, size);
 }
 
-#ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_caches_dma[SLUB_PAGE_SHIFT];
-
-static void sysfs_add_func(struct work_struct *w)
-{
-	struct kmem_cache *s;
-
-	down_write(&slub_lock);
-	list_for_each_entry(s, &slab_caches, list) {
-		if (s->flags & __SYSFS_ADD_DEFERRED) {
-			s->flags &= ~__SYSFS_ADD_DEFERRED;
-			sysfs_slab_add(s);
-		}
-	}
-	up_write(&slub_lock);
-}
-
-static DECLARE_WORK(sysfs_add_work, sysfs_add_func);
-
-static noinline struct kmem_cache *dma_kmalloc_cache(int index, gfp_t flags)
-{
-	struct kmem_cache *s;
-	char *text;
-	size_t realsize;
-	unsigned long slabflags;
-	int i;
-
-	s = kmalloc_caches_dma[index];
-	if (s)
-		return s;
-
-	/* Dynamically create dma cache */
-	if (flags & __GFP_WAIT)
-		down_write(&slub_lock);
-	else {
-		if (!down_write_trylock(&slub_lock))
-			goto out;
-	}
-
-	if (kmalloc_caches_dma[index])
-		goto unlock_out;
-
-	realsize = kmalloc_caches[index].objsize;
-	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
-			 (unsigned int)realsize);
-
-	s = NULL;
-	for (i = 0; i < KMALLOC_CACHES; i++)
-		if (!kmalloc_caches[i].size)
-			break;
-
-	BUG_ON(i >= KMALLOC_CACHES);
-	s = kmalloc_caches + i;
-
-	/*
-	 * Must defer sysfs creation to a workqueue because we don't know
-	 * what context we are called from. Before sysfs comes up, we don't
-	 * need to do anything because our sysfs initcall will start by
-	 * adding all existing slabs to sysfs.
-	 */
-	slabflags = SLAB_CACHE_DMA|SLAB_NOTRACK;
-	if (slab_state >= SYSFS)
-		slabflags |= __SYSFS_ADD_DEFERRED;
-
-	if (!text || !kmem_cache_open(s, flags, text,
-			realsize, ARCH_KMALLOC_MINALIGN, slabflags, NULL)) {
-		s->size = 0;
-		kfree(text);
-		goto unlock_out;
-	}
-
-	list_add(&s->list, &slab_caches);
-	kmalloc_caches_dma[index] = s;
-
-	if (slab_state >= SYSFS)
-		schedule_work(&sysfs_add_work);
-
-unlock_out:
-	up_write(&slub_lock);
-out:
-	return kmalloc_caches_dma[index];
-}
-#endif
-
 /*
  * Conversion table for small slabs sizes / 8 to the index in the
  * kmalloc array. This is necessary for slabs < 192 since we have non power
@@ -2710,7 +2624,7 @@ static struct kmem_cache *get_slab(size_
 
 #ifdef CONFIG_ZONE_DMA
 	if (unlikely((flags & SLUB_DMA)))
-		return dma_kmalloc_cache(index, flags);
+		return &kmalloc_dma_caches[index];
 
 #endif
 	return &kmalloc_caches[index];
@@ -3049,7 +2963,7 @@ void __init kmem_cache_init(void)
 	 * kmem_cache_open for slab_state == DOWN.
 	 */
 	create_kmalloc_cache(&kmalloc_caches[0], "kmem_cache_node",
-		sizeof(struct kmem_cache_node), GFP_NOWAIT);
+		sizeof(struct kmem_cache_node), 0);
 	kmalloc_caches[0].refcount = -1;
 	caches++;
 
@@ -3062,18 +2976,18 @@ void __init kmem_cache_init(void)
 	/* Caches that are not of the two-to-the-power-of size */
 	if (KMALLOC_MIN_SIZE <= 32) {
 		create_kmalloc_cache(&kmalloc_caches[1],
-				"kmalloc-96", 96, GFP_NOWAIT);
+				"kmalloc-96", 96, 0);
 		caches++;
 	}
 	if (KMALLOC_MIN_SIZE <= 64) {
 		create_kmalloc_cache(&kmalloc_caches[2],
-				"kmalloc-192", 192, GFP_NOWAIT);
+				"kmalloc-192", 192, 0);
 		caches++;
 	}
 
 	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
 		create_kmalloc_cache(&kmalloc_caches[i],
-			"kmalloc", 1 << i, GFP_NOWAIT);
+			"kmalloc", 1 << i, 0);
 		caches++;
 	}
 
@@ -3146,6 +3060,21 @@ void __init kmem_cache_init(void)
 
 void __init kmem_cache_init_late(void)
 {
+#ifdef CONFIG_ZONE_DMA
+	int i;
+
+	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
+		struct kmem_cache *s = &kmalloc_caches[i];
+
+		if (s && s->size) {
+			char *name = kasprintf(GFP_KERNEL,
+				 "dma-kmalloc-%d", s->objsize);
+
+			create_kmalloc_cache(&kmalloc_dma_caches[i],
+				name, s->objsize, SLAB_CACHE_DMA);
+		}
+	}
+#endif
 }
 
 /*
@@ -3240,7 +3169,7 @@ struct kmem_cache *kmem_cache_create(con
 
 	s = kmalloc(kmem_size, GFP_KERNEL);
 	if (s) {
-		if (kmem_cache_open(s, GFP_KERNEL, name,
+		if (kmem_cache_open(s, name,
 				size, align, flags, ctor)) {
 			list_add(&s->list, &slab_caches);
 			up_write(&slub_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 09/23] slub: Remove static kmem_cache_cpu array for boot
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (7 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 08/23] slub: remove dynamic dma slab allocation Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 10/23] slub: Allow removal of slab caches during boot V2 Christoph Lameter
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-mm, Tejun Heo, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: maybe_remove_static --]
[-- Type: text/plain, Size: 1535 bytes --]

The percpu allocator can now handle allocations during early boot.
So drop the static kmem_cache_cpu array.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-26 14:26:17.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-26 14:26:20.000000000 -0500
@@ -2063,23 +2063,14 @@ init_kmem_cache_node(struct kmem_cache_n
 #endif
 }
 
-static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[KMALLOC_CACHES]);
-
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 {
-	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
-		/*
-		 * Boot time creation of the kmalloc array. Use static per cpu data
-		 * since the per cpu allocator is not available yet.
-		 */
-		s->cpu_slab = kmalloc_percpu + (s - kmalloc_caches);
-	else
-		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
+	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
+			SLUB_PAGE_SHIFT * sizeof(struct kmem_cache));
 
-	if (!s->cpu_slab)
-		return 0;
+	s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
 
-	return 1;
+	return s->cpu_slab != NULL;
 }
 
 #ifdef CONFIG_NUMA

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 10/23] slub: Allow removal of slab caches during boot V2
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (8 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 09/23] slub: Remove static kmem_cache_cpu array for boot Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 11/23] slub: Dynamically size kmalloc cache allocations Christoph Lameter
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-mm, Benjamin Herrenschmidt, Roland Dreier, linux-kernel,
	Nick Piggin, David Rientjes

[-- Attachment #1: slub_sysfs_remove_during_boot --]
[-- Type: text/plain, Size: 3271 bytes --]

Serialize kmem_cache_create and kmem_cache_destroy using the slub_lock. Only
possible after the use of the slub_lock during dynamic dma creation has been
removed.

Then make sure that the setup of the slab sysfs entries does not race
with kmem_cache_create and kmem_cache destroy.

If a slab cache is removed before we have setup sysfs then simply skip over
the sysfs handling.

V1->V2:
- Do proper synchronization to address race conditions

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Roland Dreier <rdreier@cisco.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-27 22:51:41.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-27 22:51:43.000000000 -0500
@@ -2482,7 +2482,6 @@ void kmem_cache_destroy(struct kmem_cach
 	s->refcount--;
 	if (!s->refcount) {
 		list_del(&s->list);
-		up_write(&slub_lock);
 		if (kmem_cache_close(s)) {
 			printk(KERN_ERR "SLUB %s: %s called for cache that "
 				"still has objects.\n", s->name, __func__);
@@ -2491,8 +2490,8 @@ void kmem_cache_destroy(struct kmem_cach
 		if (s->flags & SLAB_DESTROY_BY_RCU)
 			rcu_barrier();
 		sysfs_slab_remove(s);
-	} else
-		up_write(&slub_lock);
+	}
+	up_write(&slub_lock);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
@@ -3147,14 +3146,12 @@ struct kmem_cache *kmem_cache_create(con
 		 */
 		s->objsize = max(s->objsize, (int)size);
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
-		up_write(&slub_lock);
 
 		if (sysfs_slab_alias(s, name)) {
-			down_write(&slub_lock);
 			s->refcount--;
-			up_write(&slub_lock);
 			goto err;
 		}
+		up_write(&slub_lock);
 		return s;
 	}
 
@@ -3163,14 +3160,12 @@ struct kmem_cache *kmem_cache_create(con
 		if (kmem_cache_open(s, name,
 				size, align, flags, ctor)) {
 			list_add(&s->list, &slab_caches);
-			up_write(&slub_lock);
 			if (sysfs_slab_add(s)) {
-				down_write(&slub_lock);
 				list_del(&s->list);
-				up_write(&slub_lock);
 				kfree(s);
 				goto err;
 			}
+			up_write(&slub_lock);
 			return s;
 		}
 		kfree(s);
@@ -4418,6 +4413,13 @@ static int sysfs_slab_add(struct kmem_ca
 
 static void sysfs_slab_remove(struct kmem_cache *s)
 {
+	if (slab_state < SYSFS)
+		/*
+		 * Sysfs has not been setup yet so no need to remove the
+		 * cache from sysfs.
+		 */
+		return;
+
 	kobject_uevent(&s->kobj, KOBJ_REMOVE);
 	kobject_del(&s->kobj);
 	kobject_put(&s->kobj);
@@ -4463,8 +4465,11 @@ static int __init slab_sysfs_init(void)
 	struct kmem_cache *s;
 	int err;
 
+	down_write(&slub_lock);
+
 	slab_kset = kset_create_and_add("slab", &slab_uevent_ops, kernel_kobj);
 	if (!slab_kset) {
+		up_write(&slub_lock);
 		printk(KERN_ERR "Cannot register slab subsystem.\n");
 		return -ENOSYS;
 	}
@@ -4489,6 +4494,7 @@ static int __init slab_sysfs_init(void)
 		kfree(al);
 	}
 
+	up_write(&slub_lock);
 	resiliency_test();
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 11/23] slub: Dynamically size kmalloc cache allocations
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (9 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 10/23] slub: Allow removal of slab caches during boot V2 Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 12/23] slub: Extract hooks for memory checkers from hotpaths Christoph Lameter
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: slub_dynamic_kmem_alloc --]
[-- Type: text/plain, Size: 11476 bytes --]

kmalloc caches are statically defined and may take up a lot of space just
because the sizes of the node array has to be dimensioned for the largest
node count supported.

This patch makes the size of the kmem_cache structure dynamic throughout by
creating a kmem_cache slab cache for the kmem_cache objects. The bootstrap
occurs by allocating the initial one or two kmem_cache objects from the
page allocator.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    7 -
 mm/slub.c                |  181 +++++++++++++++++++++++++++++++++++------------
 2 files changed, 139 insertions(+), 49 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-07-26 14:25:16.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-07-26 14:26:24.000000000 -0500
@@ -136,19 +136,16 @@ struct kmem_cache {
 
 #ifdef CONFIG_ZONE_DMA
 #define SLUB_DMA __GFP_DMA
-/* Reserve extra caches for potential DMA use */
-#define KMALLOC_CACHES (2 * SLUB_PAGE_SHIFT)
 #else
 /* Disable DMA functionality */
 #define SLUB_DMA (__force gfp_t)0
-#define KMALLOC_CACHES SLUB_PAGE_SHIFT
 #endif
 
 /*
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[KMALLOC_CACHES];
+extern struct kmem_cache *kmalloc_caches[SLUB_PAGE_SHIFT];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -213,7 +210,7 @@ static __always_inline struct kmem_cache
 	if (index == 0)
 		return NULL;
 
-	return &kmalloc_caches[index];
+	return kmalloc_caches[index];
 }
 
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-26 14:26:20.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-26 14:26:24.000000000 -0500
@@ -179,7 +179,7 @@ static struct notifier_block slab_notifi
 
 static enum {
 	DOWN,		/* No slab functionality available */
-	PARTIAL,	/* kmem_cache_open() works but kmalloc does not */
+	PARTIAL,	/* Kmem_cache_node works */
 	UP,		/* Everything works but does not show up in sysfs */
 	SYSFS		/* Sysfs up */
 } slab_state = DOWN;
@@ -2074,6 +2074,8 @@ static inline int alloc_kmem_cache_cpus(
 }
 
 #ifdef CONFIG_NUMA
+static struct kmem_cache *kmem_cache_node;
+
 /*
  * No kmalloc_node yet so do it by hand. We know that this is the first
  * slab on the node for this slabcache. There are no concurrent accesses
@@ -2089,9 +2091,9 @@ static void early_kmem_cache_node_alloc(
 	struct kmem_cache_node *n;
 	unsigned long flags;
 
-	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
+	BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, GFP_KERNEL, node);
+	page = new_slab(kmem_cache_node, GFP_KERNEL, node);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
@@ -2103,15 +2105,15 @@ static void early_kmem_cache_node_alloc(
 
 	n = page->freelist;
 	BUG_ON(!n);
-	page->freelist = get_freepointer(kmalloc_caches, n);
+	page->freelist = get_freepointer(kmem_cache_node, n);
 	page->inuse++;
-	kmalloc_caches->node[node] = n;
+	kmem_cache_node->node[node] = n;
 #ifdef CONFIG_SLUB_DEBUG
-	init_object(kmalloc_caches, n, 1);
-	init_tracking(kmalloc_caches, n);
+	init_object(kmem_cache_node, n, 1);
+	init_tracking(kmem_cache_node, n);
 #endif
-	init_kmem_cache_node(n, kmalloc_caches);
-	inc_slabs_node(kmalloc_caches, node, page->objects);
+	init_kmem_cache_node(n, kmem_cache_node);
+	inc_slabs_node(kmem_cache_node, node, page->objects);
 
 	/*
 	 * lockdep requires consistent irq usage for each lock
@@ -2129,8 +2131,10 @@ static void free_kmem_cache_nodes(struct
 
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = s->node[node];
+
 		if (n)
-			kmem_cache_free(kmalloc_caches, n);
+			kmem_cache_free(kmem_cache_node, n);
+
 		s->node[node] = NULL;
 	}
 }
@@ -2146,7 +2150,7 @@ static int init_kmem_cache_nodes(struct 
 			early_kmem_cache_node_alloc(node);
 			continue;
 		}
-		n = kmem_cache_alloc_node(kmalloc_caches,
+		n = kmem_cache_alloc_node(kmem_cache_node,
 						GFP_KERNEL, node);
 
 		if (!n) {
@@ -2499,11 +2503,13 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[KMALLOC_CACHES] __cacheline_aligned;
+struct kmem_cache *kmalloc_caches[SLUB_PAGE_SHIFT];
 EXPORT_SYMBOL(kmalloc_caches);
 
+static struct kmem_cache *kmem_cache;
+
 #ifdef CONFIG_ZONE_DMA
-static struct kmem_cache kmalloc_dma_caches[SLUB_PAGE_SHIFT];
+static struct kmem_cache *kmalloc_dma_caches[SLUB_PAGE_SHIFT];
 #endif
 
 static int __init setup_slub_min_order(char *str)
@@ -2542,9 +2548,13 @@ static int __init setup_slub_nomerge(cha
 
 __setup("slub_nomerge", setup_slub_nomerge);
 
-static void create_kmalloc_cache(struct kmem_cache *s,
+static void __init create_kmalloc_cache(struct kmem_cache **sp,
 		const char *name, int size, unsigned int flags)
 {
+	struct kmem_cache *s;
+
+	s = kmem_cache_alloc(kmem_cache, GFP_NOWAIT);
+
 	/*
 	 * This function is called with IRQs disabled during early-boot on
 	 * single CPU so there's no need to take slub_lock here.
@@ -2553,6 +2563,8 @@ static void create_kmalloc_cache(struct 
 								flags, NULL))
 		goto panic;
 
+	*sp = s;
+
 	list_add(&s->list, &slab_caches);
 
 	if (!sysfs_slab_add(s))
@@ -2614,10 +2626,10 @@ static struct kmem_cache *get_slab(size_
 
 #ifdef CONFIG_ZONE_DMA
 	if (unlikely((flags & SLUB_DMA)))
-		return &kmalloc_dma_caches[index];
+		return kmalloc_dma_caches[index];
 
 #endif
-	return &kmalloc_caches[index];
+	return kmalloc_caches[index];
 }
 
 void *__kmalloc(size_t size, gfp_t flags)
@@ -2941,46 +2953,114 @@ static int slab_memory_callback(struct n
  *			Basic setup of slabs
  *******************************************************************/
 
+/*
+ * Used for early kmem_cache structures that were allocated using
+ * the page allocator
+ */
+
+static void __init kmem_cache_bootstrap_fixup(struct kmem_cache *s)
+{
+	int node;
+
+	list_add(&s->list, &slab_caches);
+	sysfs_slab_add(s);
+	s->refcount = -1;
+
+	for_each_node(node) {
+		struct kmem_cache_node *n = get_node(s, node);
+		struct page *p;
+
+		if (n) {
+			list_for_each_entry(p, &n->partial, lru)
+				p->slab = s;
+
+#ifdef CONFIG_SLAB_DEBUG
+			list_for_each_entry(p, &n->full, lru)
+				p->slab = s;
+#endif
+		}
+	}
+}
+
 void __init kmem_cache_init(void)
 {
 	int i;
 	int caches = 0;
+	struct kmem_cache *temp_kmem_cache;
+	int order;
 
 #ifdef CONFIG_NUMA
+	struct kmem_cache *temp_kmem_cache_node;
+	unsigned long kmalloc_size;
+
+	kmem_size = offsetof(struct kmem_cache, node) +
+				nr_node_ids * sizeof(struct kmem_cache_node *);
+
+	/* Allocate two kmem_caches from the page allocator */
+	kmalloc_size = ALIGN(kmem_size, cache_line_size());
+	order = get_order(2 * kmalloc_size);
+	kmem_cache = (void *)__get_free_pages(GFP_NOWAIT, order);
+
 	/*
 	 * Must first have the slab cache available for the allocations of the
 	 * struct kmem_cache_node's. There is special bootstrap code in
 	 * kmem_cache_open for slab_state == DOWN.
 	 */
-	create_kmalloc_cache(&kmalloc_caches[0], "kmem_cache_node",
-		sizeof(struct kmem_cache_node), 0);
-	kmalloc_caches[0].refcount = -1;
-	caches++;
+	kmem_cache_node = (void *)kmem_cache + kmalloc_size;
+
+	kmem_cache_open(kmem_cache_node, "kmem_cache_node",
+		sizeof(struct kmem_cache_node),
+		0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
 
 	hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
+#else
+	/* Allocate a single kmem_cache from the page allocator */
+	kmem_size = sizeof(struct kmem_cache);
+	order = get_order(kmem_size);
+	kmem_cache = (void *)__get_free_pages(GFP_NOWAIT, order);
 #endif
 
 	/* Able to allocate the per node structures */
 	slab_state = PARTIAL;
 
-	/* Caches that are not of the two-to-the-power-of size */
-	if (KMALLOC_MIN_SIZE <= 32) {
-		create_kmalloc_cache(&kmalloc_caches[1],
-				"kmalloc-96", 96, 0);
-		caches++;
-	}
-	if (KMALLOC_MIN_SIZE <= 64) {
-		create_kmalloc_cache(&kmalloc_caches[2],
-				"kmalloc-192", 192, 0);
-		caches++;
-	}
+	temp_kmem_cache = kmem_cache;
+	kmem_cache_open(kmem_cache, "kmem_cache", kmem_size,
+		0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+	kmem_cache = kmem_cache_alloc(kmem_cache, GFP_NOWAIT);
+	memcpy(kmem_cache, temp_kmem_cache, kmem_size);
 
-	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
-		create_kmalloc_cache(&kmalloc_caches[i],
-			"kmalloc", 1 << i, 0);
-		caches++;
-	}
+#ifdef CONFIG_NUMA
+	/*
+	 * Allocate kmem_cache_node properly from the kmem_cache slab.
+	 * kmem_cache_node is separately allocated so no need to
+	 * update any list pointers.
+	 */
+	temp_kmem_cache_node = kmem_cache_node;
 
+	kmem_cache_node = kmem_cache_alloc(kmem_cache, GFP_NOWAIT);
+	memcpy(kmem_cache_node, temp_kmem_cache_node, kmem_size);
+
+	kmem_cache_bootstrap_fixup(kmem_cache_node);
+
+	caches++;
+#else
+	/*
+	 * kmem_cache has kmem_cache_node embedded and we moved it!
+	 * Update the list heads
+	 */
+	INIT_LIST_HEAD(&kmem_cache->local_node.partial);
+	list_splice(&temp_kmem_cache->local_node.partial, &kmem_cache->local_node.partial);
+#ifdef CONFIG_SLUB_DEBUG
+	INIT_LIST_HEAD(&kmem_cache->local_node.full);
+	list_splice(&temp_kmem_cache->local_node.full, &kmem_cache->local_node.full);
+#endif
+#endif
+	kmem_cache_bootstrap_fixup(kmem_cache);
+	caches++;
+	/* Free temporary boot structure */
+	free_pages((unsigned long)temp_kmem_cache, order);
+
+	/* Now we can use the kmem_cache to allocate kmalloc slabs */
 
 	/*
 	 * Patch up the size_index table if we have strange large alignment
@@ -3020,6 +3100,25 @@ void __init kmem_cache_init(void)
 			size_index[size_index_elem(i)] = 8;
 	}
 
+	/* Caches that are not of the two-to-the-power-of size */
+	if (KMALLOC_MIN_SIZE <= 32) {
+		create_kmalloc_cache(&kmalloc_caches[1],
+				"kmalloc-96", 96, 0);
+		caches++;
+	}
+
+	if (KMALLOC_MIN_SIZE <= 64) {
+		create_kmalloc_cache(&kmalloc_caches[2],
+				"kmalloc-192", 192, 0);
+		caches++;
+	}
+
+	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
+		create_kmalloc_cache(&kmalloc_caches[i],
+			"kmalloc", 1 << i, 0);
+		caches++;
+	}
+
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
@@ -3027,18 +3126,12 @@ void __init kmem_cache_init(void)
 		char *s = kasprintf(GFP_NOWAIT, "kmalloc-%d", 1 << i);
 
 		BUG_ON(!s);
-		kmalloc_caches[i].name = s;
+		kmalloc_caches[i]->name = s;
 	}
 
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
 #endif
-#ifdef CONFIG_NUMA
-	kmem_size = offsetof(struct kmem_cache, node) +
-				nr_node_ids * sizeof(struct kmem_cache_node *);
-#else
-	kmem_size = sizeof(struct kmem_cache);
-#endif
 
 	printk(KERN_INFO
 		"SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"
@@ -3054,7 +3147,7 @@ void __init kmem_cache_init_late(void)
 	int i;
 
 	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
-		struct kmem_cache *s = &kmalloc_caches[i];
+		struct kmem_cache *s = kmalloc_caches[i];
 
 		if (s && s->size) {
 			char *name = kasprintf(GFP_KERNEL,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 12/23] slub: Extract hooks for memory checkers from hotpaths
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (10 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 11/23] slub: Dynamically size kmalloc cache allocations Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 13/23] slub: Move gfpflag masking out of the hotpath Christoph Lameter
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: slub_extract --]
[-- Type: text/plain, Size: 3211 bytes --]

Extract the code that memory checkers and other verification tools use from
the hotpaths. Makes it easier to add new ones and reduces the disturbances
of the hotpaths.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   49 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 11 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-26 14:26:24.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-26 14:26:33.000000000 -0500
@@ -793,6 +793,37 @@ static void trace(struct kmem_cache *s, 
 }
 
 /*
+ * Hooks for other subsystems that check memory allocations. In a typical
+ * production configuration these hooks all should produce no code at all.
+ */
+static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
+{
+	lockdep_trace_alloc(flags);
+	might_sleep_if(flags & __GFP_WAIT);
+
+	return should_failslab(s->objsize, flags, s->flags);
+}
+
+static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, void *object)
+{
+	kmemcheck_slab_alloc(s, flags, object, s->objsize);
+	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, flags);
+}
+
+static inline void slab_free_hook(struct kmem_cache *s, void *x)
+{
+	kmemleak_free_recursive(x, s->flags);
+}
+
+static inline void slab_free_hook_irq(struct kmem_cache *s, void *object)
+{
+	kmemcheck_slab_free(s, object, s->objsize);
+	debug_check_no_locks_freed(object, s->objsize);
+	if (!(s->flags & SLAB_DEBUG_OBJECTS))
+		debug_check_no_obj_freed(object, s->objsize);
+}
+
+/*
  * Tracking of fully allocated slabs for debugging purposes.
  */
 static void add_full(struct kmem_cache_node *n, struct page *page)
@@ -1698,10 +1729,7 @@ static __always_inline void *slab_alloc(
 
 	gfpflags &= gfp_allowed_mask;
 
-	lockdep_trace_alloc(gfpflags);
-	might_sleep_if(gfpflags & __GFP_WAIT);
-
-	if (should_failslab(s->objsize, gfpflags, s->flags))
+	if (!slab_pre_alloc_hook(s, gfpflags))
 		return NULL;
 
 	local_irq_save(flags);
@@ -1720,8 +1748,7 @@ static __always_inline void *slab_alloc(
 	if (unlikely(gfpflags & __GFP_ZERO) && object)
 		memset(object, 0, s->objsize);
 
-	kmemcheck_slab_alloc(s, gfpflags, object, s->objsize);
-	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, gfpflags);
+	slab_post_alloc_hook(s, gfpflags, object);
 
 	return object;
 }
@@ -1851,13 +1878,13 @@ static __always_inline void slab_free(st
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
-	kmemleak_free_recursive(x, s->flags);
+	slab_free_hook(s, x);
+
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
-	kmemcheck_slab_free(s, object, s->objsize);
-	debug_check_no_locks_freed(object, s->objsize);
-	if (!(s->flags & SLAB_DEBUG_OBJECTS))
-		debug_check_no_obj_freed(object, s->objsize);
+
+	slab_free_hook_irq(s, x);
+
 	if (likely(page == c->page && c->node >= 0)) {
 		set_freepointer(s, object, c->freelist);
 		c->freelist = object;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 13/23] slub: Move gfpflag masking out of the hotpath
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (11 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 12/23] slub: Extract hooks for memory checkers from hotpaths Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 14/23] slub: Add SLAB style per cpu queueing Christoph Lameter
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: slub_move_gfpflags --]
[-- Type: text/plain, Size: 1790 bytes --]

Move the gfpflags masking into the hooks for checkers and into the slowpaths.
gfpflag masking requires access to a global variable and thus adds an
additional cacheline reference to the hotpaths.

If no hooks are active then the gfpflag masking will result in
code that the compiler can toss out.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-26 14:26:33.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-26 14:26:47.000000000 -0500
@@ -798,6 +798,7 @@ static void trace(struct kmem_cache *s, 
  */
 static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
 {
+	flags &= gfp_allowed_mask;
 	lockdep_trace_alloc(flags);
 	might_sleep_if(flags & __GFP_WAIT);
 
@@ -806,6 +807,7 @@ static inline int slab_pre_alloc_hook(st
 
 static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, void *object)
 {
+	flags &= gfp_allowed_mask;
 	kmemcheck_slab_alloc(s, flags, object, s->objsize);
 	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, flags);
 }
@@ -1679,6 +1681,7 @@ new_slab:
 		goto load_freelist;
 	}
 
+	gfpflags &= gfp_allowed_mask;
 	if (gfpflags & __GFP_WAIT)
 		local_irq_enable();
 
@@ -1727,8 +1730,6 @@ static __always_inline void *slab_alloc(
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
-	gfpflags &= gfp_allowed_mask;
-
 	if (!slab_pre_alloc_hook(s, gfpflags))
 		return NULL;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 14/23] slub: Add SLAB style per cpu queueing
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (12 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 13/23] slub: Move gfpflag masking out of the hotpath Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 15/23] slub: Allow resizing of per cpu queues Christoph Lameter
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: unified_core --]
[-- Type: text/plain, Size: 50271 bytes --]

This patch adds SLAB style cpu queueing and uses a new way for
managing objects in the slabs using bitmaps. It uses a percpu queue so that
free operations can be properly buffered and a bitmap for managing the
free/allocated state in the slabs. The approach uses slightly more memory
(due to the need to place large bitmaps --sized a few words--in some
slab pages) but in general does compete well in terms of space use.
The storage format using bitmaps avoids the SLAB management structure that
SLAB needs for each slab page and therefore metadata is more compact
and easily fits into a cacheline.

The SLAB scheme of not touching the object during management is adopted.
SLUB can now efficiently free and allocate cache cold objects.

The queueing scheme addresses also the issue that the free slowpath
was taken too frequently.

This patch only implements staticallly sized per cpu queues and does
not deal with NUMA queueing and shared queuing. Frees to remote nodes
are simply directly freed to the slab taking the per page slab lock.
(A later patch introduces the infamous alien caches to SLUB.)

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/page-flags.h |    5 
 include/linux/slub_def.h   |   47 +-
 init/Kconfig               |   14 
 mm/slub.c                  |  990 ++++++++++++++++++++-------------------------
 4 files changed, 488 insertions(+), 568 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-31 17:44:35.111063054 -0500
+++ linux-2.6/mm/slub.c	2010-07-31 18:25:53.244357184 -0500
@@ -1,11 +1,11 @@
 /*
- * SLUB: A slab allocator that limits cache line use instead of queuing
- * objects in per cpu and per node lists.
+ * SLUB: The unified slab allocator.
  *
  * The allocator synchronizes using per slab locks and only
  * uses a centralized lock to manage a pool of partial slabs.
  *
  * (C) 2007 SGI, Christoph Lameter
+ * (C) 2010 Linux Foundation, Christoph Lameter
  */
 
 #include <linux/mm.h>
@@ -84,27 +84,6 @@
  * minimal so we rely on the page allocators per cpu caches for
  * fast frees and allocs.
  *
- * Overloading of page flags that are otherwise used for LRU management.
- *
- * PageActive 		The slab is frozen and exempt from list processing.
- * 			This means that the slab is dedicated to a purpose
- * 			such as satisfying allocations for a specific
- * 			processor. Objects may be freed in the slab while
- * 			it is frozen but slab_free will then skip the usual
- * 			list operations. It is up to the processor holding
- * 			the slab to integrate the slab into the slab lists
- * 			when the slab is no longer needed.
- *
- * 			One use of this flag is to mark slabs that are
- * 			used for allocations. Then such a slab becomes a cpu
- * 			slab. The cpu slab may be equipped with an additional
- * 			freelist that allows lockless access to
- * 			free objects in addition to the regular freelist
- * 			that requires the slab lock.
- *
- * PageError		Slab requires special handling due to debug
- * 			options set. This moves	slab handling out of
- * 			the fast path and disables lockless freelists.
  */
 
 #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
@@ -259,38 +238,95 @@
 	return 1;
 }
 
-static inline void *get_freepointer(struct kmem_cache *s, void *object)
-{
-	return *(void **)(object + s->offset);
-}
-
-static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
-{
-	*(void **)(object + s->offset) = fp;
-}
-
 /* Loop over all objects in a slab */
 #define for_each_object(__p, __s, __addr, __objects) \
 	for (__p = (__addr); __p < (__addr) + (__objects) * (__s)->size;\
 			__p += (__s)->size)
 
-/* Scan freelist */
-#define for_each_free_object(__p, __s, __free) \
-	for (__p = (__free); __p; __p = get_freepointer((__s), __p))
-
 /* Determine object index from a given position */
 static inline int slab_index(void *p, struct kmem_cache *s, void *addr)
 {
 	return (p - addr) / s->size;
 }
 
+static inline int map_in_page_struct(struct page *page)
+{
+	return page->objects <= BITS_PER_LONG;
+}
+
+static inline unsigned long *map(struct page *page)
+{
+	if (map_in_page_struct(page))
+		return (unsigned long *)&page->freelist;
+	else
+		return page->freelist;
+}
+
+static inline int map_size(struct page *page)
+{
+	return BITS_TO_LONGS(page->objects) * sizeof(unsigned long);
+}
+
+static inline int available(struct page *page)
+{
+	return bitmap_weight(map(page), page->objects);
+}
+
+static inline int all_objects_available(struct page *page)
+{
+	return bitmap_full(map(page), page->objects);
+}
+
+static inline int all_objects_used(struct page *page)
+{
+	return bitmap_empty(map(page), page->objects);
+}
+
+static inline int inuse(struct page *page)
+{
+	return page->objects - available(page);
+}
+
+/*
+ * Basic queue functions
+ */
+
+static inline void *queue_get(struct kmem_cache_queue *q)
+{
+	return q->object[--q->objects];
+}
+
+static inline void queue_put(struct kmem_cache_queue *q, void *object)
+{
+	q->object[q->objects++] = object;
+}
+
+static inline int queue_full(struct kmem_cache_queue *q)
+{
+	return q->objects == QUEUE_SIZE;
+}
+
+static inline int queue_empty(struct kmem_cache_queue *q)
+{
+	return q->objects == 0;
+}
+
 static inline struct kmem_cache_order_objects oo_make(int order,
 						unsigned long size)
 {
-	struct kmem_cache_order_objects x = {
-		(order << OO_SHIFT) + (PAGE_SIZE << order) / size
-	};
+	struct kmem_cache_order_objects x;
+	unsigned long objects;
+	unsigned long page_size = PAGE_SIZE << order;
+	unsigned long ws = sizeof(unsigned long);
+
+	objects = page_size / size;
+
+	if (objects > BITS_PER_LONG)
+		/* Bitmap must fit into the slab as well */
+		objects = ((page_size / ws) * BITS_PER_LONG) /
+			((size / ws) * BITS_PER_LONG + 1);
 
+	x.x = (order << OO_SHIFT) + objects;
 	return x;
 }
 
@@ -357,10 +393,7 @@
 {
 	struct track *p;
 
-	if (s->offset)
-		p = object + s->offset + sizeof(void *);
-	else
-		p = object + s->inuse;
+	p = object + s->inuse;
 
 	return p + alloc;
 }
@@ -408,8 +441,8 @@
 
 static void print_page_info(struct page *page)
 {
-	printk(KERN_ERR "INFO: Slab 0x%p objects=%u used=%u fp=0x%p flags=0x%04lx\n",
-		page, page->objects, page->inuse, page->freelist, page->flags);
+	printk(KERN_ERR "INFO: Slab 0x%p objects=%u new=%u fp=0x%p flags=0x%04lx\n",
+		page, page->objects, available(page), page->freelist, page->flags);
 
 }
 
@@ -448,8 +481,8 @@
 
 	print_page_info(page);
 
-	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
-			p, p - addr, get_freepointer(s, p));
+	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu\n\n",
+			p, p - addr);
 
 	if (p > addr + 16)
 		print_section("Bytes b4", p - 16, 16);
@@ -460,10 +493,7 @@
 		print_section("Redzone", p + s->objsize,
 			s->inuse - s->objsize);
 
-	if (s->offset)
-		off = s->offset + sizeof(void *);
-	else
-		off = s->inuse;
+	off = s->inuse;
 
 	if (s->flags & SLAB_STORE_USER)
 		off += 2 * sizeof(struct track);
@@ -557,8 +587,6 @@
  *
  * object address
  * 	Bytes of the object to be managed.
- * 	If the freepointer may overlay the object then the free
- * 	pointer is the first word of the object.
  *
  * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
  * 	0xa5 (POISON_END)
@@ -574,9 +602,8 @@
  * object + s->inuse
  * 	Meta data starts here.
  *
- * 	A. Free pointer (if we cannot overwrite object on free)
- * 	B. Tracking data for SLAB_STORE_USER
- * 	C. Padding to reach required alignment boundary or at mininum
+ * 	A. Tracking data for SLAB_STORE_USER
+ * 	B. Padding to reach required alignment boundary or at mininum
  * 		one word if debugging is on to be able to detect writes
  * 		before the word boundary.
  *
@@ -594,10 +621,6 @@
 {
 	unsigned long off = s->inuse;	/* The end of info */
 
-	if (s->offset)
-		/* Freepointer is placed after the object. */
-		off += sizeof(void *);
-
 	if (s->flags & SLAB_STORE_USER)
 		/* We also have user information there */
 		off += 2 * sizeof(struct track);
@@ -622,15 +645,42 @@
 		return 1;
 
 	start = page_address(page);
-	length = (PAGE_SIZE << compound_order(page));
-	end = start + length;
-	remainder = length % s->size;
+	end = start + (PAGE_SIZE << compound_order(page));
+
+	/* Check for special case of bitmap at the end of the page */
+	if (!map_in_page_struct(page)) {
+		if ((u8 *)page->freelist > start && (u8 *)page->freelist < end)
+			end = page->freelist;
+		else
+			slab_err(s, page, "pagemap pointer invalid =%p start=%p end=%p objects=%d",
+				page->freelist, start, end, page->objects);
+	}
+
+	length = end - start;
+	remainder = length - page->objects * s->size;
 	if (!remainder)
 		return 1;
 
 	fault = check_bytes(end - remainder, POISON_INUSE, remainder);
-	if (!fault)
-		return 1;
+	if (!fault) {
+		u8 *freelist_end;
+
+		if (map_in_page_struct(page))
+			return 1;
+
+		end = start + (PAGE_SIZE << compound_order(page));
+		freelist_end = page->freelist + map_size(page);
+		remainder = end - freelist_end;
+
+		if (!remainder)
+			return 1;
+
+		fault = check_bytes(freelist_end, POISON_INUSE,
+				remainder);
+		if (!fault)
+			return 1;
+	}
+
 	while (end > fault && end[-1] == POISON_INUSE)
 		end--;
 
@@ -673,25 +723,6 @@
 		 */
 		check_pad_bytes(s, page, p);
 	}
-
-	if (!s->offset && active)
-		/*
-		 * Object and freepointer overlap. Cannot check
-		 * freepointer while object is allocated.
-		 */
-		return 1;
-
-	/* Check free pointer validity */
-	if (!check_valid_pointer(s, page, get_freepointer(s, p))) {
-		object_err(s, page, p, "Freepointer corrupt");
-		/*
-		 * No choice but to zap it and thus lose the remainder
-		 * of the free objects in this slab. May cause
-		 * another error because the object count is now wrong.
-		 */
-		set_freepointer(s, p, NULL);
-		return 0;
-	}
 	return 1;
 }
 
@@ -712,51 +743,45 @@
 			s->name, page->objects, maxobj);
 		return 0;
 	}
-	if (page->inuse > page->objects) {
-		slab_err(s, page, "inuse %u > max %u",
-			s->name, page->inuse, page->objects);
-		return 0;
-	}
+
 	/* Slab_pad_check fixes things up after itself */
 	slab_pad_check(s, page);
 	return 1;
 }
 
 /*
- * Determine if a certain object on a page is on the freelist. Must hold the
- * slab lock to guarantee that the chains are in a consistent state.
+ * Determine if a certain object on a page is on the free map.
  */
-static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
+static int object_marked_free(struct kmem_cache *s, struct page *page, void *search)
+{
+	return test_bit(slab_index(search, s, page_address(page)), map(page));
+}
+
+/* Verify the integrity of the metadata in a slab page */
+static int verify_slab(struct kmem_cache *s, struct page *page)
 {
 	int nr = 0;
-	void *fp = page->freelist;
-	void *object = NULL;
 	unsigned long max_objects;
+	void *start = page_address(page);
+	unsigned long size = PAGE_SIZE << compound_order(page);
 
-	while (fp && nr <= page->objects) {
-		if (fp == search)
-			return 1;
-		if (!check_valid_pointer(s, page, fp)) {
-			if (object) {
-				object_err(s, page, object,
-					"Freechain corrupt");
-				set_freepointer(s, object, NULL);
-				break;
-			} else {
-				slab_err(s, page, "Freepointer corrupt");
-				page->freelist = NULL;
-				page->inuse = page->objects;
-				slab_fix(s, "Freelist cleared");
-				return 0;
-			}
-			break;
-		}
-		object = fp;
-		fp = get_freepointer(s, object);
-		nr++;
+	nr = available(page);
+
+	if (map_in_page_struct(page))
+		max_objects = size / s->size;
+	else {
+		if (page->freelist <= start || page->freelist >= start + size) {
+			slab_err(s, page, "Invalid pointer to bitmap of free objects max_objects=%d!",
+				page->objects);
+			/* Switch to bitmap in page struct */
+			page->objects = max_objects = BITS_PER_LONG;
+			page->freelist = 0L;
+			slab_fix(s, "Slab sized for %d objects. ALl objects marked in use.",
+				BITS_PER_LONG);
+		} else
+			max_objects = ((void *)page->freelist - start) / s->size;
 	}
 
-	max_objects = (PAGE_SIZE << compound_order(page)) / s->size;
 	if (max_objects > MAX_OBJS_PER_PAGE)
 		max_objects = MAX_OBJS_PER_PAGE;
 
@@ -765,24 +790,19 @@
 			"should be %d", page->objects, max_objects);
 		page->objects = max_objects;
 		slab_fix(s, "Number of objects adjusted.");
+		return 0;
 	}
-	if (page->inuse != page->objects - nr) {
-		slab_err(s, page, "Wrong object count. Counter is %d but "
-			"counted were %d", page->inuse, page->objects - nr);
-		page->inuse = page->objects - nr;
-		slab_fix(s, "Object count adjusted.");
-	}
-	return search == NULL;
+	return 1;
 }
 
 static void trace(struct kmem_cache *s, struct page *page, void *object,
 								int alloc)
 {
 	if (s->flags & SLAB_TRACE) {
-		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+		printk(KERN_INFO "TRACE %s %s 0x%p free=%d fp=0x%p\n",
 			s->name,
 			alloc ? "alloc" : "free",
-			object, page->inuse,
+			object, available(page),
 			page->freelist);
 
 		if (!alloc)
@@ -828,14 +848,19 @@
 /*
  * Tracking of fully allocated slabs for debugging purposes.
  */
-static void add_full(struct kmem_cache_node *n, struct page *page)
+static inline void add_full(struct kmem_cache *s,
+		struct kmem_cache_node *n, struct page *page)
 {
+
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
 	spin_lock(&n->list_lock);
 	list_add(&page->lru, &n->full);
 	spin_unlock(&n->list_lock);
 }
 
-static void remove_full(struct kmem_cache *s, struct page *page)
+static inline void remove_full(struct kmem_cache *s, struct page *page)
 {
 	struct kmem_cache_node *n;
 
@@ -896,25 +921,30 @@
 	init_tracking(s, object);
 }
 
-static noinline int alloc_debug_processing(struct kmem_cache *s, struct page *page,
-					void *object, unsigned long addr)
+static noinline int alloc_debug_processing(struct kmem_cache *s,
+		 		void *object, unsigned long addr)
 {
+	struct page *page = virt_to_head_page(object);
+
 	if (!check_slab(s, page))
 		goto bad;
 
-	if (!on_freelist(s, page, object)) {
-		object_err(s, page, object, "Object already allocated");
+	if (!check_valid_pointer(s, page, object)) {
+		object_err(s, page, object, "Pointer check fails");
 		goto bad;
 	}
 
-	if (!check_valid_pointer(s, page, object)) {
-		object_err(s, page, object, "Freelist Pointer check fails");
+	if (object_marked_free(s, page, object)) {
+		object_err(s, page, object, "Allocated object still marked free in slab");
 		goto bad;
 	}
 
 	if (!check_object(s, page, object, 0))
 		goto bad;
 
+	if (!verify_slab(s, page))
+		goto bad;
+
 	/* Success perform special debug activities for allocs */
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_ALLOC, addr);
@@ -930,8 +960,7 @@
 		 * as used avoids touching the remaining objects.
 		 */
 		slab_fix(s, "Marking all objects used");
-		page->inuse = page->objects;
-		page->freelist = NULL;
+		bitmap_zero(map(page), page->objects);
 	}
 	return 0;
 }
@@ -947,7 +976,7 @@
 		goto fail;
 	}
 
-	if (on_freelist(s, page, object)) {
+	if (object_marked_free(s, page, object)) {
 		object_err(s, page, object, "Object already free");
 		goto fail;
 	}
@@ -970,13 +999,11 @@
 		goto fail;
 	}
 
-	/* Special debug activities for freeing objects */
-	if (!PageSlubFrozen(page) && !page->freelist)
-		remove_full(s, page);
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_FREE, addr);
 	trace(s, page, object, 0);
 	init_object(s, object, 0);
+	verify_slab(s, page);
 	return 1;
 
 fail:
@@ -1081,7 +1108,8 @@
 			{ return 1; }
 static inline int check_object(struct kmem_cache *s, struct page *page,
 			void *object, int active) { return 1; }
-static inline void add_full(struct kmem_cache_node *n, struct page *page) {}
+static inline void add_full(struct kmem_cache *s,
+		struct kmem_cache_node *n, struct page *page) {}
 static inline unsigned long kmem_cache_flags(unsigned long objsize,
 	unsigned long flags, const char *name,
 	void (*ctor)(void *))
@@ -1183,8 +1211,8 @@
 {
 	struct page *page;
 	void *start;
-	void *last;
 	void *p;
+	unsigned long size;
 
 	BUG_ON(flags & GFP_SLAB_BUG_MASK);
 
@@ -1196,23 +1224,20 @@
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
-
 	start = page_address(page);
+	size = PAGE_SIZE << compound_order(page);
 
 	if (unlikely(s->flags & SLAB_POISON))
-		memset(start, POISON_INUSE, PAGE_SIZE << compound_order(page));
+		memset(start, POISON_INUSE, size);
 
-	last = start;
-	for_each_object(p, s, start, page->objects) {
-		setup_object(s, page, last);
-		set_freepointer(s, last, p);
-		last = p;
-	}
-	setup_object(s, page, last);
-	set_freepointer(s, last, NULL);
+	if (!map_in_page_struct(page))
+		page->freelist = start + page->objects * s->size;
+
+	bitmap_fill(map(page), page->objects);
+
+	for_each_object(p, s, start, page->objects)
+		setup_object(s, page, p);
 
-	page->freelist = start;
-	page->inuse = 0;
 out:
 	return page;
 }
@@ -1329,7 +1354,6 @@
 	if (slab_trylock(page)) {
 		list_del(&page->lru);
 		n->nr_partial--;
-		__SetPageSlubFrozen(page);
 		return 1;
 	}
 	return 0;
@@ -1432,114 +1456,144 @@
 }
 
 /*
- * Move a page back to the lists.
- *
- * Must be called with the slab lock held.
- *
- * On exit the slab lock will have been dropped.
+ * Move the vector of objects back to the slab pages they came from
  */
-static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
+void drain_objects(struct kmem_cache *s, void **object, int nr)
 {
-	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
+	int i;
 
-	__ClearPageSlubFrozen(page);
-	if (page->inuse) {
+	for (i = 0 ; i < nr; ) {
 
-		if (page->freelist) {
-			add_partial(n, page, tail);
-			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
-		} else {
-			stat(s, DEACTIVATE_FULL);
-			if (kmem_cache_debug(s) && (s->flags & SLAB_STORE_USER))
-				add_full(n, page);
+		void *p = object[i];
+		struct page *page = virt_to_head_page(p);
+		void *addr = page_address(page);
+		unsigned long size = PAGE_SIZE << compound_order(page);
+		int was_fully_allocated;
+		unsigned long *m;
+		unsigned long offset;
+
+		if (kmem_cache_debug(s) && !PageSlab(page)) {
+			object_err(s, page, object[i], "Object from non-slab page");
+			i++;
+			continue;
 		}
-		slab_unlock(page);
-	} else {
-		stat(s, DEACTIVATE_EMPTY);
-		if (n->nr_partial < s->min_partial) {
+
+		slab_lock(page);
+		m = map(page);
+		was_fully_allocated = bitmap_empty(m, page->objects);
+
+		offset = p - addr;
+
+
+		while (i < nr) {
+
+			int bit;
+			unsigned long new_offset;
+
+			if (offset >= size)
+				break;
+
+			if (kmem_cache_debug(s) && offset % s->size) {
+				object_err(s, page, object[i], "Misaligned object");
+				i++;
+				new_offset = object[i] - addr;
+				continue;
+			}
+
+			bit = offset / s->size;
+
 			/*
-			 * Adding an empty slab to the partial slabs in order
-			 * to avoid page allocator overhead. This slab needs
-			 * to come after the other slabs with objects in
-			 * so that the others get filled first. That way the
-			 * size of the partial list stays small.
-			 *
-			 * kmem_cache_shrink can reclaim any empty slabs from
-			 * the partial list.
-			 */
-			add_partial(n, page, 1);
-			slab_unlock(page);
-		} else {
+			 * Fast loop to fold a sequence of objects into the slab
+			 * avoiding division and virt_to_head_page()
+ 			 */
+			do {
+
+				if (kmem_cache_debug(s)) {
+					if (unlikely(__test_and_set_bit(bit, m)))
+						object_err(s, page, object[i], "Double free");
+				} else
+					__set_bit(bit, m);
+
+				i++;
+				bit++;
+				offset += s->size;
+				new_offset = object[i] - addr;
+
+			} while (new_offset ==  offset && i < nr && new_offset < size);
+
+			offset = new_offset;
+		}
+		if (bitmap_full(m, page->objects)) {
+
+			/* All objects are available now */
+			if (!was_fully_allocated) {
+
+				remove_partial(s, page);
+				stat(s, FREE_REMOVE_PARTIAL);
+			} else
+				remove_full(s, page);
+
 			slab_unlock(page);
-			stat(s, FREE_SLAB);
 			discard_slab(s, page);
+
+  		} else {
+
+			/* Some object are available now */
+			if (was_fully_allocated) {
+
+				/* Slab was had no free objects but has them now */
+				remove_full(s, page);
+				add_partial(get_node(s, page_to_nid(page)), page, 1);
+				stat(s, FREE_ADD_PARTIAL);
+			}
+			slab_unlock(page);
 		}
 	}
 }
 
-/*
- * Remove the cpu slab
- */
-static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
+static inline void drain_queue(struct kmem_cache *s, struct kmem_cache_queue *q, int nr)
 {
-	struct page *page = c->page;
-	int tail = 1;
-
-	if (page->freelist)
-		stat(s, DEACTIVATE_REMOTE_FREES);
-	/*
-	 * Merge cpu freelist into slab freelist. Typically we get here
-	 * because both freelists are empty. So this is unlikely
-	 * to occur.
-	 */
-	while (unlikely(c->freelist)) {
-		void **object;
+	int t = min(nr, q->objects);
 
-		tail = 0;	/* Hot objects. Put the slab first */
+	drain_objects(s, q->object, t);
 
-		/* Retrieve object from cpu_freelist */
-		object = c->freelist;
-		c->freelist = get_freepointer(s, c->freelist);
-
-		/* And put onto the regular freelist */
-		set_freepointer(s, object, page->freelist);
-		page->freelist = object;
-		page->inuse--;
-	}
-	c->page = NULL;
-	unfreeze_slab(s, page, tail);
+	q->objects -= t;
+	if (q->objects)
+		memcpy(q->object, q->object + t,
+					q->objects * sizeof(void *));
 }
-
-static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
+/*
+ * Drain all objects from a per cpu queue
+ */
+static void flush_cpu_objects(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	stat(s, CPUSLAB_FLUSH);
-	slab_lock(c->page);
-	deactivate_slab(s, c);
+	drain_queue(s, &c->q, c->q.objects);
+ 	stat(s, QUEUE_FLUSH);
 }
 
 /*
- * Flush cpu slab.
+ * Flush cpu objects.
  *
  * Called from IPI handler with interrupts disabled.
  */
-static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
+static void __flush_cpu_objects(void *d)
 {
-	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+	struct kmem_cache *s = d;
+	struct kmem_cache_cpu *c = __this_cpu_ptr(s->cpu_slab);
 
-	if (likely(c && c->page))
-		flush_slab(s, c);
+	if (c->q.objects)
+		flush_cpu_objects(s, c);
 }
 
-static void flush_cpu_slab(void *d)
+static void flush_all(struct kmem_cache *s)
 {
-	struct kmem_cache *s = d;
-
-	__flush_cpu_slab(s, smp_processor_id());
+	on_each_cpu(__flush_cpu_objects, s, 1);
 }
 
-static void flush_all(struct kmem_cache *s)
+struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int n)
 {
-	on_each_cpu(flush_cpu_slab, s, 1);
+	return __alloc_percpu(sizeof(struct kmem_cache_cpu),
+		__alignof__(struct kmem_cache_cpu));
 }
 
 /*
@@ -1557,7 +1611,7 @@
 
 static int count_free(struct page *page)
 {
-	return page->objects - page->inuse;
+	return available(page);
 }
 
 static unsigned long count_partial(struct kmem_cache_node *n,
@@ -1619,139 +1673,149 @@
 }
 
 /*
- * Slow path. The lockless freelist is empty or we need to perform
- * debugging duties.
- *
- * Interrupts are disabled.
- *
- * Processing is still very fast if new objects have been freed to the
- * regular freelist. In that case we simply take over the regular freelist
- * as the lockless freelist and zap the regular freelist.
- *
- * If that is not working then we fall back to the partial lists. We take the
- * first element of the freelist as the object to allocate now and move the
- * rest of the freelist to the lockless freelist.
- *
- * And if we were unable to get a new slab from the partial slab lists then
- * we need to allocate a new slab. This is the slowest path since it involves
- * a call to the page allocator and the setup of a new slab.
- */
-static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
-{
-	void **object;
-	struct page *new;
-
-	/* We handle __GFP_ZERO in the caller */
-	gfpflags &= ~__GFP_ZERO;
-
-	if (!c->page)
-		goto new_slab;
-
-	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
-		goto another_slab;
-
-	stat(s, ALLOC_REFILL);
-
-load_freelist:
-	object = c->page->freelist;
-	if (unlikely(!object))
-		goto another_slab;
-	if (kmem_cache_debug(s))
-		goto debug;
-
-	c->freelist = get_freepointer(s, object);
-	c->page->inuse = c->page->objects;
-	c->page->freelist = NULL;
-	c->node = page_to_nid(c->page);
-unlock_out:
-	slab_unlock(c->page);
-	stat(s, ALLOC_SLOWPATH);
-	return object;
+ * Retrieve pointers to nr objects from a slab into the object array.
+ * Slab must be locked.
+ */
+void retrieve_objects(struct kmem_cache *s, struct page *page, void **object, int nr)
+{
+	void *addr = page_address(page);
+	unsigned long *m = map(page);
+
+	while (nr > 0) {
+		int i = find_first_bit(m, page->objects);
+		void *a;
 
-another_slab:
-	deactivate_slab(s, c);
+		__clear_bit(i, m);
+		a = addr + i * s->size;
 
-new_slab:
-	new = get_partial(s, gfpflags, node);
-	if (new) {
-		c->page = new;
-		stat(s, ALLOC_FROM_PARTIAL);
-		goto load_freelist;
-	}
-
-	gfpflags &= gfp_allowed_mask;
-	if (gfpflags & __GFP_WAIT)
-		local_irq_enable();
-
-	new = new_slab(s, gfpflags, node);
-
-	if (gfpflags & __GFP_WAIT)
-		local_irq_disable();
-
-	if (new) {
-		c = __this_cpu_ptr(s->cpu_slab);
-		stat(s, ALLOC_SLAB);
-		if (c->page)
-			flush_slab(s, c);
-		slab_lock(new);
-		__SetPageSlubFrozen(new);
-		c->page = new;
-		goto load_freelist;
+		/*
+		 * Fast loop to get a sequence of objects out of the slab
+		 * without find_first_bit() and multiplication
+		 */
+		do {
+			nr--;
+			object[nr] = a;
+			a += s->size;
+			i++;
+		} while (nr > 0 && i < page->objects && __test_and_clear_bit(i, m));
 	}
-	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
-		slab_out_of_memory(s, gfpflags, node);
-	return NULL;
-debug:
-	if (!alloc_debug_processing(s, c->page, object, addr))
-		goto another_slab;
+}
+
+static inline void refill_queue(struct kmem_cache *s,
+		struct kmem_cache_queue *q, struct page *page, int nr)
+{
+	int d;
 
-	c->page->inuse++;
-	c->page->freelist = get_freepointer(s, object);
-	c->node = -1;
-	goto unlock_out;
+	d = min(BATCH_SIZE - q->objects, nr);
+	retrieve_objects(s, page, q->object + q->objects, d);
+	q->objects += d;
 }
 
-/*
- * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
- * have the fastpath folded into their functions. So no function call
- * overhead for requests that can be satisfied on the fastpath.
- *
- * The fastpath works by first checking if the lockless freelist can be used.
- * If not then __slab_alloc is called for slow processing.
- *
- * Otherwise we can simply pick the next object from the lockless free list.
- */
-static __always_inline void *slab_alloc(struct kmem_cache *s,
+void to_lists(struct kmem_cache *s, struct page *page, int tail)
+{
+	if (!all_objects_used(page))
+
+		add_partial(get_node(s, page_to_nid(page)), page, tail);
+
+	else
+		add_full(s, get_node(s, page_to_nid(page)), page);
+}
+
+/* Handling of objects from other nodes */
+
+static void slab_free_alien(struct kmem_cache *s,
+	struct kmem_cache_cpu *c, struct page *page, void *object, int node)
+{
+#ifdef CONFIG_NUMA
+	/* Direct free to the slab */
+	drain_objects(s, &object, 1);
+#endif
+}
+
+/* Generic allocation */
+
+static void *slab_alloc(struct kmem_cache *s,
 		gfp_t gfpflags, int node, unsigned long addr)
 {
-	void **object;
+	void *object;
 	struct kmem_cache_cpu *c;
+	struct kmem_cache_queue *q;
 	unsigned long flags;
 
-	if (!slab_pre_alloc_hook(s, gfpflags))
+	if (slab_pre_alloc_hook(s, gfpflags))
 		return NULL;
 
+redo:
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
-	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node)))
+	q = &c->q;
+	if (unlikely(queue_empty(q) || !node_match(c, node))) {
 
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+		if (unlikely(!node_match(c, node))) {
+			flush_cpu_objects(s, c);
+			c->node = node;
+		}
 
-	else {
-		c->freelist = get_freepointer(s, object);
+		while (q->objects < BATCH_SIZE) {
+			struct page *new;
+
+			new = get_partial(s, gfpflags & ~__GFP_ZERO, node);
+			if (unlikely(!new)) {
+
+				gfpflags &= gfp_allowed_mask;
+
+				if (gfpflags & __GFP_WAIT)
+					local_irq_enable();
+
+				new = new_slab(s, gfpflags, node);
+
+				if (gfpflags & __GFP_WAIT)
+					local_irq_disable();
+
+				/* process may have moved to different cpu */
+				c = __this_cpu_ptr(s->cpu_slab);
+				q = &c->q;
+
+ 				if (!new) {
+					if (queue_empty(q))
+						goto oom;
+					break;
+				}
+				stat(s, ALLOC_SLAB);
+				slab_lock(new);
+			} else
+				stat(s, ALLOC_FROM_PARTIAL);
+
+			refill_queue(s, q, new, available(new));
+			to_lists(s, new, 1);
+
+			slab_unlock(new);
+		}
+		stat(s, ALLOC_SLOWPATH);
+
+	} else
 		stat(s, ALLOC_FASTPATH);
+
+	object = queue_get(q);
+
+	if (kmem_cache_debug(s)) {
+		if (!alloc_debug_processing(s, object, addr))
+			goto redo;
 	}
 	local_irq_restore(flags);
 
-	if (unlikely(gfpflags & __GFP_ZERO) && object)
+	if (unlikely(gfpflags & __GFP_ZERO))
 		memset(object, 0, s->objsize);
 
 	slab_post_alloc_hook(s, gfpflags, object);
 
 	return object;
+
+oom:
+	local_irq_restore(flags);
+	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
+		slab_out_of_memory(s, gfpflags, node);
+	return NULL;
 }
 
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
@@ -1795,114 +1859,52 @@
 EXPORT_SYMBOL(kmem_cache_alloc_node_notrace);
 #endif
 
-/*
- * Slow patch handling. This may still be called frequently since objects
- * have a longer lifetime than the cpu slabs in most processing loads.
- *
- * So we still attempt to reduce cache line usage. Just take the slab
- * lock and free the item. If there is no additional partial page
- * handling required then we can return immediately.
- */
-static void __slab_free(struct kmem_cache *s, struct page *page,
+static void slab_free(struct kmem_cache *s, struct page *page,
 			void *x, unsigned long addr)
 {
-	void *prior;
-	void **object = (void *)x;
-
-	stat(s, FREE_SLOWPATH);
-	slab_lock(page);
+	struct kmem_cache_cpu *c;
+	struct kmem_cache_queue *q;
+	unsigned long flags;
 
-	if (kmem_cache_debug(s))
-		goto debug;
+	slab_free_hook(s, x);
 
-checks_ok:
-	prior = page->freelist;
-	set_freepointer(s, object, prior);
-	page->freelist = object;
-	page->inuse--;
-
-	if (unlikely(PageSlubFrozen(page))) {
-		stat(s, FREE_FROZEN);
-		goto out_unlock;
-	}
+	local_irq_save(flags);
+	if (kmem_cache_debug(s)
+			&& !free_debug_processing(s, page, x, addr))
+		goto out;
 
-	if (unlikely(!page->inuse))
-		goto slab_empty;
+	slab_free_hook_irq(s, x);
 
-	/*
-	 * Objects left in the slab. If it was not on the partial list before
-	 * then add it.
-	 */
-	if (unlikely(!prior)) {
-		add_partial(get_node(s, page_to_nid(page)), page, 1);
-		stat(s, FREE_ADD_PARTIAL);
-	}
+	c = __this_cpu_ptr(s->cpu_slab);
 
-out_unlock:
-	slab_unlock(page);
-	return;
+	if (NUMA_BUILD) {
+		int node = page_to_nid(page);
 
-slab_empty:
-	if (prior) {
-		/*
-		 * Slab still on the partial list.
-		 */
-		remove_partial(s, page);
-		stat(s, FREE_REMOVE_PARTIAL);
+		if (unlikely(node != c->node)) {
+			slab_free_alien(s, c, page, x, node);
+			stat(s, FREE_ALIEN);
+			goto out;
+		}
 	}
-	slab_unlock(page);
-	stat(s, FREE_SLAB);
-	discard_slab(s, page);
-	return;
-
-debug:
-	if (!free_debug_processing(s, page, x, addr))
-		goto out_unlock;
-	goto checks_ok;
-}
-
-/*
- * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
- * can perform fastpath freeing without additional function calls.
- *
- * The fastpath is only possible if we are freeing to the current cpu slab
- * of this processor. This typically the case if we have just allocated
- * the item before.
- *
- * If fastpath is not possible then fall back to __slab_free where we deal
- * with all sorts of special processing.
- */
-static __always_inline void slab_free(struct kmem_cache *s,
-			struct page *page, void *x, unsigned long addr)
-{
-	void **object = (void *)x;
-	struct kmem_cache_cpu *c;
-	unsigned long flags;
 
-	slab_free_hook(s, x);
+	q = &c->q;
 
-	local_irq_save(flags);
-	c = __this_cpu_ptr(s->cpu_slab);
+	if (unlikely(queue_full(q))) {
 
-	slab_free_hook_irq(s, x);
+		drain_queue(s, q, BATCH_SIZE);
+		stat(s, FREE_SLOWPATH);
 
-	if (likely(page == c->page && c->node >= 0)) {
-		set_freepointer(s, object, c->freelist);
-		c->freelist = object;
-		stat(s, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr);
+		stat(s, FREE_FASTPATH);
 
+	queue_put(q, x);
+out:
 	local_irq_restore(flags);
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
 {
-	struct page *page;
-
-	page = virt_to_head_page(x);
-
-	slab_free(s, page, x, _RET_IP_);
+	slab_free(s, virt_to_head_page(x), x, _RET_IP_);
 
 	trace_kmem_cache_free(_RET_IP_, x);
 }
@@ -1920,11 +1922,6 @@
 }
 
 /*
- * Object placement in a slab is made very easy because we always start at
- * offset 0. If we tune the size of the object to the alignment then we can
- * get the required alignment by putting one properly sized object after
- * another.
- *
  * Notice that the allocation order determines the sizes of the per cpu
  * caches. Each processor has always one slab available for allocations.
  * Increasing the allocation order reduces the number of times that slabs
@@ -2019,7 +2016,7 @@
 	 */
 	min_objects = slub_min_objects;
 	if (!min_objects)
-		min_objects = 4 * (fls(nr_cpu_ids) + 1);
+		min_objects = min(BITS_PER_LONG, 4 * (fls(nr_cpu_ids) + 1));
 	max_objects = (PAGE_SIZE << slub_max_order)/size;
 	min_objects = min(min_objects, max_objects);
 
@@ -2131,10 +2128,7 @@
 				"in order to be able to continue\n");
 	}
 
-	n = page->freelist;
-	BUG_ON(!n);
-	page->freelist = get_freepointer(kmem_cache_node, n);
-	page->inuse++;
+	retrieve_objects(kmem_cache_node, page, (void **)&n, 1);
 	kmem_cache_node->node[node] = n;
 #ifdef CONFIG_SLUB_DEBUG
 	init_object(kmem_cache_node, n, 1);
@@ -2219,10 +2213,11 @@
 static int calculate_sizes(struct kmem_cache *s, int forced_order)
 {
 	unsigned long flags = s->flags;
-	unsigned long size = s->objsize;
+	unsigned long size;
 	unsigned long align = s->align;
 	int order;
 
+	size = s->objsize;
 	/*
 	 * Round up object size to the next word boundary. We can only
 	 * place the free pointer at word boundaries and this determines
@@ -2254,24 +2249,10 @@
 
 	/*
 	 * With that we have determined the number of bytes in actual use
-	 * by the object. This is the potential offset to the free pointer.
+	 * by the object.
 	 */
 	s->inuse = size;
 
-	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) ||
-		s->ctor)) {
-		/*
-		 * Relocate free pointer after the object if it is not
-		 * permitted to overwrite the first word of the object on
-		 * kmem_cache_free.
-		 *
-		 * This is the case if we do RCU, have a constructor or
-		 * destructor or are poisoning the objects.
-		 */
-		s->offset = size;
-		size += sizeof(void *);
-	}
-
 #ifdef CONFIG_SLUB_DEBUG
 	if (flags & SLAB_STORE_USER)
 		/*
@@ -2357,7 +2338,6 @@
 		 */
 		if (get_order(s->size) > get_order(s->objsize)) {
 			s->flags &= ~DEBUG_METADATA_FLAGS;
-			s->offset = 0;
 			if (!calculate_sizes(s, -1))
 				goto error;
 		}
@@ -2382,9 +2362,9 @@
 error:
 	if (flags & SLAB_PANIC)
 		panic("Cannot create slab %s size=%lu realsize=%u "
-			"order=%u offset=%u flags=%lx\n",
+			"order=%u flags=%lx\n",
 			s->name, (unsigned long)size, s->size, oo_order(s->oo),
-			s->offset, flags);
+			flags);
 	return 0;
 }
 
@@ -2438,19 +2418,14 @@
 #ifdef CONFIG_SLUB_DEBUG
 	void *addr = page_address(page);
 	void *p;
-	long *map = kzalloc(BITS_TO_LONGS(page->objects) * sizeof(long),
-			    GFP_ATOMIC);
+	long *m = map(page);
 
-	if (!map)
-		return;
 	slab_err(s, page, "%s", text);
 	slab_lock(page);
-	for_each_free_object(p, s, page->freelist)
-		set_bit(slab_index(p, s, addr), map);
 
 	for_each_object(p, s, addr, page->objects) {
 
-		if (!test_bit(slab_index(p, s, addr), map)) {
+		if (!test_bit(slab_index(p, s, addr), m)) {
 			printk(KERN_ERR "INFO: Object 0x%p @offset=%tu\n",
 							p, p - addr);
 			print_tracking(s, p);
@@ -2471,7 +2446,7 @@
 
 	spin_lock_irqsave(&n->list_lock, flags);
 	list_for_each_entry_safe(page, h, &n->partial, lru) {
-		if (!page->inuse) {
+		if (all_objects_available(page)) {
 			list_del(&page->lru);
 			discard_slab(s, page);
 			n->nr_partial--;
@@ -2829,7 +2804,7 @@
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (!page->inuse && slab_trylock(page)) {
+			if (all_objects_available(page) && slab_trylock(page)) {
 				/*
 				 * Must hold slab lock here because slab_free
 				 * may have freed the last object and be
@@ -2841,7 +2816,7 @@
 				discard_slab(s, page);
 			} else {
 				list_move(&page->lru,
-				slabs_by_inuse + page->inuse);
+				slabs_by_inuse + inuse(page));
 			}
 		}
 
@@ -3322,7 +3297,7 @@
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
 			local_irq_save(flags);
-			__flush_cpu_slab(s, cpu);
+			flush_cpu_objects(s, per_cpu_ptr(s->cpu_slab ,cpu));
 			local_irq_restore(flags);
 		}
 		up_read(&slub_lock);
@@ -3392,7 +3367,7 @@
 #ifdef CONFIG_SLUB_DEBUG
 static int count_inuse(struct page *page)
 {
-	return page->inuse;
+	return inuse(page);
 }
 
 static int count_total(struct page *page)
@@ -3400,54 +3375,52 @@
 	return page->objects;
 }
 
-static int validate_slab(struct kmem_cache *s, struct page *page,
-						unsigned long *map)
+static int validate_slab(struct kmem_cache *s, struct page *page)
 {
 	void *p;
 	void *addr = page_address(page);
+	unsigned long *m = map(page);
+	unsigned long errors = 0;
 
-	if (!check_slab(s, page) ||
-			!on_freelist(s, page, NULL))
+	if (!check_slab(s, page) || !verify_slab(s, page))
 		return 0;
 
-	/* Now we know that a valid freelist exists */
-	bitmap_zero(map, page->objects);
+	for_each_object(p, s, addr, page->objects) {
+		int bit = slab_index(p, s, addr);
+		int used = !test_bit(bit, m);
 
-	for_each_free_object(p, s, page->freelist) {
-		set_bit(slab_index(p, s, addr), map);
-		if (!check_object(s, page, p, 0))
-			return 0;
+		if (!check_object(s, page, p, used))
+			errors++;
 	}
 
-	for_each_object(p, s, addr, page->objects)
-		if (!test_bit(slab_index(p, s, addr), map))
-			if (!check_object(s, page, p, 1))
-				return 0;
-	return 1;
+	return errors;
 }
 
-static void validate_slab_slab(struct kmem_cache *s, struct page *page,
-						unsigned long *map)
+static unsigned long validate_slab_slab(struct kmem_cache *s, struct page *page)
 {
+	unsigned long errors = 0;
+
 	if (slab_trylock(page)) {
-		validate_slab(s, page, map);
+		errors = validate_slab(s, page);
 		slab_unlock(page);
 	} else
 		printk(KERN_INFO "SLUB %s: Skipped busy slab 0x%p\n",
 			s->name, page);
+	return errors;
 }
 
 static int validate_slab_node(struct kmem_cache *s,
-		struct kmem_cache_node *n, unsigned long *map)
+		struct kmem_cache_node *n)
 {
 	unsigned long count = 0;
 	struct page *page;
 	unsigned long flags;
+	unsigned long errors;
 
 	spin_lock_irqsave(&n->list_lock, flags);
 
 	list_for_each_entry(page, &n->partial, lru) {
-		validate_slab_slab(s, page, map);
+		errors += validate_slab_slab(s, page);
 		count++;
 	}
 	if (count != n->nr_partial)
@@ -3458,7 +3431,7 @@
 		goto out;
 
 	list_for_each_entry(page, &n->full, lru) {
-		validate_slab_slab(s, page, map);
+		validate_slab_slab(s, page);
 		count++;
 	}
 	if (count != atomic_long_read(&n->nr_slabs))
@@ -3468,26 +3441,20 @@
 
 out:
 	spin_unlock_irqrestore(&n->list_lock, flags);
-	return count;
+	return errors;
 }
 
 static long validate_slab_cache(struct kmem_cache *s)
 {
 	int node;
 	unsigned long count = 0;
-	unsigned long *map = kmalloc(BITS_TO_LONGS(oo_objects(s->max)) *
-				sizeof(unsigned long), GFP_KERNEL);
-
-	if (!map)
-		return -ENOMEM;
 
 	flush_all(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
-		count += validate_slab_node(s, n, map);
+		count += validate_slab_node(s, n);
 	}
-	kfree(map);
 	return count;
 }
 
@@ -3676,18 +3643,14 @@
 }
 
 static void process_slab(struct loc_track *t, struct kmem_cache *s,
-		struct page *page, enum track_item alloc,
-		long *map)
+		struct page *page, enum track_item alloc)
 {
 	void *addr = page_address(page);
+	unsigned long *m = map(page);
 	void *p;
 
-	bitmap_zero(map, page->objects);
-	for_each_free_object(p, s, page->freelist)
-		set_bit(slab_index(p, s, addr), map);
-
 	for_each_object(p, s, addr, page->objects)
-		if (!test_bit(slab_index(p, s, addr), map))
+		if (!test_bit(slab_index(p, s, addr), m))
 			add_location(t, s, get_track(s, p, alloc));
 }
 
@@ -3698,12 +3661,9 @@
 	unsigned long i;
 	struct loc_track t = { 0, 0, NULL };
 	int node;
-	unsigned long *map = kmalloc(BITS_TO_LONGS(oo_objects(s->max)) *
-				     sizeof(unsigned long), GFP_KERNEL);
 
-	if (!map || !alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location),
+	if (!alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location),
 				     GFP_TEMPORARY)) {
-		kfree(map);
 		return sprintf(buf, "Out of memory\n");
 	}
 	/* Push back cpu slabs */
@@ -3719,9 +3679,9 @@
 
 		spin_lock_irqsave(&n->list_lock, flags);
 		list_for_each_entry(page, &n->partial, lru)
-			process_slab(&t, s, page, alloc, map);
+			process_slab(&t, s, page, alloc);
 		list_for_each_entry(page, &n->full, lru)
-			process_slab(&t, s, page, alloc, map);
+			process_slab(&t, s, page, alloc);
 		spin_unlock_irqrestore(&n->list_lock, flags);
 	}
 
@@ -3772,7 +3732,6 @@
 	}
 
 	free_loc_track(&t);
-	kfree(map);
 	if (!t.count)
 		len += sprintf(buf, "No data\n");
 	return len;
@@ -3788,7 +3747,6 @@
 
 #define SO_ALL		(1 << SL_ALL)
 #define SO_PARTIAL	(1 << SL_PARTIAL)
-#define SO_CPU		(1 << SL_CPU)
 #define SO_OBJECTS	(1 << SL_OBJECTS)
 #define SO_TOTAL	(1 << SL_TOTAL)
 
@@ -3806,30 +3764,6 @@
 		return -ENOMEM;
 	per_cpu = nodes + nr_node_ids;
 
-	if (flags & SO_CPU) {
-		int cpu;
-
-		for_each_possible_cpu(cpu) {
-			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
-
-			if (!c || c->node < 0)
-				continue;
-
-			if (c->page) {
-					if (flags & SO_TOTAL)
-						x = c->page->objects;
-				else if (flags & SO_OBJECTS)
-					x = c->page->inuse;
-				else
-					x = 1;
-
-				total += x;
-				nodes[c->node] += x;
-			}
-			per_cpu[c->node]++;
-		}
-	}
-
 	if (flags & SO_ALL) {
 		for_each_node_state(node, N_NORMAL_MEMORY) {
 			struct kmem_cache_node *n = get_node(s, node);
@@ -3999,11 +3933,35 @@
 }
 SLAB_ATTR_RO(partial);
 
-static ssize_t cpu_slabs_show(struct kmem_cache *s, char *buf)
+static ssize_t cpu_queues_show(struct kmem_cache *s, char *buf)
 {
-	return show_slab_objects(s, buf, SO_CPU);
+	unsigned long total = 0;
+	int x;
+	int cpu;
+	unsigned long *cpus;
+
+	cpus = kzalloc(1 * sizeof(unsigned long) * nr_cpu_ids, GFP_KERNEL);
+	if (!cpus)
+		return -ENOMEM;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+
+		total += c->q.objects;
+	}
+
+	x = sprintf(buf, "%lu", total);
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+
+		if (c->q.objects)
+			x += sprintf(buf + x, " C%d=%u", cpu, c->q.objects);
+	}
+	kfree(cpus);
+	return x + sprintf(buf + x, "\n");
 }
-SLAB_ATTR_RO(cpu_slabs);
+SLAB_ATTR_RO(cpu_queues);
 
 static ssize_t objects_show(struct kmem_cache *s, char *buf)
 {
@@ -4297,19 +4255,12 @@
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
-STAT_ATTR(FREE_FROZEN, free_frozen);
 STAT_ATTR(FREE_ADD_PARTIAL, free_add_partial);
 STAT_ATTR(FREE_REMOVE_PARTIAL, free_remove_partial);
 STAT_ATTR(ALLOC_FROM_PARTIAL, alloc_from_partial);
 STAT_ATTR(ALLOC_SLAB, alloc_slab);
-STAT_ATTR(ALLOC_REFILL, alloc_refill);
 STAT_ATTR(FREE_SLAB, free_slab);
-STAT_ATTR(CPUSLAB_FLUSH, cpuslab_flush);
-STAT_ATTR(DEACTIVATE_FULL, deactivate_full);
-STAT_ATTR(DEACTIVATE_EMPTY, deactivate_empty);
-STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head);
-STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail);
-STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees);
+STAT_ATTR(QUEUE_FLUSH, queue_flush);
 STAT_ATTR(ORDER_FALLBACK, order_fallback);
 #endif
 
@@ -4324,7 +4275,7 @@
 	&total_objects_attr.attr,
 	&slabs_attr.attr,
 	&partial_attr.attr,
-	&cpu_slabs_attr.attr,
+	&cpu_queues_attr.attr,
 	&ctor_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
@@ -4351,19 +4302,12 @@
 	&alloc_slowpath_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
-	&free_frozen_attr.attr,
 	&free_add_partial_attr.attr,
 	&free_remove_partial_attr.attr,
 	&alloc_from_partial_attr.attr,
 	&alloc_slab_attr.attr,
-	&alloc_refill_attr.attr,
 	&free_slab_attr.attr,
-	&cpuslab_flush_attr.attr,
-	&deactivate_full_attr.attr,
-	&deactivate_empty_attr.attr,
-	&deactivate_to_head_attr.attr,
-	&deactivate_to_tail_attr.attr,
-	&deactivate_remote_frees_attr.attr,
+	&queue_flush_attr.attr,
 	&order_fallback_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2010-07-31 17:44:35.159064006 -0500
+++ linux-2.6/include/linux/page-flags.h	2010-07-31 17:44:36.775096062 -0500
@@ -125,9 +125,6 @@
 
 	/* SLOB */
 	PG_slob_free = PG_private,
-
-	/* SLUB */
-	PG_slub_frozen = PG_active,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -213,8 +210,6 @@
 
 __PAGEFLAG(SlobFree, slob_free)
 
-__PAGEFLAG(SlubFrozen, slub_frozen)
-
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-07-31 17:44:35.131063451 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-07-31 18:25:28.827872663 -0500
@@ -2,9 +2,10 @@
 #define _LINUX_SLUB_DEF_H
 
 /*
- * SLUB : A Slab allocator without object queues.
+ * SLUB : The Unified Slab allocator.
  *
- * (C) 2007 SGI, Christoph Lameter
+ * (C) 2007-2008 SGI, Christoph Lameter
+ * (C) 2008-2010 Linux Foundation, Christoph Lameter
  */
 #include <linux/types.h>
 #include <linux/gfp.h>
@@ -14,33 +15,36 @@
 #include <linux/kmemleak.h>
 
 enum stat_item {
-	ALLOC_FASTPATH,		/* Allocation from cpu slab */
-	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
-	FREE_FASTPATH,		/* Free to cpu slub */
-	FREE_SLOWPATH,		/* Freeing not to cpu slab */
-	FREE_FROZEN,		/* Freeing to frozen slab */
-	FREE_ADD_PARTIAL,	/* Freeing moves slab to partial list */
-	FREE_REMOVE_PARTIAL,	/* Freeing removes last object */
-	ALLOC_FROM_PARTIAL,	/* Cpu slab acquired from partial list */
-	ALLOC_SLAB,		/* Cpu slab acquired from page allocator */
-	ALLOC_REFILL,		/* Refill cpu slab from slab freelist */
+	ALLOC_FASTPATH,		/* Allocation from cpu queue */
+	ALLOC_SLOWPATH,		/* Allocation required refilling of queue */
+	FREE_FASTPATH,		/* Free to cpu queue */
+	FREE_SLOWPATH,		/* Required pushing objects out of the queue */
+	FREE_ADD_PARTIAL,	/* Freeing moved slab to partial list */
+	FREE_REMOVE_PARTIAL,	/* Freeing removed from partial list */
+	ALLOC_FROM_PARTIAL,	/* slab with objects acquired from partial */
+	ALLOC_SLAB,		/* New slab acquired from page allocator */
+	FREE_ALIEN,		/* Free to alien node */
 	FREE_SLAB,		/* Slab freed to the page allocator */
-	CPUSLAB_FLUSH,		/* Abandoning of the cpu slab */
-	DEACTIVATE_FULL,	/* Cpu slab was full when deactivated */
-	DEACTIVATE_EMPTY,	/* Cpu slab was empty when deactivated */
-	DEACTIVATE_TO_HEAD,	/* Cpu slab was moved to the head of partials */
-	DEACTIVATE_TO_TAIL,	/* Cpu slab was moved to the tail of partials */
-	DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */
+	QUEUE_FLUSH,		/* Flushing of the per cpu queue */
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
 	NR_SLUB_STAT_ITEMS };
 
+#define QUEUE_SIZE 50
+#define BATCH_SIZE 25
+
+/* Queueing structure used for per cpu, l3 cache and alien queueing */
+struct kmem_cache_queue {
+	int objects;		/* Available objects */
+	int max;		/* Queue capacity */
+	void *object[QUEUE_SIZE];
+};
+
 struct kmem_cache_cpu {
-	void **freelist;	/* Pointer to first free per cpu object */
-	struct page *page;	/* The slab from which we are allocating */
-	int node;		/* The node of the page (or -1 for debug) */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
+	int node;		/* objects only from this numa node */
+	struct kmem_cache_queue q;
 };
 
 struct kmem_cache_node {
@@ -72,8 +76,8 @@
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
 	int objsize;		/* The size of an object without meta data */
-	int offset;		/* Free pointer offset. */
 	struct kmem_cache_order_objects oo;
+	int batch;		/* batch size */
 
 	/* Allocation and freeing of slabs */
 	struct kmem_cache_order_objects max;
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig	2010-07-31 17:44:35.091062658 -0500
+++ linux-2.6/init/Kconfig	2010-07-31 17:44:36.779096141 -0500
@@ -1087,14 +1087,14 @@
 	  per cpu and per node queues.
 
 config SLUB
-	bool "SLUB (Unqueued Allocator)"
+	bool "SLUB (Unified allocator)"
 	help
-	   SLUB is a slab allocator that minimizes cache line usage
-	   instead of managing queues of cached objects (SLAB approach).
-	   Per cpu caching is realized using slabs of objects instead
-	   of queues of objects. SLUB can use memory efficiently
-	   and has enhanced diagnostics. SLUB is the default choice for
-	   a slab allocator.
+	   SLUB is a slab allocator that minimizes metadata and provides
+	   a clean implementation that is faster than SLAB. SLUB has many
+	   of the queueing characteristic of the original SLAB allocator
+	   but uses a bit map to manage objects in slabs. SLUB can use
+	   memory more efficiently and has enhanced diagnostic and
+	   resiliency features compared with SLAB.
 
 config SLOB
 	depends on EMBEDDED

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 15/23] slub: Allow resizing of per cpu queues
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (13 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 14/23] slub: Add SLAB style per cpu queueing Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 16/23] slub: Get rid of useless function count_free() Christoph Lameter
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: unified_resize --]
[-- Type: text/plain, Size: 13063 bytes --]

Allow resizing of cpu queue and batch size. This is done in the
basic steps that are also followed by SLAB.

Careful: The ->cpu pointer is becoming volatile. References
to the ->cpu pointer either

A. Occur with interrupts disabled. This guarantees that nothing on the
   processor itself interferes. This only serializes access to a single
   processor specific area.

B. Occur with slub_lock taken for operations on all per cpu areas.
   Taking the slub_lock guarantees that no resizing operation will occur
   while accessing the percpu areas. The data in the percpu areas
   is volatile even with slub_lock since the alloc and free functions
   do not take slub_lock and will operate on fields of kmem_cache_cpu.

C. Are racy: Tolerable for statistics. The ->cpu pointer must always
   point to a valid kmem_cache_cpu area.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    9 -
 mm/slub.c                |  218 +++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 197 insertions(+), 30 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-31 18:25:53.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-31 19:02:05.003563067 -0500
@@ -195,10 +195,19 @@
 
 #endif
 
+/*
+ * We allow stat calls while slub_lock is taken or while interrupts
+ * are enabled for simplicities sake.
+ *
+ * This results in potential inaccuracies. If the platform does not
+ * support per cpu atomic operations vs. interrupts then the counters
+ * may be updated in a racy manner due to slab processing in
+ * interrupts.
+ */
 static inline void stat(struct kmem_cache *s, enum stat_item si)
 {
 #ifdef CONFIG_SLUB_STATS
-	__this_cpu_inc(s->cpu_slab->stat[si]);
+	__this_cpu_inc(s->cpu->stat[si]);
 #endif
 }
 
@@ -303,7 +312,7 @@
 
 static inline int queue_full(struct kmem_cache_queue *q)
 {
-	return q->objects == QUEUE_SIZE;
+	return q->objects == q->max;
 }
 
 static inline int queue_empty(struct kmem_cache_queue *q)
@@ -1571,6 +1580,11 @@
  	stat(s, QUEUE_FLUSH);
 }
 
+struct flush_control {
+	struct kmem_cache *s;
+	struct kmem_cache_cpu *c;
+};
+
 /*
  * Flush cpu objects.
  *
@@ -1578,22 +1592,96 @@
  */
 static void __flush_cpu_objects(void *d)
 {
-	struct kmem_cache *s = d;
-	struct kmem_cache_cpu *c = __this_cpu_ptr(s->cpu_slab);
+	struct flush_control *f = d;
+	struct kmem_cache_cpu *c = __this_cpu_ptr(f->c);
 
 	if (c->q.objects)
-		flush_cpu_objects(s, c);
+		flush_cpu_objects(f->s, c);
 }
 
 static void flush_all(struct kmem_cache *s)
 {
-	on_each_cpu(__flush_cpu_objects, s, 1);
+	struct flush_control f = { s, s->cpu };
+
+	on_each_cpu(__flush_cpu_objects, &f, 1);
 }
 
 struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int n)
 {
-	return __alloc_percpu(sizeof(struct kmem_cache_cpu),
-		__alignof__(struct kmem_cache_cpu));
+	struct kmem_cache_cpu *k;
+	int cpu;
+	int size;
+	int max;
+
+	/* Size the queue and the allocation to cacheline sizes */
+	size = ALIGN(n * sizeof(void *) + sizeof(struct kmem_cache_cpu), cache_line_size());
+
+	k = __alloc_percpu(size, cache_line_size());
+	if (!k)
+		return NULL;
+
+	max = (size - sizeof(struct kmem_cache_cpu)) / sizeof(void *);
+
+	for_each_possible_cpu(cpu) {
+		struct kmem_cache_cpu *c = per_cpu_ptr(k, cpu);
+
+		c->q.max = max;
+	}
+
+	s->cpu_queue = max;
+	return k;
+}
+
+
+static void resize_cpu_queue(struct kmem_cache *s, int queue)
+{
+	struct kmem_cache_cpu *n = alloc_kmem_cache_cpu(s, queue);
+	struct flush_control f;
+
+	/* Create the new cpu queue and then free the old one */
+	f.s = s;
+	f.c = s->cpu;
+
+	/* We can only shrink the queue here since the new
+	 * queue size may be smaller and there may be concurrent
+	 * slab operations. The update of the queue must be seen
+	 * before the change of the location of the percpu queue.
+	 *
+	 * Note that the queue may contain more object than the
+	 * queue size after this operation.
+	 */
+	if (queue < s->queue) {
+		s->queue = queue;
+		s->batch = (s->queue + 1) / 2;
+		barrier();
+	}
+
+	/* This is critical since allocation and free runs
+	 * concurrently without taking the slub_lock!
+	 * We point the cpu pointer to a different per cpu
+	 * segment to redirect current processing and then
+	 * flush the cpu objects on the old cpu structure.
+	 *
+	 * The old percpu structure is no longer reachable
+	 * since slab_alloc/free must have terminated in order
+	 * to execute __flush_cpu_objects. Both require
+	 * interrupts to be disabled.
+	 */
+	s->cpu = n;
+	on_each_cpu(__flush_cpu_objects, &f, 1);
+
+	/*
+	 * If the queue needs to be extended then we deferred
+	 * the update until now when the larger sized queue
+	 * has been allocated and is working.
+	 */
+	if (queue > s->queue) {
+		s->queue = queue;
+		s->batch = (s->queue + 1) / 2;
+	}
+
+	if (slab_state > UP)
+		free_percpu(f.c);
 }
 
 /*
@@ -1706,7 +1794,7 @@
 {
 	int d;
 
-	d = min(BATCH_SIZE - q->objects, nr);
+	d = min(s->batch - q->objects, nr);
 	retrieve_objects(s, page, q->object + q->objects, d);
 	q->objects += d;
 }
@@ -1747,7 +1835,7 @@
 
 redo:
 	local_irq_save(flags);
-	c = __this_cpu_ptr(s->cpu_slab);
+	c = __this_cpu_ptr(s->cpu);
 	q = &c->q;
 	if (unlikely(queue_empty(q) || !node_match(c, node))) {
 
@@ -1756,7 +1844,7 @@
 			c->node = node;
 		}
 
-		while (q->objects < BATCH_SIZE) {
+		while (q->objects < s->batch) {
 			struct page *new;
 
 			new = get_partial(s, gfpflags & ~__GFP_ZERO, node);
@@ -1773,7 +1861,7 @@
 					local_irq_disable();
 
 				/* process may have moved to different cpu */
-				c = __this_cpu_ptr(s->cpu_slab);
+				c = __this_cpu_ptr(s->cpu);
 				q = &c->q;
 
  				if (!new) {
@@ -1875,7 +1963,7 @@
 
 	slab_free_hook_irq(s, x);
 
-	c = __this_cpu_ptr(s->cpu_slab);
+	c = __this_cpu_ptr(s->cpu);
 
 	if (NUMA_BUILD) {
 		int node = page_to_nid(page);
@@ -1891,7 +1979,7 @@
 
 	if (unlikely(queue_full(q))) {
 
-		drain_queue(s, q, BATCH_SIZE);
+		drain_queue(s, q, s->batch);
 		stat(s, FREE_SLOWPATH);
 
 	} else
@@ -2093,9 +2181,9 @@
 	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
 			SLUB_PAGE_SHIFT * sizeof(struct kmem_cache));
 
-	s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
+	s->cpu = alloc_kmem_cache_cpu(s, s->queue);
 
-	return s->cpu_slab != NULL;
+	return s->cpu != NULL;
 }
 
 #ifdef CONFIG_NUMA
@@ -2317,6 +2405,18 @@
 
 }
 
+static int initial_queue_size(int size)
+{
+	if (size > PAGE_SIZE)
+		return 8;
+	else if (size > 1024)
+		return 24;
+	else if (size > 256)
+		return 54;
+	else
+		return 120;
+}
+
 static int kmem_cache_open(struct kmem_cache *s,
 		const char *name, size_t size,
 		size_t align, unsigned long flags,
@@ -2355,6 +2455,9 @@
 	if (!init_kmem_cache_nodes(s))
 		goto error;
 
+	s->queue = initial_queue_size(s->size);
+	s->batch = (s->queue + 1) / 2;
+
 	if (alloc_kmem_cache_cpus(s))
 		return 1;
 
@@ -2465,8 +2568,9 @@
 {
 	int node;
 
+	down_read(&slub_lock);
 	flush_all(s);
-	free_percpu(s->cpu_slab);
+	free_percpu(s->cpu);
 	/* Attempt to free all objects */
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
@@ -2476,6 +2580,7 @@
 			return 1;
 	}
 	free_kmem_cache_nodes(s);
+	up_read(&slub_lock);
 	return 0;
 }
 
@@ -3122,6 +3227,7 @@
 		caches++;
 	}
 
+	/* Now the kmalloc array is fully functional (*not* the dma array) */
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
@@ -3149,6 +3255,7 @@
 #ifdef CONFIG_ZONE_DMA
 	int i;
 
+	/* Create the dma kmalloc array and make it operational */
 	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
 		struct kmem_cache *s = kmalloc_caches[i];
 
@@ -3297,7 +3404,7 @@
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
 			local_irq_save(flags);
-			flush_cpu_objects(s, per_cpu_ptr(s->cpu_slab ,cpu));
+			flush_cpu_objects(s, per_cpu_ptr(s->cpu, cpu));
 			local_irq_restore(flags);
 		}
 		up_read(&slub_lock);
@@ -3764,6 +3871,7 @@
 		return -ENOMEM;
 	per_cpu = nodes + nr_node_ids;
 
+	down_read(&slub_lock);
 	if (flags & SO_ALL) {
 		for_each_node_state(node, N_NORMAL_MEMORY) {
 			struct kmem_cache_node *n = get_node(s, node);
@@ -3794,6 +3902,7 @@
 			nodes[node] += x;
 		}
 	}
+
 	x = sprintf(buf, "%lu", total);
 #ifdef CONFIG_NUMA
 	for_each_node_state(node, N_NORMAL_MEMORY)
@@ -3801,6 +3910,7 @@
 			x += sprintf(buf + x, " N%d=%lu",
 					node, nodes[node]);
 #endif
+	up_read(&slub_lock);
 	kfree(nodes);
 	return x + sprintf(buf + x, "\n");
 }
@@ -3904,6 +4014,57 @@
 }
 SLAB_ATTR(min_partial);
 
+static ssize_t cpu_queue_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%u\n", s->queue);
+}
+
+static ssize_t cpu_queue_size_store(struct kmem_cache *s,
+			 const char *buf, size_t length)
+{
+	unsigned long queue;
+	int err;
+
+	err = strict_strtoul(buf, 10, &queue);
+	if (err)
+		return err;
+
+	if (queue > 10000 || queue < 4)
+		return -EINVAL;
+
+	if (s->batch > queue)
+		s->batch = queue;
+
+	down_write(&slub_lock);
+	resize_cpu_queue(s, queue);
+	up_write(&slub_lock);
+	return length;
+}
+SLAB_ATTR(cpu_queue_size);
+
+static ssize_t cpu_batch_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%u\n", s->batch);
+}
+
+static ssize_t cpu_batch_size_store(struct kmem_cache *s,
+			 const char *buf, size_t length)
+{
+	unsigned long batch;
+	int err;
+
+	err = strict_strtoul(buf, 10, &batch);
+	if (err)
+		return err;
+
+	if (batch < s->queue || batch < 4)
+		return -EINVAL;
+
+	s->batch = batch;
+	return length;
+}
+SLAB_ATTR(cpu_batch_size);
+
 static ssize_t ctor_show(struct kmem_cache *s, char *buf)
 {
 	if (s->ctor) {
@@ -3944,8 +4105,9 @@
 	if (!cpus)
 		return -ENOMEM;
 
+	down_read(&slub_lock);
 	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+		struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
 
 		total += c->q.objects;
 	}
@@ -3953,11 +4115,14 @@
 	x = sprintf(buf, "%lu", total);
 
 	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+		struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+		struct kmem_cache_queue *q = &c->q;
 
-		if (c->q.objects)
-			x += sprintf(buf + x, " C%d=%u", cpu, c->q.objects);
+		if (!queue_empty(q))
+			x += sprintf(buf + x, " C%d=%u/%u",
+				cpu, q->objects, q->max);
 	}
+	up_read(&slub_lock);
 	kfree(cpus);
 	return x + sprintf(buf + x, "\n");
 }
@@ -4209,12 +4374,14 @@
 	if (!data)
 		return -ENOMEM;
 
+	down_read(&slub_lock);
 	for_each_online_cpu(cpu) {
-		unsigned x = per_cpu_ptr(s->cpu_slab, cpu)->stat[si];
+		unsigned x = per_cpu_ptr(s->cpu, cpu)->stat[si];
 
 		data[cpu] = x;
 		sum += x;
 	}
+	up_read(&slub_lock);
 
 	len = sprintf(buf, "%lu", sum);
 
@@ -4232,8 +4399,10 @@
 {
 	int cpu;
 
+	down_write(&slub_lock);
 	for_each_online_cpu(cpu)
-		per_cpu_ptr(s->cpu_slab, cpu)->stat[si] = 0;
+		per_cpu_ptr(s->cpu, cpu)->stat[si] = 0;
+	up_write(&slub_lock);
 }
 
 #define STAT_ATTR(si, text) 					\
@@ -4270,6 +4439,8 @@
 	&objs_per_slab_attr.attr,
 	&order_attr.attr,
 	&min_partial_attr.attr,
+	&cpu_queue_size_attr.attr,
+	&cpu_batch_size_attr.attr,
 	&objects_attr.attr,
 	&objects_partial_attr.attr,
 	&total_objects_attr.attr,
@@ -4631,7 +4802,7 @@
 	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse,
 		   nr_objs, s->size, oo_objects(s->oo),
 		   (1 << oo_order(s->oo)));
-	seq_printf(m, " : tunables %4u %4u %4u", 0, 0, 0);
+	seq_printf(m, " : tunables %4u %4u %4u", s->queue, s->batch, 0);
 	seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs,
 		   0UL);
 	seq_putc(m, '\n');
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-07-31 18:25:28.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-07-31 19:00:58.738236361 -0500
@@ -29,14 +29,11 @@
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
 	NR_SLUB_STAT_ITEMS };
 
-#define QUEUE_SIZE 50
-#define BATCH_SIZE 25
-
 /* Queueing structure used for per cpu, l3 cache and alien queueing */
 struct kmem_cache_queue {
 	int objects;		/* Available objects */
 	int max;		/* Queue capacity */
-	void *object[QUEUE_SIZE];
+	void *object[];
 };
 
 struct kmem_cache_cpu {
@@ -71,7 +68,7 @@
  * Slab cache management.
  */
 struct kmem_cache {
-	struct kmem_cache_cpu *cpu_slab;
+	struct kmem_cache_cpu *cpu;
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
@@ -87,6 +84,8 @@
 	void (*ctor)(void *);
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
+	int queue;		/* specified queue size */
+	int cpu_queue;		/* cpu queue size */
 	unsigned long min_partial;
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 16/23] slub: Get rid of useless function count_free()
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (14 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 15/23] slub: Allow resizing of per cpu queues Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 17/23] slub: Remove MAX_OBJS limitation Christoph Lameter
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: unified_drop_count_free --]
[-- Type: text/plain, Size: 1585 bytes --]

count_free() == available()

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-30 18:44:54.767739966 -0500
+++ linux-2.6/mm/slub.c	2010-07-30 18:45:24.248349179 -0500
@@ -1697,11 +1697,6 @@
 	return 1;
 }
 
-static int count_free(struct page *page)
-{
-	return available(page);
-}
-
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct page *))
 {
@@ -1750,7 +1745,7 @@
 		if (!n)
 			continue;
 
-		nr_free  = count_partial(n, count_free);
+		nr_free  = count_partial(n, available);
 		nr_slabs = node_nr_slabs(n);
 		nr_objs  = node_nr_objs(n);
 
@@ -3906,7 +3901,7 @@
 			x = atomic_long_read(&n->total_objects);
 		else if (flags & SO_OBJECTS)
 			x = atomic_long_read(&n->total_objects) -
-				count_partial(n, count_free);
+				count_partial(n, available);
 
 			else
 				x = atomic_long_read(&n->nr_slabs);
@@ -4792,7 +4787,7 @@
 		nr_partials += n->nr_partial;
 		nr_slabs += atomic_long_read(&n->nr_slabs);
 		nr_objs += atomic_long_read(&n->total_objects);
-		nr_free += count_partial(n, count_free);
+		nr_free += count_partial(n, available);
 	}
 
 	nr_inuse = nr_objs - nr_free;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 17/23] slub: Remove MAX_OBJS limitation
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (15 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 16/23] slub: Get rid of useless function count_free() Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 18/23] slub: Drop allocator announcement Christoph Lameter
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: unified_unlimited --]
[-- Type: text/plain, Size: 2130 bytes --]

There is no need anymore for the "inuse" field in the page struct.
Extend the objects field to 32 bit allowing a practically unlimited
number of objects.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/mm_types.h |    5 +----
 mm/slub.c                |    7 -------
 2 files changed, 1 insertion(+), 11 deletions(-)

Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h	2010-07-30 18:37:56.171016883 -0500
+++ linux-2.6/include/linux/mm_types.h	2010-07-30 18:45:28.624439565 -0500
@@ -40,10 +40,7 @@
 					 * to show when page is mapped
 					 * & limit reverse map searches.
 					 */
-		struct {		/* SLUB */
-			u16 inuse;
-			u16 objects;
-		};
+		u32 objects;		/* SLUB */
 	};
 	union {
 	    struct {
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-30 18:45:24.248349179 -0500
+++ linux-2.6/mm/slub.c	2010-07-30 18:45:28.628439648 -0500
@@ -144,7 +144,6 @@
 
 #define OO_SHIFT	16
 #define OO_MASK		((1 << OO_SHIFT) - 1)
-#define MAX_OBJS_PER_PAGE	65535 /* since page.objects is u16 */
 
 /* Internal SLUB flags */
 #define __OBJECT_POISON		0x80000000UL /* Poison object */
@@ -791,9 +790,6 @@
 			max_objects = ((void *)page->freelist - start) / s->size;
 	}
 
-	if (max_objects > MAX_OBJS_PER_PAGE)
-		max_objects = MAX_OBJS_PER_PAGE;
-
 	if (page->objects != max_objects) {
 		slab_err(s, page, "Wrong number of objects. Found %d but "
 			"should be %d", page->objects, max_objects);
@@ -2060,9 +2056,6 @@
 	int rem;
 	int min_order = slub_min_order;
 
-	if ((PAGE_SIZE << min_order) / size > MAX_OBJS_PER_PAGE)
-		return get_order(size * MAX_OBJS_PER_PAGE) - 1;
-
 	for (order = max(min_order,
 				fls(min_objects * size - 1) - PAGE_SHIFT);
 			order <= max_order; order++) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 18/23] slub: Drop allocator announcement
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (16 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 17/23] slub: Remove MAX_OBJS limitation Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 19/23] slub: Object based NUMA policies Christoph Lameter
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: unified_remove_banner --]
[-- Type: text/plain, Size: 1106 bytes --]

People get confused because the output repeats some basic hardware
configuration values. Some of the items listed no
longer have the same relevance in the queued form of SLUB.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |    7 -------
 1 file changed, 7 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-30 18:45:28.628439648 -0500
+++ linux-2.6/mm/slub.c	2010-07-30 18:45:32.632522338 -0500
@@ -3229,13 +3229,6 @@
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
 #endif
-
-	printk(KERN_INFO
-		"SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"
-		" CPUs=%d, Nodes=%d\n",
-		caches, cache_line_size(),
-		slub_min_order, slub_max_order, slub_min_objects,
-		nr_cpu_ids, nr_node_ids);
 }
 
 void __init kmem_cache_init_late(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 19/23] slub: Object based NUMA policies
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (17 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 18/23] slub: Drop allocator announcement Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities Christoph Lameter
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: unified_object_based_policies --]
[-- Type: text/plain, Size: 5513 bytes --]

Slub applies policies and cpuset restriction currently only on the page
level. The patch here changes that to apply policies to individual allocations
(like SLAB). This comes with a cost of increased complexiy in the allocator.

The allocation does not build alien queues (later patch) and is a bit
ineffective since a slab has to be taken from the partial lists (via lock
and unlock) and possibly shifted back after taking one object out of it.

Memory policies and cpuset redirection is only applied to slabs marked with
SLAB_MEM_SPREAD (also like SLAB).

Use Lee Schermerhorns new *_mem functionality to always find the nearest
node in case we are on a memoryless node.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    3 +
 mm/slub.c                |   94 +++++++++++++++++++++++++++++++++++------------
 2 files changed, 73 insertions(+), 24 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-31 18:27:10.913898557 -0500
+++ linux-2.6/mm/slub.c	2010-07-31 18:27:15.733994218 -0500
@@ -1451,7 +1451,7 @@
 static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
 {
 	struct page *page;
-	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
 	page = get_partial_node(get_node(s, searchnode));
 	if (page || (flags & __GFP_THISNODE))
@@ -1622,6 +1622,7 @@
 		struct kmem_cache_cpu *c = per_cpu_ptr(k, cpu);
 
 		c->q.max = max;
+		c->node = cpu_to_mem(cpu);
 	}
 
 	s->cpu_queue = max;
@@ -1680,19 +1681,6 @@
 		free_percpu(f.c);
 }
 
-/*
- * Check if the objects in a per cpu structure fit numa
- * locality expectations.
- */
-static inline int node_match(struct kmem_cache_cpu *c, int node)
-{
-#ifdef CONFIG_NUMA
-	if (node != NUMA_NO_NODE && c->node != node)
-		return 0;
-#endif
-	return 1;
-}
-
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct page *))
 {
@@ -1752,6 +1740,26 @@
 }
 
 /*
+ * Determine the final numa node from which the allocation will
+ * be occurring. Allocations can be redirected for slabs marked
+ * with SLAB_MEM_SPREAD by memory policies and cpusets options.
+ */
+static inline int find_numa_node(struct kmem_cache *s, int node)
+{
+#ifdef CONFIG_NUMA
+	if (unlikely(s->flags & SLAB_MEM_SPREAD)) {
+		if (node == NUMA_NO_NODE && !in_interrupt()) {
+			if (cpuset_do_slab_mem_spread())
+				node = cpuset_mem_spread_node();
+			else if (current->mempolicy)
+				node = slab_node(current->mempolicy);
+		}
+	}
+#endif
+	return node;
+}
+
+/*
  * Retrieve pointers to nr objects from a slab into the object array.
  * Slab must be locked.
  */
@@ -1802,6 +1810,42 @@
 
 /* Handling of objects from other nodes */
 
+static void *slab_alloc_node(struct kmem_cache *s, struct kmem_cache_cpu *c,
+						gfp_t gfpflags, int node)
+{
+#ifdef CONFIG_NUMA
+	struct kmem_cache_node *n = get_node(s, node);
+	struct page *page;
+	void *object;
+
+	page = get_partial_node(n);
+	if (!page) {
+		gfpflags &= gfp_allowed_mask;
+
+		if (gfpflags & __GFP_WAIT)
+			local_irq_enable();
+
+		page = new_slab(s, gfpflags | GFP_THISNODE, node);
+
+		if (gfpflags & __GFP_WAIT)
+			local_irq_disable();
+
+ 		if (!page)
+			return NULL;
+
+		slab_lock(page);
+ 	}
+
+	retrieve_objects(s, page, &object, 1);
+
+	to_lists(s, page, 0);
+	slab_unlock(page);
+	return object;
+#else
+	return NULL;
+#endif
+}
+
 static void slab_free_alien(struct kmem_cache *s,
 	struct kmem_cache_cpu *c, struct page *page, void *object, int node)
 {
@@ -1827,13 +1871,20 @@
 redo:
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu);
-	q = &c->q;
-	if (unlikely(queue_empty(q) || !node_match(c, node))) {
 
-		if (unlikely(!node_match(c, node))) {
-			flush_cpu_objects(s, c);
-			c->node = node;
+	node = find_numa_node(s, node);
+
+	if (NUMA_BUILD && node != NUMA_NO_NODE) {
+		if (unlikely(node != c->node)) {
+			object = slab_alloc_node(s, c, gfpflags, node);
+			if (!object)
+				goto oom;
+			stat(s, ALLOC_REMOTE);
+			goto got_it;
 		}
+	}
+	q = &c->q;
+	if (unlikely(queue_empty(q))) {
 
 		while (q->objects < s->batch) {
 			struct page *new;
@@ -1877,6 +1928,7 @@
 
 	object = queue_get(q);
 
+got_it:
 	if (kmem_cache_debug(s)) {
 		if (!alloc_debug_processing(s, object, addr))
 			goto redo;
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-07-31 18:26:09.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-07-31 18:27:15.733994218 -0500
@@ -23,6 +23,7 @@
 	FREE_REMOVE_PARTIAL,	/* Freeing removed from partial list */
 	ALLOC_FROM_PARTIAL,	/* slab with objects acquired from partial */
 	ALLOC_SLAB,		/* New slab acquired from page allocator */
+	ALLOC_REMOTE,		/* Allocation from remote slab */
 	FREE_ALIEN,		/* Free to alien node */
 	FREE_SLAB,		/* Slab freed to the page allocator */
 	QUEUE_FLUSH,		/* Flushing of the per cpu queue */
@@ -40,7 +41,7 @@
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
-	int node;		/* objects only from this numa node */
+	int node;		/* The memory node local to the cpu */
 	struct kmem_cache_queue q;
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities.
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (18 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 19/23] slub: Object based NUMA policies Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-17  5:52   ` David Rientjes
  2010-08-04  2:45 ` [S+Q3 21/23] slub: Support Alien Caches Christoph Lameter
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: unified_shared_cache --]
[-- Type: text/plain, Size: 17212 bytes --]

Strictly a performance enhancement by better tracking of objects
that are likely in the lowest cpu caches of processors.

SLAB uses one shared cache per NUMA node or one globally. However, that
is not satifactory for contemporary cpus. Those may have multiple
independent cpu caches per node. SLAB in these situation treats
cache cold objects like cache hot objects.

The shared caches of slub are per physical cpu cache for all cpus using
that cache. Shared cache content will not cross physical caches.

The shared cache can be dynamically configured via
/sys/kernel/slab/<cache>/shared_queue

The current shared cache state is available via
cat /sys/kernel/slab/<cache/<shared_caches>

Shared caches are always allocated in the sizes available in the kmalloc
array. Cache sizes are rounded up to the sizes available.

F.e. on my Dell with 8 cpus in 2 packages in which each 2 cpus shared
an l2 cache I get:

christoph@:/sys/kernel/slab$ cat kmalloc-64/shared_caches
384 C0,2=66/126 C1,3=126/126 C4,6=126/126 C5,7=66/126
christoph@:/sys/kernel/slab$ cat kmalloc-64/per_cpu_caches
617 C0=54/125 C1=37/125 C2=102/125 C3=76/125 C4=81/125 C5=108/125 C6=72/125 C7=87/125

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/slub_def.h |    9 +
 mm/slub.c                |  423 ++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 405 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-08-03 13:04:49.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-08-03 15:52:01.000000000 -0500
@@ -24,6 +24,8 @@ enum stat_item {
 	ALLOC_FROM_PARTIAL,	/* slab with objects acquired from partial */
 	ALLOC_SLAB,		/* New slab acquired from page allocator */
 	ALLOC_REMOTE,		/* Allocation from remote slab */
+	ALLOC_SHARED,		/* Allocation caused a shared cache transaction */
+	FREE_SHARED,		/* Free caused a shared cache transaction */
 	FREE_ALIEN,		/* Free to alien node */
 	FREE_SLAB,		/* Slab freed to the page allocator */
 	QUEUE_FLUSH,		/* Flushing of the per cpu queue */
@@ -34,6 +36,10 @@ enum stat_item {
 struct kmem_cache_queue {
 	int objects;		/* Available objects */
 	int max;		/* Queue capacity */
+	union {
+		struct kmem_cache_queue *shared; /* cpu q -> shared q */
+		spinlock_t lock;	  /* shared queue: lock */
+	};
 	void *object[];
 };
 
@@ -87,12 +93,15 @@ struct kmem_cache {
 	int align;		/* Alignment */
 	int queue;		/* specified queue size */
 	int cpu_queue;		/* cpu queue size */
+	int shared_queue;	/* Actual shared queue size */
+	int nr_shared;		/* Total # of shared caches */
 	unsigned long min_partial;
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */
 #ifdef CONFIG_SLUB_DEBUG
 	struct kobject kobj;	/* For sysfs */
 #endif
+	int shared_queue_sysfs;	/* Desired shared queue size */
 
 #ifdef CONFIG_NUMA
 	/*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-08-03 13:04:49.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-08-03 15:52:01.000000000 -0500
@@ -1556,7 +1556,8 @@ void drain_objects(struct kmem_cache *s,
 	}
 }
 
-static inline void drain_queue(struct kmem_cache *s, struct kmem_cache_queue *q, int nr)
+static inline int drain_queue(struct kmem_cache *s,
+		struct kmem_cache_queue *q, int nr)
 {
 	int t = min(nr, q->objects);
 
@@ -1566,13 +1567,35 @@ static inline void drain_queue(struct km
 	if (q->objects)
 		memcpy(q->object, q->object + t,
 					q->objects * sizeof(void *));
+	return t;
 }
+
+static inline int drain_shared_cache(struct kmem_cache *s,
+				 struct kmem_cache_queue *q)
+{
+	int n = 0;
+
+	if (!q)
+		return n;
+
+	if (!queue_empty(q)) {
+		spin_lock(&q->lock);
+		if (q->objects)
+			n = drain_queue(s, q, q->objects);
+		spin_unlock(&q->lock);
+	}
+	return n;
+}
+
 /*
  * Drain all objects from a per cpu queue
  */
 static void flush_cpu_objects(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	drain_queue(s, &c->q, c->q.objects);
+	struct kmem_cache_queue *q = &c->q;
+
+	drain_queue(s, q, q->objects);
+	drain_shared_cache(s, q->shared);
  	stat(s, QUEUE_FLUSH);
 }
 
@@ -1629,6 +1652,207 @@ struct kmem_cache_cpu *alloc_kmem_cache_
 	return k;
 }
 
+/* Shared cache management */
+
+static inline int get_shared_objects(struct kmem_cache_queue *q,
+		void **l, int nr)
+{
+	int d;
+
+	spin_lock(&q->lock);
+	d = min(nr, q->objects);
+	q->objects -= d;
+	memcpy(l, q->object + q->objects, d * sizeof(void *));
+	spin_unlock(&q->lock);
+
+	return d;
+}
+
+static inline int put_shared_objects(struct kmem_cache_queue *q,
+				void **l, int nr)
+{
+	int d;
+
+	spin_lock(&q->lock);
+	d = min(nr, q->max - q->objects);
+	memcpy(q->object + q->objects, l,  d * sizeof(void *));
+	q->objects += d;
+	spin_unlock(&q->lock);
+
+	return d;
+}
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags);
+
+static inline unsigned long shared_cache_size(int n)
+{
+	return n * sizeof(void *) + sizeof(struct kmem_cache_queue);
+}
+
+static inline unsigned long shared_cache_capacity(unsigned long size)
+{
+	return (size - sizeof(struct kmem_cache_queue)) / sizeof(void *);
+}
+
+static inline void init_shared_cache(struct kmem_cache_queue *q, int max)
+{
+	q->max = max;
+	spin_lock_init(&q->lock);
+}
+
+
+/* Determine a list of the active shared caches */
+struct kmem_cache_queue **shared_caches(struct kmem_cache *s)
+{
+	int cpu;
+	struct kmem_cache_queue **caches;
+	int nr;
+	int n;
+
+	caches = kmalloc(sizeof(struct kmem_cache_queue *)
+				* (s->nr_shared + 1), GFP_KERNEL);
+	if (!caches)
+		return NULL;
+
+	nr = 0;
+
+	/* Build list of shared caches */
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+		struct kmem_cache_queue *q = c->q.shared;
+
+		if (!q)
+			continue;
+
+		for (n = 0; n < nr; n++)
+			if (caches[n] == q)
+				break;
+
+		if (n < nr)
+			continue;
+
+		caches[nr++] = q;
+	}
+	caches[nr] = NULL;
+	BUG_ON(nr != s->nr_shared);
+	return caches;
+}
+
+/*
+ * Allocate shared cpu caches.
+ * A shared cache is allocated for each series of cpus sharing a single cache
+ */
+static void alloc_shared_caches(struct kmem_cache *s)
+{
+	int cpu;
+	int max;
+	int size;
+	void *p;
+
+	if (slab_state < SYSFS || s->shared_queue_sysfs == 0)
+		return;
+
+	/*
+	 * Determine the size. Round it up to the size that a kmalloc cache
+	 * supporting that size has. This will often align the size to a
+	 * power of 2 especially on machines that have large kmalloc
+	 * alignment requirements.
+	 */
+	size = shared_cache_size(s->shared_queue_sysfs);
+	if (size < PAGE_SIZE / 2)
+		size = get_slab(size, GFP_KERNEL)->objsize;
+	else
+		size = PAGE_SHIFT << get_order(size);
+
+	max = shared_cache_capacity(size);
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+		struct kmem_cache_queue *l;
+		int x;
+		const struct cpumask *map =
+				per_cpu(cpu_info.llc_shared_map, cpu) ;
+
+		/* Skip cpus that already have assigned shared caches */
+		if (c->q.shared)
+			continue;
+
+		/* Allocate shared cache */
+		p = kmalloc_node(size, GFP_KERNEL | __GFP_ZERO, c->node);
+		if (!p) {
+			printk(KERN_WARNING "SLUB: Out of memory allocating"
+				" shared cache for %s cpu %d node %d\n",
+				s->name, cpu, c->node);
+			continue;
+		}
+
+		l = p;
+		init_shared_cache(l, max);
+
+		if (cpumask_weight(map) < 2) {
+
+			/*
+			 * No information available on how to setup the shared
+			 * caches. Cpu will not have shared or alien caches.
+			 */
+			printk_once(KERN_WARNING "SLUB: Cache topology"
+				" information unusable. No shared caches\n");
+
+			kfree(p);
+			continue;
+		}
+
+		/* Link all cpus in this group to the shared cache */
+		for_each_cpu(x, map) {
+			struct kmem_cache_cpu *z = per_cpu_ptr(s->cpu, x);
+
+			if (z->node == c->node)
+				z->q.shared = l;
+		}
+		s->nr_shared++;
+	}
+	s->shared_queue = max;
+}
+
+/*
+ * Flush shared caches.
+ *
+ * Called from IPI handler with interrupts disabled.
+ */
+static void __remove_shared_cache(void *d)
+{
+	struct kmem_cache *s = d;
+	struct kmem_cache_cpu *c = __this_cpu_ptr(s->cpu);
+	struct kmem_cache_queue *q = c->q.shared;
+
+	c->q.shared = NULL;
+	drain_shared_cache(s, q);
+}
+
+
+static int remove_shared_caches(struct kmem_cache *s)
+{
+	struct kmem_cache_queue **caches;
+	int i;
+
+	caches = shared_caches(s);
+	if (!caches)
+		return -ENOMEM;
+
+	/* Go through a transaction on each cpu removing the pointers to the shared caches */
+	on_each_cpu(__remove_shared_cache, s, 1);
+
+	for(i = 0; i < s->nr_shared; i++) {
+		void *p = caches[i];
+
+		kfree(p);
+	}
+
+	kfree(caches);
+	s->nr_shared = 0;
+	s->shared_queue = 0;
+	return 0;
+}
 
 static void resize_cpu_queue(struct kmem_cache *s, int queue)
 {
@@ -1792,8 +2016,9 @@ static inline void refill_queue(struct k
 		struct kmem_cache_queue *q, struct page *page, int nr)
 {
 	int d;
+	int batch = min_t(int, q->max, s->batch);
 
-	d = min(s->batch - q->objects, nr);
+	d = min(batch - q->objects, nr);
 	retrieve_objects(s, page, q->object + q->objects, d);
 	q->objects += d;
 }
@@ -1886,6 +2111,20 @@ redo:
 	q = &c->q;
 	if (unlikely(queue_empty(q))) {
 
+		struct kmem_cache_queue *l = q->shared;
+
+		if (l && !queue_empty(l)) {
+
+			/*
+			 * Refill the cpu queue with the hottest objects
+			 * from the shared cache queue
+			 */
+			q->objects = get_shared_objects(l,
+						q->object, s->batch);
+			stat(s, ALLOC_SHARED);
+
+		}
+		else
 		while (q->objects < s->batch) {
 			struct page *new;
 
@@ -2022,9 +2261,22 @@ static void slab_free(struct kmem_cache 
 
 	if (unlikely(queue_full(q))) {
 
-		drain_queue(s, q, s->batch);
-		stat(s, FREE_SLOWPATH);
+		struct kmem_cache_queue *l = q->shared;
 
+		/* Shared queue available and has space ? */
+		if (l && !queue_full(l)) {
+			/* Push coldest objects into the shared queue */
+			int d = put_shared_objects(l, q->object, s->batch);
+
+			q->objects -=  d;
+			memcpy(q->object, q->object + d,
+					q->objects  * sizeof(void *));
+			stat(s, FREE_SHARED);
+		}
+		if (queue_full(q))
+			drain_queue(s, q, s->batch);
+
+		stat(s, FREE_SLOWPATH);
 	} else
 		stat(s, FREE_FASTPATH);
 
@@ -2498,8 +2750,11 @@ static int kmem_cache_open(struct kmem_c
 	s->queue = initial_queue_size(s->size);
 	s->batch = (s->queue + 1) / 2;
 
-	if (alloc_kmem_cache_cpus(s))
+	if (alloc_kmem_cache_cpus(s)) {
+		s->shared_queue_sysfs = s->queue;
+		alloc_shared_caches(s);
 		return 1;
+	}
 
 	free_kmem_cache_nodes(s);
 error:
@@ -3270,12 +3525,21 @@ void __init kmem_cache_init(void)
 	/* Now the kmalloc array is fully functional (*not* the dma array) */
 	slab_state = UP;
 
-	/* Provide the correct kmalloc names now that the caches are up */
-	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
-		char *s = kasprintf(GFP_NOWAIT, "kmalloc-%d", 1 << i);
+	/*
+	 * Provide the correct kmalloc names and enable the shared caches
+	 * now that the kmalloc array is functional
+	 */
+	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
+		struct kmem_cache *s = kmalloc_caches[i];
 
-		BUG_ON(!s);
-		kmalloc_caches[i]->name = s;
+		if (!s)
+			continue;
+
+		if (strcmp(s->name, "kmalloc") == 0)
+			s->name = kasprintf(GFP_NOWAIT,
+				"kmalloc-%d", s->objsize);
+
+		BUG_ON(!s->name);
 	}
 
 #ifdef CONFIG_SMP
@@ -3298,6 +3562,9 @@ void __init kmem_cache_init_late(void)
 
 			create_kmalloc_cache(&kmalloc_dma_caches[i],
 				name, s->objsize, SLAB_CACHE_DMA);
+
+			/* DMA caches are rarely used. Reduce memory consumption */
+			kmalloc_dma_caches[i]->shared_queue_sysfs = 0;
 		}
 	}
 #endif
@@ -4047,10 +4314,40 @@ static ssize_t min_partial_store(struct 
 }
 SLAB_ATTR(min_partial);
 
-static ssize_t cpu_queue_size_show(struct kmem_cache *s, char *buf)
+static ssize_t queue_size_show(struct kmem_cache *s, char *buf)
 {
 	return sprintf(buf, "%u\n", s->queue);
 }
+SLAB_ATTR_RO(queue_size);
+
+
+static ssize_t batch_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%u\n", s->batch);
+}
+
+static ssize_t batch_size_store(struct kmem_cache *s,
+			 const char *buf, size_t length)
+{
+	unsigned long batch;
+	int err;
+
+	err = strict_strtoul(buf, 10, &batch);
+	if (err)
+		return err;
+
+	if (batch < s->queue || batch < 4)
+		return -EINVAL;
+
+	s->batch = batch;
+	return length;
+}
+SLAB_ATTR(batch_size);
+
+static ssize_t cpu_queue_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%u\n", s->cpu_queue);
+}
 
 static ssize_t cpu_queue_size_store(struct kmem_cache *s,
 			 const char *buf, size_t length)
@@ -4075,28 +4372,82 @@ static ssize_t cpu_queue_size_store(stru
 }
 SLAB_ATTR(cpu_queue_size);
 
-static ssize_t cpu_batch_size_show(struct kmem_cache *s, char *buf)
+static ssize_t shared_queue_size_show(struct kmem_cache *s, char *buf)
 {
-	return sprintf(buf, "%u\n", s->batch);
+	return sprintf(buf, "%u %u\n", s->shared_queue, s->shared_queue_sysfs);
 }
 
-static ssize_t cpu_batch_size_store(struct kmem_cache *s,
+static ssize_t shared_queue_size_store(struct kmem_cache *s,
 			 const char *buf, size_t length)
 {
-	unsigned long batch;
+	unsigned long queue;
 	int err;
 
-	err = strict_strtoul(buf, 10, &batch);
+	err = strict_strtoul(buf, 10, &queue);
 	if (err)
 		return err;
 
-	if (batch < s->queue || batch < 4)
+	if (queue > 10000 || queue < 4)
 		return -EINVAL;
 
-	s->batch = batch;
-	return length;
+	if (s->batch > queue)
+		s->batch = queue;
+
+	down_write(&slub_lock);
+	s->shared_queue_sysfs = queue;
+	err = remove_shared_caches(s);
+	if (!err)
+		alloc_shared_caches(s);
+	up_write(&slub_lock);
+	return err ? err : length;
+}
+SLAB_ATTR(shared_queue_size);
+
+static ssize_t shared_caches_show(struct kmem_cache *s, char *buf)
+{
+	unsigned long total = 0;
+	int x, n;
+	int cpu;
+	struct kmem_cache_queue **caches;
+
+	down_read(&slub_lock);
+	caches = shared_caches(s);
+	if (!caches) {
+		up_read(&slub_lock);
+		return -ENOMEM;
+	}
+
+	for (n = 0; n < s->nr_shared; n++)
+		total += caches[n]->objects;
+
+	x = sprintf(buf, "%lu", total);
+
+	for (n = 0; n < s->nr_shared; n++) {
+		int first = 1;
+		struct kmem_cache_queue *q = caches[n];
+
+		x += sprintf(buf + x, " C");
+
+		/* Find cpus using the shared cache */
+		for_each_online_cpu(cpu) {
+			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+
+			if (q != c->q.shared)
+				continue;
+
+			if (first)
+				first = 0;
+			else
+				x += sprintf(buf + x, ",");
+			x += sprintf(buf + x, "%d", cpu);
+		}
+		x += sprintf(buf +x, "=%d/%d", q->objects, q->max);
+	}
+	up_read(&slub_lock);
+	kfree(caches);
+	return x + sprintf(buf + x, "\n");
 }
-SLAB_ATTR(cpu_batch_size);
+SLAB_ATTR_RO(shared_caches);
 
 static ssize_t ctor_show(struct kmem_cache *s, char *buf)
 {
@@ -4127,7 +4478,7 @@ static ssize_t partial_show(struct kmem_
 }
 SLAB_ATTR_RO(partial);
 
-static ssize_t cpu_queues_show(struct kmem_cache *s, char *buf)
+static ssize_t per_cpu_caches_show(struct kmem_cache *s, char *buf)
 {
 	unsigned long total = 0;
 	int x;
@@ -4159,7 +4510,7 @@ static ssize_t cpu_queues_show(struct km
 	kfree(cpus);
 	return x + sprintf(buf + x, "\n");
 }
-SLAB_ATTR_RO(cpu_queues);
+SLAB_ATTR_RO(per_cpu_caches);
 
 static ssize_t objects_show(struct kmem_cache *s, char *buf)
 {
@@ -4472,14 +4823,17 @@ static struct attribute *slab_attrs[] = 
 	&objs_per_slab_attr.attr,
 	&order_attr.attr,
 	&min_partial_attr.attr,
+	&queue_size_attr.attr,
+	&batch_size_attr.attr,
 	&cpu_queue_size_attr.attr,
-	&cpu_batch_size_attr.attr,
+	&shared_queue_size_attr.attr,
 	&objects_attr.attr,
 	&objects_partial_attr.attr,
 	&total_objects_attr.attr,
 	&slabs_attr.attr,
 	&partial_attr.attr,
-	&cpu_queues_attr.attr,
+	&per_cpu_caches_attr.attr,
+	&shared_caches_attr.attr,
 	&ctor_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
@@ -4750,6 +5104,7 @@ static int __init slab_sysfs_init(void)
 		if (err)
 			printk(KERN_ERR "SLUB: Unable to add boot slab %s"
 						" to sysfs\n", s->name);
+		alloc_shared_caches(s);
 	}
 
 	while (alias_list) {
@@ -4806,6 +5161,19 @@ static void s_stop(struct seq_file *m, v
 	up_read(&slub_lock);
 }
 
+static unsigned long shared_objects(struct kmem_cache *s)
+{
+	unsigned long shared;
+	int n;
+	struct kmem_cache_queue **caches;
+
+	caches = shared_caches(s);
+	for(n = 0; n < s->nr_shared; n++)
+		shared += caches[n]->objects;
+
+	kfree(caches);
+	return shared;
+}
 static int s_show(struct seq_file *m, void *p)
 {
 	unsigned long nr_partials = 0;
@@ -4835,9 +5203,10 @@ static int s_show(struct seq_file *m, vo
 	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse,
 		   nr_objs, s->size, oo_objects(s->oo),
 		   (1 << oo_order(s->oo)));
-	seq_printf(m, " : tunables %4u %4u %4u", s->queue, s->batch, 0);
+	seq_printf(m, " : tunables %4u %4u %4u", s->cpu_queue, s->batch, s->shared_queue);
+
 	seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs,
-		   0UL);
+		   shared_objects(s));
 	seq_putc(m, '\n');
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 21/23] slub: Support Alien Caches
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (19 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 22/23] slub: Cached object expiration Christoph Lameter
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: unified_alien_cache --]
[-- Type: text/plain, Size: 13651 bytes --]

Alien caches are essential to track cachelines from a foreign node that are
present in a local cpu cache. They are therefore a form of the prior
introduced shared cache. Alien caches of the number of nodes minus one are
allocated for *each* lowest level shared cpu cache.

SLABs problem in this area is that the cpu caches are not properly tracked.
If there are multiple cpu caches on the same node then SLAB may not
properly track cache hotness of objects.

Alien caches are sizes differently than shared caches but are allocated
in the same contiguous memory area. The shared cache pointer is used
to reach the alien caches too. At positive offsets we fine shared cache
objects. At negative objects the alien caches are placed.

Alien caches can be switched off and configured on a cache by cache
basis using files in /sys/kernel/slab/<cache>/alien_queue_size.

Alien status is available in /sys/kernel/slab/<cache>/alien_caches.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/slub_def.h |    1 
 mm/slub.c                |  339 +++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 327 insertions(+), 13 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-08-03 15:58:51.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-08-03 15:58:53.000000000 -0500
@@ -31,8 +31,10 @@
 
 /*
  * Lock order:
- *   1. slab_lock(page)
- *   2. slab->list_lock
+ *
+ *   1. alien kmem_cache_cpu->lock lock
+ *   2. slab_lock(page)
+ *   3. kmem_cache_node->list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -148,6 +150,16 @@ static inline int kmem_cache_debug(struc
 /* Internal SLUB flags */
 #define __OBJECT_POISON		0x80000000UL /* Poison object */
 #define __SYSFS_ADD_DEFERRED	0x40000000UL /* Not yet visible via sysfs */
+#define __ALIEN_CACHE		0x20000000UL /* Slab has alien caches */
+
+static inline int aliens(struct kmem_cache *s)
+{
+#ifdef CONFIG_NUMA
+	return (s->flags & __ALIEN_CACHE) != 0;
+#else
+	return 0;
+#endif
+}
 
 static int kmem_size = sizeof(struct kmem_cache);
 
@@ -1587,6 +1599,9 @@ static inline int drain_shared_cache(str
 	return n;
 }
 
+static void drain_alien_caches(struct kmem_cache *s,
+				struct kmem_cache_cpu *c);
+
 /*
  * Drain all objects from a per cpu queue
  */
@@ -1596,6 +1611,7 @@ static void flush_cpu_objects(struct kme
 
 	drain_queue(s, q, q->objects);
 	drain_shared_cache(s, q->shared);
+	drain_alien_caches(s, c);
  	stat(s, QUEUE_FLUSH);
 }
 
@@ -1739,6 +1755,53 @@ struct kmem_cache_queue **shared_caches(
 }
 
 /*
+ * Alien caches which are also shared caches
+ */
+
+#ifdef CONFIG_NUMA
+/* Given an allocation context determine the alien queue to use */
+static inline struct kmem_cache_queue *alien_cache(struct kmem_cache *s,
+		struct kmem_cache_cpu *c, int node)
+{
+	void *p = c->q.shared;
+
+	/* If the cache does not have any alien caches return NULL */
+	if (!aliens(s) || !p || node == c->node)
+		return NULL;
+
+	/*
+	 * Map [0..(c->node - 1)] -> [1..c->node].
+	 *
+	 * This effectively removes the current node (which is serviced by
+	 * the shared cachei) from the list and avoids hitting 0 (which would
+	 * result in accessing the shared queue used for the cpu cache).
+	 */
+	if (node < c->node)
+		node++;
+
+	p -= (node << s->alien_shift);
+
+	return (struct kmem_cache_queue *)p;
+}
+
+static inline void drain_alien_caches(struct kmem_cache *s,
+					 struct kmem_cache_cpu *c)
+{
+	int node;
+
+	for_each_node(node)
+		if (node != c->node);
+			drain_shared_cache(s, alien_cache(s, c, node));
+}
+
+#else
+static inline void drain_alien_caches(struct kmem_cache *s,
+				 struct kmem_cache_cpu *c) {}
+#endif
+
+static struct kmem_cache *get_slab(size_t size, gfp_t flags);
+
+/*
  * Allocate shared cpu caches.
  * A shared cache is allocated for each series of cpus sharing a single cache
  */
@@ -1748,23 +1811,30 @@ static void alloc_shared_caches(struct k
 	int max;
 	int size;
 	void *p;
+	int alien_max = 0;
+	int alien_size = 0;
 
 	if (slab_state < SYSFS || s->shared_queue_sysfs == 0)
 		return;
 
+	if (aliens(s)) {
+		alien_size = (nr_node_ids - 1) << s->alien_shift;
+		alien_max = shared_cache_capacity(1 << s->alien_shift);
+	}
+
 	/*
 	 * Determine the size. Round it up to the size that a kmalloc cache
 	 * supporting that size has. This will often align the size to a
 	 * power of 2 especially on machines that have large kmalloc
 	 * alignment requirements.
 	 */
-	size = shared_cache_size(s->shared_queue_sysfs);
+	size = shared_cache_size(s->shared_queue_sysfs) + alien_size;
 	if (size < PAGE_SIZE / 2)
 		size = get_slab(size, GFP_KERNEL)->objsize;
 	else
 		size = PAGE_SHIFT << get_order(size);
 
-	max = shared_cache_capacity(size);
+	max = shared_cache_capacity(size - alien_size);
 
 	for_each_online_cpu(cpu) {
 		struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
@@ -1786,8 +1856,26 @@ static void alloc_shared_caches(struct k
 			continue;
 		}
 
-		l = p;
+		l = p + alien_size;
 		init_shared_cache(l, max);
+#ifdef CONFIG_NUMA
+		/* And initialize the alien caches now */
+		if (aliens(s)) {
+			int node;
+
+			for (node = 0; node < nr_node_ids - 1; node++) {
+				struct kmem_cache_queue *a =
+					p + (node << s->alien_shift);
+
+				init_shared_cache(a, alien_max);
+			}
+		}
+		if (cpumask_weight(map) < 2)  {
+			printk_once(KERN_WARNING "SLUB: Unusable processor"
+				" cache topology. Shared cache per numa node.\n");
+			map = cpumask_of_node(c->node);
+		}
+#endif
 
 		if (cpumask_weight(map) < 2) {
 
@@ -1827,6 +1915,7 @@ static void __remove_shared_cache(void *
 
 	c->q.shared = NULL;
 	drain_shared_cache(s, q);
+	drain_alien_caches(s, c);
 }
 
 
@@ -1845,6 +1934,9 @@ static int remove_shared_caches(struct k
 	for(i = 0; i < s->nr_shared; i++) {
 		void *p = caches[i];
 
+		if (aliens(s))
+			p -= (nr_node_ids - 1) << s->alien_shift;
+
 		kfree(p);
 	}
 
@@ -2039,11 +2131,23 @@ static void *slab_alloc_node(struct kmem
 						gfp_t gfpflags, int node)
 {
 #ifdef CONFIG_NUMA
-	struct kmem_cache_node *n = get_node(s, node);
+	struct kmem_cache_queue *a = alien_cache(s, c, node);
 	struct page *page;
 	void *object;
 
-	page = get_partial_node(n);
+	if (a) {
+redo:
+		spin_lock(&a->lock);
+		if (likely(!queue_empty(a))) {
+			object = queue_get(a);
+			spin_unlock(&a->lock);
+			return object;
+		}
+		spin_unlock(&a->lock);
+	}
+
+	/* Cross node allocation and lock taking ! */
+	page = get_partial_node(s->node[node]);
 	if (!page) {
 		gfpflags &= gfp_allowed_mask;
 
@@ -2061,10 +2165,19 @@ static void *slab_alloc_node(struct kmem
 		slab_lock(page);
  	}
 
-	retrieve_objects(s, page, &object, 1);
+	if (a) {
+		spin_lock(&a->lock);
+		refill_queue(s, a, page, available(page));
+		spin_unlock(&a->lock);
+	} else
+		retrieve_objects(s, page, &object, 1);
 
 	to_lists(s, page, 0);
 	slab_unlock(page);
+
+	if (a)
+		goto redo;
+
 	return object;
 #else
 	return NULL;
@@ -2075,8 +2188,17 @@ static void slab_free_alien(struct kmem_
 	struct kmem_cache_cpu *c, struct page *page, void *object, int node)
 {
 #ifdef CONFIG_NUMA
-	/* Direct free to the slab */
-	drain_objects(s, &object, 1);
+	struct kmem_cache_queue *a = alien_cache(s, c, node);
+
+	if (a) {
+		spin_lock(&a->lock);
+		while (unlikely(queue_full(a)))
+			drain_queue(s, a, s->batch);
+		queue_put(a, object);
+		spin_unlock(&a->lock);
+	} else
+		/* Direct free to the slab */
+		drain_objects(s, &object, 1);
 #endif
 }
 
@@ -2741,15 +2863,53 @@ static int kmem_cache_open(struct kmem_c
 	 */
 	set_min_partial(s, ilog2(s->size));
 	s->refcount = 1;
-#ifdef CONFIG_NUMA
-	s->remote_node_defrag_ratio = 1000;
-#endif
 	if (!init_kmem_cache_nodes(s))
 		goto error;
 
 	s->queue = initial_queue_size(s->size);
 	s->batch = (s->queue + 1) / 2;
 
+#ifdef CONFIG_NUMA
+	s->remote_node_defrag_ratio = 1000;
+	if (nr_node_ids > 1) {
+		/*
+		 * Alien cache configuration. The more NUMA nodes we have the
+		 * smaller the alien caches become since the penalties in terms
+		 * of space and latency increase. The user will have code for
+		 * locality on these boxes anyways since a large portion of
+		 * memory will be distant to the processor.
+		 *
+		 * A set of alien caches is allocated for each lowest level
+		 * cpu cache. The alien set covers all nodes except the node
+		 * that is nearest to the processor.
+		 *
+		 * Create large alien cache for small node configuration so
+		 * that these can work like shared caches do to preserve the
+		 * cpu cache hot state of objects.
+		 */
+		int lines = fls(ALIGN(shared_cache_size(s->queue),
+						cache_line_size()) -1);
+		int min = fls(cache_line_size() - 1);
+
+		/* Limit the sizes of the alien caches to some sane values */
+		if (nr_node_ids <= 4)
+			/*
+			 * Keep the sizes roughly the same as the shared cache
+			 * unless it gets too huge.
+			 */
+			s->alien_shift = min(PAGE_SHIFT - 1, lines);
+
+		else if (nr_node_ids <= 32)
+			/* Maximum of 4 cachelines */
+			s->alien_shift = min(2 + min, lines);
+		else
+			/* Clamp down to one cacheline */
+			s->alien_shift = min;
+
+		s->flags |= __ALIEN_CACHE;
+	}
+#endif
+
 	if (alloc_kmem_cache_cpus(s)) {
 		s->shared_queue_sysfs = s->queue;
 		alloc_shared_caches(s);
@@ -4745,6 +4905,157 @@ static ssize_t remote_node_defrag_ratio_
 	return length;
 }
 SLAB_ATTR(remote_node_defrag_ratio);
+
+static ssize_t alien_queue_size_show(struct kmem_cache *s, char *buf)
+{
+	if (aliens(s))
+		return sprintf(buf, "%lu %u\n",
+			((1 << s->alien_shift)
+				- sizeof(struct kmem_cache_queue)) /
+				sizeof(void *), s->alien_shift);
+	else
+		return sprintf(buf, "0\n");
+}
+
+static ssize_t alien_queue_size_store(struct kmem_cache *s,
+			 const char *buf, size_t length)
+{
+	unsigned long queue;
+	int err;
+	int oldshift;
+
+	if (nr_node_ids == 1)
+		return -ENOSYS;
+
+	oldshift = s->alien_shift;
+
+	err = strict_strtoul(buf, 10, &queue);
+	if (err)
+		return err;
+
+	if (queue < 0 && queue > 65535)
+		return -EINVAL;
+
+	if (queue == 0) {
+		s->flags &= ~__ALIEN_CACHE;
+		s->alien_shift = 0;
+	} else {
+		unsigned long size;
+
+		s->flags |= __ALIEN_CACHE;
+
+		size = max_t(unsigned long, cache_line_size(),
+			 sizeof(struct kmem_cache_queue)
+				+ queue * sizeof(void *));
+		size = ALIGN(size, cache_line_size());
+		s->alien_shift = fls(size + (size -1)) - 1;
+	}
+
+	if (oldshift != s->alien_shift) {
+		down_write(&slub_lock);
+		err = remove_shared_caches(s);
+		if (!err)
+			alloc_shared_caches(s);
+		up_write(&slub_lock);
+	}
+	return err ? err : length;
+}
+SLAB_ATTR(alien_queue_size);
+
+static ssize_t alien_caches_show(struct kmem_cache *s, char *buf)
+{
+	unsigned long total;
+	int x;
+	int n;
+	int cpu, node;
+	struct kmem_cache_queue **caches;
+
+	if (!(s->flags & __ALIEN_CACHE) || s->alien_shift == 0)
+		return -ENOSYS;
+
+	down_read(&slub_lock);
+	caches = shared_caches(s);
+	if (!caches) {
+		up_read(&slub_lock);
+		return -ENOMEM;
+	}
+
+	total = 0;
+	for (n = 0; n < s->nr_shared; n++) {
+		struct kmem_cache_queue *q = caches[n];
+
+		for (n = 1; n < nr_node_ids; n++) {
+			struct kmem_cache_queue *a =
+				(void *)q - (n << s->alien_shift);
+
+			total += a->objects;
+		}
+	}
+	x = sprintf(buf, "%lu", total);
+
+	for (n = 0; n < s->nr_shared; n++) {
+		struct kmem_cache_queue *q = caches[n];
+		struct kmem_cache_queue *a;
+		struct kmem_cache_cpu *c = NULL;
+		int first;
+
+		x += sprintf(buf + x, " C");
+		first = 1;
+		/* Find cpus using the shared cache */
+		for_each_online_cpu(cpu) {
+			struct kmem_cache_cpu *z = per_cpu_ptr(s->cpu, cpu);
+
+			if (q != z->q.shared)
+				continue;
+
+			if (z)
+				c = z;
+
+			if (first)
+				first = 0;
+			else
+				x += sprintf(buf + x, ",");
+
+			x += sprintf(buf + x, "%d", cpu);
+		}
+
+		if (!c) {
+			x += sprintf(buf +x, "=<none>");
+			continue;
+		}
+
+		/* The total of objects for a particular shared cache */
+		total = 0;
+		for_each_online_node(node) {
+			struct kmem_cache_queue *a =
+				alien_cache(s, c, node);
+
+			if (a)
+				total += a->objects;
+		}
+		x += sprintf(buf +x, "=%lu[", total);
+
+		first = 1;
+		for_each_online_node(node) {
+			a = alien_cache(s, c, node);
+
+			if (a) {
+				if (first)
+					first = 0;
+				else
+					x += sprintf(buf + x, ":");
+
+				x += sprintf(buf + x, "N%d=%d/%d",
+						node, a->objects, a->max);
+			}
+		}
+		x += sprintf(buf + x, "]");
+	}
+	up_read(&slub_lock);
+	kfree(caches);
+	return x + sprintf(buf + x, "\n");
+}
+SLAB_ATTR_RO(alien_caches);
 #endif
 
 #ifdef CONFIG_SLUB_STATS
@@ -4854,6 +5165,8 @@ static struct attribute *slab_attrs[] = 
 #endif
 #ifdef CONFIG_NUMA
 	&remote_node_defrag_ratio_attr.attr,
+	&alien_caches_attr.attr,
+	&alien_queue_size_attr.attr,
 #endif
 #ifdef CONFIG_SLUB_STATS
 	&alloc_fastpath_attr.attr,
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-08-03 15:58:51.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-08-03 15:58:52.000000000 -0500
@@ -82,6 +82,7 @@ struct kmem_cache {
 	int objsize;		/* The size of an object without meta data */
 	struct kmem_cache_order_objects oo;
 	int batch;		/* batch size */
+	int alien_shift;	/* Shift to size alien caches */
 
 	/* Allocation and freeing of slabs */
 	struct kmem_cache_order_objects max;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 22/23] slub: Cached object expiration
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (20 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 21/23] slub: Support Alien Caches Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  2:45 ` [S+Q3 23/23] vmscan: Tie slub object expiration into page reclaim Christoph Lameter
  2010-08-04  4:39 ` [S+Q3 00/23] SLUB: The Unified slab allocator (V3) David Rientjes
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: unified_expire --]
[-- Type: text/plain, Size: 11191 bytes --]

Provides a variety of functions that allow expiring objects from slabs.

kmem_cache_expire(struct kmem_cache ,int node)
	Expire objects of a specific slab.

kmem_cache_expire_all(int node)
	Walk through all caches and expire objects.

Functions return the number of bytes reclaimed.

Object expiration works by gradually expiring more or less performance
sensitive cached data. Expiration can be called multiple times and will
then gradually touch more and more performance sensitive cached data.

Levels of expiration

first		Empty partial slabs
		Alien caches
		Shared caches
last		Cpu caches

Manual expiration may be done by using the sysfs filesytem.

	/sys/kernel/slab/<cache>/expire

can take a node number or -1 for global expiration.

A cat will display the number of bytes reclaimed for a given
expiration run.

SLAB performs a scan of all its slabs every 2 seconds.
The  approach here means that the user (or the kernel) has more
control over the expiration of cached data and thereby control over
the time when the OS can disturb the application by extensive
processing.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>


---
 include/linux/slab.h     |    3 
 include/linux/slub_def.h |    1 
 mm/slub.c                |  283 ++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 238 insertions(+), 49 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-08-03 21:19:00.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-08-03 21:19:00.000000000 -0500
@@ -3320,6 +3320,213 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+static struct list_head *alloc_slabs_by_inuse(struct kmem_cache *s)
+{
+	int objects = oo_objects(s->max);
+	struct list_head *h =
+		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
+
+	return h;
+
+}
+
+static int shrink_partial_list(struct kmem_cache *s, int node,
+				struct list_head *slabs_by_inuse)
+{
+	int i;
+	struct kmem_cache_node *n = get_node(s, node);
+	struct page *page;
+	struct page *t;
+	int reclaimed = 0;
+	unsigned long flags;
+	int objects = oo_objects(s->max);
+
+	if (!n->nr_partial)
+		return 0;
+
+	for (i = 0; i < objects; i++)
+		INIT_LIST_HEAD(slabs_by_inuse + i);
+
+	spin_lock_irqsave(&n->list_lock, flags);
+
+	/*
+	 * Build lists indexed by the items in use in each slab.
+	 *
+	 * Note that concurrent frees may occur while we hold the
+	 * list_lock. page->inuse here is the upper limit.
+	 */
+	list_for_each_entry_safe(page, t, &n->partial, lru) {
+		if (all_objects_available(page) && slab_trylock(page)) {
+			/*
+			 * Must hold slab lock here because slab_free
+			 * may have freed the last object and be
+			 * waiting to release the slab.
+			 */
+			list_del(&page->lru);
+			n->nr_partial--;
+			slab_unlock(page);
+			discard_slab(s, page);
+			reclaimed++;
+		} else {
+			list_move(&page->lru,
+			slabs_by_inuse + inuse(page));
+		}
+	}
+
+	/*
+	 * Rebuild the partial list with the slabs filled up most
+	 * first and the least used slabs at the end.
+	 * This will cause the partial list to be shrunk during
+	 * allocations and memory to be freed up when more objects
+	 * are freed in pages at the tail.
+	 */
+	for (i = objects - 1; i >= 0; i--)
+		list_splice(slabs_by_inuse + i, n->partial.prev);
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	return reclaimed;
+}
+
+static int expire_cache(struct kmem_cache *s, struct kmem_cache_cpu *c,
+		struct kmem_cache_queue *q, int lock)
+{
+	unsigned long flags = 0;
+	int n;
+
+	if (queue_empty(q))
+		return 0;
+
+	if (lock)
+		spin_lock(&q->lock);
+	else
+		local_irq_save(flags);
+
+	n = drain_queue(s, q, s->batch);
+
+	if (lock)
+		spin_unlock(&q->lock);
+	else
+		local_irq_restore(flags);
+
+	return n;
+}
+
+/*
+ * Cache expiration is called when the kernel is low on memory in a node
+ * or globally (specify node == NUMA_NO_NODE).
+ *
+ * Cache expiration works by reducing caching memory used by the allocator.
+ * It starts with caches that are not that important for performance.
+ * If it cannot retrieve memory in a low importance cache then it will
+ * start expiring data from more important caches.
+ * The function returns 0 when all caches have been expired and no
+ * objects are cached anymore.
+ *
+ * low impact	 	Dropping of empty partial list slabs
+ *			Drop a batch from the alien caches
+ *                      Drop a batch from the shared caches
+ * high impact		Drop a batch from the cpu caches
+ */
+
+unsigned long kmem_cache_expire(struct kmem_cache *s, int node)
+{
+	struct list_head *slabs_by_inuse = alloc_slabs_by_inuse(s);
+	int reclaimed = 0;
+	int cpu;
+	cpumask_var_t saved_mask;
+
+	if (!slabs_by_inuse)
+		return -ENOMEM;
+
+	if (node != NUMA_NO_NODE)
+		reclaimed = shrink_partial_list(s, node, slabs_by_inuse);
+	else {
+		int n;
+
+		for_each_node_state(n, N_NORMAL_MEMORY)
+			reclaimed +=
+				shrink_partial_list(s, n, slabs_by_inuse)
+					* PAGE_SHIFT << oo_order(s->oo);
+	}
+
+	kfree(slabs_by_inuse);
+
+	if (reclaimed)
+		return reclaimed;
+#ifdef CONFIG_NUMA
+	if (aliens(s))
+	    for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+
+		if (!c->q.shared)
+			continue;
+
+		if (node == NUMA_NO_NODE) {
+			int x;
+
+			for_each_online_node(x)
+				reclaimed += expire_cache(s, c,
+					alien_cache(s, c, x), 1) * s->size;
+
+		} else
+		if (c->node != node)
+			reclaimed += expire_cache(s, c,
+				alien_cache(s, c, node), 1) * s->size;
+	}
+
+	if (reclaimed)
+		return reclaimed;
+#endif
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+
+		if (!c->q.shared)
+			continue;
+
+		if (node != NUMA_NO_NODE && c->node != node)
+			continue;
+
+		reclaimed += expire_cache(s, c, c->q.shared, 1) * s->size;
+	}
+
+	if (reclaimed)
+		return reclaimed;
+
+	if (alloc_cpumask_var(&saved_mask, GFP_KERNEL)) {
+		cpumask_copy(saved_mask, &current->cpus_allowed);
+		for_each_online_cpu(cpu) {
+			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
+
+			if (node != NUMA_NO_NODE && c->node != node)
+				continue;
+
+			/*
+			 * Switch affinity to the target cpu to allow access
+			 * to the cpu cache
+			 */
+			set_cpus_allowed_ptr(current, &cpumask_of_cpu(cpu));
+			reclaimed += expire_cache(s, c, &c->q, 0) * s->size;
+		}
+		set_cpus_allowed_ptr(current, saved_mask);
+		free_cpumask_var(saved_mask);
+	}
+
+	return reclaimed;
+}
+
+unsigned long kmem_cache_expire_all(int node)
+{
+	struct kmem_cache *s;
+	unsigned long n = 0;
+
+	down_read(&slub_lock);
+	list_for_each_entry(s, &slab_caches, list)
+		n += kmem_cache_expire(s, node);
+	up_read(&slub_lock);
+	return n;
+}
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
@@ -3333,62 +3540,16 @@ EXPORT_SYMBOL(kfree);
 int kmem_cache_shrink(struct kmem_cache *s)
 {
 	int node;
-	int i;
-	struct kmem_cache_node *n;
-	struct page *page;
-	struct page *t;
-	int objects = oo_objects(s->max);
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
-	unsigned long flags;
+	struct list_head *slabs_by_inuse = alloc_slabs_by_inuse(s);
 
 	if (!slabs_by_inuse)
 		return -ENOMEM;
 
 	flush_all(s);
-	for_each_node_state(node, N_NORMAL_MEMORY) {
-		n = get_node(s, node);
-
-		if (!n->nr_partial)
-			continue;
-
-		for (i = 0; i < objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
-
-		spin_lock_irqsave(&n->list_lock, flags);
 
-		/*
-		 * Build lists indexed by the items in use in each slab.
-		 *
-		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
-		 */
-		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (all_objects_available(page) && slab_trylock(page)) {
-				/*
-				 * Must hold slab lock here because slab_free
-				 * may have freed the last object and be
-				 * waiting to release the slab.
-				 */
-				list_del(&page->lru);
-				n->nr_partial--;
-				slab_unlock(page);
-				discard_slab(s, page);
-			} else {
-				list_move(&page->lru,
-				slabs_by_inuse + inuse(page));
-			}
-		}
+	for_each_node_state(node, N_NORMAL_MEMORY)
+		shrink_partial_list(s, node, slabs_by_inuse);
 
-		/*
-		 * Rebuild the partial list with the slabs filled up most
-		 * first and the least used slabs at the end.
-		 */
-		for (i = objects - 1; i >= 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
-
-		spin_unlock_irqrestore(&n->list_lock, flags);
-	}
 
 	kfree(slabs_by_inuse);
 	return 0;
@@ -4867,6 +5028,29 @@ static ssize_t shrink_store(struct kmem_
 }
 SLAB_ATTR(shrink);
 
+static ssize_t expire_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%lu\n", s->last_expired_bytes);
+}
+
+static ssize_t expire_store(struct kmem_cache *s,
+			const char *buf, size_t length)
+{
+	long node;
+	int err;
+
+	err = strict_strtol(buf, 10, &node);
+	if (err)
+		return err;
+
+	if (node > nr_node_ids || node < -1)
+		return -EINVAL;
+
+	s->last_expired_bytes = kmem_cache_expire(s, node);
+	return length;
+}
+SLAB_ATTR(expire);
+
 static ssize_t alloc_calls_show(struct kmem_cache *s, char *buf)
 {
 	if (!(s->flags & SLAB_STORE_USER))
@@ -5158,6 +5342,7 @@ static struct attribute *slab_attrs[] = 
 	&store_user_attr.attr,
 	&validate_attr.attr,
 	&shrink_attr.attr,
+	&expire_attr.attr,
 	&alloc_calls_attr.attr,
 	&free_calls_attr.attr,
 #ifdef CONFIG_ZONE_DMA
Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2010-08-03 21:18:41.000000000 -0500
+++ linux-2.6/include/linux/slab.h	2010-08-03 21:19:00.000000000 -0500
@@ -103,12 +103,15 @@ struct kmem_cache *kmem_cache_create(con
 			void (*)(void *));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
+unsigned long kmem_cache_expire(struct kmem_cache *, int);
 void kmem_cache_free(struct kmem_cache *, void *);
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kern_ptr_validate(const void *ptr, unsigned long size);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
 
+unsigned long kmem_cache_expire_all(int node);
+
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-08-03 21:19:00.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-08-03 21:19:00.000000000 -0500
@@ -101,6 +101,7 @@ struct kmem_cache {
 	struct list_head list;	/* List of slab caches */
 #ifdef CONFIG_SLUB_DEBUG
 	struct kobject kobj;	/* For sysfs */
+	unsigned long last_expired_bytes;
 #endif
 	int shared_queue_sysfs;	/* Desired shared queue size */
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [S+Q3 23/23] vmscan: Tie slub object expiration into page reclaim
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (21 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 22/23] slub: Cached object expiration Christoph Lameter
@ 2010-08-04  2:45 ` Christoph Lameter
  2010-08-04  4:39 ` [S+Q3 00/23] SLUB: The Unified slab allocator (V3) David Rientjes
  23 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04  2:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, linux-kernel, Nick Piggin, David Rientjes

[-- Attachment #1: unified_vmscan --]
[-- Type: text/plain, Size: 1517 bytes --]

We already do slab reclaim during page reclaim. Add a call to
object expiration in slub whenever shrink_slab() is called.
If the reclaim is zone specific then use the node of the zone
to restrict reclaim in slub.

Signed-off-by: Christoph Lameter <cl@linux.com>


---
 mm/vmscan.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2010-07-30 18:37:47.638837043 -0500
+++ linux-2.6/mm/vmscan.c	2010-07-30 18:57:44.867515416 -0500
@@ -1826,6 +1826,7 @@
 				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 				reclaim_state->reclaimed_slab = 0;
 			}
+			kmem_cache_expire_all(NUMA_NO_NODE);
 		}
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2133,6 +2134,7 @@
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
 						lru_pages);
+			kmem_cache_expire_all(nid);
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 			if (zone->all_unreclaimable)
@@ -2640,6 +2642,8 @@
 		 */
 		sc.nr_reclaimed += slab_reclaimable -
 			zone_page_state(zone, NR_SLAB_RECLAIMABLE);
+
+		kmem_cache_expire_all(zone_to_nid(zone));
 	}
 
 	p->reclaim_state = NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 03/23] slub: Use a constant for a unspecified node.
  2010-08-04  2:45 ` [S+Q3 03/23] slub: Use a constant for a unspecified node Christoph Lameter
@ 2010-08-04  3:34   ` David Rientjes
  2010-08-04 16:15     ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2010-08-04  3:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-mm, KAMEZAWA Hiroyuki, linux-kernel, Nick Piggin

On Tue, 3 Aug 2010, Christoph Lameter wrote:

> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2010-07-26 12:57:52.000000000 -0500
> +++ linux-2.6/mm/slub.c	2010-07-26 12:57:59.000000000 -0500
> @@ -1073,7 +1073,7 @@ static inline struct page *alloc_slab_pa
>  
>  	flags |= __GFP_NOTRACK;
>  
> -	if (node == -1)
> +	if (node == NUMA_NO_NODE)
>  		return alloc_pages(flags, order);
>  	else
>  		return alloc_pages_exact_node(node, flags, order);
> @@ -1387,7 +1387,7 @@ static struct page *get_any_partial(stru
>  static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
>  {
>  	struct page *page;
> -	int searchnode = (node == -1) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>  
>  	page = get_partial_node(get_node(s, searchnode));
>  	if (page || (flags & __GFP_THISNODE) || node != -1)

This has a merge conflict with 2.6.35 since it has this:

	page = get_partial_node(get_node(s, searchnode));
	if (page || (flags & __GFP_THISNODE))
		return page;

	return get_any_partial(s, flags);

so what happened to the dropped check for returning get_any_partial() when 
node != -1?  I added the check for benchmarking.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
                   ` (22 preceding siblings ...)
  2010-08-04  2:45 ` [S+Q3 23/23] vmscan: Tie slub object expiration into page reclaim Christoph Lameter
@ 2010-08-04  4:39 ` David Rientjes
  2010-08-04 16:17   ` Christoph Lameter
  23 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2010-08-04  4:39 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Tue, 3 Aug 2010, Christoph Lameter wrote:

> The following is a first release of an allocator based on SLAB
> and SLUB that integrates the best approaches from both allocators. The
> per cpu queuing is like the two prior releases. The NUMA facilities
> were much improved vs V2. Shared and alien cache support was added to
> track the cache hot state of objects. 
> 

This insta-reboots on my netperf benchmarking servers (but works with 
numa=off), so I'll have to wait until I can hook up a serial before 
benchmarking this series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 03/23] slub: Use a constant for a unspecified node.
  2010-08-04  3:34   ` David Rientjes
@ 2010-08-04 16:15     ` Christoph Lameter
  2010-08-05  7:40       ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04 16:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, linux-mm, KAMEZAWA Hiroyuki, linux-kernel, Nick Piggin

On Tue, 3 Aug 2010, David Rientjes wrote:

> >  static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
> >  {
> >  	struct page *page;
> > -	int searchnode = (node == -1) ? numa_node_id() : node;
> > +	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> >
> >  	page = get_partial_node(get_node(s, searchnode));
> >  	if (page || (flags & __GFP_THISNODE) || node != -1)
>
> This has a merge conflict with 2.6.35 since it has this:
>
> 	page = get_partial_node(get_node(s, searchnode));
> 	if (page || (flags & __GFP_THISNODE))
> 		return page;
>
> 	return get_any_partial(s, flags);
>
> so what happened to the dropped check for returning get_any_partial() when
> node != -1?  I added the check for benchmarking.

Strange no merge conflict here. Are you sure you use upstream?

GFP_THISNODE does not matter too much. If page == NULL then we failed
to allocate a page on a specific node and have to either give up (and then
extend the slab) or take a page from another node.

We always have give up to go to the page allocator if GFP_THIS_NODE was
set. The modification to additionally also go to the page allocator if
a node was just set even without GFP_THISNODE. So checking for
GFP_THISNODE does not make sense anymore.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-04  4:39 ` [S+Q3 00/23] SLUB: The Unified slab allocator (V3) David Rientjes
@ 2010-08-04 16:17   ` Christoph Lameter
  2010-08-05  8:38     ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2010-08-04 16:17 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Tue, 3 Aug 2010, David Rientjes wrote:

> On Tue, 3 Aug 2010, Christoph Lameter wrote:
>
> > The following is a first release of an allocator based on SLAB
> > and SLUB that integrates the best approaches from both allocators. The
> > per cpu queuing is like the two prior releases. The NUMA facilities
> > were much improved vs V2. Shared and alien cache support was added to
> > track the cache hot state of objects.
> >
>
> This insta-reboots on my netperf benchmarking servers (but works with
> numa=off), so I'll have to wait until I can hook up a serial before
> benchmarking this series.

There are potential issues with

1. The size of per cpu reservation on bootup and the new percpu code that
allows allocations for per cpu areas during bootup. Sometime I wonder if I
should just go back to static allocs for that.

2. The topology information provided by the machine for the cache setup.

3. My code of course.

Bootlog would be appreciated.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 03/23] slub: Use a constant for a unspecified node.
  2010-08-04 16:15     ` Christoph Lameter
@ 2010-08-05  7:40       ` David Rientjes
  0 siblings, 0 replies; 47+ messages in thread
From: David Rientjes @ 2010-08-05  7:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-mm, KAMEZAWA Hiroyuki, linux-kernel, Nick Piggin

On Wed, 4 Aug 2010, Christoph Lameter wrote:

> > >  static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
> > >  {
> > >  	struct page *page;
> > > -	int searchnode = (node == -1) ? numa_node_id() : node;
> > > +	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > >
> > >  	page = get_partial_node(get_node(s, searchnode));
> > >  	if (page || (flags & __GFP_THISNODE) || node != -1)
> >
> > This has a merge conflict with 2.6.35 since it has this:
> >
> > 	page = get_partial_node(get_node(s, searchnode));
> > 	if (page || (flags & __GFP_THISNODE))
> > 		return page;
> >
> > 	return get_any_partial(s, flags);
> >
> > so what happened to the dropped check for returning get_any_partial() when
> > node != -1?  I added the check for benchmarking.
> 
> Strange no merge conflict here. Are you sure you use upstream?
> 

Yes, 2.6.35 does not have the node != -1 check and Linus hasn't pulled 
slub/fixes from Pekka's tree yet.  Even when he does, "slub numa: Fix rare 
allocation from unexpected node" removes the __GFP_THISNODE check before 
adding node != -1, so this definitely doesn't apply to anybody else's 
tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-04 16:17   ` Christoph Lameter
@ 2010-08-05  8:38     ` David Rientjes
  2010-08-05 17:33       ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2010-08-05  8:38 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Wed, 4 Aug 2010, Christoph Lameter wrote:

> > This insta-reboots on my netperf benchmarking servers (but works with
> > numa=off), so I'll have to wait until I can hook up a serial before
> > benchmarking this series.
> 
> There are potential issues with
> 
> 1. The size of per cpu reservation on bootup and the new percpu code that
> allows allocations for per cpu areas during bootup. Sometime I wonder if I
> should just go back to static allocs for that.
> 
> 2. The topology information provided by the machine for the cache setup.
> 
> 3. My code of course.
> 

I bisected this to patch 8 but still don't have a bootlog.  I'm assuming 
in the meantime that something is kmallocing DMA memory on this machine 
prior to kmem_cache_init_late() and get_slab() is returning a NULL 
pointer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-05  8:38     ` David Rientjes
@ 2010-08-05 17:33       ` Christoph Lameter
  2010-08-17  4:56         ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2010-08-05 17:33 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Thu, 5 Aug 2010, David Rientjes wrote:

> I bisected this to patch 8 but still don't have a bootlog.  I'm assuming
> in the meantime that something is kmallocing DMA memory on this machine
> prior to kmem_cache_init_late() and get_slab() is returning a NULL
> pointer.

There is a kernel option "earlyprintk=..." that allows you to see early
boot messages.

If this indeed is a problem with the DMA caches then try the following
patch:



Subject: slub: Move dma cache initialization up

Do dma kmalloc initialization in kmem_cache_init and not in kmem_cache_init_late()

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/slub.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-08-05 12:24:21.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-08-05 12:28:58.000000000 -0500
@@ -3866,13 +3866,8 @@ void __init kmem_cache_init(void)
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
 #endif
-}

-void __init kmem_cache_init_late(void)
-{
 #ifdef CONFIG_ZONE_DMA
-	int i;
-
 	/* Create the dma kmalloc array and make it operational */
 	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
 		struct kmem_cache *s = kmalloc_caches[i];
@@ -3891,6 +3886,10 @@ void __init kmem_cache_init_late(void)
 #endif
 }

+void __init kmem_cache_init_late(void)
+{
+}
+
 /*
  * Find a mergeable slab cache
  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-05 17:33       ` Christoph Lameter
@ 2010-08-17  4:56         ` David Rientjes
  2010-08-17  7:55           ` Tejun Heo
  2010-08-17 17:23           ` Christoph Lameter
  0 siblings, 2 replies; 47+ messages in thread
From: David Rientjes @ 2010-08-17  4:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin, Tejun Heo

On Thu, 5 Aug 2010, Christoph Lameter wrote:

> > I bisected this to patch 8 but still don't have a bootlog.  I'm assuming
> > in the meantime that something is kmallocing DMA memory on this machine
> > prior to kmem_cache_init_late() and get_slab() is returning a NULL
> > pointer.
> 
> There is a kernel option "earlyprintk=..." that allows you to see early
> boot messages.
> 

Ok, so this is panicking because of the error handling when trying to 
create sysfs directories with the same name (in this case, :dt-0000064).  
I'll look into while this isn't failing gracefully later, but I isolated 
this to the new code that statically allocates the DMA caches in 
kmem_cache_init_late().

The iteration runs from 0 to SLUB_PAGE_SHIFT; that's actually incorrect 
since the kmem_cache_node cache occupies the first spot in the 
kmalloc_caches array and has a size, 64 bytes, equal to a power of two 
that is duplicated later.  So this patch tries creating two DMA kmalloc 
caches with 64 byte object size which triggers a BUG_ON() during 
kmem_cache_release() in the error handling later.

The fix is to start the iteration at 1 instead of 0 so that all other 
caches have their equivalent DMA caches created and the special-case 
kmem_cache_node cache is excluded (see below).

I'm really curious why nobody else ran into this problem before, 
especially if they have CONFIG_SLUB_DEBUG enabled so 
struct kmem_cache_node has the same size.  Perhaps my early bug report 
caused people not to test the series...

I'm adding Tejun Heo to the cc because of another thing that may be 
problematic: alloc_percpu() allocates GFP_KERNEL memory, so when we try to 
allocate kmem_cache_cpu for a DMA cache we may be returning memory from a 
node that doesn't include lowmem so there will be no affinity between the 
struct and the slab.  I'm wondering if it would be better for the percpu 
allocator to be extended for kzalloc_node(), or vmalloc_node(), when 
allocating memory after the slab layer is up.

There're a couple more issues with the patch as well:

 - the entire iteration in kmem_cache_init_late() needs to be protected by 
   slub_lock.  The comment in create_kmalloc_cache() should be revised 
   since you're no longer calling it only with irqs disabled.  
   kmem_cache_init_late() has irqs enabled and, thus, slab_caches must be 
   protected.

 - a BUG_ON(!name) needs to be added in kmem_cache_init_late() when 
   kasprintf() returns NULL.  This isn't checked in kmem_cache_open() so 
   it'll only encounter a problem in the sysfs layer.  Adding a BUG_ON() 
   will help track those down.

Otherwise, I didn't find any problem with removing the dynamic DMA cache 
allocation on my machines.

Please fold this into patch 8.

Signed-off-by: David Rientjes <rientjes@google.com>
---
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2552,13 +2552,12 @@ static int __init setup_slub_nomerge(char *str)
 
 __setup("slub_nomerge", setup_slub_nomerge);
 
+/*
+ * Requires slub_lock if called when irqs are enabled after early boot.
+ */
 static void create_kmalloc_cache(struct kmem_cache *s,
 		const char *name, int size, unsigned int flags)
 {
-	/*
-	 * This function is called with IRQs disabled during early-boot on
-	 * single CPU so there's no need to take slub_lock here.
-	 */
 	if (!kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN,
 								flags, NULL))
 		goto panic;
@@ -3063,17 +3062,20 @@ void __init kmem_cache_init_late(void)
 #ifdef CONFIG_ZONE_DMA
 	int i;
 
-	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
+	down_write(&slub_lock);
+	for (i = 1; i < SLUB_PAGE_SHIFT; i++) {
 		struct kmem_cache *s = &kmalloc_caches[i];
 
-		if (s && s->size) {
+		if (s->size) {
 			char *name = kasprintf(GFP_KERNEL,
 				 "dma-kmalloc-%d", s->objsize);
 
+			BUG_ON(!name);
 			create_kmalloc_cache(&kmalloc_dma_caches[i],
 				name, s->objsize, SLAB_CACHE_DMA);
 		}
 	}
+	up_write(&slub_lock);
 #endif
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities.
  2010-08-04  2:45 ` [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities Christoph Lameter
@ 2010-08-17  5:52   ` David Rientjes
  2010-08-17 17:51     ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2010-08-17  5:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Tue, 3 Aug 2010, Christoph Lameter wrote:

> Strictly a performance enhancement by better tracking of objects
> that are likely in the lowest cpu caches of processors.
> 
> SLAB uses one shared cache per NUMA node or one globally. However, that
> is not satifactory for contemporary cpus. Those may have multiple
> independent cpu caches per node. SLAB in these situation treats
> cache cold objects like cache hot objects.
> 
> The shared caches of slub are per physical cpu cache for all cpus using
> that cache. Shared cache content will not cross physical caches.
> 
> The shared cache can be dynamically configured via
> /sys/kernel/slab/<cache>/shared_queue
> 
> The current shared cache state is available via
> cat /sys/kernel/slab/<cache/<shared_caches>
> 
> Shared caches are always allocated in the sizes available in the kmalloc
> array. Cache sizes are rounded up to the sizes available.
> 
> F.e. on my Dell with 8 cpus in 2 packages in which each 2 cpus shared
> an l2 cache I get:
> 
> christoph@:/sys/kernel/slab$ cat kmalloc-64/shared_caches
> 384 C0,2=66/126 C1,3=126/126 C4,6=126/126 C5,7=66/126
> christoph@:/sys/kernel/slab$ cat kmalloc-64/per_cpu_caches
> 617 C0=54/125 C1=37/125 C2=102/125 C3=76/125 C4=81/125 C5=108/125 C6=72/125 C7=87/125
> 

This explodes on the memset() in slab_alloc() because of __GFP_ZERO on my 
system:

[    1.922641] BUG: unable to handle kernel paging request at 0000007e7e581f70
[    1.923625] IP: [<ffffffff811053ee>] slab_alloc+0x549/0x590
[    1.923625] PGD 0 
[    1.923625] Oops: 0002 [#1] SMP 
[    1.923625] last sysfs file: 
[    1.923625] CPU 12 
[    1.923625] Modules linked in:
[    1.923625] 
[    1.923625] Pid: 1, comm: swapper Not tainted 2.6.35-slubq #1
[    1.923625] RIP: 0010:[<ffffffff811053ee>]  [<ffffffff811053ee>] slab_alloc+0x549/0x590
[    1.923625] RSP: 0000:ffff88047e09dd30  EFLAGS: 00010246
[    1.923625] RAX: 0000000000000000 RBX: ffff88047fc04500 RCX: 0000000000000010
[    1.923625] RDX: 0000000000000003 RSI: 0000000000000348 RDI: 0000007e7e581f70
[    1.923625] RBP: ffff88047e09dde0 R08: ffff88048e200000 R09: ffffffff81ad2c70
[    1.923625] R10: ffff88107e51fd20 R11: 0000000000000000 R12: 0000007e7e581f70
[    1.923625] R13: 0000000000000001 R14: ffff880c7e54eb28 R15: 00000000000080d0
[    1.923625] FS:  0000000000000000(0000) GS:ffff880c8e200000(0000) knlGS:0000000000000000
[    1.923625] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    1.923625] CR2: 0000007e7e581f70 CR3: 0000000001a04000 CR4: 00000000000006e0
[    1.923625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    1.923625] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    1.923625] Process swapper (pid: 1, threadinfo ffff88047e09c000, task ffff88107e468000)
[    1.923625] Stack:
[    1.923625]  ffff88047e09dd60 ffffffff81162c4d 0000000000000008 ffff88087dd5f870
[    1.923625] <0> ffff88047e09dfd8 ffffffff81106e14 ffff88047e09dd80 ffff88107e468670
[    1.923625] <0> ffff88107e468670 ffff88107e468000 ffff88047e09ddd0 ffff88107e468000
[    1.923625] Call Trace:
[    1.923625]  [<ffffffff81162c4d>] ? sysfs_find_dirent+0x3f/0x58
[    1.923625]  [<ffffffff81106e14>] ? alloc_shared_caches+0x10f/0x277
[    1.923625]  [<ffffffff811060f8>] __kmalloc_node+0x78/0xa3
[    1.923625]  [<ffffffff81106e14>] alloc_shared_caches+0x10f/0x277
[    1.923625]  [<ffffffff811065e8>] ? kfree+0x85/0x8d
[    1.923625]  [<ffffffff81b09661>] slab_sysfs_init+0x96/0x10a
[    1.923625]  [<ffffffff81b095cb>] ? slab_sysfs_init+0x0/0x10a
[    1.923625]  [<ffffffff810001f9>] do_one_initcall+0x5e/0x14e
[    1.923625]  [<ffffffff81aec6bb>] kernel_init+0x178/0x202
[    1.923625]  [<ffffffff81030954>] kernel_thread_helper+0x4/0x10
[    1.923625]  [<ffffffff81aec543>] ? kernel_init+0x0/0x202
[    1.923625]  [<ffffffff81030950>] ? kernel_thread_helper+0x0/0x10
[    1.923625] Code: 95 78 ff ff ff 4c 89 e6 48 89 df e8 13 f4 ff ff 85 c0 0f 84 44 fb ff ff ff 75 b0 9d 66 45 85 ff 79 3b 48 63 4b 14 31 c0 4c 89 e7 <f3> aa eb 2e ff 75 b0 9d 41 f7 c7 00 02 00 00 75 1e 48 c7 c7 10 
[    1.923625] RIP  [<ffffffff811053ee>] slab_alloc+0x549/0x590
[    1.923625]  RSP <ffff88047e09dd30>
[    1.923625] CR2: 0000007e7e581f70

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-17  4:56         ` David Rientjes
@ 2010-08-17  7:55           ` Tejun Heo
  2010-08-17 13:56             ` Christoph Lameter
  2010-08-17 17:23           ` Christoph Lameter
  1 sibling, 1 reply; 47+ messages in thread
From: Tejun Heo @ 2010-08-17  7:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

Hello,

On 08/17/2010 06:56 AM, David Rientjes wrote:
> I'm adding Tejun Heo to the cc because of another thing that may be 
> problematic: alloc_percpu() allocates GFP_KERNEL memory, so when we try to 
> allocate kmem_cache_cpu for a DMA cache we may be returning memory from a 
> node that doesn't include lowmem so there will be no affinity between the 
> struct and the slab.  I'm wondering if it would be better for the percpu 
> allocator to be extended for kzalloc_node(), or vmalloc_node(), when 
> allocating memory after the slab layer is up.

Hmmm... do you mean adding @gfp_mask to percpu allocation function?
I've been thinking about adding it for atomic allocations (Christoph,
do you still want it?).  I've been sort of against it because I
primarily don't really like atomic allocations (it often just pushes
error handling complexities elsewhere where it becomes more complex)
and it would also require making vmalloc code do atomic allocations.

Most of percpu use cases seem pretty happy with GFP_KERNEL allocation,
so I'm still quite reluctant to change that.  We can add a semi
internal interface w/ @gfp_mask but w/o GFP_ATOMIC support, which is a
bit ugly.  How important would this be?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-17  7:55           ` Tejun Heo
@ 2010-08-17 13:56             ` Christoph Lameter
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-17 13:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Rientjes, Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Tue, 17 Aug 2010, Tejun Heo wrote:

> Hello,
>
> On 08/17/2010 06:56 AM, David Rientjes wrote:
> > I'm adding Tejun Heo to the cc because of another thing that may be
> > problematic: alloc_percpu() allocates GFP_KERNEL memory, so when we try to
> > allocate kmem_cache_cpu for a DMA cache we may be returning memory from a
> > node that doesn't include lowmem so there will be no affinity between the
> > struct and the slab.  I'm wondering if it would be better for the percpu
> > allocator to be extended for kzalloc_node(), or vmalloc_node(), when
> > allocating memory after the slab layer is up.
>
> Hmmm... do you mean adding @gfp_mask to percpu allocation function?

DMA caches may only exist on certain nodes because others do not have a
DMA zone. Their role is quite limited these days. DMA caches allocated on
nodes without DMA zones would have their percpu area allocated on the node
but the DMA allocations would be redirected to the closest node with DMA
memory.

> I've been thinking about adding it for atomic allocations (Christoph,
> do you still want it?).  I've been sort of against it because I
> primarily don't really like atomic allocations (it often just pushes
> error handling complexities elsewhere where it becomes more complex)
> and it would also require making vmalloc code do atomic allocations.

At this point I would think that we do not need that support.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-17  4:56         ` David Rientjes
  2010-08-17  7:55           ` Tejun Heo
@ 2010-08-17 17:23           ` Christoph Lameter
  2010-08-17 17:29             ` Christoph Lameter
  2010-08-17 18:02             ` David Rientjes
  1 sibling, 2 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-17 17:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin, Tejun Heo

On Mon, 16 Aug 2010, David Rientjes wrote:

> Ok, so this is panicking because of the error handling when trying to
> create sysfs directories with the same name (in this case, :dt-0000064).
> I'll look into while this isn't failing gracefully later, but I isolated
> this to the new code that statically allocates the DMA caches in
> kmem_cache_init_late().

Hmm.... Strange. The DMA caches should create a distinct pattern there.

> The iteration runs from 0 to SLUB_PAGE_SHIFT; that's actually incorrect
> since the kmem_cache_node cache occupies the first spot in the
> kmalloc_caches array and has a size, 64 bytes, equal to a power of two
> that is duplicated later.  So this patch tries creating two DMA kmalloc
> caches with 64 byte object size which triggers a BUG_ON() during
> kmem_cache_release() in the error handling later.

The kmem_cache_node cache is no longer at position 0.
kmalloc_caches[0] should be NULL and therefore be skipped.

> The fix is to start the iteration at 1 instead of 0 so that all other
> caches have their equivalent DMA caches created and the special-case
> kmem_cache_node cache is excluded (see below).
>
> I'm really curious why nobody else ran into this problem before,
> especially if they have CONFIG_SLUB_DEBUG enabled so
> struct kmem_cache_node has the same size.  Perhaps my early bug report
> caused people not to test the series...

Which patches were applied?

>  - the entire iteration in kmem_cache_init_late() needs to be protected by
>    slub_lock.  The comment in create_kmalloc_cache() should be revised
>    since you're no longer calling it only with irqs disabled.
>    kmem_cache_init_late() has irqs enabled and, thus, slab_caches must be
>    protected.

I moved it to kmem_cache_init() which is run when we only have one
execution thread. That takes care of the issue and ensures that the dma
caches are available as early as before.

>  - a BUG_ON(!name) needs to be added in kmem_cache_init_late() when
>    kasprintf() returns NULL.  This isn't checked in kmem_cache_open() so
>    it'll only encounter a problem in the sysfs layer.  Adding a BUG_ON()
>    will help track those down.

Ok.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-17 17:23           ` Christoph Lameter
@ 2010-08-17 17:29             ` Christoph Lameter
  2010-08-17 18:02             ` David Rientjes
  1 sibling, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-17 17:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin, Tejun Heo

On Tue, 17 Aug 2010, Christoph Lameter wrote:

> > I'm really curious why nobody else ran into this problem before,
> > especially if they have CONFIG_SLUB_DEBUG enabled so
> > struct kmem_cache_node has the same size.  Perhaps my early bug report
> > caused people not to test the series...
>
> Which patches were applied?


If you do not apply all patches then you can be at a stage were
kmalloc_caches[0] is still used for kmem_cache_node. Then things break.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities.
  2010-08-17  5:52   ` David Rientjes
@ 2010-08-17 17:51     ` Christoph Lameter
  2010-08-17 18:42       ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2010-08-17 17:51 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Mon, 16 Aug 2010, David Rientjes wrote:

> This explodes on the memset() in slab_alloc() because of __GFP_ZERO on my
> system:

Well that seems to be because __kmalloc_node returned invalid address. Run
with full debugging please?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-17 17:23           ` Christoph Lameter
  2010-08-17 17:29             ` Christoph Lameter
@ 2010-08-17 18:02             ` David Rientjes
  2010-08-17 18:47               ` Christoph Lameter
  1 sibling, 1 reply; 47+ messages in thread
From: David Rientjes @ 2010-08-17 18:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin, Tejun Heo

On Tue, 17 Aug 2010, Christoph Lameter wrote:

> > Ok, so this is panicking because of the error handling when trying to
> > create sysfs directories with the same name (in this case, :dt-0000064).
> > I'll look into while this isn't failing gracefully later, but I isolated
> > this to the new code that statically allocates the DMA caches in
> > kmem_cache_init_late().
> 
> Hmm.... Strange. The DMA caches should create a distinct pattern there.
> 

They do after patch 11 when you introduce dynamically sized kmalloc 
caches, but not after only patches 1-8 were applied.  Since this wasn't 
booting on my system, I bisected the problem to patch 8 where 
kmem_cache_init_late() would create two DMA caches of size 64 bytes: one 
becauses of kmalloc_caches[0] (kmem_cache_node) and one because of 
kmalloc_caches[6] (2^6 = 64).  So my fixes are necessary for patch 8 but 
obsoleted later, and then the shared cache support panics on memset().

> >  - the entire iteration in kmem_cache_init_late() needs to be protected by
> >    slub_lock.  The comment in create_kmalloc_cache() should be revised
> >    since you're no longer calling it only with irqs disabled.
> >    kmem_cache_init_late() has irqs enabled and, thus, slab_caches must be
> >    protected.
> 
> I moved it to kmem_cache_init() which is run when we only have one
> execution thread. That takes care of the issue and ensures that the dma
> caches are available as early as before.
> 

I didn't know if that was a debugging patch for me or if you wanted to 
push that as part of your series, I'm not sure if you actually need to 
move it to kmem_cache_init() now that slub_state is protected by 
slub_lock.  I'm not sure if we want to allocate DMA objects between 
kmem_cache_init() and kmem_cache_init_late().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities.
  2010-08-17 17:51     ` Christoph Lameter
@ 2010-08-17 18:42       ` David Rientjes
  2010-08-17 18:50         ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2010-08-17 18:42 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Tue, 17 Aug 2010, Christoph Lameter wrote:

> > This explodes on the memset() in slab_alloc() because of __GFP_ZERO on my
> > system:
> 
> Well that seems to be because __kmalloc_node returned invalid address. Run
> with full debugging please?
> 

Lots of data, so I trimmed it down to something reasonable by eliminating 
reports that were very similar.  (It also looks like some metadata is 
getting displayed incorrectly such as negative pid's and 10-digit cpu 
numbers.)

[   14.152177] =============================================================================
[   14.153172] BUG kmalloc-16: Object padding overwritten
[   14.153172] -----------------------------------------------------------------------------
[   14.153172] 
[   14.153172] INFO: 0xffff88107e595ea8-0xffff88107e595eab. First byte 0x0 instead of 0x5a
[   14.153172] INFO: Freed in 0x7e00000000 age=18446743536838353798 cpu=0 pid=0
[   14.153172] INFO: Slab 0xffffea0039ba3898 objects=51 new=4 fp=0x0007800000000000 flags=0xe00000000000080
[   14.153172] INFO: Object 0xffff88107e595e60 @offset=3680
[   14.153172] 
[   14.153172] Bytes b4 0xffff88107e595e50:  00 00 00 00 7e 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ....~...ZZZZZZZZ
[   14.153172]   Object 0xffff88107e595e60:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk
[   14.153172]  Redzone 0xffff88107e595e70:  bb bb bb bb bb bb bb bb                                 
[   14.153172]  Padding 0xffff88107e595ea8:  00 00 00 00 5a 5a 5a 5a                         ....ZZZZ        
[   14.153172] Pid: 1, comm: swapper Not tainted 2.6.35-slubq #1
[   14.153172] Call Trace:
[   14.153172]  [<ffffffff81104333>] print_trailer+0x134/0x13f
[   14.153172]  [<ffffffff811043f5>] check_bytes_and_report+0xb7/0xe8
[   14.153172]  [<ffffffff8110455e>] check_object+0x138/0x150
[   14.153172]  [<ffffffff81106f18>] ? alloc_shared_caches+0x10f/0x277
[   14.153172]  [<ffffffff811049bf>] alloc_debug_processing+0xd5/0x160
[   14.153172]  [<ffffffff811054d7>] slab_alloc+0x52e/0x590
[   14.153172]  [<ffffffff81106f18>] ? alloc_shared_caches+0x10f/0x277
[   14.153172]  [<ffffffff811061fc>] __kmalloc_node+0x78/0xa3
[   14.153172]  [<ffffffff81106f18>] alloc_shared_caches+0x10f/0x277
[   14.153172]  [<ffffffff81b09661>] slab_sysfs_init+0x96/0x10a
[   14.153172]  [<ffffffff81b095cb>] ? slab_sysfs_init+0x0/0x10a
[   14.153172]  [<ffffffff810001f9>] do_one_initcall+0x5e/0x14e
[   14.153172]  [<ffffffff81aec6bb>] kernel_init+0x178/0x202
[   14.153172]  [<ffffffff81030954>] kernel_thread_helper+0x4/0x10
[   14.153172]  [<ffffffff81aec543>] ? kernel_init+0x0/0x202
[   14.153172]  [<ffffffff81030950>] ? kernel_thread_helper+0x0/0x10
[   14.153172] FIX kmalloc-16: Restoring 0xffff88107e595ea8-0xffff88107e595eab=0x5a

...

[   15.751474] =============================================================================
[   15.752467] BUG kmalloc-16: Redzone overwritten
[   15.752467] -----------------------------------------------------------------------------
[   15.752467] 
[   15.752467] INFO: 0xffff880c7e5f3ec0-0xffff880c7e5f3ec7. First byte 0x30 instead of 0xbb
[   15.752467] INFO: Allocated in 0xffff88087e4f11e0 age=131909211166235 cpu=2119111312 pid=-30712
[   15.752467] INFO: Freed in 0xffff88087e4f13f0 age=131909211165707 cpu=2119111840 pid=-30712
[   15.752467] INFO: Slab 0xffffea002bba4d28 objects=51 new=3 fp=0x0007000000000000 flags=0xa00000000000080
[   15.752467] INFO: Object 0xffff880c7e5f3eb0 @offset=3760
[   15.752467] 
[   15.752467] Bytes b4 0xffff880c7e5f3ea0:  18 00 00 00 7e 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ....~...ZZZZZZZZ
[   15.752467]   Object 0xffff880c7e5f3eb0:  d0 0f 4f 7e 08 88 ff ff 80 10 4f 7e 08 88 ff ff .O~....O~..
[   15.752467]  Redzone 0xffff880c7e5f3ec0:  30 11 4f 7e 08 88 ff ff                         0.O~..        
[   15.752467]  Padding 0xffff880c7e5f3ef8:  00 16 4f 7e 08 88 ff ff                         ..O~..        
[   15.752467] Pid: 1, comm: swapper Not tainted 2.6.35-slubq #1
[   15.752467] Call Trace:
[   15.752467]  [<ffffffff81104333>] print_trailer+0x134/0x13f
[   15.752467]  [<ffffffff811043f5>] check_bytes_and_report+0xb7/0xe8
[   15.752467]  [<ffffffff81104481>] check_object+0x5b/0x150
[   15.752467]  [<ffffffff81106f18>] ? alloc_shared_caches+0x10f/0x277
[   15.752467]  [<ffffffff811049bf>] alloc_debug_processing+0xd5/0x160
[   15.752467]  [<ffffffff811054d7>] slab_alloc+0x52e/0x590
[   15.752467]  [<ffffffff81106f18>] ? alloc_shared_caches+0x10f/0x277
[   15.752467]  [<ffffffff811061fc>] __kmalloc_node+0x78/0xa3
[   15.752467]  [<ffffffff81106f18>] alloc_shared_caches+0x10f/0x277
[   15.752467]  [<ffffffff81b09661>] slab_sysfs_init+0x96/0x10a
[   15.752467]  [<ffffffff81b095cb>] ? slab_sysfs_init+0x0/0x10a
[   15.752467]  [<ffffffff810001f9>] do_one_initcall+0x5e/0x14e
[   15.752467]  [<ffffffff81aec6bb>] kernel_init+0x178/0x202
[   15.752467]  [<ffffffff81030954>] kernel_thread_helper+0x4/0x10
[   15.752467]  [<ffffffff81aec543>] ? kernel_init+0x0/0x202
[   15.752467]  [<ffffffff81030950>] ? kernel_thread_helper+0x0/0x10
[   15.752467] FIX kmalloc-16: Restoring 0xffff880c7e5f3ec0-0xffff880c7e5f3ec7=0xbb

...

[   15.752467] =============================================================================
[   15.752467] BUG kmalloc-16: Pointer check fails
[   15.752467] -----------------------------------------------------------------------------
[   15.752467] 
[   15.752467] INFO: Allocated in 0xffff880c7e5f4080 age=131874850999735 cpu=2119539736 pid=-30704
[   15.752467] INFO: Freed in 0xc00000000 age=18446743536838355610 cpu=4 pid=1
[   15.752467] INFO: Slab 0xffffea002bba4d28 objects=51 new=0 fp=0x(null) flags=0xa00000000000080
[   15.752467] INFO: Object 0xffff880c7e5f3ff0 @offset=4080
[   15.752467] 
[   15.752467] Bytes b4 0xffff880c7e5f3fe0:  00 00 00 00 7e 00 00 00 00 00 00 00 5a 5a 5a 5a ....~.......ZZZZ
[   15.752467]   Object 0xffff880c7e5f3ff0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
[   15.752467]  Redzone 0xffff880c7e5f4000:  80 35 37 7e 04 88 ff ff                         .57~..        
[   15.752467]  Padding 0xffff880c7e5f4038:  00 00 00 00 00 00 00 00                         ........        
[   15.752467] Pid: 1, comm: swapper Not tainted 2.6.35-slubq #1
[   15.752467] Call Trace:
[   15.752467]  [<ffffffff81104333>] print_trailer+0x134/0x13f
[   15.752467]  [<ffffffff81106f18>] ? alloc_shared_caches+0x10f/0x277
[   15.752467]  [<ffffffff81104858>] object_err+0x3a/0x43
[   15.752467]  [<ffffffff81106f18>] ? alloc_shared_caches+0x10f/0x277
[   15.752467]  [<ffffffff811049ad>] alloc_debug_processing+0xc3/0x160
[   15.752467]  [<ffffffff811054d7>] slab_alloc+0x52e/0x590
[   15.752467]  [<ffffffff81106f18>] ? alloc_shared_caches+0x10f/0x277
[   15.752467]  [<ffffffff811061fc>] __kmalloc_node+0x78/0xa3
[   15.752467]  [<ffffffff81106f18>] alloc_shared_caches+0x10f/0x277
[   15.752467]  [<ffffffff81b09661>] slab_sysfs_init+0x96/0x10a
[   15.752467]  [<ffffffff81b095cb>] ? slab_sysfs_init+0x0/0x10a
[   15.752467]  [<ffffffff810001f9>] do_one_initcall+0x5e/0x14e
[   15.752467]  [<ffffffff81aec6bb>] kernel_init+0x178/0x202
[   15.752467]  [<ffffffff81030954>] kernel_thread_helper+0x4/0x10
[   15.752467]  [<ffffffff81aec543>] ? kernel_init+0x0/0x202
[   15.752467]  [<ffffffff81030950>] ? kernel_thread_helper+0x0/0x10
[   15.752467] FIX kmalloc-16: Marking all objects used

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-17 18:02             ` David Rientjes
@ 2010-08-17 18:47               ` Christoph Lameter
  2010-08-17 18:54                 ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2010-08-17 18:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin, Tejun Heo

On Tue, 17 Aug 2010, David Rientjes wrote:

> I didn't know if that was a debugging patch for me or if you wanted to
> push that as part of your series, I'm not sure if you actually need to
> move it to kmem_cache_init() now that slub_state is protected by
> slub_lock.  I'm not sure if we want to allocate DMA objects between
> kmem_cache_init() and kmem_cache_init_late().

Drivers may allocate dma buffers during initialization.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities.
  2010-08-17 18:42       ` David Rientjes
@ 2010-08-17 18:50         ` Christoph Lameter
  2010-08-17 19:02           ` David Rientjes
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2010-08-17 18:50 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Tue, 17 Aug 2010, David Rientjes wrote:

> On Tue, 17 Aug 2010, Christoph Lameter wrote:
>
> > > This explodes on the memset() in slab_alloc() because of __GFP_ZERO on my
> > > system:
> >
> > Well that seems to be because __kmalloc_node returned invalid address. Run
> > with full debugging please?
> >
>
> Lots of data, so I trimmed it down to something reasonable by eliminating
> reports that were very similar.  (It also looks like some metadata is
> getting displayed incorrectly such as negative pid's and 10-digit cpu
> numbers.)

Well yes I guess that is the result of large scale corruption that is
reaching into the debug fields of the object.

> [   15.752467]
> [   15.752467] INFO: 0xffff880c7e5f3ec0-0xffff880c7e5f3ec7. First byte 0x30 instead of 0xbb
> [   15.752467] INFO: Allocated in 0xffff88087e4f11e0 age=131909211166235 cpu=2119111312 pid=-30712
> [   15.752467] INFO: Freed in 0xffff88087e4f13f0 age=131909211165707 cpu=2119111840 pid=-30712
> [   15.752467] INFO: Slab 0xffffea002bba4d28 objects=51 new=3 fp=0x0007000000000000 flags=0xa00000000000080
> [   15.752467] INFO: Object 0xffff880c7e5f3eb0 @offset=3760
> [   15.752467]
> [   15.752467] Bytes b4 0xffff880c7e5f3ea0:  18 00 00 00 7e 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ....~...ZZZZZZZZ
> [   15.752467]   Object 0xffff880c7e5f3eb0:  d0 0f 4f 7e 08 88 ff ff 80 10 4f 7e 08 88 ff ff .O~....O~..
> [   15.752467]  Redzone 0xffff880c7e5f3ec0:  30 11 4f 7e 08 88 ff ff                         0.O~..
> [   15.752467]  Padding 0xffff880c7e5f3ef8:  00 16 4f 7e 08 88 ff ff                         ..O~..

16 bytes allocated and a pointer array much larger than that is used.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-17 18:47               ` Christoph Lameter
@ 2010-08-17 18:54                 ` David Rientjes
  2010-08-17 19:34                   ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2010-08-17 18:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin, Tejun Heo

On Tue, 17 Aug 2010, Christoph Lameter wrote:

> > I didn't know if that was a debugging patch for me or if you wanted to
> > push that as part of your series, I'm not sure if you actually need to
> > move it to kmem_cache_init() now that slub_state is protected by
> > slub_lock.  I'm not sure if we want to allocate DMA objects between
> > kmem_cache_init() and kmem_cache_init_late().
> 
> Drivers may allocate dma buffers during initialization.
> 

Ok, I moved the DMA cache creation from kmem_cache_init_late() to 
kmem_cache_init().  Note: the kasprintf() will need to use GFP_NOWAIT and 
not GFP_KERNEL now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities.
  2010-08-17 18:50         ` Christoph Lameter
@ 2010-08-17 19:02           ` David Rientjes
  2010-08-17 19:32             ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: David Rientjes @ 2010-08-17 19:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Tue, 17 Aug 2010, Christoph Lameter wrote:

> Well yes I guess that is the result of large scale corruption that is
> reaching into the debug fields of the object.
> 
> > [   15.752467]
> > [   15.752467] INFO: 0xffff880c7e5f3ec0-0xffff880c7e5f3ec7. First byte 0x30 instead of 0xbb
> > [   15.752467] INFO: Allocated in 0xffff88087e4f11e0 age=131909211166235 cpu=2119111312 pid=-30712
> > [   15.752467] INFO: Freed in 0xffff88087e4f13f0 age=131909211165707 cpu=2119111840 pid=-30712
> > [   15.752467] INFO: Slab 0xffffea002bba4d28 objects=51 new=3 fp=0x0007000000000000 flags=0xa00000000000080
> > [   15.752467] INFO: Object 0xffff880c7e5f3eb0 @offset=3760
> > [   15.752467]
> > [   15.752467] Bytes b4 0xffff880c7e5f3ea0:  18 00 00 00 7e 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ....~...ZZZZZZZZ
> > [   15.752467]   Object 0xffff880c7e5f3eb0:  d0 0f 4f 7e 08 88 ff ff 80 10 4f 7e 08 88 ff ff .O~....O~..
> > [   15.752467]  Redzone 0xffff880c7e5f3ec0:  30 11 4f 7e 08 88 ff ff                         0.O~..
> > [   15.752467]  Padding 0xffff880c7e5f3ef8:  00 16 4f 7e 08 88 ff ff                         ..O~..
> 
> 16 bytes allocated and a pointer array much larger than that is used.
> 

Since the problem persists with and without CONFIG_SLUB_DEBUG_ON, I'd 
speculate that this is a problem with node scalability on my 4-node system 
if this boots fine for you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities.
  2010-08-17 19:02           ` David Rientjes
@ 2010-08-17 19:32             ` Christoph Lameter
  2010-08-18 19:32               ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2010-08-17 19:32 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Tue, 17 Aug 2010, David Rientjes wrote:

> On Tue, 17 Aug 2010, Christoph Lameter wrote:
>
> > Well yes I guess that is the result of large scale corruption that is
> > reaching into the debug fields of the object.
> >
> > > [   15.752467]
> > > [   15.752467] INFO: 0xffff880c7e5f3ec0-0xffff880c7e5f3ec7. First byte 0x30 instead of 0xbb
> > > [   15.752467] INFO: Allocated in 0xffff88087e4f11e0 age=131909211166235 cpu=2119111312 pid=-30712
> > > [   15.752467] INFO: Freed in 0xffff88087e4f13f0 age=131909211165707 cpu=2119111840 pid=-30712
> > > [   15.752467] INFO: Slab 0xffffea002bba4d28 objects=51 new=3 fp=0x0007000000000000 flags=0xa00000000000080
> > > [   15.752467] INFO: Object 0xffff880c7e5f3eb0 @offset=3760
> > > [   15.752467]
> > > [   15.752467] Bytes b4 0xffff880c7e5f3ea0:  18 00 00 00 7e 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ....~...ZZZZZZZZ
> > > [   15.752467]   Object 0xffff880c7e5f3eb0:  d0 0f 4f 7e 08 88 ff ff 80 10 4f 7e 08 88 ff ff .O~....O~..
> > > [   15.752467]  Redzone 0xffff880c7e5f3ec0:  30 11 4f 7e 08 88 ff ff                         0.O~..
> > > [   15.752467]  Padding 0xffff880c7e5f3ef8:  00 16 4f 7e 08 88 ff ff                         ..O~..
> >
> > 16 bytes allocated and a pointer array much larger than that is used.
> >
>
> Since the problem persists with and without CONFIG_SLUB_DEBUG_ON, I'd
> speculate that this is a problem with node scalability on my 4-node system
> if this boots fine for you.

Looking at it. I have a fakenuma setup here that does not trigger it.
Guess I need something more real.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)
  2010-08-17 18:54                 ` David Rientjes
@ 2010-08-17 19:34                   ` Christoph Lameter
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-17 19:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin, Tejun Heo

On Tue, 17 Aug 2010, David Rientjes wrote:

> On Tue, 17 Aug 2010, Christoph Lameter wrote:
>
> > > I didn't know if that was a debugging patch for me or if you wanted to
> > > push that as part of your series, I'm not sure if you actually need to
> > > move it to kmem_cache_init() now that slub_state is protected by
> > > slub_lock.  I'm not sure if we want to allocate DMA objects between
> > > kmem_cache_init() and kmem_cache_init_late().
> >
> > Drivers may allocate dma buffers during initialization.
> >
>
> Ok, I moved the DMA cache creation from kmem_cache_init_late() to
> kmem_cache_init().  Note: the kasprintf() will need to use GFP_NOWAIT and
> not GFP_KERNEL now.

ok. I have revised the patch since there is also a problem with the
indirection on kmalloc_caches.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities.
  2010-08-17 19:32             ` Christoph Lameter
@ 2010-08-18 19:32               ` Christoph Lameter
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2010-08-18 19:32 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-mm, linux-kernel, Nick Piggin

On Tue, 17 Aug 2010, Christoph Lameter wrote:

> > >
> > > > [   15.752467]
> > > > [   15.752467] INFO: 0xffff880c7e5f3ec0-0xffff880c7e5f3ec7. First byte 0x30 instead of 0xbb
> > > > [   15.752467] INFO: Allocated in 0xffff88087e4f11e0 age=131909211166235 cpu=2119111312 pid=-30712
> > > > [   15.752467] INFO: Freed in 0xffff88087e4f13f0 age=131909211165707 cpu=2119111840 pid=-30712
> > > > [   15.752467] INFO: Slab 0xffffea002bba4d28 objects=51 new=3 fp=0x0007000000000000 flags=0xa00000000000080
> > > > [   15.752467] INFO: Object 0xffff880c7e5f3eb0 @offset=3760
> > > > [   15.752467]
> > > > [   15.752467] Bytes b4 0xffff880c7e5f3ea0:  18 00 00 00 7e 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ....~...ZZZZZZZZ
> > > > [   15.752467]   Object 0xffff880c7e5f3eb0:  d0 0f 4f 7e 08 88 ff ff 80 10 4f 7e 08 88 ff ff .O~....O~..
> > > > [   15.752467]  Redzone 0xffff880c7e5f3ec0:  30 11 4f 7e 08 88 ff ff                         0.O~..
> > > > [   15.752467]  Padding 0xffff880c7e5f3ef8:  00 16 4f 7e 08 88 ff ff                         ..O~..
> > >
> > > 16 bytes allocated and a pointer array much larger than that is used.
> > >
> >
> > Since the problem persists with and without CONFIG_SLUB_DEBUG_ON, I'd
> > speculate that this is a problem with node scalability on my 4-node system
> > if this boots fine for you.
>
> Looking at it. I have a fakenuma setup here that does not trigger it.
> Guess I need something more real.

Cannot reproduce it on my real 2 node numa system either.
Trouble is that results in only one alien cache per cpu caching domain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2010-08-18 19:32 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-04  2:45 [S+Q3 00/23] SLUB: The Unified slab allocator (V3) Christoph Lameter
2010-08-04  2:45 ` [S+Q3 01/23] percpu: make @dyn_size always mean min dyn_size in first chunk init functions Christoph Lameter
2010-08-04  2:45 ` [S+Q3 02/23] percpu: allow limited allocation before slab is online Christoph Lameter
2010-08-04  2:45 ` [S+Q3 03/23] slub: Use a constant for a unspecified node Christoph Lameter
2010-08-04  3:34   ` David Rientjes
2010-08-04 16:15     ` Christoph Lameter
2010-08-05  7:40       ` David Rientjes
2010-08-04  2:45 ` [S+Q3 04/23] SLUB: Constants need UL Christoph Lameter
2010-08-04  2:45 ` [S+Q3 05/23] Subjec Slub: Force no inlining of debug functions Christoph Lameter
2010-08-04  2:45 ` [S+Q3 06/23] slub: Check kasprintf results in kmem_cache_init() Christoph Lameter
2010-08-04  2:45 ` [S+Q3 07/23] slub: Use kmem_cache flags to detect if slab is in debugging mode Christoph Lameter
2010-08-04  2:45 ` [S+Q3 08/23] slub: remove dynamic dma slab allocation Christoph Lameter
2010-08-04  2:45 ` [S+Q3 09/23] slub: Remove static kmem_cache_cpu array for boot Christoph Lameter
2010-08-04  2:45 ` [S+Q3 10/23] slub: Allow removal of slab caches during boot V2 Christoph Lameter
2010-08-04  2:45 ` [S+Q3 11/23] slub: Dynamically size kmalloc cache allocations Christoph Lameter
2010-08-04  2:45 ` [S+Q3 12/23] slub: Extract hooks for memory checkers from hotpaths Christoph Lameter
2010-08-04  2:45 ` [S+Q3 13/23] slub: Move gfpflag masking out of the hotpath Christoph Lameter
2010-08-04  2:45 ` [S+Q3 14/23] slub: Add SLAB style per cpu queueing Christoph Lameter
2010-08-04  2:45 ` [S+Q3 15/23] slub: Allow resizing of per cpu queues Christoph Lameter
2010-08-04  2:45 ` [S+Q3 16/23] slub: Get rid of useless function count_free() Christoph Lameter
2010-08-04  2:45 ` [S+Q3 17/23] slub: Remove MAX_OBJS limitation Christoph Lameter
2010-08-04  2:45 ` [S+Q3 18/23] slub: Drop allocator announcement Christoph Lameter
2010-08-04  2:45 ` [S+Q3 19/23] slub: Object based NUMA policies Christoph Lameter
2010-08-04  2:45 ` [S+Q3 20/23] slub: Shared cache to exploit cross cpu caching abilities Christoph Lameter
2010-08-17  5:52   ` David Rientjes
2010-08-17 17:51     ` Christoph Lameter
2010-08-17 18:42       ` David Rientjes
2010-08-17 18:50         ` Christoph Lameter
2010-08-17 19:02           ` David Rientjes
2010-08-17 19:32             ` Christoph Lameter
2010-08-18 19:32               ` Christoph Lameter
2010-08-04  2:45 ` [S+Q3 21/23] slub: Support Alien Caches Christoph Lameter
2010-08-04  2:45 ` [S+Q3 22/23] slub: Cached object expiration Christoph Lameter
2010-08-04  2:45 ` [S+Q3 23/23] vmscan: Tie slub object expiration into page reclaim Christoph Lameter
2010-08-04  4:39 ` [S+Q3 00/23] SLUB: The Unified slab allocator (V3) David Rientjes
2010-08-04 16:17   ` Christoph Lameter
2010-08-05  8:38     ` David Rientjes
2010-08-05 17:33       ` Christoph Lameter
2010-08-17  4:56         ` David Rientjes
2010-08-17  7:55           ` Tejun Heo
2010-08-17 13:56             ` Christoph Lameter
2010-08-17 17:23           ` Christoph Lameter
2010-08-17 17:29             ` Christoph Lameter
2010-08-17 18:02             ` David Rientjes
2010-08-17 18:47               ` Christoph Lameter
2010-08-17 18:54                 ` David Rientjes
2010-08-17 19:34                   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).