All of lore.kernel.org
 help / color / mirror / Atom feed
* [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench
@ 2010-06-25 21:20 Christoph Lameter
  2010-06-25 21:20 ` [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation Christoph Lameter
                   ` (16 more replies)
  0 siblings, 17 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Nick Piggin, Matt Mackall

The following patchset cleans some pieces up and then equips SLUB with
per cpu queues that work similar to SLABs queues. With that approach
SLUB wins in hackbench:

#!/bin/bash 
uname -a
echo "./hackbench 100 process 200000"
./hackbench 100 process 200000
echo "./hackbench 100 process 20000"
./hackbench 100 process 20000
echo "./hackbench 100 process 20000"
./hackbench 100 process 20000
echo "./hackbench 100 process 20000"
./hackbench 100 process 20000
echo "./hackbench 10 process 20000"
./hackbench 10 process 20000
echo "./hackbench 10 process 20000"
./hackbench 10 process 20000
echo "./hackbench 10 process 20000"
./hackbench 10 process 20000
echo "./hackbench 1 process 20000"
./hackbench 1 process 20000
echo "./hackbench 1 process 20000"
./hackbench 1 process 20000
echo "./hackbench 1 process 20000"
./hackbench 1 process 20000

Procs	NR		SLAB	SLUB	SLUB+Queuing
----------------------------------------------------
100	200000		2741.3	2764.7	2231.9
100	20000		279.3	270.3	219.0
100	20000		278.0	273.1	219.2
100	20000		279.0	271.7	218.8
10 	20000		34.0	35.6	28.8
10	20000		30.3	35.2	28.4
10	20000		32.9	34.6	28.4
1	20000		6.4	6.7	6.5
1	20000		6.3	6.8	6.5
1	20000		6.4	6.9	6.4


SLUB+Q is a merging of SLUB with some queuing concepts from SLAB and a
new way of managing objects in the slabs using bitmaps. It uses a percpu
queue so that free operations can be properly buffered and a bitmap for
managing the free/allocated state in the slabs. It is slightly more
inefficient than SLUB (due to the need to place large bitmaps --sized
a few words--in some slab pages if there are more than BITS_PER_LONG
objects in a slab) but in general does not increase space use too much.

The SLAB scheme of not touching the object during management is adopted.
SLUB+Q can efficiently free and allocate cache cold objects without
causing cache misses.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-28  2:17   ` KAMEZAWA Hiroyuki
  2010-06-28 16:48     ` Pekka Enberg
  2010-06-25 21:20 ` [S+Q 02/16] [PATCH 1/2] percpu: make @dyn_size always mean min dyn_size in first chunk init functions Christoph Lameter
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Manfred Spraul, Nick Piggin, Matt Mackall

[-- Attachment #1: 0001-ipc-sem.c-Bugfix-for-semop.patch --]
[-- Type: text/plain, Size: 2629 bytes --]

[Necessary to make 2.6.35-rc3 not deadlock. Not sure if this is the "right"(tm)
fix]

The last change to improve the scalability moved the actual wake-up out of
the section that is protected by spin_lock(sma->sem_perm.lock).

This means that IN_WAKEUP can be in queue.status even when the spinlock is
acquired by the current task. Thus the same loop that is performed when
queue.status is read without the spinlock acquired must be performed when
the spinlock is acquired.

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 ipc/sem.c |   36 ++++++++++++++++++++++++++++++------
 1 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 506c849..523665f 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1256,6 +1256,32 @@ out:
 	return un;
 }
 
+
+/** get_queue_result - Retrieve the result code from sem_queue
+ * @q: Pointer to queue structure
+ *
+ * The function retrieve the return code from the pending queue. If 
+ * IN_WAKEUP is found in q->status, then we must loop until the value
+ * is replaced with the final value: This may happen if a task is
+ * woken up by an unrelated event (e.g. signal) and in parallel the task
+ * is woken up by another task because it got the requested semaphores.
+ *
+ * The function can be called with or without holding the semaphore spinlock.
+ */
+static int get_queue_result(struct sem_queue *q)
+{
+	int error;
+
+	error = q->status;
+	while(unlikely(error == IN_WAKEUP)) {
+		cpu_relax();
+		error = q->status;
+	}
+
+	return error;
+}
+
+
 SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		unsigned, nsops, const struct timespec __user *, timeout)
 {
@@ -1409,11 +1435,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 	else
 		schedule();
 
-	error = queue.status;
-	while(unlikely(error == IN_WAKEUP)) {
-		cpu_relax();
-		error = queue.status;
-	}
+	error = get_queue_result(&queue);
 
 	if (error != -EINTR) {
 		/* fast path: update_queue already obtained all requested
@@ -1427,10 +1449,12 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		goto out_free;
 	}
 
+	error = get_queue_result(&queue);
+
 	/*
 	 * If queue.status != -EINTR we are woken up by another process
 	 */
-	error = queue.status;
+
 	if (error != -EINTR) {
 		goto out_unlock_free;
 	}
-- 
1.7.0.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [S+Q 02/16] [PATCH 1/2] percpu: make @dyn_size always mean min dyn_size in first chunk init functions
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
  2010-06-25 21:20 ` [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-27  5:06   ` David Rientjes
  2010-06-25 21:20 ` [S+Q 03/16] [PATCH 2/2] percpu: allow limited allocation before slab is online Christoph Lameter
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-mm, David Rientjes, Tejun Heo, Nick Piggin, Matt Mackall

[-- Attachment #1: percpu_early_1 --]
[-- Type: text/plain, Size: 5713 bytes --]

In pcpu_alloc_info() and pcpu_embed_first_chunk(), @dyn_size was
ssize_t, -1 meant auto-size, 0 forced 0 and positive meant minimum
size.  There's no use case for forcing 0 and the upcoming early alloc
support always requires non-zero dynamic size.  Make @dyn_size always
mean minimum dyn_size.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2010-06-18 12:23:22.000000000 -0500
+++ linux-2.6/include/linux/percpu.h	2010-06-18 12:24:52.000000000 -0500
@@ -105,7 +105,7 @@ extern struct pcpu_alloc_info * __init p
 extern void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai);
 
 extern struct pcpu_alloc_info * __init pcpu_build_alloc_info(
-				size_t reserved_size, ssize_t dyn_size,
+				size_t reserved_size, size_t dyn_size,
 				size_t atom_size,
 				pcpu_fc_cpu_distance_fn_t cpu_distance_fn);
 
@@ -113,7 +113,7 @@ extern int __init pcpu_setup_first_chunk
 					 void *base_addr);
 
 #ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
-extern int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size,
+extern int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
 				size_t atom_size,
 				pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
 				pcpu_fc_alloc_fn_t alloc_fn,
Index: linux-2.6/mm/percpu.c
===================================================================
--- linux-2.6.orig/mm/percpu.c	2010-06-18 11:20:35.000000000 -0500
+++ linux-2.6/mm/percpu.c	2010-06-18 12:24:52.000000000 -0500
@@ -988,20 +988,6 @@ phys_addr_t per_cpu_ptr_to_phys(void *ad
 		return page_to_phys(pcpu_addr_to_page(addr));
 }
 
-static inline size_t pcpu_calc_fc_sizes(size_t static_size,
-					size_t reserved_size,
-					ssize_t *dyn_sizep)
-{
-	size_t size_sum;
-
-	size_sum = PFN_ALIGN(static_size + reserved_size +
-			     (*dyn_sizep >= 0 ? *dyn_sizep : 0));
-	if (*dyn_sizep != 0)
-		*dyn_sizep = size_sum - static_size - reserved_size;
-
-	return size_sum;
-}
-
 /**
  * pcpu_alloc_alloc_info - allocate percpu allocation info
  * @nr_groups: the number of groups
@@ -1060,7 +1046,7 @@ void __init pcpu_free_alloc_info(struct 
 /**
  * pcpu_build_alloc_info - build alloc_info considering distances between CPUs
  * @reserved_size: the size of reserved percpu area in bytes
- * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @dyn_size: free size for dynamic allocation in bytes
  * @atom_size: allocation atom size
  * @cpu_distance_fn: callback to determine distance between cpus, optional
  *
@@ -1079,7 +1065,7 @@ void __init pcpu_free_alloc_info(struct 
  * failure, ERR_PTR value is returned.
  */
 struct pcpu_alloc_info * __init pcpu_build_alloc_info(
-				size_t reserved_size, ssize_t dyn_size,
+				size_t reserved_size, size_t dyn_size,
 				size_t atom_size,
 				pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
 {
@@ -1098,13 +1084,15 @@ struct pcpu_alloc_info * __init pcpu_bui
 	memset(group_map, 0, sizeof(group_map));
 	memset(group_cnt, 0, sizeof(group_map));
 
+	size_sum = PFN_ALIGN(static_size + reserved_size + dyn_size);
+	dyn_size = size_sum - static_size - reserved_size;
+
 	/*
 	 * Determine min_unit_size, alloc_size and max_upa such that
 	 * alloc_size is multiple of atom_size and is the smallest
 	 * which can accomodate 4k aligned segments which are equal to
 	 * or larger than min_unit_size.
 	 */
-	size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);
 	min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
 
 	alloc_size = roundup(min_unit_size, atom_size);
@@ -1508,7 +1496,7 @@ early_param("percpu_alloc", percpu_alloc
 /**
  * pcpu_embed_first_chunk - embed the first percpu chunk into bootmem
  * @reserved_size: the size of reserved percpu area in bytes
- * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @dyn_size: minimum free size for dynamic allocation in bytes
  * @atom_size: allocation atom size
  * @cpu_distance_fn: callback to determine distance between cpus, optional
  * @alloc_fn: function to allocate percpu page
@@ -1529,10 +1517,7 @@ early_param("percpu_alloc", percpu_alloc
  * vmalloc space is not orders of magnitude larger than distances
  * between node memory addresses (ie. 32bit NUMA machines).
  *
- * When @dyn_size is positive, dynamic area might be larger than
- * specified to fill page alignment.  When @dyn_size is auto,
- * @dyn_size is just big enough to fill page alignment after static
- * and reserved areas.
+ * @dyn_size specifies the minimum dynamic area size.
  *
  * If the needed size is smaller than the minimum or specified unit
  * size, the leftover is returned using @free_fn.
@@ -1540,7 +1525,7 @@ early_param("percpu_alloc", percpu_alloc
  * RETURNS:
  * 0 on success, -errno on failure.
  */
-int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size,
+int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
 				  size_t atom_size,
 				  pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
 				  pcpu_fc_alloc_fn_t alloc_fn,
@@ -1671,7 +1656,7 @@ int __init pcpu_page_first_chunk(size_t 
 
 	snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10);
 
-	ai = pcpu_build_alloc_info(reserved_size, -1, PAGE_SIZE, NULL);
+	ai = pcpu_build_alloc_info(reserved_size, 0, PAGE_SIZE, NULL);
 	if (IS_ERR(ai))
 		return PTR_ERR(ai);
 	BUG_ON(ai->nr_groups != 1);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 03/16] [PATCH 2/2] percpu: allow limited allocation before slab is online
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
  2010-06-25 21:20 ` [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation Christoph Lameter
  2010-06-25 21:20 ` [S+Q 02/16] [PATCH 1/2] percpu: make @dyn_size always mean min dyn_size in first chunk init functions Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-25 21:20 ` [S+Q 04/16] slub: Use a constant for a unspecified node Christoph Lameter
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-mm, David Rientjes, Tejun Heo, Nick Piggin, Matt Mackall

[-- Attachment #1: percpu_early_2 --]
[-- Type: text/plain, Size: 6880 bytes --]

This patch updates percpu allocator such that it can serve limited
amount of allocation before slab comes online.  This is primarily to
allow slab to depend on working percpu allocator.

Two parameters, PERCPU_DYNAMIC_EARLY_SIZE and SLOTS, determine how
much memory space and allocation map slots are reserved.  If this
reserved area is exhausted, WARN_ON_ONCE() will trigger and allocation
will fail till slab comes online.

The following changes are made to implement early alloc.

* pcpu_mem_alloc() now checks slab_is_available()

* Chunks are allocated using pcpu_mem_alloc()

* Init paths make sure ai->dyn_size is at least as large as
  PERCPU_DYNAMIC_EARLY_SIZE.

* Initial alloc maps are allocated in __initdata and copied to
  kmalloc'd areas once slab is online.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/percpu.h |   13 ++++++++++++
 init/main.c            |    1
 include/linux/percpu.h |   13 ++++++++++++
 init/main.c            |    1 
 mm/percpu.c            |   52 +++++++++++++++++++++++++++++++++++++------------
 3 files changed, 54 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/percpu.c
===================================================================
--- linux-2.6.orig/mm/percpu.c	2010-06-23 14:43:39.000000000 -0500
+++ linux-2.6/mm/percpu.c	2010-06-23 14:43:54.000000000 -0500
@@ -282,6 +282,9 @@ static void __maybe_unused pcpu_next_pop
  */
 static void *pcpu_mem_alloc(size_t size)
 {
+	if (WARN_ON_ONCE(!slab_is_available()))
+		return NULL;
+
 	if (size <= PAGE_SIZE)
 		return kzalloc(size, GFP_KERNEL);
 	else {
@@ -392,13 +395,6 @@ static int pcpu_extend_area_map(struct p
 	old_size = chunk->map_alloc * sizeof(chunk->map[0]);
 	memcpy(new, chunk->map, old_size);
 
-	/*
-	 * map_alloc < PCPU_DFL_MAP_ALLOC indicates that the chunk is
-	 * one of the first chunks and still using static map.
-	 */
-	if (chunk->map_alloc >= PCPU_DFL_MAP_ALLOC)
-		old = chunk->map;
-
 	chunk->map_alloc = new_alloc;
 	chunk->map = new;
 	new = NULL;
@@ -604,7 +600,7 @@ static struct pcpu_chunk *pcpu_alloc_chu
 {
 	struct pcpu_chunk *chunk;
 
-	chunk = kzalloc(pcpu_chunk_struct_size, GFP_KERNEL);
+	chunk = pcpu_mem_alloc(pcpu_chunk_struct_size);
 	if (!chunk)
 		return NULL;
 
@@ -1084,7 +1080,9 @@ struct pcpu_alloc_info * __init pcpu_bui
 	memset(group_map, 0, sizeof(group_map));
 	memset(group_cnt, 0, sizeof(group_map));
 
-	size_sum = PFN_ALIGN(static_size + reserved_size + dyn_size);
+	/* calculate size_sum and ensure dyn_size is enough for early alloc */
+	size_sum = PFN_ALIGN(static_size + reserved_size +
+			    max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE));
 	dyn_size = size_sum - static_size - reserved_size;
 
 	/*
@@ -1314,7 +1312,8 @@ int __init pcpu_setup_first_chunk(const 
 				  void *base_addr)
 {
 	static char cpus_buf[4096] __initdata;
-	static int smap[2], dmap[2];
+	static int smap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata;
+	static int dmap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata;
 	size_t dyn_size = ai->dyn_size;
 	size_t size_sum = ai->static_size + ai->reserved_size + dyn_size;
 	struct pcpu_chunk *schunk, *dchunk = NULL;
@@ -1337,14 +1336,13 @@ int __init pcpu_setup_first_chunk(const 
 } while (0)
 
 	/* sanity checks */
-	BUILD_BUG_ON(ARRAY_SIZE(smap) >= PCPU_DFL_MAP_ALLOC ||
-		     ARRAY_SIZE(dmap) >= PCPU_DFL_MAP_ALLOC);
 	PCPU_SETUP_BUG_ON(ai->nr_groups <= 0);
 	PCPU_SETUP_BUG_ON(!ai->static_size);
 	PCPU_SETUP_BUG_ON(!base_addr);
 	PCPU_SETUP_BUG_ON(ai->unit_size < size_sum);
 	PCPU_SETUP_BUG_ON(ai->unit_size & ~PAGE_MASK);
 	PCPU_SETUP_BUG_ON(ai->unit_size < PCPU_MIN_UNIT_SIZE);
+	PCPU_SETUP_BUG_ON(ai->dyn_size < PERCPU_DYNAMIC_EARLY_SIZE);
 	PCPU_SETUP_BUG_ON(pcpu_verify_alloc_info(ai) < 0);
 
 	/* process group information and build config tables accordingly */
@@ -1782,3 +1780,33 @@ void __init setup_per_cpu_areas(void)
 		__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
 }
 #endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
+
+/*
+ * First and reserved chunks are initialized with temporary allocation
+ * map in initdata so that they can be used before slab is online.
+ * This function is called after slab is brought up and replaces those
+ * with properly allocated maps.
+ */
+void __init percpu_init_late(void)
+{
+	struct pcpu_chunk *target_chunks[] =
+		{ pcpu_first_chunk, pcpu_reserved_chunk, NULL };
+	struct pcpu_chunk *chunk;
+	unsigned long flags;
+	int i;
+
+	for (i = 0; (chunk = target_chunks[i]); i++) {
+		int *map;
+		const size_t size = PERCPU_DYNAMIC_EARLY_SLOTS * sizeof(map[0]);
+
+		BUILD_BUG_ON(size > PAGE_SIZE);
+
+		map = pcpu_mem_alloc(size);
+		BUG_ON(!map);
+
+		spin_lock_irqsave(&pcpu_lock, flags);
+		memcpy(map, chunk->map, size);
+		chunk->map = map;
+		spin_unlock_irqrestore(&pcpu_lock, flags);
+	}
+}
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c	2010-06-22 09:45:34.000000000 -0500
+++ linux-2.6/init/main.c	2010-06-23 14:43:54.000000000 -0500
@@ -522,6 +522,7 @@ static void __init mm_init(void)
 	page_cgroup_init_flatmem();
 	mem_init();
 	kmem_cache_init();
+	percpu_init_late();
 	pgtable_cache_init();
 	vmalloc_init();
 }
Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2010-06-23 14:43:39.000000000 -0500
+++ linux-2.6/include/linux/percpu.h	2010-06-23 14:43:54.000000000 -0500
@@ -45,6 +45,16 @@
 #define PCPU_MIN_UNIT_SIZE		PFN_ALIGN(64 << 10)
 
 /*
+ * Percpu allocator can serve percpu allocations before slab is
+ * initialized which allows slab to depend on the percpu allocator.
+ * The following two parameters decide how much resource to
+ * preallocate for this.  Keep PERCPU_DYNAMIC_RESERVE equal to or
+ * larger than PERCPU_DYNAMIC_EARLY_SIZE.
+ */
+#define PERCPU_DYNAMIC_EARLY_SLOTS	128
+#define PERCPU_DYNAMIC_EARLY_SIZE	(12 << 10)
+
+/*
  * PERCPU_DYNAMIC_RESERVE indicates the amount of free area to piggy
  * back on the first chunk for dynamic percpu allocation if arch is
  * manually allocating and mapping it for faster access (as a part of
@@ -140,6 +150,7 @@ extern bool is_kernel_percpu_address(uns
 #ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
 extern void __init setup_per_cpu_areas(void);
 #endif
+extern void __init percpu_init_late(void);
 
 #else /* CONFIG_SMP */
 
@@ -153,6 +164,8 @@ static inline bool is_kernel_percpu_addr
 
 static inline void __init setup_per_cpu_areas(void) { }
 
+static inline void __init percpu_init_late(void) { }
+
 static inline void *pcpu_lpage_remapped(void *kaddr)
 {
 	return NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 04/16] slub: Use a constant for a unspecified node.
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (2 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 03/16] [PATCH 2/2] percpu: allow limited allocation before slab is online Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-28  2:25   ` KAMEZAWA Hiroyuki
  2010-06-25 21:20 ` [S+Q 05/16] SLUB: Constants need UL Christoph Lameter
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, David Rientjes, Nick Piggin, Matt Mackall

[-- Attachment #1: slab_node_unspecified --]
[-- Type: text/plain, Size: 2321 bytes --]

kmalloc_node() and friends can be passed a constant -1 to indicate
that no choice was made for the node from which the object needs to
come.

Use NUMA_NO_NODE instead of -1.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slab.h |    2 ++
 mm/slub.c            |   10 +++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-06-01 08:51:39.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-06-01 08:58:46.000000000 -0500
@@ -1073,7 +1073,7 @@ static inline struct page *alloc_slab_pa
 
 	flags |= __GFP_NOTRACK;
 
-	if (node == -1)
+	if (node == NUMA_NO_NODE)
 		return alloc_pages(flags, order);
 	else
 		return alloc_pages_exact_node(node, flags, order);
@@ -1727,7 +1727,7 @@ static __always_inline void *slab_alloc(
 
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 {
-	void *ret = slab_alloc(s, gfpflags, -1, _RET_IP_);
+	void *ret = slab_alloc(s, gfpflags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret, s->objsize, s->size, gfpflags);
 
@@ -1738,7 +1738,7 @@ EXPORT_SYMBOL(kmem_cache_alloc);
 #ifdef CONFIG_TRACING
 void *kmem_cache_alloc_notrace(struct kmem_cache *s, gfp_t gfpflags)
 {
-	return slab_alloc(s, gfpflags, -1, _RET_IP_);
+	return slab_alloc(s, gfpflags, NUMA_NO_NODE, _RET_IP_);
 }
 EXPORT_SYMBOL(kmem_cache_alloc_notrace);
 #endif
@@ -2728,7 +2728,7 @@ void *__kmalloc(size_t size, gfp_t flags
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
-	ret = slab_alloc(s, flags, -1, _RET_IP_);
+	ret = slab_alloc(s, flags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmalloc(_RET_IP_, ret, size, s->size, flags);
 
@@ -3312,7 +3312,7 @@ void *__kmalloc_track_caller(size_t size
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
-	ret = slab_alloc(s, gfpflags, -1, caller);
+	ret = slab_alloc(s, gfpflags, NUMA_NO_NODE, caller);
 
 	/* Honor the call site pointer we recieved. */
 	trace_kmalloc(caller, ret, size, s->size, gfpflags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 05/16] SLUB: Constants need UL
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (3 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 04/16] slub: Use a constant for a unspecified node Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-26 23:31   ` David Rientjes
  2010-06-28  2:27   ` KAMEZAWA Hiroyuki
  2010-06-25 21:20 ` [S+Q 06/16] slub: Use kmem_cache flags to detect if slab is in debugging mode Christoph Lameter
                   ` (11 subsequent siblings)
  16 siblings, 2 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Nick Piggin, Matt Mackall

[-- Attachment #1: slub_constant_ul --]
[-- Type: text/plain, Size: 1095 bytes --]

UL suffix is missing in some constants. Conform to how slab.h uses constants.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-05-24 14:40:33.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-05-24 14:42:46.000000000 -0500
@@ -162,8 +162,8 @@
 #define MAX_OBJS_PER_PAGE	65535 /* since page.objects is u16 */
 
 /* Internal SLUB flags */
-#define __OBJECT_POISON		0x80000000 /* Poison object */
-#define __SYSFS_ADD_DEFERRED	0x40000000 /* Not yet visible via sysfs */
+#define __OBJECT_POISON		0x80000000UL /* Poison object */
+#define __SYSFS_ADD_DEFERRED	0x40000000UL /* Not yet visible via sysfs */
 
 static int kmem_size = sizeof(struct kmem_cache);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 06/16] slub: Use kmem_cache flags to detect if slab is in debugging mode.
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (4 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 05/16] SLUB: Constants need UL Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-26 23:31   ` David Rientjes
  2010-06-25 21:20 ` [S+Q 07/16] slub: discard_slab_unlock Christoph Lameter
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Nick Piggin, Matt Mackall

[-- Attachment #1: slub_debug_on --]
[-- Type: text/plain, Size: 3974 bytes --]

The cacheline with the flags is reachable from the hot paths after the
percpu allocator changes went in. So there is no need anymore to put a
flag into each slab page. Get rid of the SlubDebug flag and use
the flags in kmem_cache instead.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/page-flags.h |    1 -
 mm/slub.c                  |   33 ++++++++++++---------------------
 2 files changed, 12 insertions(+), 22 deletions(-)

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2010-05-28 11:37:33.000000000 -0500
+++ linux-2.6/include/linux/page-flags.h	2010-06-01 08:58:50.000000000 -0500
@@ -215,7 +215,6 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEAR
 __PAGEFLAG(SlobFree, slob_free)
 
 __PAGEFLAG(SlubFrozen, slub_frozen)
-__PAGEFLAG(SlubDebug, slub_debug)
 
 /*
  * Private page markings that may be used by the filesystem that owns the page
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-06-01 08:58:49.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-06-01 08:58:50.000000000 -0500
@@ -107,11 +107,17 @@
  * 			the fast path and disables lockless freelists.
  */
 
+#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
+		SLAB_TRACE | SLAB_DEBUG_FREE)
+
+static inline int kmem_cache_debug(struct kmem_cache *s)
+{
 #ifdef CONFIG_SLUB_DEBUG
-#define SLABDEBUG 1
+	return unlikely(s->flags & SLAB_DEBUG_FLAGS);
 #else
-#define SLABDEBUG 0
+	return 0;
 #endif
+}
 
 /*
  * Issues still to be resolved:
@@ -1157,9 +1163,6 @@ static struct page *new_slab(struct kmem
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
-	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
-			SLAB_STORE_USER | SLAB_TRACE))
-		__SetPageSlubDebug(page);
 
 	start = page_address(page);
 
@@ -1186,14 +1189,13 @@ static void __free_slab(struct kmem_cach
 	int order = compound_order(page);
 	int pages = 1 << order;
 
-	if (unlikely(SLABDEBUG && PageSlubDebug(page))) {
+	if (kmem_cache_debug(s)) {
 		void *p;
 
 		slab_pad_check(s, page);
 		for_each_object(p, s, page_address(page),
 						page->objects)
 			check_object(s, page, p, 0);
-		__ClearPageSlubDebug(page);
 	}
 
 	kmemcheck_free_shadow(page, compound_order(page));
@@ -1415,8 +1417,7 @@ static void unfreeze_slab(struct kmem_ca
 			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
 		} else {
 			stat(s, DEACTIVATE_FULL);
-			if (SLABDEBUG && PageSlubDebug(page) &&
-						(s->flags & SLAB_STORE_USER))
+			if (kmem_cache_debug(s) && (s->flags & SLAB_STORE_USER))
 				add_full(n, page);
 		}
 		slab_unlock(page);
@@ -1624,7 +1625,7 @@ load_freelist:
 	object = c->page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
+	if (kmem_cache_debug(s))
 		goto debug;
 
 	c->freelist = get_freepointer(s, object);
@@ -1783,7 +1784,7 @@ static void __slab_free(struct kmem_cach
 	stat(s, FREE_SLOWPATH);
 	slab_lock(page);
 
-	if (unlikely(SLABDEBUG && PageSlubDebug(page)))
+	if (kmem_cache_debug(s))
 		goto debug;
 
 checks_ok:
@@ -3395,16 +3396,6 @@ static void validate_slab_slab(struct km
 	} else
 		printk(KERN_INFO "SLUB %s: Skipped busy slab 0x%p\n",
 			s->name, page);
-
-	if (s->flags & DEBUG_DEFAULT_FLAGS) {
-		if (!PageSlubDebug(page))
-			printk(KERN_ERR "SLUB %s: SlubDebug not set "
-				"on slab 0x%p\n", s->name, page);
-	} else {
-		if (PageSlubDebug(page))
-			printk(KERN_ERR "SLUB %s: SlubDebug set on "
-				"slab 0x%p\n", s->name, page);
-	}
 }
 
 static int validate_slab_node(struct kmem_cache *s,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 07/16] slub: discard_slab_unlock
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (5 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 06/16] slub: Use kmem_cache flags to detect if slab is in debugging mode Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-26 23:34   ` David Rientjes
  2010-06-25 21:20 ` [S+Q 08/16] slub: remove dynamic dma slab allocation Christoph Lameter
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Nick Piggin, Matt Mackall

[-- Attachment #1: slub_discard_unlock --]
[-- Type: text/plain, Size: 1719 bytes --]

The sequence of unlocking a slab and freeing occurs multiple times.
Put the common into a single function.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-06-01 08:58:50.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-06-01 08:58:54.000000000 -0500
@@ -1260,6 +1260,13 @@ static __always_inline int slab_trylock(
 	return rc;
 }
 
+static void discard_slab_unlock(struct kmem_cache *s,
+	struct page *page)
+{
+	slab_unlock(page);
+	discard_slab(s, page);
+}
+
 /*
  * Management of partially allocated slabs
  */
@@ -1437,9 +1444,8 @@ static void unfreeze_slab(struct kmem_ca
 			add_partial(n, page, 1);
 			slab_unlock(page);
 		} else {
-			slab_unlock(page);
 			stat(s, FREE_SLAB);
-			discard_slab(s, page);
+			discard_slab_unlock(s, page);
 		}
 	}
 }
@@ -1822,9 +1828,8 @@ slab_empty:
 		remove_partial(s, page);
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
-	slab_unlock(page);
 	stat(s, FREE_SLAB);
-	discard_slab(s, page);
+	discard_slab_unlock(s, page);
 	return;
 
 debug:
@@ -2893,8 +2898,7 @@ int kmem_cache_shrink(struct kmem_cache 
 				 */
 				list_del(&page->lru);
 				n->nr_partial--;
-				slab_unlock(page);
-				discard_slab(s, page);
+				discard_slab_unlock(s, page);
 			} else {
 				list_move(&page->lru,
 				slabs_by_inuse + page->inuse);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 08/16] slub: remove dynamic dma slab allocation
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (6 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 07/16] slub: discard_slab_unlock Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-26 23:52   ` David Rientjes
  2010-06-28  2:33   ` KAMEZAWA Hiroyuki
  2010-06-25 21:20 ` [S+Q 09/16] [percpu] make allocpercpu usable during early boot Christoph Lameter
                   ` (8 subsequent siblings)
  16 siblings, 2 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Nick Piggin, Matt Mackall

[-- Attachment #1: slub_remove_dynamic_dma --]
[-- Type: text/plain, Size: 8922 bytes --]

Remove the dynamic dma slab allocation since this causes too many issues with
nested locks etc etc. The change avoids passing gfpflags into many functions.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |  153 ++++++++++++++++----------------------------------------------
 1 file changed, 41 insertions(+), 112 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-06-15 12:40:58.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-06-15 12:41:36.000000000 -0500
@@ -2070,7 +2070,7 @@ init_kmem_cache_node(struct kmem_cache_n
 
 static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[KMALLOC_CACHES]);
 
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 {
 	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
@@ -2097,7 +2097,7 @@ static inline int alloc_kmem_cache_cpus(
  * when allocating for the kmalloc_node_cache. This is used for bootstrapping
  * memory on a fresh node that has no slab structures yet.
  */
-static void early_kmem_cache_node_alloc(gfp_t gfpflags, int node)
+static void early_kmem_cache_node_alloc(int node)
 {
 	struct page *page;
 	struct kmem_cache_node *n;
@@ -2105,7 +2105,7 @@ static void early_kmem_cache_node_alloc(
 
 	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, gfpflags, node);
+	page = new_slab(kmalloc_caches, GFP_KERNEL, node);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
@@ -2149,7 +2149,7 @@ static void free_kmem_cache_nodes(struct
 	}
 }
 
-static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
+static int init_kmem_cache_nodes(struct kmem_cache *s)
 {
 	int node;
 
@@ -2157,11 +2157,11 @@ static int init_kmem_cache_nodes(struct 
 		struct kmem_cache_node *n;
 
 		if (slab_state == DOWN) {
-			early_kmem_cache_node_alloc(gfpflags, node);
+			early_kmem_cache_node_alloc(node);
 			continue;
 		}
 		n = kmem_cache_alloc_node(kmalloc_caches,
-						gfpflags, node);
+						GFP_KERNEL, node);
 
 		if (!n) {
 			free_kmem_cache_nodes(s);
@@ -2178,7 +2178,7 @@ static void free_kmem_cache_nodes(struct
 {
 }
 
-static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
+static int init_kmem_cache_nodes(struct kmem_cache *s)
 {
 	init_kmem_cache_node(&s->local_node, s);
 	return 1;
@@ -2318,7 +2318,7 @@ static int calculate_sizes(struct kmem_c
 
 }
 
-static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
+static int kmem_cache_open(struct kmem_cache *s,
 		const char *name, size_t size,
 		size_t align, unsigned long flags,
 		void (*ctor)(void *))
@@ -2354,10 +2354,10 @@ static int kmem_cache_open(struct kmem_c
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
-	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
+	if (!init_kmem_cache_nodes(s))
 		goto error;
 
-	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
+	if (alloc_kmem_cache_cpus(s))
 		return 1;
 
 	free_kmem_cache_nodes(s);
@@ -2517,6 +2517,10 @@ EXPORT_SYMBOL(kmem_cache_destroy);
 struct kmem_cache kmalloc_caches[KMALLOC_CACHES] __cacheline_aligned;
 EXPORT_SYMBOL(kmalloc_caches);
 
+#ifdef CONFIG_ZONE_DMA
+static struct kmem_cache kmalloc_dma_caches[SLUB_PAGE_SHIFT];
+#endif
+
 static int __init setup_slub_min_order(char *str)
 {
 	get_option(&str, &slub_min_order);
@@ -2553,116 +2557,26 @@ static int __init setup_slub_nomerge(cha
 
 __setup("slub_nomerge", setup_slub_nomerge);
 
-static struct kmem_cache *create_kmalloc_cache(struct kmem_cache *s,
-		const char *name, int size, gfp_t gfp_flags)
+static void create_kmalloc_cache(struct kmem_cache *s,
+		const char *name, int size, unsigned int flags)
 {
-	unsigned int flags = 0;
-
-	if (gfp_flags & SLUB_DMA)
-		flags = SLAB_CACHE_DMA;
-
 	/*
 	 * This function is called with IRQs disabled during early-boot on
 	 * single CPU so there's no need to take slub_lock here.
 	 */
-	if (!kmem_cache_open(s, gfp_flags, name, size, ARCH_KMALLOC_MINALIGN,
+	if (!kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN,
 								flags, NULL))
 		goto panic;
 
 	list_add(&s->list, &slab_caches);
 
-	if (sysfs_slab_add(s))
-		goto panic;
-	return s;
+	if (!sysfs_slab_add(s))
+		return;
 
 panic:
 	panic("Creation of kmalloc slab %s size=%d failed.\n", name, size);
 }
 
-#ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_caches_dma[SLUB_PAGE_SHIFT];
-
-static void sysfs_add_func(struct work_struct *w)
-{
-	struct kmem_cache *s;
-
-	down_write(&slub_lock);
-	list_for_each_entry(s, &slab_caches, list) {
-		if (s->flags & __SYSFS_ADD_DEFERRED) {
-			s->flags &= ~__SYSFS_ADD_DEFERRED;
-			sysfs_slab_add(s);
-		}
-	}
-	up_write(&slub_lock);
-}
-
-static DECLARE_WORK(sysfs_add_work, sysfs_add_func);
-
-static noinline struct kmem_cache *dma_kmalloc_cache(int index, gfp_t flags)
-{
-	struct kmem_cache *s;
-	char *text;
-	size_t realsize;
-	unsigned long slabflags;
-	int i;
-
-	s = kmalloc_caches_dma[index];
-	if (s)
-		return s;
-
-	/* Dynamically create dma cache */
-	if (flags & __GFP_WAIT)
-		down_write(&slub_lock);
-	else {
-		if (!down_write_trylock(&slub_lock))
-			goto out;
-	}
-
-	if (kmalloc_caches_dma[index])
-		goto unlock_out;
-
-	realsize = kmalloc_caches[index].objsize;
-	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
-			 (unsigned int)realsize);
-
-	s = NULL;
-	for (i = 0; i < KMALLOC_CACHES; i++)
-		if (!kmalloc_caches[i].size)
-			break;
-
-	BUG_ON(i >= KMALLOC_CACHES);
-	s = kmalloc_caches + i;
-
-	/*
-	 * Must defer sysfs creation to a workqueue because we don't know
-	 * what context we are called from. Before sysfs comes up, we don't
-	 * need to do anything because our sysfs initcall will start by
-	 * adding all existing slabs to sysfs.
-	 */
-	slabflags = SLAB_CACHE_DMA|SLAB_NOTRACK;
-	if (slab_state >= SYSFS)
-		slabflags |= __SYSFS_ADD_DEFERRED;
-
-	if (!text || !kmem_cache_open(s, flags, text,
-			realsize, ARCH_KMALLOC_MINALIGN, slabflags, NULL)) {
-		s->size = 0;
-		kfree(text);
-		goto unlock_out;
-	}
-
-	list_add(&s->list, &slab_caches);
-	kmalloc_caches_dma[index] = s;
-
-	if (slab_state >= SYSFS)
-		schedule_work(&sysfs_add_work);
-
-unlock_out:
-	up_write(&slub_lock);
-out:
-	return kmalloc_caches_dma[index];
-}
-#endif
-
 /*
  * Conversion table for small slabs sizes / 8 to the index in the
  * kmalloc array. This is necessary for slabs < 192 since we have non power
@@ -2715,7 +2629,7 @@ static struct kmem_cache *get_slab(size_
 
 #ifdef CONFIG_ZONE_DMA
 	if (unlikely((flags & SLUB_DMA)))
-		return dma_kmalloc_cache(index, flags);
+		return &kmalloc_dma_caches[index];
 
 #endif
 	return &kmalloc_caches[index];
@@ -3053,7 +2967,7 @@ void __init kmem_cache_init(void)
 	 * kmem_cache_open for slab_state == DOWN.
 	 */
 	create_kmalloc_cache(&kmalloc_caches[0], "kmem_cache_node",
-		sizeof(struct kmem_cache_node), GFP_NOWAIT);
+		sizeof(struct kmem_cache_node), 0);
 	kmalloc_caches[0].refcount = -1;
 	caches++;
 
@@ -3066,18 +2980,18 @@ void __init kmem_cache_init(void)
 	/* Caches that are not of the two-to-the-power-of size */
 	if (KMALLOC_MIN_SIZE <= 32) {
 		create_kmalloc_cache(&kmalloc_caches[1],
-				"kmalloc-96", 96, GFP_NOWAIT);
+				"kmalloc-96", 96, 0);
 		caches++;
 	}
 	if (KMALLOC_MIN_SIZE <= 64) {
 		create_kmalloc_cache(&kmalloc_caches[2],
-				"kmalloc-192", 192, GFP_NOWAIT);
+				"kmalloc-192", 192, 0);
 		caches++;
 	}
 
 	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
 		create_kmalloc_cache(&kmalloc_caches[i],
-			"kmalloc", 1 << i, GFP_NOWAIT);
+			"kmalloc", 1 << i, 0);
 		caches++;
 	}
 
@@ -3124,7 +3038,7 @@ void __init kmem_cache_init(void)
 
 	/* Provide the correct kmalloc names now that the caches are up */
 	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++)
-		kmalloc_caches[i]. name =
+		kmalloc_caches[i].name =
 			kasprintf(GFP_NOWAIT, "kmalloc-%d", 1 << i);
 
 #ifdef CONFIG_SMP
@@ -3147,6 +3061,21 @@ void __init kmem_cache_init(void)
 
 void __init kmem_cache_init_late(void)
 {
+#ifdef CONFIG_ZONE_DMA
+	int i;
+
+	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
+		struct kmem_cache *s = &kmalloc_caches[i];
+
+		if (s && s->size) {
+			char *name = kasprintf(GFP_KERNEL,
+				 "dma-kmalloc-%d", s->objsize);
+
+			create_kmalloc_cache(&kmalloc_dma_caches[i],
+				name, s->objsize, SLAB_CACHE_DMA);
+		}
+	}
+#endif
 }
 
 /*
@@ -3241,7 +3170,7 @@ struct kmem_cache *kmem_cache_create(con
 
 	s = kmalloc(kmem_size, GFP_KERNEL);
 	if (s) {
-		if (kmem_cache_open(s, GFP_KERNEL, name,
+		if (kmem_cache_open(s, name,
 				size, align, flags, ctor)) {
 			list_add(&s->list, &slab_caches);
 			up_write(&slub_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (7 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 08/16] slub: remove dynamic dma slab allocation Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-26  8:10   ` Tejun Heo
                     ` (2 more replies)
  2010-06-25 21:20 ` [S+Q 10/16] slub: Remove static kmem_cache_cpu array for boot Christoph Lameter
                   ` (7 subsequent siblings)
  16 siblings, 3 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, tj, Nick Piggin, Matt Mackall

[-- Attachment #1: percpu_make_usable_during_early_boot --]
[-- Type: text/plain, Size: 1403 bytes --]

allocpercpu() may be used during early boot after the page allocator
has been bootstrapped but when interrupts are still off. Make sure
that we do not do GFP_KERNEL allocations if this occurs.

Cc: tj@kernel.org
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/percpu.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/percpu.c
===================================================================
--- linux-2.6.orig/mm/percpu.c	2010-06-23 14:43:54.000000000 -0500
+++ linux-2.6/mm/percpu.c	2010-06-23 14:44:05.000000000 -0500
@@ -275,7 +275,8 @@ static void __maybe_unused pcpu_next_pop
  * memory is always zeroed.
  *
  * CONTEXT:
- * Does GFP_KERNEL allocation.
+ * Does GFP_KERNEL allocation (May be called early in boot when
+ * interrupts are still disabled. Will then do GFP_NOWAIT alloc).
  *
  * RETURNS:
  * Pointer to the allocated area on success, NULL on failure.
@@ -286,7 +287,7 @@ static void *pcpu_mem_alloc(size_t size)
 		return NULL;
 
 	if (size <= PAGE_SIZE)
-		return kzalloc(size, GFP_KERNEL);
+		return kzalloc(size, GFP_KERNEL & gfp_allowed_mask);
 	else {
 		void *ptr = vmalloc(size);
 		if (ptr)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 10/16] slub: Remove static kmem_cache_cpu array for boot
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (8 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 09/16] [percpu] make allocpercpu usable during early boot Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-27  0:02   ` David Rientjes
  2010-06-25 21:20 ` [S+Q 11/16] slub: Dynamically size kmalloc cache allocations Christoph Lameter
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Tejun Heo, Nick Piggin, Matt Mackall

[-- Attachment #1: maybe_remove_static --]
[-- Type: text/plain, Size: 2253 bytes --]

The percpu allocator can now handle allocations in early boot.
So drop the static kmem_cache_cpu array.

Early memory allocations require the use of GFP_NOWAIT instead of
GFP_KERNEL. Mask GFP_KERNEL with gfp_allowed_mask to get to GFP_NOWAIT
in a boot scenario.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

Index: linux-2.6.34/mm/slub.c
===================================================================
--- linux-2.6.34.orig/mm/slub.c	2010-06-22 09:50:00.000000000 -0500
+++ linux-2.6.34/mm/slub.c	2010-06-23 09:59:53.000000000 -0500
@@ -2068,23 +2068,14 @@ init_kmem_cache_node(struct kmem_cache_n
 #endif
 }
 
-static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[KMALLOC_CACHES]);
-
 static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 {
-	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
-		/*
-		 * Boot time creation of the kmalloc array. Use static per cpu data
-		 * since the per cpu allocator is not available yet.
-		 */
-		s->cpu_slab = kmalloc_percpu + (s - kmalloc_caches);
-	else
-		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
+	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
+			SLUB_PAGE_SHIFT * sizeof(struct kmem_cache));
 
-	if (!s->cpu_slab)
-		return 0;
+	s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
 
-	return 1;
+	return s->cpu_slab != NULL;
 }
 
 #ifdef CONFIG_NUMA
@@ -2105,7 +2096,7 @@ static void early_kmem_cache_node_alloc(
 
 	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, GFP_KERNEL, node);
+	page = new_slab(kmalloc_caches, GFP_KERNEL & gfp_allowed_mask, node);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
@@ -2161,7 +2152,7 @@ static int init_kmem_cache_nodes(struct 
 			continue;
 		}
 		n = kmem_cache_alloc_node(kmalloc_caches,
-						GFP_KERNEL, node);
+			GFP_KERNEL & gfp_allowed_mask, node);
 
 		if (!n) {
 			free_kmem_cache_nodes(s);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 11/16] slub: Dynamically size kmalloc cache allocations
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (9 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 10/16] slub: Remove static kmem_cache_cpu array for boot Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-25 21:20 ` [S+Q 12/16] SLUB: Add SLAB style per cpu queueing Christoph Lameter
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Nick Piggin, Matt Mackall

[-- Attachment #1: slub_dynamic_kmem_alloc --]
[-- Type: text/plain, Size: 11531 bytes --]

kmalloc caches are statically defined and may take up a lot of space just
because the sizes of the node array has to be dimensioned for the largest
node count supported.

This patch makes the size of the kmem_cache structure dynamic throughout by
creating a kmem_cache slab cache for the kmem_cache objects. The bootstrap
occurs by allocating the initial one or two kmem_cache objects from the
page allocator.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    7 -
 mm/slub.c                |  182 +++++++++++++++++++++++++++++++++++------------
 2 files changed, 139 insertions(+), 50 deletions(-)

Index: linux-2.6.34/include/linux/slub_def.h
===================================================================
--- linux-2.6.34.orig/include/linux/slub_def.h	2010-06-23 09:46:33.000000000 -0500
+++ linux-2.6.34/include/linux/slub_def.h	2010-06-23 10:03:01.000000000 -0500
@@ -136,19 +136,16 @@ struct kmem_cache {
 
 #ifdef CONFIG_ZONE_DMA
 #define SLUB_DMA __GFP_DMA
-/* Reserve extra caches for potential DMA use */
-#define KMALLOC_CACHES (2 * SLUB_PAGE_SHIFT)
 #else
 /* Disable DMA functionality */
 #define SLUB_DMA (__force gfp_t)0
-#define KMALLOC_CACHES SLUB_PAGE_SHIFT
 #endif
 
 /*
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[KMALLOC_CACHES];
+extern struct kmem_cache *kmalloc_caches[SLUB_PAGE_SHIFT];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -213,7 +210,7 @@ static __always_inline struct kmem_cache
 	if (index == 0)
 		return NULL;
 
-	return &kmalloc_caches[index];
+	return kmalloc_caches[index];
 }
 
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
Index: linux-2.6.34/mm/slub.c
===================================================================
--- linux-2.6.34.orig/mm/slub.c	2010-06-23 10:02:49.000000000 -0500
+++ linux-2.6.34/mm/slub.c	2010-06-23 10:04:56.000000000 -0500
@@ -179,7 +179,7 @@ static struct notifier_block slab_notifi
 
 static enum {
 	DOWN,		/* No slab functionality available */
-	PARTIAL,	/* kmem_cache_open() works but kmalloc does not */
+	PARTIAL,	/* Kmem_cache_node works */
 	UP,		/* Everything works but does not show up in sysfs */
 	SYSFS		/* Sysfs up */
 } slab_state = DOWN;
@@ -2079,6 +2079,8 @@ static inline int alloc_kmem_cache_cpus(
 }
 
 #ifdef CONFIG_NUMA
+static struct kmem_cache *kmem_cache_node;
+
 /*
  * No kmalloc_node yet so do it by hand. We know that this is the first
  * slab on the node for this slabcache. There are no concurrent accesses
@@ -2094,9 +2096,9 @@ static void early_kmem_cache_node_alloc(
 	struct kmem_cache_node *n;
 	unsigned long flags;
 
-	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
+	BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, GFP_KERNEL & gfp_allowed_mask, node);
+	page = new_slab(kmem_cache_node, GFP_KERNEL & gfp_allowed_mask, node);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
@@ -2108,15 +2110,15 @@ static void early_kmem_cache_node_alloc(
 
 	n = page->freelist;
 	BUG_ON(!n);
-	page->freelist = get_freepointer(kmalloc_caches, n);
+	page->freelist = get_freepointer(kmem_cache_node, n);
 	page->inuse++;
-	kmalloc_caches->node[node] = n;
+	kmem_cache_node->node[node] = n;
 #ifdef CONFIG_SLUB_DEBUG
-	init_object(kmalloc_caches, n, 1);
-	init_tracking(kmalloc_caches, n);
+	init_object(kmem_cache_node, n, 1);
+	init_tracking(kmem_cache_node, n);
 #endif
-	init_kmem_cache_node(n, kmalloc_caches);
-	inc_slabs_node(kmalloc_caches, node, page->objects);
+	init_kmem_cache_node(n, kmem_cache_node);
+	inc_slabs_node(kmem_cache_node, node, page->objects);
 
 	/*
 	 * lockdep requires consistent irq usage for each lock
@@ -2134,8 +2136,10 @@ static void free_kmem_cache_nodes(struct
 
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = s->node[node];
+
 		if (n)
-			kmem_cache_free(kmalloc_caches, n);
+			kmem_cache_free(kmem_cache_node, n);
+
 		s->node[node] = NULL;
 	}
 }
@@ -2151,7 +2155,7 @@ static int init_kmem_cache_nodes(struct 
 			early_kmem_cache_node_alloc(node);
 			continue;
 		}
-		n = kmem_cache_alloc_node(kmalloc_caches,
+		n = kmem_cache_alloc_node(kmem_cache_node,
 			GFP_KERNEL & gfp_allowed_mask, node);
 
 		if (!n) {
@@ -2505,11 +2509,13 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  *		Kmalloc subsystem
  *******************************************************************/
 
-struct kmem_cache kmalloc_caches[KMALLOC_CACHES] __cacheline_aligned;
+struct kmem_cache *kmalloc_caches[SLUB_PAGE_SHIFT];
 EXPORT_SYMBOL(kmalloc_caches);
 
+static struct kmem_cache *kmem_cache;
+
 #ifdef CONFIG_ZONE_DMA
-static struct kmem_cache kmalloc_dma_caches[SLUB_PAGE_SHIFT];
+static struct kmem_cache *kmalloc_dma_caches[SLUB_PAGE_SHIFT];
 #endif
 
 static int __init setup_slub_min_order(char *str)
@@ -2548,9 +2554,13 @@ static int __init setup_slub_nomerge(cha
 
 __setup("slub_nomerge", setup_slub_nomerge);
 
-static void create_kmalloc_cache(struct kmem_cache *s,
+static void create_kmalloc_cache(struct kmem_cache **sp,
 		const char *name, int size, unsigned int flags)
 {
+	struct kmem_cache *s;
+
+	s = kmem_cache_alloc(kmem_cache, GFP_NOWAIT);
+
 	/*
 	 * This function is called with IRQs disabled during early-boot on
 	 * single CPU so there's no need to take slub_lock here.
@@ -2559,6 +2569,8 @@ static void create_kmalloc_cache(struct 
 								flags, NULL))
 		goto panic;
 
+	*sp = s;
+
 	list_add(&s->list, &slab_caches);
 
 	if (!sysfs_slab_add(s))
@@ -2620,10 +2632,10 @@ static struct kmem_cache *get_slab(size_
 
 #ifdef CONFIG_ZONE_DMA
 	if (unlikely((flags & SLUB_DMA)))
-		return &kmalloc_dma_caches[index];
+		return kmalloc_dma_caches[index];
 
 #endif
-	return &kmalloc_caches[index];
+	return kmalloc_caches[index];
 }
 
 void *__kmalloc(size_t size, gfp_t flags)
@@ -2946,46 +2958,114 @@ static int slab_memory_callback(struct n
  *			Basic setup of slabs
  *******************************************************************/
 
+/*
+ * Used for early kmem_cache structures that were allocated using
+ * the page allocator
+ */
+
+static void __init kmem_cache_bootstrap_fixup(struct kmem_cache *s)
+{
+	int node;
+
+	list_add(&s->list, &slab_caches);
+	sysfs_slab_add(s);
+	s->refcount = -1;
+
+	for_each_node(node) {
+		struct kmem_cache_node *n = get_node(s, node);
+		struct page *p;
+
+		if (n) {
+			list_for_each_entry(p, &n->partial, lru)
+				p->slab = s;
+
+#ifdef CONFIG_SLAB_DEBUG
+			list_for_each_entry(p, &n->full, lru)
+				p->slab = s;
+#endif
+		}
+	}
+}
+
 void __init kmem_cache_init(void)
 {
 	int i;
 	int caches = 0;
+	struct kmem_cache *temp_kmem_cache;
+	int order;
 
 #ifdef CONFIG_NUMA
+	struct kmem_cache *temp_kmem_cache_node;
+	unsigned long kmalloc_size;
+
+	kmem_size = offsetof(struct kmem_cache, node) +
+				nr_node_ids * sizeof(struct kmem_cache_node *);
+
+	/* Allocate two kmem_caches from the page allocator */
+	kmalloc_size = ALIGN(kmem_size, cache_line_size());
+	order = get_order(2 * kmalloc_size);
+	kmem_cache = (void *)__get_free_pages(GFP_NOWAIT, order);
+
 	/*
 	 * Must first have the slab cache available for the allocations of the
 	 * struct kmem_cache_node's. There is special bootstrap code in
 	 * kmem_cache_open for slab_state == DOWN.
 	 */
-	create_kmalloc_cache(&kmalloc_caches[0], "kmem_cache_node",
-		sizeof(struct kmem_cache_node), 0);
-	kmalloc_caches[0].refcount = -1;
-	caches++;
+	kmem_cache_node = (void *)kmem_cache + kmalloc_size;
+
+	kmem_cache_open(kmem_cache_node, "kmem_cache_node",
+		sizeof(struct kmem_cache_node),
+		0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
 
 	hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
+#else
+	/* Allocate a single kmem_cache from the page allocator */
+	kmem_size = sizeof(struct kmem_cache);
+	order = get_order(kmem_size);
+	kmem_cache = (void *)__get_free_pages(GFP_NOWAIT, order);
 #endif
 
 	/* Able to allocate the per node structures */
 	slab_state = PARTIAL;
 
-	/* Caches that are not of the two-to-the-power-of size */
-	if (KMALLOC_MIN_SIZE <= 32) {
-		create_kmalloc_cache(&kmalloc_caches[1],
-				"kmalloc-96", 96, 0);
-		caches++;
-	}
-	if (KMALLOC_MIN_SIZE <= 64) {
-		create_kmalloc_cache(&kmalloc_caches[2],
-				"kmalloc-192", 192, 0);
-		caches++;
-	}
+	temp_kmem_cache = kmem_cache;
+	kmem_cache_open(kmem_cache, "kmem_cache", kmem_size,
+		0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+	kmem_cache = kmem_cache_alloc(kmem_cache, GFP_NOWAIT);
+	memcpy(kmem_cache, temp_kmem_cache, kmem_size);
 
-	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
-		create_kmalloc_cache(&kmalloc_caches[i],
-			"kmalloc", 1 << i, 0);
-		caches++;
-	}
+#ifdef CONFIG_NUMA
+	/*
+	 * Allocate kmem_cache_node properly from the kmem_cache slab.
+	 * kmem_cache_node is separately allocated so no need to
+	 * update any list pointers.
+	 */
+	temp_kmem_cache_node = kmem_cache_node;
 
+	kmem_cache_node = kmem_cache_alloc(kmem_cache, GFP_NOWAIT);
+	memcpy(kmem_cache_node, temp_kmem_cache_node, kmem_size);
+
+	kmem_cache_bootstrap_fixup(kmem_cache_node);
+
+	caches++;
+#else
+	/*
+	 * kmem_cache has kmem_cache_node embedded and we moved it!
+	 * Update the list heads
+	 */
+	INIT_LIST_HEAD(&kmem_cache->local_node.partial);
+	list_splice(&temp_kmem_cache->local_node.partial, &kmem_cache->local_node.partial);
+#ifdef CONFIG_SLUB_DEBUG
+	INIT_LIST_HEAD(&kmem_cache->local_node.full);
+	list_splice(&temp_kmem_cache->local_node.full, &kmem_cache->local_node.full);
+#endif
+#endif
+	kmem_cache_bootstrap_fixup(kmem_cache);
+	caches++;
+	/* Free temporary boot structure */
+	free_pages((unsigned long)temp_kmem_cache, order);
+
+	/* Now we can use the kmem_cache to allocate kmalloc slabs */
 
 	/*
 	 * Patch up the size_index table if we have strange large alignment
@@ -3025,23 +3105,35 @@ void __init kmem_cache_init(void)
 			size_index[size_index_elem(i)] = 8;
 	}
 
+
+	/* Caches that are not of the two-to-the-power-of size */
+	if (KMALLOC_MIN_SIZE <= 32) {
+		create_kmalloc_cache(&kmalloc_caches[1],
+				"kmalloc-96", 96, 0);
+		caches++;
+	}
+	if (KMALLOC_MIN_SIZE <= 64) {
+		create_kmalloc_cache(&kmalloc_caches[2],
+				"kmalloc-192", 192, 0);
+		caches++;
+	}
+
+	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
+		create_kmalloc_cache(&kmalloc_caches[i],
+			"kmalloc", 1 << i, 0);
+		caches++;
+	}
+
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
 	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++)
-		kmalloc_caches[i].name =
+		kmalloc_caches[i]->name =
 			kasprintf(GFP_NOWAIT, "kmalloc-%d", 1 << i);
 
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
 #endif
-#ifdef CONFIG_NUMA
-	kmem_size = offsetof(struct kmem_cache, node) +
-				nr_node_ids * sizeof(struct kmem_cache_node *);
-#else
-	kmem_size = sizeof(struct kmem_cache);
-#endif
-
 	printk(KERN_INFO
 		"SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"
 		" CPUs=%d, Nodes=%d\n",
@@ -3056,7 +3148,7 @@ void __init kmem_cache_init_late(void)
 	int i;
 
 	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
-		struct kmem_cache *s = &kmalloc_caches[i];
+		struct kmem_cache *s = kmalloc_caches[i];
 
 		if (s && s->size) {
 			char *name = kasprintf(GFP_KERNEL,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 12/16] SLUB: Add SLAB style per cpu queueing
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (10 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 11/16] slub: Dynamically size kmalloc cache allocations Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-26  2:32   ` Nick Piggin
  2010-06-25 21:20 ` [S+Q 13/16] SLUB: Resize the new cpu queues Christoph Lameter
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Nick Piggin, Matt Mackall

[-- Attachment #1: sled_core --]
[-- Type: text/plain, Size: 43670 bytes --]

This patch adds SLAB style cpu queueing and uses a new way for
 managing objects in the slabs using bitmaps. It uses a percpu queue so that
free operations can be properly buffered and a bitmap for managing the
free/allocated state in the slabs. It uses slightly more memory
(due to the need to place large bitmaps --sized a few words--in some
slab pages) but in general does compete well in terms of space use.
The storage format using bitmaps avoids the SLAB management structure that
SLAB needs for each slab page and therefore the metadata is more compact
and easily fits into a cacheline.

The SLAB scheme of not touching the object during management is adopted.
SLUB can now efficiently free and allocate cache cold objects.

The queueing scheme addresses also the issue that the free slowpath
was taken too frequently.

This patch only implements staticallly sized per cpu queues.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |   10 
 mm/slub.c                |  920 ++++++++++++++++++++---------------------------
 2 files changed, 416 insertions(+), 514 deletions(-)

Index: linux-2.6.34/include/linux/slub_def.h
===================================================================
--- linux-2.6.34.orig/include/linux/slub_def.h	2010-06-23 10:03:01.000000000 -0500
+++ linux-2.6.34/include/linux/slub_def.h	2010-06-23 10:22:30.000000000 -0500
@@ -34,13 +34,16 @@ enum stat_item {
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
 	NR_SLUB_STAT_ITEMS };
 
+#define QUEUE_SIZE 50
+#define BATCH_SIZE 25
+
 struct kmem_cache_cpu {
-	void **freelist;	/* Pointer to first free per cpu object */
-	struct page *page;	/* The slab from which we are allocating */
-	int node;		/* The node of the page (or -1 for debug) */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
+	int objects;		/* Number of objects available */
+	int node;		/* The node of the page (or -1 for debug) */
+	void *object[QUEUE_SIZE];		/* List of objects */
 };
 
 struct kmem_cache_node {
@@ -72,7 +75,6 @@ struct kmem_cache {
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
 	int objsize;		/* The size of an object without meta data */
-	int offset;		/* Free pointer offset. */
 	struct kmem_cache_order_objects oo;
 
 	/* Allocation and freeing of slabs */
Index: linux-2.6.34/mm/slub.c
===================================================================
--- linux-2.6.34.orig/mm/slub.c	2010-06-23 10:04:56.000000000 -0500
+++ linux-2.6.34/mm/slub.c	2010-06-23 10:24:11.000000000 -0500
@@ -84,27 +84,6 @@
  * minimal so we rely on the page allocators per cpu caches for
  * fast frees and allocs.
  *
- * Overloading of page flags that are otherwise used for LRU management.
- *
- * PageActive 		The slab is frozen and exempt from list processing.
- * 			This means that the slab is dedicated to a purpose
- * 			such as satisfying allocations for a specific
- * 			processor. Objects may be freed in the slab while
- * 			it is frozen but slab_free will then skip the usual
- * 			list operations. It is up to the processor holding
- * 			the slab to integrate the slab into the slab lists
- * 			when the slab is no longer needed.
- *
- * 			One use of this flag is to mark slabs that are
- * 			used for allocations. Then such a slab becomes a cpu
- * 			slab. The cpu slab may be equipped with an additional
- * 			freelist that allows lockless access to
- * 			free objects in addition to the regular freelist
- * 			that requires the slab lock.
- *
- * PageError		Slab requires special handling due to debug
- * 			options set. This moves	slab handling out of
- * 			the fast path and disables lockless freelists.
  */
 
 #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
@@ -259,38 +238,71 @@ static inline int check_valid_pointer(st
 	return 1;
 }
 
-static inline void *get_freepointer(struct kmem_cache *s, void *object)
-{
-	return *(void **)(object + s->offset);
-}
-
-static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
-{
-	*(void **)(object + s->offset) = fp;
-}
-
 /* Loop over all objects in a slab */
 #define for_each_object(__p, __s, __addr, __objects) \
 	for (__p = (__addr); __p < (__addr) + (__objects) * (__s)->size;\
 			__p += (__s)->size)
 
-/* Scan freelist */
-#define for_each_free_object(__p, __s, __free) \
-	for (__p = (__free); __p; __p = get_freepointer((__s), __p))
-
 /* Determine object index from a given position */
 static inline int slab_index(void *p, struct kmem_cache *s, void *addr)
 {
 	return (p - addr) / s->size;
 }
 
+static inline int map_in_page_struct(struct page *page)
+{
+	return page->objects <= BITS_PER_LONG;
+}
+
+static inline unsigned long *map(struct page *page)
+{
+	if (map_in_page_struct(page))
+		return (unsigned long *)&page->freelist;
+	else
+		return page->freelist;
+}
+
+static inline int map_size(struct page *page)
+{
+	return BITS_TO_LONGS(page->objects) * sizeof(unsigned long);
+}
+
+static inline int available(struct page *page)
+{
+	return bitmap_weight(map(page), page->objects);
+}
+
+static inline int all_objects_available(struct page *page)
+{
+	return bitmap_full(map(page), page->objects);
+}
+
+static inline int all_objects_used(struct page *page)
+{
+	return bitmap_empty(map(page), page->objects);
+}
+
+static inline int inuse(struct page *page)
+{
+	return page->objects - available(page);
+}
+
 static inline struct kmem_cache_order_objects oo_make(int order,
 						unsigned long size)
 {
-	struct kmem_cache_order_objects x = {
-		(order << OO_SHIFT) + (PAGE_SIZE << order) / size
-	};
+	struct kmem_cache_order_objects x;
+	unsigned long objects;
+	unsigned long page_size = PAGE_SIZE << order;
+	unsigned long ws = sizeof(unsigned long);
+
+	objects = page_size / size;
+
+	if (objects > BITS_PER_LONG)
+		/* Bitmap must fit into the slab as well */
+		objects = ((page_size / ws) * BITS_PER_LONG) /
+			((size / ws) * BITS_PER_LONG + 1);
 
+	x.x = (order << OO_SHIFT) + objects;
 	return x;
 }
 
@@ -357,10 +369,7 @@ static struct track *get_track(struct km
 {
 	struct track *p;
 
-	if (s->offset)
-		p = object + s->offset + sizeof(void *);
-	else
-		p = object + s->inuse;
+	p = object + s->inuse;
 
 	return p + alloc;
 }
@@ -408,8 +417,8 @@ static void print_tracking(struct kmem_c
 
 static void print_page_info(struct page *page)
 {
-	printk(KERN_ERR "INFO: Slab 0x%p objects=%u used=%u fp=0x%p flags=0x%04lx\n",
-		page, page->objects, page->inuse, page->freelist, page->flags);
+	printk(KERN_ERR "INFO: Slab 0x%p objects=%u new=%u fp=0x%p flags=0x%04lx\n",
+		page, page->objects, available(page), page->freelist, page->flags);
 
 }
 
@@ -448,8 +457,8 @@ static void print_trailer(struct kmem_ca
 
 	print_page_info(page);
 
-	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu fp=0x%p\n\n",
-			p, p - addr, get_freepointer(s, p));
+	printk(KERN_ERR "INFO: Object 0x%p @offset=%tu\n\n",
+			p, p - addr);
 
 	if (p > addr + 16)
 		print_section("Bytes b4", p - 16, 16);
@@ -460,10 +469,7 @@ static void print_trailer(struct kmem_ca
 		print_section("Redzone", p + s->objsize,
 			s->inuse - s->objsize);
 
-	if (s->offset)
-		off = s->offset + sizeof(void *);
-	else
-		off = s->inuse;
+	off = s->inuse;
 
 	if (s->flags & SLAB_STORE_USER)
 		off += 2 * sizeof(struct track);
@@ -557,8 +563,6 @@ static int check_bytes_and_report(struct
  *
  * object address
  * 	Bytes of the object to be managed.
- * 	If the freepointer may overlay the object then the free
- * 	pointer is the first word of the object.
  *
  * 	Poisoning uses 0x6b (POISON_FREE) and the last byte is
  * 	0xa5 (POISON_END)
@@ -574,9 +578,8 @@ static int check_bytes_and_report(struct
  * object + s->inuse
  * 	Meta data starts here.
  *
- * 	A. Free pointer (if we cannot overwrite object on free)
- * 	B. Tracking data for SLAB_STORE_USER
- * 	C. Padding to reach required alignment boundary or at mininum
+ * 	A. Tracking data for SLAB_STORE_USER
+ * 	B. Padding to reach required alignment boundary or at mininum
  * 		one word if debugging is on to be able to detect writes
  * 		before the word boundary.
  *
@@ -594,10 +597,6 @@ static int check_pad_bytes(struct kmem_c
 {
 	unsigned long off = s->inuse;	/* The end of info */
 
-	if (s->offset)
-		/* Freepointer is placed after the object. */
-		off += sizeof(void *);
-
 	if (s->flags & SLAB_STORE_USER)
 		/* We also have user information there */
 		off += 2 * sizeof(struct track);
@@ -622,15 +621,42 @@ static int slab_pad_check(struct kmem_ca
 		return 1;
 
 	start = page_address(page);
-	length = (PAGE_SIZE << compound_order(page));
-	end = start + length;
-	remainder = length % s->size;
+	end = start + (PAGE_SIZE << compound_order(page));
+
+	/* Check for special case of bitmap at the end of the page */
+	if (!map_in_page_struct(page)) {
+		if ((u8 *)page->freelist > start && (u8 *)page->freelist < end)
+			end = page->freelist;
+		else
+			slab_err(s, page, "pagemap pointer invalid =%p start=%p end=%p objects=%d",
+				page->freelist, start, end, page->objects);
+	}
+
+	length = end - start;
+	remainder = length - page->objects * s->size;
 	if (!remainder)
 		return 1;
 
 	fault = check_bytes(end - remainder, POISON_INUSE, remainder);
-	if (!fault)
-		return 1;
+	if (!fault) {
+		u8 *freelist_end;
+
+		if (map_in_page_struct(page))
+			return 1;
+
+		end = start + (PAGE_SIZE << compound_order(page));
+		freelist_end = page->freelist + map_size(page);
+		remainder = end - freelist_end;
+
+		if (!remainder)
+			return 1;
+
+		fault = check_bytes(freelist_end, POISON_INUSE,
+				remainder);
+		if (!fault)
+			return 1;
+	}
+
 	while (end > fault && end[-1] == POISON_INUSE)
 		end--;
 
@@ -673,25 +699,6 @@ static int check_object(struct kmem_cach
 		 */
 		check_pad_bytes(s, page, p);
 	}
-
-	if (!s->offset && active)
-		/*
-		 * Object and freepointer overlap. Cannot check
-		 * freepointer while object is allocated.
-		 */
-		return 1;
-
-	/* Check free pointer validity */
-	if (!check_valid_pointer(s, page, get_freepointer(s, p))) {
-		object_err(s, page, p, "Freepointer corrupt");
-		/*
-		 * No choice but to zap it and thus lose the remainder
-		 * of the free objects in this slab. May cause
-		 * another error because the object count is now wrong.
-		 */
-		set_freepointer(s, p, NULL);
-		return 0;
-	}
 	return 1;
 }
 
@@ -712,51 +719,45 @@ static int check_slab(struct kmem_cache 
 			s->name, page->objects, maxobj);
 		return 0;
 	}
-	if (page->inuse > page->objects) {
-		slab_err(s, page, "inuse %u > max %u",
-			s->name, page->inuse, page->objects);
-		return 0;
-	}
+
 	/* Slab_pad_check fixes things up after itself */
 	slab_pad_check(s, page);
 	return 1;
 }
 
 /*
- * Determine if a certain object on a page is on the freelist. Must hold the
- * slab lock to guarantee that the chains are in a consistent state.
+ * Determine if a certain object on a page is on the free map.
  */
-static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
+static int object_marked_free(struct kmem_cache *s, struct page *page, void *search)
+{
+	return test_bit(slab_index(search, s, page_address(page)), map(page));
+}
+
+/* Verify the integrity of the metadata in a slab page */
+static int verify_slab(struct kmem_cache *s, struct page *page)
 {
 	int nr = 0;
-	void *fp = page->freelist;
-	void *object = NULL;
 	unsigned long max_objects;
+	void *start = page_address(page);
+	unsigned long size = PAGE_SIZE << compound_order(page);
 
-	while (fp && nr <= page->objects) {
-		if (fp == search)
-			return 1;
-		if (!check_valid_pointer(s, page, fp)) {
-			if (object) {
-				object_err(s, page, object,
-					"Freechain corrupt");
-				set_freepointer(s, object, NULL);
-				break;
-			} else {
-				slab_err(s, page, "Freepointer corrupt");
-				page->freelist = NULL;
-				page->inuse = page->objects;
-				slab_fix(s, "Freelist cleared");
-				return 0;
-			}
-			break;
-		}
-		object = fp;
-		fp = get_freepointer(s, object);
-		nr++;
+	nr = available(page);
+
+	if (map_in_page_struct(page))
+		max_objects = size / s->size;
+	else {
+		if (page->freelist <= start || page->freelist >= start + size) {
+			slab_err(s, page, "Invalid pointer to bitmap of free objects max_objects=%d!",
+				page->objects);
+			/* Switch to bitmap in page struct */
+			page->objects = max_objects = BITS_PER_LONG;
+			page->freelist = 0L;
+			slab_fix(s, "Slab sized for %d objects. ALl objects marked in use.",
+				BITS_PER_LONG);
+		} else
+			max_objects = ((void *)page->freelist - start) / s->size;
 	}
 
-	max_objects = (PAGE_SIZE << compound_order(page)) / s->size;
 	if (max_objects > MAX_OBJS_PER_PAGE)
 		max_objects = MAX_OBJS_PER_PAGE;
 
@@ -765,24 +766,19 @@ static int on_freelist(struct kmem_cache
 			"should be %d", page->objects, max_objects);
 		page->objects = max_objects;
 		slab_fix(s, "Number of objects adjusted.");
+		return 0;
 	}
-	if (page->inuse != page->objects - nr) {
-		slab_err(s, page, "Wrong object count. Counter is %d but "
-			"counted were %d", page->inuse, page->objects - nr);
-		page->inuse = page->objects - nr;
-		slab_fix(s, "Object count adjusted.");
-	}
-	return search == NULL;
+	return 1;
 }
 
 static void trace(struct kmem_cache *s, struct page *page, void *object,
 								int alloc)
 {
 	if (s->flags & SLAB_TRACE) {
-		printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n",
+		printk(KERN_INFO "TRACE %s %s 0x%p free=%d fp=0x%p\n",
 			s->name,
 			alloc ? "alloc" : "free",
-			object, page->inuse,
+			object, available(page),
 			page->freelist);
 
 		if (!alloc)
@@ -795,14 +791,19 @@ static void trace(struct kmem_cache *s, 
 /*
  * Tracking of fully allocated slabs for debugging purposes.
  */
-static void add_full(struct kmem_cache_node *n, struct page *page)
+static inline void add_full(struct kmem_cache *s,
+		struct kmem_cache_node *n, struct page *page)
 {
+
+	if (!(s->flags & SLAB_STORE_USER))
+		return;
+
 	spin_lock(&n->list_lock);
 	list_add(&page->lru, &n->full);
 	spin_unlock(&n->list_lock);
 }
 
-static void remove_full(struct kmem_cache *s, struct page *page)
+static inline void remove_full(struct kmem_cache *s, struct page *page)
 {
 	struct kmem_cache_node *n;
 
@@ -863,25 +864,30 @@ static void setup_object_debug(struct km
 	init_tracking(s, object);
 }
 
-static int alloc_debug_processing(struct kmem_cache *s, struct page *page,
+static int alloc_debug_processing(struct kmem_cache *s,
 					void *object, unsigned long addr)
 {
+	struct page *page = virt_to_head_page(object);
+
 	if (!check_slab(s, page))
 		goto bad;
 
-	if (!on_freelist(s, page, object)) {
-		object_err(s, page, object, "Object already allocated");
+	if (!check_valid_pointer(s, page, object)) {
+		object_err(s, page, object, "Pointer check fails");
 		goto bad;
 	}
 
-	if (!check_valid_pointer(s, page, object)) {
-		object_err(s, page, object, "Freelist Pointer check fails");
+	if (object_marked_free(s, page, object)) {
+		object_err(s, page, object, "Allocated object still marked free in slab");
 		goto bad;
 	}
 
 	if (!check_object(s, page, object, 0))
 		goto bad;
 
+	if (!verify_slab(s, page))
+		goto bad;
+
 	/* Success perform special debug activities for allocs */
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_ALLOC, addr);
@@ -897,15 +903,16 @@ bad:
 		 * as used avoids touching the remaining objects.
 		 */
 		slab_fix(s, "Marking all objects used");
-		page->inuse = page->objects;
-		page->freelist = NULL;
+		bitmap_zero(map(page), page->objects);
 	}
 	return 0;
 }
 
-static int free_debug_processing(struct kmem_cache *s, struct page *page,
+static int free_debug_processing(struct kmem_cache *s,
 					void *object, unsigned long addr)
 {
+	struct page *page = virt_to_head_page(object);
+
 	if (!check_slab(s, page))
 		goto fail;
 
@@ -914,7 +921,7 @@ static int free_debug_processing(struct 
 		goto fail;
 	}
 
-	if (on_freelist(s, page, object)) {
+	if (object_marked_free(s, page, object)) {
 		object_err(s, page, object, "Object already free");
 		goto fail;
 	}
@@ -937,13 +944,11 @@ static int free_debug_processing(struct 
 		goto fail;
 	}
 
-	/* Special debug activities for freeing objects */
-	if (!PageSlubFrozen(page) && !page->freelist)
-		remove_full(s, page);
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_FREE, addr);
 	trace(s, page, object, 0);
 	init_object(s, object, 0);
+	verify_slab(s, page);
 	return 1;
 
 fail:
@@ -1048,7 +1053,8 @@ static inline int slab_pad_check(struct 
 			{ return 1; }
 static inline int check_object(struct kmem_cache *s, struct page *page,
 			void *object, int active) { return 1; }
-static inline void add_full(struct kmem_cache_node *n, struct page *page) {}
+static inline void add_full(struct kmem_cache *s,
+		struct kmem_cache_node *n, struct page *page) {}
 static inline unsigned long kmem_cache_flags(unsigned long objsize,
 	unsigned long flags, const char *name,
 	void (*ctor)(void *))
@@ -1150,8 +1156,8 @@ static struct page *new_slab(struct kmem
 {
 	struct page *page;
 	void *start;
-	void *last;
 	void *p;
+	unsigned long size;
 
 	BUG_ON(flags & GFP_SLAB_BUG_MASK);
 
@@ -1163,23 +1169,20 @@ static struct page *new_slab(struct kmem
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 	page->flags |= 1 << PG_slab;
-
 	start = page_address(page);
+	size = PAGE_SIZE << compound_order(page);
 
 	if (unlikely(s->flags & SLAB_POISON))
-		memset(start, POISON_INUSE, PAGE_SIZE << compound_order(page));
+		memset(start, POISON_INUSE, size);
 
-	last = start;
-	for_each_object(p, s, start, page->objects) {
-		setup_object(s, page, last);
-		set_freepointer(s, last, p);
-		last = p;
-	}
-	setup_object(s, page, last);
-	set_freepointer(s, last, NULL);
+	if (!map_in_page_struct(page))
+		page->freelist = start + page->objects * s->size;
+
+	bitmap_fill(map(page), page->objects);
+
+	for_each_object(p, s, start, page->objects)
+		setup_object(s, page, p);
 
-	page->freelist = start;
-	page->inuse = 0;
 out:
 	return page;
 }
@@ -1303,7 +1306,6 @@ static inline int lock_and_freeze_slab(s
 	if (slab_trylock(page)) {
 		list_del(&page->lru);
 		n->nr_partial--;
-		__SetPageSlubFrozen(page);
 		return 1;
 	}
 	return 0;
@@ -1406,113 +1408,132 @@ static struct page *get_partial(struct k
 }
 
 /*
- * Move a page back to the lists.
- *
- * Must be called with the slab lock held.
- *
- * On exit the slab lock will have been dropped.
+ * Move the vector of objects back to the slab pages they came from
  */
-static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
+void drain_objects(struct kmem_cache *s, void **object, int nr)
 {
-	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
+	int i;
 
-	__ClearPageSlubFrozen(page);
-	if (page->inuse) {
+	for (i = 0 ; i < nr; ) {
 
-		if (page->freelist) {
-			add_partial(n, page, tail);
-			stat(s, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
-		} else {
-			stat(s, DEACTIVATE_FULL);
-			if (kmem_cache_debug(s) && (s->flags & SLAB_STORE_USER))
-				add_full(n, page);
+		void *p = object[i];
+		struct page *page = virt_to_head_page(p);
+		void *addr = page_address(page);
+		unsigned long size = PAGE_SIZE << compound_order(page);
+		int was_fully_allocated;
+		unsigned long *m;
+		unsigned long offset;
+
+		if (kmem_cache_debug(s) && !PageSlab(page)) {
+			object_err(s, page, object[i], "Object from non-slab page");
+			i++;
+			continue;
 		}
-		slab_unlock(page);
-	} else {
-		stat(s, DEACTIVATE_EMPTY);
-		if (n->nr_partial < s->min_partial) {
+
+		slab_lock(page);
+		m = map(page);
+		was_fully_allocated = bitmap_empty(m, page->objects);
+
+		offset = p - addr;
+
+
+		while (i < nr) {
+
+			int bit;
+			unsigned long new_offset;
+
+			if (offset >= size)
+				break;
+
+			if (kmem_cache_debug(s) && offset % s->size) {
+				object_err(s, page, object[i], "Misaligned object");
+				i++;
+				new_offset = object[i] - addr;
+				continue;
+			}
+
+			bit = offset / s->size;
+
 			/*
-			 * Adding an empty slab to the partial slabs in order
-			 * to avoid page allocator overhead. This slab needs
-			 * to come after the other slabs with objects in
-			 * so that the others get filled first. That way the
-			 * size of the partial list stays small.
-			 *
-			 * kmem_cache_shrink can reclaim any empty slabs from
-			 * the partial list.
-			 */
-			add_partial(n, page, 1);
-			slab_unlock(page);
-		} else {
-			stat(s, FREE_SLAB);
-			discard_slab_unlock(s, page);
+			 * Fast loop to fold a sequence of objects into the slab
+			 * avoiding division and virt_to_head_page()
+ 			 */
+			do {
+
+				if (kmem_cache_debug(s)) {
+					if (unlikely(__test_and_set_bit(bit, m)))
+						object_err(s, page, object[i], "Double free");
+				} else
+					__set_bit(bit, m);
+
+				i++;
+				bit++;
+				offset += s->size;
+				new_offset = object[i] - addr;
+
+			} while (new_offset ==  offset && i < nr && new_offset < size);
+
+			offset = new_offset;
 		}
-	}
-}
+		if (bitmap_full(m, page->objects)) {
 
-/*
- * Remove the cpu slab
- */
-static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
-{
-	struct page *page = c->page;
-	int tail = 1;
+			/* All objects are available now */
+			if (!was_fully_allocated)
 
-	if (page->freelist)
-		stat(s, DEACTIVATE_REMOTE_FREES);
-	/*
-	 * Merge cpu freelist into slab freelist. Typically we get here
-	 * because both freelists are empty. So this is unlikely
-	 * to occur.
-	 */
-	while (unlikely(c->freelist)) {
-		void **object;
+				remove_partial(s, page);
+			else
+				remove_full(s, page);
+
+			discard_slab_unlock(s, page);
 
-		tail = 0;	/* Hot objects. Put the slab first */
+  		} else {
 
-		/* Retrieve object from cpu_freelist */
-		object = c->freelist;
-		c->freelist = get_freepointer(s, c->freelist);
+			/* Some object are available now */
+			if (was_fully_allocated) {
 
-		/* And put onto the regular freelist */
-		set_freepointer(s, object, page->freelist);
-		page->freelist = object;
-		page->inuse--;
+				/* Slab was had no free objects but has them now */
+				remove_full(s, page);
+				add_partial(get_node(s, page_to_nid(page)), page, 1);
+				stat(s, FREE_REMOVE_PARTIAL);
+			}
+			slab_unlock(page);
+		}
 	}
-	c->page = NULL;
-	unfreeze_slab(s, page, tail);
 }
 
-static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
+/*
+ * Drain all objects from a per cpu queue
+ */
+static void flush_cpu_objects(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	stat(s, CPUSLAB_FLUSH);
-	slab_lock(c->page);
-	deactivate_slab(s, c);
+	drain_objects(s, c->object, c->objects);
+	c->objects = 0;
+ 	stat(s, CPUSLAB_FLUSH);
 }
 
 /*
- * Flush cpu slab.
+ * Flush cpu objects.
  *
  * Called from IPI handler with interrupts disabled.
  */
-static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
+static void __flush_cpu_objects(void *d)
 {
-	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+	struct kmem_cache *s = d;
+	struct kmem_cache_cpu *c = __this_cpu_ptr(s->cpu_slab);
 
-	if (likely(c && c->page))
-		flush_slab(s, c);
+	if (c->objects)
+		flush_cpu_objects(s, c);
 }
 
-static void flush_cpu_slab(void *d)
+static void flush_all(struct kmem_cache *s)
 {
-	struct kmem_cache *s = d;
-
-	__flush_cpu_slab(s, smp_processor_id());
+	on_each_cpu(__flush_cpu_objects, s, 1);
 }
 
-static void flush_all(struct kmem_cache *s)
+struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int n)
 {
-	on_each_cpu(flush_cpu_slab, s, 1);
+	return __alloc_percpu(sizeof(struct kmem_cache_cpu),
+		__alignof__(struct kmem_cache_cpu));
 }
 
 /*
@@ -1530,7 +1551,7 @@ static inline int node_match(struct kmem
 
 static int count_free(struct page *page)
 {
-	return page->objects - page->inuse;
+	return available(page);
 }
 
 static unsigned long count_partial(struct kmem_cache_node *n,
@@ -1592,144 +1613,127 @@ slab_out_of_memory(struct kmem_cache *s,
 }
 
 /*
- * Slow path. The lockless freelist is empty or we need to perform
- * debugging duties.
- *
- * Interrupts are disabled.
- *
- * Processing is still very fast if new objects have been freed to the
- * regular freelist. In that case we simply take over the regular freelist
- * as the lockless freelist and zap the regular freelist.
- *
- * If that is not working then we fall back to the partial lists. We take the
- * first element of the freelist as the object to allocate now and move the
- * rest of the freelist to the lockless freelist.
- *
- * And if we were unable to get a new slab from the partial slab lists then
- * we need to allocate a new slab. This is the slowest path since it involves
- * a call to the page allocator and the setup of a new slab.
+ * Retrieve pointers to nr objects from a slab into the object array.
+ * Slab must be locked.
  */
-static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
-			  unsigned long addr, struct kmem_cache_cpu *c)
+void retrieve_objects(struct kmem_cache *s, struct page *page, void **object, int nr)
 {
-	void **object;
-	struct page *new;
-
-	/* We handle __GFP_ZERO in the caller */
-	gfpflags &= ~__GFP_ZERO;
+	void *addr = page_address(page);
+	unsigned long *m = map(page);
 
-	if (!c->page)
-		goto new_slab;
+	while (nr > 0) {
+		int i = find_first_bit(m, page->objects);
+		void *a;
 
-	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
-		goto another_slab;
-
-	stat(s, ALLOC_REFILL);
-
-load_freelist:
-	object = c->page->freelist;
-	if (unlikely(!object))
-		goto another_slab;
-	if (kmem_cache_debug(s))
-		goto debug;
-
-	c->freelist = get_freepointer(s, object);
-	c->page->inuse = c->page->objects;
-	c->page->freelist = NULL;
-	c->node = page_to_nid(c->page);
-unlock_out:
-	slab_unlock(c->page);
-	stat(s, ALLOC_SLOWPATH);
-	return object;
 
-another_slab:
-	deactivate_slab(s, c);
+		__clear_bit(i, m);
+		a = addr + i * s->size;
 
-new_slab:
-	new = get_partial(s, gfpflags, node);
-	if (new) {
-		c->page = new;
-		stat(s, ALLOC_FROM_PARTIAL);
-		goto load_freelist;
-	}
-
-	if (gfpflags & __GFP_WAIT)
-		local_irq_enable();
-
-	new = new_slab(s, gfpflags, node);
-
-	if (gfpflags & __GFP_WAIT)
-		local_irq_disable();
-
-	if (new) {
-		c = __this_cpu_ptr(s->cpu_slab);
-		stat(s, ALLOC_SLAB);
-		if (c->page)
-			flush_slab(s, c);
-		slab_lock(new);
-		__SetPageSlubFrozen(new);
-		c->page = new;
-		goto load_freelist;
+		/*
+		 * Fast loop to get a sequence of objects out of the slab
+		 * without find_first_bit() and multiplication
+		 */
+		do {
+			nr--;
+			object[nr] = a;
+			a += s->size;
+			i++;
+		} while (nr > 0 && i < page->objects && __test_and_clear_bit(i, m));
 	}
-	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
-		slab_out_of_memory(s, gfpflags, node);
-	return NULL;
-debug:
-	if (!alloc_debug_processing(s, c->page, object, addr))
-		goto another_slab;
-
-	c->page->inuse++;
-	c->page->freelist = get_freepointer(s, object);
-	c->node = -1;
-	goto unlock_out;
 }
 
-/*
- * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
- * have the fastpath folded into their functions. So no function call
- * overhead for requests that can be satisfied on the fastpath.
- *
- * The fastpath works by first checking if the lockless freelist can be used.
- * If not then __slab_alloc is called for slow processing.
- *
- * Otherwise we can simply pick the next object from the lockless free list.
- */
-static __always_inline void *slab_alloc(struct kmem_cache *s,
+static void *slab_alloc(struct kmem_cache *s,
 		gfp_t gfpflags, int node, unsigned long addr)
 {
 	void **object;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
-	gfpflags &= gfp_allowed_mask;
-
 	lockdep_trace_alloc(gfpflags);
 	might_sleep_if(gfpflags & __GFP_WAIT);
 
 	if (should_failslab(s->objsize, gfpflags, s->flags))
 		return NULL;
 
+redo:
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
-	object = c->freelist;
-	if (unlikely(!object || !node_match(c, node)))
+	if (unlikely(!c->objects || !node_match(c, node))) {
 
-		object = __slab_alloc(s, gfpflags, node, addr, c);
+		gfpflags &= gfp_allowed_mask;
 
-	else {
-		c->freelist = get_freepointer(s, object);
+		if (unlikely(!node_match(c, node))) {
+			flush_cpu_objects(s, c);
+			c->node = node;
+		}
+
+		while (c->objects < BATCH_SIZE) {
+			struct page *new;
+			int d;
+
+			new = get_partial(s, gfpflags & ~__GFP_ZERO, node);
+			if (unlikely(!new)) {
+
+				if (gfpflags & __GFP_WAIT)
+					local_irq_enable();
+
+				new = new_slab(s, gfpflags, node);
+
+				if (gfpflags & __GFP_WAIT)
+					local_irq_disable();
+
+				/* process may have moved to different cpu */
+				c = __this_cpu_ptr(s->cpu_slab);
+
+ 				if (!new) {
+					if (!c->objects)
+						goto oom;
+					break;
+				}
+				stat(s, ALLOC_SLAB);
+				slab_lock(new);
+			} else
+				stat(s, ALLOC_FROM_PARTIAL);
+
+			d = min(BATCH_SIZE - c->objects, available(new));
+			retrieve_objects(s, new, c->object + c->objects, d);
+			c->objects += d;
+
+			if (!all_objects_used(new))
+
+				add_partial(get_node(s, page_to_nid(new)), new, 1);
+
+			else
+				add_full(s, get_node(s, page_to_nid(new)), new);
+
+			slab_unlock(new);
+		}
+		stat(s, ALLOC_SLOWPATH);
+
+	} else
 		stat(s, ALLOC_FASTPATH);
+
+	object = c->object[--c->objects];
+
+	if (kmem_cache_debug(s)) {
+		if (!alloc_debug_processing(s, object, addr))
+			goto redo;
 	}
 	local_irq_restore(flags);
 
-	if (unlikely(gfpflags & __GFP_ZERO) && object)
+	if (unlikely(gfpflags & __GFP_ZERO))
 		memset(object, 0, s->objsize);
 
 	kmemcheck_slab_alloc(s, gfpflags, object, s->objsize);
 	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, gfpflags);
 
 	return object;
+
+oom:
+	local_irq_restore(flags);
+	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
+		slab_out_of_memory(s, gfpflags, node);
+	return NULL;
 }
 
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
@@ -1773,113 +1777,52 @@ void *kmem_cache_alloc_node_notrace(stru
 EXPORT_SYMBOL(kmem_cache_alloc_node_notrace);
 #endif
 
-/*
- * Slow patch handling. This may still be called frequently since objects
- * have a longer lifetime than the cpu slabs in most processing loads.
- *
- * So we still attempt to reduce cache line usage. Just take the slab
- * lock and free the item. If there is no additional partial page
- * handling required then we can return immediately.
- */
-static void __slab_free(struct kmem_cache *s, struct page *page,
+static void slab_free(struct kmem_cache *s,
 			void *x, unsigned long addr)
 {
-	void *prior;
-	void **object = (void *)x;
-
-	stat(s, FREE_SLOWPATH);
-	slab_lock(page);
-
-	if (kmem_cache_debug(s))
-		goto debug;
-
-checks_ok:
-	prior = page->freelist;
-	set_freepointer(s, object, prior);
-	page->freelist = object;
-	page->inuse--;
-
-	if (unlikely(PageSlubFrozen(page))) {
-		stat(s, FREE_FROZEN);
-		goto out_unlock;
-	}
-
-	if (unlikely(!page->inuse))
-		goto slab_empty;
-
-	/*
-	 * Objects left in the slab. If it was not on the partial list before
-	 * then add it.
-	 */
-	if (unlikely(!prior)) {
-		add_partial(get_node(s, page_to_nid(page)), page, 1);
-		stat(s, FREE_ADD_PARTIAL);
-	}
-
-out_unlock:
-	slab_unlock(page);
-	return;
-
-slab_empty:
-	if (prior) {
-		/*
-		 * Slab still on the partial list.
-		 */
-		remove_partial(s, page);
-		stat(s, FREE_REMOVE_PARTIAL);
-	}
-	stat(s, FREE_SLAB);
-	discard_slab_unlock(s, page);
-	return;
-
-debug:
-	if (!free_debug_processing(s, page, x, addr))
-		goto out_unlock;
-	goto checks_ok;
-}
-
-/*
- * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
- * can perform fastpath freeing without additional function calls.
- *
- * The fastpath is only possible if we are freeing to the current cpu slab
- * of this processor. This typically the case if we have just allocated
- * the item before.
- *
- * If fastpath is not possible then fall back to __slab_free where we deal
- * with all sorts of special processing.
- */
-static __always_inline void slab_free(struct kmem_cache *s,
-			struct page *page, void *x, unsigned long addr)
-{
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
 	kmemleak_free_recursive(x, s->flags);
+
 	local_irq_save(flags);
 	c = __this_cpu_ptr(s->cpu_slab);
+
 	kmemcheck_slab_free(s, object, s->objsize);
 	debug_check_no_locks_freed(object, s->objsize);
+
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
-	if (likely(page == c->page && c->node >= 0)) {
-		set_freepointer(s, object, c->freelist);
-		c->freelist = object;
-		stat(s, FREE_FASTPATH);
+
+	if (unlikely(c->objects >= QUEUE_SIZE)) {
+
+		int t = min(BATCH_SIZE, c->objects);
+
+		drain_objects(s, c->object, t);
+
+		c->objects -= t;
+		if (c->objects)
+			memcpy(c->object, c->object + t,
+					c->objects * sizeof(void *));
+
+		stat(s, FREE_SLOWPATH);
 	} else
-		__slab_free(s, page, x, addr);
+		stat(s, FREE_FASTPATH);
 
+	if (kmem_cache_debug(s)
+			&& !free_debug_processing(s, x, addr))
+		goto out;
+
+	c->object[c->objects++] = object;
+
+out:
 	local_irq_restore(flags);
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
 {
-	struct page *page;
-
-	page = virt_to_head_page(x);
-
-	slab_free(s, page, x, _RET_IP_);
+	slab_free(s, x, _RET_IP_);
 
 	trace_kmem_cache_free(_RET_IP_, x);
 }
@@ -1897,11 +1840,6 @@ static struct page *get_object_page(cons
 }
 
 /*
- * Object placement in a slab is made very easy because we always start at
- * offset 0. If we tune the size of the object to the alignment then we can
- * get the required alignment by putting one properly sized object after
- * another.
- *
  * Notice that the allocation order determines the sizes of the per cpu
  * caches. Each processor has always one slab available for allocations.
  * Increasing the allocation order reduces the number of times that slabs
@@ -1996,7 +1934,7 @@ static inline int calculate_order(int si
 	 */
 	min_objects = slub_min_objects;
 	if (!min_objects)
-		min_objects = 4 * (fls(nr_cpu_ids) + 1);
+		min_objects = min(BITS_PER_LONG, 4 * (fls(nr_cpu_ids) + 1));
 	max_objects = (PAGE_SIZE << slub_max_order)/size;
 	min_objects = min(min_objects, max_objects);
 
@@ -2108,10 +2046,7 @@ static void early_kmem_cache_node_alloc(
 				"in order to be able to continue\n");
 	}
 
-	n = page->freelist;
-	BUG_ON(!n);
-	page->freelist = get_freepointer(kmem_cache_node, n);
-	page->inuse++;
+	retrieve_objects(kmem_cache_node, page, (void **)&n, 1);
 	kmem_cache_node->node[node] = n;
 #ifdef CONFIG_SLUB_DEBUG
 	init_object(kmem_cache_node, n, 1);
@@ -2196,10 +2131,11 @@ static void set_min_partial(struct kmem_
 static int calculate_sizes(struct kmem_cache *s, int forced_order)
 {
 	unsigned long flags = s->flags;
-	unsigned long size = s->objsize;
+	unsigned long size;
 	unsigned long align = s->align;
 	int order;
 
+	size = s->objsize;
 	/*
 	 * Round up object size to the next word boundary. We can only
 	 * place the free pointer at word boundaries and this determines
@@ -2231,24 +2167,10 @@ static int calculate_sizes(struct kmem_c
 
 	/*
 	 * With that we have determined the number of bytes in actual use
-	 * by the object. This is the potential offset to the free pointer.
+	 * by the object.
 	 */
 	s->inuse = size;
 
-	if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) ||
-		s->ctor)) {
-		/*
-		 * Relocate free pointer after the object if it is not
-		 * permitted to overwrite the first word of the object on
-		 * kmem_cache_free.
-		 *
-		 * This is the case if we do RCU, have a constructor or
-		 * destructor or are poisoning the objects.
-		 */
-		s->offset = size;
-		size += sizeof(void *);
-	}
-
 #ifdef CONFIG_SLUB_DEBUG
 	if (flags & SLAB_STORE_USER)
 		/*
@@ -2334,7 +2256,6 @@ static int kmem_cache_open(struct kmem_c
 		 */
 		if (get_order(s->size) > get_order(s->objsize)) {
 			s->flags &= ~DEBUG_METADATA_FLAGS;
-			s->offset = 0;
 			if (!calculate_sizes(s, -1))
 				goto error;
 		}
@@ -2359,9 +2280,9 @@ static int kmem_cache_open(struct kmem_c
 error:
 	if (flags & SLAB_PANIC)
 		panic("Cannot create slab %s size=%lu realsize=%u "
-			"order=%u offset=%u flags=%lx\n",
+			"order=%u flags=%lx\n",
 			s->name, (unsigned long)size, s->size, oo_order(s->oo),
-			s->offset, flags);
+			flags);
 	return 0;
 }
 
@@ -2415,19 +2336,14 @@ static void list_slab_objects(struct kme
 #ifdef CONFIG_SLUB_DEBUG
 	void *addr = page_address(page);
 	void *p;
-	long *map = kzalloc(BITS_TO_LONGS(page->objects) * sizeof(long),
-			    GFP_ATOMIC);
+	long *m = map(page);
 
-	if (!map)
-		return;
 	slab_err(s, page, "%s", text);
 	slab_lock(page);
-	for_each_free_object(p, s, page->freelist)
-		set_bit(slab_index(p, s, addr), map);
 
 	for_each_object(p, s, addr, page->objects) {
 
-		if (!test_bit(slab_index(p, s, addr), map)) {
+		if (!test_bit(slab_index(p, s, addr), m)) {
 			printk(KERN_ERR "INFO: Object 0x%p @offset=%tu\n",
 							p, p - addr);
 			print_tracking(s, p);
@@ -2448,7 +2364,7 @@ static void free_partial(struct kmem_cac
 
 	spin_lock_irqsave(&n->list_lock, flags);
 	list_for_each_entry_safe(page, h, &n->partial, lru) {
-		if (!page->inuse) {
+		if (all_objects_available(page)) {
 			list_del(&page->lru);
 			discard_slab(s, page);
 			n->nr_partial--;
@@ -2759,7 +2675,7 @@ void kfree(const void *x)
 		put_page(page);
 		return;
 	}
-	slab_free(page->slab, page, object, _RET_IP_);
+	slab_free(page->slab, object, _RET_IP_);
 }
 EXPORT_SYMBOL(kfree);
 
@@ -2807,7 +2723,7 @@ int kmem_cache_shrink(struct kmem_cache 
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (!page->inuse && slab_trylock(page)) {
+			if (all_objects_available(page) && slab_trylock(page)) {
 				/*
 				 * Must hold slab lock here because slab_free
 				 * may have freed the last object and be
@@ -2818,7 +2734,7 @@ int kmem_cache_shrink(struct kmem_cache 
 				discard_slab_unlock(s, page);
 			} else {
 				list_move(&page->lru,
-				slabs_by_inuse + page->inuse);
+				slabs_by_inuse + inuse(page));
 			}
 		}
 
@@ -3299,7 +3215,7 @@ static int __cpuinit slab_cpuup_callback
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
 			local_irq_save(flags);
-			__flush_cpu_slab(s, cpu);
+			flush_cpu_objects(s, per_cpu_ptr(s->cpu_slab ,cpu));
 			local_irq_restore(flags);
 		}
 		up_read(&slub_lock);
@@ -3369,7 +3285,7 @@ void *__kmalloc_node_track_caller(size_t
 #ifdef CONFIG_SLUB_DEBUG
 static int count_inuse(struct page *page)
 {
-	return page->inuse;
+	return inuse(page);
 }
 
 static int count_total(struct page *page)
@@ -3377,54 +3293,52 @@ static int count_total(struct page *page
 	return page->objects;
 }
 
-static int validate_slab(struct kmem_cache *s, struct page *page,
-						unsigned long *map)
+static int validate_slab(struct kmem_cache *s, struct page *page)
 {
 	void *p;
 	void *addr = page_address(page);
+	unsigned long *m = map(page);
+	unsigned long errors = 0;
 
-	if (!check_slab(s, page) ||
-			!on_freelist(s, page, NULL))
+	if (!check_slab(s, page) || !verify_slab(s, page))
 		return 0;
 
-	/* Now we know that a valid freelist exists */
-	bitmap_zero(map, page->objects);
+	for_each_object(p, s, addr, page->objects) {
+		int bit = slab_index(p, s, addr);
+		int used = !test_bit(bit, m);
 
-	for_each_free_object(p, s, page->freelist) {
-		set_bit(slab_index(p, s, addr), map);
-		if (!check_object(s, page, p, 0))
-			return 0;
+		if (!check_object(s, page, p, used))
+			errors++;
 	}
 
-	for_each_object(p, s, addr, page->objects)
-		if (!test_bit(slab_index(p, s, addr), map))
-			if (!check_object(s, page, p, 1))
-				return 0;
-	return 1;
+	return errors;
 }
 
-static void validate_slab_slab(struct kmem_cache *s, struct page *page,
-						unsigned long *map)
+static unsigned long validate_slab_slab(struct kmem_cache *s, struct page *page)
 {
+	unsigned long errors = 0;
+
 	if (slab_trylock(page)) {
-		validate_slab(s, page, map);
+		errors = validate_slab(s, page);
 		slab_unlock(page);
 	} else
 		printk(KERN_INFO "SLUB %s: Skipped busy slab 0x%p\n",
 			s->name, page);
+	return errors;
 }
 
 static int validate_slab_node(struct kmem_cache *s,
-		struct kmem_cache_node *n, unsigned long *map)
+		struct kmem_cache_node *n)
 {
 	unsigned long count = 0;
 	struct page *page;
 	unsigned long flags;
+	unsigned long errors;
 
 	spin_lock_irqsave(&n->list_lock, flags);
 
 	list_for_each_entry(page, &n->partial, lru) {
-		validate_slab_slab(s, page, map);
+		errors += validate_slab_slab(s, page);
 		count++;
 	}
 	if (count != n->nr_partial)
@@ -3435,7 +3349,7 @@ static int validate_slab_node(struct kme
 		goto out;
 
 	list_for_each_entry(page, &n->full, lru) {
-		validate_slab_slab(s, page, map);
+		validate_slab_slab(s, page);
 		count++;
 	}
 	if (count != atomic_long_read(&n->nr_slabs))
@@ -3445,26 +3359,20 @@ static int validate_slab_node(struct kme
 
 out:
 	spin_unlock_irqrestore(&n->list_lock, flags);
-	return count;
+	return errors;
 }
 
 static long validate_slab_cache(struct kmem_cache *s)
 {
 	int node;
 	unsigned long count = 0;
-	unsigned long *map = kmalloc(BITS_TO_LONGS(oo_objects(s->max)) *
-				sizeof(unsigned long), GFP_KERNEL);
-
-	if (!map)
-		return -ENOMEM;
 
 	flush_all(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
-		count += validate_slab_node(s, n, map);
+		count += validate_slab_node(s, n);
 	}
-	kfree(map);
 	return count;
 }
 
@@ -3653,18 +3561,14 @@ static int add_location(struct loc_track
 }
 
 static void process_slab(struct loc_track *t, struct kmem_cache *s,
-		struct page *page, enum track_item alloc,
-		long *map)
+		struct page *page, enum track_item alloc)
 {
 	void *addr = page_address(page);
+	unsigned long *m = map(page);
 	void *p;
 
-	bitmap_zero(map, page->objects);
-	for_each_free_object(p, s, page->freelist)
-		set_bit(slab_index(p, s, addr), map);
-
 	for_each_object(p, s, addr, page->objects)
-		if (!test_bit(slab_index(p, s, addr), map))
+		if (!test_bit(slab_index(p, s, addr), m))
 			add_location(t, s, get_track(s, p, alloc));
 }
 
@@ -3675,12 +3579,9 @@ static int list_locations(struct kmem_ca
 	unsigned long i;
 	struct loc_track t = { 0, 0, NULL };
 	int node;
-	unsigned long *map = kmalloc(BITS_TO_LONGS(oo_objects(s->max)) *
-				     sizeof(unsigned long), GFP_KERNEL);
 
-	if (!map || !alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location),
+	if (!alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location),
 				     GFP_TEMPORARY)) {
-		kfree(map);
 		return sprintf(buf, "Out of memory\n");
 	}
 	/* Push back cpu slabs */
@@ -3696,9 +3597,9 @@ static int list_locations(struct kmem_ca
 
 		spin_lock_irqsave(&n->list_lock, flags);
 		list_for_each_entry(page, &n->partial, lru)
-			process_slab(&t, s, page, alloc, map);
+			process_slab(&t, s, page, alloc);
 		list_for_each_entry(page, &n->full, lru)
-			process_slab(&t, s, page, alloc, map);
+			process_slab(&t, s, page, alloc);
 		spin_unlock_irqrestore(&n->list_lock, flags);
 	}
 
@@ -3749,7 +3650,6 @@ static int list_locations(struct kmem_ca
 	}
 
 	free_loc_track(&t);
-	kfree(map);
 	if (!t.count)
 		len += sprintf(buf, "No data\n");
 	return len;
@@ -3792,11 +3692,11 @@ static ssize_t show_slab_objects(struct 
 			if (!c || c->node < 0)
 				continue;
 
-			if (c->page) {
-					if (flags & SO_TOTAL)
-						x = c->page->objects;
+			if (c->objects) {
+				if (flags & SO_TOTAL)
+					x = 0;
 				else if (flags & SO_OBJECTS)
-					x = c->page->inuse;
+					x = c->objects;
 				else
 					x = 1;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 13/16] SLUB: Resize the new cpu queues
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (11 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 12/16] SLUB: Add SLAB style per cpu queueing Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-25 21:20 ` [S+Q 14/16] SLUB: Get rid of useless function count_free() Christoph Lameter
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Nick Piggin, Matt Mackall

[-- Attachment #1: sled_resize --]
[-- Type: text/plain, Size: 14291 bytes --]

Allow resizing of cpu queue and batch size. This is done in the
basic steps that are also followed by SLAB.

The statically allocated per cpu areas are removed since the per cpu
allocator is already available when kmem_cache_init is called. We can
dynamically size the per cpu data during bootstrap.

Careful: This means that the ->cpu pointer is becoming volatile. References
to the ->cpu pointer either

A. Occur with interrupts disabled. This guarantees that nothing on the
   processor itself interferes. This only serializes access to a single
   processor specific area.

B. Occur with slub_lock taken for operations on all per cpu areas.
   Taking the slub_lock guarantees that no resizing operation will occur
   while accessing the percpu areas. The data in the percpu areas
   is volatile even with slub_lock since the alloc and free functions
   do not take slub_lock and will operate on fields of kmem_cache_cpu.

C. Are racy: This is true for the statistics. The ->cpu pointer must always
   point to a valid kmem_cache_cpu area.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    9 --
 mm/slub.c                |  199 +++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 178 insertions(+), 30 deletions(-)

Index: linux-2.6.34/include/linux/slub_def.h
===================================================================
--- linux-2.6.34.orig/include/linux/slub_def.h	2010-06-23 10:05:03.000000000 -0500
+++ linux-2.6.34/include/linux/slub_def.h	2010-06-23 10:05:16.000000000 -0500
@@ -34,16 +34,13 @@ enum stat_item {
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
 	NR_SLUB_STAT_ITEMS };
 
-#define QUEUE_SIZE 50
-#define BATCH_SIZE 25
-
 struct kmem_cache_cpu {
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
 	int objects;		/* Number of objects available */
 	int node;		/* The node of the page (or -1 for debug) */
-	void *object[QUEUE_SIZE];		/* List of objects */
+	void *object[];		/* Dynamic alloc will allow larger sizes  */
 };
 
 struct kmem_cache_node {
@@ -70,12 +67,14 @@ struct kmem_cache_order_objects {
  * Slab cache management.
  */
 struct kmem_cache {
-	struct kmem_cache_cpu *cpu_slab;
+	struct kmem_cache_cpu *cpu;
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
 	int objsize;		/* The size of an object without meta data */
 	struct kmem_cache_order_objects oo;
+	int queue;		/* per cpu queue size */
+	int batch;		/* batch size */
 
 	/* Allocation and freeing of slabs */
 	struct kmem_cache_order_objects max;
Index: linux-2.6.34/mm/slub.c
===================================================================
--- linux-2.6.34.orig/mm/slub.c	2010-06-23 10:05:03.000000000 -0500
+++ linux-2.6.34/mm/slub.c	2010-06-23 10:06:13.000000000 -0500
@@ -195,10 +195,19 @@ static inline void sysfs_slab_remove(str
 
 #endif
 
+/*
+ * We allow stat calls while slub_lock is taken or while interrupts
+ * are enabled for simplicities sake.
+ *
+ * This results in potential inaccuracies. If the platform does not
+ * support per cpu atomic operations vs. interrupts thent he counts
+ * may be updated in a racy manner due to slab processing in
+ * interrupts.
+ */
 static inline void stat(struct kmem_cache *s, enum stat_item si)
 {
 #ifdef CONFIG_SLUB_STATS
-	__this_cpu_inc(s->cpu_slab->stat[si]);
+	__this_cpu_inc(s->cpu->stat[si]);
 #endif
 }
 
@@ -1511,6 +1520,11 @@ static void flush_cpu_objects(struct kme
  	stat(s, CPUSLAB_FLUSH);
 }
 
+struct flush_control {
+	struct kmem_cache *s;
+	struct kmem_cache_cpu *c;
+};
+
 /*
  * Flush cpu objects.
  *
@@ -1518,24 +1532,78 @@ static void flush_cpu_objects(struct kme
  */
 static void __flush_cpu_objects(void *d)
 {
-	struct kmem_cache *s = d;
-	struct kmem_cache_cpu *c = __this_cpu_ptr(s->cpu_slab);
+	struct flush_control *f = d;
+	struct kmem_cache_cpu *c = __this_cpu_ptr(f->c);
 
 	if (c->objects)
-		flush_cpu_objects(s, c);
+		flush_cpu_objects(f->s, c);
 }
 
 static void flush_all(struct kmem_cache *s)
 {
-	on_each_cpu(__flush_cpu_objects, s, 1);
+	struct flush_control f = { s, s->cpu};
+
+	on_each_cpu(__flush_cpu_objects, &f, 1);
 }
 
 struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int n)
 {
-	return __alloc_percpu(sizeof(struct kmem_cache_cpu),
+	return __alloc_percpu(sizeof(struct kmem_cache_cpu) +
+	                       	sizeof(void *) * n,
 		__alignof__(struct kmem_cache_cpu));
 }
 
+static void resize_cpu_queue(struct kmem_cache *s, int queue)
+{
+	struct kmem_cache_cpu *n = alloc_kmem_cache_cpu(s, queue);
+	struct flush_control f;
+
+	/* Create the new cpu queue and then free the old one */
+	f.s = s;
+	f.c = s->cpu;
+
+	/* We can only shrink the queue here since the new
+	 * queue size may be smaller and there may be concurrent
+	 * slab operations. The update of the queue must be seen
+	 * before the change of the location of the percpu queue.
+	 *
+	 * Note that the queue may contain more object than the
+	 * queue size after this operation.
+	 */
+	if (queue < s->queue) {
+		s->queue = queue;
+		s->batch = (s->queue + 1) / 2;
+		barrier();
+	}
+
+	/* This is critical since allocation and free runs
+	 * concurrently without taking the slub_lock!
+	 * We point the cpu pointer to a different per cpu
+	 * segment to redirect current processing and then
+	 * flush the cpu objects on the old cpu structure.
+	 *
+	 * The old percpu structure is no longer reachable
+	 * since slab_alloc/free must have terminated in order
+	 * to execute __flush_cpu_objects. Both require
+	 * interrupts to be disabled.
+	 */
+	s->cpu = n;
+	on_each_cpu(__flush_cpu_objects, &f, 1);
+
+	/*
+	 * If the queue needs to be extended then we deferred
+	 * the update until now when the larger sized queue
+	 * has been allocated and is working.
+	 */
+	if (queue > s->queue) {
+		s->queue = queue;
+		s->batch = (s->queue + 1) / 2;
+	}
+
+	if (slab_state > UP)
+		free_percpu(f.c);
+}
+
 /*
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
@@ -1657,7 +1725,7 @@ static void *slab_alloc(struct kmem_cach
 
 redo:
 	local_irq_save(flags);
-	c = __this_cpu_ptr(s->cpu_slab);
+	c = __this_cpu_ptr(s->cpu);
 	if (unlikely(!c->objects || !node_match(c, node))) {
 
 		gfpflags &= gfp_allowed_mask;
@@ -1667,7 +1735,7 @@ redo:
 			c->node = node;
 		}
 
-		while (c->objects < BATCH_SIZE) {
+		while (c->objects < s->batch) {
 			struct page *new;
 			int d;
 
@@ -1683,7 +1751,7 @@ redo:
 					local_irq_disable();
 
 				/* process may have moved to different cpu */
-				c = __this_cpu_ptr(s->cpu_slab);
+				c = __this_cpu_ptr(s->cpu);
 
  				if (!new) {
 					if (!c->objects)
@@ -1695,7 +1763,7 @@ redo:
 			} else
 				stat(s, ALLOC_FROM_PARTIAL);
 
-			d = min(BATCH_SIZE - c->objects, available(new));
+			d = min(s->batch - c->objects, available(new));
 			retrieve_objects(s, new, c->object + c->objects, d);
 			c->objects += d;
 
@@ -1787,7 +1855,7 @@ static void slab_free(struct kmem_cache 
 	kmemleak_free_recursive(x, s->flags);
 
 	local_irq_save(flags);
-	c = __this_cpu_ptr(s->cpu_slab);
+	c = __this_cpu_ptr(s->cpu);
 
 	kmemcheck_slab_free(s, object, s->objsize);
 	debug_check_no_locks_freed(object, s->objsize);
@@ -1795,9 +1863,9 @@ static void slab_free(struct kmem_cache 
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
 
-	if (unlikely(c->objects >= QUEUE_SIZE)) {
+	if (unlikely(c->objects >= s->queue)) {
 
-		int t = min(BATCH_SIZE, c->objects);
+		int t = min(s->batch, c->objects);
 
 		drain_objects(s, c->object, t);
 
@@ -2011,9 +2079,9 @@ static inline int alloc_kmem_cache_cpus(
 	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
 			SLUB_PAGE_SHIFT * sizeof(struct kmem_cache));
 
-	s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
+	s->cpu = alloc_kmem_cache_cpu(s, s->queue);
 
-	return s->cpu_slab != NULL;
+	return s->cpu != NULL;
 }
 
 #ifdef CONFIG_NUMA
@@ -2235,6 +2303,18 @@ static int calculate_sizes(struct kmem_c
 
 }
 
+static int initial_queue_size(int size)
+{
+	if (size > PAGE_SIZE)
+		return 8;
+	else if (size > 1024)
+		return 24;
+	else if (size > 256)
+		return 54;
+	else
+		return 120;
+}
+
 static int kmem_cache_open(struct kmem_cache *s,
 		const char *name, size_t size,
 		size_t align, unsigned long flags,
@@ -2273,6 +2353,9 @@ static int kmem_cache_open(struct kmem_c
 	if (!init_kmem_cache_nodes(s))
 		goto error;
 
+	s->queue = initial_queue_size(s->size);
+	s->batch = (s->queue + 1) / 2;
+
 	if (alloc_kmem_cache_cpus(s))
 		return 1;
 
@@ -2383,8 +2466,9 @@ static inline int kmem_cache_close(struc
 {
 	int node;
 
+	down_read(&slub_lock);
 	flush_all(s);
-	free_percpu(s->cpu_slab);
+	free_percpu(s->cpu);
 	/* Attempt to free all objects */
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
@@ -2394,6 +2478,7 @@ static inline int kmem_cache_close(struc
 			return 1;
 	}
 	free_kmem_cache_nodes(s);
+	up_read(&slub_lock);
 	return 0;
 }
 
@@ -3040,6 +3125,7 @@ void __init kmem_cache_init(void)
 		caches++;
 	}
 
+	/* Now the kmalloc array is fully functional (*not* the dma array) */
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */
@@ -3056,6 +3142,7 @@ void __init kmem_cache_init(void)
 		caches, cache_line_size(),
 		slub_min_order, slub_max_order, slub_min_objects,
 		nr_cpu_ids, nr_node_ids);
+
 }
 
 void __init kmem_cache_init_late(void)
@@ -3063,6 +3150,7 @@ void __init kmem_cache_init_late(void)
 #ifdef CONFIG_ZONE_DMA
 	int i;
 
+	/* Create the dma kmalloc array and make it operational */
 	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
 		struct kmem_cache *s = kmalloc_caches[i];
 
@@ -3167,7 +3255,7 @@ struct kmem_cache *kmem_cache_create(con
 		return s;
 	}
 
-	s = kmalloc(kmem_size, GFP_KERNEL);
+	s = kmalloc(kmem_size, irqs_disabled() ? GFP_NOWAIT : GFP_KERNEL);
 	if (s) {
 		if (kmem_cache_open(s, name,
 				size, align, flags, ctor)) {
@@ -3215,7 +3303,7 @@ static int __cpuinit slab_cpuup_callback
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
 			local_irq_save(flags);
-			flush_cpu_objects(s, per_cpu_ptr(s->cpu_slab ,cpu));
+			flush_cpu_objects(s, per_cpu_ptr(s->cpu, cpu));
 			local_irq_restore(flags);
 		}
 		up_read(&slub_lock);
@@ -3681,13 +3769,15 @@ static ssize_t show_slab_objects(struct 
 	nodes = kzalloc(2 * sizeof(unsigned long) * nr_node_ids, GFP_KERNEL);
 	if (!nodes)
 		return -ENOMEM;
+
+	down_read(&slub_lock);
 	per_cpu = nodes + nr_node_ids;
 
 	if (flags & SO_CPU) {
 		int cpu;
 
 		for_each_possible_cpu(cpu) {
-			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu, cpu);
 
 			if (!c || c->node < 0)
 				continue;
@@ -3737,6 +3827,8 @@ static ssize_t show_slab_objects(struct 
 			nodes[node] += x;
 		}
 	}
+
+	up_read(&slub_lock);
 	x = sprintf(buf, "%lu", total);
 #ifdef CONFIG_NUMA
 	for_each_node_state(node, N_NORMAL_MEMORY)
@@ -3847,6 +3939,57 @@ static ssize_t min_partial_store(struct 
 }
 SLAB_ATTR(min_partial);
 
+static ssize_t cpu_queue_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%u\n", s->queue);
+}
+
+static ssize_t cpu_queue_size_store(struct kmem_cache *s,
+			 const char *buf, size_t length)
+{
+	unsigned long queue;
+	int err;
+
+	err = strict_strtoul(buf, 10, &queue);
+	if (err)
+		return err;
+
+	if (queue > 10000 || queue < 4)
+		return -EINVAL;
+
+	if (s->batch > queue)
+		s->batch = queue;
+
+	down_write(&slub_lock);
+	resize_cpu_queue(s, queue);
+	up_write(&slub_lock);
+	return length;
+}
+SLAB_ATTR(cpu_queue_size);
+
+static ssize_t cpu_batch_size_show(struct kmem_cache *s, char *buf)
+{
+	return sprintf(buf, "%u\n", s->batch);
+}
+
+static ssize_t cpu_batch_size_store(struct kmem_cache *s,
+			 const char *buf, size_t length)
+{
+	unsigned long batch;
+	int err;
+
+	err = strict_strtoul(buf, 10, &batch);
+	if (err)
+		return err;
+
+	if (batch < s->queue || batch < 4)
+		return -EINVAL;
+
+	s->batch = batch;
+	return length;
+}
+SLAB_ATTR(cpu_batch_size);
+
 static ssize_t ctor_show(struct kmem_cache *s, char *buf)
 {
 	if (s->ctor) {
@@ -3876,11 +4019,11 @@ static ssize_t partial_show(struct kmem_
 }
 SLAB_ATTR_RO(partial);
 
-static ssize_t cpu_slabs_show(struct kmem_cache *s, char *buf)
+static ssize_t cpu_show(struct kmem_cache *s, char *buf)
 {
 	return show_slab_objects(s, buf, SO_CPU);
 }
-SLAB_ATTR_RO(cpu_slabs);
+SLAB_ATTR_RO(cpu);
 
 static ssize_t objects_show(struct kmem_cache *s, char *buf)
 {
@@ -4128,12 +4271,14 @@ static int show_stat(struct kmem_cache *
 	if (!data)
 		return -ENOMEM;
 
+	down_read(&slub_lock);
 	for_each_online_cpu(cpu) {
-		unsigned x = per_cpu_ptr(s->cpu_slab, cpu)->stat[si];
+		unsigned x = per_cpu_ptr(s->cpu, cpu)->stat[si];
 
 		data[cpu] = x;
 		sum += x;
 	}
+	up_read(&slub_lock);
 
 	len = sprintf(buf, "%lu", sum);
 
@@ -4151,8 +4296,10 @@ static void clear_stat(struct kmem_cache
 {
 	int cpu;
 
+	down_write(&slub_lock);
 	for_each_online_cpu(cpu)
-		per_cpu_ptr(s->cpu_slab, cpu)->stat[si] = 0;
+		per_cpu_ptr(s->cpu, cpu)->stat[si] = 0;
+	up_write(&slub_lock);
 }
 
 #define STAT_ATTR(si, text) 					\
@@ -4196,12 +4343,14 @@ static struct attribute *slab_attrs[] = 
 	&objs_per_slab_attr.attr,
 	&order_attr.attr,
 	&min_partial_attr.attr,
+	&cpu_queue_size_attr.attr,
+	&cpu_batch_size_attr.attr,
 	&objects_attr.attr,
 	&objects_partial_attr.attr,
 	&total_objects_attr.attr,
 	&slabs_attr.attr,
 	&partial_attr.attr,
-	&cpu_slabs_attr.attr,
+	&cpu_attr.attr,
 	&ctor_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
@@ -4553,7 +4702,7 @@ static int s_show(struct seq_file *m, vo
 	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse,
 		   nr_objs, s->size, oo_objects(s->oo),
 		   (1 << oo_order(s->oo)));
-	seq_printf(m, " : tunables %4u %4u %4u", 0, 0, 0);
+	seq_printf(m, " : tunables %4u %4u %4u", s->queue, s->batch, 0);
 	seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs,
 		   0UL);
 	seq_putc(m, '\n');

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 14/16] SLUB: Get rid of useless function count_free()
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (12 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 13/16] SLUB: Resize the new cpu queues Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-25 21:20 ` [S+Q 15/16] SLUB: Remove MAX_OBJS limitation Christoph Lameter
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Nick Piggin, Matt Mackall

[-- Attachment #1: sled_drop_count_free --]
[-- Type: text/plain, Size: 1758 bytes --]

count_free() == available()

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

Index: linux-2.6.34/mm/slub.c
===================================================================
--- linux-2.6.34.orig/mm/slub.c	2010-06-23 10:24:15.000000000 -0500
+++ linux-2.6.34/mm/slub.c	2010-06-23 10:24:16.000000000 -0500
@@ -1617,11 +1617,6 @@ static inline int node_match(struct kmem
 	return 1;
 }
 
-static int count_free(struct page *page)
-{
-	return available(page);
-}
-
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct page *))
 {
@@ -1670,7 +1665,7 @@ slab_out_of_memory(struct kmem_cache *s,
 		if (!n)
 			continue;
 
-		nr_free  = count_partial(n, count_free);
+		nr_free  = count_partial(n, available);
 		nr_slabs = node_nr_slabs(n);
 		nr_objs  = node_nr_objs(n);
 
@@ -3805,7 +3800,7 @@ static ssize_t show_slab_objects(struct 
 			x = atomic_long_read(&n->total_objects);
 		else if (flags & SO_OBJECTS)
 			x = atomic_long_read(&n->total_objects) -
-				count_partial(n, count_free);
+				count_partial(n, available);
 
 			else
 				x = atomic_long_read(&n->nr_slabs);
@@ -4694,7 +4689,7 @@ static int s_show(struct seq_file *m, vo
 		nr_partials += n->nr_partial;
 		nr_slabs += atomic_long_read(&n->nr_slabs);
 		nr_objs += atomic_long_read(&n->total_objects);
-		nr_free += count_partial(n, count_free);
+		nr_free += count_partial(n, available);
 	}
 
 	nr_inuse = nr_objs - nr_free;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 15/16] SLUB: Remove MAX_OBJS limitation
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (13 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 14/16] SLUB: Get rid of useless function count_free() Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-25 21:20 ` [S+Q 16/16] slub: Drop allocator announcement Christoph Lameter
  2010-06-26  2:24 ` [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Nick Piggin
  16 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Nick Piggin, Matt Mackall

[-- Attachment #1: sled_unlimited_objects --]
[-- Type: text/plain, Size: 2285 bytes --]

There is no need anymore for the "inuse" field in the page struct.
Extend the objects field to 32 bit allowing a practically unlimited
number of objects.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/mm_types.h |    5 +----
 mm/slub.c                |    7 -------
 2 files changed, 1 insertion(+), 11 deletions(-)

Index: linux-2.6.34/include/linux/mm_types.h
===================================================================
--- linux-2.6.34.orig/include/linux/mm_types.h	2010-06-23 10:22:29.000000000 -0500
+++ linux-2.6.34/include/linux/mm_types.h	2010-06-23 10:24:21.000000000 -0500
@@ -40,10 +40,7 @@ struct page {
 					 * to show when page is mapped
 					 * & limit reverse map searches.
 					 */
-		struct {		/* SLUB */
-			u16 inuse;
-			u16 objects;
-		};
+		u32 objects;		/* SLUB */
 	};
 	union {
 	    struct {
Index: linux-2.6.34/mm/slub.c
===================================================================
--- linux-2.6.34.orig/mm/slub.c	2010-06-23 10:24:16.000000000 -0500
+++ linux-2.6.34/mm/slub.c	2010-06-23 10:24:21.000000000 -0500
@@ -144,7 +144,6 @@ static inline int kmem_cache_debug(struc
 
 #define OO_SHIFT	16
 #define OO_MASK		((1 << OO_SHIFT) - 1)
-#define MAX_OBJS_PER_PAGE	65535 /* since page.objects is u16 */
 
 /* Internal SLUB flags */
 #define __OBJECT_POISON		0x80000000UL /* Poison object */
@@ -767,9 +766,6 @@ static int verify_slab(struct kmem_cache
 			max_objects = ((void *)page->freelist - start) / s->size;
 	}
 
-	if (max_objects > MAX_OBJS_PER_PAGE)
-		max_objects = MAX_OBJS_PER_PAGE;
-
 	if (page->objects != max_objects) {
 		slab_err(s, page, "Wrong number of objects. Found %d but "
 			"should be %d", page->objects, max_objects);
@@ -1958,9 +1954,6 @@ static inline int slab_order(int size, i
 	int rem;
 	int min_order = slub_min_order;
 
-	if ((PAGE_SIZE << min_order) / size > MAX_OBJS_PER_PAGE)
-		return get_order(size * MAX_OBJS_PER_PAGE) - 1;
-
 	for (order = max(min_order,
 				fls(min_objects * size - 1) - PAGE_SHIFT);
 			order <= max_order; order++) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 16/16] slub: Drop allocator announcement
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (14 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 15/16] SLUB: Remove MAX_OBJS limitation Christoph Lameter
@ 2010-06-25 21:20 ` Christoph Lameter
  2010-06-26  2:24 ` [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Nick Piggin
  16 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-25 21:20 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, Nick Piggin, Matt Mackall

[-- Attachment #1: mininum_objects --]
[-- Type: text/plain, Size: 1075 bytes --]

People get confused and some of the items listed no longer have the same
relevance in the queued form of SLUB.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |    7 -------
 1 file changed, 7 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-06-25 16:08:05.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-06-25 16:08:28.000000000 -0500
@@ -3124,13 +3124,6 @@ void __init kmem_cache_init(void)
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
 #endif
-	printk(KERN_INFO
-		"SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d,"
-		" CPUs=%d, Nodes=%d\n",
-		caches, cache_line_size(),
-		slub_min_order, slub_max_order, slub_min_objects,
-		nr_cpu_ids, nr_node_ids);
-
 }
 
 void __init kmem_cache_init_late(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench
  2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
                   ` (15 preceding siblings ...)
  2010-06-25 21:20 ` [S+Q 16/16] slub: Drop allocator announcement Christoph Lameter
@ 2010-06-26  2:24 ` Nick Piggin
  2010-06-28  6:18   ` Pekka Enberg
  16 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2010-06-26  2:24 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, Matt Mackall

On Fri, Jun 25, 2010 at 04:20:26PM -0500, Christoph Lameter wrote:
> The following patchset cleans some pieces up and then equips SLUB with
> per cpu queues that work similar to SLABs queues. With that approach
> SLUB wins in hackbench:

Hackbench I don't think is that interesting. SLQB was beating SLAB
too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 12/16] SLUB: Add SLAB style per cpu queueing
  2010-06-25 21:20 ` [S+Q 12/16] SLUB: Add SLAB style per cpu queueing Christoph Lameter
@ 2010-06-26  2:32   ` Nick Piggin
  2010-06-28 10:19     ` Christoph Lameter
  0 siblings, 1 reply; 72+ messages in thread
From: Nick Piggin @ 2010-06-26  2:32 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, Matt Mackall

On Fri, Jun 25, 2010 at 04:20:38PM -0500, Christoph Lameter wrote:
> This patch adds SLAB style cpu queueing and uses a new way for
>  managing objects in the slabs using bitmaps. It uses a percpu queue so that
> free operations can be properly buffered and a bitmap for managing the
> free/allocated state in the slabs. It uses slightly more memory
> (due to the need to place large bitmaps --sized a few words--in some
> slab pages) but in general does compete well in terms of space use.
> The storage format using bitmaps avoids the SLAB management structure that
> SLAB needs for each slab page and therefore the metadata is more compact
> and easily fits into a cacheline.
> 
> The SLAB scheme of not touching the object during management is adopted.
> SLUB can now efficiently free and allocate cache cold objects.

BTW. this was never the problem with SLUB, because SLQB didn't have
the big performance regression on tpcc. SLUB IIRC had to touch more
cachelines per operation.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-06-25 21:20 ` [S+Q 09/16] [percpu] make allocpercpu usable during early boot Christoph Lameter
@ 2010-06-26  8:10   ` Tejun Heo
  2010-06-26 23:53     ` David Rientjes
  2010-06-29 15:15     ` Christoph Lameter
  2010-06-26 23:38   ` David Rientjes
  2010-06-28 17:03   ` Pekka Enberg
  2 siblings, 2 replies; 72+ messages in thread
From: Tejun Heo @ 2010-06-26  8:10 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On 06/25/2010 11:20 PM, Christoph Lameter wrote:
> allocpercpu() may be used during early boot after the page allocator
> has been bootstrapped but when interrupts are still off. Make sure
> that we do not do GFP_KERNEL allocations if this occurs.
> 
> Cc: tj@kernel.org
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Acked-by: Tejun Heo <tj@kernel.org>

Christoph, how do you wanna route these patches?  I already have the
other two patches in the percpu tree, I can push this there too, which
then you can pull into the allocator tree.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 05/16] SLUB: Constants need UL
  2010-06-25 21:20 ` [S+Q 05/16] SLUB: Constants need UL Christoph Lameter
@ 2010-06-26 23:31   ` David Rientjes
  2010-06-28  2:27   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 72+ messages in thread
From: David Rientjes @ 2010-06-26 23:31 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Fri, 25 Jun 2010, Christoph Lameter wrote:

> UL suffix is missing in some constants. Conform to how slab.h uses constants.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 06/16] slub: Use kmem_cache flags to detect if slab is in debugging mode.
  2010-06-25 21:20 ` [S+Q 06/16] slub: Use kmem_cache flags to detect if slab is in debugging mode Christoph Lameter
@ 2010-06-26 23:31   ` David Rientjes
  0 siblings, 0 replies; 72+ messages in thread
From: David Rientjes @ 2010-06-26 23:31 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Fri, 25 Jun 2010, Christoph Lameter wrote:

> The cacheline with the flags is reachable from the hot paths after the
> percpu allocator changes went in. So there is no need anymore to put a
> flag into each slab page. Get rid of the SlubDebug flag and use
> the flags in kmem_cache instead.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 07/16] slub: discard_slab_unlock
  2010-06-25 21:20 ` [S+Q 07/16] slub: discard_slab_unlock Christoph Lameter
@ 2010-06-26 23:34   ` David Rientjes
  2010-07-06 20:44     ` Christoph Lameter
  0 siblings, 1 reply; 72+ messages in thread
From: David Rientjes @ 2010-06-26 23:34 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Fri, 25 Jun 2010, Christoph Lameter wrote:

> The sequence of unlocking a slab and freeing occurs multiple times.
> Put the common into a single function.
> 

Did you want to respond to the comments I made about this patch at 
http://marc.info/?l=linux-mm&m=127689747432061 ?  Specifically, how it 
makes seeing if there are unmatched slab_lock() -> slab_unlock() pairs 
more difficult.

> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  mm/slub.c |   16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2010-06-01 08:58:50.000000000 -0500
> +++ linux-2.6/mm/slub.c	2010-06-01 08:58:54.000000000 -0500
> @@ -1260,6 +1260,13 @@ static __always_inline int slab_trylock(
>  	return rc;
>  }
>  
> +static void discard_slab_unlock(struct kmem_cache *s,
> +	struct page *page)
> +{
> +	slab_unlock(page);
> +	discard_slab(s, page);
> +}
> +
>  /*
>   * Management of partially allocated slabs
>   */
> @@ -1437,9 +1444,8 @@ static void unfreeze_slab(struct kmem_ca
>  			add_partial(n, page, 1);
>  			slab_unlock(page);
>  		} else {
> -			slab_unlock(page);
>  			stat(s, FREE_SLAB);
> -			discard_slab(s, page);
> +			discard_slab_unlock(s, page);
>  		}
>  	}
>  }
> @@ -1822,9 +1828,8 @@ slab_empty:
>  		remove_partial(s, page);
>  		stat(s, FREE_REMOVE_PARTIAL);
>  	}
> -	slab_unlock(page);
>  	stat(s, FREE_SLAB);
> -	discard_slab(s, page);
> +	discard_slab_unlock(s, page);
>  	return;
>  
>  debug:
> @@ -2893,8 +2898,7 @@ int kmem_cache_shrink(struct kmem_cache 
>  				 */
>  				list_del(&page->lru);
>  				n->nr_partial--;
> -				slab_unlock(page);
> -				discard_slab(s, page);
> +				discard_slab_unlock(s, page);
>  			} else {
>  				list_move(&page->lru,
>  				slabs_by_inuse + page->inuse);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-06-25 21:20 ` [S+Q 09/16] [percpu] make allocpercpu usable during early boot Christoph Lameter
  2010-06-26  8:10   ` Tejun Heo
@ 2010-06-26 23:38   ` David Rientjes
  2010-06-29 15:26     ` Christoph Lameter
  2010-06-28 17:03   ` Pekka Enberg
  2 siblings, 1 reply; 72+ messages in thread
From: David Rientjes @ 2010-06-26 23:38 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, tj, Nick Piggin, Matt Mackall

On Fri, 25 Jun 2010, Christoph Lameter wrote:

> allocpercpu() may be used during early boot after the page allocator
> has been bootstrapped but when interrupts are still off. Make sure
> that we do not do GFP_KERNEL allocations if this occurs.
> 

Why isn't this being handled at a lower level, specifically in the slab 
allocator to prevent GFP_KERNEL from being used when irqs are disabled?  
We'll otherwise need to audit all slab allocations from the boot cpu for 
correctness.

> Cc: tj@kernel.org
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  mm/percpu.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/mm/percpu.c
> ===================================================================
> --- linux-2.6.orig/mm/percpu.c	2010-06-23 14:43:54.000000000 -0500
> +++ linux-2.6/mm/percpu.c	2010-06-23 14:44:05.000000000 -0500
> @@ -275,7 +275,8 @@ static void __maybe_unused pcpu_next_pop
>   * memory is always zeroed.
>   *
>   * CONTEXT:
> - * Does GFP_KERNEL allocation.
> + * Does GFP_KERNEL allocation (May be called early in boot when
> + * interrupts are still disabled. Will then do GFP_NOWAIT alloc).
>   *
>   * RETURNS:
>   * Pointer to the allocated area on success, NULL on failure.
> @@ -286,7 +287,7 @@ static void *pcpu_mem_alloc(size_t size)
>  		return NULL;
>  
>  	if (size <= PAGE_SIZE)
> -		return kzalloc(size, GFP_KERNEL);
> +		return kzalloc(size, GFP_KERNEL & gfp_allowed_mask);
>  	else {
>  		void *ptr = vmalloc(size);
>  		if (ptr)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 08/16] slub: remove dynamic dma slab allocation
  2010-06-25 21:20 ` [S+Q 08/16] slub: remove dynamic dma slab allocation Christoph Lameter
@ 2010-06-26 23:52   ` David Rientjes
  2010-06-29 15:31     ` Christoph Lameter
  2010-06-28  2:33   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 72+ messages in thread
From: David Rientjes @ 2010-06-26 23:52 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Fri, 25 Jun 2010, Christoph Lameter wrote:

> Remove the dynamic dma slab allocation since this causes too many issues with
> nested locks etc etc. The change avoids passing gfpflags into many functions.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  mm/slub.c |  153 ++++++++++++++++----------------------------------------------
>  1 file changed, 41 insertions(+), 112 deletions(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2010-06-15 12:40:58.000000000 -0500
> +++ linux-2.6/mm/slub.c	2010-06-15 12:41:36.000000000 -0500
> @@ -2070,7 +2070,7 @@ init_kmem_cache_node(struct kmem_cache_n
>  
>  static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[KMALLOC_CACHES]);
>  
> -static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
> +static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
>  {
>  	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
>  		/*
> @@ -2097,7 +2097,7 @@ static inline int alloc_kmem_cache_cpus(
>   * when allocating for the kmalloc_node_cache. This is used for bootstrapping
>   * memory on a fresh node that has no slab structures yet.
>   */
> -static void early_kmem_cache_node_alloc(gfp_t gfpflags, int node)
> +static void early_kmem_cache_node_alloc(int node)
>  {
>  	struct page *page;
>  	struct kmem_cache_node *n;
> @@ -2105,7 +2105,7 @@ static void early_kmem_cache_node_alloc(
>  
>  	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
>  
> -	page = new_slab(kmalloc_caches, gfpflags, node);
> +	page = new_slab(kmalloc_caches, GFP_KERNEL, node);
>  
>  	BUG_ON(!page);
>  	if (page_to_nid(page) != node) {

This still passes GFP_KERNEL to the page allocator when not allowed by 
gfp_allowed_mask for early (non SLAB_CACHE_DMA) users of 
create_kmalloc_cache().

> @@ -2149,7 +2149,7 @@ static void free_kmem_cache_nodes(struct
>  	}
>  }
>  
> -static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
> +static int init_kmem_cache_nodes(struct kmem_cache *s)
>  {
>  	int node;
>  
> @@ -2157,11 +2157,11 @@ static int init_kmem_cache_nodes(struct 
>  		struct kmem_cache_node *n;
>  
>  		if (slab_state == DOWN) {
> -			early_kmem_cache_node_alloc(gfpflags, node);
> +			early_kmem_cache_node_alloc(node);
>  			continue;
>  		}
>  		n = kmem_cache_alloc_node(kmalloc_caches,
> -						gfpflags, node);
> +						GFP_KERNEL, node);
>  
>  		if (!n) {
>  			free_kmem_cache_nodes(s);

slab_state != DOWN is still not an indication that GFP_KERNEL is safe; in 
fact, all users of GFP_KERNEL from kmem_cache_init() are unsafe.  These 
need to be GFP_NOWAIT.

> @@ -2178,7 +2178,7 @@ static void free_kmem_cache_nodes(struct
>  {
>  }
>  
> -static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
> +static int init_kmem_cache_nodes(struct kmem_cache *s)
>  {
>  	init_kmem_cache_node(&s->local_node, s);
>  	return 1;
> @@ -2318,7 +2318,7 @@ static int calculate_sizes(struct kmem_c
>  
>  }
>  
> -static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
> +static int kmem_cache_open(struct kmem_cache *s,
>  		const char *name, size_t size,
>  		size_t align, unsigned long flags,
>  		void (*ctor)(void *))
> @@ -2354,10 +2354,10 @@ static int kmem_cache_open(struct kmem_c
>  #ifdef CONFIG_NUMA
>  	s->remote_node_defrag_ratio = 1000;
>  #endif
> -	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
> +	if (!init_kmem_cache_nodes(s))
>  		goto error;
>  
> -	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
> +	if (alloc_kmem_cache_cpus(s))
>  		return 1;
>  
>  	free_kmem_cache_nodes(s);
> @@ -2517,6 +2517,10 @@ EXPORT_SYMBOL(kmem_cache_destroy);
>  struct kmem_cache kmalloc_caches[KMALLOC_CACHES] __cacheline_aligned;
>  EXPORT_SYMBOL(kmalloc_caches);
>  
> +#ifdef CONFIG_ZONE_DMA
> +static struct kmem_cache kmalloc_dma_caches[SLUB_PAGE_SHIFT];
> +#endif
> +
>  static int __init setup_slub_min_order(char *str)
>  {
>  	get_option(&str, &slub_min_order);
> @@ -2553,116 +2557,26 @@ static int __init setup_slub_nomerge(cha
>  
>  __setup("slub_nomerge", setup_slub_nomerge);
>  
> -static struct kmem_cache *create_kmalloc_cache(struct kmem_cache *s,
> -		const char *name, int size, gfp_t gfp_flags)
> +static void create_kmalloc_cache(struct kmem_cache *s,
> +		const char *name, int size, unsigned int flags)
>  {
> -	unsigned int flags = 0;
> -
> -	if (gfp_flags & SLUB_DMA)
> -		flags = SLAB_CACHE_DMA;
> -
>  	/*
>  	 * This function is called with IRQs disabled during early-boot on
>  	 * single CPU so there's no need to take slub_lock here.
>  	 */
> -	if (!kmem_cache_open(s, gfp_flags, name, size, ARCH_KMALLOC_MINALIGN,
> +	if (!kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN,
>  								flags, NULL))
>  		goto panic;
>  
>  	list_add(&s->list, &slab_caches);
>  
> -	if (sysfs_slab_add(s))
> -		goto panic;
> -	return s;
> +	if (!sysfs_slab_add(s))
> +		return;
>  
>  panic:
>  	panic("Creation of kmalloc slab %s size=%d failed.\n", name, size);
>  }
>  
> -#ifdef CONFIG_ZONE_DMA
> -static struct kmem_cache *kmalloc_caches_dma[SLUB_PAGE_SHIFT];
> -
> -static void sysfs_add_func(struct work_struct *w)
> -{
> -	struct kmem_cache *s;
> -
> -	down_write(&slub_lock);
> -	list_for_each_entry(s, &slab_caches, list) {
> -		if (s->flags & __SYSFS_ADD_DEFERRED) {
> -			s->flags &= ~__SYSFS_ADD_DEFERRED;
> -			sysfs_slab_add(s);
> -		}
> -	}
> -	up_write(&slub_lock);
> -}
> -
> -static DECLARE_WORK(sysfs_add_work, sysfs_add_func);
> -
> -static noinline struct kmem_cache *dma_kmalloc_cache(int index, gfp_t flags)
> -{
> -	struct kmem_cache *s;
> -	char *text;
> -	size_t realsize;
> -	unsigned long slabflags;
> -	int i;
> -
> -	s = kmalloc_caches_dma[index];
> -	if (s)
> -		return s;
> -
> -	/* Dynamically create dma cache */
> -	if (flags & __GFP_WAIT)
> -		down_write(&slub_lock);
> -	else {
> -		if (!down_write_trylock(&slub_lock))
> -			goto out;
> -	}
> -
> -	if (kmalloc_caches_dma[index])
> -		goto unlock_out;
> -
> -	realsize = kmalloc_caches[index].objsize;
> -	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
> -			 (unsigned int)realsize);
> -
> -	s = NULL;
> -	for (i = 0; i < KMALLOC_CACHES; i++)
> -		if (!kmalloc_caches[i].size)
> -			break;
> -
> -	BUG_ON(i >= KMALLOC_CACHES);
> -	s = kmalloc_caches + i;
> -
> -	/*
> -	 * Must defer sysfs creation to a workqueue because we don't know
> -	 * what context we are called from. Before sysfs comes up, we don't
> -	 * need to do anything because our sysfs initcall will start by
> -	 * adding all existing slabs to sysfs.
> -	 */
> -	slabflags = SLAB_CACHE_DMA|SLAB_NOTRACK;
> -	if (slab_state >= SYSFS)
> -		slabflags |= __SYSFS_ADD_DEFERRED;
> -
> -	if (!text || !kmem_cache_open(s, flags, text,
> -			realsize, ARCH_KMALLOC_MINALIGN, slabflags, NULL)) {
> -		s->size = 0;
> -		kfree(text);
> -		goto unlock_out;
> -	}
> -
> -	list_add(&s->list, &slab_caches);
> -	kmalloc_caches_dma[index] = s;
> -
> -	if (slab_state >= SYSFS)
> -		schedule_work(&sysfs_add_work);
> -
> -unlock_out:
> -	up_write(&slub_lock);
> -out:
> -	return kmalloc_caches_dma[index];
> -}
> -#endif
> -
>  /*
>   * Conversion table for small slabs sizes / 8 to the index in the
>   * kmalloc array. This is necessary for slabs < 192 since we have non power
> @@ -2715,7 +2629,7 @@ static struct kmem_cache *get_slab(size_
>  
>  #ifdef CONFIG_ZONE_DMA
>  	if (unlikely((flags & SLUB_DMA)))
> -		return dma_kmalloc_cache(index, flags);
> +		return &kmalloc_dma_caches[index];
>  
>  #endif
>  	return &kmalloc_caches[index];
> @@ -3053,7 +2967,7 @@ void __init kmem_cache_init(void)
>  	 * kmem_cache_open for slab_state == DOWN.
>  	 */
>  	create_kmalloc_cache(&kmalloc_caches[0], "kmem_cache_node",
> -		sizeof(struct kmem_cache_node), GFP_NOWAIT);
> +		sizeof(struct kmem_cache_node), 0);
>  	kmalloc_caches[0].refcount = -1;
>  	caches++;
>  
> @@ -3066,18 +2980,18 @@ void __init kmem_cache_init(void)
>  	/* Caches that are not of the two-to-the-power-of size */
>  	if (KMALLOC_MIN_SIZE <= 32) {
>  		create_kmalloc_cache(&kmalloc_caches[1],
> -				"kmalloc-96", 96, GFP_NOWAIT);
> +				"kmalloc-96", 96, 0);
>  		caches++;
>  	}
>  	if (KMALLOC_MIN_SIZE <= 64) {
>  		create_kmalloc_cache(&kmalloc_caches[2],
> -				"kmalloc-192", 192, GFP_NOWAIT);
> +				"kmalloc-192", 192, 0);
>  		caches++;
>  	}
>  
>  	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
>  		create_kmalloc_cache(&kmalloc_caches[i],
> -			"kmalloc", 1 << i, GFP_NOWAIT);
> +			"kmalloc", 1 << i, 0);
>  		caches++;
>  	}
>  
> @@ -3124,7 +3038,7 @@ void __init kmem_cache_init(void)
>  
>  	/* Provide the correct kmalloc names now that the caches are up */
>  	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++)
> -		kmalloc_caches[i]. name =
> +		kmalloc_caches[i].name =
>  			kasprintf(GFP_NOWAIT, "kmalloc-%d", 1 << i);
>  
>  #ifdef CONFIG_SMP
> @@ -3147,6 +3061,21 @@ void __init kmem_cache_init(void)
>  
>  void __init kmem_cache_init_late(void)
>  {
> +#ifdef CONFIG_ZONE_DMA
> +	int i;
> +
> +	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
> +		struct kmem_cache *s = &kmalloc_caches[i];
> +
> +		if (s && s->size) {
> +			char *name = kasprintf(GFP_KERNEL,
> +				 "dma-kmalloc-%d", s->objsize);
> +

You're still not handling the case where !name, which kasprintf() can 
return both here and in kmem_cache_init().  Nameless caches aren't allowed 
for CONFIG_SLUB_DEBUG.

> +			create_kmalloc_cache(&kmalloc_dma_caches[i],
> +				name, s->objsize, SLAB_CACHE_DMA);
> +		}
> +	}
> +#endif
>  }
>  
>  /*
> @@ -3241,7 +3170,7 @@ struct kmem_cache *kmem_cache_create(con
>  
>  	s = kmalloc(kmem_size, GFP_KERNEL);
>  	if (s) {
> -		if (kmem_cache_open(s, GFP_KERNEL, name,
> +		if (kmem_cache_open(s, name,
>  				size, align, flags, ctor)) {
>  			list_add(&s->list, &slab_caches);
>  			up_write(&slub_lock);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-06-26  8:10   ` Tejun Heo
@ 2010-06-26 23:53     ` David Rientjes
  2010-06-29 15:15     ` Christoph Lameter
  1 sibling, 0 replies; 72+ messages in thread
From: David Rientjes @ 2010-06-26 23:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Sat, 26 Jun 2010, Tejun Heo wrote:

> On 06/25/2010 11:20 PM, Christoph Lameter wrote:
> > allocpercpu() may be used during early boot after the page allocator
> > has been bootstrapped but when interrupts are still off. Make sure
> > that we do not do GFP_KERNEL allocations if this occurs.
> > 
> > Cc: tj@kernel.org
> > Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> Acked-by: Tejun Heo <tj@kernel.org>
> 
> Christoph, how do you wanna route these patches?  I already have the
> other two patches in the percpu tree, I can push this there too, which
> then you can pull into the allocator tree.
> 

I think that's great for patches 2 and 3 in this series, but this patch is 
only a bandaid for allocations done in early boot whereas the real fix 
should be within a lower layer such as the slab or page allocator since 
the irq context on the boot cpu is not specific only to percpu.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 10/16] slub: Remove static kmem_cache_cpu array for boot
  2010-06-25 21:20 ` [S+Q 10/16] slub: Remove static kmem_cache_cpu array for boot Christoph Lameter
@ 2010-06-27  0:02   ` David Rientjes
  2010-06-29 15:35     ` Christoph Lameter
  0 siblings, 1 reply; 72+ messages in thread
From: David Rientjes @ 2010-06-27  0:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-mm, Tejun Heo, Nick Piggin, Matt Mackall

On Fri, 25 Jun 2010, Christoph Lameter wrote:

> The percpu allocator can now handle allocations in early boot.
> So drop the static kmem_cache_cpu array.
> 
> Early memory allocations require the use of GFP_NOWAIT instead of
> GFP_KERNEL. Mask GFP_KERNEL with gfp_allowed_mask to get to GFP_NOWAIT
> in a boot scenario.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  mm/slub.c |   21 ++++++---------------
>  1 file changed, 6 insertions(+), 15 deletions(-)
> 
> Index: linux-2.6.34/mm/slub.c
> ===================================================================
> --- linux-2.6.34.orig/mm/slub.c	2010-06-22 09:50:00.000000000 -0500
> +++ linux-2.6.34/mm/slub.c	2010-06-23 09:59:53.000000000 -0500
> @@ -2068,23 +2068,14 @@ init_kmem_cache_node(struct kmem_cache_n
>  #endif
>  }
>  
> -static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[KMALLOC_CACHES]);
> -
>  static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
>  {
> -	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
> -		/*
> -		 * Boot time creation of the kmalloc array. Use static per cpu data
> -		 * since the per cpu allocator is not available yet.
> -		 */
> -		s->cpu_slab = kmalloc_percpu + (s - kmalloc_caches);
> -	else
> -		s->cpu_slab =  alloc_percpu(struct kmem_cache_cpu);
> +	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
> +			SLUB_PAGE_SHIFT * sizeof(struct kmem_cache));
>  
> -	if (!s->cpu_slab)
> -		return 0;
> +	s->cpu_slab = alloc_percpu(struct kmem_cache_cpu);
>  
> -	return 1;
> +	return s->cpu_slab != NULL;
>  }
>  
>  #ifdef CONFIG_NUMA
> @@ -2105,7 +2096,7 @@ static void early_kmem_cache_node_alloc(
>  
>  	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
>  
> -	page = new_slab(kmalloc_caches, GFP_KERNEL, node);
> +	page = new_slab(kmalloc_caches, GFP_KERNEL & gfp_allowed_mask, node);
>  
>  	BUG_ON(!page);
>  	if (page_to_nid(page) != node) {

This needs to be merged into the preceding patch since it had broken new 
slab allocations during early boot while irqs are still disabled; it also 
seems deserving of a big fat comment about why it's required in this 
situation.

> @@ -2161,7 +2152,7 @@ static int init_kmem_cache_nodes(struct 
>  			continue;
>  		}
>  		n = kmem_cache_alloc_node(kmalloc_caches,
> -						GFP_KERNEL, node);
> +			GFP_KERNEL & gfp_allowed_mask, node);
>  
>  		if (!n) {
>  			free_kmem_cache_nodes(s);

Likewise.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 02/16] [PATCH 1/2] percpu: make @dyn_size always mean min dyn_size in first chunk init functions
  2010-06-25 21:20 ` [S+Q 02/16] [PATCH 1/2] percpu: make @dyn_size always mean min dyn_size in first chunk init functions Christoph Lameter
@ 2010-06-27  5:06   ` David Rientjes
  2010-06-27  8:21     ` Tejun Heo
  2010-06-29 15:36     ` Christoph Lameter
  0 siblings, 2 replies; 72+ messages in thread
From: David Rientjes @ 2010-06-27  5:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-mm, Tejun Heo, Nick Piggin, Matt Mackall

On Fri, 25 Jun 2010, Christoph Lameter wrote:

> In pcpu_alloc_info()

You mean pcpu_build_alloc_info()?

This should have a "From: Tejun Heo <tj@kernel.org>" line, right?

> and pcpu_embed_first_chunk(), @dyn_size was
> ssize_t, -1 meant auto-size, 0 forced 0 and positive meant minimum
> size.  There's no use case for forcing 0 and the upcoming early alloc
> support always requires non-zero dynamic size.  Make @dyn_size always
> mean minimum dyn_size.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> Index: linux-2.6/include/linux/percpu.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu.h	2010-06-18 12:23:22.000000000 -0500
> +++ linux-2.6/include/linux/percpu.h	2010-06-18 12:24:52.000000000 -0500
> @@ -105,7 +105,7 @@ extern struct pcpu_alloc_info * __init p
>  extern void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai);
>  
>  extern struct pcpu_alloc_info * __init pcpu_build_alloc_info(
> -				size_t reserved_size, ssize_t dyn_size,
> +				size_t reserved_size, size_t dyn_size,
>  				size_t atom_size,
>  				pcpu_fc_cpu_distance_fn_t cpu_distance_fn);
>  

This can just be removed entirely, it's unnecessarily global.

> @@ -113,7 +113,7 @@ extern int __init pcpu_setup_first_chunk
>  					 void *base_addr);
>  
>  #ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
> -extern int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size,
> +extern int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
>  				size_t atom_size,
>  				pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
>  				pcpu_fc_alloc_fn_t alloc_fn,
> Index: linux-2.6/mm/percpu.c
> ===================================================================
> --- linux-2.6.orig/mm/percpu.c	2010-06-18 11:20:35.000000000 -0500
> +++ linux-2.6/mm/percpu.c	2010-06-18 12:24:52.000000000 -0500
> @@ -988,20 +988,6 @@ phys_addr_t per_cpu_ptr_to_phys(void *ad
>  		return page_to_phys(pcpu_addr_to_page(addr));
>  }
>  
> -static inline size_t pcpu_calc_fc_sizes(size_t static_size,
> -					size_t reserved_size,
> -					ssize_t *dyn_sizep)
> -{
> -	size_t size_sum;
> -
> -	size_sum = PFN_ALIGN(static_size + reserved_size +
> -			     (*dyn_sizep >= 0 ? *dyn_sizep : 0));
> -	if (*dyn_sizep != 0)
> -		*dyn_sizep = size_sum - static_size - reserved_size;
> -
> -	return size_sum;
> -}
> -
>  /**
>   * pcpu_alloc_alloc_info - allocate percpu allocation info
>   * @nr_groups: the number of groups
> @@ -1060,7 +1046,7 @@ void __init pcpu_free_alloc_info(struct 
>  /**
>   * pcpu_build_alloc_info - build alloc_info considering distances between CPUs
>   * @reserved_size: the size of reserved percpu area in bytes
> - * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
> + * @dyn_size: free size for dynamic allocation in bytes

It's the minimum free size, it's not necessarily the exact size due to 
round-up.

>   * @atom_size: allocation atom size
>   * @cpu_distance_fn: callback to determine distance between cpus, optional
>   *
> @@ -1079,7 +1065,7 @@ void __init pcpu_free_alloc_info(struct 
>   * failure, ERR_PTR value is returned.
>   */
>  struct pcpu_alloc_info * __init pcpu_build_alloc_info(
> -				size_t reserved_size, ssize_t dyn_size,
> +				size_t reserved_size, size_t dyn_size,
>  				size_t atom_size,
>  				pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
>  {
> @@ -1098,13 +1084,15 @@ struct pcpu_alloc_info * __init pcpu_bui
>  	memset(group_map, 0, sizeof(group_map));
>  	memset(group_cnt, 0, sizeof(group_map));
>  
> +	size_sum = PFN_ALIGN(static_size + reserved_size + dyn_size);
> +	dyn_size = size_sum - static_size - reserved_size;

Ok, so the only purpose of "dyn_size" is to store in the struct 
pcpu_alloc_info later.  Before this patch, ai->dyn_size would always be 0 
if that's what was passed to pcpu_build_alloc_info(), but due to this 
arithmetic it now requires that static_size + reserved_size to be pfn 
aligned.  Where is that enforced or do we not care?

> +
>  	/*
>  	 * Determine min_unit_size, alloc_size and max_upa such that
>  	 * alloc_size is multiple of atom_size and is the smallest
>  	 * which can accomodate 4k aligned segments which are equal to
>  	 * or larger than min_unit_size.
>  	 */
> -	size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);
>  	min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
>  
>  	alloc_size = roundup(min_unit_size, atom_size);
> @@ -1508,7 +1496,7 @@ early_param("percpu_alloc", percpu_alloc
>  /**
>   * pcpu_embed_first_chunk - embed the first percpu chunk into bootmem
>   * @reserved_size: the size of reserved percpu area in bytes
> - * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
> + * @dyn_size: minimum free size for dynamic allocation in bytes
>   * @atom_size: allocation atom size
>   * @cpu_distance_fn: callback to determine distance between cpus, optional
>   * @alloc_fn: function to allocate percpu page
> @@ -1529,10 +1517,7 @@ early_param("percpu_alloc", percpu_alloc
>   * vmalloc space is not orders of magnitude larger than distances
>   * between node memory addresses (ie. 32bit NUMA machines).
>   *
> - * When @dyn_size is positive, dynamic area might be larger than
> - * specified to fill page alignment.  When @dyn_size is auto,
> - * @dyn_size is just big enough to fill page alignment after static
> - * and reserved areas.
> + * @dyn_size specifies the minimum dynamic area size.
>   *
>   * If the needed size is smaller than the minimum or specified unit
>   * size, the leftover is returned using @free_fn.
> @@ -1540,7 +1525,7 @@ early_param("percpu_alloc", percpu_alloc
>   * RETURNS:
>   * 0 on success, -errno on failure.
>   */
> -int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size,
> +int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
>  				  size_t atom_size,
>  				  pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
>  				  pcpu_fc_alloc_fn_t alloc_fn,
> @@ -1671,7 +1656,7 @@ int __init pcpu_page_first_chunk(size_t 
>  
>  	snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10);
>  
> -	ai = pcpu_build_alloc_info(reserved_size, -1, PAGE_SIZE, NULL);
> +	ai = pcpu_build_alloc_info(reserved_size, 0, PAGE_SIZE, NULL);
>  	if (IS_ERR(ai))
>  		return PTR_ERR(ai);
>  	BUG_ON(ai->nr_groups != 1);
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 02/16] [PATCH 1/2] percpu: make @dyn_size always mean min dyn_size in first chunk init functions
  2010-06-27  5:06   ` David Rientjes
@ 2010-06-27  8:21     ` Tejun Heo
  2010-06-27 16:57       ` [S+Q 02/16] [PATCH 1/2 UPDATED] " Tejun Heo
  2010-06-27 19:24       ` [S+Q 02/16] [PATCH 1/2] " David Rientjes
  2010-06-29 15:36     ` Christoph Lameter
  1 sibling, 2 replies; 72+ messages in thread
From: Tejun Heo @ 2010-06-27  8:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

Hello,

On 06/27/2010 07:06 AM, David Rientjes wrote:
> On Fri, 25 Jun 2010, Christoph Lameter wrote:
> 
>> In pcpu_alloc_info()
> 
> You mean pcpu_build_alloc_info()?

Yeap.

>> @@ -105,7 +105,7 @@ extern struct pcpu_alloc_info * __init p
>>  extern void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai);
>>  
>>  extern struct pcpu_alloc_info * __init pcpu_build_alloc_info(
>> -				size_t reserved_size, ssize_t dyn_size,
>> +				size_t reserved_size, size_t dyn_size,
>>  				size_t atom_size,
>>  				pcpu_fc_cpu_distance_fn_t cpu_distance_fn);
>>  
> 
> This can just be removed entirely, it's unnecessarily global.

Oh yeah, it's not used outside mm/percpu.c anymore.  I'll make it
static.

>>  /**
>>   * pcpu_alloc_alloc_info - allocate percpu allocation info
>>   * @nr_groups: the number of groups
>> @@ -1060,7 +1046,7 @@ void __init pcpu_free_alloc_info(struct 
>>  /**
>>   * pcpu_build_alloc_info - build alloc_info considering distances between CPUs
>>   * @reserved_size: the size of reserved percpu area in bytes
>> - * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
>> + * @dyn_size: free size for dynamic allocation in bytes
> 
> It's the minimum free size, it's not necessarily the exact size due to 
> round-up.

Will update.

>>  struct pcpu_alloc_info * __init pcpu_build_alloc_info(
>> -				size_t reserved_size, ssize_t dyn_size,
>> +				size_t reserved_size, size_t dyn_size,
>>  				size_t atom_size,
>>  				pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
>>  {
>> @@ -1098,13 +1084,15 @@ struct pcpu_alloc_info * __init pcpu_bui
>>  	memset(group_map, 0, sizeof(group_map));
>>  	memset(group_cnt, 0, sizeof(group_map));
>>  
>> +	size_sum = PFN_ALIGN(static_size + reserved_size + dyn_size);
>> +	dyn_size = size_sum - static_size - reserved_size;
> 
> Ok, so the only purpose of "dyn_size" is to store in the struct 
> pcpu_alloc_info later.  Before this patch, ai->dyn_size would always be 0 
> if that's what was passed to pcpu_build_alloc_info(), but due to this 
> arithmetic it now requires that static_size + reserved_size to be pfn 
> aligned.  Where is that enforced or do we not care?

I'm not really following you, but

* Nobody called pcpu_build_alloc_info() w/ zero dyn_size.  It was
  either -1 or positive minimum size.

* None of static_size, reserved_size or dyn_size needs to be page
  aligned.

Thanks for the review.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [S+Q 02/16] [PATCH 1/2 UPDATED] percpu: make @dyn_size always mean min dyn_size in first chunk init functions
  2010-06-27  8:21     ` Tejun Heo
@ 2010-06-27 16:57       ` Tejun Heo
  2010-06-27 19:25         ` David Rientjes
  2010-06-27 19:24       ` [S+Q 02/16] [PATCH 1/2] " David Rientjes
  1 sibling, 1 reply; 72+ messages in thread
From: Tejun Heo @ 2010-06-27 16:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

In pcpu_build_alloc_info() and pcpu_embed_first_chunk(), @dyn_size was
ssize_t, -1 meant auto-size, 0 forced 0 and positive meant minimum
size.  There's no use case for forcing 0 and the upcoming early alloc
support always requires non-zero dynamic size.  Make @dyn_size always
mean minimum dyn_size.

While at it, make pcpu_build_alloc_info() static which doesn't have
any external caller as suggested by David Rientjes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
---

Here's the updated patch.  I've pushed out this and the second patch
to linux-next.  Please feel free to pull from the following git tree.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-next

I'll apply 09 once how it's gonna be handled is determined.

Thanks.

 include/linux/percpu.h |    7 +------
 mm/percpu.c            |   35 ++++++++++-------------------------
 2 files changed, 11 insertions(+), 31 deletions(-)

Index: work/include/linux/percpu.h
===================================================================
--- work.orig/include/linux/percpu.h
+++ work/include/linux/percpu.h
@@ -104,16 +104,11 @@ extern struct pcpu_alloc_info * __init p
 							     int nr_units);
 extern void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai);

-extern struct pcpu_alloc_info * __init pcpu_build_alloc_info(
-				size_t reserved_size, ssize_t dyn_size,
-				size_t atom_size,
-				pcpu_fc_cpu_distance_fn_t cpu_distance_fn);
-
 extern int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 					 void *base_addr);

 #ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
-extern int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size,
+extern int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
 				size_t atom_size,
 				pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
 				pcpu_fc_alloc_fn_t alloc_fn,
Index: work/mm/percpu.c
===================================================================
--- work.orig/mm/percpu.c
+++ work/mm/percpu.c
@@ -1013,20 +1013,6 @@ phys_addr_t per_cpu_ptr_to_phys(void *ad
 		return page_to_phys(pcpu_addr_to_page(addr));
 }

-static inline size_t pcpu_calc_fc_sizes(size_t static_size,
-					size_t reserved_size,
-					ssize_t *dyn_sizep)
-{
-	size_t size_sum;
-
-	size_sum = PFN_ALIGN(static_size + reserved_size +
-			     (*dyn_sizep >= 0 ? *dyn_sizep : 0));
-	if (*dyn_sizep != 0)
-		*dyn_sizep = size_sum - static_size - reserved_size;
-
-	return size_sum;
-}
-
 /**
  * pcpu_alloc_alloc_info - allocate percpu allocation info
  * @nr_groups: the number of groups
@@ -1085,7 +1071,7 @@ void __init pcpu_free_alloc_info(struct
 /**
  * pcpu_build_alloc_info - build alloc_info considering distances between CPUs
  * @reserved_size: the size of reserved percpu area in bytes
- * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @dyn_size: minimum free size for dynamic allocation in bytes
  * @atom_size: allocation atom size
  * @cpu_distance_fn: callback to determine distance between cpus, optional
  *
@@ -1103,8 +1089,8 @@ void __init pcpu_free_alloc_info(struct
  * On success, pointer to the new allocation_info is returned.  On
  * failure, ERR_PTR value is returned.
  */
-struct pcpu_alloc_info * __init pcpu_build_alloc_info(
-				size_t reserved_size, ssize_t dyn_size,
+static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
+				size_t reserved_size, size_t dyn_size,
 				size_t atom_size,
 				pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
 {
@@ -1123,13 +1109,15 @@ struct pcpu_alloc_info * __init pcpu_bui
 	memset(group_map, 0, sizeof(group_map));
 	memset(group_cnt, 0, sizeof(group_cnt));

+	size_sum = PFN_ALIGN(static_size + reserved_size + dyn_size);
+	dyn_size = size_sum - static_size - reserved_size;
+
 	/*
 	 * Determine min_unit_size, alloc_size and max_upa such that
 	 * alloc_size is multiple of atom_size and is the smallest
 	 * which can accomodate 4k aligned segments which are equal to
 	 * or larger than min_unit_size.
 	 */
-	size_sum = pcpu_calc_fc_sizes(static_size, reserved_size, &dyn_size);
 	min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);

 	alloc_size = roundup(min_unit_size, atom_size);
@@ -1532,7 +1520,7 @@ early_param("percpu_alloc", percpu_alloc
 /**
  * pcpu_embed_first_chunk - embed the first percpu chunk into bootmem
  * @reserved_size: the size of reserved percpu area in bytes
- * @dyn_size: free size for dynamic allocation in bytes, -1 for auto
+ * @dyn_size: minimum free size for dynamic allocation in bytes
  * @atom_size: allocation atom size
  * @cpu_distance_fn: callback to determine distance between cpus, optional
  * @alloc_fn: function to allocate percpu page
@@ -1553,10 +1541,7 @@ early_param("percpu_alloc", percpu_alloc
  * vmalloc space is not orders of magnitude larger than distances
  * between node memory addresses (ie. 32bit NUMA machines).
  *
- * When @dyn_size is positive, dynamic area might be larger than
- * specified to fill page alignment.  When @dyn_size is auto,
- * @dyn_size is just big enough to fill page alignment after static
- * and reserved areas.
+ * @dyn_size specifies the minimum dynamic area size.
  *
  * If the needed size is smaller than the minimum or specified unit
  * size, the leftover is returned using @free_fn.
@@ -1564,7 +1549,7 @@ early_param("percpu_alloc", percpu_alloc
  * RETURNS:
  * 0 on success, -errno on failure.
  */
-int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size,
+int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
 				  size_t atom_size,
 				  pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
 				  pcpu_fc_alloc_fn_t alloc_fn,
@@ -1695,7 +1680,7 @@ int __init pcpu_page_first_chunk(size_t

 	snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10);

-	ai = pcpu_build_alloc_info(reserved_size, -1, PAGE_SIZE, NULL);
+	ai = pcpu_build_alloc_info(reserved_size, 0, PAGE_SIZE, NULL);
 	if (IS_ERR(ai))
 		return PTR_ERR(ai);
 	BUG_ON(ai->nr_groups != 1);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 02/16] [PATCH 1/2] percpu: make @dyn_size always mean min dyn_size in first chunk init functions
  2010-06-27  8:21     ` Tejun Heo
  2010-06-27 16:57       ` [S+Q 02/16] [PATCH 1/2 UPDATED] " Tejun Heo
@ 2010-06-27 19:24       ` David Rientjes
  1 sibling, 0 replies; 72+ messages in thread
From: David Rientjes @ 2010-06-27 19:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Sun, 27 Jun 2010, Tejun Heo wrote:

> >>  struct pcpu_alloc_info * __init pcpu_build_alloc_info(
> >> -				size_t reserved_size, ssize_t dyn_size,
> >> +				size_t reserved_size, size_t dyn_size,
> >>  				size_t atom_size,
> >>  				pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
> >>  {
> >> @@ -1098,13 +1084,15 @@ struct pcpu_alloc_info * __init pcpu_bui
> >>  	memset(group_map, 0, sizeof(group_map));
> >>  	memset(group_cnt, 0, sizeof(group_map));
> >>  
> >> +	size_sum = PFN_ALIGN(static_size + reserved_size + dyn_size);
> >> +	dyn_size = size_sum - static_size - reserved_size;
> > 
> > Ok, so the only purpose of "dyn_size" is to store in the struct 
> > pcpu_alloc_info later.  Before this patch, ai->dyn_size would always be 0 
> > if that's what was passed to pcpu_build_alloc_info(), but due to this 
> > arithmetic it now requires that static_size + reserved_size to be pfn 
> > aligned.  Where is that enforced or do we not care?
> 
> I'm not really following you, but
> 
> * Nobody called pcpu_build_alloc_info() w/ zero dyn_size.  It was
>   either -1 or positive minimum size.
> 

Ok, the commit description said that passing pcpu_build_alloc_info() a 
dyn_size of 0 would force it to be 0, although the arithmetic introduced 
by this patch would not have necessarily set ai->dyn_size to be 0 when 
passed if static_size + reserved_size was not page aligned (size_sum 
could be greater than static_size + reserved_size).  Since there are no 
users passing a dyn_size of 0, my concern is addressed.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 02/16] [PATCH 1/2 UPDATED] percpu: make @dyn_size always mean min dyn_size in first chunk init functions
  2010-06-27 16:57       ` [S+Q 02/16] [PATCH 1/2 UPDATED] " Tejun Heo
@ 2010-06-27 19:25         ` David Rientjes
  0 siblings, 0 replies; 72+ messages in thread
From: David Rientjes @ 2010-06-27 19:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Sun, 27 Jun 2010, Tejun Heo wrote:

> In pcpu_build_alloc_info() and pcpu_embed_first_chunk(), @dyn_size was
> ssize_t, -1 meant auto-size, 0 forced 0 and positive meant minimum
> size.  There's no use case for forcing 0 and the upcoming early alloc
> support always requires non-zero dynamic size.  Make @dyn_size always
> mean minimum dyn_size.
> 
> While at it, make pcpu_build_alloc_info() static which doesn't have
> any external caller as suggested by David Rientjes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Rientjes <rientjes@google.com>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
  2010-06-25 21:20 ` [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation Christoph Lameter
@ 2010-06-28  2:17   ` KAMEZAWA Hiroyuki
  2010-06-28 16:45     ` Manfred Spraul
  2010-06-28 16:48     ` Pekka Enberg
  1 sibling, 1 reply; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-28  2:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-mm, Manfred Spraul, Nick Piggin, Matt Mackall

On Fri, 25 Jun 2010 16:20:27 -0500
Christoph Lameter <cl@linux-foundation.org> wrote:

> [Necessary to make 2.6.35-rc3 not deadlock. Not sure if this is the "right"(tm)
> fix]
> 
> The last change to improve the scalability moved the actual wake-up out of
> the section that is protected by spin_lock(sma->sem_perm.lock).
> 
> This means that IN_WAKEUP can be in queue.status even when the spinlock is
> acquired by the current task. Thus the same loop that is performed when
> queue.status is read without the spinlock acquired must be performed when
> the spinlock is acquired.
> 
> Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>


Hmm, I'm sorry if I don't understand the code...

> 
> ---
>  ipc/sem.c |   36 ++++++++++++++++++++++++++++++------
>  1 files changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/ipc/sem.c b/ipc/sem.c
> index 506c849..523665f 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -1256,6 +1256,32 @@ out:
>  	return un;
>  }
>  
> +
> +/** get_queue_result - Retrieve the result code from sem_queue
> + * @q: Pointer to queue structure
> + *
> + * The function retrieve the return code from the pending queue. If 
> + * IN_WAKEUP is found in q->status, then we must loop until the value
> + * is replaced with the final value: This may happen if a task is
> + * woken up by an unrelated event (e.g. signal) and in parallel the task
> + * is woken up by another task because it got the requested semaphores.
> + *
> + * The function can be called with or without holding the semaphore spinlock.
> + */
> +static int get_queue_result(struct sem_queue *q)
> +{
> +	int error;
> +
> +	error = q->status;
> +	while(unlikely(error == IN_WAKEUP)) {
> +		cpu_relax();
> +		error = q->status;
> +	}
> +
> +	return error;
> +}

no memory barrier is required ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 04/16] slub: Use a constant for a unspecified node.
  2010-06-25 21:20 ` [S+Q 04/16] slub: Use a constant for a unspecified node Christoph Lameter
@ 2010-06-28  2:25   ` KAMEZAWA Hiroyuki
  2010-06-29 15:38     ` Christoph Lameter
  0 siblings, 1 reply; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-28  2:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-mm, David Rientjes, Nick Piggin, Matt Mackall

On Fri, 25 Jun 2010 16:20:30 -0500
Christoph Lameter <cl@linux-foundation.org> wrote:

> kmalloc_node() and friends can be passed a constant -1 to indicate
> that no choice was made for the node from which the object needs to
> come.
> 
> Use NUMA_NO_NODE instead of -1.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

How about more updates ?

Hmm, by grep (mmotm)
==
static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
{
        struct page *page;
        int searchnode = (node == -1) ? numa_node_id() : node;
==
==
static inline int node_match(struct kmem_cache_cpu *c, int node)
{
#ifdef CONFIG_NUMA
        if (node != -1 && c->node != node)
                return 0;
#endif
        return 1;
}
==

==
debug:
        if (!alloc_debug_processing(s, c->page, object, addr))
                goto another_slab;

        c->page->inuse++;
        c->page->freelist = get_freepointer(s, object);
        c->node = -1;
        goto unlock_out;
}
==

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 05/16] SLUB: Constants need UL
  2010-06-25 21:20 ` [S+Q 05/16] SLUB: Constants need UL Christoph Lameter
  2010-06-26 23:31   ` David Rientjes
@ 2010-06-28  2:27   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-28  2:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Fri, 25 Jun 2010 16:20:31 -0500
Christoph Lameter <cl@linux-foundation.org> wrote:

> UL suffix is missing in some constants. Conform to how slab.h uses constants.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 08/16] slub: remove dynamic dma slab allocation
  2010-06-25 21:20 ` [S+Q 08/16] slub: remove dynamic dma slab allocation Christoph Lameter
  2010-06-26 23:52   ` David Rientjes
@ 2010-06-28  2:33   ` KAMEZAWA Hiroyuki
  2010-06-29 15:41     ` Christoph Lameter
  1 sibling, 1 reply; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-28  2:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Fri, 25 Jun 2010 16:20:34 -0500
Christoph Lameter <cl@linux-foundation.org> wrote:

> Remove the dynamic dma slab allocation since this causes too many issues with
> nested locks etc etc. The change avoids passing gfpflags into many functions.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 

Uh...I think just using GFP_KERNEL drops too much requests-from-user-via-gfp_mask.

How about this ?

gfp_mask = (gfp_mask & GFP_RECLAIM_MASK); 

if you want to drop __GFP_DMA.

Thanks,
-Kame



> ---
>  mm/slub.c |  153 ++++++++++++++++----------------------------------------------
>  1 file changed, 41 insertions(+), 112 deletions(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c	2010-06-15 12:40:58.000000000 -0500
> +++ linux-2.6/mm/slub.c	2010-06-15 12:41:36.000000000 -0500
> @@ -2070,7 +2070,7 @@ init_kmem_cache_node(struct kmem_cache_n
>  
>  static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[KMALLOC_CACHES]);
>  
> -static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
> +static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
>  {
>  	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
>  		/*
> @@ -2097,7 +2097,7 @@ static inline int alloc_kmem_cache_cpus(
>   * when allocating for the kmalloc_node_cache. This is used for bootstrapping
>   * memory on a fresh node that has no slab structures yet.
>   */
> -static void early_kmem_cache_node_alloc(gfp_t gfpflags, int node)
> +static void early_kmem_cache_node_alloc(int node)
>  {
>  	struct page *page;
>  	struct kmem_cache_node *n;
> @@ -2105,7 +2105,7 @@ static void early_kmem_cache_node_alloc(
>  
>  	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
>  
> -	page = new_slab(kmalloc_caches, gfpflags, node);
> +	page = new_slab(kmalloc_caches, GFP_KERNEL, node);
>  
>  	BUG_ON(!page);
>  	if (page_to_nid(page) != node) {
> @@ -2149,7 +2149,7 @@ static void free_kmem_cache_nodes(struct
>  	}
>  }
>  
> -static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
> +static int init_kmem_cache_nodes(struct kmem_cache *s)
>  {
>  	int node;
>  
> @@ -2157,11 +2157,11 @@ static int init_kmem_cache_nodes(struct 
>  		struct kmem_cache_node *n;
>  
>  		if (slab_state == DOWN) {
> -			early_kmem_cache_node_alloc(gfpflags, node);
> +			early_kmem_cache_node_alloc(node);
>  			continue;
>  		}
>  		n = kmem_cache_alloc_node(kmalloc_caches,
> -						gfpflags, node);
> +						GFP_KERNEL, node);
>  
>  		if (!n) {
>  			free_kmem_cache_nodes(s);
> @@ -2178,7 +2178,7 @@ static void free_kmem_cache_nodes(struct
>  {
>  }
>  
> -static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
> +static int init_kmem_cache_nodes(struct kmem_cache *s)
>  {
>  	init_kmem_cache_node(&s->local_node, s);
>  	return 1;
> @@ -2318,7 +2318,7 @@ static int calculate_sizes(struct kmem_c
>  
>  }
>  
> -static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
> +static int kmem_cache_open(struct kmem_cache *s,
>  		const char *name, size_t size,
>  		size_t align, unsigned long flags,
>  		void (*ctor)(void *))
> @@ -2354,10 +2354,10 @@ static int kmem_cache_open(struct kmem_c
>  #ifdef CONFIG_NUMA
>  	s->remote_node_defrag_ratio = 1000;
>  #endif
> -	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
> +	if (!init_kmem_cache_nodes(s))
>  		goto error;
>  
> -	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
> +	if (alloc_kmem_cache_cpus(s))
>  		return 1;
>  
>  	free_kmem_cache_nodes(s);
> @@ -2517,6 +2517,10 @@ EXPORT_SYMBOL(kmem_cache_destroy);
>  struct kmem_cache kmalloc_caches[KMALLOC_CACHES] __cacheline_aligned;
>  EXPORT_SYMBOL(kmalloc_caches);
>  
> +#ifdef CONFIG_ZONE_DMA
> +static struct kmem_cache kmalloc_dma_caches[SLUB_PAGE_SHIFT];
> +#endif
> +
>  static int __init setup_slub_min_order(char *str)
>  {
>  	get_option(&str, &slub_min_order);
> @@ -2553,116 +2557,26 @@ static int __init setup_slub_nomerge(cha
>  
>  __setup("slub_nomerge", setup_slub_nomerge);
>  
> -static struct kmem_cache *create_kmalloc_cache(struct kmem_cache *s,
> -		const char *name, int size, gfp_t gfp_flags)
> +static void create_kmalloc_cache(struct kmem_cache *s,
> +		const char *name, int size, unsigned int flags)
>  {
> -	unsigned int flags = 0;
> -
> -	if (gfp_flags & SLUB_DMA)
> -		flags = SLAB_CACHE_DMA;
> -
>  	/*
>  	 * This function is called with IRQs disabled during early-boot on
>  	 * single CPU so there's no need to take slub_lock here.
>  	 */
> -	if (!kmem_cache_open(s, gfp_flags, name, size, ARCH_KMALLOC_MINALIGN,
> +	if (!kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN,
>  								flags, NULL))
>  		goto panic;
>  
>  	list_add(&s->list, &slab_caches);
>  
> -	if (sysfs_slab_add(s))
> -		goto panic;
> -	return s;
> +	if (!sysfs_slab_add(s))
> +		return;
>  
>  panic:
>  	panic("Creation of kmalloc slab %s size=%d failed.\n", name, size);
>  }
>  
> -#ifdef CONFIG_ZONE_DMA
> -static struct kmem_cache *kmalloc_caches_dma[SLUB_PAGE_SHIFT];
> -
> -static void sysfs_add_func(struct work_struct *w)
> -{
> -	struct kmem_cache *s;
> -
> -	down_write(&slub_lock);
> -	list_for_each_entry(s, &slab_caches, list) {
> -		if (s->flags & __SYSFS_ADD_DEFERRED) {
> -			s->flags &= ~__SYSFS_ADD_DEFERRED;
> -			sysfs_slab_add(s);
> -		}
> -	}
> -	up_write(&slub_lock);
> -}
> -
> -static DECLARE_WORK(sysfs_add_work, sysfs_add_func);
> -
> -static noinline struct kmem_cache *dma_kmalloc_cache(int index, gfp_t flags)
> -{
> -	struct kmem_cache *s;
> -	char *text;
> -	size_t realsize;
> -	unsigned long slabflags;
> -	int i;
> -
> -	s = kmalloc_caches_dma[index];
> -	if (s)
> -		return s;
> -
> -	/* Dynamically create dma cache */
> -	if (flags & __GFP_WAIT)
> -		down_write(&slub_lock);
> -	else {
> -		if (!down_write_trylock(&slub_lock))
> -			goto out;
> -	}
> -
> -	if (kmalloc_caches_dma[index])
> -		goto unlock_out;
> -
> -	realsize = kmalloc_caches[index].objsize;
> -	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
> -			 (unsigned int)realsize);
> -
> -	s = NULL;
> -	for (i = 0; i < KMALLOC_CACHES; i++)
> -		if (!kmalloc_caches[i].size)
> -			break;
> -
> -	BUG_ON(i >= KMALLOC_CACHES);
> -	s = kmalloc_caches + i;
> -
> -	/*
> -	 * Must defer sysfs creation to a workqueue because we don't know
> -	 * what context we are called from. Before sysfs comes up, we don't
> -	 * need to do anything because our sysfs initcall will start by
> -	 * adding all existing slabs to sysfs.
> -	 */
> -	slabflags = SLAB_CACHE_DMA|SLAB_NOTRACK;
> -	if (slab_state >= SYSFS)
> -		slabflags |= __SYSFS_ADD_DEFERRED;
> -
> -	if (!text || !kmem_cache_open(s, flags, text,
> -			realsize, ARCH_KMALLOC_MINALIGN, slabflags, NULL)) {
> -		s->size = 0;
> -		kfree(text);
> -		goto unlock_out;
> -	}
> -
> -	list_add(&s->list, &slab_caches);
> -	kmalloc_caches_dma[index] = s;
> -
> -	if (slab_state >= SYSFS)
> -		schedule_work(&sysfs_add_work);
> -
> -unlock_out:
> -	up_write(&slub_lock);
> -out:
> -	return kmalloc_caches_dma[index];
> -}
> -#endif
> -
>  /*
>   * Conversion table for small slabs sizes / 8 to the index in the
>   * kmalloc array. This is necessary for slabs < 192 since we have non power
> @@ -2715,7 +2629,7 @@ static struct kmem_cache *get_slab(size_
>  
>  #ifdef CONFIG_ZONE_DMA
>  	if (unlikely((flags & SLUB_DMA)))
> -		return dma_kmalloc_cache(index, flags);
> +		return &kmalloc_dma_caches[index];
>  
>  #endif
>  	return &kmalloc_caches[index];
> @@ -3053,7 +2967,7 @@ void __init kmem_cache_init(void)
>  	 * kmem_cache_open for slab_state == DOWN.
>  	 */
>  	create_kmalloc_cache(&kmalloc_caches[0], "kmem_cache_node",
> -		sizeof(struct kmem_cache_node), GFP_NOWAIT);
> +		sizeof(struct kmem_cache_node), 0);
>  	kmalloc_caches[0].refcount = -1;
>  	caches++;
>  
> @@ -3066,18 +2980,18 @@ void __init kmem_cache_init(void)
>  	/* Caches that are not of the two-to-the-power-of size */
>  	if (KMALLOC_MIN_SIZE <= 32) {
>  		create_kmalloc_cache(&kmalloc_caches[1],
> -				"kmalloc-96", 96, GFP_NOWAIT);
> +				"kmalloc-96", 96, 0);
>  		caches++;
>  	}
>  	if (KMALLOC_MIN_SIZE <= 64) {
>  		create_kmalloc_cache(&kmalloc_caches[2],
> -				"kmalloc-192", 192, GFP_NOWAIT);
> +				"kmalloc-192", 192, 0);
>  		caches++;
>  	}
>  
>  	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++) {
>  		create_kmalloc_cache(&kmalloc_caches[i],
> -			"kmalloc", 1 << i, GFP_NOWAIT);
> +			"kmalloc", 1 << i, 0);
>  		caches++;
>  	}
>  
> @@ -3124,7 +3038,7 @@ void __init kmem_cache_init(void)
>  
>  	/* Provide the correct kmalloc names now that the caches are up */
>  	for (i = KMALLOC_SHIFT_LOW; i < SLUB_PAGE_SHIFT; i++)
> -		kmalloc_caches[i]. name =
> +		kmalloc_caches[i].name =
>  			kasprintf(GFP_NOWAIT, "kmalloc-%d", 1 << i);
>  
>  #ifdef CONFIG_SMP
> @@ -3147,6 +3061,21 @@ void __init kmem_cache_init(void)
>  
>  void __init kmem_cache_init_late(void)
>  {
> +#ifdef CONFIG_ZONE_DMA
> +	int i;
> +
> +	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
> +		struct kmem_cache *s = &kmalloc_caches[i];
> +
> +		if (s && s->size) {
> +			char *name = kasprintf(GFP_KERNEL,
> +				 "dma-kmalloc-%d", s->objsize);
> +
> +			create_kmalloc_cache(&kmalloc_dma_caches[i],
> +				name, s->objsize, SLAB_CACHE_DMA);
> +		}
> +	}
> +#endif
>  }
>  
>  /*
> @@ -3241,7 +3170,7 @@ struct kmem_cache *kmem_cache_create(con
>  
>  	s = kmalloc(kmem_size, GFP_KERNEL);
>  	if (s) {
> -		if (kmem_cache_open(s, GFP_KERNEL, name,
> +		if (kmem_cache_open(s, name,
>  				size, align, flags, ctor)) {
>  			list_add(&s->list, &slab_caches);
>  			up_write(&slub_lock);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench
  2010-06-26  2:24 ` [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Nick Piggin
@ 2010-06-28  6:18   ` Pekka Enberg
  2010-06-28 10:12     ` Christoph Lameter
  2010-06-28 14:46     ` Matt Mackall
  0 siblings, 2 replies; 72+ messages in thread
From: Pekka Enberg @ 2010-06-28  6:18 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Lameter, linux-mm, Matt Mackall

On Sat, Jun 26, 2010 at 5:24 AM, Nick Piggin <npiggin@suse.de> wrote:
> On Fri, Jun 25, 2010 at 04:20:26PM -0500, Christoph Lameter wrote:
>> The following patchset cleans some pieces up and then equips SLUB with
>> per cpu queues that work similar to SLABs queues. With that approach
>> SLUB wins in hackbench:
>
> Hackbench I don't think is that interesting. SLQB was beating SLAB
> too.

We've seen regressions pop up with hackbench so I think it's
interesting. Not the most interesting one, for sure, nor conclusive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench
  2010-06-28  6:18   ` Pekka Enberg
@ 2010-06-28 10:12     ` Christoph Lameter
  2010-06-28 15:18       ` Pekka Enberg
  2010-06-28 14:46     ` Matt Mackall
  1 sibling, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-06-28 10:12 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Nick Piggin, linux-mm, Matt Mackall

On Mon, 28 Jun 2010, Pekka Enberg wrote:

> > Hackbench I don't think is that interesting. SLQB was beating SLAB
> > too.
>
> We've seen regressions pop up with hackbench so I think it's
> interesting. Not the most interesting one, for sure, nor conclusive.
>
Hackbench was frequently cited in performance tests. Which benchmarks
would be of interest?  I am off this week so dont expect a fast response
from me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 12/16] SLUB: Add SLAB style per cpu queueing
  2010-06-26  2:32   ` Nick Piggin
@ 2010-06-28 10:19     ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-28 10:19 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Pekka Enberg, linux-mm, Matt Mackall

On Sat, 26 Jun 2010, Nick Piggin wrote:

> > The SLAB scheme of not touching the object during management is adopted.
> > SLUB can now efficiently free and allocate cache cold objects.
>
> BTW. this was never the problem with SLUB, because SLQB didn't have
> the big performance regression on tpcc. SLUB IIRC had to touch more
> cachelines per operation.

Wish you were more detailed here. SLUB was designed for minimal cacheline
footprint and always had an edge there. AFACT SLQB was able to address
tpcc in some ways was because the issue with the hotpath on free was
addressed. These were issues with atomic ops on free not cache footprint.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench
  2010-06-28  6:18   ` Pekka Enberg
  2010-06-28 10:12     ` Christoph Lameter
@ 2010-06-28 14:46     ` Matt Mackall
  1 sibling, 0 replies; 72+ messages in thread
From: Matt Mackall @ 2010-06-28 14:46 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Nick Piggin, Christoph Lameter, linux-mm

On Mon, 2010-06-28 at 09:18 +0300, Pekka Enberg wrote:
> On Sat, Jun 26, 2010 at 5:24 AM, Nick Piggin <npiggin@suse.de> wrote:
> > On Fri, Jun 25, 2010 at 04:20:26PM -0500, Christoph Lameter wrote:
> >> The following patchset cleans some pieces up and then equips SLUB with
> >> per cpu queues that work similar to SLABs queues. With that approach
> >> SLUB wins in hackbench:
> >
> > Hackbench I don't think is that interesting. SLQB was beating SLAB
> > too.
> 
> We've seen regressions pop up with hackbench so I think it's
> interesting. Not the most interesting one, for sure, nor conclusive.

Looks like most of the stuff up to 12 is a good idea.

Christoph, is there any test where this is likely to lose substantial
ground to SLUB without queueing? Can we characterize that? We're in
danger now of getting into the situation where we can't drop SLUB for
the same reasons we can't drop SLAB - big performance regressions.

-- 
Mathematics is the supreme nostalgia of our time.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench
  2010-06-28 10:12     ` Christoph Lameter
@ 2010-06-28 15:18       ` Pekka Enberg
  2010-06-28 18:54         ` David Rientjes
  2010-06-29 15:21         ` Christoph Lameter
  0 siblings, 2 replies; 72+ messages in thread
From: Pekka Enberg @ 2010-06-28 15:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, linux-mm, Matt Mackall, David Rientjes, Mel Gorman

On Mon, 28 Jun 2010, Pekka Enberg wrote:
>> > Hackbench I don't think is that interesting. SLQB was beating SLAB
>> > too.
>>
>> We've seen regressions pop up with hackbench so I think it's
>> interesting. Not the most interesting one, for sure, nor conclusive.

On Mon, Jun 28, 2010 at 1:12 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> Hackbench was frequently cited in performance tests. Which benchmarks
> would be of interest?  I am off this week so dont expect a fast response
> from me.

I guess "netperf TCP_RR" is the most interesting one because that's a
known benchmark where SLUB performs poorly when compared to SLAB.
Mel's extensive slab benchmarks are also worth looking at:

http://lkml.indiana.edu/hypermail/linux/kernel/0902.0/00745.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
  2010-06-28  2:17   ` KAMEZAWA Hiroyuki
@ 2010-06-28 16:45     ` Manfred Spraul
  2010-06-28 23:58       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 72+ messages in thread
From: Manfred Spraul @ 2010-06-28 16:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On 06/28/2010 04:17 AM, KAMEZAWA Hiroyuki wrote:
> On Fri, 25 Jun 2010 16:20:27 -0500
> Christoph Lameter<cl@linux-foundation.org>  wrote:
>
>    
>> [Necessary to make 2.6.35-rc3 not deadlock. Not sure if this is the "right"(tm)
>> fix]
>>
>> The last change to improve the scalability moved the actual wake-up out of
>> the section that is protected by spin_lock(sma->sem_perm.lock).
>>
>> This means that IN_WAKEUP can be in queue.status even when the spinlock is
>> acquired by the current task. Thus the same loop that is performed when
>> queue.status is read without the spinlock acquired must be performed when
>> the spinlock is acquired.
>>
>> Signed-off-by: Manfred Spraul<manfred@colorfullife.com>
>> Signed-off-by: Christoph Lameter<cl@linux-foundation.org>
>>      
>
> Hmm, I'm sorry if I don't understand the code...
>
>    
>> ---
>>   ipc/sem.c |   36 ++++++++++++++++++++++++++++++------
>>   1 files changed, 30 insertions(+), 6 deletions(-)
>>
>> diff --git a/ipc/sem.c b/ipc/sem.c
>> index 506c849..523665f 100644
>> --- a/ipc/sem.c
>> +++ b/ipc/sem.c
>> @@ -1256,6 +1256,32 @@ out:
>>   	return un;
>>   }
>>
>> +
>> +/** get_queue_result - Retrieve the result code from sem_queue
>> + * @q: Pointer to queue structure
>> + *
>> + * The function retrieve the return code from the pending queue. If
>> + * IN_WAKEUP is found in q->status, then we must loop until the value
>> + * is replaced with the final value: This may happen if a task is
>> + * woken up by an unrelated event (e.g. signal) and in parallel the task
>> + * is woken up by another task because it got the requested semaphores.
>> + *
>> + * The function can be called with or without holding the semaphore spinlock.
>> + */
>> +static int get_queue_result(struct sem_queue *q)
>> +{
>> +	int error;
>> +
>> +	error = q->status;
>> +	while(unlikely(error == IN_WAKEUP)) {
>> +		cpu_relax();
>> +		error = q->status;
>> +	}
>> +
>> +	return error;
>> +}
>>      
> no memory barrier is required ?
>
>    
No.
q->status is the only field that is read in the exit path of 
sys_semtimedop():
After that, q->status is used as the return value of sys_semtimedop(), 
without accessing any other field.
Thus no memory barrier is required: there is just no other read/write 
operation against which the read of q->status must be serialized.

There is a smp_wmb() wake_up_sem_queue_do(), to ensure that all writes 
that are done by the cpu that does the wake-up are completed before 
q->status is set to the final value.

--
     Manfred

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting  successful operation
  2010-06-25 21:20 ` [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation Christoph Lameter
@ 2010-06-28 16:48     ` Pekka Enberg
  2010-06-28 16:48     ` Pekka Enberg
  1 sibling, 0 replies; 72+ messages in thread
From: Pekka Enberg @ 2010-06-28 16:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Manfred Spraul, Nick Piggin, Matt Mackall, Andrew Morton, LKML

On Sat, Jun 26, 2010 at 12:20 AM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> [Necessary to make 2.6.35-rc3 not deadlock. Not sure if this is the "right"(tm)
> fix]

Is this related to the SLUB patches? Regardless, lets add Andrew and
linux-kernel on CC.

> The last change to improve the scalability moved the actual wake-up out of
> the section that is protected by spin_lock(sma->sem_perm.lock).
>
> This means that IN_WAKEUP can be in queue.status even when the spinlock is
> acquired by the current task. Thus the same loop that is performed when
> queue.status is read without the spinlock acquired must be performed when
> the spinlock is acquired.
>
> Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>
> ---
>  ipc/sem.c |   36 ++++++++++++++++++++++++++++++------
>  1 files changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/ipc/sem.c b/ipc/sem.c
> index 506c849..523665f 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -1256,6 +1256,32 @@ out:
>        return un;
>  }
>
> +
> +/** get_queue_result - Retrieve the result code from sem_queue
> + * @q: Pointer to queue structure
> + *
> + * The function retrieve the return code from the pending queue. If
> + * IN_WAKEUP is found in q->status, then we must loop until the value
> + * is replaced with the final value: This may happen if a task is
> + * woken up by an unrelated event (e.g. signal) and in parallel the task
> + * is woken up by another task because it got the requested semaphores.
> + *
> + * The function can be called with or without holding the semaphore spinlock.
> + */
> +static int get_queue_result(struct sem_queue *q)
> +{
> +       int error;
> +
> +       error = q->status;
> +       while(unlikely(error == IN_WAKEUP)) {
> +               cpu_relax();
> +               error = q->status;
> +       }
> +
> +       return error;
> +}
> +
> +
>  SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>                unsigned, nsops, const struct timespec __user *, timeout)
>  {
> @@ -1409,11 +1435,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>        else
>                schedule();
>
> -       error = queue.status;
> -       while(unlikely(error == IN_WAKEUP)) {
> -               cpu_relax();
> -               error = queue.status;
> -       }
> +       error = get_queue_result(&queue);
>
>        if (error != -EINTR) {
>                /* fast path: update_queue already obtained all requested
> @@ -1427,10 +1449,12 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>                goto out_free;
>        }
>
> +       error = get_queue_result(&queue);
> +
>        /*
>         * If queue.status != -EINTR we are woken up by another process
>         */
> -       error = queue.status;
> +
>        if (error != -EINTR) {
>                goto out_unlock_free;
>        }
> --
> 1.7.0.1
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
@ 2010-06-28 16:48     ` Pekka Enberg
  0 siblings, 0 replies; 72+ messages in thread
From: Pekka Enberg @ 2010-06-28 16:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, Manfred Spraul, Nick Piggin, Matt Mackall, Andrew Morton, LKML

On Sat, Jun 26, 2010 at 12:20 AM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> [Necessary to make 2.6.35-rc3 not deadlock. Not sure if this is the "right"(tm)
> fix]

Is this related to the SLUB patches? Regardless, lets add Andrew and
linux-kernel on CC.

> The last change to improve the scalability moved the actual wake-up out of
> the section that is protected by spin_lock(sma->sem_perm.lock).
>
> This means that IN_WAKEUP can be in queue.status even when the spinlock is
> acquired by the current task. Thus the same loop that is performed when
> queue.status is read without the spinlock acquired must be performed when
> the spinlock is acquired.
>
> Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>
> ---
>  ipc/sem.c |   36 ++++++++++++++++++++++++++++++------
>  1 files changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/ipc/sem.c b/ipc/sem.c
> index 506c849..523665f 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -1256,6 +1256,32 @@ out:
>        return un;
>  }
>
> +
> +/** get_queue_result - Retrieve the result code from sem_queue
> + * @q: Pointer to queue structure
> + *
> + * The function retrieve the return code from the pending queue. If
> + * IN_WAKEUP is found in q->status, then we must loop until the value
> + * is replaced with the final value: This may happen if a task is
> + * woken up by an unrelated event (e.g. signal) and in parallel the task
> + * is woken up by another task because it got the requested semaphores.
> + *
> + * The function can be called with or without holding the semaphore spinlock.
> + */
> +static int get_queue_result(struct sem_queue *q)
> +{
> +       int error;
> +
> +       error = q->status;
> +       while(unlikely(error == IN_WAKEUP)) {
> +               cpu_relax();
> +               error = q->status;
> +       }
> +
> +       return error;
> +}
> +
> +
>  SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>                unsigned, nsops, const struct timespec __user *, timeout)
>  {
> @@ -1409,11 +1435,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>        else
>                schedule();
>
> -       error = queue.status;
> -       while(unlikely(error == IN_WAKEUP)) {
> -               cpu_relax();
> -               error = queue.status;
> -       }
> +       error = get_queue_result(&queue);
>
>        if (error != -EINTR) {
>                /* fast path: update_queue already obtained all requested
> @@ -1427,10 +1449,12 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
>                goto out_free;
>        }
>
> +       error = get_queue_result(&queue);
> +
>        /*
>         * If queue.status != -EINTR we are woken up by another process
>         */
> -       error = queue.status;
> +
>        if (error != -EINTR) {
>                goto out_unlock_free;
>        }
> --
> 1.7.0.1
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-06-25 21:20 ` [S+Q 09/16] [percpu] make allocpercpu usable during early boot Christoph Lameter
  2010-06-26  8:10   ` Tejun Heo
  2010-06-26 23:38   ` David Rientjes
@ 2010-06-28 17:03   ` Pekka Enberg
  2010-06-29 15:45     ` Christoph Lameter
  2 siblings, 1 reply; 72+ messages in thread
From: Pekka Enberg @ 2010-06-28 17:03 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, tj, Nick Piggin, Matt Mackall, David Rientjes

On Sat, Jun 26, 2010 at 12:20 AM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> allocpercpu() may be used during early boot after the page allocator
> has been bootstrapped but when interrupts are still off. Make sure
> that we do not do GFP_KERNEL allocations if this occurs.
>
> Cc: tj@kernel.org
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
>
> ---
>  mm/percpu.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/mm/percpu.c
> ===================================================================
> --- linux-2.6.orig/mm/percpu.c  2010-06-23 14:43:54.000000000 -0500
> +++ linux-2.6/mm/percpu.c       2010-06-23 14:44:05.000000000 -0500
> @@ -275,7 +275,8 @@ static void __maybe_unused pcpu_next_pop
>  * memory is always zeroed.
>  *
>  * CONTEXT:
> - * Does GFP_KERNEL allocation.
> + * Does GFP_KERNEL allocation (May be called early in boot when
> + * interrupts are still disabled. Will then do GFP_NOWAIT alloc).
>  *
>  * RETURNS:
>  * Pointer to the allocated area on success, NULL on failure.
> @@ -286,7 +287,7 @@ static void *pcpu_mem_alloc(size_t size)
>                return NULL;
>
>        if (size <= PAGE_SIZE)
> -               return kzalloc(size, GFP_KERNEL);
> +               return kzalloc(size, GFP_KERNEL & gfp_allowed_mask);
>        else {
>                void *ptr = vmalloc(size);
>                if (ptr)

This looks wrong to me. All slab allocators should do gfp_allowed_mask
magic under the hood. Maybe it's triggering kmalloc_large() path that
needs the masking too?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench
  2010-06-28 15:18       ` Pekka Enberg
@ 2010-06-28 18:54         ` David Rientjes
  2010-06-29 15:23           ` Christoph Lameter
  2010-06-29 15:21         ` Christoph Lameter
  1 sibling, 1 reply; 72+ messages in thread
From: David Rientjes @ 2010-06-28 18:54 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Nick Piggin, linux-mm, Matt Mackall, Mel Gorman

[-- Attachment #1: Type: TEXT/PLAIN, Size: 811 bytes --]

On Mon, 28 Jun 2010, Pekka Enberg wrote:

> > Hackbench was frequently cited in performance tests. Which benchmarks
> > would be of interest?  I am off this week so dont expect a fast response
> > from me.
> 
> I guess "netperf TCP_RR" is the most interesting one because that's a
> known benchmark where SLUB performs poorly when compared to SLAB.
> Mel's extensive slab benchmarks are also worth looking at:
> 
> http://lkml.indiana.edu/hypermail/linux/kernel/0902.0/00745.html
> 

In addition to that benchmark, which regresses on systems with larger 
numbers of cpus, you had posted results for slub vs slab for kernbench, 
aim9, and sysbench before slub was ever merged.  If you're going to use 
slab-like queueing in slub, it would be interesting to see if these 
particular benchmarks regress once again.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
  2010-06-28 16:45     ` Manfred Spraul
@ 2010-06-28 23:58       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-28 23:58 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Mon, 28 Jun 2010 18:45:25 +0200
Manfred Spraul <manfred@colorfullife.com> wrote:

> On 06/28/2010 04:17 AM, KAMEZAWA Hiroyuki wrote:
> > On Fri, 25 Jun 2010 16:20:27 -0500
> > Christoph Lameter<cl@linux-foundation.org>  wrote:
> >
> >    
> >> [Necessary to make 2.6.35-rc3 not deadlock. Not sure if this is the "right"(tm)
> >> fix]
> >>
> >> The last change to improve the scalability moved the actual wake-up out of
> >> the section that is protected by spin_lock(sma->sem_perm.lock).
> >>
> >> This means that IN_WAKEUP can be in queue.status even when the spinlock is
> >> acquired by the current task. Thus the same loop that is performed when
> >> queue.status is read without the spinlock acquired must be performed when
> >> the spinlock is acquired.
> >>
> >> Signed-off-by: Manfred Spraul<manfred@colorfullife.com>
> >> Signed-off-by: Christoph Lameter<cl@linux-foundation.org>
> >>      
> >
> > Hmm, I'm sorry if I don't understand the code...
> >
> >    
> >> ---
> >>   ipc/sem.c |   36 ++++++++++++++++++++++++++++++------
> >>   1 files changed, 30 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/ipc/sem.c b/ipc/sem.c
> >> index 506c849..523665f 100644
> >> --- a/ipc/sem.c
> >> +++ b/ipc/sem.c
> >> @@ -1256,6 +1256,32 @@ out:
> >>   	return un;
> >>   }
> >>
> >> +
> >> +/** get_queue_result - Retrieve the result code from sem_queue
> >> + * @q: Pointer to queue structure
> >> + *
> >> + * The function retrieve the return code from the pending queue. If
> >> + * IN_WAKEUP is found in q->status, then we must loop until the value
> >> + * is replaced with the final value: This may happen if a task is
> >> + * woken up by an unrelated event (e.g. signal) and in parallel the task
> >> + * is woken up by another task because it got the requested semaphores.
> >> + *
> >> + * The function can be called with or without holding the semaphore spinlock.
> >> + */
> >> +static int get_queue_result(struct sem_queue *q)
> >> +{
> >> +	int error;
> >> +
> >> +	error = q->status;
> >> +	while(unlikely(error == IN_WAKEUP)) {
> >> +		cpu_relax();
> >> +		error = q->status;
> >> +	}
> >> +
> >> +	return error;
> >> +}
> >>      
> > no memory barrier is required ?
> >
> >    
> No.
> q->status is the only field that is read in the exit path of 
> sys_semtimedop():
> After that, q->status is used as the return value of sys_semtimedop(), 
> without accessing any other field.
> Thus no memory barrier is required: there is just no other read/write 
> operation against which the read of q->status must be serialized.
> 
> There is a smp_wmb() wake_up_sem_queue_do(), to ensure that all writes 
> that are done by the cpu that does the wake-up are completed before 
> q->status is set to the final value.
> 

Thanks. BTW, cpu_relax() always includes asm("":::"memory") for avoiding
optimization ?

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-06-26  8:10   ` Tejun Heo
  2010-06-26 23:53     ` David Rientjes
@ 2010-06-29 15:15     ` Christoph Lameter
  2010-06-29 15:30       ` Tejun Heo
  1 sibling, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Sat, 26 Jun 2010, Tejun Heo wrote:

> Christoph, how do you wanna route these patches?  I already have the
> other two patches in the percpu tree, I can push this there too, which
> then you can pull into the allocator tree.

Please push via your trees. Lets keep stuff subsystem specific if
possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench
  2010-06-28 15:18       ` Pekka Enberg
  2010-06-28 18:54         ` David Rientjes
@ 2010-06-29 15:21         ` Christoph Lameter
  1 sibling, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:21 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, linux-mm, Matt Mackall, David Rientjes, Mel Gorman

On Mon, 28 Jun 2010, Pekka Enberg wrote:

> I guess "netperf TCP_RR" is the most interesting one because that's a
> known benchmark where SLUB performs poorly when compared to SLAB.
> Mel's extensive slab benchmarks are also worth looking at:

I will look at it when I get time but I am vacation right now and sitting
in the hospital with my son who managed to get himself there on the first
day of the "vacation". Guess it will take a week or so at least.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench
  2010-06-28 18:54         ` David Rientjes
@ 2010-06-29 15:23           ` Christoph Lameter
  2010-06-29 15:55             ` Mike Travis
  0 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, Nick Piggin, linux-mm, Matt Mackall, Mel Gorman, travis

On Mon, 28 Jun 2010, David Rientjes wrote:

> In addition to that benchmark, which regresses on systems with larger
> numbers of cpus, you had posted results for slub vs slab for kernbench,
> aim9, and sysbench before slub was ever merged.  If you're going to use
> slab-like queueing in slub, it would be interesting to see if these
> particular benchmarks regress once again.

I do not have access to Itanium systems anymore. I hope Mike can run some
benchmarks?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-06-26 23:38   ` David Rientjes
@ 2010-06-29 15:26     ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:26 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-mm, tj, Nick Piggin, Matt Mackall

On Sat, 26 Jun 2010, David Rientjes wrote:

> On Fri, 25 Jun 2010, Christoph Lameter wrote:
>
> > allocpercpu() may be used during early boot after the page allocator
> > has been bootstrapped but when interrupts are still off. Make sure
> > that we do not do GFP_KERNEL allocations if this occurs.
> Why isn't this being handled at a lower level, specifically in the slab
> allocator to prevent GFP_KERNEL from being used when irqs are disabled?
> We'll otherwise need to audit all slab allocations from the boot cpu for
> correctness.

It is handled at a lower level when slab allocates from the page
allocator. But the checking logic for the proper flags passed to the slab
allocator does not mask the bits and it seems that this approach is the
way people want it to be. So we have to explicitly mask GFP_KERNEL in
these locations.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-06-29 15:15     ` Christoph Lameter
@ 2010-06-29 15:30       ` Tejun Heo
  2010-07-06 20:41         ` Christoph Lameter
  0 siblings, 1 reply; 72+ messages in thread
From: Tejun Heo @ 2010-06-29 15:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On 06/29/2010 05:15 PM, Christoph Lameter wrote:
> On Sat, 26 Jun 2010, Tejun Heo wrote:
> 
>> Christoph, how do you wanna route these patches?  I already have the
>> other two patches in the percpu tree, I can push this there too, which
>> then you can pull into the allocator tree.
> 
> Please push via your trees. Lets keep stuff subsystem specific if
> possible.

Sure, please feel free to pull from the following tree.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-next

I haven't committed the gfp_allowed_mask patch yet.  I'll commit it
once it gets resolved.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 08/16] slub: remove dynamic dma slab allocation
  2010-06-26 23:52   ` David Rientjes
@ 2010-06-29 15:31     ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:31 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Sat, 26 Jun 2010, David Rientjes wrote:

> > -	page = new_slab(kmalloc_caches, gfpflags, node);
> > +	page = new_slab(kmalloc_caches, GFP_KERNEL, node);
> >
> >  	BUG_ON(!page);
> >  	if (page_to_nid(page) != node) {
>
> This still passes GFP_KERNEL to the page allocator when not allowed by
> gfp_allowed_mask for early (non SLAB_CACHE_DMA) users of
> create_kmalloc_cache().

Right a later patch changes that. I could fold that hunk in here.

> > @@ -2157,11 +2157,11 @@ static int init_kmem_cache_nodes(struct
> >  		struct kmem_cache_node *n;
> >
> >  		if (slab_state == DOWN) {
> > -			early_kmem_cache_node_alloc(gfpflags, node);
> > +			early_kmem_cache_node_alloc(node);
> >  			continue;
> >  		}
> >  		n = kmem_cache_alloc_node(kmalloc_caches,
> > -						gfpflags, node);
> > +						GFP_KERNEL, node);
> >
> >  		if (!n) {
> >  			free_kmem_cache_nodes(s);
>
> slab_state != DOWN is still not an indication that GFP_KERNEL is safe; in
> fact, all users of GFP_KERNEL from kmem_cache_init() are unsafe.  These
> need to be GFP_NOWAIT.

slab_state == DOWN is a sure indicator that kmem_cache_alloc_node is not
functional. That is what we need to know here.

> > +#ifdef CONFIG_ZONE_DMA
> > +	int i;
> > +
> > +	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
> > +		struct kmem_cache *s = &kmalloc_caches[i];
> > +
> > +		if (s && s->size) {
> > +			char *name = kasprintf(GFP_KERNEL,
> > +				 "dma-kmalloc-%d", s->objsize);
> > +
>
> You're still not handling the case where !name, which kasprintf() can
> return both here and in kmem_cache_init().  Nameless caches aren't allowed
> for CONFIG_SLUB_DEBUG.

It was not handled before either. I can come up with a patch but frankly
this is a rare corner case that does not have too high priority to get
done.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 10/16] slub: Remove static kmem_cache_cpu array for boot
  2010-06-27  0:02   ` David Rientjes
@ 2010-06-29 15:35     ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, linux-mm, Tejun Heo, Nick Piggin, Matt Mackall

On Sat, 26 Jun 2010, David Rientjes wrote:

> > @@ -2105,7 +2096,7 @@ static void early_kmem_cache_node_alloc(
> >
> >  	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
> >
> > -	page = new_slab(kmalloc_caches, GFP_KERNEL, node);
> > +	page = new_slab(kmalloc_caches, GFP_KERNEL & gfp_allowed_mask, node);
> >
> >  	BUG_ON(!page);
> >  	if (page_to_nid(page) != node) {
>
> This needs to be merged into the preceding patch since it had broken new
> slab allocations during early boot while irqs are still disabled; it also
> seems deserving of a big fat comment about why it's required in this
> situation.

AFAICT The earlier patch did not break anything but leave existing
behavior the way it was. Breakage would occur in this patch because it
results in allocations occurring earlier during boot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 02/16] [PATCH 1/2] percpu: make @dyn_size always mean min dyn_size in first chunk init functions
  2010-06-27  5:06   ` David Rientjes
  2010-06-27  8:21     ` Tejun Heo
@ 2010-06-29 15:36     ` Christoph Lameter
  1 sibling, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pekka Enberg, linux-mm, Tejun Heo, Nick Piggin, Matt Mackall

Hmmm... Ok seems to have been manged by quilt send.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 04/16] slub: Use a constant for a unspecified node.
  2010-06-28  2:25   ` KAMEZAWA Hiroyuki
@ 2010-06-29 15:38     ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Pekka Enberg, linux-mm, David Rientjes, Nick Piggin, Matt Mackall

On Mon, 28 Jun 2010, KAMEZAWA Hiroyuki wrote:

> On Fri, 25 Jun 2010 16:20:30 -0500
> Christoph Lameter <cl@linux-foundation.org> wrote:
>
> > kmalloc_node() and friends can be passed a constant -1 to indicate
> > that no choice was made for the node from which the object needs to
> > come.
> >
> > Use NUMA_NO_NODE instead of -1.
> >
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> >
> Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> How about more updates ?

Would be a great idea. Can you take over this patch and add the missing
pieces? I dont have too much time in the next weeks. Also am on vacation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 08/16] slub: remove dynamic dma slab allocation
  2010-06-28  2:33   ` KAMEZAWA Hiroyuki
@ 2010-06-29 15:41     ` Christoph Lameter
  2010-06-30  0:26       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Mon, 28 Jun 2010, KAMEZAWA Hiroyuki wrote:

> Uh...I think just using GFP_KERNEL drops too much
> requests-from-user-via-gfp_mask.

Sorry I do not understand what the issue is? The dma slabs are allocated
while user space is not active yet.

Please do not quote diff hunks that you do not comment on. I am on a slow
link (vacation) and its awkward to check for comments...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
  2010-06-28 16:48     ` Pekka Enberg
@ 2010-06-29 15:42       ` Christoph Lameter
  -1 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:42 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-mm, Manfred Spraul, Nick Piggin, Matt Mackall, Andrew Morton, LKML

This is a patch from Manfred. Required to make 2.6.35-rc3 work.



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
@ 2010-06-29 15:42       ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:42 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: linux-mm, Manfred Spraul, Nick Piggin, Matt Mackall, Andrew Morton, LKML

This is a patch from Manfred. Required to make 2.6.35-rc3 work.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-06-28 17:03   ` Pekka Enberg
@ 2010-06-29 15:45     ` Christoph Lameter
  2010-07-01  6:23       ` Pekka Enberg
  0 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-06-29 15:45 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, tj, Nick Piggin, Matt Mackall, David Rientjes

[-- Attachment #1: Type: TEXT/PLAIN, Size: 608 bytes --]

On Mon, 28 Jun 2010, Pekka Enberg wrote:
> > +               return kzalloc(size, GFP_KERNEL & gfp_allowed_mask);
> >        else {
> >                void *ptr = vmalloc(size);
> >                if (ptr)
>
> This looks wrong to me. All slab allocators should do gfp_allowed_mask
> magic under the hood. Maybe it's triggering kmalloc_large() path that
> needs the masking too?

They do gfp_allowed_mask magic. But the checks at function entry of the
slabs do not mask the masks so we get false positives without this. All my
protest against the checks doing it this IMHO broken way were ignored.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench
  2010-06-29 15:23           ` Christoph Lameter
@ 2010-06-29 15:55             ` Mike Travis
  0 siblings, 0 replies; 72+ messages in thread
From: Mike Travis @ 2010-06-29 15:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Pekka Enberg, Nick Piggin, linux-mm,
	Matt Mackall, Mel Gorman



Christoph Lameter wrote:
> On Mon, 28 Jun 2010, David Rientjes wrote:
> 
>> In addition to that benchmark, which regresses on systems with larger
>> numbers of cpus, you had posted results for slub vs slab for kernbench,
>> aim9, and sysbench before slub was ever merged.  If you're going to use
>> slab-like queueing in slub, it would be interesting to see if these
>> particular benchmarks regress once again.
> 
> I do not have access to Itanium systems anymore. I hope Mike can run some
> benchmarks?
> 

Sure, but I won't have a lot of time as we're pushing out the first
customer UV systems and that's keeping me pretty busy.

If it's all packaged up and ready to run that would help a lot.

Thanks,
Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
  2010-06-29 15:42       ` Christoph Lameter
  (?)
@ 2010-06-29 19:08       ` Andrew Morton
  2010-06-30 19:38           ` Manfred Spraul
  -1 siblings, 1 reply; 72+ messages in thread
From: Andrew Morton @ 2010-06-29 19:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, linux-mm, Manfred Spraul, Nick Piggin, Matt Mackall, LKML

On Tue, 29 Jun 2010 10:42:42 -0500 (CDT)
Christoph Lameter <cl@linux-foundation.org> wrote:

> This is a patch from Manfred. Required to make 2.6.35-rc3 work.
> 

My current version of the patch is below.

I believe that Luca has still seen problems with this patch applied so
its current status is "stuck, awaiting developments".

Is that a correct determination?

Thanks.


From: Manfred Spraul <manfred@colorfullife.com>

The last change to improve the scalability moved the actual wake-up out of
the section that is protected by spin_lock(sma->sem_perm.lock).

This means that IN_WAKEUP can be in queue.status even when the spinlock is
acquired by the current task.  Thus the same loop that is performed when
queue.status is read without the spinlock acquired must be performed when
the spinlock is acquired.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16255

[akpm@linux-foundation.org: clean up kerneldoc, checkpatch warning and whitespace]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Luca Tettamanti <kronos.it@gmail.com>
Tested-by: Luca Tettamanti <kronos.it@gmail.com>
Reported-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/sem.c |   37 +++++++++++++++++++++++++++++++------
 1 file changed, 31 insertions(+), 6 deletions(-)

diff -puN ipc/sem.c~ipc-semc-bugfix-for-semop-not-reporting-successful-operation ipc/sem.c
--- a/ipc/sem.c~ipc-semc-bugfix-for-semop-not-reporting-successful-operation
+++ a/ipc/sem.c
@@ -1256,6 +1256,33 @@ out:
 	return un;
 }
 
+
+/**
+ * get_queue_result - Retrieve the result code from sem_queue
+ * @q: Pointer to queue structure
+ *
+ * Retrieve the return code from the pending queue. If IN_WAKEUP is found in
+ * q->status, then we must loop until the value is replaced with the final
+ * value: This may happen if a task is woken up by an unrelated event (e.g.
+ * signal) and in parallel the task is woken up by another task because it got
+ * the requested semaphores.
+ *
+ * The function can be called with or without holding the semaphore spinlock.
+ */
+static int get_queue_result(struct sem_queue *q)
+{
+	int error;
+
+	error = q->status;
+	while (unlikely(error == IN_WAKEUP)) {
+		cpu_relax();
+		error = q->status;
+	}
+
+	return error;
+}
+
+
 SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		unsigned, nsops, const struct timespec __user *, timeout)
 {
@@ -1409,11 +1436,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 
 	else
 		schedule();
 
-	error = queue.status;
-	while(unlikely(error == IN_WAKEUP)) {
-		cpu_relax();
-		error = queue.status;
-	}
+	error = get_queue_result(&queue);
 
 	if (error != -EINTR) {
 		/* fast path: update_queue already obtained all requested
@@ -1427,10 +1450,12 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 
 		goto out_free;
 	}
 
+	error = get_queue_result(&queue);
+
 	/*
 	 * If queue.status != -EINTR we are woken up by another process
 	 */
-	error = queue.status;
+
 	if (error != -EINTR) {
 		goto out_unlock_free;
 	}
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 08/16] slub: remove dynamic dma slab allocation
  2010-06-29 15:41     ` Christoph Lameter
@ 2010-06-30  0:26       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-30  0:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Tue, 29 Jun 2010 10:41:59 -0500 (CDT)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Mon, 28 Jun 2010, KAMEZAWA Hiroyuki wrote:
> 
> > Uh...I think just using GFP_KERNEL drops too much
> > requests-from-user-via-gfp_mask.
> 
> Sorry I do not understand what the issue is? The dma slabs are allocated
> while user space is not active yet.
> 
Sorry, I misunderstood the patch. It seems ok, now.

> Please do not quote diff hunks that you do not comment on. I am on a slow
> link (vacation) and its awkward to check for comments...

Sure.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
  2010-06-29 19:08       ` Andrew Morton
@ 2010-06-30 19:38           ` Manfred Spraul
  0 siblings, 0 replies; 72+ messages in thread
From: Manfred Spraul @ 2010-06-30 19:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Nick Piggin,
	Matt Mackall, LKML

Hi Andrew,

On 06/29/2010 09:08 PM, Andrew Morton wrote:
> On Tue, 29 Jun 2010 10:42:42 -0500 (CDT)
> Christoph Lameter<cl@linux-foundation.org>  wrote:
>
>    
>> This is a patch from Manfred. Required to make 2.6.35-rc3 work.
>>
>>      
> My current version of the patch is below.
>
> I believe that Luca has still seen problems with this patch applied so
> its current status is "stuck, awaiting developments".
>
> Is that a correct determination?
>    

I would propose that you forward a patch to Linus - either the one you 
have in your tree or the v2 that I've just posted.
With stock 2.6.35-rc3, my semtimedop() stress tests produces an oops or 
an invalid return value (i.e.:semtimedop() returns with "1") within a 
fraction of a second.

With either of the patches applied, my test apps show the expected behavior.

--
     Manfred



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
@ 2010-06-30 19:38           ` Manfred Spraul
  0 siblings, 0 replies; 72+ messages in thread
From: Manfred Spraul @ 2010-06-30 19:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Nick Piggin,
	Matt Mackall, LKML

Hi Andrew,

On 06/29/2010 09:08 PM, Andrew Morton wrote:
> On Tue, 29 Jun 2010 10:42:42 -0500 (CDT)
> Christoph Lameter<cl@linux-foundation.org>  wrote:
>
>    
>> This is a patch from Manfred. Required to make 2.6.35-rc3 work.
>>
>>      
> My current version of the patch is below.
>
> I believe that Luca has still seen problems with this patch applied so
> its current status is "stuck, awaiting developments".
>
> Is that a correct determination?
>    

I would propose that you forward a patch to Linus - either the one you 
have in your tree or the v2 that I've just posted.
With stock 2.6.35-rc3, my semtimedop() stress tests produces an oops or 
an invalid return value (i.e.:semtimedop() returns with "1") within a 
fraction of a second.

With either of the patches applied, my test apps show the expected behavior.

--
     Manfred


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
  2010-06-30 19:38           ` Manfred Spraul
@ 2010-06-30 19:51             ` Andrew Morton
  -1 siblings, 0 replies; 72+ messages in thread
From: Andrew Morton @ 2010-06-30 19:51 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Nick Piggin,
	Matt Mackall, LKML

On Wed, 30 Jun 2010 21:38:43 +0200
Manfred Spraul <manfred@colorfullife.com> wrote:

> Hi Andrew,
> 
> On 06/29/2010 09:08 PM, Andrew Morton wrote:
> > On Tue, 29 Jun 2010 10:42:42 -0500 (CDT)
> > Christoph Lameter<cl@linux-foundation.org>  wrote:
> >
> >    
> >> This is a patch from Manfred. Required to make 2.6.35-rc3 work.
> >>
> >>      
> > My current version of the patch is below.
> >
> > I believe that Luca has still seen problems with this patch applied so
> > its current status is "stuck, awaiting developments".
> >
> > Is that a correct determination?
> >    
> 
> I would propose that you forward a patch to Linus - either the one you 
> have in your tree or the v2 that I've just posted.

OK, I added the incremental change:

--- a/ipc/sem.c~ipc-semc-bugfix-for-semop-not-reporting-successful-operation-update
+++ a/ipc/sem.c
@@ -1440,7 +1440,14 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 
 
 	if (error != -EINTR) {
 		/* fast path: update_queue already obtained all requested
-		 * resources */
+		 * resources.
+		 * Perform a smp_mb(): User space could assume that semop()
+		 * is a memory barrier: Without the mb(), the cpu could
+		 * speculatively read in user space stale data that was
+		 * overwritten by the previous owner of the semaphore.
+		 */
+		smp_mb();
+
 		goto out_free;
 	}
 
_

> With stock 2.6.35-rc3, my semtimedop() stress tests produces an oops or 
> an invalid return value (i.e.:semtimedop() returns with "1") within a 
> fraction of a second.
> 
> With either of the patches applied, my test apps show the expected behavior.

OK, I'll queue it up.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation
@ 2010-06-30 19:51             ` Andrew Morton
  0 siblings, 0 replies; 72+ messages in thread
From: Andrew Morton @ 2010-06-30 19:51 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Christoph Lameter, Pekka Enberg, linux-mm, Nick Piggin,
	Matt Mackall, LKML

On Wed, 30 Jun 2010 21:38:43 +0200
Manfred Spraul <manfred@colorfullife.com> wrote:

> Hi Andrew,
> 
> On 06/29/2010 09:08 PM, Andrew Morton wrote:
> > On Tue, 29 Jun 2010 10:42:42 -0500 (CDT)
> > Christoph Lameter<cl@linux-foundation.org>  wrote:
> >
> >    
> >> This is a patch from Manfred. Required to make 2.6.35-rc3 work.
> >>
> >>      
> > My current version of the patch is below.
> >
> > I believe that Luca has still seen problems with this patch applied so
> > its current status is "stuck, awaiting developments".
> >
> > Is that a correct determination?
> >    
> 
> I would propose that you forward a patch to Linus - either the one you 
> have in your tree or the v2 that I've just posted.

OK, I added the incremental change:

--- a/ipc/sem.c~ipc-semc-bugfix-for-semop-not-reporting-successful-operation-update
+++ a/ipc/sem.c
@@ -1440,7 +1440,14 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 
 
 	if (error != -EINTR) {
 		/* fast path: update_queue already obtained all requested
-		 * resources */
+		 * resources.
+		 * Perform a smp_mb(): User space could assume that semop()
+		 * is a memory barrier: Without the mb(), the cpu could
+		 * speculatively read in user space stale data that was
+		 * overwritten by the previous owner of the semaphore.
+		 */
+		smp_mb();
+
 		goto out_free;
 	}
 
_

> With stock 2.6.35-rc3, my semtimedop() stress tests produces an oops or 
> an invalid return value (i.e.:semtimedop() returns with "1") within a 
> fraction of a second.
> 
> With either of the patches applied, my test apps show the expected behavior.

OK, I'll queue it up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-06-29 15:45     ` Christoph Lameter
@ 2010-07-01  6:23       ` Pekka Enberg
  2010-07-06 14:32         ` Christoph Lameter
  0 siblings, 1 reply; 72+ messages in thread
From: Pekka Enberg @ 2010-07-01  6:23 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, tj, Nick Piggin, Matt Mackall, David Rientjes

> On Mon, 28 Jun 2010, Pekka Enberg wrote:
>> > +               return kzalloc(size, GFP_KERNEL & gfp_allowed_mask);
>> >        else {
>> >                void *ptr = vmalloc(size);
>> >                if (ptr)
>>
>> This looks wrong to me. All slab allocators should do gfp_allowed_mask
>> magic under the hood. Maybe it's triggering kmalloc_large() path that
>> needs the masking too?

On Tue, Jun 29, 2010 at 6:45 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
> They do gfp_allowed_mask magic. But the checks at function entry of the
> slabs do not mask the masks so we get false positives without this. All my
> protest against the checks doing it this IMHO broken way were ignored.

Which checks are those? Are they in SLUB proper or are they introduced
in one of the SLEB patches? We definitely don't want to expose
gfp_allowed_mask here.

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-07-01  6:23       ` Pekka Enberg
@ 2010-07-06 14:32         ` Christoph Lameter
  2010-07-31  9:39           ` Pekka Enberg
  0 siblings, 1 reply; 72+ messages in thread
From: Christoph Lameter @ 2010-07-06 14:32 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: linux-mm, tj, Nick Piggin, Matt Mackall, David Rientjes

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1576 bytes --]

On Thu, 1 Jul 2010, Pekka Enberg wrote:

> > On Mon, 28 Jun 2010, Pekka Enberg wrote:
> >> > +               return kzalloc(size, GFP_KERNEL & gfp_allowed_mask);
> >> >        else {
> >> >                void *ptr = vmalloc(size);
> >> >                if (ptr)
> >>
> >> This looks wrong to me. All slab allocators should do gfp_allowed_mask
> >> magic under the hood. Maybe it's triggering kmalloc_large() path that
> >> needs the masking too?
>
> On Tue, Jun 29, 2010 at 6:45 PM, Christoph Lameter
> <cl@linux-foundation.org> wrote:
> > They do gfp_allowed_mask magic. But the checks at function entry of the
> > slabs do not mask the masks so we get false positives without this. All my
> > protest against the checks doing it this IMHO broken way were ignored.
>
> Which checks are those? Are they in SLUB proper or are they introduced
> in one of the SLEB patches? We definitely don't want to expose
> gfp_allowed_mask here.

Argh. The reason for the trouble here is because I moved the
masking of the gfp flags out of the hot path.

The masking of the bits adds to the cache footprint of the hotpaths now in
all slab allocators. Gosh. Why is there constant contamination of the hot
paths with the stuff?

We only need this masking in the hot path if the debugging hooks need it.
Otherwise its fine to defer this to the slow paths.

So how do I get that in there? Add "& gfp_allowed_mask" to the gfp mask
passed to the debugging hooks?

Or add a debug_hooks_alloc() function and make it empty if no debugging
functions are enabled?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-06-29 15:30       ` Tejun Heo
@ 2010-07-06 20:41         ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-07-06 20:41 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Tue, 29 Jun 2010, Tejun Heo wrote:

> I haven't committed the gfp_allowed_mask patch yet.  I'll commit it
> once it gets resolved.

Dont commit. I dropped it myself.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 07/16] slub: discard_slab_unlock
  2010-06-26 23:34   ` David Rientjes
@ 2010-07-06 20:44     ` Christoph Lameter
  0 siblings, 0 replies; 72+ messages in thread
From: Christoph Lameter @ 2010-07-06 20:44 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-mm, Nick Piggin, Matt Mackall

On Sat, 26 Jun 2010, David Rientjes wrote:

> > The sequence of unlocking a slab and freeing occurs multiple times.
> > Put the common into a single function.
> >
>
> Did you want to respond to the comments I made about this patch at
> http://marc.info/?l=linux-mm&m=127689747432061 ?  Specifically, how it
> makes seeing if there are unmatched slab_lock() -> slab_unlock() pairs
> more difficult.

I dont think so. The name includes slab_unlock at the end. We could drop
this but its a frequent action necessary when disposing of a slab page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [S+Q 09/16] [percpu] make allocpercpu usable during early boot
  2010-07-06 14:32         ` Christoph Lameter
@ 2010-07-31  9:39           ` Pekka Enberg
  0 siblings, 0 replies; 72+ messages in thread
From: Pekka Enberg @ 2010-07-31  9:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, tj, Nick Piggin, Matt Mackall, David Rientjes, torvalds, benh

Christoph Lameter wrote:
> On Thu, 1 Jul 2010, Pekka Enberg wrote:
> 
>>> On Mon, 28 Jun 2010, Pekka Enberg wrote:
>>>>> +               return kzalloc(size, GFP_KERNEL & gfp_allowed_mask);
>>>>>        else {
>>>>>                void *ptr = vmalloc(size);
>>>>>                if (ptr)
>>>> This looks wrong to me. All slab allocators should do gfp_allowed_mask
>>>> magic under the hood. Maybe it's triggering kmalloc_large() path that
>>>> needs the masking too?
>> On Tue, Jun 29, 2010 at 6:45 PM, Christoph Lameter
>> <cl@linux-foundation.org> wrote:
>>> They do gfp_allowed_mask magic. But the checks at function entry of the
>>> slabs do not mask the masks so we get false positives without this. All my
>>> protest against the checks doing it this IMHO broken way were ignored.
>> Which checks are those? Are they in SLUB proper or are they introduced
>> in one of the SLEB patches? We definitely don't want to expose
>> gfp_allowed_mask here.
> 
> Argh. The reason for the trouble here is because I moved the
> masking of the gfp flags out of the hot path.
> 
> The masking of the bits adds to the cache footprint of the hotpaths now in
> all slab allocators. Gosh. Why is there constant contamination of the hot
> paths with the stuff?

We should definitely move gfp_allowed_mask masking out of fast-paths. I 
think the ideal solution is to only do it deep in the page allocator.

> We only need this masking in the hot path if the debugging hooks need it.
> Otherwise its fine to defer this to the slow paths.
> 
> So how do I get that in there? Add "& gfp_allowed_mask" to the gfp mask
> passed to the debugging hooks?
> 
> Or add a debug_hooks_alloc() function and make it empty if no debugging
> functions are enabled?

I think it's best to have separate debugging hooks that are empty if 
debugging is disabled, yes.

			Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2010-07-31  9:39 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-25 21:20 [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Christoph Lameter
2010-06-25 21:20 ` [S+Q 01/16] [PATCH] ipc/sem.c: Bugfix for semop() not reporting successful operation Christoph Lameter
2010-06-28  2:17   ` KAMEZAWA Hiroyuki
2010-06-28 16:45     ` Manfred Spraul
2010-06-28 23:58       ` KAMEZAWA Hiroyuki
2010-06-28 16:48   ` Pekka Enberg
2010-06-28 16:48     ` Pekka Enberg
2010-06-29 15:42     ` Christoph Lameter
2010-06-29 15:42       ` Christoph Lameter
2010-06-29 19:08       ` Andrew Morton
2010-06-30 19:38         ` Manfred Spraul
2010-06-30 19:38           ` Manfred Spraul
2010-06-30 19:51           ` Andrew Morton
2010-06-30 19:51             ` Andrew Morton
2010-06-25 21:20 ` [S+Q 02/16] [PATCH 1/2] percpu: make @dyn_size always mean min dyn_size in first chunk init functions Christoph Lameter
2010-06-27  5:06   ` David Rientjes
2010-06-27  8:21     ` Tejun Heo
2010-06-27 16:57       ` [S+Q 02/16] [PATCH 1/2 UPDATED] " Tejun Heo
2010-06-27 19:25         ` David Rientjes
2010-06-27 19:24       ` [S+Q 02/16] [PATCH 1/2] " David Rientjes
2010-06-29 15:36     ` Christoph Lameter
2010-06-25 21:20 ` [S+Q 03/16] [PATCH 2/2] percpu: allow limited allocation before slab is online Christoph Lameter
2010-06-25 21:20 ` [S+Q 04/16] slub: Use a constant for a unspecified node Christoph Lameter
2010-06-28  2:25   ` KAMEZAWA Hiroyuki
2010-06-29 15:38     ` Christoph Lameter
2010-06-25 21:20 ` [S+Q 05/16] SLUB: Constants need UL Christoph Lameter
2010-06-26 23:31   ` David Rientjes
2010-06-28  2:27   ` KAMEZAWA Hiroyuki
2010-06-25 21:20 ` [S+Q 06/16] slub: Use kmem_cache flags to detect if slab is in debugging mode Christoph Lameter
2010-06-26 23:31   ` David Rientjes
2010-06-25 21:20 ` [S+Q 07/16] slub: discard_slab_unlock Christoph Lameter
2010-06-26 23:34   ` David Rientjes
2010-07-06 20:44     ` Christoph Lameter
2010-06-25 21:20 ` [S+Q 08/16] slub: remove dynamic dma slab allocation Christoph Lameter
2010-06-26 23:52   ` David Rientjes
2010-06-29 15:31     ` Christoph Lameter
2010-06-28  2:33   ` KAMEZAWA Hiroyuki
2010-06-29 15:41     ` Christoph Lameter
2010-06-30  0:26       ` KAMEZAWA Hiroyuki
2010-06-25 21:20 ` [S+Q 09/16] [percpu] make allocpercpu usable during early boot Christoph Lameter
2010-06-26  8:10   ` Tejun Heo
2010-06-26 23:53     ` David Rientjes
2010-06-29 15:15     ` Christoph Lameter
2010-06-29 15:30       ` Tejun Heo
2010-07-06 20:41         ` Christoph Lameter
2010-06-26 23:38   ` David Rientjes
2010-06-29 15:26     ` Christoph Lameter
2010-06-28 17:03   ` Pekka Enberg
2010-06-29 15:45     ` Christoph Lameter
2010-07-01  6:23       ` Pekka Enberg
2010-07-06 14:32         ` Christoph Lameter
2010-07-31  9:39           ` Pekka Enberg
2010-06-25 21:20 ` [S+Q 10/16] slub: Remove static kmem_cache_cpu array for boot Christoph Lameter
2010-06-27  0:02   ` David Rientjes
2010-06-29 15:35     ` Christoph Lameter
2010-06-25 21:20 ` [S+Q 11/16] slub: Dynamically size kmalloc cache allocations Christoph Lameter
2010-06-25 21:20 ` [S+Q 12/16] SLUB: Add SLAB style per cpu queueing Christoph Lameter
2010-06-26  2:32   ` Nick Piggin
2010-06-28 10:19     ` Christoph Lameter
2010-06-25 21:20 ` [S+Q 13/16] SLUB: Resize the new cpu queues Christoph Lameter
2010-06-25 21:20 ` [S+Q 14/16] SLUB: Get rid of useless function count_free() Christoph Lameter
2010-06-25 21:20 ` [S+Q 15/16] SLUB: Remove MAX_OBJS limitation Christoph Lameter
2010-06-25 21:20 ` [S+Q 16/16] slub: Drop allocator announcement Christoph Lameter
2010-06-26  2:24 ` [S+Q 00/16] SLUB with Queueing beats SLAB in hackbench Nick Piggin
2010-06-28  6:18   ` Pekka Enberg
2010-06-28 10:12     ` Christoph Lameter
2010-06-28 15:18       ` Pekka Enberg
2010-06-28 18:54         ` David Rientjes
2010-06-29 15:23           ` Christoph Lameter
2010-06-29 15:55             ` Mike Travis
2010-06-29 15:21         ` Christoph Lameter
2010-06-28 14:46     ` Matt Mackall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.