linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
@ 2008-05-30  3:56 Christoph Lameter
  2008-05-30  3:56 ` [patch 01/41] cpu_alloc: Increase percpu area size to 128k Christoph Lameter
                   ` (41 more replies)
  0 siblings, 42 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

In various places the kernel maintains arrays of pointers indexed by
processor numbers. These are used to locate objects that need to be used
when executing on a specirfic processor. Both the slab allocator
and the page allocator use these arrays and there the arrays are used in
performance critical code. The allocpercpu functionality is a simple
allocator to provide these arrays. However, there are certain drawbacks
in using such arrays:

1. The arrays become huge for large systems and may be very sparsely
   populated (if they are dimensionied for NR_CPUS) on an architecture
   like IA64 that allows up to 4k cpus if a kernel is then booted on a
   machine that only supports 8 processors. We could nr_cpu_ids there
   but we would still have to allocate all possible processors up to
   the number of processor ids. cpu_alloc can deal with sparse cpu_maps.

2. The arrays cause surrounding variables to no longer fit into a single
   cacheline. The layout of core data structure is typically optimized so
   that variables frequently used together are placed in the same cacheline.
   Arrays of pointers move these variables far apart and destroy this effect.

3. A processor frequently follows only one pointer for its own use. Thus
   that cacheline with that pointer has to be kept in memory. The neighboring
   pointers are all to other processors that are rarely used. So a whole
   cacheline of 128 bytes may be consumed but only 8 bytes of information
   is constant use. It would be better to be able to place more information
   in this cacheline.

4. The lookup of the per cpu object is expensive and requires multiple
   memory accesses to:

   A) smp_processor_id()
   B) pointer to the base of the per cpu pointer array
   C) pointer to the per cpu object in the pointer array
   D) the per cpu object itself.

5. Each use of allocper requires its own per cpu array. On large
   system large arrays have to be allocated again and again.

6. Processor hotplug cannot effectively track the per cpu objects
   since the VM cannot find all memory that was allocated for
   a specific cpu. It is impossible to add or remove objects in
   a consistent way. Although the allocpercpu subsystem was extended
   to add that capability is not used since use would require adding
   cpu hotplug callbacks to each and every use of allocpercpu in
   the kernel.

The patchset here provides an cpu allocator that arranges data differently.
Objects are placed tightly in linear areas reserved for each processor.
The areas are of a fixed size so that address calculation can be used
instead of a lookup. This means that

1. The VM knows where all the per cpu variables are and it could remove
   or add cpu areas as cpus come online or go offline.

2. There is only a single per cpu array that is used for the percpu area
   and all per cpu allocations.

3. The lookup of a per cpu object is easy and requires memory access to
	(worst case: architecture does not provide cpu ops):

   A) per cpu offset from the per cpu pointer table
      (if its the current processor then there is usually some
      more efficient means of retrieving the offset)
   B) cpu pointer to the object
   C) the per cpu object itself.

4. Surrounding variables can be placed in the same cacheline.
   This allow f.e. in SLUB to avoid caching objects in per cpu structures
   since the kmem_cache structure is finally available without the need
   to access a cache cold cacheline.

5. A single pointer can be used regardless of the number of processors
   in the system.

The cpu allocator manages a fixed size data per cpu data area. The size
can be configured as needed.

The current usage of the cpu area can be seen in the field

	cpu_bytes

in /proc/vmstat

The patchset is agsinst 2.6.26-rc4.

There are two arch implementation of cpu ops provides.

1. x86. Another version of the zero based x86 patches
   exist by Mike.

2. IA64. Limited implementation since IA64 has
   no fast RMV ops. But we can avoid the addition of the
   my_cpu_offset in hotpaths.

This is a rather complex patchset and I am not sure how to merge it.
Maybe it would be best to merge a piece at a time beginning with the
basic infrastructure in the first few patches?

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 01/41] cpu_alloc: Increase percpu area size to 128k
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-06-02 17:58   ` Luck, Tony
  2008-05-30  3:56 ` [patch 02/41] cpu alloc: The allocator Christoph Lameter
                   ` (40 subsequent siblings)
  41 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: ia64_increase_percpu_size --]
[-- Type: text/plain, Size: 910 bytes --]

The per cpu allocator requires more per cpu space and we are already near
the limit on IA64. Increase the maximum size of the IA64 per cpu area from
64K to 128K.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/asm-ia64/page.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/include/asm-ia64/page.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/page.h	2008-05-29 12:10:42.216486476 -0700
+++ linux-2.6/include/asm-ia64/page.h	2008-05-29 12:11:06.218953049 -0700
@@ -42,7 +42,7 @@
 #define PAGE_MASK		(~(PAGE_SIZE - 1))
 #define PAGE_ALIGN(addr)	(((addr) + PAGE_SIZE - 1) & PAGE_MASK)
 
-#define PERCPU_PAGE_SHIFT	16	/* log2() of max. size of per-CPU area */
+#define PERCPU_PAGE_SHIFT	17	/* log2() of max. size of per-CPU area */
 #define PERCPU_PAGE_SIZE	(__IA64_UL_CONST(1) << PERCPU_PAGE_SHIFT)
 
 

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 02/41] cpu alloc: The allocator
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
  2008-05-30  3:56 ` [patch 01/41] cpu_alloc: Increase percpu area size to 128k Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  4:58   ` Andrew Morton
                     ` (3 more replies)
  2008-05-30  3:56 ` [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator Christoph Lameter
                   ` (39 subsequent siblings)
  41 siblings, 4 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_base --]
[-- Type: text/plain, Size: 9598 bytes --]

The per cpu allocator allows dynamic allocation of memory on all
processors simultaneously. A bitmap is used to track used areas.
The allocator implements tight packing to reduce the cache footprint
and increase speed since cacheline contention is typically not a concern
for memory mainly used by a single cpu. Small objects will fill up gaps
left by larger allocations that required alignments.

The size of the cpu_alloc area can be changed via make menuconfig.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/percpu.h |   46 +++++++++++++
 include/linux/vmstat.h |    2 
 mm/Kconfig             |    6 +
 mm/Makefile            |    2 
 mm/cpu_alloc.c         |  167 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmstat.c            |    1 
 6 files changed, 222 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h	2008-05-29 19:41:21.000000000 -0700
+++ linux-2.6/include/linux/vmstat.h	2008-05-29 20:15:37.000000000 -0700
@@ -37,7 +37,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 		FOR_ALL_ZONES(PGSCAN_KSWAPD),
 		FOR_ALL_ZONES(PGSCAN_DIRECT),
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
-		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		PAGEOUTRUN, ALLOCSTALL, PGROTATED, CPU_BYTES,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig	2008-05-29 19:41:21.000000000 -0700
+++ linux-2.6/mm/Kconfig	2008-05-29 20:13:39.000000000 -0700
@@ -205,3 +205,9 @@ config NR_QUICK
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
+
+config CPU_ALLOC_SIZE
+	int "Size of cpu alloc area"
+	default "30000"
+	help
+	  Sets the maximum amount of memory that can be allocated via cpu_alloc
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-05-29 19:41:21.000000000 -0700
+++ linux-2.6/mm/Makefile	2008-05-29 20:15:41.000000000 -0700
@@ -11,7 +11,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
 			   maccess.o page_alloc.o page-writeback.o pdflush.o \
 			   readahead.o swap.o truncate.o vmscan.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
-			   page_isolation.o $(mmu-y)
+			   page_isolation.o cpu_alloc.o $(mmu-y)
 
 obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
 obj-$(CONFIG_BOUNCE)	+= bounce.o
Index: linux-2.6/mm/cpu_alloc.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/cpu_alloc.c	2008-05-29 20:13:39.000000000 -0700
@@ -0,0 +1,167 @@
+/*
+ * Cpu allocator - Manage objects allocated for each processor
+ *
+ * (C) 2008 SGI, Christoph Lameter <clameter@sgi.com>
+ * 	Basic implementation with allocation and free from a dedicated per
+ * 	cpu area.
+ *
+ * The per cpu allocator allows dynamic allocation of memory on all
+ * processor simultaneously. A bitmap is used to track used areas.
+ * The allocator implements tight packing to reduce the cache footprint
+ * and increase speed since cacheline contention is typically not a concern
+ * for memory mainly used by a single cpu. Small objects will fill up gaps
+ * left by larger allocations that required alignments.
+ */
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/bitmap.h>
+#include <asm/sections.h>
+
+/*
+ * Basic allocation unit. A bit map is created to track the use of each
+ * UNIT_SIZE element in the cpu area.
+ */
+#define UNIT_TYPE int
+#define UNIT_SIZE sizeof(UNIT_TYPE)
+#define UNITS (CONFIG_CPU_ALLOC_SIZE / UNIT_SIZE)
+
+static DEFINE_PER_CPU(UNIT_TYPE, area[UNITS]);
+
+/*
+ * How many units are needed for an object of a given size
+ */
+static int size_to_units(unsigned long size)
+{
+	return DIV_ROUND_UP(size, UNIT_SIZE);
+}
+
+/*
+ * Lock to protect the bitmap and the meta data for the cpu allocator.
+ */
+static DEFINE_SPINLOCK(cpu_alloc_map_lock);
+static DECLARE_BITMAP(cpu_alloc_map, UNITS);
+static int first_free;		/* First known free unit */
+
+/*
+ * Mark an object as used in the cpu_alloc_map
+ *
+ * Must hold cpu_alloc_map_lock
+ */
+static void set_map(int start, int length)
+{
+	while (length-- > 0)
+		__set_bit(start++, cpu_alloc_map);
+}
+
+/*
+ * Mark an area as freed.
+ *
+ * Must hold cpu_alloc_map_lock
+ */
+static void clear_map(int start, int length)
+{
+	while (length-- > 0)
+		__clear_bit(start++, cpu_alloc_map);
+}
+
+/*
+ * Allocate an object of a certain size
+ *
+ * Returns a special pointer that can be used with CPU_PTR to find the
+ * address of the object for a certain cpu.
+ */
+void *cpu_alloc(unsigned long size, gfp_t gfpflags, unsigned long align)
+{
+	unsigned long start;
+	int units = size_to_units(size);
+	void *ptr;
+	int first;
+	unsigned long flags;
+
+	if (!size)
+		return ZERO_SIZE_PTR;
+
+	spin_lock_irqsave(&cpu_alloc_map_lock, flags);
+
+	first = 1;
+	start = first_free;
+
+	for ( ; ; ) {
+
+		start = find_next_zero_bit(cpu_alloc_map, UNITS, start);
+		if (start >= UNITS)
+			goto out_of_memory;
+
+		if (first)
+			first_free = start;
+
+		/*
+		 * Check alignment and that there is enough space after
+		 * the starting unit.
+		 */
+		if (start % (align / UNIT_SIZE) == 0 &&
+			find_next_bit(cpu_alloc_map, UNITS, start + 1)
+							>= start + units)
+				break;
+		start++;
+		first = 0;
+	}
+
+	if (first)
+		first_free = start + units;
+
+	if (start + units > UNITS)
+		goto out_of_memory;
+
+	set_map(start, units);
+	__count_vm_events(CPU_BYTES, units * UNIT_SIZE);
+
+	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
+
+	ptr = per_cpu_var(area) + start;
+
+	if (gfpflags & __GFP_ZERO) {
+		int cpu;
+
+		for_each_possible_cpu(cpu)
+			memset(CPU_PTR(ptr, cpu), 0, size);
+	}
+
+	return ptr;
+
+out_of_memory:
+	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
+	return NULL;
+}
+EXPORT_SYMBOL(cpu_alloc);
+
+/*
+ * Free an object. The pointer must be a cpu pointer allocated
+ * via cpu_alloc.
+ */
+void cpu_free(void *start, unsigned long size)
+{
+	unsigned long units = size_to_units(size);
+	unsigned long index = (int *)start - per_cpu_var(area);
+	unsigned long flags;
+
+	if (!start || start == ZERO_SIZE_PTR)
+		return;
+
+	BUG_ON(index >= UNITS ||
+		!test_bit(index, cpu_alloc_map) ||
+		!test_bit(index + units - 1, cpu_alloc_map));
+
+	spin_lock_irqsave(&cpu_alloc_map_lock, flags);
+
+	clear_map(index, units);
+	__count_vm_events(CPU_BYTES, -units * UNIT_SIZE);
+
+	if (index < first_free)
+		first_free = index;
+
+	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
+}
+EXPORT_SYMBOL(cpu_free);
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2008-05-29 19:41:21.000000000 -0700
+++ linux-2.6/mm/vmstat.c	2008-05-29 20:13:39.000000000 -0700
@@ -653,6 +653,7 @@ static const char * const vmstat_text[] 
 	"allocstall",
 
 	"pgrotated",
+	"cpu_bytes",
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
 	"htlb_buddy_alloc_fail",
Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2008-05-29 19:41:21.000000000 -0700
+++ linux-2.6/include/linux/percpu.h	2008-05-29 20:29:12.000000000 -0700
@@ -135,4 +135,50 @@ static inline void percpu_free(void *__p
 #define free_percpu(ptr)	percpu_free((ptr))
 #define per_cpu_ptr(ptr, cpu)	percpu_ptr((ptr), (cpu))
 
+
+/*
+ * cpu allocator definitions
+ *
+ * The cpu allocator allows allocating an instance of an object for each
+ * processor and the use of a single pointer to access all instances
+ * of the object. cpu_alloc provides optimized means for accessing the
+ * instance of the object belonging to the currently executing processor
+ * as well as special atomic operations on fields of objects of the
+ * currently executing processor.
+ *
+ * Cpu objects are typically small. The allocator packs them tightly
+ * to increase the chance on each access that a per cpu object is already
+ * cached. Alignments may be specified but the intent is to align the data
+ * properly due to cpu alignment constraints and not to avoid cacheline
+ * contention. Any holes left by aligning objects are filled up with smaller
+ * objects that are allocated later.
+ *
+ * Cpu data can be allocated using CPU_ALLOC. The resulting pointer is
+ * pointing to the instance of the variable in the per cpu area provided
+ * by the loader. It is generally an error to use the pointer directly
+ * unless we are booting the system.
+ *
+ * __GFP_ZERO may be passed as a flag to zero the allocated memory.
+ */
+
+/* Return a pointer to the instance of a object for a particular processor */
+#define CPU_PTR(__p, __cpu)	SHIFT_PERCPU_PTR((__p), per_cpu_offset(__cpu))
+
+/*
+ * Return a pointer to the instance of the object belonging to the processor
+ * running the current code.
+ */
+#define THIS_CPU(__p)	SHIFT_PERCPU_PTR((__p), my_cpu_offset)
+#define __THIS_CPU(__p)	SHIFT_PERCPU_PTR((__p), __my_cpu_offset)
+
+#define CPU_ALLOC(type, flags)	((typeof(type) *)cpu_alloc(sizeof(type), \
+					(flags), __alignof__(type)))
+#define CPU_FREE(pointer)	cpu_free((pointer), sizeof(*(pointer)))
+
+/*
+ * Raw calls
+ */
+void *cpu_alloc(unsigned long size, gfp_t flags, unsigned long align);
+void cpu_free(void *cpu_pointer, unsigned long size);
+
 #endif /* __LINUX_PERCPU_H */

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
  2008-05-30  3:56 ` [patch 01/41] cpu_alloc: Increase percpu area size to 128k Christoph Lameter
  2008-05-30  3:56 ` [patch 02/41] cpu alloc: The allocator Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  4:58   ` Andrew Morton
  2008-05-30  6:08   ` Rusty Russell
  2008-05-30  3:56 ` [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations Christoph Lameter
                   ` (38 subsequent siblings)
  41 siblings, 2 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_replace_modules_per_cpu_allocator --]
[-- Type: text/plain, Size: 13022 bytes --]

Remove the builtin per cpu allocator from modules.c and use cpu_alloc instead.

The patch also removes PERCPU_ENOUGH_ROOM. The size of the cpu_alloc area is
determined by CONFIG_CPU_AREA_SIZE. PERCPU_ENOUGH_ROOMs default was 8k.
CONFIG_CPU_AREA_SIZE defaults to 30k. Thus we have more space to load modules.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/powerpc/kernel/setup_64.c |    5 -
 arch/sparc64/kernel/smp.c      |    2 
 arch/x86/kernel/setup.c        |   11 +-
 include/asm-ia64/percpu.h      |    2 
 include/linux/module.h         |    1 
 include/linux/percpu.h         |   11 --
 init/main.c                    |    9 --
 kernel/lockdep.c               |    2 
 kernel/module.c                |  178 +++--------------------------------------
 9 files changed, 28 insertions(+), 193 deletions(-)

Index: linux-2.6/kernel/module.c
===================================================================
--- linux-2.6.orig/kernel/module.c	2008-05-29 17:57:39.825214766 -0700
+++ linux-2.6/kernel/module.c	2008-05-29 18:00:50.496815514 -0700
@@ -314,121 +314,6 @@ static struct module *find_module(const 
 	return NULL;
 }
 
-#ifdef CONFIG_SMP
-/* Number of blocks used and allocated. */
-static unsigned int pcpu_num_used, pcpu_num_allocated;
-/* Size of each block.  -ve means used. */
-static int *pcpu_size;
-
-static int split_block(unsigned int i, unsigned short size)
-{
-	/* Reallocation required? */
-	if (pcpu_num_used + 1 > pcpu_num_allocated) {
-		int *new;
-
-		new = krealloc(pcpu_size, sizeof(new[0])*pcpu_num_allocated*2,
-			       GFP_KERNEL);
-		if (!new)
-			return 0;
-
-		pcpu_num_allocated *= 2;
-		pcpu_size = new;
-	}
-
-	/* Insert a new subblock */
-	memmove(&pcpu_size[i+1], &pcpu_size[i],
-		sizeof(pcpu_size[0]) * (pcpu_num_used - i));
-	pcpu_num_used++;
-
-	pcpu_size[i+1] -= size;
-	pcpu_size[i] = size;
-	return 1;
-}
-
-static inline unsigned int block_size(int val)
-{
-	if (val < 0)
-		return -val;
-	return val;
-}
-
-static void *percpu_modalloc(unsigned long size, unsigned long align,
-			     const char *name)
-{
-	unsigned long extra;
-	unsigned int i;
-	void *ptr;
-
-	if (align > PAGE_SIZE) {
-		printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
-		       name, align, PAGE_SIZE);
-		align = PAGE_SIZE;
-	}
-
-	ptr = __per_cpu_start;
-	for (i = 0; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
-		/* Extra for alignment requirement. */
-		extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr;
-		BUG_ON(i == 0 && extra != 0);
-
-		if (pcpu_size[i] < 0 || pcpu_size[i] < extra + size)
-			continue;
-
-		/* Transfer extra to previous block. */
-		if (pcpu_size[i-1] < 0)
-			pcpu_size[i-1] -= extra;
-		else
-			pcpu_size[i-1] += extra;
-		pcpu_size[i] -= extra;
-		ptr += extra;
-
-		/* Split block if warranted */
-		if (pcpu_size[i] - size > sizeof(unsigned long))
-			if (!split_block(i, size))
-				return NULL;
-
-		/* Mark allocated */
-		pcpu_size[i] = -pcpu_size[i];
-		return ptr;
-	}
-
-	printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",
-	       size);
-	return NULL;
-}
-
-static void percpu_modfree(void *freeme)
-{
-	unsigned int i;
-	void *ptr = __per_cpu_start + block_size(pcpu_size[0]);
-
-	/* First entry is core kernel percpu data. */
-	for (i = 1; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
-		if (ptr == freeme) {
-			pcpu_size[i] = -pcpu_size[i];
-			goto free;
-		}
-	}
-	BUG();
-
- free:
-	/* Merge with previous? */
-	if (pcpu_size[i-1] >= 0) {
-		pcpu_size[i-1] += pcpu_size[i];
-		pcpu_num_used--;
-		memmove(&pcpu_size[i], &pcpu_size[i+1],
-			(pcpu_num_used - i) * sizeof(pcpu_size[0]));
-		i--;
-	}
-	/* Merge with next? */
-	if (i+1 < pcpu_num_used && pcpu_size[i+1] >= 0) {
-		pcpu_size[i] += pcpu_size[i+1];
-		pcpu_num_used--;
-		memmove(&pcpu_size[i+1], &pcpu_size[i+2],
-			(pcpu_num_used - (i+1)) * sizeof(pcpu_size[0]));
-	}
-}
-
 static unsigned int find_pcpusec(Elf_Ehdr *hdr,
 				 Elf_Shdr *sechdrs,
 				 const char *secstrings)
@@ -444,48 +329,6 @@ static void percpu_modcopy(void *pcpudes
 		memcpy(pcpudest + per_cpu_offset(cpu), from, size);
 }
 
-static int percpu_modinit(void)
-{
-	pcpu_num_used = 2;
-	pcpu_num_allocated = 2;
-	pcpu_size = kmalloc(sizeof(pcpu_size[0]) * pcpu_num_allocated,
-			    GFP_KERNEL);
-	/* Static in-kernel percpu data (used). */
-	pcpu_size[0] = -(__per_cpu_end-__per_cpu_start);
-	/* Free room. */
-	pcpu_size[1] = PERCPU_ENOUGH_ROOM + pcpu_size[0];
-	if (pcpu_size[1] < 0) {
-		printk(KERN_ERR "No per-cpu room for modules.\n");
-		pcpu_num_used = 1;
-	}
-
-	return 0;
-}
-__initcall(percpu_modinit);
-#else /* ... !CONFIG_SMP */
-static inline void *percpu_modalloc(unsigned long size, unsigned long align,
-				    const char *name)
-{
-	return NULL;
-}
-static inline void percpu_modfree(void *pcpuptr)
-{
-	BUG();
-}
-static inline unsigned int find_pcpusec(Elf_Ehdr *hdr,
-					Elf_Shdr *sechdrs,
-					const char *secstrings)
-{
-	return 0;
-}
-static inline void percpu_modcopy(void *pcpudst, const void *src,
-				  unsigned long size)
-{
-	/* pcpusec should be 0, and size of that section should be 0. */
-	BUG_ON(size != 0);
-}
-#endif /* CONFIG_SMP */
-
 #define MODINFO_ATTR(field)	\
 static void setup_modinfo_##field(struct module *mod, const char *s)  \
 {                                                                     \
@@ -1403,7 +1246,7 @@ static void free_module(struct module *m
 	module_free(mod, mod->module_init);
 	kfree(mod->args);
 	if (mod->percpu)
-		percpu_modfree(mod->percpu);
+		cpu_free(mod->percpu, mod->percpu_size);
 
 	/* Free lock-classes: */
 	lockdep_free_key_range(mod->module_core, mod->core_size);
@@ -1772,6 +1615,7 @@ static struct module *load_module(void _
 	unsigned int markersstringsindex;
 	struct module *mod;
 	long err = 0;
+	unsigned long percpu_size = 0;
 	void *percpu = NULL, *ptr = NULL; /* Stops spurious gcc warning */
 	struct exception_table_entry *extable;
 	mm_segment_t old_fs;
@@ -1918,15 +1762,25 @@ static struct module *load_module(void _
 
 	if (pcpuindex) {
 		/* We have a special allocation for this section. */
-		percpu = percpu_modalloc(sechdrs[pcpuindex].sh_size,
-					 sechdrs[pcpuindex].sh_addralign,
-					 mod->name);
+		unsigned long align = sechdrs[pcpuindex].sh_addralign;
+		unsigned long size = sechdrs[pcpuindex].sh_size;
+
+		if (align > PAGE_SIZE) {
+			printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
+			mod->name, align, PAGE_SIZE);
+			align = PAGE_SIZE;
+		}
+		percpu = cpu_alloc(size, GFP_KERNEL|__GFP_ZERO, align);
+		if (!percpu)
+			printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",
+										size);
 		if (!percpu) {
 			err = -ENOMEM;
 			goto free_mod;
 		}
 		sechdrs[pcpuindex].sh_flags &= ~(unsigned long)SHF_ALLOC;
 		mod->percpu = percpu;
+		mod->percpu_size = percpu_size;
 	}
 
 	/* Determine total sizes, and put offsets in sh_entsize.  For now
@@ -2175,7 +2029,7 @@ static struct module *load_module(void _
 	module_free(mod, mod->module_core);
  free_percpu:
 	if (percpu)
-		percpu_modfree(percpu);
+		cpu_free(percpu, percpu_size);
  free_mod:
 	kfree(args);
  free_hdr:
Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2008-05-29 17:58:32.328714051 -0700
+++ linux-2.6/include/linux/percpu.h	2008-05-29 17:58:53.652714198 -0700
@@ -34,17 +34,6 @@
 #define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var)
 #define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var)
 
-/* Enough to cover all DEFINE_PER_CPUs in kernel, including modules. */
-#ifndef PERCPU_ENOUGH_ROOM
-#ifdef CONFIG_MODULES
-#define PERCPU_MODULE_RESERVE	8192
-#else
-#define PERCPU_MODULE_RESERVE	0
-#endif
-
-#define PERCPU_ENOUGH_ROOM						\
-	(__per_cpu_end - __per_cpu_start + PERCPU_MODULE_RESERVE)
-#endif	/* PERCPU_ENOUGH_ROOM */
 
 /*
  * Must be an lvalue. Since @var must be a simple identifier,
Index: linux-2.6/include/linux/module.h
===================================================================
--- linux-2.6.orig/include/linux/module.h	2008-05-29 17:57:38.341214464 -0700
+++ linux-2.6/include/linux/module.h	2008-05-29 17:58:53.652714198 -0700
@@ -334,6 +334,7 @@ struct module
 
 	/* Per-cpu data. */
 	void *percpu;
+	int percpu_size;
 
 	/* The command line arguments (may be mangled).  People like
 	   keeping pointers to this stuff */
Index: linux-2.6/arch/powerpc/kernel/setup_64.c
===================================================================
--- linux-2.6.orig/arch/powerpc/kernel/setup_64.c	2008-05-29 17:57:38.357214432 -0700
+++ linux-2.6/arch/powerpc/kernel/setup_64.c	2008-05-29 17:58:53.652714198 -0700
@@ -596,11 +596,6 @@ void __init setup_per_cpu_areas(void)
 
 	/* Copy section for each CPU (we discard the original) */
 	size = ALIGN(__per_cpu_end - __per_cpu_start, PAGE_SIZE);
-#ifdef CONFIG_MODULES
-	if (size < PERCPU_ENOUGH_ROOM)
-		size = PERCPU_ENOUGH_ROOM;
-#endif
-
 	for_each_possible_cpu(i) {
 		ptr = alloc_bootmem_pages_node(NODE_DATA(cpu_to_node(i)), size);
 		if (!ptr)
Index: linux-2.6/arch/sparc64/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/sparc64/kernel/smp.c	2008-05-29 17:57:38.364714166 -0700
+++ linux-2.6/arch/sparc64/kernel/smp.c	2008-05-29 17:58:53.652714198 -0700
@@ -1454,7 +1454,7 @@ void __init real_setup_per_cpu_areas(voi
 	char *ptr;
 
 	/* Copy section for each CPU (we discard the original) */
-	goal = PERCPU_ENOUGH_ROOM;
+	goal = __per_cpu_size;
 
 	__per_cpu_shift = PAGE_SHIFT;
 	for (size = PAGE_SIZE; size < goal; size <<= 1UL)
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c	2008-05-29 17:57:39.592714425 -0700
+++ linux-2.6/arch/x86/kernel/setup.c	2008-05-29 17:58:53.652714198 -0700
@@ -89,30 +89,29 @@ EXPORT_SYMBOL(__per_cpu_offset);
 void __init setup_per_cpu_areas(void)
 {
 	int i, highest_cpu = 0;
-	unsigned long size;
 
 #ifdef CONFIG_HOTPLUG_CPU
 	prefill_possible_map();
 #endif
 
 	/* Copy section for each CPU (we discard the original) */
-	size = PERCPU_ENOUGH_ROOM;
 	printk(KERN_INFO "PERCPU: Allocating %lu bytes of per cpu data\n",
-			  size);
+			  __per_cpu_size);
 
 	for_each_possible_cpu(i) {
 		char *ptr;
 #ifndef CONFIG_NEED_MULTIPLE_NODES
-		ptr = alloc_bootmem_pages(size);
+		ptr = alloc_bootmem_pages(__per_cpu_size);
 #else
 		int node = early_cpu_to_node(i);
 		if (!node_online(node) || !NODE_DATA(node)) {
-			ptr = alloc_bootmem_pages(size);
+			ptr = alloc_bootmem_pages(__per_cpu_size);
 			printk(KERN_INFO
 			       "cpu %d has no node or node-local memory\n", i);
 		}
 		else
-			ptr = alloc_bootmem_pages_node(NODE_DATA(node), size);
+			ptr = alloc_bootmem_pages_node(NODE_DATA(node),
+								__per_cpu_size);
 #endif
 		if (!ptr)
 			panic("Cannot allocate cpu data for CPU %d\n", i);
Index: linux-2.6/include/asm-ia64/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/percpu.h	2008-05-29 17:57:38.349214528 -0700
+++ linux-2.6/include/asm-ia64/percpu.h	2008-05-29 17:58:53.652714198 -0700
@@ -6,8 +6,6 @@
  *	David Mosberger-Tang <davidm@hpl.hp.com>
  */
 
-#define PERCPU_ENOUGH_ROOM PERCPU_PAGE_SIZE
-
 #ifdef __ASSEMBLY__
 # define THIS_CPU(var)	(per_cpu__##var)  /* use this to mark accesses to per-CPU variables... */
 #else /* !__ASSEMBLY__ */
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c	2008-05-29 17:57:38.380714353 -0700
+++ linux-2.6/init/main.c	2008-05-29 17:58:53.652714198 -0700
@@ -393,18 +393,17 @@ EXPORT_SYMBOL(__per_cpu_offset);
 
 static void __init setup_per_cpu_areas(void)
 {
-	unsigned long size, i;
+	unsigned long i;
 	char *ptr;
 	unsigned long nr_possible_cpus = num_possible_cpus();
 
 	/* Copy section for each CPU (we discard the original) */
-	size = ALIGN(PERCPU_ENOUGH_ROOM, PAGE_SIZE);
-	ptr = alloc_bootmem_pages(size * nr_possible_cpus);
+	ptr = alloc_bootmem_pages(__per_cpu_size * nr_possible_cpus);
 
 	for_each_possible_cpu(i) {
 		__per_cpu_offset[i] = ptr - __per_cpu_start;
-		memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
-		ptr += size;
+		memcpy(ptr, __per_cpu_start, __per_cpu_size);
+		ptr += __per_cpu_size;
 	}
 }
 #endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
Index: linux-2.6/kernel/lockdep.c
===================================================================
--- linux-2.6.orig/kernel/lockdep.c	2008-05-29 17:57:39.816713970 -0700
+++ linux-2.6/kernel/lockdep.c	2008-05-29 17:59:22.697422432 -0700
@@ -610,7 +610,7 @@ static int static_obj(void *obj)
 	 */
 	for_each_possible_cpu(i) {
 		start = (unsigned long) &__per_cpu_start + per_cpu_offset(i);
-		end   = (unsigned long) &__per_cpu_start + PERCPU_ENOUGH_ROOM
+		end   = (unsigned long) &__per_cpu_start + __per_cpu_size
 					+ per_cpu_offset(i);
 
 		if ((addr >= start) && (addr < end))

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (2 preceding siblings ...)
  2008-05-30  3:56 ` [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  4:58   ` Andrew Morton
  2008-05-30  3:56 ` [patch 05/41] cpu alloc: Percpu_counter conversion Christoph Lameter
                   ` (37 subsequent siblings)
  41 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_ops_base --]
[-- Type: text/plain, Size: 7877 bytes --]

Currently the per cpu subsystem is not able to use the atomic capabilities
that are provided by many of the available processors.

This patch adds new functionality that allows the optimizing of per cpu
variable handling. In particular it provides a simple way to exploit
atomic operations in order to avoid having to disable interrupts or
performing address calculation to access per cpu data.

F.e. Using our current methods we may do

	unsigned long flags;
	struct stat_struct *p;

	local_irq_save(flags);
	/* Calculate address of per processor area */
	p = CPU_PTR(stat, smp_processor_id());
	p->counter++;
	local_irq_restore(flags);

The segment can be replaced by a single atomic CPU operation:

	CPU_INC(stat->counter);

Most processors have instructions to perform the increment using a
a single atomic instruction. Processors may have segment registers,
global registers or per cpu mappings of per cpu areas that can be used
to generate atomic instructions that combine the following in a single
operation:

1. Adding of an offset / register to a base address
2. Read modify write operation on the address calculated by
   the instruction.

If 1+2 are combined in an instruction then the instruction is atomic
vs interrupts. This means that percpu atomic operations do not need
to disable interrupts to increments counters etc.

The existing methods in use in the kernel cannot utilize the power of
these atomic instructions. local_t is not really addressing the issue
since the offset calculation performed before the atomic operation. The
operation is therefor not atomic. Disabling interrupt or preemption is
required in order to use local_t.

local_t is also very specific to the x86 processor. The solution here can
utilize other methods than just those provided by the x86 instruction set.



On x86 the above CPU_INC translated into a single instruction:

	inc %%gs:(&stat->counter)

This instruction is interrupt safe since it can either be completed
or not. Both adding of the offset and the read modify write are combined
in one instruction.

The determination of the correct per cpu area for the current processor
does not require access to smp_processor_id() (expensive...). The gs
register is used to provide a processor specific offset to the respective
per cpu area where the per cpu variable resides.

Note that the counter offset into the struct was added *before* the segment
selector was added. This is necessary to avoid calculations.  In the past
we first determine the address of the stats structure on the respective
processor and then added the field offset. However, the offset may as
well be added earlier. The adding of the per cpu offset (here through the
gs register) must be done by the instruction used for atomic per cpu
access.



If "stat" was declared via DECLARE_PER_CPU then this patchset is capable of
convincing the linker to provide the proper base address. In that case
no calculations are necessary.

Should the stat structure be reachable via a register then the address
calculation capabilities can be leveraged to avoid calculations.

On IA64 we can get the same combination of operations in a single instruction
by using the virtual address that always maps to the local per cpu area:

	fetchadd &stat->counter + (VCPU_BASE - __per_cpu_start)

The access is forced into the per cpu address reachable via the virtualized
address. IA64 allows the embedding of an offset into the instruction. So the
fetchadd can perform both the relocation of the pointer into the per cpu
area as well as the atomic read modify write cycle.



In order to be able to exploit the atomicity of these instructions we
introduce a series of new functions that take either:

1. A per cpu pointer as returned by cpu_alloc() or CPU_ALLOC().

2. A per cpu variable address as returned by per_cpu_var(<percpuvarname>).

CPU_READ()
CPU_WRITE()
CPU_INC
CPU_DEC
CPU_ADD
CPU_SUB
CPU_XCHG
CPU_CMPXCHG

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/percpu.h |  135 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 135 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2008-05-28 22:31:43.000000000 -0700
+++ linux-2.6/include/linux/percpu.h	2008-05-28 23:38:17.000000000 -0700
@@ -179,4 +179,139 @@
 void *cpu_alloc(unsigned long size, gfp_t flags, unsigned long align);
 void cpu_free(void *cpu_pointer, unsigned long size);
 
+/*
+ * Fast atomic per cpu operations.
+ *
+ * The following operations can be overridden by arches to implement fast
+ * and efficient operations. The operations are atomic meaning that the
+ * determination of the processor, the calculation of the address and the
+ * operation on the data is an atomic operation.
+ *
+ * The parameter passed to the atomic per cpu operations is an lvalue not a
+ * pointer to the object.
+ */
+#ifndef CONFIG_HAVE_CPU_OPS
+
+/*
+ * Fallback in case the arch does not provide for atomic per cpu operations.
+ *
+ * The first group of macros is used when it is safe to update the per
+ * cpu variable because preemption is off (per cpu variables that are not
+ * updated from interrupt context) or because interrupts are already off.
+ */
+#define __CPU_READ(var)				\
+({						\
+	(*THIS_CPU(&(var)));			\
+})
+
+#define __CPU_WRITE(var, value)			\
+({						\
+	*THIS_CPU(&(var)) = (value);		\
+})
+
+#define __CPU_ADD(var, value)			\
+({						\
+	*THIS_CPU(&(var)) += (value);		\
+})
+
+#define __CPU_INC(var) __CPU_ADD((var), 1)
+#define __CPU_DEC(var) __CPU_ADD((var), -1)
+#define __CPU_SUB(var, value) __CPU_ADD((var), -(value))
+
+#define __CPU_CMPXCHG(var, old, new)		\
+({						\
+	typeof(obj) x;				\
+	typeof(obj) *p = THIS_CPU(&(obj));	\
+	x = *p;					\
+	if (x == (old))				\
+		*p = (new);			\
+	(x);					\
+})
+
+#define __CPU_XCHG(obj, new)			\
+({						\
+	typeof(obj) x;				\
+	typeof(obj) *p = THIS_CPU(&(obj));	\
+	x = *p;					\
+	*p = (new);				\
+	(x);					\
+})
+
+/*
+ * Second group used for per cpu variables that are not updated from an
+ * interrupt context. In that case we can simply disable preemption which
+ * may be free if the kernel is compiled without support for preemption.
+ */
+#define _CPU_READ __CPU_READ
+#define _CPU_WRITE __CPU_WRITE
+
+#define _CPU_ADD(var, value)			\
+({						\
+	preempt_disable();			\
+	__CPU_ADD((var), (value));		\
+	preempt_enable();			\
+})
+
+#define _CPU_INC(var) _CPU_ADD((var), 1)
+#define _CPU_DEC(var) _CPU_ADD((var), -1)
+#define _CPU_SUB(var, value) _CPU_ADD((var), -(value))
+
+#define _CPU_CMPXCHG(var, old, new)		\
+({						\
+	typeof(addr) x;				\
+	preempt_disable();			\
+	x = __CPU_CMPXCHG((var), (old), (new));	\
+	preempt_enable();			\
+	(x);					\
+})
+
+#define _CPU_XCHG(var, new)			\
+({						\
+	typeof(var) x;				\
+	preempt_disable();			\
+	x = __CPU_XCHG((var), (new));		\
+	preempt_enable();			\
+	(x);					\
+})
+
+/*
+ * Third group: Interrupt safe CPU functions
+ */
+#define CPU_READ __CPU_READ
+#define CPU_WRITE __CPU_WRITE
+
+#define CPU_ADD(var, value)			\
+({						\
+	unsigned long flags;			\
+	local_irq_save(flags);			\
+	__CPU_ADD((var), (value));		\
+	local_irq_restore(flags);		\
+})
+
+#define CPU_INC(var) CPU_ADD((var), 1)
+#define CPU_DEC(var) CPU_ADD((var), -1)
+#define CPU_SUB(var, value) CPU_ADD((var), -(value))
+
+#define CPU_CMPXCHG(var, old, new)		\
+({						\
+	unsigned long flags;			\
+	typeof(var) x;				\
+	local_irq_save(flags);			\
+	x = __CPU_CMPXCHG((var), (old), (new));	\
+	local_irq_restore(flags);		\
+	(x);					\
+})
+
+#define CPU_XCHG(var, new)			\
+({						\
+	unsigned long flags;			\
+	typeof(var) x;				\
+	local_irq_save(flags);			\
+	x = __CPU_XCHG((var), (new));		\
+	local_irq_restore(flags);		\
+	(x);					\
+})
+
+#endif /* CONFIG_HAVE_CPU_OPS */
+
 #endif /* __LINUX_PERCPU_H */

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 05/41] cpu alloc: Percpu_counter conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (3 preceding siblings ...)
  2008-05-30  3:56 ` [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  6:47   ` Rusty Russell
  2008-05-30  3:56 ` [patch 06/41] cpu alloc: crash_notes conversion Christoph Lameter
                   ` (36 subsequent siblings)
  41 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_percpu_counter_conversion --]
[-- Type: text/plain, Size: 2210 bytes --]

Use cpu_alloc instead of allocpercpu.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 lib/percpu_counter.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2008-05-28 17:56:17.000000000 -0700
+++ linux-2.6/lib/percpu_counter.c	2008-05-28 18:30:25.000000000 -0700
@@ -20,7 +20,7 @@
 
 	spin_lock(&fbc->lock);
 	for_each_possible_cpu(cpu) {
-		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+		s32 *pcount = CPU_PTR(fbc->counters, cpu);
 		*pcount = 0;
 	}
 	fbc->count = amount;
@@ -31,20 +31,18 @@
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
-	s32 *pcount;
-	int cpu = get_cpu();
 
-	pcount = per_cpu_ptr(fbc->counters, cpu);
-	count = *pcount + amount;
+	preempt_disable();
+	count = __CPU_READ(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
 		spin_lock(&fbc->lock);
 		fbc->count += count;
-		*pcount = 0;
+		__CPU_WRITE(*fbc->counters, 0);
 		spin_unlock(&fbc->lock);
 	} else {
-		*pcount = count;
+		__CPU_WRITE(*fbc->counters, count);
 	}
-	put_cpu();
+	preempt_enable();
 }
 EXPORT_SYMBOL(__percpu_counter_add);
 
@@ -60,7 +58,7 @@
 	spin_lock(&fbc->lock);
 	ret = fbc->count;
 	for_each_online_cpu(cpu) {
-		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+		s32 *pcount = CPU_PTR(fbc->counters, cpu);
 		ret += *pcount;
 	}
 	spin_unlock(&fbc->lock);
@@ -74,7 +72,7 @@
 {
 	spin_lock_init(&fbc->lock);
 	fbc->count = amount;
-	fbc->counters = alloc_percpu(s32);
+	fbc->counters = CPU_ALLOC(s32, GFP_KERNEL|__GFP_ZERO);
 	if (!fbc->counters)
 		return -ENOMEM;
 #ifdef CONFIG_HOTPLUG_CPU
@@ -101,7 +99,7 @@
 	if (!fbc->counters)
 		return;
 
-	free_percpu(fbc->counters);
+	CPU_FREE(fbc->counters);
 	fbc->counters = NULL;
 #ifdef CONFIG_HOTPLUG_CPU
 	mutex_lock(&percpu_counters_lock);
@@ -128,7 +126,7 @@
 		unsigned long flags;
 
 		spin_lock_irqsave(&fbc->lock, flags);
-		pcount = per_cpu_ptr(fbc->counters, cpu);
+		pcount = CPU_PTR(fbc->counters, cpu);
 		fbc->count += *pcount;
 		*pcount = 0;
 		spin_unlock_irqrestore(&fbc->lock, flags);

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 06/41] cpu alloc: crash_notes conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (4 preceding siblings ...)
  2008-05-30  3:56 ` [patch 05/41] cpu alloc: Percpu_counter conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 07/41] cpu alloc: Workqueue conversion Christoph Lameter
                   ` (35 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_crash_notes_conversion --]
[-- Type: text/plain, Size: 2229 bytes --]

Convert crash_notes access to use cpu alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 arch/ia64/kernel/crash.c |    2 +-
 drivers/base/cpu.c       |    2 +-
 kernel/kexec.c           |    4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/ia64/kernel/crash.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/crash.c	2008-05-28 17:25:23.000000000 -0700
+++ linux-2.6/arch/ia64/kernel/crash.c	2008-05-28 18:27:04.000000000 -0700
@@ -72,7 +72,7 @@
 	dst[46] = (unsigned long)ia64_rse_skip_regs((unsigned long *)dst[46],
 			sof - sol);
 
-	buf = (u64 *) per_cpu_ptr(crash_notes, cpu);
+	buf = (u64 *) CPU_PTR(crash_notes, cpu);
 	if (!buf)
 		return;
 	buf = append_elf_note(buf, KEXEC_CORE_NOTE_NAME, NT_PRSTATUS, prstatus,
Index: linux-2.6/drivers/base/cpu.c
===================================================================
--- linux-2.6.orig/drivers/base/cpu.c	2008-05-28 17:25:23.000000000 -0700
+++ linux-2.6/drivers/base/cpu.c	2008-05-28 18:27:04.000000000 -0700
@@ -95,7 +95,7 @@
 	 * boot up and this data does not change there after. Hence this
 	 * operation should be safe. No locking required.
 	 */
-	addr = __pa(per_cpu_ptr(crash_notes, cpunum));
+	addr = __pa(CPU_PTR(crash_notes, cpunum));
 	rc = sprintf(buf, "%Lx\n", addr);
 	return rc;
 }
Index: linux-2.6/kernel/kexec.c
===================================================================
--- linux-2.6.orig/kernel/kexec.c	2008-05-28 17:25:23.000000000 -0700
+++ linux-2.6/kernel/kexec.c	2008-05-28 18:27:05.000000000 -0700
@@ -1121,7 +1121,7 @@
 	 * squirrelled away.  ELF notes happen to provide
 	 * all of that, so there is no need to invent something new.
 	 */
-	buf = (u32*)per_cpu_ptr(crash_notes, cpu);
+	buf = (u32 *)CPU_PTR(crash_notes, cpu);
 	if (!buf)
 		return;
 	memset(&prstatus, 0, sizeof(prstatus));
@@ -1135,7 +1135,7 @@
 static int __init crash_notes_memory_init(void)
 {
 	/* Allocate memory for saving cpu registers. */
-	crash_notes = alloc_percpu(note_buf_t);
+	crash_notes = CPU_ALLOC(note_buf_t, GFP_KERNEL|__GFP_ZERO);
 	if (!crash_notes) {
 		printk("Kexec: Memory allocation for saving cpu register"
 		" states failed\n");

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 07/41] cpu alloc: Workqueue conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (5 preceding siblings ...)
  2008-05-30  3:56 ` [patch 06/41] cpu alloc: crash_notes conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 08/41] cpu alloc: ACPI cstate handling conversion Christoph Lameter
                   ` (34 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_workqueue_conversion --]
[-- Type: text/plain, Size: 4397 bytes --]

Convert the workqueue waitqueue handling to cpu alloc.

The second parameter to wq_per_cpu() is always the current processor
id. So drop the parameter and use THIS_CPU in wq_per_cpu that we rename
to wq_this_cpu.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 kernel/workqueue.c |   27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

Index: linux-2.6/kernel/workqueue.c
===================================================================
--- linux-2.6.orig/kernel/workqueue.c	2008-05-28 22:02:19.000000000 -0700
+++ linux-2.6/kernel/workqueue.c	2008-05-28 22:52:29.000000000 -0700
@@ -95,11 +95,11 @@
 }
 
 static
-struct cpu_workqueue_struct *wq_per_cpu(struct workqueue_struct *wq, int cpu)
+struct cpu_workqueue_struct *wq_this_cpu(struct workqueue_struct *wq)
 {
 	if (unlikely(is_single_threaded(wq)))
-		cpu = singlethread_cpu;
-	return per_cpu_ptr(wq->cpu_wq, cpu);
+		return CPU_PTR(wq->cpu_wq, singlethread_cpu);
+	return THIS_CPU(wq->cpu_wq);
 }
 
 /*
@@ -167,8 +167,9 @@
 
 	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
 		BUG_ON(!list_empty(&work->entry));
-		__queue_work(wq_per_cpu(wq, get_cpu()), work);
-		put_cpu();
+		preempt_disable();
+		__queue_work(wq_this_cpu(wq), work);
+		preempt_enable();
 		ret = 1;
 	}
 	return ret;
@@ -181,7 +182,7 @@
 	struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work);
 	struct workqueue_struct *wq = cwq->wq;
 
-	__queue_work(wq_per_cpu(wq, smp_processor_id()), &dwork->work);
+	__queue_work(wq_this_cpu(wq), &dwork->work);
 }
 
 /**
@@ -225,7 +226,7 @@
 		timer_stats_timer_set_start_info(&dwork->timer);
 
 		/* This stores cwq for the moment, for the timer_fn */
-		set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id()));
+		set_wq_data(work, wq_this_cpu(wq));
 		timer->expires = jiffies + delay;
 		timer->data = (unsigned long)dwork;
 		timer->function = delayed_work_timer_fn;
@@ -398,7 +399,7 @@
 	lock_acquire(&wq->lockdep_map, 0, 0, 0, 2, _THIS_IP_);
 	lock_release(&wq->lockdep_map, 1, _THIS_IP_);
 	for_each_cpu_mask(cpu, *cpu_map)
-		flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
+		flush_cpu_workqueue(CPU_PTR(wq->cpu_wq, cpu));
 }
 EXPORT_SYMBOL_GPL(flush_workqueue);
 
@@ -478,7 +479,7 @@
 	cpu_map = wq_cpu_map(wq);
 
 	for_each_cpu_mask(cpu, *cpu_map)
-		wait_on_cpu_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+		wait_on_cpu_work(CPU_PTR(wq->cpu_wq, cpu), work);
 }
 
 static int __cancel_work_timer(struct work_struct *work,
@@ -598,21 +599,21 @@
 	int cpu;
 	struct work_struct *works;
 
-	works = alloc_percpu(struct work_struct);
+	works = CPU_ALLOC(struct work_struct, GFP_KERNEL);
 	if (!works)
 		return -ENOMEM;
 
 	get_online_cpus();
 	for_each_online_cpu(cpu) {
-		struct work_struct *work = per_cpu_ptr(works, cpu);
+		struct work_struct *work = CPU_PTR(works, cpu);
 
 		INIT_WORK(work, func);
 		set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
-		__queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work);
+		__queue_work(CPU_PTR(keventd_wq->cpu_wq, cpu), work);
 	}
 	flush_workqueue(keventd_wq);
 	put_online_cpus();
-	free_percpu(works);
+	CPU_FREE(works);
 	return 0;
 }
 
@@ -661,7 +662,7 @@
 
 	BUG_ON(!keventd_wq);
 
-	cwq = per_cpu_ptr(keventd_wq->cpu_wq, cpu);
+	cwq = CPU_PTR(keventd_wq->cpu_wq, cpu);
 	if (current == cwq->thread)
 		ret = 1;
 
@@ -672,7 +673,7 @@
 static struct cpu_workqueue_struct *
 init_cpu_workqueue(struct workqueue_struct *wq, int cpu)
 {
-	struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+	struct cpu_workqueue_struct *cwq = CPU_PTR(wq->cpu_wq, cpu);
 
 	cwq->wq = wq;
 	spin_lock_init(&cwq->lock);
@@ -730,7 +731,8 @@
 	if (!wq)
 		return NULL;
 
-	wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
+	wq->cpu_wq = CPU_ALLOC(struct cpu_workqueue_struct,
+					GFP_KERNEL|__GFP_ZERO);
 	if (!wq->cpu_wq) {
 		kfree(wq);
 		return NULL;
@@ -814,10 +816,10 @@
 	spin_unlock(&workqueue_lock);
 
 	for_each_cpu_mask(cpu, *cpu_map)
-		cleanup_workqueue_thread(per_cpu_ptr(wq->cpu_wq, cpu));
+		cleanup_workqueue_thread(CPU_PTR(wq->cpu_wq, cpu));
 	put_online_cpus();
 
-	free_percpu(wq->cpu_wq);
+	CPU_FREE(wq->cpu_wq);
 	kfree(wq);
 }
 EXPORT_SYMBOL_GPL(destroy_workqueue);
@@ -838,7 +840,7 @@
 	}
 
 	list_for_each_entry(wq, &workqueues, list) {
-		cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+		cwq = CPU_PTR(wq->cpu_wq, cpu);
 
 		switch (action) {
 		case CPU_UP_PREPARE:

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 08/41] cpu alloc: ACPI cstate handling conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (6 preceding siblings ...)
  2008-05-30  3:56 ` [patch 07/41] cpu alloc: Workqueue conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 09/41] cpu alloc: Genhd statistics conversion Christoph Lameter
                   ` (33 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_acpi_cstate_handling_conversion --]
[-- Type: text/plain, Size: 3311 bytes --]

Convert ACPI per cpu handling to cpu alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 arch/x86/kernel/acpi/cstate.c              |    9 +++++----
 arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |    7 ++++---
 drivers/acpi/processor_perflib.c           |    4 ++--
 3 files changed, 11 insertions(+), 9 deletions(-)

Index: linux-2.6/arch/x86/kernel/acpi/cstate.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/acpi/cstate.c	2008-04-29 14:55:48.000000000 -0700
+++ linux-2.6/arch/x86/kernel/acpi/cstate.c	2008-05-21 21:43:07.000000000 -0700
@@ -85,7 +85,7 @@
 	if (reg->bit_offset != NATIVE_CSTATE_BEYOND_HALT)
 		return -1;
 
-	percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
+	percpu_entry = CPU_PTR(cpu_cstate_entry, cpu);
 	percpu_entry->states[cx->index].eax = 0;
 	percpu_entry->states[cx->index].ecx = 0;
 
@@ -138,7 +138,7 @@
 	unsigned int cpu = smp_processor_id();
 	struct cstate_entry *percpu_entry;
 
-	percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
+	percpu_entry = CPU_PTR(cpu_cstate_entry, cpu);
 	mwait_idle_with_hints(percpu_entry->states[cx->index].eax,
 	                      percpu_entry->states[cx->index].ecx);
 }
@@ -150,13 +150,14 @@
 	if (c->x86_vendor != X86_VENDOR_INTEL)
 		return -1;
 
-	cpu_cstate_entry = alloc_percpu(struct cstate_entry);
+	cpu_cstate_entry = CPU_ALLOC(struct cstate_entry,
+					GFP_KERNEL|__GFP_ZERO);
 	return 0;
 }
 
 static void __exit ffh_cstate_exit(void)
 {
-	free_percpu(cpu_cstate_entry);
+	CPU_FREE(cpu_cstate_entry);
 	cpu_cstate_entry = NULL;
 }
 
Index: linux-2.6/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c	2008-04-29 14:55:48.000000000 -0700
+++ linux-2.6/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c	2008-05-21 21:44:02.000000000 -0700
@@ -524,7 +524,8 @@
 {
 	dprintk("acpi_cpufreq_early_init\n");
 
-	acpi_perf_data = alloc_percpu(struct acpi_processor_performance);
+	acpi_perf_data = CPU_ALLOC(struct acpi_processor_performance,
+						GFP_KERNEL|__GFP_ZERO);
 	if (!acpi_perf_data) {
 		dprintk("Memory allocation error for acpi_perf_data.\n");
 		return -ENOMEM;
@@ -580,7 +581,7 @@
 	if (!data)
 		return -ENOMEM;
 
-	data->acpi_data = percpu_ptr(acpi_perf_data, cpu);
+	data->acpi_data = CPU_PTR(acpi_perf_data, cpu);
 	per_cpu(drv_data, cpu) = data;
 
 	if (cpu_has(c, X86_FEATURE_CONSTANT_TSC))
@@ -794,7 +795,7 @@
 
 	cpufreq_unregister_driver(&acpi_cpufreq_driver);
 
-	free_percpu(acpi_perf_data);
+	CPU_FREE(acpi_perf_data);
 
 	return;
 }
Index: linux-2.6/drivers/acpi/processor_perflib.c
===================================================================
--- linux-2.6.orig/drivers/acpi/processor_perflib.c	2008-04-29 14:55:49.000000000 -0700
+++ linux-2.6/drivers/acpi/processor_perflib.c	2008-05-21 21:43:07.000000000 -0700
@@ -583,12 +583,12 @@
 			continue;
 		}
 
-		if (!performance || !percpu_ptr(performance, i)) {
+		if (!performance || !CPU_PTR(performance, i)) {
 			retval = -EINVAL;
 			continue;
 		}
 
-		pr->performance = percpu_ptr(performance, i);
+		pr->performance = CPU_PTR(performance, i);
 		cpu_set(i, pr->performance->shared_cpu_map);
 		if (acpi_processor_get_psd(pr)) {
 			retval = -EINVAL;

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 09/41] cpu alloc: Genhd statistics conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (7 preceding siblings ...)
  2008-05-30  3:56 ` [patch 08/41] cpu alloc: ACPI cstate handling conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 10/41] cpu alloc: blktrace conversion Christoph Lameter
                   ` (32 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_genhd_statistics_conversion --]
[-- Type: text/plain, Size: 7166 bytes --]

Convert genhd statistics to cpu alloc. The patch also drops the UP special
casing of the statistics.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/genhd.h |  113 +++++++++++---------------------------------------
 1 file changed, 25 insertions(+), 88 deletions(-)

Index: linux-2.6/include/linux/genhd.h
===================================================================
--- linux-2.6.orig/include/linux/genhd.h	2008-05-26 09:35:31.626487665 -0700
+++ linux-2.6/include/linux/genhd.h	2008-05-26 11:09:00.740248203 -0700
@@ -95,11 +95,7 @@ struct hd_struct {
 #endif
 	unsigned long stamp;
 	int in_flight;
-#ifdef	CONFIG_SMP
 	struct disk_stats *dkstats;
-#else
-	struct disk_stats dkstats;
-#endif
 };
 
 #define GENHD_FL_REMOVABLE			1
@@ -135,11 +131,7 @@ struct gendisk {
 	atomic_t sync_io;		/* RAID */
 	unsigned long stamp;
 	int in_flight;
-#ifdef	CONFIG_SMP
 	struct disk_stats *dkstats;
-#else
-	struct disk_stats dkstats;
-#endif
 	struct work_struct async_notify;
 };
 
@@ -163,16 +155,15 @@ static inline struct hd_struct *get_part
 	return NULL;
 }
 
-#ifdef	CONFIG_SMP
 #define __disk_stat_add(gendiskp, field, addnd) 	\
-	(per_cpu_ptr(gendiskp->dkstats, smp_processor_id())->field += addnd)
+	__CPU_ADD(gendiskp->dkstats->field, addnd)
 
 #define disk_stat_read(gendiskp, field)					\
 ({									\
 	typeof(gendiskp->dkstats->field) res = 0;			\
 	int i;								\
 	for_each_possible_cpu(i)					\
-		res += per_cpu_ptr(gendiskp->dkstats, i)->field;	\
+		res += CPU_PTR(gendiskp->dkstats, i)->field;		\
 	res;								\
 })
 
@@ -180,12 +171,12 @@ static inline void disk_stat_set_all(str
 	int i;
 
 	for_each_possible_cpu(i)
-		memset(per_cpu_ptr(gendiskp->dkstats, i), value,
+		memset(CPU_PTR(gendiskp->dkstats, i), value,
 				sizeof(struct disk_stats));
-}		
+}
 
-#define __part_stat_add(part, field, addnd)				\
-	(per_cpu_ptr(part->dkstats, smp_processor_id())->field += addnd)
+#define __part_stat_add(part, field, addnd)			\
+	__CPU_ADD(part->dkstats->field, addnd)
 
 #define __all_stat_add(gendiskp, part, field, addnd, sector)	\
 ({								\
@@ -199,7 +190,7 @@ static inline void disk_stat_set_all(str
 	typeof(part->dkstats->field) res = 0;				\
 	int i;								\
 	for_each_possible_cpu(i)					\
-		res += per_cpu_ptr(part->dkstats, i)->field;		\
+		res += CPU_PTR(part->dkstats, i)->field;		\
 	res;								\
 })
 
@@ -208,56 +199,23 @@ static inline void part_stat_set_all(str
 	int i;
 
 	for_each_possible_cpu(i)
-		memset(per_cpu_ptr(part->dkstats, i), value,
+		memset(CPU_PTR(part->dkstats, i), value,
 				sizeof(struct disk_stats));
 }
-				
-#else /* !CONFIG_SMP */
-#define __disk_stat_add(gendiskp, field, addnd) \
-				(gendiskp->dkstats.field += addnd)
-#define disk_stat_read(gendiskp, field)	(gendiskp->dkstats.field)
-
-static inline void disk_stat_set_all(struct gendisk *gendiskp, int value)
-{
-	memset(&gendiskp->dkstats, value, sizeof (struct disk_stats));
-}
-
-#define __part_stat_add(part, field, addnd) \
-	(part->dkstats.field += addnd)
-
-#define __all_stat_add(gendiskp, part, field, addnd, sector)	\
-({								\
-	if (part)						\
-		part->dkstats.field += addnd;			\
-	__disk_stat_add(gendiskp, field, addnd);		\
-})
-
-#define part_stat_read(part, field)	(part->dkstats.field)
-
-static inline void part_stat_set_all(struct hd_struct *part, int value)
-{
-	memset(&part->dkstats, value, sizeof(struct disk_stats));
-}
-
-#endif /* CONFIG_SMP */
 
 #define disk_stat_add(gendiskp, field, addnd)			\
-	do {							\
-		preempt_disable();				\
-		__disk_stat_add(gendiskp, field, addnd);	\
-		preempt_enable();				\
-	} while (0)
+	_CPU_ADD(gendiskp->dkstats->field, addnd)
 
-#define __disk_stat_dec(gendiskp, field) __disk_stat_add(gendiskp, field, -1)
-#define disk_stat_dec(gendiskp, field) disk_stat_add(gendiskp, field, -1)
+#define __disk_stat_dec(gendiskp, field) __CPU_DEC(gendiskp->dkstats->field)
+#define disk_stat_dec(gendiskp, field) _CPU_DEC(gendiskp->dkstats->field)
 
-#define __disk_stat_inc(gendiskp, field) __disk_stat_add(gendiskp, field, 1)
-#define disk_stat_inc(gendiskp, field) disk_stat_add(gendiskp, field, 1)
+#define __disk_stat_inc(gendiskp, field) __CPU_INC(gendiskp->dkstats->field)
+#define disk_stat_inc(gendiskp, field) _CPU_INC(gendiskp->dkstats->field)
 
 #define __disk_stat_sub(gendiskp, field, subnd) \
-		__disk_stat_add(gendiskp, field, -subnd)
+		__CPU_SUB(gendisk->dkstats->field, subnd)
 #define disk_stat_sub(gendiskp, field, subnd) \
-		disk_stat_add(gendiskp, field, -subnd)
+		_CPU_SUB(gendisk->dkstats->field, subnd)
 
 #define part_stat_add(gendiskp, field, addnd)		\
 	do {						\
@@ -266,16 +224,16 @@ static inline void part_stat_set_all(str
 		preempt_enable();			\
 	} while (0)
 
-#define __part_stat_dec(gendiskp, field) __part_stat_add(gendiskp, field, -1)
-#define part_stat_dec(gendiskp, field) part_stat_add(gendiskp, field, -1)
+#define __part_stat_dec(gendiskp, field) __CPU_DEC(gendiskp->dkstats->field)
+#define part_stat_dec(gendiskp, field) _CPU_DEC(gendiskp->dkstats->field)
 
-#define __part_stat_inc(gendiskp, field) __part_stat_add(gendiskp, field, 1)
-#define part_stat_inc(gendiskp, field) part_stat_add(gendiskp, field, 1)
+#define __part_stat_inc(gendiskp, field) __CPU_INC(gendiskp->dkstats->field)
+#define part_stat_inc(gendiskp, field) _CPU_INC(gendiskp->dkstats->field)
 
 #define __part_stat_sub(gendiskp, field, subnd) \
-		__part_stat_add(gendiskp, field, -subnd)
+		__CPU_SUB(gendiskp->dkstats->field, subnd)
 #define part_stat_sub(gendiskp, field, subnd) \
-		part_stat_add(gendiskp, field, -subnd)
+		_CPU_SUB(gendiskp->dkstats->field, subnd)
 
 #define all_stat_add(gendiskp, part, field, addnd, sector)	\
 	do {							\
@@ -300,10 +258,9 @@ static inline void part_stat_set_all(str
 		all_stat_add(gendiskp, part, field, -subnd, sector)
 
 /* Inlines to alloc and free disk stats in struct gendisk */
-#ifdef  CONFIG_SMP
 static inline int init_disk_stats(struct gendisk *disk)
 {
-	disk->dkstats = alloc_percpu(struct disk_stats);
+	disk->dkstats = CPU_ALLOC(struct disk_stats, GFP_KERNEL | __GFP_ZERO);
 	if (!disk->dkstats)
 		return 0;
 	return 1;
@@ -311,12 +268,12 @@ static inline int init_disk_stats(struct
 
 static inline void free_disk_stats(struct gendisk *disk)
 {
-	free_percpu(disk->dkstats);
+	CPU_FREE(disk->dkstats);
 }
 
 static inline int init_part_stats(struct hd_struct *part)
 {
-	part->dkstats = alloc_percpu(struct disk_stats);
+	part->dkstats = CPU_ALLOC(struct disk_stats, GFP_KERNEL|__GFP_ZERO);
 	if (!part->dkstats)
 		return 0;
 	return 1;
@@ -324,28 +281,8 @@ static inline int init_part_stats(struct
 
 static inline void free_part_stats(struct hd_struct *part)
 {
-	free_percpu(part->dkstats);
-}
-
-#else	/* CONFIG_SMP */
-static inline int init_disk_stats(struct gendisk *disk)
-{
-	return 1;
-}
-
-static inline void free_disk_stats(struct gendisk *disk)
-{
-}
-
-static inline int init_part_stats(struct hd_struct *part)
-{
-	return 1;
-}
-
-static inline void free_part_stats(struct hd_struct *part)
-{
+	CPU_FREE(part->dkstats);
 }
-#endif	/* CONFIG_SMP */
 
 /* drivers/block/ll_rw_blk.c */
 extern void disk_round_stats(struct gendisk *disk);

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 10/41] cpu alloc: blktrace conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (8 preceding siblings ...)
  2008-05-30  3:56 ` [patch 09/41] cpu alloc: Genhd statistics conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 11/41] cpu alloc: SRCU cpu alloc conversion Christoph Lameter
                   ` (31 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_blktrace_conversion --]
[-- Type: text/plain, Size: 2545 bytes --]

Convert blktrace percpu handling to cpu_alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 block/blktrace.c |   24 ++++++++++--------------
 1 file changed, 10 insertions(+), 14 deletions(-)

Index: linux-2.6/block/blktrace.c
===================================================================
--- linux-2.6.orig/block/blktrace.c	2008-05-28 18:45:08.580239875 -0700
+++ linux-2.6/block/blktrace.c	2008-05-29 00:02:19.570236238 -0700
@@ -82,7 +82,7 @@ void __trace_note_message(struct blk_tra
 	char *buf;
 
 	preempt_disable();
-	buf = per_cpu_ptr(bt->msg_data, smp_processor_id());
+	buf = THIS_CPU(bt->msg_data);
 	va_start(args, fmt);
 	n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
 	va_end(args);
@@ -138,9 +138,7 @@ void __blk_add_trace(struct blk_trace *b
 	struct task_struct *tsk = current;
 	struct blk_io_trace *t;
 	unsigned long flags;
-	unsigned long *sequence;
 	pid_t pid;
-	int cpu;
 
 	if (unlikely(bt->trace_state != Blktrace_running))
 		return;
@@ -170,18 +168,16 @@ void __blk_add_trace(struct blk_trace *b
 
 	t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
 	if (t) {
-		cpu = smp_processor_id();
-		sequence = per_cpu_ptr(bt->sequence, cpu);
-
+		__CPU_INC(bt->sequence);
 		t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
-		t->sequence = ++(*sequence);
+		t->sequence = __CPU_READ(bt->sequence);
 		t->time = ktime_to_ns(ktime_get());
 		t->sector = sector;
 		t->bytes = bytes;
 		t->action = what;
 		t->pid = pid;
 		t->device = bt->dev;
-		t->cpu = cpu;
+		t->cpu = smp_processor_id();
 		t->error = error;
 		t->pdu_len = pdu_len;
 
@@ -248,8 +244,8 @@ static void blk_trace_cleanup(struct blk
 	relay_close(bt->rchan);
 	debugfs_remove(bt->dropped_file);
 	blk_remove_tree(bt->dir);
-	free_percpu(bt->sequence);
-	free_percpu(bt->msg_data);
+	CPU_FREE(bt->sequence);
+	CPU_FREE(bt->msg_data);
 	kfree(bt);
 }
 
@@ -360,11 +356,11 @@ int do_blk_trace_setup(struct request_qu
 	if (!bt)
 		goto err;
 
-	bt->sequence = alloc_percpu(unsigned long);
+	bt->sequence = CPU_ALLOC(unsigned long, GFP_KERNEL | __GFP_ZERO);
 	if (!bt->sequence)
 		goto err;
 
-	bt->msg_data = __alloc_percpu(BLK_TN_MAX_MSG);
+	bt->msg_data = cpu_alloc(BLK_TN_MAX_MSG, GFP_KERNEL | __GFP_ZERO, 0);
 	if (!bt->msg_data)
 		goto err;
 
@@ -413,8 +409,8 @@ err:
 	if (bt) {
 		if (bt->dropped_file)
 			debugfs_remove(bt->dropped_file);
-		free_percpu(bt->sequence);
-		free_percpu(bt->msg_data);
+		CPU_FREE(bt->sequence);
+		CPU_FREE(bt->msg_data);
 		if (bt->rchan)
 			relay_close(bt->rchan);
 		kfree(bt);

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 11/41] cpu alloc: SRCU cpu alloc conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (9 preceding siblings ...)
  2008-05-30  3:56 ` [patch 10/41] cpu alloc: blktrace conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 12/41] cpu alloc: XFS counter conversion Christoph Lameter
                   ` (30 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_srcu_conversion --]
[-- Type: text/plain, Size: 2459 bytes --]

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 kernel/rcutorture.c |    4 ++--
 kernel/srcu.c       |   20 ++++++++------------
 2 files changed, 10 insertions(+), 14 deletions(-)

Index: linux-2.6/kernel/rcutorture.c
===================================================================
--- linux-2.6.orig/kernel/rcutorture.c	2008-04-29 14:55:55.000000000 -0700
+++ linux-2.6/kernel/rcutorture.c	2008-05-21 21:46:19.000000000 -0700
@@ -442,8 +442,8 @@
 		       torture_type, TORTURE_FLAG, idx);
 	for_each_possible_cpu(cpu) {
 		cnt += sprintf(&page[cnt], " %d(%d,%d)", cpu,
-			       per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[!idx],
-			       per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[idx]);
+			       CPU_PTR(srcu_ctl.per_cpu_ref, cpu)->c[!idx],
+			       CPU_PTR(srcu_ctl.per_cpu_ref, cpu)->c[idx]);
 	}
 	cnt += sprintf(&page[cnt], "\n");
 	return cnt;
Index: linux-2.6/kernel/srcu.c
===================================================================
--- linux-2.6.orig/kernel/srcu.c	2008-02-16 20:28:44.000000000 -0800
+++ linux-2.6/kernel/srcu.c	2008-05-21 21:46:19.000000000 -0700
@@ -46,7 +46,8 @@
 {
 	sp->completed = 0;
 	mutex_init(&sp->mutex);
-	sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
+	sp->per_cpu_ref = CPU_ALLOC(struct srcu_struct_array,
+						GFP_KERNEL|__GFP_ZERO);
 	return (sp->per_cpu_ref ? 0 : -ENOMEM);
 }
 
@@ -62,7 +63,7 @@
 
 	sum = 0;
 	for_each_possible_cpu(cpu)
-		sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
+		sum += CPU_PTR(sp->per_cpu_ref, cpu)->c[idx];
 	return sum;
 }
 
@@ -94,7 +95,7 @@
 	WARN_ON(sum);  /* Leakage unless caller handles error. */
 	if (sum != 0)
 		return;
-	free_percpu(sp->per_cpu_ref);
+	CPU_FREE(sp->per_cpu_ref);
 	sp->per_cpu_ref = NULL;
 }
 
@@ -110,12 +111,9 @@
 {
 	int idx;
 
-	preempt_disable();
 	idx = sp->completed & 0x1;
-	barrier();  /* ensure compiler looks -once- at sp->completed. */
-	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
-	srcu_barrier();  /* ensure compiler won't misorder critical section. */
-	preempt_enable();
+	srcu_barrier();
+	_CPU_INC(sp->per_cpu_ref->c[idx]);
 	return idx;
 }
 
@@ -131,10 +129,8 @@
  */
 void srcu_read_unlock(struct srcu_struct *sp, int idx)
 {
-	preempt_disable();
-	srcu_barrier();  /* ensure compiler won't misorder critical section. */
-	per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
-	preempt_enable();
+	srcu_barrier();
+	_CPU_DEC(sp->per_cpu_ref->c[idx]);
 }
 
 /**

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 12/41] cpu alloc: XFS counter conversion.
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (10 preceding siblings ...)
  2008-05-30  3:56 ` [patch 11/41] cpu alloc: SRCU cpu alloc conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 13/41] cpu alloc: NFS statistics Christoph Lameter
                   ` (29 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_xfs_counter_conversion --]
[-- Type: text/plain, Size: 2829 bytes --]

Also remove the useless zeroing after allocation. Even allocpercpu() already
zeroed the objects on alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/xfs/xfs_mount.c |   24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

Index: linux-2.6/fs/xfs/xfs_mount.c
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_mount.c	2008-05-08 22:10:42.000000000 -0700
+++ linux-2.6/fs/xfs/xfs_mount.c	2008-05-21 21:46:27.000000000 -0700
@@ -2016,7 +2016,7 @@
 
 	mp = (xfs_mount_t *)container_of(nfb, xfs_mount_t, m_icsb_notifier);
 	cntp = (xfs_icsb_cnts_t *)
-			per_cpu_ptr(mp->m_sb_cnts, (unsigned long)hcpu);
+			CPU_PTR(mp->m_sb_cnts, (unsigned long)hcpu);
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
@@ -2065,10 +2065,7 @@
 xfs_icsb_init_counters(
 	xfs_mount_t	*mp)
 {
-	xfs_icsb_cnts_t *cntp;
-	int		i;
-
-	mp->m_sb_cnts = alloc_percpu(xfs_icsb_cnts_t);
+	mp->m_sb_cnts = CPU_ALLOC(xfs_icsb_cnts_t, GFP_KERNEL | __GFP_ZERO);
 	if (mp->m_sb_cnts == NULL)
 		return -ENOMEM;
 
@@ -2078,11 +2075,6 @@
 	register_hotcpu_notifier(&mp->m_icsb_notifier);
 #endif /* CONFIG_HOTPLUG_CPU */
 
-	for_each_online_cpu(i) {
-		cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
-		memset(cntp, 0, sizeof(xfs_icsb_cnts_t));
-	}
-
 	mutex_init(&mp->m_icsb_mutex);
 
 	/*
@@ -2115,7 +2107,7 @@
 {
 	if (mp->m_sb_cnts) {
 		unregister_hotcpu_notifier(&mp->m_icsb_notifier);
-		free_percpu(mp->m_sb_cnts);
+		CPU_FREE(mp->m_sb_cnts);
 	}
 	mutex_destroy(&mp->m_icsb_mutex);
 }
@@ -2145,7 +2137,7 @@
 	int		i;
 
 	for_each_online_cpu(i) {
-		cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
+		cntp = (xfs_icsb_cnts_t *)CPU_PTR(mp->m_sb_cnts, i);
 		xfs_icsb_lock_cntr(cntp);
 	}
 }
@@ -2158,7 +2150,7 @@
 	int		i;
 
 	for_each_online_cpu(i) {
-		cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
+		cntp = (xfs_icsb_cnts_t *)CPU_PTR(mp->m_sb_cnts, i);
 		xfs_icsb_unlock_cntr(cntp);
 	}
 }
@@ -2178,7 +2170,7 @@
 		xfs_icsb_lock_all_counters(mp);
 
 	for_each_online_cpu(i) {
-		cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
+		cntp = (xfs_icsb_cnts_t *)CPU_PTR(mp->m_sb_cnts, i);
 		cnt->icsb_icount += cntp->icsb_icount;
 		cnt->icsb_ifree += cntp->icsb_ifree;
 		cnt->icsb_fdblocks += cntp->icsb_fdblocks;
@@ -2254,7 +2246,7 @@
 
 	xfs_icsb_lock_all_counters(mp);
 	for_each_online_cpu(i) {
-		cntp = per_cpu_ptr(mp->m_sb_cnts, i);
+		cntp = CPU_PTR(mp->m_sb_cnts, i);
 		switch (field) {
 		case XFS_SBS_ICOUNT:
 			cntp->icsb_icount = count + resid;
@@ -2391,7 +2383,7 @@
 	might_sleep();
 again:
 	cpu = get_cpu();
-	icsbp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, cpu);
+	icsbp = (xfs_icsb_cnts_t *)CPU_PTR(mp->m_sb_cnts, cpu);
 
 	/*
 	 * if the counter is disabled, go to slow path

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 13/41] cpu alloc: NFS statistics
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (11 preceding siblings ...)
  2008-05-30  3:56 ` [patch 12/41] cpu alloc: XFS counter conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 14/41] cpu alloc: Neigbour statistics Christoph Lameter
                   ` (28 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_nfs_statistics_conversion --]
[-- Type: text/plain, Size: 2562 bytes --]

Convert NFS statistics to cpu alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/nfs/iostat.h |   27 +++++++++++----------------
 fs/nfs/super.c  |    2 +-
 2 files changed, 12 insertions(+), 17 deletions(-)

Index: linux-2.6/fs/nfs/iostat.h
===================================================================
--- linux-2.6.orig/fs/nfs/iostat.h	2008-05-26 09:35:31.077736336 -0700
+++ linux-2.6/fs/nfs/iostat.h	2008-05-26 10:01:26.178776581 -0700
@@ -119,13 +119,7 @@ struct nfs_iostats {
 
 static inline void nfs_inc_server_stats(struct nfs_server *server, enum nfs_stat_eventcounters stat)
 {
-	struct nfs_iostats *iostats;
-	int cpu;
-
-	cpu = get_cpu();
-	iostats = per_cpu_ptr(server->io_stats, cpu);
-	iostats->events[stat] ++;
-	put_cpu_no_resched();
+	_CPU_INC(server->io_stats->events[stat]);
 }
 
 static inline void nfs_inc_stats(struct inode *inode, enum nfs_stat_eventcounters stat)
@@ -135,13 +129,14 @@ static inline void nfs_inc_stats(struct 
 
 static inline void nfs_add_server_stats(struct nfs_server *server, enum nfs_stat_bytecounters stat, unsigned long addend)
 {
-	struct nfs_iostats *iostats;
-	int cpu;
-
-	cpu = get_cpu();
-	iostats = per_cpu_ptr(server->io_stats, cpu);
-	iostats->bytes[stat] += addend;
-	put_cpu_no_resched();
+#ifdef CONFIG_64BIT
+	_CPU_ADD(server->io_stats->bytes[stat], addend);
+#else
+	/* 32bit cannot perform atomic 64 bit inc sodisable preemption */
+	preempt_disable();
+	THIS_CPU(server->io_stats)->bytes[stat] += addend;
+	preempt_enable_no_resched();
+#endif
 }
 
 static inline void nfs_add_stats(struct inode *inode, enum nfs_stat_bytecounters stat, unsigned long addend)
@@ -151,13 +146,13 @@ static inline void nfs_add_stats(struct 
 
 static inline struct nfs_iostats *nfs_alloc_iostats(void)
 {
-	return alloc_percpu(struct nfs_iostats);
+	return CPU_ALLOC(struct nfs_iostats, GFP_KERNEL | __GFP_ZERO);
 }
 
 static inline void nfs_free_iostats(struct nfs_iostats *stats)
 {
 	if (stats != NULL)
-		free_percpu(stats);
+		CPU_FREE(stats);
 }
 
 #endif
Index: linux-2.6/fs/nfs/super.c
===================================================================
--- linux-2.6.orig/fs/nfs/super.c	2008-05-26 09:35:31.097738734 -0700
+++ linux-2.6/fs/nfs/super.c	2008-05-26 09:36:15.900248119 -0700
@@ -620,7 +620,7 @@ static int nfs_show_stats(struct seq_fil
 		struct nfs_iostats *stats;
 
 		preempt_disable();
-		stats = per_cpu_ptr(nfss->io_stats, cpu);
+		stats = CPU_PTR(nfss->io_stats, cpu);
 
 		for (i = 0; i < __NFSIOS_COUNTSMAX; i++)
 			totals.events[i] += stats->events[i];

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 14/41] cpu alloc: Neigbour statistics
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (12 preceding siblings ...)
  2008-05-30  3:56 ` [patch 13/41] cpu alloc: NFS statistics Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 15/41] cpu_alloc: Convert ip route statistics Christoph Lameter
                   ` (27 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_neighbor_statistics_conversion --]
[-- Type: text/plain, Size: 2301 bytes --]

Convert neighbor stats to cpu alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/net/neighbour.h |    6 +-----
 net/core/neighbour.c    |   11 ++++++-----
 2 files changed, 7 insertions(+), 10 deletions(-)

Index: linux-2.6/include/net/neighbour.h
===================================================================
--- linux-2.6.orig/include/net/neighbour.h	2008-05-21 23:59:14.000000000 -0700
+++ linux-2.6/include/net/neighbour.h	2008-05-22 07:40:37.000000000 -0700
@@ -87,12 +87,7 @@
 	unsigned long forced_gc_runs;	/* number of forced GC runs */
 };
 
-#define NEIGH_CACHE_STAT_INC(tbl, field)				\
-	do {								\
-		preempt_disable();					\
-		(per_cpu_ptr((tbl)->stats, smp_processor_id())->field)++; \
-		preempt_enable();					\
-	} while (0)
+#define NEIGH_CACHE_STAT_INC(tbl, field) _CPU_INC((tbl)->stats->field)
 
 struct neighbour
 {
Index: linux-2.6/net/core/neighbour.c
===================================================================
--- linux-2.6.orig/net/core/neighbour.c	2008-05-21 23:59:14.000000000 -0700
+++ linux-2.6/net/core/neighbour.c	2008-05-22 00:00:06.000000000 -0700
@@ -1425,7 +1425,8 @@
 			kmem_cache_create(tbl->id, tbl->entry_size, 0,
 					  SLAB_HWCACHE_ALIGN|SLAB_PANIC,
 					  NULL);
-	tbl->stats = alloc_percpu(struct neigh_statistics);
+	tbl->stats = CPU_ALLOC(struct neigh_statistics,
+					GFP_KERNEL | __GFP_ZERO);
 	if (!tbl->stats)
 		panic("cannot create neighbour cache statistics");
 
@@ -1511,7 +1512,7 @@
 
 	remove_proc_entry(tbl->id, init_net.proc_net_stat);
 
-	free_percpu(tbl->stats);
+	CPU_FREE(tbl->stats);
 	tbl->stats = NULL;
 
 	kmem_cache_destroy(tbl->kmem_cachep);
@@ -1769,7 +1770,7 @@
 		for_each_possible_cpu(cpu) {
 			struct neigh_statistics	*st;
 
-			st = per_cpu_ptr(tbl->stats, cpu);
+			st = CPU_PTR(tbl->stats, cpu);
 			ndst.ndts_allocs		+= st->allocs;
 			ndst.ndts_destroys		+= st->destroys;
 			ndst.ndts_hash_grows		+= st->hash_grows;
@@ -2429,7 +2430,7 @@
 		if (!cpu_possible(cpu))
 			continue;
 		*pos = cpu+1;
-		return per_cpu_ptr(tbl->stats, cpu);
+		return CPU_PTR(tbl->stats, cpu);
 	}
 	return NULL;
 }
@@ -2444,7 +2445,7 @@
 		if (!cpu_possible(cpu))
 			continue;
 		*pos = cpu+1;
-		return per_cpu_ptr(tbl->stats, cpu);
+		return CPU_PTR(tbl->stats, cpu);
 	}
 	return NULL;
 }

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 15/41] cpu_alloc: Convert ip route statistics
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (13 preceding siblings ...)
  2008-05-30  3:56 ` [patch 14/41] cpu alloc: Neigbour statistics Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 16/41] cpu alloc: Tcp statistics conversion Christoph Lameter
                   ` (26 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_ip_rt_act_conversion --]
[-- Type: text/plain, Size: 1472 bytes --]

Convert IP route stats to cpu alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6/net/ipv4/ip_input.c
===================================================================
--- linux-2.6.orig/net/ipv4/ip_input.c	2008-05-21 22:37:24.000000000 -0700
+++ linux-2.6/net/ipv4/ip_input.c	2008-05-21 22:38:39.000000000 -0700
@@ -345,7 +345,7 @@
 
 #ifdef CONFIG_NET_CLS_ROUTE
 	if (unlikely(skb->dst->tclassid)) {
-		struct ip_rt_acct *st = per_cpu_ptr(ip_rt_acct, smp_processor_id());
+		struct ip_rt_acct *st = THIS_CPU(ip_rt_acct);
 		u32 idx = skb->dst->tclassid;
 		st[idx&0xFF].o_packets++;
 		st[idx&0xFF].o_bytes+=skb->len;
Index: linux-2.6/net/ipv4/route.c
===================================================================
--- linux-2.6.orig/net/ipv4/route.c	2008-05-21 22:37:24.000000000 -0700
+++ linux-2.6/net/ipv4/route.c	2008-05-21 22:38:39.000000000 -0700
@@ -534,7 +534,7 @@
 			unsigned int j;
 			u32 *src;
 
-			src = ((u32 *) per_cpu_ptr(ip_rt_acct, i)) + offset;
+			src = ((u32 *) CPU_PTR(ip_rt_acct, i)) + offset;
 			for (j = 0; j < length/4; j++)
 				dst[j] += src[j];
 		}
@@ -3035,7 +3035,8 @@
 			     (jiffies ^ (jiffies >> 7))));
 
 #ifdef CONFIG_NET_CLS_ROUTE
-	ip_rt_acct = __alloc_percpu(256 * sizeof(struct ip_rt_acct));
+	ip_rt_acct = cpu_alloc(256 * sizeof(struct ip_rt_acct),
+		GFP_KERNEL|__GFP_ZERO, __alignof__(struct ip_rt_acct));
 	if (!ip_rt_acct)
 		panic("IP: failed to allocate ip_rt_acct\n");
 #endif

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 16/41] cpu alloc: Tcp statistics conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (14 preceding siblings ...)
  2008-05-30  3:56 ` [patch 15/41] cpu_alloc: Convert ip route statistics Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 17/41] cpu alloc: Convert scratches to cpu alloc Christoph Lameter
                   ` (25 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_tcp_statistics_conversion --]
[-- Type: text/plain, Size: 1462 bytes --]

Convert tcp statistics to cpu alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 net/ipv4/tcp.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/net/ipv4/tcp.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp.c	2008-04-29 14:55:55.000000000 -0700
+++ linux-2.6/net/ipv4/tcp.c	2008-05-21 21:46:54.000000000 -0700
@@ -2456,7 +2456,7 @@
 {
 	int cpu;
 	for_each_possible_cpu(cpu) {
-		struct tcp_md5sig_pool *p = *per_cpu_ptr(pool, cpu);
+		struct tcp_md5sig_pool *p = *CPU_PTR(pool, cpu);
 		if (p) {
 			if (p->md5_desc.tfm)
 				crypto_free_hash(p->md5_desc.tfm);
@@ -2464,7 +2464,7 @@
 			p = NULL;
 		}
 	}
-	free_percpu(pool);
+	CPU_FREE(pool);
 }
 
 void tcp_free_md5sig_pool(void)
@@ -2488,7 +2488,7 @@
 	int cpu;
 	struct tcp_md5sig_pool **pool;
 
-	pool = alloc_percpu(struct tcp_md5sig_pool *);
+	pool = CPU_ALLOC(struct tcp_md5sig_pool *, GFP_KERNEL);
 	if (!pool)
 		return NULL;
 
@@ -2499,7 +2499,7 @@
 		p = kzalloc(sizeof(*p), GFP_KERNEL);
 		if (!p)
 			goto out_free;
-		*per_cpu_ptr(pool, cpu) = p;
+		*CPU_PTR(pool, cpu) = p;
 
 		hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
 		if (!hash || IS_ERR(hash))
@@ -2564,7 +2564,7 @@
 	if (p)
 		tcp_md5sig_users++;
 	spin_unlock_bh(&tcp_md5sig_pool_lock);
-	return (p ? *per_cpu_ptr(p, cpu) : NULL);
+	return (p ? *CPU_PTR(p, cpu) : NULL);
 }
 
 EXPORT_SYMBOL(__tcp_get_md5sig_pool);

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 17/41] cpu alloc: Convert scratches to cpu alloc
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (15 preceding siblings ...)
  2008-05-30  3:56 ` [patch 16/41] cpu alloc: Tcp statistics conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 18/41] cpu alloc: Dmaengine conversion Christoph Lameter
                   ` (24 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_scratches_convert --]
[-- Type: text/plain, Size: 5375 bytes --]

Convert scratch handling in the network stack to cpu alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 net/ipv4/ipcomp.c  |   26 +++++++++++++-------------
 net/ipv6/ipcomp6.c |   26 +++++++++++++-------------
 2 files changed, 26 insertions(+), 26 deletions(-)

Index: linux-2.6/net/ipv4/ipcomp.c
===================================================================
--- linux-2.6.orig/net/ipv4/ipcomp.c	2008-04-29 14:55:55.000000000 -0700
+++ linux-2.6/net/ipv4/ipcomp.c	2008-05-21 21:48:14.000000000 -0700
@@ -47,8 +47,8 @@
 	int dlen = IPCOMP_SCRATCH_SIZE;
 	const u8 *start = skb->data;
 	const int cpu = get_cpu();
-	u8 *scratch = *per_cpu_ptr(ipcomp_scratches, cpu);
-	struct crypto_comp *tfm = *per_cpu_ptr(ipcd->tfms, cpu);
+	u8 *scratch = *CPU_PTR(ipcomp_scratches, cpu);
+	struct crypto_comp *tfm = *CPU_PTR(ipcd->tfms, cpu);
 	int err = crypto_comp_decompress(tfm, start, plen, scratch, &dlen);
 
 	if (err)
@@ -105,8 +105,8 @@
 	int dlen = IPCOMP_SCRATCH_SIZE;
 	u8 *start = skb->data;
 	const int cpu = get_cpu();
-	u8 *scratch = *per_cpu_ptr(ipcomp_scratches, cpu);
-	struct crypto_comp *tfm = *per_cpu_ptr(ipcd->tfms, cpu);
+	u8 *scratch = *CPU_PTR(ipcomp_scratches, cpu);
+	struct crypto_comp *tfm = *CPU_PTR(ipcd->tfms, cpu);
 	int err;
 
 	local_bh_disable();
@@ -254,9 +254,9 @@
 		return;
 
 	for_each_possible_cpu(i)
-		vfree(*per_cpu_ptr(scratches, i));
+		vfree(*CPU_PTR(scratches, i));
 
-	free_percpu(scratches);
+	CPU_FREE(scratches);
 }
 
 static void **ipcomp_alloc_scratches(void)
@@ -267,7 +267,7 @@
 	if (ipcomp_scratch_users++)
 		return ipcomp_scratches;
 
-	scratches = alloc_percpu(void *);
+	scratches = CPU_ALLOC(void *, GFP_KERNEL);
 	if (!scratches)
 		return NULL;
 
@@ -277,7 +277,7 @@
 		void *scratch = vmalloc(IPCOMP_SCRATCH_SIZE);
 		if (!scratch)
 			return NULL;
-		*per_cpu_ptr(scratches, i) = scratch;
+		*CPU_PTR(scratches, i) = scratch;
 	}
 
 	return scratches;
@@ -305,10 +305,10 @@
 		return;
 
 	for_each_possible_cpu(cpu) {
-		struct crypto_comp *tfm = *per_cpu_ptr(tfms, cpu);
+		struct crypto_comp *tfm = *CPU_PTR(tfms, cpu);
 		crypto_free_comp(tfm);
 	}
-	free_percpu(tfms);
+	CPU_FREE(tfms);
 }
 
 static struct crypto_comp **ipcomp_alloc_tfms(const char *alg_name)
@@ -324,7 +324,7 @@
 		struct crypto_comp *tfm;
 
 		tfms = pos->tfms;
-		tfm = *per_cpu_ptr(tfms, cpu);
+		tfm = *CPU_PTR(tfms, cpu);
 
 		if (!strcmp(crypto_comp_name(tfm), alg_name)) {
 			pos->users++;
@@ -340,7 +340,7 @@
 	INIT_LIST_HEAD(&pos->list);
 	list_add(&pos->list, &ipcomp_tfms_list);
 
-	pos->tfms = tfms = alloc_percpu(struct crypto_comp *);
+	pos->tfms = tfms = CPU_ALLOC(struct crypto_comp *, GFP_KERNEL);
 	if (!tfms)
 		goto error;
 
@@ -349,7 +349,7 @@
 							    CRYPTO_ALG_ASYNC);
 		if (IS_ERR(tfm))
 			goto error;
-		*per_cpu_ptr(tfms, cpu) = tfm;
+		*CPU_PTR(tfms, cpu) = tfm;
 	}
 
 	return tfms;
Index: linux-2.6/net/ipv6/ipcomp6.c
===================================================================
--- linux-2.6.orig/net/ipv6/ipcomp6.c	2008-04-29 14:55:55.000000000 -0700
+++ linux-2.6/net/ipv6/ipcomp6.c	2008-05-21 21:47:09.000000000 -0700
@@ -90,8 +90,8 @@
 	start = skb->data;
 
 	cpu = get_cpu();
-	scratch = *per_cpu_ptr(ipcomp6_scratches, cpu);
-	tfm = *per_cpu_ptr(ipcd->tfms, cpu);
+	scratch = *CPU_PTR(ipcomp6_scratches, cpu);
+	tfm = *CPU_PTR(ipcd->tfms, cpu);
 
 	err = crypto_comp_decompress(tfm, start, plen, scratch, &dlen);
 	if (err)
@@ -142,8 +142,8 @@
 	start = skb->data;
 
 	cpu = get_cpu();
-	scratch = *per_cpu_ptr(ipcomp6_scratches, cpu);
-	tfm = *per_cpu_ptr(ipcd->tfms, cpu);
+	scratch = *CPU_PTR(ipcomp6_scratches, cpu);
+	tfm = *CPU_PTR(ipcd->tfms, cpu);
 
 	local_bh_disable();
 	err = crypto_comp_compress(tfm, start, plen, scratch, &dlen);
@@ -264,12 +264,12 @@
 		return;
 
 	for_each_possible_cpu(i) {
-		void *scratch = *per_cpu_ptr(scratches, i);
+		void *scratch = *CPU_PTR(scratches, i);
 
 		vfree(scratch);
 	}
 
-	free_percpu(scratches);
+	CPU_FREE(scratches);
 }
 
 static void **ipcomp6_alloc_scratches(void)
@@ -280,7 +280,7 @@
 	if (ipcomp6_scratch_users++)
 		return ipcomp6_scratches;
 
-	scratches = alloc_percpu(void *);
+	scratches = CPU_ALLOC(void *, GFP_KERNEL);
 	if (!scratches)
 		return NULL;
 
@@ -290,7 +290,7 @@
 		void *scratch = vmalloc(IPCOMP_SCRATCH_SIZE);
 		if (!scratch)
 			return NULL;
-		*per_cpu_ptr(scratches, i) = scratch;
+		*CPU_PTR(scratches, i) = scratch;
 	}
 
 	return scratches;
@@ -318,10 +318,10 @@
 		return;
 
 	for_each_possible_cpu(cpu) {
-		struct crypto_comp *tfm = *per_cpu_ptr(tfms, cpu);
+		struct crypto_comp *tfm = *CPU_PTR(tfms, cpu);
 		crypto_free_comp(tfm);
 	}
-	free_percpu(tfms);
+	CPU_FREE(tfms);
 }
 
 static struct crypto_comp **ipcomp6_alloc_tfms(const char *alg_name)
@@ -337,7 +337,7 @@
 		struct crypto_comp *tfm;
 
 		tfms = pos->tfms;
-		tfm = *per_cpu_ptr(tfms, cpu);
+		tfm = *CPU_PTR(tfms, cpu);
 
 		if (!strcmp(crypto_comp_name(tfm), alg_name)) {
 			pos->users++;
@@ -353,7 +353,7 @@
 	INIT_LIST_HEAD(&pos->list);
 	list_add(&pos->list, &ipcomp6_tfms_list);
 
-	pos->tfms = tfms = alloc_percpu(struct crypto_comp *);
+	pos->tfms = tfms = CPU_ALLOC(struct crypto_comp *, GFP_KERNEL);
 	if (!tfms)
 		goto error;
 
@@ -362,7 +362,7 @@
 							    CRYPTO_ALG_ASYNC);
 		if (IS_ERR(tfm))
 			goto error;
-		*per_cpu_ptr(tfms, cpu) = tfm;
+		*CPU_PTR(tfms, cpu) = tfm;
 	}
 
 	return tfms;

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 18/41] cpu alloc: Dmaengine conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (16 preceding siblings ...)
  2008-05-30  3:56 ` [patch 17/41] cpu alloc: Convert scratches to cpu alloc Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 19/41] cpu alloc: Convert loopback statistics Christoph Lameter
                   ` (23 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_dmaengine_conversion --]
[-- Type: text/plain, Size: 4634 bytes --]

Convert DMA engine to use CPU_xx operations. This also removes the use of local_t
from the dmaengine.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 drivers/dma/dmaengine.c   |   38 ++++++++++++++------------------------
 include/linux/dmaengine.h |   16 ++++++----------
 2 files changed, 20 insertions(+), 34 deletions(-)

Index: linux-2.6/drivers/dma/dmaengine.c
===================================================================
--- linux-2.6.orig/drivers/dma/dmaengine.c	2008-04-29 14:55:49.000000000 -0700
+++ linux-2.6/drivers/dma/dmaengine.c	2008-05-21 21:48:25.000000000 -0700
@@ -84,7 +84,7 @@
 	int i;
 
 	for_each_possible_cpu(i)
-		count += per_cpu_ptr(chan->local, i)->memcpy_count;
+		count += CPU_PTR(chan->local, i)->memcpy_count;
 
 	return sprintf(buf, "%lu\n", count);
 }
@@ -97,7 +97,7 @@
 	int i;
 
 	for_each_possible_cpu(i)
-		count += per_cpu_ptr(chan->local, i)->bytes_transferred;
+		count += CPU_PTR(chan->local, i)->bytes_transferred;
 
 	return sprintf(buf, "%lu\n", count);
 }
@@ -111,10 +111,8 @@
 		atomic_read(&chan->refcount.refcount) > 1)
 		in_use = 1;
 	else {
-		if (local_read(&(per_cpu_ptr(chan->local,
-			get_cpu())->refcount)) > 0)
+		if (_CPU_READ(chan->local->refcount) > 0)
 			in_use = 1;
-		put_cpu();
 	}
 
 	return sprintf(buf, "%d\n", in_use);
@@ -227,7 +225,7 @@
 	int bias = 0x7FFFFFFF;
 	int i;
 	for_each_possible_cpu(i)
-		bias -= local_read(&per_cpu_ptr(chan->local, i)->refcount);
+		bias -= _CPU_READ(chan->local->refcount);
 	atomic_sub(bias, &chan->refcount.refcount);
 	kref_put(&chan->refcount, dma_chan_cleanup);
 }
@@ -372,7 +370,8 @@
 
 	/* represent channels in sysfs. Probably want devs too */
 	list_for_each_entry(chan, &device->channels, device_node) {
-		chan->local = alloc_percpu(typeof(*chan->local));
+		chan->local = CPU_ALLOC(typeof(*chan->local),
+					GFP_KERNEL | __GFP_ZERO);
 		if (chan->local == NULL)
 			continue;
 
@@ -385,7 +384,7 @@
 		rc = device_register(&chan->dev);
 		if (rc) {
 			chancnt--;
-			free_percpu(chan->local);
+			CPU_FREE(chan->local);
 			chan->local = NULL;
 			goto err_out;
 		}
@@ -413,7 +412,7 @@
 		kref_put(&device->refcount, dma_async_device_cleanup);
 		device_unregister(&chan->dev);
 		chancnt--;
-		free_percpu(chan->local);
+		CPU_FREE(chan->local);
 	}
 	return rc;
 }
@@ -490,11 +489,8 @@
 	tx->callback = NULL;
 	cookie = tx->tx_submit(tx);
 
-	cpu = get_cpu();
-	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
-	put_cpu();
-
+	__CPU_ADD(chan->local->bytes_transferred, len);
+	__CPU_INC(chan->local->memcpy_count);
 	return cookie;
 }
 EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
@@ -536,11 +532,8 @@
 	tx->callback = NULL;
 	cookie = tx->tx_submit(tx);
 
-	cpu = get_cpu();
-	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
-	put_cpu();
-
+	_CPU_ADD(chan->local->bytes_transferred, len);
+	_CPU_INC(chan->local->memcpy_count);
 	return cookie;
 }
 EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
@@ -585,11 +578,8 @@
 	tx->callback = NULL;
 	cookie = tx->tx_submit(tx);
 
-	cpu = get_cpu();
-	per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
-	per_cpu_ptr(chan->local, cpu)->memcpy_count++;
-	put_cpu();
-
+	_CPU_ADD(chan->local->bytes_transferred, len);
+	_CPU_INC(chan->local->memcpy_count);
 	return cookie;
 }
 EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
Index: linux-2.6/include/linux/dmaengine.h
===================================================================
--- linux-2.6.orig/include/linux/dmaengine.h	2008-04-29 14:55:54.000000000 -0700
+++ linux-2.6/include/linux/dmaengine.h	2008-05-21 21:48:25.000000000 -0700
@@ -116,13 +116,13 @@
 
 /**
  * struct dma_chan_percpu - the per-CPU part of struct dma_chan
- * @refcount: local_t used for open-coded "bigref" counting
+ * @refcount: int used for open-coded "bigref" counting
  * @memcpy_count: transaction counter
  * @bytes_transferred: byte counter
  */
 
 struct dma_chan_percpu {
-	local_t refcount;
+	int refcount;
 	/* stats */
 	unsigned long memcpy_count;
 	unsigned long bytes_transferred;
@@ -164,20 +164,16 @@
 {
 	if (unlikely(chan->slow_ref))
 		kref_get(&chan->refcount);
-	else {
-		local_inc(&(per_cpu_ptr(chan->local, get_cpu())->refcount));
-		put_cpu();
-	}
+	else
+		_CPU_INC(chan->local->refcount);
 }
 
 static inline void dma_chan_put(struct dma_chan *chan)
 {
 	if (unlikely(chan->slow_ref))
 		kref_put(&chan->refcount, dma_chan_cleanup);
-	else {
-		local_dec(&(per_cpu_ptr(chan->local, get_cpu())->refcount));
-		put_cpu();
-	}
+	else
+		_CPU_DEC(chan->local->refcount);
 }
 
 /*

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 19/41] cpu alloc: Convert loopback statistics
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (17 preceding siblings ...)
  2008-05-30  3:56 ` [patch 18/41] cpu alloc: Dmaengine conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 20/41] cpu alloc: Veth conversion Christoph Lameter
                   ` (22 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_loopback_statistics_conversion --]
[-- Type: text/plain, Size: 1672 bytes --]

Convert loopback statistics to cpu alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 drivers/net/loopback.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

Index: linux-2.6/drivers/net/loopback.c
===================================================================
--- linux-2.6.orig/drivers/net/loopback.c	2008-05-28 22:02:18.000000000 -0700
+++ linux-2.6/drivers/net/loopback.c	2008-05-28 23:14:11.000000000 -0700
@@ -132,7 +132,7 @@
  */
 static int loopback_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	struct pcpu_lstats *pcpu_lstats, *lb_stats;
+	struct pcpu_lstats *pcpu_lstats;
 
 	skb_orphan(skb);
 
@@ -152,11 +152,10 @@
 #endif
 	dev->last_rx = jiffies;
 
-	/* it's OK to use per_cpu_ptr() because BHs are off */
+	/* it's OK to use __xxx cpu operations because BHs are off */
 	pcpu_lstats = netdev_priv(dev);
-	lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
-	lb_stats->bytes += skb->len;
-	lb_stats->packets++;
+	__CPU_ADD(pcpu_lstats->bytes, skb->len);
+	__CPU_INC(pcpu_lstats->packets);
 
 	netif_rx(skb);
 
@@ -175,7 +174,7 @@
 	for_each_possible_cpu(i) {
 		const struct pcpu_lstats *lb_stats;
 
-		lb_stats = per_cpu_ptr(pcpu_lstats, i);
+		lb_stats = CPU_PTR(pcpu_lstats, i);
 		bytes   += lb_stats->bytes;
 		packets += lb_stats->packets;
 	}
@@ -203,7 +202,7 @@
 {
 	struct pcpu_lstats *lstats;
 
-	lstats = alloc_percpu(struct pcpu_lstats);
+	lstats = CPU_ALLOC(struct pcpu_lstats, GFP_KERNEL | __GFP_ZERO);
 	if (!lstats)
 		return -ENOMEM;
 
@@ -215,7 +214,7 @@
 {
 	struct pcpu_lstats *lstats = netdev_priv(dev);
 
-	free_percpu(lstats);
+	CPU_FREE(lstats);
 	free_netdev(dev);
 }
 

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 20/41] cpu alloc: Veth conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (18 preceding siblings ...)
  2008-05-30  3:56 ` [patch 19/41] cpu alloc: Convert loopback statistics Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 21/41] cpu alloc: Chelsio statistics conversion Christoph Lameter
                   ` (21 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_veth_conversion --]
[-- Type: text/plain, Size: 1948 bytes --]

veth statistics conversion.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 drivers/net/veth.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/drivers/net/veth.c
===================================================================
--- linux-2.6.orig/drivers/net/veth.c	2008-05-28 22:02:18.000000000 -0700
+++ linux-2.6/drivers/net/veth.c	2008-05-28 23:18:47.000000000 -0700
@@ -152,8 +152,7 @@
 {
 	struct net_device *rcv = NULL;
 	struct veth_priv *priv, *rcv_priv;
-	struct veth_net_stats *stats;
-	int length, cpu;
+	int length;
 
 	skb_orphan(skb);
 
@@ -161,9 +160,6 @@
 	rcv = priv->peer;
 	rcv_priv = netdev_priv(rcv);
 
-	cpu = smp_processor_id();
-	stats = per_cpu_ptr(priv->stats, cpu);
-
 	if (!(rcv->flags & IFF_UP))
 		goto outf;
 
@@ -180,19 +176,18 @@
 
 	length = skb->len;
 
-	stats->tx_bytes += length;
-	stats->tx_packets++;
+	__CPU_ADD(priv->stats->tx_bytes, length);
+	__CPU_INC(priv->stats->tx_packets);
 
-	stats = per_cpu_ptr(rcv_priv->stats, cpu);
-	stats->rx_bytes += length;
-	stats->rx_packets++;
+	__CPU_ADD(rcv_priv->stats->rx_bytes, length);
+	__CPU_INC(rcv_priv->stats->rx_packets);
 
 	netif_rx(skb);
 	return 0;
 
 outf:
 	kfree_skb(skb);
-	stats->tx_dropped++;
+	__CPU_INC(priv->stats->tx_dropped);
 	return 0;
 }
 
@@ -217,7 +212,7 @@
 	dev_stats->tx_dropped = 0;
 
 	for_each_online_cpu(cpu) {
-		stats = per_cpu_ptr(priv->stats, cpu);
+		stats = CPU_PTR(priv->stats, cpu);
 
 		dev_stats->rx_packets += stats->rx_packets;
 		dev_stats->tx_packets += stats->tx_packets;
@@ -249,7 +244,7 @@
 	struct veth_net_stats *stats;
 	struct veth_priv *priv;
 
-	stats = alloc_percpu(struct veth_net_stats);
+	stats = CPU_ALLOC(struct veth_net_stats, GFP_KERNEL | __GFP_ZER);
 	if (stats == NULL)
 		return -ENOMEM;
 
@@ -263,7 +258,7 @@
 	struct veth_priv *priv;
 
 	priv = netdev_priv(dev);
-	free_percpu(priv->stats);
+	CPU_FREE(priv->stats);
 	free_netdev(dev);
 }
 

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 21/41] cpu alloc: Chelsio statistics conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (19 preceding siblings ...)
  2008-05-30  3:56 ` [patch 20/41] cpu alloc: Veth conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 22/41] cpu alloc: Convert network sockets inuse counter Christoph Lameter
                   ` (20 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_chelsio_statistics_conversion --]
[-- Type: text/plain, Size: 2793 bytes --]

Convert chelsio statistics to cpu alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 drivers/net/chelsio/sge.c |   17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

Index: linux-2.6/drivers/net/chelsio/sge.c
===================================================================
--- linux-2.6.orig/drivers/net/chelsio/sge.c	2008-05-28 22:02:18.000000000 -0700
+++ linux-2.6/drivers/net/chelsio/sge.c	2008-05-28 23:22:28.000000000 -0700
@@ -809,7 +809,7 @@
 	int i;
 
 	for_each_port(sge->adapter, i)
-		free_percpu(sge->port_stats[i]);
+		CPU_FREE(sge->port_stats[i]);
 
 	kfree(sge->tx_sched);
 	free_tx_resources(sge);
@@ -988,7 +988,7 @@
 
 	memset(ss, 0, sizeof(*ss));
 	for_each_possible_cpu(cpu) {
-		struct sge_port_stats *st = per_cpu_ptr(sge->port_stats[port], cpu);
+		struct sge_port_stats *st = CPU_PTR(sge->port_stats[port], cpu);
 
 		ss->rx_cso_good += st->rx_cso_good;
 		ss->tx_cso += st->tx_cso;
@@ -1367,7 +1367,6 @@
 	struct sk_buff *skb;
 	const struct cpl_rx_pkt *p;
 	struct adapter *adapter = sge->adapter;
-	struct sge_port_stats *st;
 
 	skb = get_packet(adapter->pdev, fl, len - sge->rx_pkt_pad);
 	if (unlikely(!skb)) {
@@ -1382,20 +1381,18 @@
 	}
 	__skb_pull(skb, sizeof(*p));
 
-	st = per_cpu_ptr(sge->port_stats[p->iff], smp_processor_id());
-
 	skb->protocol = eth_type_trans(skb, adapter->port[p->iff].dev);
 	skb->dev->last_rx = jiffies;
 	if ((adapter->flags & RX_CSUM_ENABLED) && p->csum == 0xffff &&
 	    skb->protocol == htons(ETH_P_IP) &&
 	    (skb->data[9] == IPPROTO_TCP || skb->data[9] == IPPROTO_UDP)) {
-		++st->rx_cso_good;
+		__CPU_INC(sge->port_stats[p->iff]->rx_cso_good);
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
 	} else
 		skb->ip_summed = CHECKSUM_NONE;
 
 	if (unlikely(adapter->vlan_grp && p->vlan_valid)) {
-		st->vlan_xtract++;
+		__CPU_INC(sge->port_stats[p->iff]->vlan_xtract);
 #ifdef CONFIG_CHELSIO_T1_NAPI
 			vlan_hwaccel_receive_skb(skb, adapter->vlan_grp,
 						 ntohs(p->vlan));
@@ -1848,8 +1845,7 @@
 {
 	struct adapter *adapter = dev->priv;
 	struct sge *sge = adapter->sge;
-	struct sge_port_stats *st = per_cpu_ptr(sge->port_stats[dev->if_port],
-						smp_processor_id());
+	struct sge_port_stats *st = THIS_CPU(sge->port_stats[dev->if_port]);
 	struct cpl_tx_pkt *cpl;
 	struct sk_buff *orig_skb = skb;
 	int ret;
@@ -2159,7 +2155,8 @@
 	sge->jumbo_fl = t1_is_T1B(adapter) ? 1 : 0;
 
 	for_each_port(adapter, i) {
-		sge->port_stats[i] = alloc_percpu(struct sge_port_stats);
+		sge->port_stats[i] = CPU_ALLOC(struct sge_port_stats,
+					GFP_KERNEL | __GFP_ZERO);
 		if (!sge->port_stats[i])
 			goto nomem_port;
 	}
@@ -2203,7 +2200,7 @@
 	return sge;
 nomem_port:
 	while (i >= 0) {
-		free_percpu(sge->port_stats[i]);
+		CPU_FREE(sge->port_stats[i]);
 		--i;
 	}
 	kfree(sge);

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 22/41] cpu alloc: Convert network sockets inuse counter
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (20 preceding siblings ...)
  2008-05-30  3:56 ` [patch 21/41] cpu alloc: Chelsio statistics conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 23/41] cpu alloc: Use it for infiniband Christoph Lameter
                   ` (19 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_network_sockets_conversion --]
[-- Type: text/plain, Size: 1383 bytes --]

Convert handling of the inuse counters to cpu alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 net/core/sock.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c	2008-05-14 19:40:34.000000000 -0700
+++ linux-2.6/net/core/sock.c	2008-05-21 22:00:56.000000000 -0700
@@ -1943,8 +1943,7 @@
 #ifdef CONFIG_NET_NS
 void sock_prot_inuse_add(struct net *net, struct proto *prot, int val)
 {
-	int cpu = smp_processor_id();
-	per_cpu_ptr(net->core.inuse, cpu)->val[prot->inuse_idx] += val;
+	__CPU_ADD(net->core.inuse->val[prot->inuse_idx], val);
 }
 EXPORT_SYMBOL_GPL(sock_prot_inuse_add);
 
@@ -1954,7 +1953,7 @@
 	int res = 0;
 
 	for_each_possible_cpu(cpu)
-		res += per_cpu_ptr(net->core.inuse, cpu)->val[idx];
+		res += CPU_PTR(net->core.inuse, cpu)->val[idx];
 
 	return res >= 0 ? res : 0;
 }
@@ -1962,13 +1961,13 @@
 
 static int sock_inuse_init_net(struct net *net)
 {
-	net->core.inuse = alloc_percpu(struct prot_inuse);
+	net->core.inuse = CPU_ALLOC(struct prot_inuse, GFP_KERNEL|__GFP_ZERO);
 	return net->core.inuse ? 0 : -ENOMEM;
 }
 
 static void sock_inuse_exit_net(struct net *net)
 {
-	free_percpu(net->core.inuse);
+	CPU_FREE(net->core.inuse);
 }
 
 static struct pernet_operations net_inuse_ops = {

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 23/41] cpu alloc: Use it for infiniband
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (21 preceding siblings ...)
  2008-05-30  3:56 ` [patch 22/41] cpu alloc: Convert network sockets inuse counter Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 24/41] cpu alloc: Use in the crypto subsystem Christoph Lameter
                   ` (18 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_infiniband_conversion --]
[-- Type: text/plain, Size: 3233 bytes --]

Use cpu alloc for infiniband.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 drivers/infiniband/hw/ehca/ehca_irq.c |   22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

Index: linux-2.6/drivers/infiniband/hw/ehca/ehca_irq.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_irq.c	2008-05-14 19:40:32.000000000 -0700
+++ linux-2.6/drivers/infiniband/hw/ehca/ehca_irq.c	2008-05-21 22:01:32.000000000 -0700
@@ -680,7 +680,7 @@
 	cpu_id = find_next_online_cpu(pool);
 	BUG_ON(!cpu_online(cpu_id));
 
-	cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu_id);
+	cct = CPU_PTR(pool->cpu_comp_tasks, cpu_id);
 	BUG_ON(!cct);
 
 	spin_lock_irqsave(&cct->task_lock, flags);
@@ -688,7 +688,7 @@
 	spin_unlock_irqrestore(&cct->task_lock, flags);
 	if (cq_jobs > 0) {
 		cpu_id = find_next_online_cpu(pool);
-		cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu_id);
+		cct = CPU_PTR(pool->cpu_comp_tasks, cpu_id);
 		BUG_ON(!cct);
 	}
 
@@ -761,7 +761,7 @@
 {
 	struct ehca_cpu_comp_task *cct;
 
-	cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu);
+	cct = CPU_PTR(pool->cpu_comp_tasks, cpu);
 	spin_lock_init(&cct->task_lock);
 	INIT_LIST_HEAD(&cct->cq_list);
 	init_waitqueue_head(&cct->wait_queue);
@@ -777,7 +777,7 @@
 	struct task_struct *task;
 	unsigned long flags_cct;
 
-	cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu);
+	cct = CPU_PTR(pool->cpu_comp_tasks, cpu);
 
 	spin_lock_irqsave(&cct->task_lock, flags_cct);
 
@@ -793,7 +793,7 @@
 
 static void __cpuinit take_over_work(struct ehca_comp_pool *pool, int cpu)
 {
-	struct ehca_cpu_comp_task *cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu);
+	struct ehca_cpu_comp_task *cct = CPU_PTR(pool->cpu_comp_tasks, cpu);
 	LIST_HEAD(list);
 	struct ehca_cq *cq;
 	unsigned long flags_cct;
@@ -806,8 +806,7 @@
 		cq = list_entry(cct->cq_list.next, struct ehca_cq, entry);
 
 		list_del(&cq->entry);
-		__queue_comp_task(cq, per_cpu_ptr(pool->cpu_comp_tasks,
-						  smp_processor_id()));
+		__queue_comp_task(cq, THIS_CPU(pool->cpu_comp_tasks));
 	}
 
 	spin_unlock_irqrestore(&cct->task_lock, flags_cct);
@@ -833,14 +832,14 @@
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 		ehca_gen_dbg("CPU: %x (CPU_CANCELED)", cpu);
-		cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu);
+		cct = CPU_PTR(pool->cpu_comp_tasks, cpu);
 		kthread_bind(cct->task, any_online_cpu(cpu_online_map));
 		destroy_comp_task(pool, cpu);
 		break;
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		ehca_gen_dbg("CPU: %x (CPU_ONLINE)", cpu);
-		cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu);
+		cct = CPU_PTR(pool->cpu_comp_tasks, cpu);
 		kthread_bind(cct->task, cpu);
 		wake_up_process(cct->task);
 		break;
@@ -883,7 +882,8 @@
 	spin_lock_init(&pool->last_cpu_lock);
 	pool->last_cpu = any_online_cpu(cpu_online_map);
 
-	pool->cpu_comp_tasks = alloc_percpu(struct ehca_cpu_comp_task);
+	pool->cpu_comp_tasks = CPU_ALLOC(struct ehca_cpu_comp_task,
+						GFP_KERNEL | __GFP_ZERO);
 	if (pool->cpu_comp_tasks == NULL) {
 		kfree(pool);
 		return -EINVAL;
@@ -917,6 +917,6 @@
 		if (cpu_online(i))
 			destroy_comp_task(pool, i);
 	}
-	free_percpu(pool->cpu_comp_tasks);
+	CPU_FREE(pool->cpu_comp_tasks);
 	kfree(pool);
 }

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 24/41] cpu alloc: Use in the crypto subsystem.
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (22 preceding siblings ...)
  2008-05-30  3:56 ` [patch 23/41] cpu alloc: Use it for infiniband Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 25/41] cpu alloc: scheduler: Convert cpuusage to cpu_alloc Christoph Lameter
                   ` (17 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_crypto_conversion --]
[-- Type: text/plain, Size: 2239 bytes --]

Use cpu alloc for the crypto subsystem.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 crypto/async_tx/async_tx.c |   19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

Index: linux-2.6/crypto/async_tx/async_tx.c
===================================================================
--- linux-2.6.orig/crypto/async_tx/async_tx.c	2008-04-29 14:55:49.000000000 -0700
+++ linux-2.6/crypto/async_tx/async_tx.c	2008-05-21 22:01:42.000000000 -0700
@@ -221,10 +221,10 @@
 	for_each_dma_cap_mask(cap, dma_cap_mask_all)
 		for_each_possible_cpu(cpu) {
 			struct dma_chan_ref *ref =
-				per_cpu_ptr(channel_table[cap], cpu)->ref;
+				CPU_PTR(channel_table[cap], cpu)->ref;
 			if (ref) {
 				atomic_set(&ref->count, 0);
-				per_cpu_ptr(channel_table[cap], cpu)->ref =
+				CPU_PTR(channel_table[cap], cpu)->ref =
 									NULL;
 			}
 		}
@@ -237,7 +237,7 @@
 			else
 				new = get_chan_ref_by_cap(cap, -1);
 
-			per_cpu_ptr(channel_table[cap], cpu)->ref = new;
+			CPU_PTR(channel_table[cap], cpu)->ref = new;
 		}
 
 	spin_unlock_irqrestore(&async_tx_lock, flags);
@@ -341,7 +341,8 @@
 	clear_bit(DMA_INTERRUPT, dma_cap_mask_all.bits);
 
 	for_each_dma_cap_mask(cap, dma_cap_mask_all) {
-		channel_table[cap] = alloc_percpu(struct chan_ref_percpu);
+		channel_table[cap] = CPU_ALLOC(struct chan_ref_percpu,
+						GFP_KERNEL | __GFP_ZERO);
 		if (!channel_table[cap])
 			goto err;
 	}
@@ -357,7 +358,7 @@
 	printk(KERN_ERR "async_tx: initialization failure\n");
 
 	while (--cap >= 0)
-		free_percpu(channel_table[cap]);
+		CPU_FREE(channel_table[cap]);
 
 	return 1;
 }
@@ -370,7 +371,7 @@
 
 	for_each_dma_cap_mask(cap, dma_cap_mask_all)
 		if (channel_table[cap])
-			free_percpu(channel_table[cap]);
+			CPU_FREE(channel_table[cap]);
 
 	dma_async_client_unregister(&async_tx_dma);
 }
@@ -390,10 +391,8 @@
 		dma_has_cap(tx_type, depend_tx->chan->device->cap_mask))
 		return depend_tx->chan;
 	else if (likely(channel_table_initialized)) {
-		struct dma_chan_ref *ref;
-		int cpu = get_cpu();
-		ref = per_cpu_ptr(channel_table[tx_type], cpu)->ref;
-		put_cpu();
+		struct dma_chan_ref *ref =
+			_CPU_READ(channel_table[tx_type]->ref);
 		return ref ? ref->chan : NULL;
 	} else
 		return NULL;

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 25/41] cpu alloc: scheduler: Convert cpuusage to cpu_alloc.
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (23 preceding siblings ...)
  2008-05-30  3:56 ` [patch 24/41] cpu alloc: Use in the crypto subsystem Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 26/41] cpu alloc: Convert mib handling to cpu alloc Christoph Lameter
                   ` (16 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_scheduler_usage_conversion --]
[-- Type: text/plain, Size: 1529 bytes --]

Convert scheduler handling of cpuusage to cpu alloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/x86/kernel/setup.c |    2 +-
 kernel/sched.c          |   10 +++++-----
 mm/cpu_alloc.c          |   25 +++++++++++++++++++------
 3 files changed, 25 insertions(+), 12 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2008-05-26 20:38:49.000000000 -0700
+++ linux-2.6/kernel/sched.c	2008-05-26 20:38:53.000000000 -0700
@@ -9090,7 +9090,7 @@
 	if (!ca)
 		return ERR_PTR(-ENOMEM);
 
-	ca->cpuusage = alloc_percpu(u64);
+	ca->cpuusage = CPU_ALLOC(u64, GFP_KERNEL|__GFP_ZERO);
 	if (!ca->cpuusage) {
 		kfree(ca);
 		return ERR_PTR(-ENOMEM);
@@ -9105,7 +9105,7 @@
 {
 	struct cpuacct *ca = cgroup_ca(cgrp);
 
-	free_percpu(ca->cpuusage);
+	CPU_FREE(ca->cpuusage);
 	kfree(ca);
 }
 
@@ -9117,7 +9117,7 @@
 	int i;
 
 	for_each_possible_cpu(i) {
-		u64 *cpuusage = percpu_ptr(ca->cpuusage, i);
+		u64 *cpuusage = CPU_PTR(cpuusage, i);
 
 		/*
 		 * Take rq->lock to make 64-bit addition safe on 32-bit
@@ -9144,7 +9144,7 @@
 	}
 
 	for_each_possible_cpu(i) {
-		u64 *cpuusage = percpu_ptr(ca->cpuusage, i);
+		u64 *cpuusage = CPU_PTR(ca->cpuusage, i);
 
 		spin_lock_irq(&cpu_rq(i)->lock);
 		*cpuusage = 0;
@@ -9181,7 +9181,7 @@
 
 	ca = task_ca(tsk);
 	if (ca) {
-		u64 *cpuusage = percpu_ptr(ca->cpuusage, task_cpu(tsk));
+		u64 *cpuusage = CPU_PTR(ca->cpuusage, task_cpu(tsk));
 
 		*cpuusage += cputime;
 	}

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 26/41] cpu alloc: Convert mib handling to cpu alloc
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (24 preceding siblings ...)
  2008-05-30  3:56 ` [patch 25/41] cpu alloc: scheduler: Convert cpuusage to cpu_alloc Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  6:47   ` Eric Dumazet
  2008-05-30  3:56 ` [patch 27/41] cpu alloc: Remove the allocpercpu functionality Christoph Lameter
                   ` (15 subsequent siblings)
  41 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_mib_handling_conversion --]
[-- Type: text/plain, Size: 9614 bytes --]

Use the cpu alloc functions for the mib handling functions in the net
layer. The API for snmp_mib_free() is changed to add a size parameter
since cpu_free() requires a size parameter.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/net/ip.h     |    2 +-
 include/net/snmp.h   |   32 ++++++++------------------------
 net/dccp/proto.c     |    2 +-
 net/ipv4/af_inet.c   |   31 +++++++++++++++++--------------
 net/ipv6/addrconf.c  |   11 ++++++-----
 net/ipv6/af_inet6.c  |   20 +++++++++++---------
 net/sctp/protocol.c  |    2 +-
 net/xfrm/xfrm_proc.c |    4 ++--
 8 files changed, 47 insertions(+), 57 deletions(-)

Index: linux-2.6/include/net/ip.h
===================================================================
--- linux-2.6.orig/include/net/ip.h	2008-05-29 19:41:20.000000000 -0700
+++ linux-2.6/include/net/ip.h	2008-05-29 20:15:34.000000000 -0700
@@ -170,7 +170,7 @@ DECLARE_SNMP_STAT(struct linux_mib, net_
 
 extern unsigned long snmp_fold_field(void *mib[], int offt);
 extern int snmp_mib_init(void *ptr[2], size_t mibsize);
-extern void snmp_mib_free(void *ptr[2]);
+extern void snmp_mib_free(void *ptr[2], size_t mibsize);
 
 extern void inet_get_local_port_range(int *low, int *high);
 
Index: linux-2.6/include/net/snmp.h
===================================================================
--- linux-2.6.orig/include/net/snmp.h	2008-05-29 19:41:20.000000000 -0700
+++ linux-2.6/include/net/snmp.h	2008-05-29 20:15:34.000000000 -0700
@@ -138,29 +138,13 @@ struct linux_xfrm_mib {
 #define SNMP_STAT_BHPTR(name)	(name[0])
 #define SNMP_STAT_USRPTR(name)	(name[1])
 
-#define SNMP_INC_STATS_BH(mib, field) 	\
-	(per_cpu_ptr(mib[0], raw_smp_processor_id())->mibs[field]++)
-#define SNMP_INC_STATS_USER(mib, field) \
-	do { \
-		per_cpu_ptr(mib[1], get_cpu())->mibs[field]++; \
-		put_cpu(); \
-	} while (0)
-#define SNMP_INC_STATS(mib, field) 	\
-	do { \
-		per_cpu_ptr(mib[!in_softirq()], get_cpu())->mibs[field]++; \
-		put_cpu(); \
-	} while (0)
-#define SNMP_DEC_STATS(mib, field) 	\
-	do { \
-		per_cpu_ptr(mib[!in_softirq()], get_cpu())->mibs[field]--; \
-		put_cpu(); \
-	} while (0)
-#define SNMP_ADD_STATS_BH(mib, field, addend) 	\
-	(per_cpu_ptr(mib[0], raw_smp_processor_id())->mibs[field] += addend)
-#define SNMP_ADD_STATS_USER(mib, field, addend) 	\
-	do { \
-		per_cpu_ptr(mib[1], get_cpu())->mibs[field] += addend; \
-		put_cpu(); \
-	} while (0)
+#define SNMP_INC_STATS_BH(mib, field) __CPU_INC(mib[0]->mibs[field])
+#define SNMP_INC_STATS_USER(mib, field) _CPU_INC(mib[1]->mibs[field])
+#define SNMP_INC_STATS(mib, field) _CPU_INC(mib[!in_softirq()]->mibs[field])
+#define SNMP_DEC_STATS(mib, field) _CPU_DEC(mib[!in_softirq()]->mibs[field])
+#define SNMP_ADD_STATS_BH(mib, field, addend) \
+				__CPU_ADD(mib[0]->mibs[field], addend)
+#define SNMP_ADD_STATS_USER(mib, field, addend) \
+				_CPU_ADD(mib[1]->mibs[field], addend)
 
 #endif
Index: linux-2.6/net/ipv4/af_inet.c
===================================================================
--- linux-2.6.orig/net/ipv4/af_inet.c	2008-05-29 19:41:20.000000000 -0700
+++ linux-2.6/net/ipv4/af_inet.c	2008-05-29 20:15:34.000000000 -0700
@@ -1279,8 +1279,8 @@ unsigned long snmp_fold_field(void *mib[
 	int i;
 
 	for_each_possible_cpu(i) {
-		res += *(((unsigned long *) per_cpu_ptr(mib[0], i)) + offt);
-		res += *(((unsigned long *) per_cpu_ptr(mib[1], i)) + offt);
+		res += *(((unsigned long *) CPU_PTR(mib[0], i)) + offt);
+		res += *(((unsigned long *) CPU_PTR(mib[1], i)) + offt);
 	}
 	return res;
 }
@@ -1289,26 +1289,28 @@ EXPORT_SYMBOL_GPL(snmp_fold_field);
 int snmp_mib_init(void *ptr[2], size_t mibsize)
 {
 	BUG_ON(ptr == NULL);
-	ptr[0] = __alloc_percpu(mibsize);
+	ptr[0] = cpu_alloc(mibsize, GFP_KERNEL | __GFP_ZERO,
+					L1_CACHE_BYTES);
 	if (!ptr[0])
 		goto err0;
-	ptr[1] = __alloc_percpu(mibsize);
+	ptr[1] = cpu_alloc(mibsize, GFP_KERNEL | __GFP_ZERO,
+					L1_CACHE_BYTES);
 	if (!ptr[1])
 		goto err1;
 	return 0;
 err1:
-	free_percpu(ptr[0]);
+	cpu_free(ptr[0], mibsize);
 	ptr[0] = NULL;
 err0:
 	return -ENOMEM;
 }
 EXPORT_SYMBOL_GPL(snmp_mib_init);
 
-void snmp_mib_free(void *ptr[2])
+void snmp_mib_free(void *ptr[2], size_t mibsize)
 {
 	BUG_ON(ptr == NULL);
-	free_percpu(ptr[0]);
-	free_percpu(ptr[1]);
+	cpu_free(ptr[0], mibsize);
+	cpu_free(ptr[1], mibsize);
 	ptr[0] = ptr[1] = NULL;
 }
 EXPORT_SYMBOL_GPL(snmp_mib_free);
@@ -1370,17 +1372,18 @@ static int __init init_ipv4_mibs(void)
 	return 0;
 
 err_udplite_mib:
-	snmp_mib_free((void **)udp_statistics);
+	snmp_mib_free((void **)udp_statistics, sizeof(struct udp_mib));
 err_udp_mib:
-	snmp_mib_free((void **)tcp_statistics);
+	snmp_mib_free((void **)tcp_statistics, sizeof(struct tcp_mib));
 err_tcp_mib:
-	snmp_mib_free((void **)icmpmsg_statistics);
+	snmp_mib_free((void **)icmpmsg_statistics,
+					sizeof(struct icmpmsg_mib));
 err_icmpmsg_mib:
-	snmp_mib_free((void **)icmp_statistics);
+	snmp_mib_free((void **)icmp_statistics, sizeof(struct icmp_mib));
 err_icmp_mib:
-	snmp_mib_free((void **)ip_statistics);
+	snmp_mib_free((void **)ip_statistics, sizeof(struct ipstats_mib));
 err_ip_mib:
-	snmp_mib_free((void **)net_statistics);
+	snmp_mib_free((void **)net_statistics, sizeof(struct linux_mib));
 err_net_mib:
 	return -ENOMEM;
 }
Index: linux-2.6/net/ipv6/addrconf.c
===================================================================
--- linux-2.6.orig/net/ipv6/addrconf.c	2008-05-29 19:41:20.000000000 -0700
+++ linux-2.6/net/ipv6/addrconf.c	2008-05-29 20:16:35.000000000 -0700
@@ -279,18 +279,19 @@ static int snmp6_alloc_dev(struct inet6_
 	return 0;
 
 err_icmpmsg:
-	snmp_mib_free((void **)idev->stats.icmpv6);
+	snmp_mib_free((void **)idev->stats.icmpv6, sizeof(struct icmpv6_mib));
 err_icmp:
-	snmp_mib_free((void **)idev->stats.ipv6);
+	snmp_mib_free((void **)idev->stats.ipv6, sizeof(struct ipstats_mib));
 err_ip:
 	return -ENOMEM;
 }
 
 static void snmp6_free_dev(struct inet6_dev *idev)
 {
-	snmp_mib_free((void **)idev->stats.icmpv6msg);
-	snmp_mib_free((void **)idev->stats.icmpv6);
-	snmp_mib_free((void **)idev->stats.ipv6);
+	snmp_mib_free((void **)idev->stats.icmpv6msg,
+						sizeof(struct icmpv6msg_mib));
+	snmp_mib_free((void **)idev->stats.icmpv6, sizeof(struct icmpv6_mib));
+	snmp_mib_free((void **)idev->stats.ipv6, sizeof(struct ipstats_mib));
 }
 
 /* Nobody refers to this device, we may destroy it. */
Index: linux-2.6/net/ipv6/af_inet6.c
===================================================================
--- linux-2.6.orig/net/ipv6/af_inet6.c	2008-05-29 19:41:20.000000000 -0700
+++ linux-2.6/net/ipv6/af_inet6.c	2008-05-29 20:17:39.000000000 -0700
@@ -822,13 +822,14 @@ static int __init init_ipv6_mibs(void)
 	return 0;
 
 err_udplite_mib:
-	snmp_mib_free((void **)udp_stats_in6);
+	snmp_mib_free((void **)udp_stats_in6, sizeof(struct udp_mib));
 err_udp_mib:
-	snmp_mib_free((void **)icmpv6msg_statistics);
+	snmp_mib_free((void **)icmpv6msg_statistics,
+					sizeof(struct icmpv6msg_mib));
 err_icmpmsg_mib:
-	snmp_mib_free((void **)icmpv6_statistics);
+	snmp_mib_free((void **)icmpv6_statistics, sizeof(struct icmpv6_mib));
 err_icmp_mib:
-	snmp_mib_free((void **)ipv6_statistics);
+	snmp_mib_free((void **)ipv6_statistics, sizeof(struct ipstats_mib));
 err_ip_mib:
 	return -ENOMEM;
 
@@ -836,11 +837,12 @@ err_ip_mib:
 
 static void cleanup_ipv6_mibs(void)
 {
-	snmp_mib_free((void **)ipv6_statistics);
-	snmp_mib_free((void **)icmpv6_statistics);
-	snmp_mib_free((void **)icmpv6msg_statistics);
-	snmp_mib_free((void **)udp_stats_in6);
-	snmp_mib_free((void **)udplite_stats_in6);
+	snmp_mib_free((void **)ipv6_statistics, sizeof(struct ipstats_mib));
+	snmp_mib_free((void **)icmpv6_statistics, sizeof(struct icmpv6_mib));
+	snmp_mib_free((void **)icmpv6msg_statistics,
+						sizeof(struct icmpv6msg_mib));
+	snmp_mib_free((void **)udp_stats_in6, sizeof(struct udp_mib));
+	snmp_mib_free((void **)udplite_stats_in6, sizeof(struct udp_mib));
 }
 
 static int inet6_net_init(struct net *net)
Index: linux-2.6/net/dccp/proto.c
===================================================================
--- linux-2.6.orig/net/dccp/proto.c	2008-05-29 19:41:20.000000000 -0700
+++ linux-2.6/net/dccp/proto.c	2008-05-29 20:18:05.000000000 -0700
@@ -1016,7 +1016,7 @@ static inline int dccp_mib_init(void)
 
 static inline void dccp_mib_exit(void)
 {
-	snmp_mib_free((void**)dccp_statistics);
+	snmp_mib_free((void **)dccp_statistics, sizeof(struct dccp_mib));
 }
 
 static int thash_entries;
Index: linux-2.6/net/sctp/protocol.c
===================================================================
--- linux-2.6.orig/net/sctp/protocol.c	2008-05-29 19:41:20.000000000 -0700
+++ linux-2.6/net/sctp/protocol.c	2008-05-29 20:18:21.000000000 -0700
@@ -981,7 +981,7 @@ static inline int init_sctp_mibs(void)
 
 static inline void cleanup_sctp_mibs(void)
 {
-	snmp_mib_free((void**)sctp_statistics);
+	snmp_mib_free((void **)sctp_statistics, sizeof(struct sctp_mib));
 }
 
 static void sctp_v4_pf_init(void)
Index: linux-2.6/net/xfrm/xfrm_proc.c
===================================================================
--- linux-2.6.orig/net/xfrm/xfrm_proc.c	2008-05-29 19:41:20.000000000 -0700
+++ linux-2.6/net/xfrm/xfrm_proc.c	2008-05-29 20:19:10.000000000 -0700
@@ -51,8 +51,8 @@ fold_field(void *mib[], int offt)
         int i;
 
         for_each_possible_cpu(i) {
-                res += *(((unsigned long *)per_cpu_ptr(mib[0], i)) + offt);
-                res += *(((unsigned long *)per_cpu_ptr(mib[1], i)) + offt);
+		res += *(((unsigned long *)CPU_PTR(mib[0], i)) + offt);
+		res += *(((unsigned long *)CPU_PTR(mib[1], i)) + offt);
         }
         return res;
 }

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 27/41] cpu alloc: Remove the allocpercpu functionality
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (25 preceding siblings ...)
  2008-05-30  3:56 ` [patch 26/41] cpu alloc: Convert mib handling to cpu alloc Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  4:58   ` Andrew Morton
  2008-05-30  3:56 ` [patch 28/41] Module handling: Use CPU_xx ops to dynamically allocate counters Christoph Lameter
                   ` (14 subsequent siblings)
  41 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_remove_allocpercpu --]
[-- Type: text/plain, Size: 8176 bytes --]

There is no user of allocpercpu left after all the earlier patches were
applied. Remove the allocpercpu code.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/percpu.h |   80 ------------------------------
 mm/Makefile            |    1 
 mm/allocpercpu.c       |  127 -------------------------------------------------
 3 files changed, 208 deletions(-)
 delete mode 100644 mm/allocpercpu.c

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2008-05-21 21:42:55.000000000 -0700
+++ linux-2.6/include/linux/percpu.h	2008-05-21 22:03:19.000000000 -0700
@@ -53,86 +53,6 @@
 	&__get_cpu_var(var); }))
 #define put_cpu_var(var) preempt_enable()
 
-#ifdef CONFIG_SMP
-
-struct percpu_data {
-	void *ptrs[1];
-};
-
-#define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
-/* 
- * Use this to get to a cpu's version of the per-cpu object dynamically
- * allocated. Non-atomic access to the current CPU's version should
- * probably be combined with get_cpu()/put_cpu().
- */ 
-#define percpu_ptr(ptr, cpu)                              \
-({                                                        \
-        struct percpu_data *__p = __percpu_disguise(ptr); \
-        (__typeof__(ptr))__p->ptrs[(cpu)];	          \
-})
-
-extern void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu);
-extern void percpu_depopulate(void *__pdata, int cpu);
-extern int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
-				  cpumask_t *mask);
-extern void __percpu_depopulate_mask(void *__pdata, cpumask_t *mask);
-extern void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask);
-extern void percpu_free(void *__pdata);
-
-#else /* CONFIG_SMP */
-
-#define percpu_ptr(ptr, cpu) ({ (void)(cpu); (ptr); })
-
-static inline void percpu_depopulate(void *__pdata, int cpu)
-{
-}
-
-static inline void __percpu_depopulate_mask(void *__pdata, cpumask_t *mask)
-{
-}
-
-static inline void *percpu_populate(void *__pdata, size_t size, gfp_t gfp,
-				    int cpu)
-{
-	return percpu_ptr(__pdata, cpu);
-}
-
-static inline int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
-					 cpumask_t *mask)
-{
-	return 0;
-}
-
-static __always_inline void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
-{
-	return kzalloc(size, gfp);
-}
-
-static inline void percpu_free(void *__pdata)
-{
-	kfree(__pdata);
-}
-
-#endif /* CONFIG_SMP */
-
-#define percpu_populate_mask(__pdata, size, gfp, mask) \
-	__percpu_populate_mask((__pdata), (size), (gfp), &(mask))
-#define percpu_depopulate_mask(__pdata, mask) \
-	__percpu_depopulate_mask((__pdata), &(mask))
-#define percpu_alloc_mask(size, gfp, mask) \
-	__percpu_alloc_mask((size), (gfp), &(mask))
-
-#define percpu_alloc(size, gfp) percpu_alloc_mask((size), (gfp), cpu_online_map)
-
-/* (legacy) interface for use without CPU hotplug handling */
-
-#define __alloc_percpu(size)	percpu_alloc_mask((size), GFP_KERNEL, \
-						  cpu_possible_map)
-#define alloc_percpu(type)	(type *)__alloc_percpu(sizeof(type))
-#define free_percpu(ptr)	percpu_free((ptr))
-#define per_cpu_ptr(ptr, cpu)	percpu_ptr((ptr), (cpu))
-
-
 /*
  * cpu allocator definitions
  *
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile	2008-05-21 21:35:21.000000000 -0700
+++ linux-2.6/mm/Makefile	2008-05-21 22:02:05.000000000 -0700
@@ -30,7 +30,6 @@
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
-obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
 
Index: linux-2.6/mm/allocpercpu.c
===================================================================
--- linux-2.6.orig/mm/allocpercpu.c	2008-04-29 14:55:55.000000000 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,141 +0,0 @@
-/*
- * linux/mm/allocpercpu.c
- *
- * Separated from slab.c August 11, 2006 Christoph Lameter <clameter@sgi.com>
- */
-#include <linux/mm.h>
-#include <linux/module.h>
-
-#ifndef cache_line_size
-#define cache_line_size()	L1_CACHE_BYTES
-#endif
-
-/**
- * percpu_depopulate - depopulate per-cpu data for given cpu
- * @__pdata: per-cpu data to depopulate
- * @cpu: depopulate per-cpu data for this cpu
- *
- * Depopulating per-cpu data for a cpu going offline would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- */
-void percpu_depopulate(void *__pdata, int cpu)
-{
-	struct percpu_data *pdata = __percpu_disguise(__pdata);
-
-	kfree(pdata->ptrs[cpu]);
-	pdata->ptrs[cpu] = NULL;
-}
-EXPORT_SYMBOL_GPL(percpu_depopulate);
-
-/**
- * percpu_depopulate_mask - depopulate per-cpu data for some cpu's
- * @__pdata: per-cpu data to depopulate
- * @mask: depopulate per-cpu data for cpu's selected through mask bits
- */
-void __percpu_depopulate_mask(void *__pdata, cpumask_t *mask)
-{
-	int cpu;
-	for_each_cpu_mask(cpu, *mask)
-		percpu_depopulate(__pdata, cpu);
-}
-EXPORT_SYMBOL_GPL(__percpu_depopulate_mask);
-
-/**
- * percpu_populate - populate per-cpu data for given cpu
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @cpu: populate per-data for this cpu
- *
- * Populating per-cpu data for a cpu coming online would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- * Per-cpu object is populated with zeroed buffer.
- */
-void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu)
-{
-	struct percpu_data *pdata = __percpu_disguise(__pdata);
-	int node = cpu_to_node(cpu);
-
-	/*
-	 * We should make sure each CPU gets private memory.
-	 */
-	size = roundup(size, cache_line_size());
-
-	BUG_ON(pdata->ptrs[cpu]);
-	if (node_online(node))
-		pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
-	else
-		pdata->ptrs[cpu] = kzalloc(size, gfp);
-	return pdata->ptrs[cpu];
-}
-EXPORT_SYMBOL_GPL(percpu_populate);
-
-/**
- * percpu_populate_mask - populate per-cpu data for more cpu's
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @mask: populate per-cpu data for cpu's selected through mask bits
- *
- * Per-cpu objects are populated with zeroed buffers.
- */
-int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
-			   cpumask_t *mask)
-{
-	cpumask_t populated;
-	int cpu;
-
-	cpus_clear(populated);
-	for_each_cpu_mask(cpu, *mask)
-		if (unlikely(!percpu_populate(__pdata, size, gfp, cpu))) {
-			__percpu_depopulate_mask(__pdata, &populated);
-			return -ENOMEM;
-		} else
-			cpu_set(cpu, populated);
-	return 0;
-}
-EXPORT_SYMBOL_GPL(__percpu_populate_mask);
-
-/**
- * percpu_alloc_mask - initial setup of per-cpu data
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @mask: populate per-data for cpu's selected through mask bits
- *
- * Populating per-cpu data for all online cpu's would be a typical use case,
- * which is simplified by the percpu_alloc() wrapper.
- * Per-cpu objects are populated with zeroed buffers.
- */
-void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
-{
-	/*
-	 * We allocate whole cache lines to avoid false sharing
-	 */
-	size_t sz = roundup(nr_cpu_ids * sizeof(void *), cache_line_size());
-	void *pdata = kzalloc(sz, gfp);
-	void *__pdata = __percpu_disguise(pdata);
-
-	if (unlikely(!pdata))
-		return NULL;
-	if (likely(!__percpu_populate_mask(__pdata, size, gfp, mask)))
-		return __pdata;
-	kfree(pdata);
-	return NULL;
-}
-EXPORT_SYMBOL_GPL(__percpu_alloc_mask);
-
-/**
- * percpu_free - final cleanup of per-cpu data
- * @__pdata: object to clean up
- *
- * We simply clean up any per-cpu object left. No need for the client to
- * track and specify through a bis mask which per-cpu objects are to free.
- */
-void percpu_free(void *__pdata)
-{
-	if (unlikely(!__pdata))
-		return;
-	__percpu_depopulate_mask(__pdata, &cpu_possible_map);
-	kfree(__percpu_disguise(__pdata));
-}
-EXPORT_SYMBOL_GPL(percpu_free);

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 28/41] Module handling: Use CPU_xx ops to dynamically allocate counters
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (26 preceding siblings ...)
  2008-05-30  3:56 ` [patch 27/41] cpu alloc: Remove the allocpercpu functionality Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 29/41] x86_64: Use CPU ops for nmi alert counter Christoph Lameter
                   ` (13 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_remove_local_t_in_kernel_module --]
[-- Type: text/plain, Size: 3133 bytes --]

Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory
requirements, cache footprint and decreases cycle counts.

Avoid a loop to NR_CPUS here. Use the possible map instead.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/module.h |   13 +++++--------
 kernel/module.c        |   17 +++++++----------
 2 files changed, 12 insertions(+), 18 deletions(-)

Index: linux-2.6/include/linux/module.h
===================================================================
--- linux-2.6.orig/include/linux/module.h	2008-05-21 22:41:03.000000000 -0700
+++ linux-2.6/include/linux/module.h	2008-05-21 23:19:39.000000000 -0700
@@ -219,8 +219,8 @@
 
 struct module_ref
 {
-	local_t count;
-} ____cacheline_aligned;
+	int count;
+};
 
 enum module_state
 {
@@ -307,7 +307,7 @@
 
 #ifdef CONFIG_MODULE_UNLOAD
 	/* Reference counts */
-	struct module_ref ref[NR_CPUS];
+	struct module_ref *ref;
 
 	/* What modules depend on me? */
 	struct list_head modules_which_use_me;
@@ -385,8 +385,7 @@
 {
 	if (module) {
 		BUG_ON(module_refcount(module) == 0);
-		local_inc(&module->ref[get_cpu()].count);
-		put_cpu();
+		_CPU_INC(module->ref->count);
 	}
 }
 
@@ -395,12 +394,12 @@
 	int ret = 1;
 
 	if (module) {
-		unsigned int cpu = get_cpu();
+		preempt_disable();
 		if (likely(module_is_live(module)))
-			local_inc(&module->ref[cpu].count);
+			__CPU_INC(module->ref->count);
 		else
 			ret = 0;
-		put_cpu();
+		preempt_enable();
 	}
 	return ret;
 }
Index: linux-2.6/kernel/module.c
===================================================================
--- linux-2.6.orig/kernel/module.c	2008-05-21 22:41:03.000000000 -0700
+++ linux-2.6/kernel/module.c	2008-05-21 23:17:20.000000000 -0700
@@ -366,13 +366,11 @@
 /* Init the unload section of the module. */
 static void module_unload_init(struct module *mod)
 {
-	unsigned int i;
-
 	INIT_LIST_HEAD(&mod->modules_which_use_me);
-	for (i = 0; i < NR_CPUS; i++)
-		local_set(&mod->ref[i].count, 0);
+	mod->ref = CPU_ALLOC(struct module_ref, GFP_KERNEL | __GFP_ZERO);
+
 	/* Hold reference count during initialization. */
-	local_set(&mod->ref[raw_smp_processor_id()].count, 1);
+	__CPU_WRITE(mod->ref->count, 1);
 	/* Backwards compatibility macros put refcount during init. */
 	mod->waiter = current;
 }
@@ -450,6 +448,7 @@
 				kfree(use);
 				sysfs_remove_link(i->holders_dir, mod->name);
 				/* There can be at most one match. */
+				CPU_FREE(i->ref);
 				break;
 			}
 		}
@@ -505,8 +504,8 @@
 {
 	unsigned int i, total = 0;
 
-	for (i = 0; i < NR_CPUS; i++)
-		total += local_read(&mod->ref[i].count);
+	for_each_online_cpu(i)
+		total += CPU_PTR(mod->ref, i)->count;
 	return total;
 }
 EXPORT_SYMBOL(module_refcount);
@@ -667,12 +666,12 @@
 void module_put(struct module *module)
 {
 	if (module) {
-		unsigned int cpu = get_cpu();
-		local_dec(&module->ref[cpu].count);
+		preempt_disable();
+		_CPU_DEC(module->ref->count);
 		/* Maybe they're waiting for us to drop reference? */
 		if (unlikely(!module_is_live(module)))
 			wake_up_process(module->waiter);
-		put_cpu();
+		preempt_enable();
 	}
 }
 EXPORT_SYMBOL(module_put);

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 29/41] x86_64: Use CPU ops for nmi alert counter
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (27 preceding siblings ...)
  2008-05-30  3:56 ` [patch 28/41] Module handling: Use CPU_xx ops to dynamically allocate counters Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 30/41] Remove local_t support Christoph Lameter
                   ` (12 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_remove_local_t_in_nmi --]
[-- Type: text/plain, Size: 1383 bytes --]

These are critical fast paths. Reduce overhead by using a segment override
instead of an address calculation.

Signed-off-by: Christoph LAmeter <clameter@sgi.com>
---
 arch/x86/kernel/nmi_64.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86/kernel/nmi_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/nmi_64.c	2008-04-29 14:55:48.000000000 -0700
+++ linux-2.6/arch/x86/kernel/nmi_64.c	2008-05-21 22:49:33.000000000 -0700
@@ -290,7 +290,7 @@
  */
 
 static DEFINE_PER_CPU(unsigned, last_irq_sum);
-static DEFINE_PER_CPU(local_t, alert_counter);
+static DEFINE_PER_CPU(int, alert_counter);
 static DEFINE_PER_CPU(int, nmi_touch);
 
 void touch_nmi_watchdog(void)
@@ -356,13 +356,13 @@
 		 * Ayiee, looks like this CPU is stuck ...
 		 * wait a few IRQs (5 seconds) before doing the oops ...
 		 */
-		local_inc(&__get_cpu_var(alert_counter));
-		if (local_read(&__get_cpu_var(alert_counter)) == 5*nmi_hz)
+		CPU_INC(per_cpu_var(alert_counter));
+		if (CPU_READ(per_cpu_var(alert_counter)) == 5*nmi_hz)
 			die_nmi("NMI Watchdog detected LOCKUP on CPU %d\n", regs,
 				panic_on_timeout);
 	} else {
 		__get_cpu_var(last_irq_sum) = sum;
-		local_set(&__get_cpu_var(alert_counter), 0);
+		CPU_WRITE(per_cpu_var(alert_counter), 0);
 	}
 
 	/* see if the nmi watchdog went off */

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 30/41] Remove local_t support
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (28 preceding siblings ...)
  2008-05-30  3:56 ` [patch 29/41] x86_64: Use CPU ops for nmi alert counter Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 31/41] VM statistics: Use CPU ops Christoph Lameter
                   ` (11 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_remove_local_t --]
[-- Type: text/plain, Size: 52632 bytes --]

There is no user of local_t remaining after the cpu ops patchset. local_t
always suffered from the problem that the operations it generated were not
able to perform the relocation of a pointer to the target processor and the
atomic update at the same time. There was a need to disable preemption
and/or interrupts which made it awkward to use.

Quirk:
- linux/module.h needs to include hardirq.h now since asm-generic/local.h did
  and some arches now depend on it (sparc64 for example).

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 Documentation/local_ops.txt   |  186 ---------------------
 arch/frv/kernel/local.h       |   59 ------
 include/asm-alpha/local.h     |  118 -------------
 include/asm-arm/local.h       |    1 
 include/asm-avr32/local.h     |    6 
 include/asm-blackfin/local.h  |    6 
 include/asm-cris/local.h      |    1 
 include/asm-frv/local.h       |    6 
 include/asm-generic/local.h   |   75 --------
 include/asm-h8300/local.h     |    6 
 include/asm-ia64/local.h      |    1 
 include/asm-m32r/local.h      |  366 ------------------------------------------
 include/asm-m68k/local.h      |    6 
 include/asm-m68knommu/local.h |    6 
 include/asm-mips/local.h      |  221 -------------------------
 include/asm-mn10300/local.h   |    1 
 include/asm-parisc/local.h    |    1 
 include/asm-powerpc/local.h   |  200 ----------------------
 include/asm-s390/local.h      |    1 
 include/asm-sh/local.h        |    7 
 include/asm-sparc/local.h     |    6 
 include/asm-sparc64/local.h   |    1 
 include/asm-um/local.h        |    6 
 include/asm-v850/local.h      |    6 
 include/asm-x86/local.h       |  235 --------------------------
 include/asm-xtensa/local.h    |   16 -
 include/linux/module.h        |    4 
 27 files changed, 3 insertions(+), 1545 deletions(-)

Index: linux-2.6/Documentation/local_ops.txt
===================================================================
--- linux-2.6.orig/Documentation/local_ops.txt	2008-05-29 10:57:34.640237763 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,186 +0,0 @@
-	     Semantics and Behavior of Local Atomic Operations
-
-			    Mathieu Desnoyers
-
-
-	This document explains the purpose of the local atomic operations, how
-to implement them for any given architecture and shows how they can be used
-properly. It also stresses on the precautions that must be taken when reading
-those local variables across CPUs when the order of memory writes matters.
-
-
-
-* Purpose of local atomic operations
-
-Local atomic operations are meant to provide fast and highly reentrant per CPU
-counters. They minimize the performance cost of standard atomic operations by
-removing the LOCK prefix and memory barriers normally required to synchronize
-across CPUs.
-
-Having fast per CPU atomic counters is interesting in many cases : it does not
-require disabling interrupts to protect from interrupt handlers and it permits
-coherent counters in NMI handlers. It is especially useful for tracing purposes
-and for various performance monitoring counters.
-
-Local atomic operations only guarantee variable modification atomicity wrt the
-CPU which owns the data. Therefore, care must taken to make sure that only one
-CPU writes to the local_t data. This is done by using per cpu data and making
-sure that we modify it from within a preemption safe context. It is however
-permitted to read local_t data from any CPU : it will then appear to be written
-out of order wrt other memory writes by the owner CPU.
-
-
-* Implementation for a given architecture
-
-It can be done by slightly modifying the standard atomic operations : only
-their UP variant must be kept. It typically means removing LOCK prefix (on
-i386 and x86_64) and any SMP sychronization barrier. If the architecture does
-not have a different behavior between SMP and UP, including asm-generic/local.h
-in your archtecture's local.h is sufficient.
-
-The local_t type is defined as an opaque signed long by embedding an
-atomic_long_t inside a structure. This is made so a cast from this type to a
-long fails. The definition looks like :
-
-typedef struct { atomic_long_t a; } local_t;
-
-
-* Rules to follow when using local atomic operations
-
-- Variables touched by local ops must be per cpu variables.
-- _Only_ the CPU owner of these variables must write to them.
-- This CPU can use local ops from any context (process, irq, softirq, nmi, ...)
-  to update its local_t variables.
-- Preemption (or interrupts) must be disabled when using local ops in
-  process context to   make sure the process won't be migrated to a
-  different CPU between getting the per-cpu variable and doing the
-  actual local op.
-- When using local ops in interrupt context, no special care must be
-  taken on a mainline kernel, since they will run on the local CPU with
-  preemption already disabled. I suggest, however, to explicitly
-  disable preemption anyway to make sure it will still work correctly on
-  -rt kernels.
-- Reading the local cpu variable will provide the current copy of the
-  variable.
-- Reads of these variables can be done from any CPU, because updates to
-  "long", aligned, variables are always atomic. Since no memory
-  synchronization is done by the writer CPU, an outdated copy of the
-  variable can be read when reading some _other_ cpu's variables.
-
-
-* How to use local atomic operations
-
-#include <linux/percpu.h>
-#include <asm/local.h>
-
-static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0);
-
-
-* Counting
-
-Counting is done on all the bits of a signed long.
-
-In preemptible context, use get_cpu_var() and put_cpu_var() around local atomic
-operations : it makes sure that preemption is disabled around write access to
-the per cpu variable. For instance :
-
-	local_inc(&get_cpu_var(counters));
-	put_cpu_var(counters);
-
-If you are already in a preemption-safe context, you can directly use
-__get_cpu_var() instead.
-
-	local_inc(&__get_cpu_var(counters));
-
-
-
-* Reading the counters
-
-Those local counters can be read from foreign CPUs to sum the count. Note that
-the data seen by local_read across CPUs must be considered to be out of order
-relatively to other memory writes happening on the CPU that owns the data.
-
-	long sum = 0;
-	for_each_online_cpu(cpu)
-		sum += local_read(&per_cpu(counters, cpu));
-
-If you want to use a remote local_read to synchronize access to a resource
-between CPUs, explicit smp_wmb() and smp_rmb() memory barriers must be used
-respectively on the writer and the reader CPUs. It would be the case if you use
-the local_t variable as a counter of bytes written in a buffer : there should
-be a smp_wmb() between the buffer write and the counter increment and also a
-smp_rmb() between the counter read and the buffer read.
-
-
-Here is a sample module which implements a basic per cpu counter using local.h.
-
---- BEGIN ---
-/* test-local.c
- *
- * Sample module for local.h usage.
- */
-
-
-#include <asm/local.h>
-#include <linux/module.h>
-#include <linux/timer.h>
-
-static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0);
-
-static struct timer_list test_timer;
-
-/* IPI called on each CPU. */
-static void test_each(void *info)
-{
-	/* Increment the counter from a non preemptible context */
-	printk("Increment on cpu %d\n", smp_processor_id());
-	local_inc(&__get_cpu_var(counters));
-
-	/* This is what incrementing the variable would look like within a
-	 * preemptible context (it disables preemption) :
-	 *
-	 * local_inc(&get_cpu_var(counters));
-	 * put_cpu_var(counters);
-	 */
-}
-
-static void do_test_timer(unsigned long data)
-{
-	int cpu;
-
-	/* Increment the counters */
-	on_each_cpu(test_each, NULL, 0, 1);
-	/* Read all the counters */
-	printk("Counters read from CPU %d\n", smp_processor_id());
-	for_each_online_cpu(cpu) {
-		printk("Read : CPU %d, count %ld\n", cpu,
-			local_read(&per_cpu(counters, cpu)));
-	}
-	del_timer(&test_timer);
-	test_timer.expires = jiffies + 1000;
-	add_timer(&test_timer);
-}
-
-static int __init test_init(void)
-{
-	/* initialize the timer that will increment the counter */
-	init_timer(&test_timer);
-	test_timer.function = do_test_timer;
-	test_timer.expires = jiffies + 1;
-	add_timer(&test_timer);
-
-	return 0;
-}
-
-static void __exit test_exit(void)
-{
-	del_timer_sync(&test_timer);
-}
-
-module_init(test_init);
-module_exit(test_exit);
-
-MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Mathieu Desnoyers");
-MODULE_DESCRIPTION("Local Atomic Ops");
---- END ---
Index: linux-2.6/include/asm-x86/local.h
===================================================================
--- linux-2.6.orig/include/asm-x86/local.h	2008-05-29 10:57:34.670237432 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,235 +0,0 @@
-#ifndef _ARCH_LOCAL_H
-#define _ARCH_LOCAL_H
-
-#include <linux/percpu.h>
-
-#include <asm/system.h>
-#include <asm/atomic.h>
-#include <asm/asm.h>
-
-typedef struct {
-	atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i)	{ ATOMIC_LONG_INIT(i) }
-
-#define local_read(l)	atomic_long_read(&(l)->a)
-#define local_set(l, i)	atomic_long_set(&(l)->a, (i))
-
-static inline void local_inc(local_t *l)
-{
-	asm volatile(_ASM_INC "%0"
-		     : "+m" (l->a.counter));
-}
-
-static inline void local_dec(local_t *l)
-{
-	asm volatile(_ASM_DEC "%0"
-		     : "+m" (l->a.counter));
-}
-
-static inline void local_add(long i, local_t *l)
-{
-	asm volatile(_ASM_ADD "%1,%0"
-		     : "+m" (l->a.counter)
-		     : "ir" (i));
-}
-
-static inline void local_sub(long i, local_t *l)
-{
-	asm volatile(_ASM_SUB "%1,%0"
-		     : "+m" (l->a.counter)
-		     : "ir" (i));
-}
-
-/**
- * local_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @l: pointer to type local_t
- *
- * Atomically subtracts @i from @l and returns
- * true if the result is zero, or false for all
- * other cases.
- */
-static inline int local_sub_and_test(long i, local_t *l)
-{
-	unsigned char c;
-
-	asm volatile(_ASM_SUB "%2,%0; sete %1"
-		     : "+m" (l->a.counter), "=qm" (c)
-		     : "ir" (i) : "memory");
-	return c;
-}
-
-/**
- * local_dec_and_test - decrement and test
- * @l: pointer to type local_t
- *
- * Atomically decrements @l by 1 and
- * returns true if the result is 0, or false for all other
- * cases.
- */
-static inline int local_dec_and_test(local_t *l)
-{
-	unsigned char c;
-
-	asm volatile(_ASM_DEC "%0; sete %1"
-		     : "+m" (l->a.counter), "=qm" (c)
-		     : : "memory");
-	return c != 0;
-}
-
-/**
- * local_inc_and_test - increment and test
- * @l: pointer to type local_t
- *
- * Atomically increments @l by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-static inline int local_inc_and_test(local_t *l)
-{
-	unsigned char c;
-
-	asm volatile(_ASM_INC "%0; sete %1"
-		     : "+m" (l->a.counter), "=qm" (c)
-		     : : "memory");
-	return c != 0;
-}
-
-/**
- * local_add_negative - add and test if negative
- * @i: integer value to add
- * @l: pointer to type local_t
- *
- * Atomically adds @i to @l and returns true
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
-static inline int local_add_negative(long i, local_t *l)
-{
-	unsigned char c;
-
-	asm volatile(_ASM_ADD "%2,%0; sets %1"
-		     : "+m" (l->a.counter), "=qm" (c)
-		     : "ir" (i) : "memory");
-	return c;
-}
-
-/**
- * local_add_return - add and return
- * @i: integer value to add
- * @l: pointer to type local_t
- *
- * Atomically adds @i to @l and returns @i + @l
- */
-static inline long local_add_return(long i, local_t *l)
-{
-	long __i;
-#ifdef CONFIG_M386
-	unsigned long flags;
-	if (unlikely(boot_cpu_data.x86 <= 3))
-		goto no_xadd;
-#endif
-	/* Modern 486+ processor */
-	__i = i;
-	asm volatile(_ASM_XADD "%0, %1;"
-		     : "+r" (i), "+m" (l->a.counter)
-		     : : "memory");
-	return i + __i;
-
-#ifdef CONFIG_M386
-no_xadd: /* Legacy 386 processor */
-	local_irq_save(flags);
-	__i = local_read(l);
-	local_set(l, i + __i);
-	local_irq_restore(flags);
-	return i + __i;
-#endif
-}
-
-static inline long local_sub_return(long i, local_t *l)
-{
-	return local_add_return(-i, l);
-}
-
-#define local_inc_return(l)  (local_add_return(1, l))
-#define local_dec_return(l)  (local_sub_return(1, l))
-
-#define local_cmpxchg(l, o, n) \
-	(cmpxchg_local(&((l)->a.counter), (o), (n)))
-/* Always has a lock prefix */
-#define local_xchg(l, n) (xchg(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- * @l: pointer of type local_t
- * @a: the amount to add to l...
- * @u: ...unless l is equal to u.
- *
- * Atomically adds @a to @l, so long as it was not @u.
- * Returns non-zero if @l was not @u, and zero otherwise.
- */
-#define local_add_unless(l, a, u)				\
-({								\
-	long c, old;						\
-	c = local_read((l));					\
-	for (;;) {						\
-		if (unlikely(c == (u)))				\
-			break;					\
-		old = local_cmpxchg((l), c, c + (a));		\
-		if (likely(old == c))				\
-			break;					\
-		c = old;					\
-	}							\
-	c != (u);						\
-})
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-/* On x86_32, these are no better than the atomic variants.
- * On x86-64 these are better than the atomic variants on SMP kernels
- * because they dont use a lock prefix.
- */
-#define __local_inc(l)		local_inc(l)
-#define __local_dec(l)		local_dec(l)
-#define __local_add(i, l)	local_add((i), (l))
-#define __local_sub(i, l)	local_sub((i), (l))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- *
- * X86_64: This could be done better if we moved the per cpu data directly
- * after GS.
- */
-
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)		\
-({					\
-	local_t res__;			\
-	preempt_disable(); 		\
-	res__ = (l);			\
-	preempt_enable();		\
-	res__;				\
-})
-#define cpu_local_wrap(l)		\
-({					\
-	preempt_disable();		\
-	(l);				\
-	preempt_enable();		\
-})					\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var((l))))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var((l)), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var((l))))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var((l))))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var((l))))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var((l))))
-
-#define __cpu_local_inc(l)	cpu_local_inc((l))
-#define __cpu_local_dec(l)	cpu_local_dec((l))
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
-#endif /* _ARCH_LOCAL_H */
Index: linux-2.6/arch/frv/kernel/local.h
===================================================================
--- linux-2.6.orig/arch/frv/kernel/local.h	2008-05-29 10:57:35.606486730 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,59 +0,0 @@
-/* local.h: local definitions
- *
- * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-
-#ifndef _FRV_LOCAL_H
-#define _FRV_LOCAL_H
-
-#include <asm/sections.h>
-
-#ifndef __ASSEMBLY__
-
-/* dma.c */
-extern unsigned long frv_dma_inprogress;
-
-extern void frv_dma_pause_all(void);
-extern void frv_dma_resume_all(void);
-
-/* sleep.S */
-extern asmlinkage void frv_cpu_suspend(unsigned long);
-extern asmlinkage void frv_cpu_core_sleep(void);
-
-/* setup.c */
-extern unsigned long __nongprelbss pdm_suspend_mode;
-extern void determine_clocks(int verbose);
-extern int __nongprelbss clock_p0_current;
-extern int __nongprelbss clock_cm_current;
-extern int __nongprelbss clock_cmode_current;
-
-#ifdef CONFIG_PM
-extern int __nongprelbss clock_cmodes_permitted;
-extern unsigned long __nongprelbss clock_bits_settable;
-#define CLOCK_BIT_CM		0x0000000f
-#define CLOCK_BIT_CM_H		0x00000001	/* CLKC.CM can be set to 0 */
-#define CLOCK_BIT_CM_M		0x00000002	/* CLKC.CM can be set to 1 */
-#define CLOCK_BIT_CM_L		0x00000004	/* CLKC.CM can be set to 2 */
-#define CLOCK_BIT_P0		0x00000010	/* CLKC.P0 can be changed */
-#define CLOCK_BIT_CMODE		0x00000020	/* CLKC.CMODE can be changed */
-
-extern void (*__power_switch_wake_setup)(void);
-extern int  (*__power_switch_wake_check)(void);
-extern void (*__power_switch_wake_cleanup)(void);
-#endif
-
-/* time.c */
-extern void time_divisor_init(void);
-
-/* cmode.S */
-extern asmlinkage void frv_change_cmode(int);
-
-
-#endif /* __ASSEMBLY__ */
-#endif /* _FRV_LOCAL_H */
Index: linux-2.6/include/asm-alpha/local.h
===================================================================
--- linux-2.6.orig/include/asm-alpha/local.h	2008-05-29 10:57:34.700237406 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,118 +0,0 @@
-#ifndef _ALPHA_LOCAL_H
-#define _ALPHA_LOCAL_H
-
-#include <linux/percpu.h>
-#include <asm/atomic.h>
-
-typedef struct
-{
-	atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i)	{ ATOMIC_LONG_INIT(i) }
-#define local_read(l)	atomic_long_read(&(l)->a)
-#define local_set(l,i)	atomic_long_set(&(l)->a, (i))
-#define local_inc(l)	atomic_long_inc(&(l)->a)
-#define local_dec(l)	atomic_long_dec(&(l)->a)
-#define local_add(i,l)	atomic_long_add((i),(&(l)->a))
-#define local_sub(i,l)	atomic_long_sub((i),(&(l)->a))
-
-static __inline__ long local_add_return(long i, local_t * l)
-{
-	long temp, result;
-	__asm__ __volatile__(
-	"1:	ldq_l %0,%1\n"
-	"	addq %0,%3,%2\n"
-	"	addq %0,%3,%0\n"
-	"	stq_c %0,%1\n"
-	"	beq %0,2f\n"
-	".subsection 2\n"
-	"2:	br 1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (l->a.counter), "=&r" (result)
-	:"Ir" (i), "m" (l->a.counter) : "memory");
-	return result;
-}
-
-static __inline__ long local_sub_return(long i, local_t * l)
-{
-	long temp, result;
-	__asm__ __volatile__(
-	"1:	ldq_l %0,%1\n"
-	"	subq %0,%3,%2\n"
-	"	subq %0,%3,%0\n"
-	"	stq_c %0,%1\n"
-	"	beq %0,2f\n"
-	".subsection 2\n"
-	"2:	br 1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (l->a.counter), "=&r" (result)
-	:"Ir" (i), "m" (l->a.counter) : "memory");
-	return result;
-}
-
-#define local_cmpxchg(l, o, n) \
-	(cmpxchg_local(&((l)->a.counter), (o), (n)))
-#define local_xchg(l, n) (xchg_local(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- * @l: pointer of type local_t
- * @a: the amount to add to l...
- * @u: ...unless l is equal to u.
- *
- * Atomically adds @a to @l, so long as it was not @u.
- * Returns non-zero if @l was not @u, and zero otherwise.
- */
-#define local_add_unless(l, a, u)				\
-({								\
-	long c, old;						\
-	c = local_read(l);					\
-	for (;;) {						\
-		if (unlikely(c == (u)))				\
-			break;					\
-		old = local_cmpxchg((l), c, c + (a));	\
-		if (likely(old == c))				\
-			break;					\
-		c = old;					\
-	}							\
-	c != (u);						\
-})
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-#define local_add_negative(a, l) (local_add_return((a), (l)) < 0)
-
-#define local_dec_return(l) local_sub_return(1,(l))
-
-#define local_inc_return(l) local_add_return(1,(l))
-
-#define local_sub_and_test(i,l) (local_sub_return((i), (l)) == 0)
-
-#define local_inc_and_test(l) (local_add_return(1, (l)) == 0)
-
-#define local_dec_and_test(l) (local_sub_return(1, (l)) == 0)
-
-/* Verify if faster than atomic ops */
-#define __local_inc(l)		((l)->a.counter++)
-#define __local_dec(l)		((l)->a.counter++)
-#define __local_add(i,l)	((l)->a.counter+=(i))
-#define __local_sub(i,l)	((l)->a.counter-=(i))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- */
-#define cpu_local_read(l)	local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i)	local_set(&__get_cpu_var(l), (i))
-
-#define cpu_local_inc(l)	local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l)	local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l)	local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l)	local_sub((i), &__get_cpu_var(l))
-
-#define __cpu_local_inc(l)	__local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l)	__local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l)	__local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l)	__local_sub((i), &__get_cpu_var(l))
-
-#endif /* _ALPHA_LOCAL_H */
Index: linux-2.6/include/asm-arm/local.h
===================================================================
--- linux-2.6.orig/include/asm-arm/local.h	2008-05-29 10:57:34.836486364 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-avr32/local.h
===================================================================
--- linux-2.6.orig/include/asm-avr32/local.h	2008-05-29 10:57:34.846486525 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __ASM_AVR32_LOCAL_H
-#define __ASM_AVR32_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __ASM_AVR32_LOCAL_H */
Index: linux-2.6/include/asm-blackfin/local.h
===================================================================
--- linux-2.6.orig/include/asm-blackfin/local.h	2008-05-29 10:57:34.876486219 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __BLACKFIN_LOCAL_H
-#define __BLACKFIN_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif				/* __BLACKFIN_LOCAL_H */
Index: linux-2.6/include/asm-cris/local.h
===================================================================
--- linux-2.6.orig/include/asm-cris/local.h	2008-05-29 10:57:34.886488493 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-frv/local.h
===================================================================
--- linux-2.6.orig/include/asm-frv/local.h	2008-05-29 10:57:34.896486992 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _ASM_LOCAL_H
-#define _ASM_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* _ASM_LOCAL_H */
Index: linux-2.6/include/asm-generic/local.h
===================================================================
--- linux-2.6.orig/include/asm-generic/local.h	2008-05-29 10:57:34.906487888 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,75 +0,0 @@
-#ifndef _ASM_GENERIC_LOCAL_H
-#define _ASM_GENERIC_LOCAL_H
-
-#include <linux/percpu.h>
-#include <linux/hardirq.h>
-#include <asm/atomic.h>
-#include <asm/types.h>
-
-/*
- * A signed long type for operations which are atomic for a single CPU.
- * Usually used in combination with per-cpu variables.
- *
- * This is the default implementation, which uses atomic_long_t.  Which is
- * rather pointless.  The whole point behind local_t is that some processors
- * can perform atomic adds and subtracts in a manner which is atomic wrt IRQs
- * running on this CPU.  local_t allows exploitation of such capabilities.
- */
-
-/* Implement in terms of atomics. */
-
-/* Don't use typedef: don't want them to be mixed with atomic_t's. */
-typedef struct
-{
-	atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i)	{ ATOMIC_LONG_INIT(i) }
-
-#define local_read(l)	atomic_long_read(&(l)->a)
-#define local_set(l,i)	atomic_long_set((&(l)->a),(i))
-#define local_inc(l)	atomic_long_inc(&(l)->a)
-#define local_dec(l)	atomic_long_dec(&(l)->a)
-#define local_add(i,l)	atomic_long_add((i),(&(l)->a))
-#define local_sub(i,l)	atomic_long_sub((i),(&(l)->a))
-
-#define local_sub_and_test(i, l) atomic_long_sub_and_test((i), (&(l)->a))
-#define local_dec_and_test(l) atomic_long_dec_and_test(&(l)->a)
-#define local_inc_and_test(l) atomic_long_inc_and_test(&(l)->a)
-#define local_add_negative(i, l) atomic_long_add_negative((i), (&(l)->a))
-#define local_add_return(i, l) atomic_long_add_return((i), (&(l)->a))
-#define local_sub_return(i, l) atomic_long_sub_return((i), (&(l)->a))
-#define local_inc_return(l) atomic_long_inc_return(&(l)->a)
-
-#define local_cmpxchg(l, o, n) atomic_long_cmpxchg((&(l)->a), (o), (n))
-#define local_xchg(l, n) atomic_long_xchg((&(l)->a), (n))
-#define local_add_unless(l, a, u) atomic_long_add_unless((&(l)->a), (a), (u))
-#define local_inc_not_zero(l) atomic_long_inc_not_zero(&(l)->a)
-
-/* Non-atomic variants, ie. preemption disabled and won't be touched
- * in interrupt, etc.  Some archs can optimize this case well. */
-#define __local_inc(l)		local_set((l), local_read(l) + 1)
-#define __local_dec(l)		local_set((l), local_read(l) - 1)
-#define __local_add(i,l)	local_set((l), local_read(l) + (i))
-#define __local_sub(i,l)	local_set((l), local_read(l) - (i))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable (eg. mystruct.foo), not an address.
- */
-#define cpu_local_read(l)	local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i)	local_set(&__get_cpu_var(l), (i))
-#define cpu_local_inc(l)	local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l)	local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l)	local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l)	local_sub((i), &__get_cpu_var(l))
-
-/* Non-atomic increments, ie. preemption disabled and won't be touched
- * in interrupt, etc.  Some archs can optimize this case well.
- */
-#define __cpu_local_inc(l)	__local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l)	__local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l)	__local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l)	__local_sub((i), &__get_cpu_var(l))
-
-#endif /* _ASM_GENERIC_LOCAL_H */
Index: linux-2.6/include/asm-h8300/local.h
===================================================================
--- linux-2.6.orig/include/asm-h8300/local.h	2008-05-29 10:57:34.916488227 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _H8300_LOCAL_H_
-#define _H8300_LOCAL_H_
-
-#include <asm-generic/local.h>
-
-#endif
Index: linux-2.6/include/asm-ia64/local.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/local.h	2008-05-29 10:57:34.976486190 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-m32r/local.h
===================================================================
--- linux-2.6.orig/include/asm-m32r/local.h	2008-05-29 10:57:34.986488047 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,366 +0,0 @@
-#ifndef __M32R_LOCAL_H
-#define __M32R_LOCAL_H
-
-/*
- *  linux/include/asm-m32r/local.h
- *
- *  M32R version:
- *    Copyright (C) 2001, 2002  Hitoshi Yamamoto
- *    Copyright (C) 2004  Hirokazu Takata <takata at linux-m32r.org>
- *    Copyright (C) 2007  Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
- */
-
-#include <linux/percpu.h>
-#include <asm/assembler.h>
-#include <asm/system.h>
-#include <asm/local.h>
-
-/*
- * Atomic operations that C can't guarantee us.  Useful for
- * resource counting etc..
- */
-
-/*
- * Make sure gcc doesn't try to be clever and move things around
- * on us. We need to use _exactly_ the address the user gave us,
- * not some alias that contains the same information.
- */
-typedef struct { volatile int counter; } local_t;
-
-#define LOCAL_INIT(i)	{ (i) }
-
-/**
- * local_read - read local variable
- * @l: pointer of type local_t
- *
- * Atomically reads the value of @l.
- */
-#define local_read(l)	((l)->counter)
-
-/**
- * local_set - set local variable
- * @l: pointer of type local_t
- * @i: required value
- *
- * Atomically sets the value of @l to @i.
- */
-#define local_set(l, i)	(((l)->counter) = (i))
-
-/**
- * local_add_return - add long to local variable and return it
- * @i: long value to add
- * @l: pointer of type local_t
- *
- * Atomically adds @i to @l and return (@i + @l).
- */
-static inline long local_add_return(long i, local_t *l)
-{
-	unsigned long flags;
-	long result;
-
-	local_irq_save(flags);
-	__asm__ __volatile__ (
-		"# local_add_return		\n\t"
-		DCACHE_CLEAR("%0", "r4", "%1")
-		"ld %0, @%1;			\n\t"
-		"add	%0, %2;			\n\t"
-		"st %0, @%1;			\n\t"
-		: "=&r" (result)
-		: "r" (&l->counter), "r" (i)
-		: "memory"
-#ifdef CONFIG_CHIP_M32700_TS1
-		, "r4"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
-	);
-	local_irq_restore(flags);
-
-	return result;
-}
-
-/**
- * local_sub_return - subtract long from local variable and return it
- * @i: long value to subtract
- * @l: pointer of type local_t
- *
- * Atomically subtracts @i from @l and return (@l - @i).
- */
-static inline long local_sub_return(long i, local_t *l)
-{
-	unsigned long flags;
-	long result;
-
-	local_irq_save(flags);
-	__asm__ __volatile__ (
-		"# local_sub_return		\n\t"
-		DCACHE_CLEAR("%0", "r4", "%1")
-		"ld %0, @%1;			\n\t"
-		"sub	%0, %2;			\n\t"
-		"st %0, @%1;			\n\t"
-		: "=&r" (result)
-		: "r" (&l->counter), "r" (i)
-		: "memory"
-#ifdef CONFIG_CHIP_M32700_TS1
-		, "r4"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
-	);
-	local_irq_restore(flags);
-
-	return result;
-}
-
-/**
- * local_add - add long to local variable
- * @i: long value to add
- * @l: pointer of type local_t
- *
- * Atomically adds @i to @l.
- */
-#define local_add(i, l) ((void) local_add_return((i), (l)))
-
-/**
- * local_sub - subtract the local variable
- * @i: long value to subtract
- * @l: pointer of type local_t
- *
- * Atomically subtracts @i from @l.
- */
-#define local_sub(i, l) ((void) local_sub_return((i), (l)))
-
-/**
- * local_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @l: pointer of type local_t
- *
- * Atomically subtracts @i from @l and returns
- * true if the result is zero, or false for all
- * other cases.
- */
-#define local_sub_and_test(i, l) (local_sub_return((i), (l)) == 0)
-
-/**
- * local_inc_return - increment local variable and return it
- * @l: pointer of type local_t
- *
- * Atomically increments @l by 1 and returns the result.
- */
-static inline long local_inc_return(local_t *l)
-{
-	unsigned long flags;
-	long result;
-
-	local_irq_save(flags);
-	__asm__ __volatile__ (
-		"# local_inc_return		\n\t"
-		DCACHE_CLEAR("%0", "r4", "%1")
-		"ld %0, @%1;			\n\t"
-		"addi	%0, #1;			\n\t"
-		"st %0, @%1;			\n\t"
-		: "=&r" (result)
-		: "r" (&l->counter)
-		: "memory"
-#ifdef CONFIG_CHIP_M32700_TS1
-		, "r4"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
-	);
-	local_irq_restore(flags);
-
-	return result;
-}
-
-/**
- * local_dec_return - decrement local variable and return it
- * @l: pointer of type local_t
- *
- * Atomically decrements @l by 1 and returns the result.
- */
-static inline long local_dec_return(local_t *l)
-{
-	unsigned long flags;
-	long result;
-
-	local_irq_save(flags);
-	__asm__ __volatile__ (
-		"# local_dec_return		\n\t"
-		DCACHE_CLEAR("%0", "r4", "%1")
-		"ld %0, @%1;			\n\t"
-		"addi	%0, #-1;		\n\t"
-		"st %0, @%1;			\n\t"
-		: "=&r" (result)
-		: "r" (&l->counter)
-		: "memory"
-#ifdef CONFIG_CHIP_M32700_TS1
-		, "r4"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
-	);
-	local_irq_restore(flags);
-
-	return result;
-}
-
-/**
- * local_inc - increment local variable
- * @l: pointer of type local_t
- *
- * Atomically increments @l by 1.
- */
-#define local_inc(l) ((void)local_inc_return(l))
-
-/**
- * local_dec - decrement local variable
- * @l: pointer of type local_t
- *
- * Atomically decrements @l by 1.
- */
-#define local_dec(l) ((void)local_dec_return(l))
-
-/**
- * local_inc_and_test - increment and test
- * @l: pointer of type local_t
- *
- * Atomically increments @l by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-#define local_inc_and_test(l) (local_inc_return(l) == 0)
-
-/**
- * local_dec_and_test - decrement and test
- * @l: pointer of type local_t
- *
- * Atomically decrements @l by 1 and
- * returns true if the result is 0, or false for all
- * other cases.
- */
-#define local_dec_and_test(l) (local_dec_return(l) == 0)
-
-/**
- * local_add_negative - add and test if negative
- * @l: pointer of type local_t
- * @i: integer value to add
- *
- * Atomically adds @i to @l and returns true
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
-#define local_add_negative(i, l) (local_add_return((i), (l)) < 0)
-
-#define local_cmpxchg(l, o, n) (cmpxchg_local(&((l)->counter), (o), (n)))
-#define local_xchg(v, new) (xchg_local(&((l)->counter), new))
-
-/**
- * local_add_unless - add unless the number is a given value
- * @l: pointer of type local_t
- * @a: the amount to add to l...
- * @u: ...unless l is equal to u.
- *
- * Atomically adds @a to @l, so long as it was not @u.
- * Returns non-zero if @l was not @u, and zero otherwise.
- */
-static inline int local_add_unless(local_t *l, long a, long u)
-{
-	long c, old;
-	c = local_read(l);
-	for (;;) {
-		if (unlikely(c == (u)))
-			break;
-		old = local_cmpxchg((l), c, c + (a));
-		if (likely(old == c))
-			break;
-		c = old;
-	}
-	return c != (u);
-}
-
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-static inline void local_clear_mask(unsigned long  mask, local_t *addr)
-{
-	unsigned long flags;
-	unsigned long tmp;
-
-	local_irq_save(flags);
-	__asm__ __volatile__ (
-		"# local_clear_mask		\n\t"
-		DCACHE_CLEAR("%0", "r5", "%1")
-		"ld %0, @%1;			\n\t"
-		"and	%0, %2;			\n\t"
-		"st %0, @%1;			\n\t"
-		: "=&r" (tmp)
-		: "r" (addr), "r" (~mask)
-		: "memory"
-#ifdef CONFIG_CHIP_M32700_TS1
-		, "r5"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
-	);
-	local_irq_restore(flags);
-}
-
-static inline void local_set_mask(unsigned long  mask, local_t *addr)
-{
-	unsigned long flags;
-	unsigned long tmp;
-
-	local_irq_save(flags);
-	__asm__ __volatile__ (
-		"# local_set_mask		\n\t"
-		DCACHE_CLEAR("%0", "r5", "%1")
-		"ld %0, @%1;			\n\t"
-		"or	%0, %2;			\n\t"
-		"st %0, @%1;			\n\t"
-		: "=&r" (tmp)
-		: "r" (addr), "r" (mask)
-		: "memory"
-#ifdef CONFIG_CHIP_M32700_TS1
-		, "r5"
-#endif	/* CONFIG_CHIP_M32700_TS1 */
-	);
-	local_irq_restore(flags);
-}
-
-/* Atomic operations are already serializing on m32r */
-#define smp_mb__before_local_dec()	barrier()
-#define smp_mb__after_local_dec()	barrier()
-#define smp_mb__before_local_inc()	barrier()
-#define smp_mb__after_local_inc()	barrier()
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- */
-
-#define __local_inc(l)		((l)->a.counter++)
-#define __local_dec(l)		((l)->a.counter++)
-#define __local_add(i, l)	((l)->a.counter += (i))
-#define __local_sub(i, l)	((l)->a.counter -= (i))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- */
-
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non local way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
-#endif /* __M32R_LOCAL_H */
Index: linux-2.6/include/asm-m68k/local.h
===================================================================
--- linux-2.6.orig/include/asm-m68k/local.h	2008-05-29 10:57:35.016486932 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _ASM_M68K_LOCAL_H
-#define _ASM_M68K_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* _ASM_M68K_LOCAL_H */
Index: linux-2.6/include/asm-m68knommu/local.h
===================================================================
--- linux-2.6.orig/include/asm-m68knommu/local.h	2008-05-29 10:57:35.036486582 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __M68KNOMMU_LOCAL_H
-#define __M68KNOMMU_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __M68KNOMMU_LOCAL_H */
Index: linux-2.6/include/asm-mips/local.h
===================================================================
--- linux-2.6.orig/include/asm-mips/local.h	2008-05-29 10:57:35.066486235 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,221 +0,0 @@
-#ifndef _ARCH_MIPS_LOCAL_H
-#define _ARCH_MIPS_LOCAL_H
-
-#include <linux/percpu.h>
-#include <linux/bitops.h>
-#include <asm/atomic.h>
-#include <asm/cmpxchg.h>
-#include <asm/war.h>
-
-typedef struct
-{
-	atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i)	{ ATOMIC_LONG_INIT(i) }
-
-#define local_read(l)	atomic_long_read(&(l)->a)
-#define local_set(l, i)	atomic_long_set(&(l)->a, (i))
-
-#define local_add(i, l)	atomic_long_add((i), (&(l)->a))
-#define local_sub(i, l)	atomic_long_sub((i), (&(l)->a))
-#define local_inc(l)	atomic_long_inc(&(l)->a)
-#define local_dec(l)	atomic_long_dec(&(l)->a)
-
-/*
- * Same as above, but return the result value
- */
-static __inline__ long local_add_return(long i, local_t * l)
-{
-	unsigned long result;
-
-	if (cpu_has_llsc && R10000_LLSC_WAR) {
-		unsigned long temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:"	__LL	"%1, %2		# local_add_return	\n"
-		"	addu	%0, %1, %3				\n"
-			__SC	"%0, %2					\n"
-		"	beqzl	%0, 1b					\n"
-		"	addu	%0, %1, %3				\n"
-		"	.set	mips0					\n"
-		: "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
-		: "Ir" (i), "m" (l->a.counter)
-		: "memory");
-	} else if (cpu_has_llsc) {
-		unsigned long temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:"	__LL	"%1, %2		# local_add_return	\n"
-		"	addu	%0, %1, %3				\n"
-			__SC	"%0, %2					\n"
-		"	beqz	%0, 1b					\n"
-		"	addu	%0, %1, %3				\n"
-		"	.set	mips0					\n"
-		: "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
-		: "Ir" (i), "m" (l->a.counter)
-		: "memory");
-	} else {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		result = l->a.counter;
-		result += i;
-		l->a.counter = result;
-		local_irq_restore(flags);
-	}
-
-	return result;
-}
-
-static __inline__ long local_sub_return(long i, local_t * l)
-{
-	unsigned long result;
-
-	if (cpu_has_llsc && R10000_LLSC_WAR) {
-		unsigned long temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:"	__LL	"%1, %2		# local_sub_return	\n"
-		"	subu	%0, %1, %3				\n"
-			__SC	"%0, %2					\n"
-		"	beqzl	%0, 1b					\n"
-		"	subu	%0, %1, %3				\n"
-		"	.set	mips0					\n"
-		: "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
-		: "Ir" (i), "m" (l->a.counter)
-		: "memory");
-	} else if (cpu_has_llsc) {
-		unsigned long temp;
-
-		__asm__ __volatile__(
-		"	.set	mips3					\n"
-		"1:"	__LL	"%1, %2		# local_sub_return	\n"
-		"	subu	%0, %1, %3				\n"
-			__SC	"%0, %2					\n"
-		"	beqz	%0, 1b					\n"
-		"	subu	%0, %1, %3				\n"
-		"	.set	mips0					\n"
-		: "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
-		: "Ir" (i), "m" (l->a.counter)
-		: "memory");
-	} else {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		result = l->a.counter;
-		result -= i;
-		l->a.counter = result;
-		local_irq_restore(flags);
-	}
-
-	return result;
-}
-
-#define local_cmpxchg(l, o, n) \
-	((long)cmpxchg_local(&((l)->a.counter), (o), (n)))
-#define local_xchg(l, n) (xchg_local(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- * @l: pointer of type local_t
- * @a: the amount to add to l...
- * @u: ...unless l is equal to u.
- *
- * Atomically adds @a to @l, so long as it was not @u.
- * Returns non-zero if @l was not @u, and zero otherwise.
- */
-#define local_add_unless(l, a, u)				\
-({								\
-	long c, old;						\
-	c = local_read(l);					\
-	while (c != (u) && (old = local_cmpxchg((l), c, c + (a))) != c) \
-		c = old;					\
-	c != (u);						\
-})
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-#define local_dec_return(l) local_sub_return(1, (l))
-#define local_inc_return(l) local_add_return(1, (l))
-
-/*
- * local_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @l: pointer of type local_t
- *
- * Atomically subtracts @i from @l and returns
- * true if the result is zero, or false for all
- * other cases.
- */
-#define local_sub_and_test(i, l) (local_sub_return((i), (l)) == 0)
-
-/*
- * local_inc_and_test - increment and test
- * @l: pointer of type local_t
- *
- * Atomically increments @l by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-#define local_inc_and_test(l) (local_inc_return(l) == 0)
-
-/*
- * local_dec_and_test - decrement by 1 and test
- * @l: pointer of type local_t
- *
- * Atomically decrements @l by 1 and
- * returns true if the result is 0, or false for all other
- * cases.
- */
-#define local_dec_and_test(l) (local_sub_return(1, (l)) == 0)
-
-/*
- * local_add_negative - add and test if negative
- * @l: pointer of type local_t
- * @i: integer value to add
- *
- * Atomically adds @i to @l and returns true
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
-#define local_add_negative(i, l) (local_add_return(i, (l)) < 0)
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- */
-
-#define __local_inc(l)		((l)->a.counter++)
-#define __local_dec(l)		((l)->a.counter++)
-#define __local_add(i, l)	((l)->a.counter+=(i))
-#define __local_sub(i, l)	((l)->a.counter-=(i))
-
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
-#endif /* _ARCH_MIPS_LOCAL_H */
Index: linux-2.6/include/asm-parisc/local.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/local.h	2008-05-29 10:57:35.096488491 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-powerpc/local.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/local.h	2008-05-29 10:57:35.346487556 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,200 +0,0 @@
-#ifndef _ARCH_POWERPC_LOCAL_H
-#define _ARCH_POWERPC_LOCAL_H
-
-#include <linux/percpu.h>
-#include <asm/atomic.h>
-
-typedef struct
-{
-	atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i)	{ ATOMIC_LONG_INIT(i) }
-
-#define local_read(l)	atomic_long_read(&(l)->a)
-#define local_set(l,i)	atomic_long_set(&(l)->a, (i))
-
-#define local_add(i,l)	atomic_long_add((i),(&(l)->a))
-#define local_sub(i,l)	atomic_long_sub((i),(&(l)->a))
-#define local_inc(l)	atomic_long_inc(&(l)->a)
-#define local_dec(l)	atomic_long_dec(&(l)->a)
-
-static __inline__ long local_add_return(long a, local_t *l)
-{
-	long t;
-
-	__asm__ __volatile__(
-"1:"	PPC_LLARX	"%0,0,%2		# local_add_return\n\
-	add	%0,%1,%0\n"
-	PPC405_ERR77(0,%2)
-	PPC_STLCX	"%0,0,%2 \n\
-	bne-	1b"
-	: "=&r" (t)
-	: "r" (a), "r" (&(l->a.counter))
-	: "cc", "memory");
-
-	return t;
-}
-
-#define local_add_negative(a, l)	(local_add_return((a), (l)) < 0)
-
-static __inline__ long local_sub_return(long a, local_t *l)
-{
-	long t;
-
-	__asm__ __volatile__(
-"1:"	PPC_LLARX	"%0,0,%2		# local_sub_return\n\
-	subf	%0,%1,%0\n"
-	PPC405_ERR77(0,%2)
-	PPC_STLCX	"%0,0,%2 \n\
-	bne-	1b"
-	: "=&r" (t)
-	: "r" (a), "r" (&(l->a.counter))
-	: "cc", "memory");
-
-	return t;
-}
-
-static __inline__ long local_inc_return(local_t *l)
-{
-	long t;
-
-	__asm__ __volatile__(
-"1:"	PPC_LLARX	"%0,0,%1		# local_inc_return\n\
-	addic	%0,%0,1\n"
-	PPC405_ERR77(0,%1)
-	PPC_STLCX	"%0,0,%1 \n\
-	bne-	1b"
-	: "=&r" (t)
-	: "r" (&(l->a.counter))
-	: "cc", "memory");
-
-	return t;
-}
-
-/*
- * local_inc_and_test - increment and test
- * @l: pointer of type local_t
- *
- * Atomically increments @l by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-#define local_inc_and_test(l) (local_inc_return(l) == 0)
-
-static __inline__ long local_dec_return(local_t *l)
-{
-	long t;
-
-	__asm__ __volatile__(
-"1:"	PPC_LLARX	"%0,0,%1		# local_dec_return\n\
-	addic	%0,%0,-1\n"
-	PPC405_ERR77(0,%1)
-	PPC_STLCX	"%0,0,%1\n\
-	bne-	1b"
-	: "=&r" (t)
-	: "r" (&(l->a.counter))
-	: "cc", "memory");
-
-	return t;
-}
-
-#define local_cmpxchg(l, o, n) \
-	(cmpxchg_local(&((l)->a.counter), (o), (n)))
-#define local_xchg(l, n) (xchg_local(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- * @l: pointer of type local_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- *
- * Atomically adds @a to @l, so long as it was not @u.
- * Returns non-zero if @l was not @u, and zero otherwise.
- */
-static __inline__ int local_add_unless(local_t *l, long a, long u)
-{
-	long t;
-
-	__asm__ __volatile__ (
-"1:"	PPC_LLARX	"%0,0,%1		# local_add_unless\n\
-	cmpw	0,%0,%3 \n\
-	beq-	2f \n\
-	add	%0,%2,%0 \n"
-	PPC405_ERR77(0,%2)
-	PPC_STLCX	"%0,0,%1 \n\
-	bne-	1b \n"
-"	subf	%0,%2,%0 \n\
-2:"
-	: "=&r" (t)
-	: "r" (&(l->a.counter)), "r" (a), "r" (u)
-	: "cc", "memory");
-
-	return t != u;
-}
-
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-#define local_sub_and_test(a, l)	(local_sub_return((a), (l)) == 0)
-#define local_dec_and_test(l)		(local_dec_return((l)) == 0)
-
-/*
- * Atomically test *l and decrement if it is greater than 0.
- * The function returns the old value of *l minus 1.
- */
-static __inline__ long local_dec_if_positive(local_t *l)
-{
-	long t;
-
-	__asm__ __volatile__(
-"1:"	PPC_LLARX	"%0,0,%1		# local_dec_if_positive\n\
-	cmpwi	%0,1\n\
-	addi	%0,%0,-1\n\
-	blt-	2f\n"
-	PPC405_ERR77(0,%1)
-	PPC_STLCX	"%0,0,%1\n\
-	bne-	1b"
-	"\n\
-2:"	: "=&b" (t)
-	: "r" (&(l->a.counter))
-	: "cc", "memory");
-
-	return t;
-}
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations.  Note they take
- * a variable, not an address.
- */
-
-#define __local_inc(l)		((l)->a.counter++)
-#define __local_dec(l)		((l)->a.counter++)
-#define __local_add(i,l)	((l)->a.counter+=(i))
-#define __local_sub(i,l)	((l)->a.counter-=(i))
-
-/* Need to disable preemption for the cpu local counters otherwise we could
-   still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l)	 	\
-	({ local_t res__;		\
-	   preempt_disable(); 		\
-	   res__ = (l);			\
-	   preempt_enable();		\
-	   res__; })
-#define cpu_local_wrap(l)		\
-	({ preempt_disable();		\
-	   l;				\
-	   preempt_enable(); })		\
-
-#define cpu_local_read(l)    cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i)  cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l)     cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l)  cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l)  cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l)	cpu_local_inc(l)
-#define __cpu_local_dec(l)	cpu_local_dec(l)
-#define __cpu_local_add(i, l)	cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l)	cpu_local_sub((i), (l))
-
-#endif /* _ARCH_POWERPC_LOCAL_H */
Index: linux-2.6/include/asm-s390/local.h
===================================================================
--- linux-2.6.orig/include/asm-s390/local.h	2008-05-29 10:57:35.366488125 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-sh/local.h
===================================================================
--- linux-2.6.orig/include/asm-sh/local.h	2008-05-29 10:57:35.386488139 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,7 +0,0 @@
-#ifndef __ASM_SH_LOCAL_H
-#define __ASM_SH_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __ASM_SH_LOCAL_H */
-
Index: linux-2.6/include/asm-sparc/local.h
===================================================================
--- linux-2.6.orig/include/asm-sparc/local.h	2008-05-29 10:57:35.406486837 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _SPARC_LOCAL_H
-#define _SPARC_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif
Index: linux-2.6/include/asm-sparc64/local.h
===================================================================
--- linux-2.6.orig/include/asm-sparc64/local.h	2008-05-29 10:57:35.416487655 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-um/local.h
===================================================================
--- linux-2.6.orig/include/asm-um/local.h	2008-05-29 10:57:35.516486346 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __UM_LOCAL_H
-#define __UM_LOCAL_H
-
-#include "asm/arch/local.h"
-
-#endif
Index: linux-2.6/include/asm-v850/local.h
===================================================================
--- linux-2.6.orig/include/asm-v850/local.h	2008-05-29 10:57:35.536486897 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __V850_LOCAL_H__
-#define __V850_LOCAL_H__
-
-#include <asm-generic/local.h>
-
-#endif /* __V850_LOCAL_H__ */
Index: linux-2.6/include/asm-xtensa/local.h
===================================================================
--- linux-2.6.orig/include/asm-xtensa/local.h	2008-05-29 10:57:35.546488200 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,16 +0,0 @@
-/*
- * include/asm-xtensa/local.h
- *
- * This file is subject to the terms and conditions of the GNU General Public
- * License.  See the file "COPYING" in the main directory of this archive
- * for more details.
- *
- * Copyright (C) 2001 - 2005 Tensilica Inc.
- */
-
-#ifndef _XTENSA_LOCAL_H
-#define _XTENSA_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* _XTENSA_LOCAL_H */
Index: linux-2.6/include/linux/module.h
===================================================================
--- linux-2.6.orig/include/linux/module.h	2008-05-29 10:57:35.576486417 -0700
+++ linux-2.6/include/linux/module.h	2008-05-29 11:25:28.333434424 -0700
@@ -16,10 +16,12 @@
 #include <linux/kobject.h>
 #include <linux/moduleparam.h>
 #include <linux/marker.h>
-#include <asm/local.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
 
 #include <asm/module.h>
 
+
 /* Not Yet Implemented */
 #define MODULE_SUPPORTED_DEVICE(name)
 
Index: linux-2.6/include/asm-mn10300/local.h
===================================================================
--- linux-2.6.orig/include/asm-mn10300/local.h	2008-05-29 10:57:35.586487972 -0700
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 31/41] VM statistics: Use CPU ops
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (29 preceding siblings ...)
  2008-05-30  3:56 ` [patch 30/41] Remove local_t support Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 32/41] cpu alloc: Use in slub Christoph Lameter
                   ` (10 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_ops_vmstat --]
[-- Type: text/plain, Size: 1654 bytes --]

The use of CPU ops here avoids the offset calculations that we used to have
to do with per cpu operations. The result of this patch is that event counters
are coded with a single instruction the following way:

	incq   %gs:offset(%rip)

Without these patches this was:

	mov    %gs:0x8,%rdx
	mov    %eax,0x38(%rsp)
	mov    xxx(%rip),%eax
	mov    %eax,0x48(%rsp)
	mov    varoffset,%rax
	incq   0x110(%rax,%rdx,1)

Signed-off-by: Christoph Lameter <clameter@sgi.com

---
 include/linux/vmstat.h |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h	2008-05-20 19:43:43.000000000 -0700
+++ linux-2.6/include/linux/vmstat.h	2008-05-20 21:40:32.000000000 -0700
@@ -63,24 +63,22 @@
 
 static inline void __count_vm_event(enum vm_event_item item)
 {
-	__get_cpu_var(vm_event_states).event[item]++;
+	__CPU_INC(per_cpu_var(vm_event_states).event[item]);
 }
 
 static inline void count_vm_event(enum vm_event_item item)
 {
-	get_cpu_var(vm_event_states).event[item]++;
-	put_cpu();
+	_CPU_INC(per_cpu_var(vm_event_states).event[item]);
 }
 
 static inline void __count_vm_events(enum vm_event_item item, long delta)
 {
-	__get_cpu_var(vm_event_states).event[item] += delta;
+	__CPU_ADD(per_cpu_var(vm_event_states).event[item], delta);
 }
 
 static inline void count_vm_events(enum vm_event_item item, long delta)
 {
-	get_cpu_var(vm_event_states).event[item] += delta;
-	put_cpu();
+	_CPU_ADD(per_cpu_var(vm_event_states).event[item], delta);
 }
 
 extern void all_vm_events(unsigned long *);

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 32/41] cpu alloc: Use in slub
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (30 preceding siblings ...)
  2008-05-30  3:56 ` [patch 31/41] VM statistics: Use CPU ops Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 33/41] cpu alloc: Remove slub fields Christoph Lameter
                   ` (9 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_slub_conversion --]
[-- Type: text/plain, Size: 11012 bytes --]

Using cpu alloc removes the needs for the per cpu arrays in the kmem_cache struct.
These could get quite big if we have to support system of up to thousands of cpus.
The use of cpu_alloc means that:

1. The size of kmem_cache for SMP configuration shrinks since we will only
   need 1 pointer instead of NR_CPUS. The same pointer can be used by all
   processors. Reduces cache footprint of the allocator.

2. We can dynamically size kmem_cache according to the actual nodes in the
   system meaning less memory overhead for configurations that may potentially
   support up to 1k NUMA nodes.

3. We can remove the diddle widdle with allocating and releasing of
   kmem_cache_cpu structures when bringing up and shutting down cpus. The cpu
   alloc logic will do it all for us. Removes some portions of the cpu hotplug
   functionality.

4. Fastpath performance increases by 20%.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 arch/x86/Kconfig         |    4 
 include/linux/slub_def.h |    6 -
 mm/slub.c                |  226 ++++++++++-------------------------------------
 3 files changed, 50 insertions(+), 186 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2008-05-27 23:56:24.000000000 -0700
+++ linux-2.6/include/linux/slub_def.h	2008-05-28 00:00:27.000000000 -0700
@@ -67,6 +67,7 @@
  * Slab cache management.
  */
 struct kmem_cache {
+	struct kmem_cache_cpu *cpu_slab;
 	/* Used for retriving partial slabs etc */
 	unsigned long flags;
 	int size;		/* The size of an object including meta data */
@@ -101,11 +102,6 @@
 	int remote_node_defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-#ifdef CONFIG_SMP
-	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
-	struct kmem_cache_cpu cpu_slab;
-#endif
 };
 
 /*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-05-27 23:56:24.000000000 -0700
+++ linux-2.6/mm/slub.c	2008-05-28 00:00:27.000000000 -0700
@@ -258,15 +258,6 @@
 #endif
 }
 
-static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
-{
-#ifdef CONFIG_SMP
-	return s->cpu_slab[cpu];
-#else
-	return &s->cpu_slab;
-#endif
-}
-
 /* Verify that a pointer has an address that is valid within a slab page */
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
@@ -1120,7 +1111,7 @@
 		if (!page)
 			return NULL;
 
-		stat(get_cpu_slab(s, raw_smp_processor_id()), ORDER_FALLBACK);
+		stat(THIS_CPU(s->cpu_slab), ORDER_FALLBACK);
 	}
 	page->objects = oo_objects(oo);
 	mod_zone_page_state(page_zone(page),
@@ -1397,7 +1388,7 @@
 static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
-	struct kmem_cache_cpu *c = get_cpu_slab(s, smp_processor_id());
+	struct kmem_cache_cpu *c = THIS_CPU(s->cpu_slab);
 
 	ClearSlabFrozen(page);
 	if (page->inuse) {
@@ -1428,7 +1419,7 @@
 			slab_unlock(page);
 		} else {
 			slab_unlock(page);
-			stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
+			stat(__THIS_CPU(s->cpu_slab), FREE_SLAB);
 			discard_slab(s, page);
 		}
 	}
@@ -1481,7 +1472,7 @@
  */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+	struct kmem_cache_cpu *c = CPU_PTR(s->cpu_slab, cpu);
 
 	if (likely(c && c->page))
 		flush_slab(s, c);
@@ -1496,15 +1487,7 @@
 
 static void flush_all(struct kmem_cache *s)
 {
-#ifdef CONFIG_SMP
 	on_each_cpu(flush_cpu_slab, s, 1, 1);
-#else
-	unsigned long flags;
-
-	local_irq_save(flags);
-	flush_cpu_slab(s);
-	local_irq_restore(flags);
-#endif
 }
 
 /*
@@ -1520,6 +1503,15 @@
 	return 1;
 }
 
+static inline int cpu_node_match(struct kmem_cache_cpu *c, int node)
+{
+#ifdef CONFIG_NUMA
+	if (node != -1 && __CPU_READ(c->node) != node)
+		return 0;
+#endif
+	return 1;
+}
+
 /*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
@@ -1592,7 +1584,7 @@
 		local_irq_disable();
 
 	if (new) {
-		c = get_cpu_slab(s, smp_processor_id());
+		c = __THIS_CPU(s->cpu_slab);
 		stat(c, ALLOC_SLAB);
 		if (c->page)
 			flush_slab(s, c);
@@ -1630,20 +1622,20 @@
 	unsigned long flags;
 
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
-	if (unlikely(!c->freelist || !node_match(c, node)))
+	c = __THIS_CPU(s->cpu_slab);
+	object = c->freelist;
+	if (unlikely(!object || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		object = c->freelist;
 		c->freelist = object[c->offset];
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
 
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
-		memset(object, 0, c->objsize);
+		memset(object, 0, s->objsize);
 
 	return object;
 }
@@ -1677,7 +1669,7 @@
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
 
-	c = get_cpu_slab(s, raw_smp_processor_id());
+	c = __THIS_CPU(s->cpu_slab);
 	stat(c, FREE_SLOWPATH);
 	slab_lock(page);
 
@@ -1748,7 +1740,7 @@
 	unsigned long flags;
 
 	local_irq_save(flags);
-	c = get_cpu_slab(s, smp_processor_id());
+	c = __THIS_CPU(s->cpu_slab);
 	debug_check_no_locks_freed(object, c->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
@@ -1962,130 +1954,19 @@
 #endif
 }
 
-#ifdef CONFIG_SMP
-/*
- * Per cpu array for per cpu structures.
- *
- * The per cpu array places all kmem_cache_cpu structures from one processor
- * close together meaning that it becomes possible that multiple per cpu
- * structures are contained in one cacheline. This may be particularly
- * beneficial for the kmalloc caches.
- *
- * A desktop system typically has around 60-80 slabs. With 100 here we are
- * likely able to get per cpu structures for all caches from the array defined
- * here. We must be able to cover all kmalloc caches during bootstrap.
- *
- * If the per cpu array is exhausted then fall back to kmalloc
- * of individual cachelines. No sharing is possible then.
- */
-#define NR_KMEM_CACHE_CPU 100
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu,
-				kmem_cache_cpu)[NR_KMEM_CACHE_CPU];
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
-static cpumask_t kmem_cach_cpu_free_init_once = CPU_MASK_NONE;
-
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
-							int cpu, gfp_t flags)
-{
-	struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
-
-	if (c)
-		per_cpu(kmem_cache_cpu_free, cpu) =
-				(void *)c->freelist;
-	else {
-		/* Table overflow: So allocate ourselves */
-		c = kmalloc_node(
-			ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
-			flags, cpu_to_node(cpu));
-		if (!c)
-			return NULL;
-	}
-
-	init_kmem_cache_cpu(s, c);
-	return c;
-}
-
-static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
-{
-	if (c < per_cpu(kmem_cache_cpu, cpu) ||
-			c > per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
-		kfree(c);
-		return;
-	}
-	c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
-	per_cpu(kmem_cache_cpu_free, cpu) = c;
-}
-
-static void free_kmem_cache_cpus(struct kmem_cache *s)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c) {
-			s->cpu_slab[cpu] = NULL;
-			free_kmem_cache_cpu(c, cpu);
-		}
-	}
-}
-
 static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 {
 	int cpu;
 
-	for_each_online_cpu(cpu) {
-		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
-		if (c)
-			continue;
-
-		c = alloc_kmem_cache_cpu(s, cpu, flags);
-		if (!c) {
-			free_kmem_cache_cpus(s);
-			return 0;
-		}
-		s->cpu_slab[cpu] = c;
-	}
-	return 1;
-}
-
-/*
- * Initialize the per cpu array.
- */
-static void init_alloc_cpu_cpu(int cpu)
-{
-	int i;
-
-	if (cpu_isset(cpu, kmem_cach_cpu_free_init_once))
-		return;
-
-	for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
-		free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
+	s->cpu_slab = CPU_ALLOC(struct kmem_cache_cpu, flags);
 
-	cpu_set(cpu, kmem_cach_cpu_free_init_once);
-}
-
-static void __init init_alloc_cpu(void)
-{
-	int cpu;
+	if (!s->cpu_slab)
+		return 0;
 
 	for_each_online_cpu(cpu)
-		init_alloc_cpu_cpu(cpu);
-  }
-
-#else
-static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
-static inline void init_alloc_cpu(void) {}
-
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
-	init_kmem_cache_cpu(s, &s->cpu_slab);
+		init_kmem_cache_cpu(s, CPU_PTR(s->cpu_slab, cpu));
 	return 1;
 }
-#endif
 
 #ifdef CONFIG_NUMA
 /*
@@ -2446,9 +2327,8 @@
 	int node;
 
 	flush_all(s);
-
+	CPU_FREE(s->cpu_slab);
 	/* Attempt to free all objects */
-	free_kmem_cache_cpus(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n = get_node(s, node);
 
@@ -2966,8 +2846,6 @@
 	int i;
 	int caches = 0;
 
-	init_alloc_cpu();
-
 #ifdef CONFIG_NUMA
 	/*
 	 * Must first have the slab cache available for the allocations of the
@@ -3027,11 +2905,12 @@
 	for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
 		kmalloc_caches[i]. name =
 			kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
-
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
-	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#endif
+#ifdef CONFIG_NUMA
+	kmem_size = offsetof(struct kmem_cache, node) +
+				nr_node_ids * sizeof(struct kmem_cache_node *);
 #else
 	kmem_size = sizeof(struct kmem_cache);
 #endif
@@ -3128,7 +3007,7 @@
 		 * per cpu structures
 		 */
 		for_each_online_cpu(cpu)
-			get_cpu_slab(s, cpu)->objsize = s->objsize;
+			CPU_PTR(s->cpu_slab, cpu)->objsize = s->objsize;
 
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
@@ -3176,11 +3055,9 @@
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		init_alloc_cpu_cpu(cpu);
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list)
-			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
-							GFP_KERNEL);
+			init_kmem_cache_cpu(s, CPU_PTR(s->cpu_slab, cpu));
 		up_read(&slub_lock);
 		break;
 
@@ -3190,13 +3067,9 @@
 	case CPU_DEAD_FROZEN:
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
 			local_irq_save(flags);
 			__flush_cpu_slab(s, cpu);
 			local_irq_restore(flags);
-			free_kmem_cache_cpu(c, cpu);
-			s->cpu_slab[cpu] = NULL;
 		}
 		up_read(&slub_lock);
 		break;
@@ -3687,7 +3560,7 @@
 		int cpu;
 
 		for_each_possible_cpu(cpu) {
-			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+			struct kmem_cache_cpu *c = CPU_PTR(s->cpu_slab, cpu);
 
 			if (!c || c->node < 0)
 				continue;
@@ -4092,7 +3965,7 @@
 		return -ENOMEM;
 
 	for_each_online_cpu(cpu) {
-		unsigned x = get_cpu_slab(s, cpu)->stat[si];
+		unsigned x = CPU_PTR(s->cpu_slab, cpu)->stat[si];
 
 		data[cpu] = x;
 		sum += x;

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 33/41] cpu alloc: Remove slub fields
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (31 preceding siblings ...)
  2008-05-30  3:56 ` [patch 32/41] cpu alloc: Use in slub Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 34/41] cpu alloc: Page allocator conversion Christoph Lameter
                   ` (8 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_remove_slub_fields --]
[-- Type: text/plain, Size: 4947 bytes --]

Remove the fields in kmem_cache_cpu that were used to cache data from
kmem_cache when they were in different cachelines. The cacheline that holds
the per cpu array pointer now also holds these values. We can cut down the
kmem_cache_cpu size to almost half.

The get_freepointer() and set_freepointer() functions that used to be only
intended for the slow path now are also useful for the hot path since access
to the field does not require accessing an additional cacheline anymore. This
results in consistent use of setting the freepointer for objects throughout
SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/slub_def.h |    3 --
 mm/slub.c                |   48 +++++++++++++++--------------------------------
 2 files changed, 16 insertions(+), 35 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2008-05-28 00:00:27.000000000 -0700
+++ linux-2.6/include/linux/slub_def.h	2008-05-28 00:00:33.000000000 -0700
@@ -36,8 +36,6 @@
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
-	unsigned int offset;	/* Freepointer offset (in word units) */
-	unsigned int objsize;	/* Size of an object (from kmem_cache) */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-05-28 00:00:27.000000000 -0700
+++ linux-2.6/mm/slub.c	2008-05-28 00:00:33.000000000 -0700
@@ -276,13 +276,6 @@
 	return 1;
 }
 
-/*
- * Slow version of get and set free pointer.
- *
- * This version requires touching the cache lines of kmem_cache which
- * we avoid to do in the fast alloc free paths. There we obtain the offset
- * from the page struct.
- */
 static inline void *get_freepointer(struct kmem_cache *s, void *object)
 {
 	return *(void **)(object + s->offset);
@@ -1447,10 +1440,10 @@
 
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
-		c->freelist = c->freelist[c->offset];
+		c->freelist = get_freepointer(s, c->freelist);
 
 		/* And put onto the regular freelist */
-		object[c->offset] = page->freelist;
+		set_freepointer(s, object, page->freelist);
 		page->freelist = object;
 		page->inuse--;
 	}
@@ -1555,7 +1548,7 @@
 	if (unlikely(SlabDebug(c->page)))
 		goto debug;
 
-	c->freelist = object[c->offset];
+	c->freelist = get_freepointer(s, object);
 	c->page->inuse = c->page->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1599,7 +1592,7 @@
 		goto another_slab;
 
 	c->page->inuse++;
-	c->page->freelist = object[c->offset];
+	c->page->freelist = get_freepointer(s, object);
 	c->node = -1;
 	goto unlock_out;
 }
@@ -1629,7 +1622,7 @@
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		c->freelist = object[c->offset];
+		c->freelist = get_freepointer(s, object);
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
@@ -1663,7 +1656,7 @@
  * handling required then we can return immediately.
  */
 static void __slab_free(struct kmem_cache *s, struct page *page,
-				void *x, void *addr, unsigned int offset)
+				void *x, void *addr)
 {
 	void *prior;
 	void **object = (void *)x;
@@ -1677,7 +1670,8 @@
 		goto debug;
 
 checks_ok:
-	prior = object[offset] = page->freelist;
+	prior = page->freelist;
+	set_freepointer(s, object, prior);
 	page->freelist = object;
 	page->inuse--;
 
@@ -1741,15 +1735,15 @@
 
 	local_irq_save(flags);
 	c = __THIS_CPU(s->cpu_slab);
-	debug_check_no_locks_freed(object, c->objsize);
+	debug_check_no_locks_freed(object, s->objsize);
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(object, s->objsize);
 	if (likely(page == c->page && c->node >= 0)) {
-		object[c->offset] = c->freelist;
+		set_freepointer(s, object, c->freelist);
 		c->freelist = object;
 		stat(c, FREE_FASTPATH);
 	} else
-		__slab_free(s, page, x, addr, c->offset);
+		__slab_free(s, page, x, addr);
 
 	local_irq_restore(flags);
 }
@@ -1936,8 +1930,6 @@
 	c->page = NULL;
 	c->freelist = NULL;
 	c->node = 0;
-	c->offset = s->offset / sizeof(void *);
-	c->objsize = s->objsize;
 #ifdef CONFIG_SLUB_STATS
 	memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned));
 #endif
@@ -2993,8 +2985,6 @@
 	down_write(&slub_lock);
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int cpu;
-
 		s->refcount++;
 		/*
 		 * Adjust the object sizes so that we clear
@@ -3002,13 +2992,6 @@
 		 */
 		s->objsize = max(s->objsize, (int)size);
 
-		/*
-		 * And then we need to update the object size in the
-		 * per cpu structures
-		 */
-		for_each_online_cpu(cpu)
-			CPU_PTR(s->cpu_slab, cpu)->objsize = s->objsize;
-
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 		up_write(&slub_lock);
 

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 34/41] cpu alloc: Page allocator conversion
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (32 preceding siblings ...)
  2008-05-30  3:56 ` [patch 33/41] cpu alloc: Remove slub fields Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 35/41] Support for CPU ops Christoph Lameter
                   ` (7 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_page_allocator_conversion --]
[-- Type: text/plain, Size: 15188 bytes --]

Use the new cpu_alloc functionality to avoid per cpu arrays in struct zone.
This drastically reduces the size of struct zone for systems with a large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Surprisingly this clears up much of the painful NUMA bringup. Bootstrap
becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs are
reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since hotplug already brings up a
percpu area which comes with a per cpu alloc area. So there is no need to
allocate or free individual pagesets anymore.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 arch/x86/kernel/setup.c |    6 +
 include/linux/gfp.h     |    1 
 include/linux/mm.h      |    4 -
 include/linux/mmzone.h  |   12 ---
 init/main.c             |    5 +
 mm/page_alloc.c         |  172 +++++++++++++++++++-----------------------------
 mm/vmstat.c             |   14 ++-
 7 files changed, 93 insertions(+), 121 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-05-09 18:46:19.000000000 -0700
+++ linux-2.6/include/linux/mm.h	2008-05-29 19:13:54.000000000 -0700
@@ -1024,11 +1024,7 @@ extern void show_mem(void);
 extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 
-#ifdef CONFIG_NUMA
 extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif
 
 /* prio_tree.c */
 void vma_prio_tree_add(struct vm_area_struct *, struct vm_area_struct *old);
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h	2008-05-28 11:16:24.000000000 -0700
+++ linux-2.6/include/linux/mmzone.h	2008-05-29 19:13:54.000000000 -0700
@@ -123,13 +123,7 @@ struct per_cpu_pageset {
 	s8 stat_threshold;
 	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
 #endif
-} ____cacheline_aligned_in_smp;
-
-#ifdef CONFIG_NUMA
-#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
-#else
-#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
-#endif
+};
 
 #endif /* !__GENERATING_BOUNDS.H */
 
@@ -224,10 +218,8 @@ struct zone {
 	 */
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
-	struct per_cpu_pageset	*pageset[NR_CPUS];
-#else
-	struct per_cpu_pageset	pageset[NR_CPUS];
 #endif
+	struct per_cpu_pageset	*pageset;
 	/*
 	 * free areas of different sizes
 	 */
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2008-05-28 11:16:24.000000000 -0700
+++ linux-2.6/mm/page_alloc.c	2008-05-29 19:39:58.000000000 -0700
@@ -923,7 +923,7 @@ static void drain_pages(unsigned int cpu
 		if (!populated_zone(zone))
 			continue;
 
-		pset = zone_pcp(zone, cpu);
+		pset = CPU_PTR(zone->pageset, cpu);
 
 		pcp = &pset->pcp;
 		local_irq_save(flags);
@@ -1006,8 +1006,8 @@ static void free_hot_cold_page(struct pa
 	arch_free_page(page, 0);
 	kernel_map_pages(page, 1, 0);
 
-	pcp = &zone_pcp(zone, get_cpu())->pcp;
 	local_irq_save(flags);
+	pcp = &THIS_CPU(zone->pageset)->pcp;
 	__count_vm_event(PGFREE);
 	if (cold)
 		list_add_tail(&page->lru, &pcp->list);
@@ -1020,7 +1020,6 @@ static void free_hot_cold_page(struct pa
 		pcp->count -= pcp->batch;
 	}
 	local_irq_restore(flags);
-	put_cpu();
 }
 
 void free_hot_page(struct page *page)
@@ -1062,16 +1061,14 @@ static struct page *buffered_rmqueue(str
 	unsigned long flags;
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
-	int cpu;
 	int migratetype = allocflags_to_migratetype(gfp_flags);
 
 again:
-	cpu  = get_cpu();
 	if (likely(order == 0)) {
 		struct per_cpu_pages *pcp;
 
-		pcp = &zone_pcp(zone, cpu)->pcp;
 		local_irq_save(flags);
+		pcp = &THIS_CPU(zone->pageset)->pcp;
 		if (!pcp->count) {
 			pcp->count = rmqueue_bulk(zone, 0,
 					pcp->batch, &pcp->list, migratetype);
@@ -1110,7 +1107,6 @@ again:
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
-	put_cpu();
 
 	VM_BUG_ON(bad_range(zone, page));
 	if (prep_new_page(page, order, gfp_flags))
@@ -1119,7 +1115,6 @@ again:
 
 failed:
 	local_irq_restore(flags);
-	put_cpu();
 	return NULL;
 }
 
@@ -1836,7 +1831,7 @@ void show_free_areas(void)
 		for_each_online_cpu(cpu) {
 			struct per_cpu_pageset *pageset;
 
-			pageset = zone_pcp(zone, cpu);
+			pageset = CPU_PTR(zone->pageset, cpu);
 
 			printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
 			       cpu, pageset->pcp.high,
@@ -2670,82 +2665,77 @@ static void setup_pagelist_highmark(stru
 		pcp->batch = PAGE_SHIFT * 8;
 }
 
-
-#ifdef CONFIG_NUMA
 /*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * Some NUMA counter updates may also be caught by the boot pagesets.
+ * The boot_pageset enables bootstrapping of the page allocator
+ * before pagesets can be allocated.
  *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
+ * The boot pageset is configued in such a way that therew will be no pages
+ * permanently queued. A page is added to the list and then we reach the
+ * highwater mark and the queue is drained.
+ *
+ * All zone pageset pointers for zones not activated by process_zones() point
+ * to the boot_pageset. Only one processor may be using the pageset at a time
+ * though. So only a single processor may perform bootstrap.
+ */
+static struct per_cpu_pageset boot_pageset = {
+	{
+		.count = 0,
+		.high = 1,
+		.batch = 1,
+		.list = LIST_HEAD_INIT(boot_pageset.pcp.list)
+	}
+};
+
+/*
+ * Initialize a pageset pointer during early boot.
+ * We need to undo the effect that THIS_CPU() would have in order to
+ * have CPU_PTR() return a pointer to the boot pageset.
  */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
+static void setup_zone_boot_pageset(struct zone *zone)
+{
+	zone->pageset = SHIFT_PERCPU_PTR(&boot_pageset, -my_cpu_offset);
+}
+
+void __cpuinit setup_boot_pagesets(void)
+{
+	struct zone *zone;
+
+	for_each_zone(zone)
+		if (populated_zone(zone))
+			setup_zone_boot_pageset(zone);
+}
 
 /*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
+ * Prepare the pagesets in struct zone.
  */
-static int __cpuinit process_zones(int cpu)
+static void __cpuinit process_zones(int cpu)
 {
-	struct zone *zone, *dzone;
+	struct zone *zone;
 	int node = cpu_to_node(cpu);
 
 	node_set_state(node, N_CPU);	/* this node has a cpu */
 
 	for_each_zone(zone) {
+		struct per_cpu_pageset *pcp;
 
 		if (!populated_zone(zone))
 			continue;
 
-		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, node);
-		if (!zone_pcp(zone, cpu))
-			goto bad;
+		if (CPU_PTR(zone->pageset, cpu) == &boot_pageset)
+			zone->pageset = CPU_ALLOC(struct per_cpu_pageset,
+						GFP_KERNEL|__GFP_ZERO);
 
-		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
+		pcp = CPU_PTR(zone->pageset, cpu);
+		setup_pageset(pcp, zone_batchsize(zone));
 
 		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(zone_pcp(zone, cpu),
-			 	(zone->present_pages / percpu_pagelist_fraction));
-	}
-
-	return 0;
-bad:
-	for_each_zone(dzone) {
-		if (!populated_zone(dzone))
-			continue;
-		if (dzone == zone)
-			break;
-		kfree(zone_pcp(dzone, cpu));
-		zone_pcp(dzone, cpu) = NULL;
-	}
-	return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
-	struct zone *zone;
-
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
+			setup_pagelist_highmark(pcp, zone->present_pages /
+						percpu_pagelist_fraction);
 
-		/* Free per_cpu_pageset if it is slab allocated */
-		if (pset != &boot_pageset[cpu])
-			kfree(pset);
-		zone_pcp(zone, cpu) = NULL;
 	}
 }
 
+#ifdef CONFIG_SMP
 static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
 		unsigned long action,
 		void *hcpu)
@@ -2756,14 +2746,7 @@ static int __cpuinit pageset_cpuup_callb
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		if (process_zones(cpu))
-			ret = NOTIFY_BAD;
-		break;
-	case CPU_UP_CANCELED:
-	case CPU_UP_CANCELED_FROZEN:
-	case CPU_DEAD:
-	case CPU_DEAD_FROZEN:
-		free_zone_pagesets(cpu);
+		process_zones(cpu);
 		break;
 	default:
 		break;
@@ -2773,21 +2756,20 @@ static int __cpuinit pageset_cpuup_callb
 
 static struct notifier_block __cpuinitdata pageset_notifier =
 	{ &pageset_cpuup_callback, NULL, 0 };
+#endif
 
 void __init setup_per_cpu_pageset(void)
 {
-	int err;
-
-	/* Initialize per_cpu_pageset for cpu 0.
+	/*
+	 * Initialize per_cpu settings for the boot cpu.
 	 * A cpuup callback will do this for every cpu
-	 * as it comes online
+	 * as it comes online.
 	 */
-	err = process_zones(smp_processor_id());
-	BUG_ON(err);
+	process_zones(smp_processor_id());
+#ifdef CONFIG_SMP
 	register_cpu_notifier(&pageset_notifier);
-}
-
 #endif
+}
 
 static noinline __init_refok
 int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
@@ -2832,25 +2814,6 @@ int zone_wait_table_init(struct zone *zo
 	return 0;
 }
 
-static __meminit void zone_pcp_init(struct zone *zone)
-{
-	int cpu;
-	unsigned long batch = zone_batchsize(zone);
-
-	for (cpu = 0; cpu < NR_CPUS; cpu++) {
-#ifdef CONFIG_NUMA
-		/* Early boot. Slab allocator not functional yet */
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		setup_pageset(&boot_pageset[cpu],0);
-#else
-		setup_pageset(zone_pcp(zone,cpu), batch);
-#endif
-	}
-	if (zone->present_pages)
-		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-			zone->name, zone->present_pages, batch);
-}
-
 __meminit int init_currently_empty_zone(struct zone *zone,
 					unsigned long zone_start_pfn,
 					unsigned long size,
@@ -3420,7 +3383,12 @@ static void __paginginit free_area_init_
 
 		zone->prev_priority = DEF_PRIORITY;
 
-		zone_pcp_init(zone);
+		setup_zone_boot_pageset(zone);
+		if (zone->present_pages)
+			printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
+				zone->name, zone->present_pages,
+				zone_batchsize(zone));
+
 		INIT_LIST_HEAD(&zone->active_list);
 		INIT_LIST_HEAD(&zone->inactive_list);
 		zone->nr_scan_active = 0;
@@ -4295,11 +4263,13 @@ int percpu_pagelist_fraction_sysctl_hand
 	ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
 	if (!write || (ret == -EINVAL))
 		return ret;
-	for_each_zone(zone) {
-		for_each_online_cpu(cpu) {
+	for_each_online_cpu(cpu) {
+		for_each_zone(zone) {
 			unsigned long  high;
+
 			high = zone->present_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+			setup_pagelist_highmark(CPU_PTR(zone->pageset, cpu),
+									high);
 		}
 	}
 	return 0;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2008-05-29 19:13:53.000000000 -0700
+++ linux-2.6/mm/vmstat.c	2008-05-29 19:13:54.000000000 -0700
@@ -142,7 +142,8 @@ static void refresh_zone_stat_thresholds
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
-			zone_pcp(zone, cpu)->stat_threshold = threshold;
+			CPU_PTR(zone->pageset, cpu)->stat_threshold
+							= threshold;
 	}
 }
 
@@ -152,7 +153,8 @@ static void refresh_zone_stat_thresholds
 void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
 				int delta)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = THIS_CPU(zone->pageset);
+
 	s8 *p = pcp->vm_stat_diff + item;
 	long x;
 
@@ -205,7 +207,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
  */
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = THIS_CPU(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)++;
@@ -226,7 +228,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);
 
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
-	struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+	struct per_cpu_pageset *pcp = THIS_CPU(zone->pageset);
 	s8 *p = pcp->vm_stat_diff + item;
 
 	(*p)--;
@@ -306,7 +308,7 @@ void refresh_cpu_vm_stats(int cpu)
 		if (!populated_zone(zone))
 			continue;
 
-		p = zone_pcp(zone, cpu);
+		p = CPU_PTR(zone->pageset, cpu);
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 			if (p->vm_stat_diff[i]) {
@@ -698,7 +700,7 @@ static void zoneinfo_show_print(struct s
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
 
-		pageset = zone_pcp(zone, i);
+		pageset = CPU_PTR(zone->pageset, i);
 		seq_printf(m,
 			   "\n    cpu: %i"
 			   "\n              count: %i"
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c	2008-05-29 19:13:53.000000000 -0700
+++ linux-2.6/arch/x86/kernel/setup.c	2008-05-29 19:40:08.000000000 -0700
@@ -125,6 +125,12 @@ void __init setup_per_cpu_areas(void)
 		highest_cpu = i;
 	}
 
+	/*
+	 * The per_cpu offsets have changed and therefore the pageset
+	 * pointers need to be updated.
+	 */
+	setup_boot_pagesets();
+
 	nr_cpu_ids = highest_cpu + 1;
 	printk(KERN_DEBUG "NR_CPUS: %d, nr_cpu_ids: %d\n", NR_CPUS, nr_cpu_ids);
 
Index: linux-2.6/include/linux/gfp.h
===================================================================
--- linux-2.6.orig/include/linux/gfp.h	2008-04-29 12:13:29.000000000 -0700
+++ linux-2.6/include/linux/gfp.h	2008-05-29 19:13:54.000000000 -0700
@@ -233,5 +233,6 @@ void page_alloc_init(void);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(void);
 void drain_local_pages(void *dummy);
+void setup_boot_pagesets(void);
 
 #endif /* __LINUX_GFP_H */
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c	2008-05-29 19:13:53.000000000 -0700
+++ linux-2.6/init/main.c	2008-05-29 19:13:54.000000000 -0700
@@ -405,6 +405,11 @@ static void __init setup_per_cpu_areas(v
 		memcpy(ptr, __per_cpu_start, __per_cpu_size);
 		ptr += __per_cpu_size;
 	}
+	/*
+	 * __per_cpu_offset[] have changed. Need to update the
+	 * pointers to the boot page set.
+	 */
+	setup_boot_pagesets();
 }
 #endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
 

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 35/41] Support for CPU ops
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (33 preceding siblings ...)
  2008-05-30  3:56 ` [patch 34/41] cpu alloc: Page allocator conversion Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  4:58   ` Andrew Morton
  2008-05-30  3:56 ` [patch 36/41] Zero based percpu: Infrastructure to rebase the per cpu area to zero Christoph Lameter
                   ` (6 subsequent siblings)
  41 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, Tony.Luck, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: ia64_cpu_ops --]
[-- Type: text/plain, Size: 10388 bytes --]

IA64 has no efficient atomic operations. But we can get rid of the need to
add my_percpu_offset(). The address of a per cpu variable can be used directly
on IA64 since its mapped to a per processor area.

This also allows us to kill off the __ia64_get_cpu_var macro. Its nothing but
per_cpu_var().

Cc: Tony.Luck@intel.com
Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/ia64/Kconfig              |    3 
 arch/ia64/kernel/perfmon.c     |    2 
 arch/ia64/kernel/setup.c       |    2 
 arch/ia64/kernel/smp.c         |    4 -
 arch/ia64/sn/kernel/setup.c    |    4 -
 include/asm-ia64/mmu_context.h |    6 -
 include/asm-ia64/percpu.h      |  133 ++++++++++++++++++++++++++++++++++++++---
 include/asm-ia64/processor.h   |    2 
 include/asm-ia64/sn/pda.h      |    2 
 9 files changed, 138 insertions(+), 20 deletions(-)

Index: linux-2.6/include/asm-ia64/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/percpu.h	2008-05-29 19:35:10.000000000 -0700
+++ linux-2.6/include/asm-ia64/percpu.h	2008-05-29 19:35:11.000000000 -0700
@@ -19,7 +19,7 @@
 # define PER_CPU_ATTRIBUTES	__attribute__((__model__ (__small__)))
 #endif
 
-#define __my_cpu_offset	__ia64_per_cpu_var(local_per_cpu_offset)
+#define __my_cpu_offset	CPU_READ(per_cpu_var(local_per_cpu_offset))
 
 extern void *per_cpu_init(void);
 
@@ -31,14 +31,6 @@ extern void *per_cpu_init(void);
 
 #endif	/* SMP */
 
-/*
- * Be extremely careful when taking the address of this variable!  Due to virtual
- * remapping, it is different from the canonical address returned by __get_cpu_var(var)!
- * On the positive side, using __ia64_per_cpu_var() instead of __get_cpu_var() is slightly
- * more efficient.
- */
-#define __ia64_per_cpu_var(var)	per_cpu__##var
-
 #include <asm-generic/percpu.h>
 
 /* Equal to __per_cpu_offset[smp_processor_id()], but faster to access: */
@@ -46,4 +38,127 @@ DECLARE_PER_CPU(unsigned long, local_per
 
 #endif /* !__ASSEMBLY__ */
 
+/*
+ * Per cpu ops.
+ *
+ * IA64 has no instructions that would allow light weight RMW operations.
+ *
+ * However, the canonical address of a per cpu variable is mapped via
+ * a processor specific TLB entry to the per cpu area of the respective
+ * processor. The THIS_CPU() macro is therefore not necessary here
+ * since the canonical address of the per cpu variable allows access
+ * to the instance of the per cpu variable for the current processor.
+ *
+ * Sadly we cannot simply define THIS_CPU() to return an address in
+ * the per processor mapping space since the address acquired by THIS_CPU\
+ * may be passed to another processor.
+ */
+#define __CPU_READ(var)				\
+({						\
+	(var);					\
+})
+
+#define __CPU_WRITE(var, value)			\
+({						\
+	(var) = (value);			\
+})
+
+#define __CPU_ADD(var, value)			\
+({						\
+	(var) += (value);			\
+})
+
+#define __CPU_INC(var) __CPU_ADD((var), 1)
+#define __CPU_DEC(var) __CPU_ADD((var), -1)
+#define __CPU_SUB(var, value) __CPU_ADD((var), -(value))
+
+#define __CPU_CMPXCHG(var, old, new)		\
+({						\
+	typeof(obj) x;				\
+	typeof(obj) *p = &(var);		\
+	x = *p;					\
+	if (x == (old))				\
+		*p = (new);			\
+	(x);					\
+})
+
+#define __CPU_XCHG(obj, new)			\
+({						\
+	typeof(obj) x;				\
+	typeof(obj) *p = &(obj);		\
+	x = *p;					\
+	*p = (new);				\
+	(x);					\
+})
+
+#define _CPU_READ __CPU_READ
+#define _CPU_WRITE __CPU_WRITE
+
+#define _CPU_ADD(var, value)			\
+({						\
+	preempt_disable();			\
+	__CPU_ADD((var), (value));		\
+	preempt_enable();			\
+})
+
+#define _CPU_INC(var) _CPU_ADD((var), 1)
+#define _CPU_DEC(var) _CPU_ADD((var), -1)
+#define _CPU_SUB(var, value) _CPU_ADD((var), -(value))
+
+#define _CPU_CMPXCHG(var, old, new)		\
+({						\
+	typeof(addr) x;				\
+	preempt_disable();			\
+	x = __CPU_CMPXCHG((var), (old), (new));	\
+	preempt_enable();			\
+	(x);					\
+})
+
+#define _CPU_XCHG(var, new)			\
+({						\
+	typeof(var) x;				\
+	preempt_disable();			\
+	x = __CPU_XCHG((var), (new));		\
+	preempt_enable();			\
+	(x);					\
+})
+
+/*
+ * Third group: Interrupt safe CPU functions
+ */
+#define CPU_READ __CPU_READ
+#define CPU_WRITE __CPU_WRITE
+
+#define CPU_ADD(var, value)			\
+({						\
+	unsigned long flags;			\
+	local_irq_save(flags);			\
+	__CPU_ADD((var), (value));		\
+	local_irq_restore(flags);		\
+})
+
+#define CPU_INC(var) CPU_ADD((var), 1)
+#define CPU_DEC(var) CPU_ADD((var), -1)
+#define CPU_SUB(var, value) CPU_ADD((var), -(value))
+
+#define CPU_CMPXCHG(var, old, new)		\
+({						\
+	unsigned long flags;			\
+	typeof(var) x;				\
+	local_irq_save(flags);			\
+	x = __CPU_CMPXCHG((var), (old), (new));	\
+	local_irq_restore(flags);		\
+	(x);					\
+})
+
+#define CPU_XCHG(var, new)			\
+({						\
+	unsigned long flags;			\
+	typeof(var) x;				\
+	local_irq_save(flags);			\
+	x = __CPU_XCHG((var), (new));		\
+	local_irq_restore(flags);		\
+	(x);					\
+})
+
 #endif /* _ASM_IA64_PERCPU_H */
Index: linux-2.6/arch/ia64/kernel/perfmon.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/perfmon.c	2008-05-29 19:35:09.000000000 -0700
+++ linux-2.6/arch/ia64/kernel/perfmon.c	2008-05-29 19:35:13.000000000 -0700
@@ -576,7 +576,7 @@ static struct ctl_table_header *pfm_sysc
 
 static int pfm_context_unload(pfm_context_t *ctx, void *arg, int count, struct pt_regs *regs);
 
-#define pfm_get_cpu_var(v)		__ia64_per_cpu_var(v)
+#define pfm_get_cpu_var(v)		per_cpu_var(v)
 #define pfm_get_cpu_data(a,b)		per_cpu(a, b)
 
 static inline void
Index: linux-2.6/arch/ia64/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/setup.c	2008-05-29 19:35:09.000000000 -0700
+++ linux-2.6/arch/ia64/kernel/setup.c	2008-05-29 19:35:11.000000000 -0700
@@ -925,7 +925,7 @@ cpu_init (void)
 	 * depends on the data returned by identify_cpu().  We break the dependency by
 	 * accessing cpu_data() through the canonical per-CPU address.
 	 */
-	cpu_info = cpu_data + ((char *) &__ia64_per_cpu_var(cpu_info) - __per_cpu_start);
+	cpu_info = cpu_data + ((char *)&per_cpu_var(cpu_info) - __per_cpu_start);
 	identify_cpu(cpu_info);
 
 #ifdef CONFIG_MCKINLEY
Index: linux-2.6/arch/ia64/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/smp.c	2008-05-29 19:35:09.000000000 -0700
+++ linux-2.6/arch/ia64/kernel/smp.c	2008-05-29 19:35:11.000000000 -0700
@@ -150,7 +150,7 @@ irqreturn_t
 handle_IPI (int irq, void *dev_id)
 {
 	int this_cpu = get_cpu();
-	unsigned long *pending_ipis = &__ia64_per_cpu_var(ipi_operation);
+	unsigned long *pending_ipis = &per_cpu_var(ipi_operation);
 	unsigned long ops;
 
 	mb();	/* Order interrupt and bit testing. */
@@ -303,7 +303,7 @@ smp_local_flush_tlb(void)
 void
 smp_flush_tlb_cpumask(cpumask_t xcpumask)
 {
-	unsigned int *counts = __ia64_per_cpu_var(shadow_flush_counts);
+	unsigned int *counts = per_cpu_var(shadow_flush_counts);
 	cpumask_t cpumask = xcpumask;
 	int mycpu, cpu, flush_mycpu = 0;
 
Index: linux-2.6/arch/ia64/sn/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/ia64/sn/kernel/setup.c	2008-05-29 19:35:09.000000000 -0700
+++ linux-2.6/arch/ia64/sn/kernel/setup.c	2008-05-29 19:35:11.000000000 -0700
@@ -645,7 +645,7 @@ void __cpuinit sn_cpu_init(void)
 		/* copy cpu 0's sn_cnodeid_to_nasid table to this cpu's */
 		memcpy(sn_cnodeid_to_nasid,
 		       (&per_cpu(__sn_cnodeid_to_nasid, 0)),
-		       sizeof(__ia64_per_cpu_var(__sn_cnodeid_to_nasid)));
+		       sizeof(per_cpu_var(__sn_cnodeid_to_nasid)));
 	}
 
 	/*
@@ -706,7 +706,7 @@ void __init build_cnode_tables(void)
 
 	memset(physical_node_map, -1, sizeof(physical_node_map));
 	memset(sn_cnodeid_to_nasid, -1,
-			sizeof(__ia64_per_cpu_var(__sn_cnodeid_to_nasid)));
+			sizeof(per_cpu_var(__sn_cnodeid_to_nasid)));
 
 	/*
 	 * First populate the tables with C/M bricks. This ensures that
Index: linux-2.6/include/asm-ia64/mmu_context.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/mmu_context.h	2008-05-29 19:35:10.000000000 -0700
+++ linux-2.6/include/asm-ia64/mmu_context.h	2008-05-29 19:35:13.000000000 -0700
@@ -64,11 +64,11 @@ delayed_tlb_flush (void)
 	extern void local_flush_tlb_all (void);
 	unsigned long flags;
 
-	if (unlikely(__ia64_per_cpu_var(ia64_need_tlb_flush))) {
+	if (unlikely(CPU_READ(per_cpu_var(ia64_need_tlb_flush)))) {
 		spin_lock_irqsave(&ia64_ctx.lock, flags);
-		if (__ia64_per_cpu_var(ia64_need_tlb_flush)) {
+		if (CPU_READ(per_cpu_var(ia64_need_tlb_flush))) {
 			local_flush_tlb_all();
-			__ia64_per_cpu_var(ia64_need_tlb_flush) = 0;
+			CPU_WRITE(per_cpu_var(ia64_need_tlb_flush), 0);
 		}
 		spin_unlock_irqrestore(&ia64_ctx.lock, flags);
 	}
Index: linux-2.6/include/asm-ia64/processor.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/processor.h	2008-05-29 19:35:09.000000000 -0700
+++ linux-2.6/include/asm-ia64/processor.h	2008-05-29 19:35:11.000000000 -0700
@@ -237,7 +237,7 @@ DECLARE_PER_CPU(struct cpuinfo_ia64, cpu
  * Do not use the address of local_cpu_data, since it will be different from
  * cpu_data(smp_processor_id())!
  */
-#define local_cpu_data		(&__ia64_per_cpu_var(cpu_info))
+#define local_cpu_data		(&per_cpu_var(cpu_info))
 #define cpu_data(cpu)		(&per_cpu(cpu_info, cpu))
 
 extern void print_cpu_info (struct cpuinfo_ia64 *);
Index: linux-2.6/include/asm-ia64/sn/pda.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/sn/pda.h	2008-05-29 19:35:10.000000000 -0700
+++ linux-2.6/include/asm-ia64/sn/pda.h	2008-05-29 19:35:11.000000000 -0700
@@ -62,7 +62,7 @@ typedef struct pda_s {
  */
 DECLARE_PER_CPU(struct pda_s, pda_percpu);
 
-#define pda		(&__ia64_per_cpu_var(pda_percpu))
+#define pda		(&per_cpu_var(pda_percpu))
 
 #define pdacpu(cpu)	(&per_cpu(pda_percpu, cpu))
 
Index: linux-2.6/arch/ia64/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/Kconfig	2008-05-29 19:35:09.000000000 -0700
+++ linux-2.6/arch/ia64/Kconfig	2008-05-29 19:35:11.000000000 -0700
@@ -92,6 +92,9 @@ config GENERIC_TIME_VSYSCALL
 config HAVE_SETUP_PER_CPU_AREA
 	def_bool y
 
+config HAVE_CPU_OPS
+	def_bool y
+
 config DMI
 	bool
 	default y

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 36/41] Zero based percpu: Infrastructure to rebase the per cpu area to zero
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (34 preceding siblings ...)
  2008-05-30  3:56 ` [patch 35/41] Support for CPU ops Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 37/41] x86_64: Fold pda into per cpu area Christoph Lameter
                   ` (5 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, Mike Travis, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Rusty Russell

[-- Attachment #1: zero_based_infrastructure --]
[-- Type: text/plain, Size: 6216 bytes --]

    * Support an option

	CONFIG_HAVE_ZERO_BASED_PER_CPU

      to make offsets for per cpu variables to start at zero.

      If a percpu area starts at zero then:

	-  We do not need RELOC_HIDE anymore

	-  Provides for the future capability of architectures providing
	   a per cpu allocator that returns offsets instead of pointers.
	   The offsets would be independent of the processor so that
	   address calculations can be done in a processor independent way.
	   Per cpu instructions can then add the processor specific offset
	   at the last minute possibly in an atomic instruction.

      The data the linker provides is different for zero based percpu segments:

	__per_cpu_load	-> The address at which the percpu area was loaded
	__per_cpu_size	-> The length of the per cpu area

    * Removes the &__per_cpu_x in lockdep. The __per_cpu_x are already
      pointers. There is no need to take the address.

    * Updates kernel/module.c to be able to deal with a percpu area that
      is loaded at __per_cpu_load but is accessed at __per_cpu_start.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-generic/percpu.h      |    9 ++++++++-
 include/asm-generic/sections.h    |   10 ++++++++++
 include/asm-generic/vmlinux.lds.h |   16 ++++++++++++++++
 include/linux/percpu.h            |    8 ++++++++
 kernel/lockdep.c                  |    4 ++--
 5 files changed, 44 insertions(+), 3 deletions(-)

Index: linux-2.6/include/asm-generic/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-generic/percpu.h	2008-05-29 17:57:39.788714222 -0700
+++ linux-2.6/include/asm-generic/percpu.h	2008-05-29 18:03:12.432714383 -0700
@@ -45,7 +45,12 @@ extern unsigned long __per_cpu_offset[NR
  * Only S390 provides its own means of moving the pointer.
  */
 #ifndef SHIFT_PERCPU_PTR
-#define SHIFT_PERCPU_PTR(__p, __offset)	RELOC_HIDE((__p), (__offset))
+# ifdef CONFIG_HAVE_ZERO_BASED_PER_CPU
+#  define SHIFT_PERCPU_PTR(__p, __offset) \
+	((__typeof(__p))(((void *)(__p)) + (__offset)))
+# else
+#  define SHIFT_PERCPU_PTR(__p, __offset)	RELOC_HIDE((__p), (__offset))
+# endif /* CONFIG_HAVE_ZERO_BASED_PER_CPU */
 #endif
 
 /*
@@ -70,6 +75,8 @@ extern void setup_per_cpu_areas(void);
 #define per_cpu(var, cpu)			(*((void)(cpu), &per_cpu_var(var)))
 #define __get_cpu_var(var)			per_cpu_var(var)
 #define __raw_get_cpu_var(var)			per_cpu_var(var)
+#define SHIFT_PERCPU_PTR(__p, __offset)		(__p)
+#define per_cpu_offset(x)			0L
 
 #endif	/* SMP */
 
Index: linux-2.6/include/asm-generic/sections.h
===================================================================
--- linux-2.6.orig/include/asm-generic/sections.h	2008-05-29 17:57:39.792714337 -0700
+++ linux-2.6/include/asm-generic/sections.h	2008-05-29 18:03:12.432714383 -0700
@@ -9,7 +9,17 @@ extern char __bss_start[], __bss_stop[];
 extern char __init_begin[], __init_end[];
 extern char _sinittext[], _einittext[];
 extern char _end[];
+#ifdef CONFIG_HAVE_ZERO_BASED_PER_CPU
+extern char __per_cpu_load[];
+extern char ____per_cpu_size[];
+#define __per_cpu_size ((unsigned long)&____per_cpu_size)
+#define __per_cpu_start ((char *)0)
+#define __per_cpu_end ((char *)__per_cpu_size)
+#else
 extern char __per_cpu_start[], __per_cpu_end[];
+#define __per_cpu_load __per_cpu_start
+#define __per_cpu_size (__per_cpu_end - __per_cpu_start)
+#endif
 extern char __kprobes_text_start[], __kprobes_text_end[];
 extern char __initdata_begin[], __initdata_end[];
 extern char __start_rodata[], __end_rodata[];
Index: linux-2.6/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6.orig/include/asm-generic/vmlinux.lds.h	2008-05-29 17:57:39.800714018 -0700
+++ linux-2.6/include/asm-generic/vmlinux.lds.h	2008-05-29 18:03:12.432714383 -0700
@@ -344,6 +344,21 @@
   	*(.initcall7.init)						\
   	*(.initcall7s.init)
 
+#ifdef CONFIG_HAVE_ZERO_BASED_PER_CPU
+#define PERCPU(align)							\
+	. = ALIGN(align);						\
+	percpu : { } :percpu						\
+	__per_cpu_load = .;						\
+	.data.percpu 0 : AT(__per_cpu_load - LOAD_OFFSET) {		\
+		*(.data.percpu.first)					\
+		*(.data.percpu.shared_aligned)				\
+		*(.data.percpu)						\
+		*(.data.percpu.page_aligned)				\
+		____per_cpu_size = .;					\
+	}								\
+	. = __per_cpu_load + ____per_cpu_size;				\
+	data : { } :data
+#else
 #define PERCPU(align)							\
 	. = ALIGN(align);						\
 	__per_cpu_start = .;						\
@@ -352,3 +367,4 @@
 		*(.data.percpu.shared_aligned)				\
 	}								\
 	__per_cpu_end = .;
+#endif
Index: linux-2.6/kernel/lockdep.c
===================================================================
--- linux-2.6.orig/kernel/lockdep.c	2008-05-29 17:59:22.697422432 -0700
+++ linux-2.6/kernel/lockdep.c	2008-05-29 18:03:33.013702733 -0700
@@ -609,8 +609,8 @@ static int static_obj(void *obj)
 	 * percpu var?
 	 */
 	for_each_possible_cpu(i) {
-		start = (unsigned long) &__per_cpu_start + per_cpu_offset(i);
-		end   = (unsigned long) &__per_cpu_start + __per_cpu_size
+		start = (unsigned long) __per_cpu_start + per_cpu_offset(i);
+		end   = (unsigned long) __per_cpu_start + __per_cpu_size
 					+ per_cpu_offset(i);
 
 		if ((addr >= start) && (addr < end))
Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2008-05-29 18:01:22.260714623 -0700
+++ linux-2.6/include/linux/percpu.h	2008-05-29 18:03:12.436714003 -0700
@@ -23,12 +23,20 @@
 	__attribute__((__section__(SHARED_ALIGNED_SECTION)))		\
 	PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name		\
 	____cacheline_aligned_in_smp
+
+#define DEFINE_PER_CPU_FIRST(type, name)				\
+	__attribute__((__section__(".data.percpu.first")))		\
+	PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name
+
 #else
 #define DEFINE_PER_CPU(type, name)					\
 	PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name
 
 #define DEFINE_PER_CPU_SHARED_ALIGNED(type, name)		      \
 	DEFINE_PER_CPU(type, name)
+
+#define DEFINE_PER_CPU_FIRST(type, name)			      \
+	DEFINE_PER_CPU(type, name)
 #endif
 
 #define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var)

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 37/41] x86_64: Fold pda into per cpu area
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (35 preceding siblings ...)
  2008-05-30  3:56 ` [patch 36/41] Zero based percpu: Infrastructure to rebase the per cpu area to zero Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 38/41] x86: Extend percpu ops to 64 bit Christoph Lameter
                   ` (4 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, Mike Travis, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Rusty Russell

[-- Attachment #1: zero_based_fold --]
[-- Type: text/plain, Size: 5788 bytes --]

  * Declare the pda as a per cpu variable.
  * Make the x86_64 per cpu area start at zero.

  * Since %gs is pointing to the pda, it will then also point to the per cpu
    variables and can be accessed thusly:

	%gs:[&per_cpu_xxxx - __per_cpu_start]

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 arch/x86/Kconfig                 |    3 +++
 arch/x86/kernel/setup.c          |   22 ++++++++++++++++++++--
 arch/x86/kernel/smpboot.c        |   16 ----------------
 arch/x86/kernel/vmlinux_64.lds.S |    1 +
 include/asm-x86/percpu.h         |   19 +++++++++----------
 5 files changed, 33 insertions(+), 28 deletions(-)

Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig	2008-05-29 17:57:39.588714025 -0700
+++ linux-2.6/arch/x86/Kconfig	2008-05-29 18:16:38.743452832 -0700
@@ -126,6 +126,9 @@ config HAVE_SETUP_PER_CPU_AREA
 config HAVE_CPUMASK_OF_CPU_MAP
 	def_bool X86_64_SMP
 
+config HAVE_ZERO_BASED_PER_CPU
+	def_bool X86_64 && SMP
+
 config ARCH_HIBERNATION_POSSIBLE
 	def_bool y
 	depends on !SMP || !X86_VOYAGER
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c	2008-05-29 18:02:56.229889675 -0700
+++ linux-2.6/arch/x86/kernel/setup.c	2008-05-29 18:17:35.835953108 -0700
@@ -26,6 +26,11 @@ EXPORT_PER_CPU_SYMBOL(x86_cpu_to_apicid)
 physid_mask_t phys_cpu_present_map;
 #endif
 
+#ifdef CONFIG_X86_64
+DEFINE_PER_CPU_FIRST(struct x8664_pda, pda);
+EXPORT_PER_CPU_SYMBOL(pda);
+#endif
+
 #if defined(CONFIG_HAVE_SETUP_PER_CPU_AREA) && defined(CONFIG_X86_SMP)
 /*
  * Copy data used in early init routines from the initial arrays to the
@@ -115,13 +120,20 @@ void __init setup_per_cpu_areas(void)
 #endif
 		if (!ptr)
 			panic("Cannot allocate cpu data for CPU %d\n", i);
+
+		memcpy(ptr, __per_cpu_load, __per_cpu_size);
+
 #ifdef CONFIG_X86_64
+		/*
+		 * So far an embryonic per cpu area was used containing only
+		 * the pda. Move the pda contents into the full per cpu area.
+		  */
 		cpu_pda(i)->data_offset = ptr - __per_cpu_start;
+		memcpy(ptr, cpu_pda(i), sizeof(struct x8664_pda));
+		cpu_pda(i) = (struct x8664_pda *)ptr;
 #else
 		__per_cpu_offset[i] = ptr - __per_cpu_start;
 #endif
-		memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
-
 		highest_cpu = i;
 	}
 
@@ -132,6 +144,12 @@ void __init setup_per_cpu_areas(void)
 	setup_boot_pagesets();
 
 	nr_cpu_ids = highest_cpu + 1;
+
+#ifdef CONFIG_X86_64
+	/* Fix up pda for boot processor */
+	pda_init(0);
+#endif
+
 	printk(KERN_DEBUG "NR_CPUS: %d, nr_cpu_ids: %d\n", NR_CPUS, nr_cpu_ids);
 
 	/* Setup percpu data maps */
Index: linux-2.6/arch/x86/kernel/vmlinux_64.lds.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/vmlinux_64.lds.S	2008-05-29 17:57:39.600964822 -0700
+++ linux-2.6/arch/x86/kernel/vmlinux_64.lds.S	2008-05-29 18:05:08.514214613 -0700
@@ -16,6 +16,7 @@ jiffies_64 = jiffies;
 _proxy_pda = 1;
 PHDRS {
 	text PT_LOAD FLAGS(5);	/* R_E */
+	percpu PT_LOAD FLAGS(4);	/* R__ */
 	data PT_LOAD FLAGS(7);	/* RWE */
 	user PT_LOAD FLAGS(7);	/* RWE */
 	data.init PT_LOAD FLAGS(7);	/* RWE */
Index: linux-2.6/include/asm-x86/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu.h	2008-05-29 17:57:39.616964037 -0700
+++ linux-2.6/include/asm-x86/percpu.h	2008-05-29 18:17:20.419452945 -0700
@@ -3,21 +3,16 @@
 
 #ifdef CONFIG_X86_64
 #include <linux/compiler.h>
-
-/* Same as asm-generic/percpu.h, except that we store the per cpu offset
-   in the PDA. Longer term the PDA and every per cpu variable
-   should be just put into a single section and referenced directly
-   from %gs */
-
-#ifdef CONFIG_SMP
 #include <asm/pda.h>
 
+#ifdef CONFIG_SMP
 #define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
 #define __my_cpu_offset read_pda(data_offset)
-
 #define per_cpu_offset(x) (__per_cpu_offset(x))
-
 #endif
+
+#define __percpu_seg "%%gs:"
+
 #include <asm-generic/percpu.h>
 
 DECLARE_PER_CPU(struct x8664_pda, pda);
@@ -81,6 +76,11 @@ DECLARE_PER_CPU(struct x8664_pda, pda);
 /* We can use this directly for local CPU (faster). */
 DECLARE_PER_CPU(unsigned long, this_cpu_off);
 
+#endif /* __ASSEMBLY__ */
+#endif /* !CONFIG_X86_64 */
+
+#ifndef __ASSEMBLY__
+
 /* For arch-specific code, we can use direct single-insn ops (they
  * don't give an lvalue though). */
 extern void __bad_percpu_size(void);
@@ -142,5 +142,4 @@ do {							\
 #define x86_sub_percpu(var, val) percpu_to_op("sub", per_cpu__##var, val)
 #define x86_or_percpu(var, val) percpu_to_op("or", per_cpu__##var, val)
 #endif /* !__ASSEMBLY__ */
-#endif /* !CONFIG_X86_64 */
 #endif /* _ASM_X86_PERCPU_H_ */
Index: linux-2.6/arch/x86/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smpboot.c	2008-05-29 17:57:39.608964052 -0700
+++ linux-2.6/arch/x86/kernel/smpboot.c	2008-05-29 18:17:18.539452880 -0700
@@ -855,22 +855,6 @@ static int __cpuinit do_boot_cpu(int api
 		printk(KERN_ERR "Failed to allocate GDT for CPU %d\n", cpu);
 		return -1;
 	}
-
-	/* Allocate node local memory for AP pdas */
-	if (cpu_pda(cpu) == &boot_cpu_pda[cpu]) {
-		struct x8664_pda *newpda, *pda;
-		int node = cpu_to_node(cpu);
-		pda = cpu_pda(cpu);
-		newpda = kmalloc_node(sizeof(struct x8664_pda), GFP_ATOMIC,
-				      node);
-		if (newpda) {
-			memcpy(newpda, pda, sizeof(struct x8664_pda));
-			cpu_pda(cpu) = newpda;
-		} else
-			printk(KERN_ERR
-		"Could not allocate node local PDA for CPU %d on node %d\n",
-				cpu, node);
-	}
 #endif
 
 	alternatives_smp_switch(1);

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 38/41] x86: Extend percpu ops to 64 bit
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (36 preceding siblings ...)
  2008-05-30  3:56 ` [patch 37/41] x86_64: Fold pda into per cpu area Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:56 ` [patch 39/41] x86: Replace cpu_pda() using percpu logic and get rid of _cpu_pda() Christoph Lameter
                   ` (3 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: zero_based_percpu_64bit --]
[-- Type: text/plain, Size: 3964 bytes --]

x86 percpu ops now will work on 64 bit too. So add the missing 8 byte cases.
Also add a number of atomic ops that will be useful in the future:
x86_xchg_percpu() and x86_cmpxchg_percpu().

Add x86_inc_percpu and x86_dec_percpu. Increment by one can generate more
efficient instructions and inc/dec will be supported by cpu ops later.

Also use per_cpu_var() instead of per_cpu__##xxx.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/asm-x86/percpu.h |   83 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 78 insertions(+), 5 deletions(-)

Index: linux-2.6/include/asm-x86/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu.h	2008-05-29 20:29:40.000000000 -0700
+++ linux-2.6/include/asm-x86/percpu.h	2008-05-29 20:32:03.000000000 -0700
@@ -108,6 +108,11 @@ do {							\
 		    : "+m" (var)			\
 		    : "ri" ((T__)val));			\
 		break;					\
+	case 8:						\
+		asm(op "q %1,"__percpu_seg"%0"		\
+		    : "+m" (var)			\
+		    : "ri" ((T__)val));			\
+		break;					\
 	default: __bad_percpu_size();			\
 	}						\
 } while (0)
@@ -131,15 +136,83 @@ do {							\
 		    : "=r" (ret__)			\
 		    : "m" (var));			\
 		break;					\
+	case 8:						\
+		asm(op "q "__percpu_seg"%1,%0"		\
+		    : "=r" (ret__)			\
+		    : "m" (var));			\
+		break;					\
 	default: __bad_percpu_size();			\
 	}						\
 	ret__;						\
 })
 
-#define x86_read_percpu(var) percpu_from_op("mov", per_cpu__##var)
-#define x86_write_percpu(var, val) percpu_to_op("mov", per_cpu__##var, val)
-#define x86_add_percpu(var, val) percpu_to_op("add", per_cpu__##var, val)
-#define x86_sub_percpu(var, val) percpu_to_op("sub", per_cpu__##var, val)
-#define x86_or_percpu(var, val) percpu_to_op("or", per_cpu__##var, val)
+#define percpu_addr_op(op, var)				\
+({							\
+	switch (sizeof(var)) {				\
+	case 1:						\
+		asm(op "b "__percpu_seg"%0"		\
+				: : "m"(var));		\
+		break;					\
+	case 2:						\
+		asm(op "w "__percpu_seg"%0"		\
+				: : "m"(var));		\
+		break;					\
+	case 4:						\
+		asm(op "l "__percpu_seg"%0"		\
+				: : "m"(var));		\
+		break;					\
+	case 8:						\
+		asm(op "q "__percpu_seg"%0"		\
+				: : "m"(var));		\
+		break;					\
+	default: __bad_percpu_size();			\
+	}						\
+})
+
+#define percpu_cmpxchg_op(var, old, new)				\
+({									\
+	typeof(var) prev;						\
+	switch (sizeof(var)) {						\
+	case 1:								\
+		asm("cmpxchgb %b1, "__percpu_seg"%2"			\
+				     : "=a"(prev)			\
+				     : "q"(new), "m"(var), "0"(old)	\
+				     : "memory");			\
+		break;							\
+	case 2:								\
+		asm("cmpxchgw %w1, "__percpu_seg"%2"			\
+				     : "=a"(prev)			\
+				     : "r"(new), "m"(var), "0"(old)	\
+				     : "memory");			\
+		break;							\
+	case 4:								\
+		asm("cmpxchgl %k1, "__percpu_seg"%2"			\
+				     : "=a"(prev)			\
+				     : "r"(new), "m"(var), "0"(old)	\
+				     : "memory");			\
+		break;							\
+	case 8:								\
+		asm("cmpxchgq %1, "__percpu_seg"%2"			\
+				     : "=a"(prev)			\
+				     : "r"(new), "m"(var), "0"(old)	\
+				     : "memory");			\
+		break;							\
+	default:							\
+		__bad_percpu_size();					\
+	}								\
+	return prev;							\
+})
+
+#define x86_read_percpu(var) percpu_from_op("mov", per_cpu_var(var))
+#define x86_write_percpu(var, val) percpu_to_op("mov", per_cpu_var(var), val)
+#define x86_add_percpu(var, val) percpu_to_op("add", per_cpu_var(var), val)
+#define x86_sub_percpu(var, val) percpu_to_op("sub", per_cpu_var(var), val)
+#define x86_inc_percpu(var) percpu_addr_op("inc", per_cpu_var(var))
+#define x86_dec_percpu(var) percpu_addr_op("dec", per_cpu_var(var))
+#define x86_or_percpu(var, val) percpu_to_op("or", per_cpu_var(var), val)
+#define x86_xchg_percpu(var, val) percpu_to_op("xchg", per_cpu_var(var), val)
+#define x86_cmpxchg_percpu(var, old, new) \
+				percpu_cmpxchg_op(per_cpu_var(var), old, new)
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_PERCPU_H_ */

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 39/41] x86: Replace cpu_pda() using percpu logic and get rid of _cpu_pda()
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (37 preceding siblings ...)
  2008-05-30  3:56 ` [patch 38/41] x86: Extend percpu ops to 64 bit Christoph Lameter
@ 2008-05-30  3:56 ` Christoph Lameter
  2008-05-30  3:57 ` [patch 40/41] x86: Replace xxx_pda() operations with x86_xx_percpu() Christoph Lameter
                   ` (2 subsequent siblings)
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: zero_based_replace_pda_with_percpu_ops --]
[-- Type: text/plain, Size: 11316 bytes --]

_cpu_pda() is pointing to the pda which is at the beginning of the per cpu area.
This means that cpu_pda and _cpu_pda[] are both pointing at the percpu area!
per_cpu() can be used instead of cpu_pda() when accessing pda fields.

Typically the offsets to the per cpu areas are stored in an array _per_cpu_offset
(generic per cpu support can then provide more functionality).
Use that array for x86_64 and get rid of the pda pointers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/x86/kernel/head64.c   |    7 ++++++-
 arch/x86/kernel/irq_64.c   |   16 ++++++++--------
 arch/x86/kernel/nmi_64.c   |    6 +++---
 arch/x86/kernel/setup.c    |   15 ++++-----------
 arch/x86/kernel/setup64.c  |    6 +-----
 arch/x86/kernel/smpboot.c  |    2 +-
 arch/x86/kernel/traps_64.c |    9 +++++----
 include/asm-x86/pda.h      |    4 ----
 include/asm-x86/percpu.h   |   32 +++++++-------------------------
 9 files changed, 35 insertions(+), 62 deletions(-)

Index: linux-2.6/arch/x86/kernel/head64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/head64.c	2008-05-29 20:29:55.000000000 -0700
+++ linux-2.6/arch/x86/kernel/head64.c	2008-05-29 20:48:18.000000000 -0700
@@ -119,8 +119,13 @@ static void __init reserve_setup_data(vo
 	}
 }
 
+static struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned;
+
 void __init x86_64_start_kernel(char * real_mode_data)
 {
+#ifndef CONFIG_SMP
+	unsigned long __per_cpu_offset[1];
+#endif
 	int i;
 
 	/*
@@ -157,7 +162,7 @@ void __init x86_64_start_kernel(char * r
 	early_printk("Kernel alive\n");
 
  	for (i = 0; i < NR_CPUS; i++)
- 		cpu_pda(i) = &boot_cpu_pda[i];
+ 		__per_cpu_offset[i] = (unsigned long)&boot_cpu_pda[i];
 
 	pda_init(0);
 	copy_bootdata(__va(real_mode_data));
Index: linux-2.6/arch/x86/kernel/irq_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/irq_64.c	2008-05-29 20:29:55.000000000 -0700
+++ linux-2.6/arch/x86/kernel/irq_64.c	2008-05-29 20:48:18.000000000 -0700
@@ -115,37 +115,37 @@ skip:
 	} else if (i == NR_IRQS) {
 		seq_printf(p, "NMI: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->__nmi_count);
+			seq_printf(p, "%10u ", per_cpu(pda.__nmi_count, j));
 		seq_printf(p, "  Non-maskable interrupts\n");
 		seq_printf(p, "LOC: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->apic_timer_irqs);
+			seq_printf(p, "%10u ", per_cpu(pda.apic_timer_irqs, j));
 		seq_printf(p, "  Local timer interrupts\n");
 #ifdef CONFIG_SMP
 		seq_printf(p, "RES: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_resched_count);
+			seq_printf(p, "%10u ", per_cpu(pda.irq_resched_count, j));
 		seq_printf(p, "  Rescheduling interrupts\n");
 		seq_printf(p, "CAL: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_call_count);
+			seq_printf(p, "%10u ", per_cpu(pda.irq_call_count, j));
 		seq_printf(p, "  function call interrupts\n");
 		seq_printf(p, "TLB: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_tlb_count);
+			seq_printf(p, "%10u ", per_cpu(pda.irq_tlb_count, j));
 		seq_printf(p, "  TLB shootdowns\n");
 #endif
 		seq_printf(p, "TRM: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_thermal_count);
+			seq_printf(p, "%10u ", per_cpu(pda.irq_thermal_count, j));
 		seq_printf(p, "  Thermal event interrupts\n");
 		seq_printf(p, "THR: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_threshold_count);
+			seq_printf(p, "%10u ", per_cpu(pda.irq_threshold_count, j));
 		seq_printf(p, "  Threshold APIC interrupts\n");
 		seq_printf(p, "SPU: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_spurious_count);
+			seq_printf(p, "%10u ", per_cpu(pda.irq_spurious_count, j));
 		seq_printf(p, "  Spurious interrupts\n");
 		seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count));
 	}
Index: linux-2.6/arch/x86/kernel/nmi_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/nmi_64.c	2008-05-29 20:29:55.000000000 -0700
+++ linux-2.6/arch/x86/kernel/nmi_64.c	2008-05-29 20:48:18.000000000 -0700
@@ -100,19 +100,19 @@ int __init check_nmi_watchdog(void)
 #endif
 
 	for (cpu = 0; cpu < NR_CPUS; cpu++)
-		prev_nmi_count[cpu] = cpu_pda(cpu)->__nmi_count;
+		prev_nmi_count[cpu] = per_cpu(pda.__nmi_count, cpu);
 	local_irq_enable();
 	mdelay((20*1000)/nmi_hz); // wait 20 ticks
 
 	for_each_online_cpu(cpu) {
 		if (!per_cpu(wd_enabled, cpu))
 			continue;
-		if (cpu_pda(cpu)->__nmi_count - prev_nmi_count[cpu] <= 5) {
+		if (per_cpu(pda.__nmi_count, cpu) - prev_nmi_count[cpu] <= 5) {
 			printk(KERN_WARNING "WARNING: CPU#%d: NMI "
 			       "appears to be stuck (%d->%d)!\n",
 				cpu,
 				prev_nmi_count[cpu],
-				cpu_pda(cpu)->__nmi_count);
+				per_cpu(pda.__nmi_count, cpu));
 			per_cpu(wd_enabled, cpu) = 0;
 			atomic_dec(&nmi_active);
 		}
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c	2008-05-29 20:29:55.000000000 -0700
+++ linux-2.6/arch/x86/kernel/setup.c	2008-05-29 20:48:18.000000000 -0700
@@ -77,14 +77,8 @@ static void __init setup_cpumask_of_cpu(
 static inline void setup_cpumask_of_cpu(void) { }
 #endif
 
-#ifdef CONFIG_X86_32
-/*
- * Great future not-so-futuristic plan: make i386 and x86_64 do it
- * the same way
- */
 unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
 EXPORT_SYMBOL(__per_cpu_offset);
-#endif
 
 /*
  * Great future plan:
@@ -128,12 +122,11 @@ void __init setup_per_cpu_areas(void)
 		 * So far an embryonic per cpu area was used containing only
 		 * the pda. Move the pda contents into the full per cpu area.
 		  */
-		cpu_pda(i)->data_offset = ptr - __per_cpu_start;
-		memcpy(ptr, cpu_pda(i), sizeof(struct x8664_pda));
-		cpu_pda(i) = (struct x8664_pda *)ptr;
-#else
-		__per_cpu_offset[i] = ptr - __per_cpu_start;
+		per_cpu(pda.data_offset, i) = ptr - __per_cpu_start;
+		memcpy(ptr, &per_cpu(pda, i), sizeof(struct x8664_pda));
 #endif
+		__per_cpu_offset[i] = ptr - __per_cpu_start;
+
 		highest_cpu = i;
 	}
 
Index: linux-2.6/arch/x86/kernel/setup64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup64.c	2008-05-29 20:29:55.000000000 -0700
+++ linux-2.6/arch/x86/kernel/setup64.c	2008-05-29 20:48:18.000000000 -0700
@@ -34,10 +34,6 @@ struct boot_params boot_params;
 
 cpumask_t cpu_initialized __cpuinitdata = CPU_MASK_NONE;
 
-struct x8664_pda *_cpu_pda[NR_CPUS] __read_mostly;
-EXPORT_SYMBOL(_cpu_pda);
-struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned;
-
 struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table };
 
 char boot_cpu_stack[IRQSTACKSIZE] __attribute__((section(".bss.page_aligned")));
@@ -89,7 +85,7 @@ __setup("noexec32=", nonx32_setup);
 
 void pda_init(int cpu)
 { 
-	struct x8664_pda *pda = cpu_pda(cpu);
+	struct x8664_pda *pda = &per_cpu(pda, cpu);
 
 	/* Setup up data that may be needed in __get_free_pages early */
 	asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0)); 
Index: linux-2.6/arch/x86/kernel/smpboot.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smpboot.c	2008-05-29 20:29:55.000000000 -0700
+++ linux-2.6/arch/x86/kernel/smpboot.c	2008-05-29 20:48:18.000000000 -0700
@@ -895,7 +895,7 @@ do_rest:
 	stack_start.sp = (void *) c_idle.idle->thread.sp;
 	irq_ctx_init(cpu);
 #else
-	cpu_pda(cpu)->pcurrent = c_idle.idle;
+	per_cpu(pda.pcurrent, cpu) = c_idle.idle;
 	init_rsp = c_idle.idle->thread.sp;
 	load_sp0(&per_cpu(init_tss, cpu), &c_idle.idle->thread);
 	initial_code = (unsigned long)start_secondary;
Index: linux-2.6/arch/x86/kernel/traps_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/traps_64.c	2008-05-29 20:29:56.000000000 -0700
+++ linux-2.6/arch/x86/kernel/traps_64.c	2008-05-29 20:48:18.000000000 -0700
@@ -263,7 +263,8 @@ void dump_trace(struct task_struct *tsk,
 		const struct stacktrace_ops *ops, void *data)
 {
 	const unsigned cpu = get_cpu();
-	unsigned long *irqstack_end = (unsigned long*)cpu_pda(cpu)->irqstackptr;
+	unsigned long *irqstack_end =
+		(unsigned long*)per_cpu(pda.irqstackptr, cpu);
 	unsigned used = 0;
 	struct thread_info *tinfo;
 
@@ -397,8 +398,8 @@ _show_stack(struct task_struct *tsk, str
 	unsigned long *stack;
 	int i;
 	const int cpu = smp_processor_id();
-	unsigned long *irqstack_end = (unsigned long *) (cpu_pda(cpu)->irqstackptr);
-	unsigned long *irqstack = (unsigned long *) (cpu_pda(cpu)->irqstackptr - IRQSTACKSIZE);
+	unsigned long *irqstack_end = (unsigned long *)per_cpu(pda.irqstackptr, cpu);
+	unsigned long *irqstack = (unsigned long *)(per_cpu(pda.irqstackptr, cpu) - IRQSTACKSIZE);
 
 	// debugging aid: "show_stack(NULL, NULL);" prints the
 	// back trace for this cpu.
@@ -462,7 +463,7 @@ void show_registers(struct pt_regs *regs
 	int i;
 	unsigned long sp;
 	const int cpu = smp_processor_id();
-	struct task_struct *cur = cpu_pda(cpu)->pcurrent;
+	struct task_struct *cur = per_cpu(pda.pcurrent, cpu);
 	u8 *ip;
 	unsigned int code_prologue = code_bytes * 43 / 64;
 	unsigned int code_len = code_bytes;
Index: linux-2.6/include/asm-x86/pda.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pda.h	2008-05-29 20:29:56.000000000 -0700
+++ linux-2.6/include/asm-x86/pda.h	2008-05-29 20:48:18.000000000 -0700
@@ -37,12 +37,8 @@ struct x8664_pda {
 	unsigned irq_spurious_count;
 } ____cacheline_aligned_in_smp;
 
-extern struct x8664_pda *_cpu_pda[];
-extern struct x8664_pda boot_cpu_pda[];
 extern void pda_init(int);
 
-#define cpu_pda(i) (_cpu_pda[i])
-
 /*
  * There is no fast way to get the base address of the PDA, all the accesses
  * have to mention %fs/%gs.  So it needs to be done this Torvaldian way.
Index: linux-2.6/include/asm-x86/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu.h	2008-05-29 20:32:03.000000000 -0700
+++ linux-2.6/include/asm-x86/percpu.h	2008-05-29 20:48:18.000000000 -0700
@@ -6,12 +6,12 @@
 #include <asm/pda.h>
 
 #ifdef CONFIG_SMP
-#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
-#define __my_cpu_offset read_pda(data_offset)
-#define per_cpu_offset(x) (__per_cpu_offset(x))
-#endif
-
+#define __my_cpu_offset x86_read_percpu(pda.data_offset)
 #define __percpu_seg "%%gs:"
+#else
+#define __percpu_seg ""
+
+#endif
 
 #include <asm-generic/percpu.h>
 
@@ -46,30 +46,12 @@ DECLARE_PER_CPU(struct x8664_pda, pda);
 
 #else /* ...!ASSEMBLY */
 
-/*
- * PER_CPU finds an address of a per-cpu variable.
- *
- * Args:
- *    var - variable name
- *    cpu - 32bit register containing the current CPU number
- *
- * The resulting address is stored in the "cpu" argument.
- *
- * Example:
- *    PER_CPU(cpu_gdt_descr, %ebx)
- */
 #ifdef CONFIG_SMP
-
 #define __my_cpu_offset x86_read_percpu(this_cpu_off)
-
-/* fs segment starts at (positive) offset == __per_cpu_offset[cpu] */
 #define __percpu_seg "%%fs:"
-
-#else  /* !SMP */
-
+#else
 #define __percpu_seg ""
-
-#endif	/* SMP */
+#endif
 
 #include <asm-generic/percpu.h>
 

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 40/41] x86: Replace xxx_pda() operations with x86_xx_percpu().
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (38 preceding siblings ...)
  2008-05-30  3:56 ` [patch 39/41] x86: Replace cpu_pda() using percpu logic and get rid of _cpu_pda() Christoph Lameter
@ 2008-05-30  3:57 ` Christoph Lameter
  2008-05-30  3:57 ` [patch 41/41] x86_64: Support for cpu ops Christoph Lameter
  2008-05-30  4:58 ` [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Andrew Morton
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:57 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: zero_based_replace_pda_operations --]
[-- Type: text/plain, Size: 15521 bytes --]

It is now possible to use percpu operations for pda access since the pda is
in the percpu area. Drop the pda operations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/x86/kernel/apic_64.c                 |    4 -
 arch/x86/kernel/cpu/mcheck/mce_amd_64.c   |    2 
 arch/x86/kernel/cpu/mcheck/mce_intel_64.c |    2 
 arch/x86/kernel/nmi_64.c                  |    5 +
 arch/x86/kernel/process_64.c              |   16 ++---
 arch/x86/kernel/smp.c                     |    4 -
 arch/x86/kernel/time_64.c                 |    2 
 arch/x86/kernel/tlb_64.c                  |   12 ++--
 arch/x86/kernel/x8664_ksyms_64.c          |    2 
 include/asm-x86/current_64.h              |    4 -
 include/asm-x86/hardirq_64.h              |    6 +-
 include/asm-x86/mmu_context_64.h          |   12 ++--
 include/asm-x86/pda.h                     |   86 ------------------------------
 include/asm-x86/smp.h                     |    2 
 include/asm-x86/thread_info_64.h          |    2 
 15 files changed, 37 insertions(+), 124 deletions(-)

Index: linux-2.6/arch/x86/kernel/apic_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/apic_64.c	2008-05-24 00:14:00.936487393 -0700
+++ linux-2.6/arch/x86/kernel/apic_64.c	2008-05-24 00:15:07.877737074 -0700
@@ -481,7 +481,7 @@ static void local_apic_timer_interrupt(v
 	/*
 	 * the NMI deadlock-detector uses this.
 	 */
-	add_pda(apic_timer_irqs, 1);
+	x86_inc_percpu(pda.apic_timer_irqs);
 
 	evt->event_handler(evt);
 }
@@ -986,7 +986,7 @@ asmlinkage void smp_spurious_interrupt(v
 	if (v & (1 << (SPURIOUS_APIC_VECTOR & 0x1f)))
 		ack_APIC_irq();
 
-	add_pda(irq_spurious_count, 1);
+	x86_inc_percpu(pda.irq_spurious_count);
 	irq_exit();
 }
 
Index: linux-2.6/arch/x86/kernel/cpu/mcheck/mce_amd_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/mcheck/mce_amd_64.c	2008-05-24 00:14:00.946486866 -0700
+++ linux-2.6/arch/x86/kernel/cpu/mcheck/mce_amd_64.c	2008-05-24 00:15:07.877737074 -0700
@@ -237,7 +237,7 @@ asmlinkage void mce_threshold_interrupt(
 		}
 	}
 out:
-	add_pda(irq_threshold_count, 1);
+	x86_inc_percpu(pda.irq_threshold_count);
 	irq_exit();
 }
 
Index: linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/mcheck/mce_intel_64.c	2008-05-24 00:14:00.956487483 -0700
+++ linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel_64.c	2008-05-24 00:15:07.907736191 -0700
@@ -26,7 +26,7 @@ asmlinkage void smp_thermal_interrupt(vo
 	if (therm_throt_process(msr_val & 1))
 		mce_log_therm_throt_event(smp_processor_id(), msr_val);
 
-	add_pda(irq_thermal_count, 1);
+	x86_inc_percpu(pda.irq_thermal_count);
 	irq_exit();
 }
 
Index: linux-2.6/arch/x86/kernel/nmi_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/nmi_64.c	2008-05-24 00:14:00.966487850 -0700
+++ linux-2.6/arch/x86/kernel/nmi_64.c	2008-05-24 00:15:07.937736041 -0700
@@ -328,7 +328,8 @@ nmi_watchdog_tick(struct pt_regs *regs, 
 		touched = 1;
 	}
 
-	sum = read_pda(apic_timer_irqs) + read_pda(irq0_irqs);
+	sum = x86_read_percpu(pda.apic_timer_irqs) +
+					x86_read_percpu(pda.irq0_irqs);
 	if (__get_cpu_var(nmi_touch)) {
 		__get_cpu_var(nmi_touch) = 0;
 		touched = 1;
@@ -389,7 +390,7 @@ asmlinkage notrace __kprobes void
 do_nmi(struct pt_regs *regs, long error_code)
 {
 	nmi_enter();
-	add_pda(__nmi_count,1);
+	x86_inc_percpu(pda.__nmi_count);
 	if (!ignore_nmis)
 		default_do_nmi(regs);
 	nmi_exit();
Index: linux-2.6/arch/x86/kernel/process_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/process_64.c	2008-05-24 00:14:00.976488140 -0700
+++ linux-2.6/arch/x86/kernel/process_64.c	2008-05-24 00:15:22.687263321 -0700
@@ -74,13 +74,13 @@ void idle_notifier_register(struct notif
 
 void enter_idle(void)
 {
-	write_pda(isidle, 1);
+	x86_write_percpu(pda.isidle, 1);
 	atomic_notifier_call_chain(&idle_notifier, IDLE_START, NULL);
 }
 
 static void __exit_idle(void)
 {
-	if (test_and_clear_bit_pda(0, isidle) == 0)
+	if (test_and_clear_bit(0, &per_cpu(pda.isidle, smp_processor_id())) == 0)
 		return;
 	atomic_notifier_call_chain(&idle_notifier, IDLE_END, NULL);
 }
@@ -411,7 +411,7 @@ start_thread(struct pt_regs *regs, unsig
 	load_gs_index(0);
 	regs->ip		= new_ip;
 	regs->sp		= new_sp;
-	write_pda(oldrsp, new_sp);
+	x86_write_percpu(pda.oldrsp, new_sp);
 	regs->cs		= __USER_CS;
 	regs->ss		= __USER_DS;
 	regs->flags		= 0x200;
@@ -633,14 +633,14 @@ __switch_to(struct task_struct *prev_p, 
 	/* 
 	 * Switch the PDA and FPU contexts.
 	 */
-	prev->usersp = read_pda(oldrsp);
-	write_pda(oldrsp, next->usersp);
-	write_pda(pcurrent, next_p); 
+	prev->usersp = x86_read_percpu(pda.oldrsp);
+	x86_write_percpu(pda.oldrsp, next->usersp);
+	x86_write_percpu(pda.pcurrent, next_p);
 
-	write_pda(kernelstack,
+	x86_write_percpu(pda.kernelstack,
 	(unsigned long)task_stack_page(next_p) + THREAD_SIZE - PDA_STACKOFFSET);
 #ifdef CONFIG_CC_STACKPROTECTOR
-	write_pda(stack_canary, next_p->stack_canary);
+	x86_write_percpu(pda.stack_canary, next_p->stack_canary);
 	/*
 	 * Build time only check to make sure the stack_canary is at
 	 * offset 40 in the pda; this is a gcc ABI requirement
Index: linux-2.6/arch/x86/kernel/smp.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp.c	2008-05-24 00:14:00.986488228 -0700
+++ linux-2.6/arch/x86/kernel/smp.c	2008-05-24 00:15:07.977736114 -0700
@@ -295,7 +295,7 @@ void smp_reschedule_interrupt(struct pt_
 #ifdef CONFIG_X86_32
 	__get_cpu_var(irq_stat).irq_resched_count++;
 #else
-	add_pda(irq_resched_count, 1);
+	x86_inc_percpu(pda.irq_resched_count);
 #endif
 }
 
@@ -320,7 +320,7 @@ void smp_call_function_interrupt(struct 
 #ifdef CONFIG_X86_32
 	__get_cpu_var(irq_stat).irq_call_count++;
 #else
-	add_pda(irq_call_count, 1);
+	x86_inc_percpu(pda.irq_call_count);
 #endif
 	irq_exit();
 
Index: linux-2.6/arch/x86/kernel/time_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/time_64.c	2008-05-24 00:14:00.996488252 -0700
+++ linux-2.6/arch/x86/kernel/time_64.c	2008-05-24 00:15:07.997736055 -0700
@@ -46,7 +46,7 @@ EXPORT_SYMBOL(profile_pc);
 
 static irqreturn_t timer_event_interrupt(int irq, void *dev_id)
 {
-	add_pda(irq0_irqs, 1);
+	x86_inc_percpu(pda.irq0_irqs);
 
 	global_clock_event->event_handler(global_clock_event);
 
Index: linux-2.6/arch/x86/kernel/tlb_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/tlb_64.c	2008-05-24 00:14:01.006487889 -0700
+++ linux-2.6/arch/x86/kernel/tlb_64.c	2008-05-24 00:15:08.037736066 -0700
@@ -60,9 +60,9 @@ static DEFINE_PER_CPU(union smp_flush_st
  */
 void leave_mm(int cpu)
 {
-	if (read_pda(mmu_state) == TLBSTATE_OK)
+	if (x86_read_percpu(pda.mmu_state) == TLBSTATE_OK)
 		BUG();
-	cpu_clear(cpu, read_pda(active_mm)->cpu_vm_mask);
+	cpu_clear(cpu, x86_read_percpu(pda.active_mm)->cpu_vm_mask);
 	load_cr3(swapper_pg_dir);
 }
 EXPORT_SYMBOL_GPL(leave_mm);
@@ -140,8 +140,8 @@ asmlinkage void smp_invalidate_interrupt
 		 * BUG();
 		 */
 
-	if (f->flush_mm == read_pda(active_mm)) {
-		if (read_pda(mmu_state) == TLBSTATE_OK) {
+	if (f->flush_mm == x86_read_percpu(pda.active_mm)) {
+		if (x86_read_percpu(pda.mmu_state) == TLBSTATE_OK) {
 			if (f->flush_va == TLB_FLUSH_ALL)
 				local_flush_tlb();
 			else
@@ -152,7 +152,7 @@ asmlinkage void smp_invalidate_interrupt
 out:
 	ack_APIC_irq();
 	cpu_clear(cpu, f->flush_cpumask);
-	add_pda(irq_tlb_count, 1);
+	x86_inc_percpu(pda.irq_tlb_count);
 }
 
 void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
@@ -264,7 +264,7 @@ static void do_flush_tlb_all(void *info)
 	unsigned long cpu = smp_processor_id();
 
 	__flush_tlb_all();
-	if (read_pda(mmu_state) == TLBSTATE_LAZY)
+	if (x86_read_percpu(pda.mmu_state) == TLBSTATE_LAZY)
 		leave_mm(cpu);
 }
 
Index: linux-2.6/include/asm-x86/pda.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pda.h	2008-05-24 00:14:01.016487470 -0700
+++ linux-2.6/include/asm-x86/pda.h	2008-05-24 00:15:22.598997235 -0700
@@ -39,92 +39,6 @@ struct x8664_pda {
 
 extern void pda_init(int);
 
-/*
- * There is no fast way to get the base address of the PDA, all the accesses
- * have to mention %fs/%gs.  So it needs to be done this Torvaldian way.
- */
-extern void __bad_pda_field(void) __attribute__((noreturn));
-
-/*
- * proxy_pda doesn't actually exist, but tell gcc it is accessed for
- * all PDA accesses so it gets read/write dependencies right.
- */
-extern struct x8664_pda _proxy_pda;
-
-#define pda_offset(field) offsetof(struct x8664_pda, field)
-
-#define pda_to_op(op, field, val)					\
-do {									\
-	typedef typeof(_proxy_pda.field) T__;				\
-	if (0) { T__ tmp__; tmp__ = (val); }	/* type checking */	\
-	switch (sizeof(_proxy_pda.field)) {				\
-	case 2:								\
-		asm(op "w %1,%%gs:%c2" :				\
-		    "+m" (_proxy_pda.field) :				\
-		    "ri" ((T__)val),					\
-		    "i"(pda_offset(field)));				\
-		break;							\
-	case 4:								\
-		asm(op "l %1,%%gs:%c2" :				\
-		    "+m" (_proxy_pda.field) :				\
-		    "ri" ((T__)val),					\
-		    "i" (pda_offset(field)));				\
-		break;							\
-	case 8:								\
-		asm(op "q %1,%%gs:%c2":					\
-		    "+m" (_proxy_pda.field) :				\
-		    "ri" ((T__)val),					\
-		    "i"(pda_offset(field)));				\
-		break;							\
-	default:							\
-		__bad_pda_field();					\
-	}								\
-} while (0)
-
-#define pda_from_op(op, field)			\
-({						\
-	typeof(_proxy_pda.field) ret__;		\
-	switch (sizeof(_proxy_pda.field)) {	\
-	case 2:					\
-		asm(op "w %%gs:%c1,%0" :	\
-		    "=r" (ret__) :		\
-		    "i" (pda_offset(field)),	\
-		    "m" (_proxy_pda.field));	\
-		break;				\
-	case 4:					\
-		asm(op "l %%gs:%c1,%0":		\
-		    "=r" (ret__):		\
-		    "i" (pda_offset(field)),	\
-		    "m" (_proxy_pda.field));	\
-		break;				\
-	case 8:					\
-		asm(op "q %%gs:%c1,%0":		\
-		    "=r" (ret__) :		\
-		    "i" (pda_offset(field)),	\
-		    "m" (_proxy_pda.field));	\
-		break;				\
-	default:				\
-		__bad_pda_field();		\
-	}					\
-	ret__;					\
-})
-
-#define read_pda(field)		pda_from_op("mov", field)
-#define write_pda(field, val)	pda_to_op("mov", field, val)
-#define add_pda(field, val)	pda_to_op("add", field, val)
-#define sub_pda(field, val)	pda_to_op("sub", field, val)
-#define or_pda(field, val)	pda_to_op("or", field, val)
-
-/* This is not atomic against other CPUs -- CPU preemption needs to be off */
-#define test_and_clear_bit_pda(bit, field)				\
-({									\
-	int old__;							\
-	asm volatile("btr %2,%%gs:%c3\n\tsbbl %0,%0"			\
-		     : "=r" (old__), "+m" (_proxy_pda.field)		\
-		     : "dIr" (bit), "i" (pda_offset(field)) : "memory");\
-	old__;								\
-})
-
 #endif
 
 #define PDA_STACKOFFSET (5*8)
Index: linux-2.6/arch/x86/kernel/x8664_ksyms_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/x8664_ksyms_64.c	2008-05-24 00:13:59.416488421 -0700
+++ linux-2.6/arch/x86/kernel/x8664_ksyms_64.c	2008-05-24 00:15:22.668997592 -0700
@@ -52,8 +52,6 @@ EXPORT_SYMBOL(empty_zero_page);
 EXPORT_SYMBOL(init_level4_pgt);
 EXPORT_SYMBOL(load_gs_index);
 
-EXPORT_SYMBOL(_proxy_pda);
-
 #ifdef CONFIG_PARAVIRT
 /* Virtualized guests may want to use it */
 EXPORT_SYMBOL_GPL(cpu_gdt_descr);
Index: linux-2.6/include/asm-x86/current_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/current_64.h	2008-05-24 00:13:59.356487241 -0700
+++ linux-2.6/include/asm-x86/current_64.h	2008-05-24 00:15:22.528988082 -0700
@@ -5,11 +5,11 @@
 struct task_struct;
 
 #include <asm/pda.h>
+#include <asm/percpu.h>
 
 static inline struct task_struct *get_current(void)
 {
-	struct task_struct *t = read_pda(pcurrent);
-	return t;
+	return x86_read_percpu(pda.pcurrent);
 }
 
 #define current get_current()
Index: linux-2.6/include/asm-x86/hardirq_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/hardirq_64.h	2008-05-24 00:13:59.366487645 -0700
+++ linux-2.6/include/asm-x86/hardirq_64.h	2008-05-24 00:15:22.528988082 -0700
@@ -11,12 +11,12 @@
 
 #define __ARCH_IRQ_STAT 1
 
-#define local_softirq_pending() read_pda(__softirq_pending)
+#define local_softirq_pending() x86_read_percpu(pda.__softirq_pending)
 
 #define __ARCH_SET_SOFTIRQ_PENDING 1
 
-#define set_softirq_pending(x) write_pda(__softirq_pending, (x))
-#define or_softirq_pending(x)  or_pda(__softirq_pending, (x))
+#define set_softirq_pending(x) x86_write_percpu(pda.__softirq_pending, (x))
+#define or_softirq_pending(x)  x86_or_percpu(pda.__softirq_pending, (x))
 
 extern void ack_bad_irq(unsigned int irq);
 
Index: linux-2.6/include/asm-x86/mmu_context_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/mmu_context_64.h	2008-05-24 00:13:59.376487037 -0700
+++ linux-2.6/include/asm-x86/mmu_context_64.h	2008-05-24 00:15:22.558986281 -0700
@@ -20,8 +20,8 @@ void destroy_context(struct mm_struct *m
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
 #ifdef CONFIG_SMP
-	if (read_pda(mmu_state) == TLBSTATE_OK)
-		write_pda(mmu_state, TLBSTATE_LAZY);
+	if (x86_read_percpu(pda.mmu_state) == TLBSTATE_OK)
+		x86_write_percpu(pda.mmu_state, TLBSTATE_LAZY);
 #endif
 }
 
@@ -33,8 +33,8 @@ static inline void switch_mm(struct mm_s
 		/* stop flush ipis for the previous mm */
 		cpu_clear(cpu, prev->cpu_vm_mask);
 #ifdef CONFIG_SMP
-		write_pda(mmu_state, TLBSTATE_OK);
-		write_pda(active_mm, next);
+		x86_write_percpu(pda.mmu_state, TLBSTATE_OK);
+		x86_write_percpu(pda.active_mm, next);
 #endif
 		cpu_set(cpu, next->cpu_vm_mask);
 		load_cr3(next->pgd);
@@ -44,8 +44,8 @@ static inline void switch_mm(struct mm_s
 	}
 #ifdef CONFIG_SMP
 	else {
-		write_pda(mmu_state, TLBSTATE_OK);
-		if (read_pda(active_mm) != next)
+		x86_write_percpu(pda.mmu_state, TLBSTATE_OK);
+		if (x86_read_percpu(pda.active_mm) != next)
 			BUG();
 		if (!cpu_test_and_set(cpu, next->cpu_vm_mask)) {
 			/* We were in lazy tlb mode and leave_mm disabled
Index: linux-2.6/include/asm-x86/smp.h
===================================================================
--- linux-2.6.orig/include/asm-x86/smp.h	2008-05-24 00:13:59.386487983 -0700
+++ linux-2.6/include/asm-x86/smp.h	2008-05-24 00:15:22.628996584 -0700
@@ -143,7 +143,7 @@ DECLARE_PER_CPU(int, cpu_number);
 extern int safe_smp_processor_id(void);
 
 #elif defined(CONFIG_X86_64_SMP)
-#define raw_smp_processor_id()	read_pda(cpunumber)
+#define raw_smp_processor_id()	x86_read_percpu(pda.cpunumber)
 
 #define stack_smp_processor_id()					\
 ({								\
Index: linux-2.6/include/asm-x86/thread_info_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/thread_info_64.h	2008-05-24 00:13:59.406488336 -0700
+++ linux-2.6/include/asm-x86/thread_info_64.h	2008-05-24 00:15:22.648997403 -0700
@@ -63,7 +63,7 @@ struct thread_info {
 static inline struct thread_info *current_thread_info(void)
 {
 	struct thread_info *ti;
-	ti = (void *)(read_pda(kernelstack) + PDA_STACKOFFSET - THREAD_SIZE);
+	ti = (void *)(x86_read_percpu(pda.kernelstack) + PDA_STACKOFFSET - THREAD_SIZE);
 	return ti;
 }
 

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 41/41] x86_64: Support for cpu ops
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (39 preceding siblings ...)
  2008-05-30  3:57 ` [patch 40/41] x86: Replace xxx_pda() operations with x86_xx_percpu() Christoph Lameter
@ 2008-05-30  3:57 ` Christoph Lameter
  2008-05-30  4:58 ` [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Andrew Morton
  41 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  3:57 UTC (permalink / raw)
  To: akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

[-- Attachment #1: cpu_alloc_ops_x86 --]
[-- Type: text/plain, Size: 2393 bytes --]

Support fast cpu ops in x86_64 by providing a series of functions that
generate the proper instructions.

Define CONFIG_HAVE_CPU_OPS so that core code
can exploit the availability of fast per cpu operations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/x86/Kconfig         |    4 ++++
 include/asm-x86/percpu.h |   31 +++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig	2008-05-29 18:05:08.514214613 -0700
+++ linux-2.6/arch/x86/Kconfig	2008-05-29 18:09:55.889464792 -0700
@@ -168,6 +168,10 @@ config GENERIC_PENDING_IRQ
 	depends on GENERIC_HARDIRQS && SMP
 	default y
 
+config HAVE_CPU_OPS
+	bool
+	default y
+
 config X86_SMP
 	bool
 	depends on SMP && ((X86_32 && !X86_VOYAGER) || X86_64)
Index: linux-2.6/include/asm-x86/percpu.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu.h	2008-05-29 18:08:28.513214585 -0700
+++ linux-2.6/include/asm-x86/percpu.h	2008-05-29 18:09:55.889464792 -0700
@@ -196,5 +196,36 @@ do {							\
 #define x86_cmpxchg_percpu(var, old, new) \
 				percpu_cmpxchg_op(per_cpu_var(var), old, new)
 
+#define CPU_READ(obj)		percpu_from_op("mov", obj)
+#define CPU_WRITE(obj, val)	percpu_to_op("mov", obj, val)
+#define CPU_ADD(obj, val)	percpu_to_op("add", obj, val)
+#define CPU_SUB(obj, val)	percpu_to_op("sub", obj, val)
+#define CPU_INC(obj)		percpu_addr_op("inc", obj)
+#define CPU_DEC(obj)		percpu_addr_op("dec", obj)
+#define CPU_XCHG(obj, val)	percpu_to_op("xchg", var, val)
+#define CPU_CMPXCHG(obj, old, new) percpu_cmpxchg_op(var, old, new)
+
+/*
+ * All cpu operations are interrupt safe and do not need to disable
+ * preempt. So the other variants all reduce to the same instruction.
+ */
+#define _CPU_READ CPU_READ
+#define _CPU_WRITE CPU_WRITE
+#define _CPU_ADD CPU_ADD
+#define _CPU_SUB CPU_SUB
+#define _CPU_INC CPU_INC
+#define _CPU_DEC CPU_DEC
+#define _CPU_XCHG CPU_XCHG
+#define _CPU_CMPXCHG CPU_CMPXCHG
+
+#define __CPU_READ CPU_READ
+#define __CPU_WRITE CPU_WRITE
+#define __CPU_ADD CPU_ADD
+#define __CPU_SUB CPU_SUB
+#define __CPU_INC CPU_INC
+#define __CPU_DEC CPU_DEC
+#define __CPU_XCHG CPU_XCHG
+#define __CPU_CMPXCHG CPU_CMPXCHG
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_PERCPU_H_ */

-- 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
                   ` (40 preceding siblings ...)
  2008-05-30  3:57 ` [patch 41/41] x86_64: Support for cpu ops Christoph Lameter
@ 2008-05-30  4:58 ` Andrew Morton
  2008-05-30  5:03   ` Christoph Lameter
  2008-06-04 15:07   ` Mike Travis
  41 siblings, 2 replies; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  4:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 20:56:20 -0700 Christoph Lameter <clameter@sgi.com> wrote:

> In various places the kernel maintains arrays of pointers indexed by
> processor numbers. These are used to locate objects that need to be used
> when executing on a specirfic processor. Both the slab allocator
> and the page allocator use these arrays and there the arrays are used in
> performance critical code. The allocpercpu functionality is a simple
> allocator to provide these arrays.

All seems reasonable to me.  The obvious question is "how do we size
the arena".  We either waste memory or, much worse, run out.

And running out is a real possibility, I think.  Most people will only
mount a handful of XFS filesystems.  But some customer will come along
who wants to mount 5,000, and distributors will need to cater for that,
but how can they?

I wonder if we can arrange for the default to be overridden via a
kernel boot option?


Another obvious question is "how much of a problem will we have with
internal fragmentation"?  This might be a drop-dead showstopper.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  3:56 ` [patch 02/41] cpu alloc: The allocator Christoph Lameter
@ 2008-05-30  4:58   ` Andrew Morton
  2008-05-30  5:10     ` Christoph Lameter
  2008-06-04 14:48     ` Mike Travis
  2008-05-30  5:04   ` Eric Dumazet
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  4:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 20:56:22 -0700 Christoph Lameter <clameter@sgi.com> wrote:

> The per cpu allocator allows dynamic allocation of memory on all
> processors simultaneously. A bitmap is used to track used areas.
> The allocator implements tight packing to reduce the cache footprint
> and increase speed since cacheline contention is typically not a concern
> for memory mainly used by a single cpu. Small objects will fill up gaps
> left by larger allocations that required alignments.
> 
> The size of the cpu_alloc area can be changed via make menuconfig.
> 
> ...
>
> +config CPU_ALLOC_SIZE
> +	int "Size of cpu alloc area"
> +	default "30000"

strange choice of a default?  I guess it makes it clear that there's no
particular advantage in making it a power-of-two or anything like that.

> +	help
> +	  Sets the maximum amount of memory that can be allocated via cpu_alloc
> Index: linux-2.6/mm/Makefile
> ===================================================================
> --- linux-2.6.orig/mm/Makefile	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/mm/Makefile	2008-05-29 20:15:41.000000000 -0700
> @@ -11,7 +11,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
>  			   maccess.o page_alloc.o page-writeback.o pdflush.o \
>  			   readahead.o swap.o truncate.o vmscan.o \
>  			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
> -			   page_isolation.o $(mmu-y)
> +			   page_isolation.o cpu_alloc.o $(mmu-y)
>  
>  obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
>  obj-$(CONFIG_BOUNCE)	+= bounce.o
> Index: linux-2.6/mm/cpu_alloc.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/mm/cpu_alloc.c	2008-05-29 20:13:39.000000000 -0700
> @@ -0,0 +1,167 @@
> +/*
> + * Cpu allocator - Manage objects allocated for each processor
> + *
> + * (C) 2008 SGI, Christoph Lameter <clameter@sgi.com>
> + * 	Basic implementation with allocation and free from a dedicated per
> + * 	cpu area.
> + *
> + * The per cpu allocator allows dynamic allocation of memory on all
> + * processor simultaneously. A bitmap is used to track used areas.
> + * The allocator implements tight packing to reduce the cache footprint
> + * and increase speed since cacheline contention is typically not a concern
> + * for memory mainly used by a single cpu. Small objects will fill up gaps
> + * left by larger allocations that required alignments.
> + */
> +#include <linux/mm.h>
> +#include <linux/mmzone.h>
> +#include <linux/module.h>
> +#include <linux/percpu.h>
> +#include <linux/bitmap.h>
> +#include <asm/sections.h>
> +
> +/*
> + * Basic allocation unit. A bit map is created to track the use of each
> + * UNIT_SIZE element in the cpu area.
> + */
> +#define UNIT_TYPE int
> +#define UNIT_SIZE sizeof(UNIT_TYPE)
> +#define UNITS (CONFIG_CPU_ALLOC_SIZE / UNIT_SIZE)
> +
> +static DEFINE_PER_CPU(UNIT_TYPE, area[UNITS]);
> +
> +/*
> + * How many units are needed for an object of a given size
> + */
> +static int size_to_units(unsigned long size)
> +{
> +	return DIV_ROUND_UP(size, UNIT_SIZE);
> +}

Perhaps it should return UNIT_TYPE? (ugh).

I guess there's no need to ever change that type, so no?

> +/*
> + * Lock to protect the bitmap and the meta data for the cpu allocator.
> + */
> +static DEFINE_SPINLOCK(cpu_alloc_map_lock);
> +static DECLARE_BITMAP(cpu_alloc_map, UNITS);
> +static int first_free;		/* First known free unit */

Would be nicer to move these above size_to_units(), IMO.

> +/*
> + * Mark an object as used in the cpu_alloc_map
> + *
> + * Must hold cpu_alloc_map_lock
> + */
> +static void set_map(int start, int length)
> +{
> +	while (length-- > 0)
> +		__set_bit(start++, cpu_alloc_map);
> +}

bitmap_fill()?

> +/*
> + * Mark an area as freed.
> + *
> + * Must hold cpu_alloc_map_lock
> + */
> +static void clear_map(int start, int length)
> +{
> +	while (length-- > 0)
> +		__clear_bit(start++, cpu_alloc_map);
> +}

bitmap_zero()?

> +/*
> + * Allocate an object of a certain size
> + *
> + * Returns a special pointer that can be used with CPU_PTR to find the
> + * address of the object for a certain cpu.
> + */

Should be kerneldoc, I guess.

> +void *cpu_alloc(unsigned long size, gfp_t gfpflags, unsigned long align)
> +{
> +	unsigned long start;
> +	int units = size_to_units(size);
> +	void *ptr;
> +	int first;
> +	unsigned long flags;
> +
> +	if (!size)
> +		return ZERO_SIZE_PTR;

OK, so we reuse ZERO_SIZE_PTR from kmalloc.

> +	spin_lock_irqsave(&cpu_alloc_map_lock, flags);
> +
> +	first = 1;
> +	start = first_free;
> +
> +	for ( ; ; ) {
> +
> +		start = find_next_zero_bit(cpu_alloc_map, UNITS, start);
> +		if (start >= UNITS)
> +			goto out_of_memory;
> +
> +		if (first)
> +			first_free = start;
> +
> +		/*
> +		 * Check alignment and that there is enough space after
> +		 * the starting unit.
> +		 */
> +		if (start % (align / UNIT_SIZE) == 0 &&
> +			find_next_bit(cpu_alloc_map, UNITS, start + 1)
> +							>= start + units)
> +				break;
> +		start++;
> +		first = 0;
> +	}

This is kinda bitmap_find_free_region(), only bitmap_find_free_region()
isn't quite strong enough.

Generally I think it would have been better if you had added new
primitives to the bitmap library (or enhanced existing ones) and used
them here, rather than implementing private functionality.

> +	if (first)
> +		first_free = start + units;
> +
> +	if (start + units > UNITS)
> +		goto out_of_memory;
> +
> +	set_map(start, units);
> +	__count_vm_events(CPU_BYTES, units * UNIT_SIZE);
> +
> +	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
> +
> +	ptr = per_cpu_var(area) + start;
> +
> +	if (gfpflags & __GFP_ZERO) {
> +		int cpu;
> +
> +		for_each_possible_cpu(cpu)
> +			memset(CPU_PTR(ptr, cpu), 0, size);
> +	}
> +
> +	return ptr;
> +
> +out_of_memory:
> +	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(cpu_alloc);
> +
> +/*
> + * Free an object. The pointer must be a cpu pointer allocated
> + * via cpu_alloc.
> + */
> +void cpu_free(void *start, unsigned long size)
> +{
> +	unsigned long units = size_to_units(size);
> +	unsigned long index = (int *)start - per_cpu_var(area);
> +	unsigned long flags;
> +
> +	if (!start || start == ZERO_SIZE_PTR)
> +		return;
> +
> +	BUG_ON(index >= UNITS ||
> +		!test_bit(index, cpu_alloc_map) ||
> +		!test_bit(index + units - 1, cpu_alloc_map));

If this assertion triggers for someone, you'll wish like hell that it
had been implemented as three separate BUG_ONs.

> +	spin_lock_irqsave(&cpu_alloc_map_lock, flags);
> +
> +	clear_map(index, units);
> +	__count_vm_events(CPU_BYTES, -units * UNIT_SIZE);
> +
> +	if (index < first_free)
> +		first_free = index;
> +
> +	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
> +}
> +EXPORT_SYMBOL(cpu_free);
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/mm/vmstat.c	2008-05-29 20:13:39.000000000 -0700
> @@ -653,6 +653,7 @@ static const char * const vmstat_text[] 
>  	"allocstall",
>  
>  	"pgrotated",
> +	"cpu_bytes",
>  #ifdef CONFIG_HUGETLB_PAGE
>  	"htlb_buddy_alloc_success",
>  	"htlb_buddy_alloc_fail",
> Index: linux-2.6/include/linux/percpu.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu.h	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/include/linux/percpu.h	2008-05-29 20:29:12.000000000 -0700
> @@ -135,4 +135,50 @@ static inline void percpu_free(void *__p
>  #define free_percpu(ptr)	percpu_free((ptr))
>  #define per_cpu_ptr(ptr, cpu)	percpu_ptr((ptr), (cpu))
>  
> +
> +/*
> + * cpu allocator definitions
> + *
> + * The cpu allocator allows allocating an instance of an object for each
> + * processor and the use of a single pointer to access all instances
> + * of the object. cpu_alloc provides optimized means for accessing the
> + * instance of the object belonging to the currently executing processor
> + * as well as special atomic operations on fields of objects of the
> + * currently executing processor.
> + *
> + * Cpu objects are typically small. The allocator packs them tightly
> + * to increase the chance on each access that a per cpu object is already
> + * cached. Alignments may be specified but the intent is to align the data
> + * properly due to cpu alignment constraints and not to avoid cacheline
> + * contention. Any holes left by aligning objects are filled up with smaller
> + * objects that are allocated later.
> + *
> + * Cpu data can be allocated using CPU_ALLOC. The resulting pointer is
> + * pointing to the instance of the variable in the per cpu area provided
> + * by the loader. It is generally an error to use the pointer directly
> + * unless we are booting the system.
> + *
> + * __GFP_ZERO may be passed as a flag to zero the allocated memory.
> + */
> +
> +/* Return a pointer to the instance of a object for a particular processor */
> +#define CPU_PTR(__p, __cpu)	SHIFT_PERCPU_PTR((__p), per_cpu_offset(__cpu))

eek, a major interface function which is ALL IN CAPS!

can we do this in lower-case?  In a C function?

> +/*
> + * Return a pointer to the instance of the object belonging to the processor
> + * running the current code.
> + */
> +#define THIS_CPU(__p)	SHIFT_PERCPU_PTR((__p), my_cpu_offset)
> +#define __THIS_CPU(__p)	SHIFT_PERCPU_PTR((__p), __my_cpu_offset)
> +
> +#define CPU_ALLOC(type, flags)	((typeof(type) *)cpu_alloc(sizeof(type), \
> +					(flags), __alignof__(type)))
> +#define CPU_FREE(pointer)	cpu_free((pointer), sizeof(*(pointer)))

Dittoes.

> +/*
> + * Raw calls
> + */
> +void *cpu_alloc(unsigned long size, gfp_t flags, unsigned long align);
> +void cpu_free(void *cpu_pointer, unsigned long size);
> +
>  #endif /* __LINUX_PERCPU_H */


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator
  2008-05-30  3:56 ` [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator Christoph Lameter
@ 2008-05-30  4:58   ` Andrew Morton
  2008-05-30  5:14     ` Christoph Lameter
  2008-05-30  6:08   ` Rusty Russell
  1 sibling, 1 reply; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  4:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 20:56:23 -0700 Christoph Lameter <clameter@sgi.com> wrote:

> Remove the builtin per cpu allocator from modules.c and use cpu_alloc instead.
> 
> The patch also removes PERCPU_ENOUGH_ROOM. The size of the cpu_alloc area is
> determined by CONFIG_CPU_AREA_SIZE. PERCPU_ENOUGH_ROOMs default was 8k.
> CONFIG_CPU_AREA_SIZE defaults to 30k. Thus we have more space to load modules.
> 
>
> ...
>
> +		unsigned long align = sechdrs[pcpuindex].sh_addralign;
> +		unsigned long size = sechdrs[pcpuindex].sh_size;
> +
> +		if (align > PAGE_SIZE) {
> +			printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
> +			mod->name, align, PAGE_SIZE);

Indenting broke.

Alas, PAGE_SIZE has, iirc, unsigned type on some architectures and
unsigned long on others.  I suspect you'll need to cast it to be able
to print it.

> +			align = PAGE_SIZE;
> +		}
> +		percpu = cpu_alloc(size, GFP_KERNEL|__GFP_ZERO, align);
> +		if (!percpu)
> +			printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",

80-col bustage,.

A printk like this should, I think, identify what part of the kernel it
came from.

But really, I don't think any printk should be present here. 
cpu_alloc() itself should dump the warning and the backtrace when it
runs out.  Because a cpu_alloc() failure is a major catastrophe.  It
probably means a reconfigure-and-reboot cycle.

Right now it means a
reconfigure-kernel-rebuild-kernel-reinstall-kernel-then-reboot cycle. 
Or a call-vendor-complain-pay-money-and-wait cycle.  But I hope we can
fix that with the boot parameter thing?



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-05-30  3:56 ` [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations Christoph Lameter
@ 2008-05-30  4:58   ` Andrew Morton
  2008-05-30  5:17     ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  4:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 20:56:24 -0700 Christoph Lameter <clameter@sgi.com> wrote:

> Currently the per cpu subsystem is not able to use the atomic capabilities
> that are provided by many of the available processors.
> 
> This patch adds new functionality that allows the optimizing of per cpu
> variable handling. In particular it provides a simple way to exploit
> atomic operations in order to avoid having to disable interrupts or
> performing address calculation to access per cpu data.
> 
> F.e. Using our current methods we may do
> 
> 	unsigned long flags;
> 	struct stat_struct *p;
> 
> 	local_irq_save(flags);
> 	/* Calculate address of per processor area */
> 	p = CPU_PTR(stat, smp_processor_id());
> 	p->counter++;
> 	local_irq_restore(flags);

eh?  That's what local_t is for?

> The segment can be replaced by a single atomic CPU operation:
> 
> 	CPU_INC(stat->counter);

hm, I guess this _has_ to be implemented as a macro.  ho hum.  But
please: "cpu_inc"?

> Most processors have instructions to perform the increment using a
> a single atomic instruction. Processors may have segment registers,
> global registers or per cpu mappings of per cpu areas that can be used
> to generate atomic instructions that combine the following in a single
> operation:
> 
> 1. Adding of an offset / register to a base address
> 2. Read modify write operation on the address calculated by
>    the instruction.
> 
> If 1+2 are combined in an instruction then the instruction is atomic
> vs interrupts. This means that percpu atomic operations do not need
> to disable interrupts to increments counters etc.
> 
> The existing methods in use in the kernel cannot utilize the power of
> these atomic instructions. local_t is not really addressing the issue
> since the offset calculation performed before the atomic operation. The
> operation is therefor not atomic. Disabling interrupt or preemption is
> required in order to use local_t.

Your terminology is totally confusing here.

To me, an "atomic operation" is one which is atomic wrt other CPUs:
atomic_t, for example.

Here we're talking about atomic-wrt-this-cpu-only, yes?

If so, we should invent a new term for that different concept and stick
to it like glue.  How about "self-atomic"?  Or "locally-atomic" in
deference to the existing local_t?

> local_t is also very specific to the x86 processor.

And alpha, m32r, mips and powerpc, methinks.  Probably others, but
people just haven't got around to it.

> The solution here can
> utilize other methods than just those provided by the x86 instruction set.
> 
> 
> 
> On x86 the above CPU_INC translated into a single instruction:
> 
> 	inc %%gs:(&stat->counter)
> 
> This instruction is interrupt safe since it can either be completed
> or not. Both adding of the offset and the read modify write are combined
> in one instruction.
> 
> The determination of the correct per cpu area for the current processor
> does not require access to smp_processor_id() (expensive...). The gs
> register is used to provide a processor specific offset to the respective
> per cpu area where the per cpu variable resides.
> 
> Note that the counter offset into the struct was added *before* the segment
> selector was added. This is necessary to avoid calculations.  In the past
> we first determine the address of the stats structure on the respective
> processor and then added the field offset. However, the offset may as
> well be added earlier. The adding of the per cpu offset (here through the
> gs register) must be done by the instruction used for atomic per cpu
> access.
> 
> 
> 
> If "stat" was declared via DECLARE_PER_CPU then this patchset is capable of
> convincing the linker to provide the proper base address. In that case
> no calculations are necessary.
> 
> Should the stat structure be reachable via a register then the address
> calculation capabilities can be leveraged to avoid calculations.
> 
> On IA64 we can get the same combination of operations in a single instruction
> by using the virtual address that always maps to the local per cpu area:
> 
> 	fetchadd &stat->counter + (VCPU_BASE - __per_cpu_start)
> 
> The access is forced into the per cpu address reachable via the virtualized
> address. IA64 allows the embedding of an offset into the instruction. So the
> fetchadd can perform both the relocation of the pointer into the per cpu
> area as well as the atomic read modify write cycle.
> 
> 
> 
> In order to be able to exploit the atomicity of these instructions we
> introduce a series of new functions that take either:
> 
> 1. A per cpu pointer as returned by cpu_alloc() or CPU_ALLOC().
> 
> 2. A per cpu variable address as returned by per_cpu_var(<percpuvarname>).
> 
> CPU_READ()
> CPU_WRITE()
> CPU_INC
> CPU_DEC
> CPU_ADD
> CPU_SUB
> CPU_XCHG
> CPU_CMPXCHG
> 

I think I'll need to come back another time to understand all that ;)

Thanks for writing it up carefully.

> 
> ---
>  include/linux/percpu.h |  135 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 135 insertions(+)
> 
> Index: linux-2.6/include/linux/percpu.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu.h	2008-05-28 22:31:43.000000000 -0700
> +++ linux-2.6/include/linux/percpu.h	2008-05-28 23:38:17.000000000 -0700

I wonder if all this stuff should be in a new header file.

We could get lazy and include that header from percpu.h if needed.

> @@ -179,4 +179,139 @@
>  void *cpu_alloc(unsigned long size, gfp_t flags, unsigned long align);
>  void cpu_free(void *cpu_pointer, unsigned long size);
>  
> +/*
> + * Fast atomic per cpu operations.
> + *
> + * The following operations can be overridden by arches to implement fast
> + * and efficient operations. The operations are atomic meaning that the
> + * determination of the processor, the calculation of the address and the
> + * operation on the data is an atomic operation.
> + *
> + * The parameter passed to the atomic per cpu operations is an lvalue not a
> + * pointer to the object.
> + */
> +#ifndef CONFIG_HAVE_CPU_OPS

If you move this functionality into a new cpu_alloc.h then the below
code goes into include/asm-generic/cpu_alloc.h and most architectures'
include/asm/cpu_alloc.h will include asm-generic/cpu_alloc.h.

include/linux/percpu.h can still include linux/cpu_alloc.h (which
includes asm/cpu_alloc.h) if needed.  But it would be better to just
teach the .c files to include <linux/cpu_alloc.h>

> +/*
> + * Fallback in case the arch does not provide for atomic per cpu operations.
> + *
> + * The first group of macros is used when it is safe to update the per
> + * cpu variable because preemption is off (per cpu variables that are not
> + * updated from interrupt context) or because interrupts are already off.
> + */
> +#define __CPU_READ(var)				\
> +({						\
> +	(*THIS_CPU(&(var)));			\
> +})
> +
> +#define __CPU_WRITE(var, value)			\
> +({						\
> +	*THIS_CPU(&(var)) = (value);		\
> +})
> +
> +#define __CPU_ADD(var, value)			\
> +({						\
> +	*THIS_CPU(&(var)) += (value);		\
> +})
> +
> +#define __CPU_INC(var) __CPU_ADD((var), 1)
> +#define __CPU_DEC(var) __CPU_ADD((var), -1)
> +#define __CPU_SUB(var, value) __CPU_ADD((var), -(value))
> +
> +#define __CPU_CMPXCHG(var, old, new)		\
> +({						\
> +	typeof(obj) x;				\
> +	typeof(obj) *p = THIS_CPU(&(obj));	\
> +	x = *p;					\
> +	if (x == (old))				\
> +		*p = (new);			\
> +	(x);					\
> +})
> +
> +#define __CPU_XCHG(obj, new)			\
> +({						\
> +	typeof(obj) x;				\
> +	typeof(obj) *p = THIS_CPU(&(obj));	\
> +	x = *p;					\
> +	*p = (new);				\
> +	(x);					\
> +})
> +
> +/*
> + * Second group used for per cpu variables that are not updated from an
> + * interrupt context. In that case we can simply disable preemption which
> + * may be free if the kernel is compiled without support for preemption.
> + */
> +#define _CPU_READ __CPU_READ
> +#define _CPU_WRITE __CPU_WRITE
> +
> +#define _CPU_ADD(var, value)			\
> +({						\
> +	preempt_disable();			\
> +	__CPU_ADD((var), (value));		\
> +	preempt_enable();			\
> +})
> +
> +#define _CPU_INC(var) _CPU_ADD((var), 1)
> +#define _CPU_DEC(var) _CPU_ADD((var), -1)
> +#define _CPU_SUB(var, value) _CPU_ADD((var), -(value))
> +
> +#define _CPU_CMPXCHG(var, old, new)		\
> +({						\
> +	typeof(addr) x;				\
> +	preempt_disable();			\
> +	x = __CPU_CMPXCHG((var), (old), (new));	\
> +	preempt_enable();			\
> +	(x);					\
> +})
> +
> +#define _CPU_XCHG(var, new)			\
> +({						\
> +	typeof(var) x;				\
> +	preempt_disable();			\
> +	x = __CPU_XCHG((var), (new));		\
> +	preempt_enable();			\
> +	(x);					\
> +})
> +
> +/*
> + * Third group: Interrupt safe CPU functions
> + */
> +#define CPU_READ __CPU_READ
> +#define CPU_WRITE __CPU_WRITE
> +
> +#define CPU_ADD(var, value)			\
> +({						\
> +	unsigned long flags;			\
> +	local_irq_save(flags);			\
> +	__CPU_ADD((var), (value));		\
> +	local_irq_restore(flags);		\
> +})
> +
> +#define CPU_INC(var) CPU_ADD((var), 1)
> +#define CPU_DEC(var) CPU_ADD((var), -1)
> +#define CPU_SUB(var, value) CPU_ADD((var), -(value))
> +
> +#define CPU_CMPXCHG(var, old, new)		\
> +({						\
> +	unsigned long flags;			\
> +	typeof(var) x;				\
> +	local_irq_save(flags);			\
> +	x = __CPU_CMPXCHG((var), (old), (new));	\
> +	local_irq_restore(flags);		\
> +	(x);					\
> +})
> +
> +#define CPU_XCHG(var, new)			\
> +({						\
> +	unsigned long flags;			\
> +	typeof(var) x;				\
> +	local_irq_save(flags);			\
> +	x = __CPU_XCHG((var), (new));		\
> +	local_irq_restore(flags);		\
> +	(x);					\
> +})
> +
> +#endif /* CONFIG_HAVE_CPU_OPS */
> +
>  #endif /* __LINUX_PERCPU_H */


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 27/41] cpu alloc: Remove the allocpercpu functionality
  2008-05-30  3:56 ` [patch 27/41] cpu alloc: Remove the allocpercpu functionality Christoph Lameter
@ 2008-05-30  4:58   ` Andrew Morton
  0 siblings, 0 replies; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  4:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 20:56:47 -0700 Christoph Lameter <clameter@sgi.com> wrote:

> There is no user of allocpercpu left after all the earlier patches were
> applied. Remove the allocpercpu code.

Wow.

y:/usr/src/25> grep alloc_percpu patches/*.patch
y:/usr/src/25> 

we might just be able to get away with doing this.

But it might make life easier to defer the removal of the old stuff for
a while.  This can be worked out later.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 35/41] Support for CPU ops
  2008-05-30  3:56 ` [patch 35/41] Support for CPU ops Christoph Lameter
@ 2008-05-30  4:58   ` Andrew Morton
  2008-05-30  5:18     ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  4:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-arch, Tony.Luck, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 20:56:55 -0700 Christoph Lameter <clameter@sgi.com> wrote:

> Subject: [patch 35/41] Support for CPU ops

Should be called "ia64: support for CPU ops", please.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  4:58 ` [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Andrew Morton
@ 2008-05-30  5:03   ` Christoph Lameter
  2008-05-30  5:21     ` Andrew Morton
  2008-06-04 15:07   ` Mike Travis
  1 sibling, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  5:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008, Andrew Morton wrote:

> All seems reasonable to me.  The obvious question is "how do we size
> the arena".  We either waste memory or, much worse, run out.

The per cpu memory use by subsystems is typically quite small. We already 
have an 8k limitation for percpu space for modules. And that does not seem 
to be a problem.

> And running out is a real possibility, I think.  Most people will only
> mount a handful of XFS filesystems.  But some customer will come along
> who wants to mount 5,000, and distributors will need to cater for that,
> but how can they?

Typically these are fairly small 8 bytes * 5000 is only 20k.

> I wonder if we can arrange for the default to be overridden via a
> kernel boot option?

We could do that yes.
 
> Another obvious question is "how much of a problem will we have with
> internal fragmentation"?  This might be a drop-dead showstopper.

But then per cpu data is not frequently allocated and freed.

Going away from allocpercpu saves a lot of memory. We could make this 
128k or so to be safe?



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  3:56 ` [patch 02/41] cpu alloc: The allocator Christoph Lameter
  2008-05-30  4:58   ` Andrew Morton
@ 2008-05-30  5:04   ` Eric Dumazet
  2008-05-30  5:20     ` Christoph Lameter
  2008-05-30  5:46   ` Rusty Russell
  2008-05-31 20:58   ` Pavel Machek
  3 siblings, 1 reply; 139+ messages in thread
From: Eric Dumazet @ 2008-05-30  5:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-arch, linux-kernel, David Miller, Peter Zijlstra,
	Rusty Russell, Mike Travis

Christoph Lameter a écrit :
> The per cpu allocator allows dynamic allocation of memory on all
> processors simultaneously. A bitmap is used to track used areas.
> The allocator implements tight packing to reduce the cache footprint
> and increase speed since cacheline contention is typically not a concern
> for memory mainly used by a single cpu. Small objects will fill up gaps
> left by larger allocations that required alignments.
>
> The size of the cpu_alloc area can be changed via make menuconfig.
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  include/linux/percpu.h |   46 +++++++++++++
>  include/linux/vmstat.h |    2 
>  mm/Kconfig             |    6 +
>  mm/Makefile            |    2 
>  mm/cpu_alloc.c         |  167 +++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/vmstat.c            |    1 
>  6 files changed, 222 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/include/linux/vmstat.h
> ===================================================================
> --- linux-2.6.orig/include/linux/vmstat.h	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/include/linux/vmstat.h	2008-05-29 20:15:37.000000000 -0700
> @@ -37,7 +37,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
>  		FOR_ALL_ZONES(PGSCAN_KSWAPD),
>  		FOR_ALL_ZONES(PGSCAN_DIRECT),
>  		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> -		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> +		PAGEOUTRUN, ALLOCSTALL, PGROTATED, CPU_BYTES,
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> Index: linux-2.6/mm/Kconfig
> ===================================================================
> --- linux-2.6.orig/mm/Kconfig	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/mm/Kconfig	2008-05-29 20:13:39.000000000 -0700
> @@ -205,3 +205,9 @@ config NR_QUICK
>  config VIRT_TO_BUS
>  	def_bool y
>  	depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config CPU_ALLOC_SIZE
> +	int "Size of cpu alloc area"
> +	default "30000"
> +	help
> +	  Sets the maximum amount of memory that can be allocated via cpu_alloc
> Index: linux-2.6/mm/Makefile
> ===================================================================
> --- linux-2.6.orig/mm/Makefile	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/mm/Makefile	2008-05-29 20:15:41.000000000 -0700
> @@ -11,7 +11,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
>  			   maccess.o page_alloc.o page-writeback.o pdflush.o \
>  			   readahead.o swap.o truncate.o vmscan.o \
>  			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
> -			   page_isolation.o $(mmu-y)
> +			   page_isolation.o cpu_alloc.o $(mmu-y)
>  
>  obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
>  obj-$(CONFIG_BOUNCE)	+= bounce.o
> Index: linux-2.6/mm/cpu_alloc.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/mm/cpu_alloc.c	2008-05-29 20:13:39.000000000 -0700
> @@ -0,0 +1,167 @@
> +/*
> + * Cpu allocator - Manage objects allocated for each processor
> + *
> + * (C) 2008 SGI, Christoph Lameter <clameter@sgi.com>
> + * 	Basic implementation with allocation and free from a dedicated per
> + * 	cpu area.
> + *
> + * The per cpu allocator allows dynamic allocation of memory on all
> + * processor simultaneously. A bitmap is used to track used areas.
> + * The allocator implements tight packing to reduce the cache footprint
> + * and increase speed since cacheline contention is typically not a concern
> + * for memory mainly used by a single cpu. Small objects will fill up gaps
> + * left by larger allocations that required alignments.
> + */
> +#include <linux/mm.h>
> +#include <linux/mmzone.h>
> +#include <linux/module.h>
> +#include <linux/percpu.h>
> +#include <linux/bitmap.h>
> +#include <asm/sections.h>
> +
> +/*
> + * Basic allocation unit. A bit map is created to track the use of each
> + * UNIT_SIZE element in the cpu area.
> + */
> +#define UNIT_TYPE int
> +#define UNIT_SIZE sizeof(UNIT_TYPE)
> +#define UNITS (CONFIG_CPU_ALLOC_SIZE / UNIT_SIZE)
> +
> +static DEFINE_PER_CPU(UNIT_TYPE, area[UNITS]);
>   
area[] is not guaranteed to be aligned on anything but 4 bytes.

If someone then needs to call cpu_alloc(8, GFP_KERNEL, 8), it might get 
an non aligned result.

Either you should add an __attribute__((__aligned__(PAGE_SIZE))),
or take into account the real address of area[] in cpu_alloc() to avoid 
waste of up to PAGE_SIZE bytes
per cpu.
> +
> +/*
> + * How many units are needed for an object of a given size
> + */
> +static int size_to_units(unsigned long size)
> +{
> +	return DIV_ROUND_UP(size, UNIT_SIZE);
> +}
> +
> +/*
> + * Lock to protect the bitmap and the meta data for the cpu allocator.
> + */
> +static DEFINE_SPINLOCK(cpu_alloc_map_lock);
> +static DECLARE_BITMAP(cpu_alloc_map, UNITS);
> +static int first_free;		/* First known free unit */
> +
> +/*
> + * Mark an object as used in the cpu_alloc_map
> + *
> + * Must hold cpu_alloc_map_lock
> + */
> +static void set_map(int start, int length)
> +{
> +	while (length-- > 0)
> +		__set_bit(start++, cpu_alloc_map);
> +}
> +
> +/*
> + * Mark an area as freed.
> + *
> + * Must hold cpu_alloc_map_lock
> + */
> +static void clear_map(int start, int length)
> +{
> +	while (length-- > 0)
> +		__clear_bit(start++, cpu_alloc_map);
> +}
> +
> +/*
> + * Allocate an object of a certain size
> + *
> + * Returns a special pointer that can be used with CPU_PTR to find the
> + * address of the object for a certain cpu.
> + */
> +void *cpu_alloc(unsigned long size, gfp_t gfpflags, unsigned long align)
> +{
> +	unsigned long start;
> +	int units = size_to_units(size);
> +	void *ptr;
> +	int first;
> +	unsigned long flags;
> +
> +	if (!size)
> +		return ZERO_SIZE_PTR;
> +
> +	spin_lock_irqsave(&cpu_alloc_map_lock, flags);
> +
> +	first = 1;
> +	start = first_free;
> +
> +	for ( ; ; ) {
> +
> +		start = find_next_zero_bit(cpu_alloc_map, UNITS, start);
> +		if (start >= UNITS)
> +			goto out_of_memory;
> +
> +		if (first)
> +			first_free = start;
> +
> +		/*
> +		 * Check alignment and that there is enough space after
> +		 * the starting unit.
> +		 */
> +		if (start % (align / UNIT_SIZE) == 0 &&
> +			find_next_bit(cpu_alloc_map, UNITS, start + 1)
> +							>= start + units)
> +				break;
> +		start++;
> +		first = 0;
> +	}
> +
> +	if (first)
> +		first_free = start + units;
> +
> +	if (start + units > UNITS)
> +		goto out_of_memory;
> +
> +	set_map(start, units);
> +	__count_vm_events(CPU_BYTES, units * UNIT_SIZE);
> +
> +	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
> +
> +	ptr = per_cpu_var(area) + start;
> +
> +	if (gfpflags & __GFP_ZERO) {
> +		int cpu;
> +
> +		for_each_possible_cpu(cpu)
> +			memset(CPU_PTR(ptr, cpu), 0, size);
> +	}
> +
> +	return ptr;
> +
> +out_of_memory:
> +	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(cpu_alloc);
> +
> +/*
> + * Free an object. The pointer must be a cpu pointer allocated
> + * via cpu_alloc.
> + */
> +void cpu_free(void *start, unsigned long size)
> +{
> +	unsigned long units = size_to_units(size);
> +	unsigned long index = (int *)start - per_cpu_var(area);
> +	unsigned long flags;
> +
> +	if (!start || start == ZERO_SIZE_PTR)
> +		return;
> +
> +	BUG_ON(index >= UNITS ||
> +		!test_bit(index, cpu_alloc_map) ||
> +		!test_bit(index + units - 1, cpu_alloc_map));
> +
> +	spin_lock_irqsave(&cpu_alloc_map_lock, flags);
> +
> +	clear_map(index, units);
> +	__count_vm_events(CPU_BYTES, -units * UNIT_SIZE);
> +
> +	if (index < first_free)
> +		first_free = index;
> +
> +	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
> +}
> +EXPORT_SYMBOL(cpu_free);
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/mm/vmstat.c	2008-05-29 20:13:39.000000000 -0700
> @@ -653,6 +653,7 @@ static const char * const vmstat_text[] 
>  	"allocstall",
>  
>  	"pgrotated",
> +	"cpu_bytes",
>  #ifdef CONFIG_HUGETLB_PAGE
>  	"htlb_buddy_alloc_success",
>  	"htlb_buddy_alloc_fail",
> Index: linux-2.6/include/linux/percpu.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu.h	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/include/linux/percpu.h	2008-05-29 20:29:12.000000000 -0700
> @@ -135,4 +135,50 @@ static inline void percpu_free(void *__p
>  #define free_percpu(ptr)	percpu_free((ptr))
>  #define per_cpu_ptr(ptr, cpu)	percpu_ptr((ptr), (cpu))
>  
> +
> +/*
> + * cpu allocator definitions
> + *
> + * The cpu allocator allows allocating an instance of an object for each
> + * processor and the use of a single pointer to access all instances
> + * of the object. cpu_alloc provides optimized means for accessing the
> + * instance of the object belonging to the currently executing processor
> + * as well as special atomic operations on fields of objects of the
> + * currently executing processor.
> + *
> + * Cpu objects are typically small. The allocator packs them tightly
> + * to increase the chance on each access that a per cpu object is already
> + * cached. Alignments may be specified but the intent is to align the data
> + * properly due to cpu alignment constraints and not to avoid cacheline
> + * contention. Any holes left by aligning objects are filled up with smaller
> + * objects that are allocated later.
> + *
> + * Cpu data can be allocated using CPU_ALLOC. The resulting pointer is
> + * pointing to the instance of the variable in the per cpu area provided
> + * by the loader. It is generally an error to use the pointer directly
> + * unless we are booting the system.
> + *
> + * __GFP_ZERO may be passed as a flag to zero the allocated memory.
> + */
> +
> +/* Return a pointer to the instance of a object for a particular processor */
> +#define CPU_PTR(__p, __cpu)	SHIFT_PERCPU_PTR((__p), per_cpu_offset(__cpu))
> +
> +/*
> + * Return a pointer to the instance of the object belonging to the processor
> + * running the current code.
> + */
> +#define THIS_CPU(__p)	SHIFT_PERCPU_PTR((__p), my_cpu_offset)
> +#define __THIS_CPU(__p)	SHIFT_PERCPU_PTR((__p), __my_cpu_offset)
> +
> +#define CPU_ALLOC(type, flags)	((typeof(type) *)cpu_alloc(sizeof(type), \
> +					(flags), __alignof__(type)))
> +#define CPU_FREE(pointer)	cpu_free((pointer), sizeof(*(pointer)))
> +
> +/*
> + * Raw calls
> + */
> +void *cpu_alloc(unsigned long size, gfp_t flags, unsigned long align);
> +void cpu_free(void *cpu_pointer, unsigned long size);
> +
>  #endif /* __LINUX_PERCPU_H */
>
>   





^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  4:58   ` Andrew Morton
@ 2008-05-30  5:10     ` Christoph Lameter
  2008-05-30  5:31       ` Andrew Morton
  2008-05-30  5:56       ` KAMEZAWA Hiroyuki
  2008-06-04 14:48     ` Mike Travis
  1 sibling, 2 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  5:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008, Andrew Morton wrote:

> > +config CPU_ALLOC_SIZE
> > +	int "Size of cpu alloc area"
> > +	default "30000"
> 
> strange choice of a default?  I guess it makes it clear that there's no
> particular advantage in making it a power-of-two or anything like that.

The cpu alloc has a cpu_bytes field in vmstat that shows how much memory 
is being used. 30000 seemed to be reasonable after staring at these 
numbers for awhile.

> > +static int size_to_units(unsigned long size)
> > +{
> > +	return DIV_ROUND_UP(size, UNIT_SIZE);
> > +}
> 
> Perhaps it should return UNIT_TYPE? (ugh).

No. The UNIT_TYPE is the basic allocation unit. This returns the number of 
allocation units.
 
> I guess there's no need to ever change that type, so no?

We could go to finer or coarser grained someday? Maybe if the area becomes 
1M of size of so we could go to 8 bytes?

> > +static DEFINE_SPINLOCK(cpu_alloc_map_lock);
> > +static DECLARE_BITMAP(cpu_alloc_map, UNITS);
> > +static int first_free;		/* First known free unit */
> 
> Would be nicer to move these above size_to_units(), IMO.

size_to_units is fairly basic for most of the logic. These are variaables 
that manage the allocator state.

> > +static void set_map(int start, int length)
> > +{
> > +	while (length-- > 0)
> > +		__set_bit(start++, cpu_alloc_map);
> > +}
> 
> bitmap_fill()?

Good idea.

> > + */
> > +static void clear_map(int start, int length)
> > +{
> > +	while (length-- > 0)
> > +		__clear_bit(start++, cpu_alloc_map);
> > +}
> 
> bitmap_zero()?

Ditto.

> > +	if (!size)
> > +		return ZERO_SIZE_PTR;
> 
> OK, so we reuse ZERO_SIZE_PTR from kmalloc.

Well yes slab convention...

> > +		start++;
> > +		first = 0;
> > +	}
> 
> This is kinda bitmap_find_free_region(), only bitmap_find_free_region()
> isn't quite strong enough.
> 
> Generally I think it would have been better if you had added new
> primitives to the bitmap library (or enhanced existing ones) and used
> them here, rather than implementing private functionality.

The scope of the patchset is already fairly large. The search here is 
different and not performance critical. Not sure if this is useful for 
other purposes.

> > +
> > +	BUG_ON(index >= UNITS ||
> > +		!test_bit(index, cpu_alloc_map) ||
> > +		!test_bit(index + units - 1, cpu_alloc_map));
> 
> If this assertion triggers for someone, you'll wish like hell that it
> had been implemented as three separate BUG_ONs.

Ok. But in all cases we have an invalid index.

> > + */
> > +
> > +/* Return a pointer to the instance of a object for a particular processor */
> > +#define CPU_PTR(__p, __cpu)	SHIFT_PERCPU_PTR((__p), per_cpu_offset(__cpu))
> 
> eek, a major interface function which is ALL IN CAPS!
> 
> can we do this in lower-case?  In a C function?

No. This is a macro and therefore uppercase (there is macro magic going on 
that ppl need to be aware of). AFAICR you wanted it this way last year. C 
function not possible because of the type checking.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator
  2008-05-30  4:58   ` Andrew Morton
@ 2008-05-30  5:14     ` Christoph Lameter
  2008-05-30  5:34       ` Andrew Morton
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  5:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008, Andrew Morton wrote:

> > +			printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
> > +			mod->name, align, PAGE_SIZE);
> 
> Indenting broke.

Hmmm. Okay.
 
> Alas, PAGE_SIZE has, iirc, unsigned type on some architectures and
> unsigned long on others.  I suspect you'll need to cast it to be able
> to print it.

This is code that was moved.

> > +		percpu = cpu_alloc(size, GFP_KERNEL|__GFP_ZERO, align);
> > +		if (!percpu)
> > +			printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",
> 
> 80-col bustage,.
> 
> A printk like this should, I think, identify what part of the kernel it
> came from.

Again moved code. Should I really do string separations for code 
that is moved?

> But really, I don't think any printk should be present here. 
> cpu_alloc() itself should dump the warning and the backtrace when it
> runs out.  Because a cpu_alloc() failure is a major catastrophe.  It
> probably means a reconfigure-and-reboot cycle.

The code has been able to deal with an allocpercpu failure in the 
past. Why would it have trouble with a cpu_alloc failure here?


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-05-30  4:58   ` Andrew Morton
@ 2008-05-30  5:17     ` Christoph Lameter
  2008-05-30  5:38       ` Andrew Morton
  2008-05-30  6:32       ` Rusty Russell
  0 siblings, 2 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  5:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008, Andrew Morton wrote:

> > 	local_irq_save(flags);
> > 	/* Calculate address of per processor area */
> > 	p = CPU_PTR(stat, smp_processor_id());
> > 	p->counter++;
> > 	local_irq_restore(flags);
> 
> eh?  That's what local_t is for?

No that is what local_t exactly cannot do.

> > The segment can be replaced by a single atomic CPU operation:
> > 
> > 	CPU_INC(stat->counter);
> 
> hm, I guess this _has_ to be implemented as a macro.  ho hum.  But
> please: "cpu_inc"?

A lowercase macro?

> > The existing methods in use in the kernel cannot utilize the power of
> > these atomic instructions. local_t is not really addressing the issue
> > since the offset calculation performed before the atomic operation. The
> > operation is therefor not atomic. Disabling interrupt or preemption is
> > required in order to use local_t.
> 
> Your terminology is totally confusing here.
> 
> To me, an "atomic operation" is one which is atomic wrt other CPUs:
> atomic_t, for example.
> 
> Here we're talking about atomic-wrt-this-cpu-only, yes?

Right.
 
> > local_t is also very specific to the x86 processor.
> 
> And alpha, m32r, mips and powerpc, methinks.  Probably others, but
> people just haven't got around to it.

No local_t does not do the relocation of the address to the correct percpu 
area. It requies disabling of interrupts etc. Its not atomic (wrt 
interrupts) because of that.
 
> I think I'll need to come back another time to understand all that ;)
> 
> Thanks for writing it up carefully.

Well this stuff is so large in scope that I have difficulties keeping 
everything straight.

> I wonder if all this stuff should be in a new header file.
> 
> We could get lazy and include that header from percpu.h if needed.

But then its related to percpu operations and relies extensively on the 
various percpu.h files in asm-generic and asm-arch and include/linux


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 35/41] Support for CPU ops
  2008-05-30  4:58   ` Andrew Morton
@ 2008-05-30  5:18     ` Christoph Lameter
  0 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  5:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, Tony.Luck, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008, Andrew Morton wrote:

> On Thu, 29 May 2008 20:56:55 -0700 Christoph Lameter <clameter@sgi.com> wrote:
> 
> > Subject: [patch 35/41] Support for CPU ops
> 
> Should be called "ia64: support for CPU ops", please.

Argh. Gazillions of details on gazillions of arches over all sorts of 
kernel subsystems.
 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  5:04   ` Eric Dumazet
@ 2008-05-30  5:20     ` Christoph Lameter
  2008-05-30  5:52       ` Rusty Russell
                         ` (2 more replies)
  0 siblings, 3 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  5:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: akpm, linux-arch, linux-kernel, David Miller, Peter Zijlstra,
	Rusty Russell, Mike Travis

On Fri, 30 May 2008, Eric Dumazet wrote:

> > +static DEFINE_PER_CPU(UNIT_TYPE, area[UNITS]);
> >   
> area[] is not guaranteed to be aligned on anything but 4 bytes.
> 
> If someone then needs to call cpu_alloc(8, GFP_KERNEL, 8), it might get an non
> aligned result.
> 
> Either you should add an __attribute__((__aligned__(PAGE_SIZE))),
> or take into account the real address of area[] in cpu_alloc() to avoid waste
> of up to PAGE_SIZE bytes
> per cpu.

I think cacheline aligning should be sufficient. People should not 
allocate large page aligned objects here.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  5:03   ` Christoph Lameter
@ 2008-05-30  5:21     ` Andrew Morton
  2008-05-30  5:27       ` Christoph Lameter
  2008-05-30  6:01       ` Eric Dumazet
  0 siblings, 2 replies; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  5:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 22:03:14 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 29 May 2008, Andrew Morton wrote:
> 
> > All seems reasonable to me.  The obvious question is "how do we size
> > the arena".  We either waste memory or, much worse, run out.
> 
> The per cpu memory use by subsystems is typically quite small. We already 
> have an 8k limitation for percpu space for modules. And that does not seem 
> to be a problem.

eh?  That's DEFINE_PERCPU memory, not alloc_pecpu() memory?

> > And running out is a real possibility, I think.  Most people will only
> > mount a handful of XFS filesystems.  But some customer will come along
> > who wants to mount 5,000, and distributors will need to cater for that,
> > but how can they?
> 
> Typically these are fairly small 8 bytes * 5000 is only 20k.

It was just an example.  There will be others.

	tcp_v4_md5_do_add
	->tcp_alloc_md5sig_pool
	  ->__tcp_alloc_md5sig_pool

does an alloc_percpu for each md5-capable TCP connection.  I think - it
doesn't matter really, because something _could_.  And if something
_does_, we're screwed.

> > I wonder if we can arrange for the default to be overridden via a
> > kernel boot option?
> 
> We could do that yes.

Phew.

> > Another obvious question is "how much of a problem will we have with
> > internal fragmentation"?  This might be a drop-dead showstopper.
> 
> But then per cpu data is not frequently allocated and freed.

I think it is, in the TCP case.  And that's the only one I looked at.

Plus who knows what lies ahead of us?

> Going away from allocpercpu saves a lot of memory. We could make this 
> 128k or so to be safe?

("alloc_percpu" - please be careful about getting this stuff right)

I don't think there is presently any upper limit on alloc_percpu()?  It
uses kmalloc() and kmalloc_node()?

Even if there is some limit, is it an unfixable one?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  5:21     ` Andrew Morton
@ 2008-05-30  5:27       ` Christoph Lameter
  2008-05-30  5:49         ` Andrew Morton
  2008-05-30 14:38         ` Mike Travis
  2008-05-30  6:01       ` Eric Dumazet
  1 sibling, 2 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  5:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008, Andrew Morton wrote:

> > The per cpu memory use by subsystems is typically quite small. We already 
> > have an 8k limitation for percpu space for modules. And that does not seem 
> > to be a problem.
> 
> eh?  That's DEFINE_PERCPU memory, not alloc_pecpu() memory?

No. The module subsystem has its own alloc_percpu subsystem that the 
cpu_alloc replaces.

> > We could do that yes.
> 
> Phew.

But its going to be even more complicated and I have a hard time managing 
the complexity here. Could someone take pieces off my hand?

> > But then per cpu data is not frequently allocated and freed.
> 
> I think it is, in the TCP case.  And that's the only one I looked at.

Which tcp case?

> Plus who knows what lies ahead of us?

Well invariably we will end up with cpu area defragmentation.... Sigh.

> I don't think there is presently any upper limit on alloc_percpu()?  It
> uses kmalloc() and kmalloc_node()?
> 
> Even if there is some limit, is it an unfixable one?

No there is no limit. It just wastes lots of space (pointer arrays, 
alignment etc) that we could use to configure sufficiently large per cpu 
areas.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  5:10     ` Christoph Lameter
@ 2008-05-30  5:31       ` Andrew Morton
  2008-06-02  9:29         ` Paul Jackson
  2008-05-30  5:56       ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  5:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 22:10:25 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> > > +		start++;
> > > +		first = 0;
> > > +	}
> > 
> > This is kinda bitmap_find_free_region(), only bitmap_find_free_region()
> > isn't quite strong enough.
> > 
> > Generally I think it would have been better if you had added new
> > primitives to the bitmap library (or enhanced existing ones) and used
> > them here, rather than implementing private functionality.
> 
> The scope of the patchset is already fairly large.

It would be a relatively small incremental effort ;)

> The search here is 
> different and not performance critical. Not sure if this is useful for 
> other purposes.

I think that strengthening bitmap_find_free_region() would end up
giving us a better kernel than open-coding something similar here.

> > > + */
> > > +
> > > +/* Return a pointer to the instance of a object for a particular processor */
> > > +#define CPU_PTR(__p, __cpu)	SHIFT_PERCPU_PTR((__p), per_cpu_offset(__cpu))
> > 
> > eek, a major interface function which is ALL IN CAPS!
> > 
> > can we do this in lower-case?  In a C function?
> 
> No. This is a macro and therefore uppercase (there is macro magic going on 
> that ppl need to be aware of). AFAICR you wanted it this way last year. C 
> function not possible because of the type checking.

urgh.  This is a C-convention versus kernel-convention thing.  The C
convention exists for very good reasons.  But it sure does suck.

What do others think?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator
  2008-05-30  5:14     ` Christoph Lameter
@ 2008-05-30  5:34       ` Andrew Morton
  0 siblings, 0 replies; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  5:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 22:14:17 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 29 May 2008, Andrew Morton wrote:
> 
> > > +			printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
> > > +			mod->name, align, PAGE_SIZE);
> > 
> > Indenting broke.
> 
> Hmmm. Okay.
>  
> > Alas, PAGE_SIZE has, iirc, unsigned type on some architectures and
> > unsigned long on others.  I suspect you'll need to cast it to be able
> > to print it.
> 
> This is code that was moved.
> 
> > > +		percpu = cpu_alloc(size, GFP_KERNEL|__GFP_ZERO, align);
> > > +		if (!percpu)
> > > +			printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",
> > 
> > 80-col bustage,.
> > 
> > A printk like this should, I think, identify what part of the kernel it
> > came from.
> 
> Again moved code. Should I really do string separations for code 
> that is moved?

That's not a string separation - it is a functional improvement.

Sure, why not fix these little things while we're there?

> > But really, I don't think any printk should be present here. 
> > cpu_alloc() itself should dump the warning and the backtrace when it
> > runs out.  Because a cpu_alloc() failure is a major catastrophe.  It
> > probably means a reconfigure-and-reboot cycle.
> 
> The code has been able to deal with an allocpercpu failure in the 
> past. Why would it have trouble with a cpu_alloc failure here?

Because an alloc_percpu failure is a page allocator failure.  This is a
well-known situation which we know basically never happens, or at least
happens under well-known circumstances.

Whereas a cpu_alloc() failure is a dead box.  We cannot fix it via
running page reclaim.  We cannot fix it via oom-killing someone.  We
are dead.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-05-30  5:17     ` Christoph Lameter
@ 2008-05-30  5:38       ` Andrew Morton
  2008-05-30  6:12         ` Christoph Lameter
  2008-05-30  7:05         ` Rusty Russell
  2008-05-30  6:32       ` Rusty Russell
  1 sibling, 2 replies; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  5:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 22:17:55 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> > > local_t is also very specific to the x86 processor.
> > 
> > And alpha, m32r, mips and powerpc, methinks.  Probably others, but
> > people just haven't got around to it.
> 
> No local_t does not do the relocation of the address to the correct percpu 
> area. It requies disabling of interrupts etc.

No it doesn't.  Look:

static inline void local_inc(local_t *l)
{
	asm volatile(_ASM_INC "%0"
		     : "+m" (l->a.counter));
}

> Its not atomic (wrt 
> interrupts) because of that.
>

Yes it is.

> > I think I'll need to come back another time to understand all that ;)
> > 
> > Thanks for writing it up carefully.
> 
> Well this stuff is so large in scope that I have difficulties keeping 
> everything straight.
> 
> > I wonder if all this stuff should be in a new header file.
> > 
> > We could get lazy and include that header from percpu.h if needed.
> 
> But then its related to percpu operations and relies extensively on the 
> various percpu.h files in asm-generic and asm-arch and include/linux

Well that should be fixed.  We should never have mixed the
alloc_percpu() and DEFINE_PER_CPU things inthe same header.  They're
different.

otoh as you propose removing the old alloc_percpu() I guess the end
result is no worse than what we presently have.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  3:56 ` [patch 02/41] cpu alloc: The allocator Christoph Lameter
  2008-05-30  4:58   ` Andrew Morton
  2008-05-30  5:04   ` Eric Dumazet
@ 2008-05-30  5:46   ` Rusty Russell
  2008-06-04 15:04     ` Mike Travis
  2008-05-31 20:58   ` Pavel Machek
  3 siblings, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-05-30  5:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Mike Travis

On Friday 30 May 2008 13:56:22 Christoph Lameter wrote:
> The per cpu allocator allows dynamic allocation of memory on all
> processors simultaneously. A bitmap is used to track used areas.
> The allocator implements tight packing to reduce the cache footprint
> and increase speed since cacheline contention is typically not a concern
> for memory mainly used by a single cpu. Small objects will fill up gaps
> left by larger allocations that required alignments.

Allocator seems nice and simple, similar to existing one in module.c (which 
predates cool bitmap operators).

Being able to do per-cpu allocations in an interrupt handler seems like 
enouraging a Bad Idea though: I'd be tempted to avoid the flags word, always 
zero, and use a mutex instead of a spinlock.

Cheers,
Rusty.


>
> The size of the cpu_alloc area can be changed via make menuconfig.
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  include/linux/percpu.h |   46 +++++++++++++
>  include/linux/vmstat.h |    2
>  mm/Kconfig             |    6 +
>  mm/Makefile            |    2
>  mm/cpu_alloc.c         |  167
> +++++++++++++++++++++++++++++++++++++++++++++++++ mm/vmstat.c            | 
>   1
>  6 files changed, 222 insertions(+), 2 deletions(-)
>
> Index: linux-2.6/include/linux/vmstat.h
> ===================================================================
> --- linux-2.6.orig/include/linux/vmstat.h	2008-05-29 19:41:21.000000000
> -0700 +++ linux-2.6/include/linux/vmstat.h	2008-05-29 20:15:37.000000000
> -0700 @@ -37,7 +37,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
>  		FOR_ALL_ZONES(PGSCAN_KSWAPD),
>  		FOR_ALL_ZONES(PGSCAN_DIRECT),
>  		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> -		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> +		PAGEOUTRUN, ALLOCSTALL, PGROTATED, CPU_BYTES,
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> Index: linux-2.6/mm/Kconfig
> ===================================================================
> --- linux-2.6.orig/mm/Kconfig	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/mm/Kconfig	2008-05-29 20:13:39.000000000 -0700
> @@ -205,3 +205,9 @@ config NR_QUICK
>  config VIRT_TO_BUS
>  	def_bool y
>  	depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config CPU_ALLOC_SIZE
> +	int "Size of cpu alloc area"
> +	default "30000"
> +	help
> +	  Sets the maximum amount of memory that can be allocated via cpu_alloc
> Index: linux-2.6/mm/Makefile
> ===================================================================
> --- linux-2.6.orig/mm/Makefile	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/mm/Makefile	2008-05-29 20:15:41.000000000 -0700
> @@ -11,7 +11,7 @@ obj-y			:= bootmem.o filemap.o mempool.o
>  			   maccess.o page_alloc.o page-writeback.o pdflush.o \
>  			   readahead.o swap.o truncate.o vmscan.o \
>  			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
> -			   page_isolation.o $(mmu-y)
> +			   page_isolation.o cpu_alloc.o $(mmu-y)
>
>  obj-$(CONFIG_PROC_PAGE_MONITOR) += pagewalk.o
>  obj-$(CONFIG_BOUNCE)	+= bounce.o
> Index: linux-2.6/mm/cpu_alloc.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/mm/cpu_alloc.c	2008-05-29 20:13:39.000000000 -0700
> @@ -0,0 +1,167 @@
> +/*
> + * Cpu allocator - Manage objects allocated for each processor
> + *
> + * (C) 2008 SGI, Christoph Lameter <clameter@sgi.com>
> + * 	Basic implementation with allocation and free from a dedicated per
> + * 	cpu area.
> + *
> + * The per cpu allocator allows dynamic allocation of memory on all
> + * processor simultaneously. A bitmap is used to track used areas.
> + * The allocator implements tight packing to reduce the cache footprint
> + * and increase speed since cacheline contention is typically not a
> concern + * for memory mainly used by a single cpu. Small objects will fill
> up gaps + * left by larger allocations that required alignments.
> + */
> +#include <linux/mm.h>
> +#include <linux/mmzone.h>
> +#include <linux/module.h>
> +#include <linux/percpu.h>
> +#include <linux/bitmap.h>
> +#include <asm/sections.h>
> +
> +/*
> + * Basic allocation unit. A bit map is created to track the use of each
> + * UNIT_SIZE element in the cpu area.
> + */
> +#define UNIT_TYPE int
> +#define UNIT_SIZE sizeof(UNIT_TYPE)
> +#define UNITS (CONFIG_CPU_ALLOC_SIZE / UNIT_SIZE)
> +
> +static DEFINE_PER_CPU(UNIT_TYPE, area[UNITS]);
> +
> +/*
> + * How many units are needed for an object of a given size
> + */
> +static int size_to_units(unsigned long size)
> +{
> +	return DIV_ROUND_UP(size, UNIT_SIZE);
> +}
> +
> +/*
> + * Lock to protect the bitmap and the meta data for the cpu allocator.
> + */
> +static DEFINE_SPINLOCK(cpu_alloc_map_lock);
> +static DECLARE_BITMAP(cpu_alloc_map, UNITS);
> +static int first_free;		/* First known free unit */
> +
> +/*
> + * Mark an object as used in the cpu_alloc_map
> + *
> + * Must hold cpu_alloc_map_lock
> + */
> +static void set_map(int start, int length)
> +{
> +	while (length-- > 0)
> +		__set_bit(start++, cpu_alloc_map);
> +}
> +
> +/*
> + * Mark an area as freed.
> + *
> + * Must hold cpu_alloc_map_lock
> + */
> +static void clear_map(int start, int length)
> +{
> +	while (length-- > 0)
> +		__clear_bit(start++, cpu_alloc_map);
> +}
> +
> +/*
> + * Allocate an object of a certain size
> + *
> + * Returns a special pointer that can be used with CPU_PTR to find the
> + * address of the object for a certain cpu.
> + */
> +void *cpu_alloc(unsigned long size, gfp_t gfpflags, unsigned long align)
> +{
> +	unsigned long start;
> +	int units = size_to_units(size);
> +	void *ptr;
> +	int first;
> +	unsigned long flags;
> +
> +	if (!size)
> +		return ZERO_SIZE_PTR;
> +
> +	spin_lock_irqsave(&cpu_alloc_map_lock, flags);
> +
> +	first = 1;
> +	start = first_free;
> +
> +	for ( ; ; ) {
> +
> +		start = find_next_zero_bit(cpu_alloc_map, UNITS, start);
> +		if (start >= UNITS)
> +			goto out_of_memory;
> +
> +		if (first)
> +			first_free = start;
> +
> +		/*
> +		 * Check alignment and that there is enough space after
> +		 * the starting unit.
> +		 */
> +		if (start % (align / UNIT_SIZE) == 0 &&
> +			find_next_bit(cpu_alloc_map, UNITS, start + 1)
> +							>= start + units)
> +				break;
> +		start++;
> +		first = 0;
> +	}
> +
> +	if (first)
> +		first_free = start + units;
> +
> +	if (start + units > UNITS)
> +		goto out_of_memory;
> +
> +	set_map(start, units);
> +	__count_vm_events(CPU_BYTES, units * UNIT_SIZE);
> +
> +	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
> +
> +	ptr = per_cpu_var(area) + start;
> +
> +	if (gfpflags & __GFP_ZERO) {
> +		int cpu;
> +
> +		for_each_possible_cpu(cpu)
> +			memset(CPU_PTR(ptr, cpu), 0, size);
> +	}
> +
> +	return ptr;
> +
> +out_of_memory:
> +	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
> +	return NULL;
> +}
> +EXPORT_SYMBOL(cpu_alloc);
> +
> +/*
> + * Free an object. The pointer must be a cpu pointer allocated
> + * via cpu_alloc.
> + */
> +void cpu_free(void *start, unsigned long size)
> +{
> +	unsigned long units = size_to_units(size);
> +	unsigned long index = (int *)start - per_cpu_var(area);
> +	unsigned long flags;
> +
> +	if (!start || start == ZERO_SIZE_PTR)
> +		return;
> +
> +	BUG_ON(index >= UNITS ||
> +		!test_bit(index, cpu_alloc_map) ||
> +		!test_bit(index + units - 1, cpu_alloc_map));
> +
> +	spin_lock_irqsave(&cpu_alloc_map_lock, flags);
> +
> +	clear_map(index, units);
> +	__count_vm_events(CPU_BYTES, -units * UNIT_SIZE);
> +
> +	if (index < first_free)
> +		first_free = index;
> +
> +	spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
> +}
> +EXPORT_SYMBOL(cpu_free);
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/mm/vmstat.c	2008-05-29 20:13:39.000000000 -0700
> @@ -653,6 +653,7 @@ static const char * const vmstat_text[]
>  	"allocstall",
>
>  	"pgrotated",
> +	"cpu_bytes",
>  #ifdef CONFIG_HUGETLB_PAGE
>  	"htlb_buddy_alloc_success",
>  	"htlb_buddy_alloc_fail",
> Index: linux-2.6/include/linux/percpu.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu.h	2008-05-29 19:41:21.000000000
> -0700 +++ linux-2.6/include/linux/percpu.h	2008-05-29 20:29:12.000000000
> -0700 @@ -135,4 +135,50 @@ static inline void percpu_free(void *__p
>  #define free_percpu(ptr)	percpu_free((ptr))
>  #define per_cpu_ptr(ptr, cpu)	percpu_ptr((ptr), (cpu))
>
> +
> +/*
> + * cpu allocator definitions
> + *
> + * The cpu allocator allows allocating an instance of an object for each
> + * processor and the use of a single pointer to access all instances
> + * of the object. cpu_alloc provides optimized means for accessing the
> + * instance of the object belonging to the currently executing processor
> + * as well as special atomic operations on fields of objects of the
> + * currently executing processor.
> + *
> + * Cpu objects are typically small. The allocator packs them tightly
> + * to increase the chance on each access that a per cpu object is already
> + * cached. Alignments may be specified but the intent is to align the data
> + * properly due to cpu alignment constraints and not to avoid cacheline
> + * contention. Any holes left by aligning objects are filled up with
> smaller + * objects that are allocated later.
> + *
> + * Cpu data can be allocated using CPU_ALLOC. The resulting pointer is
> + * pointing to the instance of the variable in the per cpu area provided
> + * by the loader. It is generally an error to use the pointer directly
> + * unless we are booting the system.
> + *
> + * __GFP_ZERO may be passed as a flag to zero the allocated memory.
> + */
> +
> +/* Return a pointer to the instance of a object for a particular processor
> */ +#define CPU_PTR(__p, __cpu)	SHIFT_PERCPU_PTR((__p),
> per_cpu_offset(__cpu)) +
> +/*
> + * Return a pointer to the instance of the object belonging to the
> processor + * running the current code.
> + */
> +#define THIS_CPU(__p)	SHIFT_PERCPU_PTR((__p), my_cpu_offset)
> +#define __THIS_CPU(__p)	SHIFT_PERCPU_PTR((__p), __my_cpu_offset)
> +
> +#define CPU_ALLOC(type, flags)	((typeof(type) *)cpu_alloc(sizeof(type), \
> +					(flags), __alignof__(type)))
> +#define CPU_FREE(pointer)	cpu_free((pointer), sizeof(*(pointer)))
> +
> +/*
> + * Raw calls
> + */
> +void *cpu_alloc(unsigned long size, gfp_t flags, unsigned long align);
> +void cpu_free(void *cpu_pointer, unsigned long size);
> +
>  #endif /* __LINUX_PERCPU_H */



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  5:27       ` Christoph Lameter
@ 2008-05-30  5:49         ` Andrew Morton
  2008-05-30  6:16           ` Christoph Lameter
  2008-05-30 14:38         ` Mike Travis
  1 sibling, 1 reply; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  5:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 22:27:53 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 29 May 2008, Andrew Morton wrote:
> 
> > > The per cpu memory use by subsystems is typically quite small. We already 
> > > have an 8k limitation for percpu space for modules. And that does not seem 
> > > to be a problem.
> > 
> > eh?  That's DEFINE_PERCPU memory, not alloc_pecpu() memory?
> 
> No. The module subsystem has its own alloc_percpu subsystem that the 
> cpu_alloc replaces.

That is to support DEFINE_PER_CPU, not alloc_percpu().

> > > We could do that yes.
> > 
> > Phew.
> 
> But its going to be even more complicated and I have a hard time managing 
> the complexity here. Could someone take pieces off my hand?

It could be done later on.

> > > But then per cpu data is not frequently allocated and freed.
> > 
> > I think it is, in the TCP case.  And that's the only one I looked at.
> 
> Which tcp case?

The one you just deleted from my reply :(

> > Plus who knows what lies ahead of us?
> 
> Well invariably we will end up with cpu area defragmentation.... Sigh.
> 
> > I don't think there is presently any upper limit on alloc_percpu()?  It
> > uses kmalloc() and kmalloc_node()?
> > 
> > Even if there is some limit, is it an unfixable one?
> 
> No there is no limit. It just wastes lots of space (pointer arrays, 
> alignment etc) that we could use to configure sufficiently large per cpu 
> areas.

Christoph, please.  An allocator which is of fixed size and which is
vulnerable to internal fragmentation is a huge problem!  The kernel is
subject to wildly varying workloads both between different users and in
the hands of a single user.

If we were to merge all this code and then run into the problems which
I fear then we are tremendously screwed.  We must examine this
exhaustively, in the most paranoid fashion.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  5:20     ` Christoph Lameter
@ 2008-05-30  5:52       ` Rusty Russell
  2008-06-04 15:30         ` Mike Travis
  2008-05-30  5:54       ` Eric Dumazet
  2008-06-04 14:58       ` Mike Travis
  2 siblings, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-05-30  5:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, akpm, linux-arch, linux-kernel, David Miller,
	Peter Zijlstra, Mike Travis

On Friday 30 May 2008 15:20:45 Christoph Lameter wrote:
> On Fri, 30 May 2008, Eric Dumazet wrote:
> > > +static DEFINE_PER_CPU(UNIT_TYPE, area[UNITS]);
> >
> > area[] is not guaranteed to be aligned on anything but 4 bytes.
> >
> > If someone then needs to call cpu_alloc(8, GFP_KERNEL, 8), it might get
> > an non aligned result.
> >
> > Either you should add an __attribute__((__aligned__(PAGE_SIZE))),
> > or take into account the real address of area[] in cpu_alloc() to avoid
> > waste of up to PAGE_SIZE bytes
> > per cpu.
>
> I think cacheline aligning should be sufficient. People should not
> allocate large page aligned objects here.

I vaguely recall there were issues with this in the module code.  They might 
be gone now, but failing to meet alignment contraints without a big warning 
would suck.

But modifying your code to consider the actual alignment is actually pretty 
trivial, AFAICT.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  5:20     ` Christoph Lameter
  2008-05-30  5:52       ` Rusty Russell
@ 2008-05-30  5:54       ` Eric Dumazet
  2008-06-04 14:58       ` Mike Travis
  2 siblings, 0 replies; 139+ messages in thread
From: Eric Dumazet @ 2008-05-30  5:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-arch, linux-kernel, David Miller, Peter Zijlstra,
	Rusty Russell, Mike Travis

Christoph Lameter a écrit :
> On Fri, 30 May 2008, Eric Dumazet wrote:
>
>   
>>> +static DEFINE_PER_CPU(UNIT_TYPE, area[UNITS]);
>>>   
>>>       
>> area[] is not guaranteed to be aligned on anything but 4 bytes.
>>
>> If someone then needs to call cpu_alloc(8, GFP_KERNEL, 8), it might get an non
>> aligned result.
>>
>> Either you should add an __attribute__((__aligned__(PAGE_SIZE))),
>> or take into account the real address of area[] in cpu_alloc() to avoid waste
>> of up to PAGE_SIZE bytes
>> per cpu.
>>     
>
> I think cacheline aligning should be sufficient. People should not 
> allocate large page aligned objects here.
>
>
>   
Hum, maybe, but then we broke modules that might request up to PAGE_SIZE 
alignement for their percpu section,
if I read your 3rd patch correctly.

Taking into account the ((unsigned long)area & (PAGE_SIZE-1)) offset in 
cpu_alloc()
should give up to PAGE_SIZE alignment for free.






^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  5:10     ` Christoph Lameter
  2008-05-30  5:31       ` Andrew Morton
@ 2008-05-30  5:56       ` KAMEZAWA Hiroyuki
  2008-05-30  6:16         ` Christoph Lameter
  1 sibling, 1 reply; 139+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-05-30  5:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 22:10:25 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 29 May 2008, Andrew Morton wrote:
> 
> > > +config CPU_ALLOC_SIZE
> > > +	int "Size of cpu alloc area"
> > > +	default "30000"
> > 
> > strange choice of a default?  I guess it makes it clear that there's no
> > particular advantage in making it a power-of-two or anything like that.
> 
> The cpu alloc has a cpu_bytes field in vmstat that shows how much memory 
> is being used. 30000 seemed to be reasonable after staring at these 
> numbers for awhile.
> 

30000 is suitable for both of 32bits/64bits arch ?

Thanks,
-kame


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  5:21     ` Andrew Morton
  2008-05-30  5:27       ` Christoph Lameter
@ 2008-05-30  6:01       ` Eric Dumazet
  2008-05-30  6:16         ` Andrew Morton
  1 sibling, 1 reply; 139+ messages in thread
From: Eric Dumazet @ 2008-05-30  6:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-arch, linux-kernel, David Miller,
	Peter Zijlstra, Rusty Russell, Mike Travis

Andrew Morton a écrit :
>
> It was just an example.  There will be others.
>
> 	tcp_v4_md5_do_add
> 	->tcp_alloc_md5sig_pool
> 	  ->__tcp_alloc_md5sig_pool
>
> does an alloc_percpu for each md5-capable TCP connection.  I think - it
> doesn't matter really, because something _could_.  And if something
> _does_, we're screwed.
>   
Last time I took a look on this stuff, this was a percpu allocation for 
all connections, not for each TCP session.
(It should be static, instead of dynamic )

Really, percpu allocations are currently not frequent at all.

vmalloc()/vfreee() are way more frequent and still use a list.






^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator
  2008-05-30  3:56 ` [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator Christoph Lameter
  2008-05-30  4:58   ` Andrew Morton
@ 2008-05-30  6:08   ` Rusty Russell
  2008-05-30  6:21     ` Christoph Lameter
  1 sibling, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-05-30  6:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Mike Travis

On Friday 30 May 2008 13:56:23 Christoph Lameter wrote:
> --- linux-2.6.orig/kernel/module.c	2008-05-29 17:57:39.825214766 -0700
> +++ linux-2.6/kernel/module.c	2008-05-29 18:00:50.496815514 -0700
> @@ -314,121 +314,6 @@ static struct module *find_module(const
>  	return NULL;
>  }

115 lines removed... This is my favourite part of the series so far :)

>  	if (mod->percpu)
> -		percpu_modfree(mod->percpu);
> +		cpu_free(mod->percpu, mod->percpu_size);

Hmm, does cpu_free(NULL, 0) do something?  Seems like it shouldn't, for 
symmetry with free().

> +		if (align > PAGE_SIZE) {
> +			printk(KERN_WARNING "%s: per-cpu alignment %li > %li\n",
> +			mod->name, align, PAGE_SIZE);
> +			align = PAGE_SIZE;
> +		}
> +		percpu = cpu_alloc(size, GFP_KERNEL|__GFP_ZERO, align);
> +		if (!percpu)
> +			printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",
> +										size);
>  		if (!percpu) {
>  			err = -ENOMEM;
>  			goto free_mod;

OK, we've *never* had a report of the per-cpu alignment message, so I'd be 
happy to pass that through to cpu_alloc() and have it fail.  Also, the if 
(!percpu) cases should be combined.

>   free_percpu:
>  	if (percpu)
> -		percpu_modfree(percpu);
> +		cpu_free(percpu, percpu_size);

As above.

> +	goal = __per_cpu_size;

Where did __per_cpu_size come from?  I missed it in the earlier patches...

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-05-30  5:38       ` Andrew Morton
@ 2008-05-30  6:12         ` Christoph Lameter
  2008-05-30  7:08           ` Rusty Russell
  2008-05-30  7:05         ` Rusty Russell
  1 sibling, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  6:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008, Andrew Morton wrote:

> > area. It requies disabling of interrupts etc.
> 
> No it doesn't.  Look:
> 
> static inline void local_inc(local_t *l)
> {
> 	asm volatile(_ASM_INC "%0"
> 		     : "+m" (l->a.counter));
> }
> 
> > Its not atomic (wrt 
> > interrupts) because of that.
> >
> 
> Yes it is.

No its not! In order to increment a per cpu value you need to calculate 
the per cpu pointer address in the current per cpu segment. local_t 
cannot do that in an atomic (wrt interrupt/preempt fashion) fashion. cpu 
ops can use a segment prefix and thus the insructions can calculate the 
per cpu adress and perform the atomic inc without disabling preempt or 
interrupts.

local_t is only useful when you disable interrupt or premption otherwise. 
But then you could also use a regular increment.

> > But then its related to percpu operations and relies extensively on the 
> > various percpu.h files in asm-generic and asm-arch and include/linux
> 
> Well that should be fixed.  We should never have mixed the
> alloc_percpu() and DEFINE_PER_CPU things inthe same header.  They're
> different.

With cpu_alloc they are the same. They allocate from the same per cpu 
area.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  5:49         ` Andrew Morton
@ 2008-05-30  6:16           ` Christoph Lameter
  2008-05-30  6:51             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  6:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008, Andrew Morton wrote:

> > No. The module subsystem has its own alloc_percpu subsystem that the 
> > cpu_alloc replaces.
> 
> That is to support DEFINE_PER_CPU, not alloc_percpu().

Right but it needs to have its own section of the percpu space from which 
it allocates the percpu segments for the modules. So it effectively 
implements an allocator.

> If we were to merge all this code and then run into the problems which
> I fear then we are tremendously screwed.  We must examine this
> exhaustively, in the most paranoid fashion.

Well V2 virtually mapped the cpu alloc area which allowed extending it 
arbitrarily. But that made things very complicated.

The number of per cpu resources needed is mostly fixed. The number of 
zones, nodes, slab caches, network interfaces etc etc does not change much 
during typical operations.
 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  6:01       ` Eric Dumazet
@ 2008-05-30  6:16         ` Andrew Morton
  2008-05-30  6:22           ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  6:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, linux-arch, linux-kernel, David Miller,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Fri, 30 May 2008 08:01:02 +0200 Eric Dumazet <dada1@cosmosbay.com> wrote:

> Really, percpu allocations are currently not frequent at all.
> 
> vmalloc()/vfreee() are way more frequent and still use a list.

Sure it's hard to conceive how anyone could go and do a per-cpu
allocation on a fastpath.

But this has nothing to do with the frequency!  The problems surround
the _amount_ of allocated memory and the allocation/freeing patterns.

Here's another example.  And it's only an example!  Generalise!

ext3 maintains three percpu_counters per mount.  Each percpu_counter
does one percpu_alloc.  People can mount an arbitrary number of ext3
filesystems!


Another: there are two percpu_counters (and hence two percpu_alloc()s)
per backing_dev_info.  One backing_dev_info per disk and people have
been known to have thousands (iirc ~10,000) disks online.

And those examples were plucked only from today's kernel.  Who knows
what other problems will be in 2.6.45?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  5:56       ` KAMEZAWA Hiroyuki
@ 2008-05-30  6:16         ` Christoph Lameter
  0 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  6:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Rusty Russell, Mike Travis

On Fri, 30 May 2008, KAMEZAWA Hiroyuki wrote:

> 30000 is suitable for both of 32bits/64bits arch ?

Was developed on 64bit. As ususal 32 bit was an afterthought.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator
  2008-05-30  6:08   ` Rusty Russell
@ 2008-05-30  6:21     ` Christoph Lameter
  0 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  6:21 UTC (permalink / raw)
  To: Rusty Russell
  Cc: akpm, linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Mike Travis

On Fri, 30 May 2008, Rusty Russell wrote:

> Hmm, does cpu_free(NULL, 0) do something?  Seems like it shouldn't, for 
> symmetry with free().

No it just returns.

> > +		percpu = cpu_alloc(size, GFP_KERNEL|__GFP_ZERO, align);
> > +		if (!percpu)
> > +			printk(KERN_WARNING "Could not allocate %lu bytes percpu data\n",
> > +										size);
> >  		if (!percpu) {
> >  			err = -ENOMEM;
> >  			goto free_mod;
> 
> OK, we've *never* had a report of the per-cpu alignment message, so I'd be 
> happy to pass that through to cpu_alloc() and have it fail.  Also, the if 
> (!percpu) cases should be combined.

Ack.

> >   free_percpu:
> >  	if (percpu)
> > -		percpu_modfree(percpu);
> > +		cpu_free(percpu, percpu_size);
> 
> As above.

The if can be dropped.

> 
> > +	goal = __per_cpu_size;
> 
> Where did __per_cpu_size come from?  I missed it in the earlier patches...

Its __per_cpu_end - __per_cpu_start.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  6:16         ` Andrew Morton
@ 2008-05-30  6:22           ` Christoph Lameter
  2008-05-30  6:37             ` Andrew Morton
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, linux-arch, linux-kernel, David Miller,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008, Andrew Morton wrote:

> ext3 maintains three percpu_counters per mount.  Each percpu_counter
> does one percpu_alloc.  People can mount an arbitrary number of ext3
> filesystems!

But its 4 bytes per alloc right?

> Another: there are two percpu_counters (and hence two percpu_alloc()s)
> per backing_dev_info.  One backing_dev_info per disk and people have
> been known to have thousands (iirc ~10,000) disks online.

8 bytes per backing device. 80000 bytes for 10000 disks.

> And those examples were plucked only from today's kernel.  Who knows
> what other problems will be in 2.6.45?

We can always increase the sizes.
 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-05-30  5:17     ` Christoph Lameter
  2008-05-30  5:38       ` Andrew Morton
@ 2008-05-30  6:32       ` Rusty Russell
  1 sibling, 0 replies; 139+ messages in thread
From: Rusty Russell @ 2008-05-30  6:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Mike Travis

On Friday 30 May 2008 15:17:55 Christoph Lameter wrote:
> On Thu, 29 May 2008, Andrew Morton wrote:
> > > 	local_irq_save(flags);
> > > 	/* Calculate address of per processor area */
> > > 	p = CPU_PTR(stat, smp_processor_id());
> > > 	p->counter++;
> > > 	local_irq_restore(flags);
> >
> > eh?  That's what local_t is for?
>
> No that is what local_t exactly cannot do.

Yes, but this is local_t for dynamically allocated per-cpu vars.  You've lost 
potential symmetry and invented a whole new nomenclature :(

local_ptr_inc() etc would be far preferable IMHO.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  6:22           ` Christoph Lameter
@ 2008-05-30  6:37             ` Andrew Morton
  2008-05-30 11:32               ` Matthew Wilcox
  0 siblings, 1 reply; 139+ messages in thread
From: Andrew Morton @ 2008-05-30  6:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, linux-arch, linux-kernel, David Miller,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 23:22:31 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 29 May 2008, Andrew Morton wrote:
> 
> > ext3 maintains three percpu_counters per mount.  Each percpu_counter
> > does one percpu_alloc.  People can mount an arbitrary number of ext3
> > filesystems!
> 
> But its 4 bytes per alloc right?

It could be 4000.  The present alloc_percpu() would support that.

And struct nfs_iostats is 264 bytes and nfs does an alloc_percpu() of
one of those per server and mounting thousands of servers per client
is, I believe, a real-world operation.

Plus for the entyenth time: saying that this code will probably work
acceptably for most people in 2.6.26 is not sufficient!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 05/41] cpu alloc: Percpu_counter conversion
  2008-05-30  3:56 ` [patch 05/41] cpu alloc: Percpu_counter conversion Christoph Lameter
@ 2008-05-30  6:47   ` Rusty Russell
  2008-05-30 17:54     ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-05-30  6:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Mike Travis

On Friday 30 May 2008 13:56:25 Christoph Lameter wrote:
> Use cpu_alloc instead of allocpercpu.

These patches seem like useless churn.

Plus, the new code is uglier than the old code :(

Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 26/41] cpu alloc: Convert mib handling to cpu alloc
  2008-05-30  3:56 ` [patch 26/41] cpu alloc: Convert mib handling to cpu alloc Christoph Lameter
@ 2008-05-30  6:47   ` Eric Dumazet
  2008-05-30 18:01     ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Eric Dumazet @ 2008-05-30  6:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-arch, linux-kernel, David Miller, Peter Zijlstra,
	Rusty Russell, Mike Travis

Christoph Lameter a écrit :
> Use the cpu alloc functions for the mib handling functions in the net
> layer. The API for snmp_mib_free() is changed to add a size parameter
> since cpu_free() requires a size parameter.
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  include/net/ip.h     |    2 +-
>  include/net/snmp.h   |   32 ++++++++------------------------
>  net/dccp/proto.c     |    2 +-
>  net/ipv4/af_inet.c   |   31 +++++++++++++++++--------------
>  net/ipv6/addrconf.c  |   11 ++++++-----
>  net/ipv6/af_inet6.c  |   20 +++++++++++---------
>  net/sctp/protocol.c  |    2 +-
>  net/xfrm/xfrm_proc.c |    4 ++--
>  8 files changed, 47 insertions(+), 57 deletions(-)
>   
We also can avoid the use of two arrays when CONFIG_HAVE_CPU_OPS
since _CPU_INC() and __CPU_INC() are both interrupt safe.
This would reduce size of mibs by 50% and complexity (no need to sum)





^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  6:16           ` Christoph Lameter
@ 2008-05-30  6:51             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 139+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-05-30  6:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, 29 May 2008 23:16:11 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 29 May 2008, Andrew Morton wrote:
> 
> > > No. The module subsystem has its own alloc_percpu subsystem that the 
> > > cpu_alloc replaces.
> > 
> > That is to support DEFINE_PER_CPU, not alloc_percpu().
> 
> Right but it needs to have its own section of the percpu space from which 
> it allocates the percpu segments for the modules. So it effectively 
> implements an allocator.
> 

Could you add a text to explain "This interface is for wise use of
pre-allocated limited area (see Documentation/xxxx). please use this only
when you need very fast access to per-cpu object and you can estimate the amount
which you finally need. If unsure, please use generic allocator."

for the moment ?

At first look, I thought of using this in memory-resource-controller but it seems
I shouldn't do so because thousands of cgroup can be used in theory...

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-05-30  5:38       ` Andrew Morton
  2008-05-30  6:12         ` Christoph Lameter
@ 2008-05-30  7:05         ` Rusty Russell
  1 sibling, 0 replies; 139+ messages in thread
From: Rusty Russell @ 2008-05-30  7:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Mike Travis

On Friday 30 May 2008 15:38:25 Andrew Morton wrote:
> On Thu, 29 May 2008 22:17:55 -0700 (PDT) Christoph Lameter 
<clameter@sgi.com> wrote:
> > But then its related to percpu operations and relies extensively on the
> > various percpu.h files in asm-generic and asm-arch and include/linux
>
> Well that should be fixed.  We should never have mixed the
> alloc_percpu() and DEFINE_PER_CPU things inthe same header.  They're
> different.
>
> otoh as you propose removing the old alloc_percpu() I guess the end
> result is no worse than what we presently have.

No, the worst thing is that this is a great deal of churn which doesn't 
actually fix the "running out of per-cpu memory" problem.

It can, and should, be fixed, before changing dynamic percpu alloc to use the 
same percpu pool.

Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-05-30  6:12         ` Christoph Lameter
@ 2008-05-30  7:08           ` Rusty Russell
  2008-05-30 18:00             ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-05-30  7:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Mike Travis

On Friday 30 May 2008 16:12:59 Christoph Lameter wrote:
> On Thu, 29 May 2008, Andrew Morton wrote:
> > > area. It requies disabling of interrupts etc.
> >
> > No it doesn't.  Look:
> >
> > static inline void local_inc(local_t *l)
> > {
> > 	asm volatile(_ASM_INC "%0"
> >
> > 		     : "+m" (l->a.counter));
> >
> > }
> >
> > > Its not atomic (wrt
> > > interrupts) because of that.
> >
> > Yes it is.
>
> No its not! In order to increment a per cpu value you need to calculate
> the per cpu pointer address in the current per cpu segment.

Christoph, you just missed it, that's all.  Look at cpu_local_read et al in 
include/asm-i386/local.h (ie. before the x86 mergers chose the lowest common 
denominator one).

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  6:37             ` Andrew Morton
@ 2008-05-30 11:32               ` Matthew Wilcox
  0 siblings, 0 replies; 139+ messages in thread
From: Matthew Wilcox @ 2008-05-30 11:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Eric Dumazet, linux-arch, linux-kernel,
	David Miller, Peter Zijlstra, Rusty Russell, Mike Travis

On Thu, May 29, 2008 at 11:37:58PM -0700, Andrew Morton wrote:
> And struct nfs_iostats is 264 bytes and nfs does an alloc_percpu() of
> one of those per server and mounting thousands of servers per client
> is, I believe, a real-world operation.

Another example, not as extreme, there's an alloc_percpu(struct
disk_stats) [80 bytes on 64-bit machines] for every disk and every
partition in the machine.  The TPC system has 3000 disks, each with 14
partitions on it.  That's 15 * 80 * 3000 = 3,600,000 bytes.

Even if you're only putting a pointer to each allocation in the percpu
area, that's still 360,000 bytes, 12x as much as you think is sufficient
for the entire system.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  5:27       ` Christoph Lameter
  2008-05-30  5:49         ` Andrew Morton
@ 2008-05-30 14:38         ` Mike Travis
  2008-05-30 17:50           ` Christoph Lameter
  1 sibling, 1 reply; 139+ messages in thread
From: Mike Travis @ 2008-05-30 14:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Rusty Russell

Christoph Lameter wrote:
> On Thu, 29 May 2008, Andrew Morton wrote:
...
>> Plus who knows what lies ahead of us?
> 
> Well invariably we will end up with cpu area defragmentation.... Sigh.
> 
>> I don't think there is presently any upper limit on alloc_percpu()?  It
>> uses kmalloc() and kmalloc_node()?
>>
>> Even if there is some limit, is it an unfixable one?
> 
> No there is no limit. It just wastes lots of space (pointer arrays, 
> alignment etc) that we could use to configure sufficiently large per cpu 
> areas.

Is there any reason why the per_cpu area couldn't be made extensible?  Maybe
a simple linked list of available areas?  (And use a config variable and/or
boot param for initial size and increment size?)  [Ignoring the problem of
reclaiming the space...]

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30 14:38         ` Mike Travis
@ 2008-05-30 17:50           ` Christoph Lameter
  2008-05-30 18:00             ` Matthew Wilcox
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30 17:50 UTC (permalink / raw)
  To: Mike Travis
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Rusty Russell

On Fri, 30 May 2008, Mike Travis wrote:

> > No there is no limit. It just wastes lots of space (pointer arrays, 
> > alignment etc) that we could use to configure sufficiently large per cpu 
> > areas.
> 
> Is there any reason why the per_cpu area couldn't be made extensible?  Maybe
> a simple linked list of available areas?  (And use a config variable and/or
> boot param for initial size and increment size?)  [Ignoring the problem of
> reclaiming the space...]

cpu alloc v2 had an extendable per cpu space. You have the patches. We 
could put this on top of this patchset if necessary. But then it not so 
nice and simple anymore. Maybe we can rstrict the use of cpu alloc 
instead to users with objects < cache_line_size() or so?


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 05/41] cpu alloc: Percpu_counter conversion
  2008-05-30  6:47   ` Rusty Russell
@ 2008-05-30 17:54     ` Christoph Lameter
  0 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30 17:54 UTC (permalink / raw)
  To: Rusty Russell
  Cc: akpm, linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Mike Travis

On Fri, 30 May 2008, Rusty Russell wrote:

> On Friday 30 May 2008 13:56:25 Christoph Lameter wrote:
> > Use cpu_alloc instead of allocpercpu.
> 
> These patches seem like useless churn.
> 
> Plus, the new code is uglier than the old code :(

It drastically reduces the memory size f.e. 4 byte allocations require 
SLAB f.e. to allocate a 32 byte chunk. This reduces memory requirements by
32/4 = 8 times.

Plus the per cpu counters allocated in order are likely placed in the same 
cacheline (whereas the slab allocators avoid placing multiple objects in 
the same cacheline). Reduces cache footprint.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30 17:50           ` Christoph Lameter
@ 2008-05-30 18:00             ` Matthew Wilcox
  2008-05-30 18:12               ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Matthew Wilcox @ 2008-05-30 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra, Rusty Russell

On Fri, May 30, 2008 at 10:50:04AM -0700, Christoph Lameter wrote:
> cpu alloc v2 had an extendable per cpu space. You have the patches. We 
> could put this on top of this patchset if necessary. But then it not so 
> nice and simple anymore. Maybe we can rstrict the use of cpu alloc 
> instead to users with objects < cache_line_size() or so?

Restricting the use of cpu_alloc based on size of object is no good when
you're trying to allocate 45,000 objects.  Extending the per CPU space
is the only option.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-05-30  7:08           ` Rusty Russell
@ 2008-05-30 18:00             ` Christoph Lameter
  2008-06-02  2:00               ` Rusty Russell
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30 18:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Mike Travis

On Fri, 30 May 2008, Rusty Russell wrote:

> > No its not! In order to increment a per cpu value you need to calculate
> > the per cpu pointer address in the current per cpu segment.
> 
> Christoph, you just missed it, that's all.  Look at cpu_local_read et al in 
> include/asm-i386/local.h (ie. before the x86 mergers chose the lowest common 
> denominator one).

There is no doubt that local_t does perform an atomic vs. interrupt inc 
for example. But its not usable. Because you need to determine the address 
of the local_t belonging to the current processor first. As soon as you 
have loaded a processor specific address you can no longer be preempted 
because that may change the processor and then the wrong address may be 
increment (and then we have a race again since now we are incrementing 
counters belonging to other processors). So local_t at mininum requires 
disabling preempt.
 
Believe me I have tried to use local_t repeatedly for vm statistics etc. 
It always fails on that issue. See f.e. the patch that converts vmstat to 
cpu alloc and compare with my initial local_t based implementation 2 years 
ago that bombed out because I assumed that local_t would work right.

cpu ops does both

1. The determination of the address of the object belonging to the local 
processor.

and

2. The RMW

in one instruction. That avoids having to disable preemption or interrupts 
and it shortens the instructions significantly.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 26/41] cpu alloc: Convert mib handling to cpu alloc
  2008-05-30  6:47   ` Eric Dumazet
@ 2008-05-30 18:01     ` Christoph Lameter
  0 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30 18:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: akpm, linux-arch, linux-kernel, David Miller, Peter Zijlstra,
	Rusty Russell, Mike Travis

On Fri, 30 May 2008, Eric Dumazet wrote:

> We also can avoid the use of two arrays when CONFIG_HAVE_CPU_OPS
> since _CPU_INC() and __CPU_INC() are both interrupt safe.
> This would reduce size of mibs by 50% and complexity (no need to sum)

Right.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30 18:00             ` Matthew Wilcox
@ 2008-05-30 18:12               ` Christoph Lameter
  0 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-05-30 18:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Travis, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra, Rusty Russell

On Fri, 30 May 2008, Matthew Wilcox wrote:

> On Fri, May 30, 2008 at 10:50:04AM -0700, Christoph Lameter wrote:
> > cpu alloc v2 had an extendable per cpu space. You have the patches. We 
> > could put this on top of this patchset if necessary. But then it not so 
> > nice and simple anymore. Maybe we can rstrict the use of cpu alloc 
> > instead to users with objects < cache_line_size() or so?
> 
> Restricting the use of cpu_alloc based on size of object is no good when
> you're trying to allocate 45,000 objects.  Extending the per CPU space
> is the only option.

Ok guess we need to bring the virtually mapped per cpu patches forward.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  3:56 ` [patch 02/41] cpu alloc: The allocator Christoph Lameter
                     ` (2 preceding siblings ...)
  2008-05-30  5:46   ` Rusty Russell
@ 2008-05-31 20:58   ` Pavel Machek
  3 siblings, 0 replies; 139+ messages in thread
From: Pavel Machek @ 2008-05-31 20:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

Hi!

> Index: linux-2.6/mm/Kconfig
> ===================================================================
> --- linux-2.6.orig/mm/Kconfig	2008-05-29 19:41:21.000000000 -0700
> +++ linux-2.6/mm/Kconfig	2008-05-29 20:13:39.000000000 -0700
> @@ -205,3 +205,9 @@ config NR_QUICK
>  config VIRT_TO_BUS
>  	def_bool y
>  	depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config CPU_ALLOC_SIZE
> +	int "Size of cpu alloc area"
> +	default "30000"
> +	help
> +	  Sets the maximum amount of memory that can be allocated via cpu_alloc

Missing . at end of line, but more important:

How do you expect user to answer this?!

I mean, can they put 0 there and expect kernel to work? If not, you
should disallow that config. And you shoudl really explain how much
memory is needed... and perhaps mention that size is in bytes?

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-05-30 18:00             ` Christoph Lameter
@ 2008-06-02  2:00               ` Rusty Russell
  2008-06-04 18:18                 ` Mike Travis
  2008-06-10 17:42                 ` Christoph Lameter
  0 siblings, 2 replies; 139+ messages in thread
From: Rusty Russell @ 2008-06-02  2:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Mike Travis

On Saturday 31 May 2008 04:00:40 Christoph Lameter wrote:
> On Fri, 30 May 2008, Rusty Russell wrote:
> > > No its not! In order to increment a per cpu value you need to calculate
> > > the per cpu pointer address in the current per cpu segment.
> >
> > Christoph, you just missed it, that's all.  Look at cpu_local_read et al
> > in include/asm-i386/local.h (ie. before the x86 mergers chose the lowest
> > common denominator one).
>
> There is no doubt that local_t does perform an atomic vs. interrupt inc
> for example. But its not usable. Because you need to determine the address
> of the local_t belonging to the current processor first.

Christoph!

STOP typing, and START reading.

cpu_local_inc() does all this: it takes the name of a local_t var, and is 
expected to increment this cpu's version of that.  You ripped this out and 
called it CPU_INC().

Do not make me explain it a third time.

> As soon as you 
> have loaded a processor specific address you can no longer be preempted
> because that may change the processor and then the wrong address may be
> increment (and then we have a race again since now we are incrementing
> counters belonging to other processors). So local_t at mininum requires
> disabling preempt.

Think for a moment.  What are the chances that I didn't understand this when I 
wrote the multiple implementations of local_t?

You are wasting my time explaining the obvious, and wasting your own.

> Believe me I have tried to use local_t repeatedly for vm statistics etc.
> It always fails on that issue.

Frankly, I am finding it increasingly easy to believe that you failed.  But 
you are blaming the wrong thing.

There are three implementations of local_t which are obvious.  The best is for 
architectures which can locate and increment a per-cpu var in one instruction 
(eg. x86).  Otherwise, using atomic_t/atomic64_t for local_t provides a 
general solution.  The other general solution would involve 
local_irq_disable()/increment/local_irq_enable().

My (fading) hope is that this idiocy is an abberation,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  5:31       ` Andrew Morton
@ 2008-06-02  9:29         ` Paul Jackson
  0 siblings, 0 replies; 139+ messages in thread
From: Paul Jackson @ 2008-06-02  9:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: clameter, linux-arch, linux-kernel, davem, dada1, peterz, rusty, travis

Andrew wrote:
> > > > +#define CPU_PTR(__p, __cpu)	SHIFT_PERCPU_PTR((__p), per_cpu_offset(__cpu))
> > > 
> > > eek, a major interface function which is ALL IN CAPS!
> > > 
> > > can we do this in lower-case?  In a C function?
> > 
> > No. This is a macro and therefore uppercase (there is macro magic going on 
> > that ppl need to be aware of). AFAICR you wanted it this way last year. C 
> > function not possible because of the type checking.
> 
> urgh.  This is a C-convention versus kernel-convention thing.  The C
> convention exists for very good reasons.  But it sure does suck.
> 
> What do others think?

A few, key symbols get to be special ... short but distinctive names
that become (in)famous.  The classic was "u", for the per-user
structure, aka the "user area", in old Unix kernels.  In people's
names, a few one word or first names such as "Ike", "Madonna", "Ali",
"Tiger", "Cher", "Mao", "OJ", "Plato", "Linus", ... have become
distinctive and well known to many people.

How about "_pcpu", instead of CPU_PTR?  "_pcpu" is a short, unique
(not currently in use) symbol that, tersely, says what we want to say.

Yes - it violates multiple conventions.  "The Boss" (Bruce Springsteen)
gets to do that.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [patch 01/41] cpu_alloc: Increase percpu area size to 128k
  2008-05-30  3:56 ` [patch 01/41] cpu_alloc: Increase percpu area size to 128k Christoph Lameter
@ 2008-06-02 17:58   ` Luck, Tony
  2008-06-02 23:48     ` Rusty Russell
  2008-06-10 17:22     ` Christoph Lameter
  0 siblings, 2 replies; 139+ messages in thread
From: Luck, Tony @ 2008-06-02 17:58 UTC (permalink / raw)
  To: Christoph Lameter, akpm
  Cc: linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

> The per cpu allocator requires more per cpu space and we are already near
> the limit on IA64. Increase the maximum size of the IA64 per cpu area from
> 64K to 128K.

> -#define PERCPU_PAGE_SHIFT	16	/* log2() of max. size of per-CPU area */
> +#define PERCPU_PAGE_SHIFT	17	/* log2() of max. size of per-CPU area */

Don't you need some more changes to the alt_dtlb_miss handler in
ivt.S for this to work?  128K is not a supported pagesize on any
processor model.

-Tony

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 01/41] cpu_alloc: Increase percpu area size to 128k
  2008-06-02 17:58   ` Luck, Tony
@ 2008-06-02 23:48     ` Rusty Russell
  2008-06-10 17:22     ` Christoph Lameter
  1 sibling, 0 replies; 139+ messages in thread
From: Rusty Russell @ 2008-06-02 23:48 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Christoph Lameter, akpm, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Mike Travis

On Tuesday 03 June 2008 03:58:17 Luck, Tony wrote:
> > The per cpu allocator requires more per cpu space and we are already near
> > the limit on IA64. Increase the maximum size of the IA64 per cpu area
> > from 64K to 128K.
> >
> > -#define PERCPU_PAGE_SHIFT	16	/* log2() of max. size of per-CPU area */
> > +#define PERCPU_PAGE_SHIFT	17	/* log2() of max. size of per-CPU area */
>
> Don't you need some more changes to the alt_dtlb_miss handler in
> ivt.S for this to work?  128K is not a supported pagesize on any
> processor model.

Yes, this was one of the issues with IA64 and extending the per-cpu area.  
It's probable that the IA64 TLB nailing trick might have to give way for 
dynamic per-cpu...

Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  4:58   ` Andrew Morton
  2008-05-30  5:10     ` Christoph Lameter
@ 2008-06-04 14:48     ` Mike Travis
  1 sibling, 0 replies; 139+ messages in thread
From: Mike Travis @ 2008-06-04 14:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Rusty Russell


Andrew Morton wrote:
...
>> +/*
>> + * Mark an object as used in the cpu_alloc_map
>> + *
>> + * Must hold cpu_alloc_map_lock
>> + */
>> +static void set_map(int start, int length)
>> +{
>> +	while (length-- > 0)
>> +		__set_bit(start++, cpu_alloc_map);
>> +}
> 
> bitmap_fill()?
> 
>> +/*
>> + * Mark an area as freed.
>> + *
>> + * Must hold cpu_alloc_map_lock
>> + */
>> +static void clear_map(int start, int length)
>> +{
>> +	while (length-- > 0)
>> +		__clear_bit(start++, cpu_alloc_map);
>> +}
> 
> bitmap_zero()?
...
>> +void *cpu_alloc(unsigned long size, gfp_t gfpflags, unsigned long align)
>> +{
>> +	unsigned long start;
>> +	int units = size_to_units(size);
>> +	void *ptr;
>> +	int first;
>> +	unsigned long flags;
>> +
>> +	if (!size)
>> +		return ZERO_SIZE_PTR;
> 
> OK, so we reuse ZERO_SIZE_PTR from kmalloc.
> 
>> +	spin_lock_irqsave(&cpu_alloc_map_lock, flags);
>> +
>> +	first = 1;
>> +	start = first_free;
>> +
>> +	for ( ; ; ) {
>> +
>> +		start = find_next_zero_bit(cpu_alloc_map, UNITS, start);
>> +		if (start >= UNITS)
>> +			goto out_of_memory;
>> +
>> +		if (first)
>> +			first_free = start;
>> +
>> +		/*
>> +		 * Check alignment and that there is enough space after
>> +		 * the starting unit.
>> +		 */
>> +		if (start % (align / UNIT_SIZE) == 0 &&
>> +			find_next_bit(cpu_alloc_map, UNITS, start + 1)
>> +							>= start + units)
>> +				break;
>> +		start++;
>> +		first = 0;
>> +	}
> 
> This is kinda bitmap_find_free_region(), only bitmap_find_free_region()
> isn't quite strong enough.
> 
> Generally I think it would have been better if you had added new
> primitives to the bitmap library (or enhanced existing ones) and used
> them here, rather than implementing private functionality.

Hi Andrew,

I've sort of inherited this now...

So are you suggesting that we add new bitmap primitives to:

	bitmap_fill_offset(bitmap, start, nbits)   /* start at bitmap[start] */
	bitmap_zero_offset(bitmap, start, nbits)
	bitmap_find_free_area(bitmap, nbits, size, align)  /* size not order */

Thanks,
Mike 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  5:20     ` Christoph Lameter
  2008-05-30  5:52       ` Rusty Russell
  2008-05-30  5:54       ` Eric Dumazet
@ 2008-06-04 14:58       ` Mike Travis
  2008-06-04 15:11         ` Eric Dumazet
  2008-06-10 17:33         ` Christoph Lameter
  2 siblings, 2 replies; 139+ messages in thread
From: Mike Travis @ 2008-06-04 14:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, akpm, linux-arch, linux-kernel, David Miller,
	Peter Zijlstra, Rusty Russell

Christoph Lameter wrote:
> On Fri, 30 May 2008, Eric Dumazet wrote:
> 
>>> +static DEFINE_PER_CPU(UNIT_TYPE, area[UNITS]);
>>>   
>> area[] is not guaranteed to be aligned on anything but 4 bytes.
>>
>> If someone then needs to call cpu_alloc(8, GFP_KERNEL, 8), it might get an non
>> aligned result.
>>
>> Either you should add an __attribute__((__aligned__(PAGE_SIZE))),
>> or take into account the real address of area[] in cpu_alloc() to avoid waste
>> of up to PAGE_SIZE bytes
>> per cpu.
> 
> I think cacheline aligning should be sufficient. People should not 
> allocate large page aligned objects here.

I'm a bit confused.  Why is DEFINE_PER_CPU_SHARED_ALIGNED() conditioned on
ifdef MODULE?

        #ifdef MODULE
        #define SHARED_ALIGNED_SECTION ".data.percpu"
        #else
        #define SHARED_ALIGNED_SECTION ".data.percpu.shared_aligned"
        #endif

        #define DEFINE_PER_CPU_SHARED_ALIGNED(type, name)                       \
                __attribute__((__section__(SHARED_ALIGNED_SECTION)))            \
                PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name             \
                ____cacheline_aligned_in_smp

Thanks,
Mike

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  5:46   ` Rusty Russell
@ 2008-06-04 15:04     ` Mike Travis
  2008-06-10 17:34       ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Mike Travis @ 2008-06-04 15:04 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christoph Lameter, akpm, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra

Rusty Russell wrote:
> On Friday 30 May 2008 13:56:22 Christoph Lameter wrote:
>> The per cpu allocator allows dynamic allocation of memory on all
>> processors simultaneously. A bitmap is used to track used areas.
>> The allocator implements tight packing to reduce the cache footprint
>> and increase speed since cacheline contention is typically not a concern
>> for memory mainly used by a single cpu. Small objects will fill up gaps
>> left by larger allocations that required alignments.
> 
> Allocator seems nice and simple, similar to existing one in module.c (which 
> predates cool bitmap operators).
> 
> Being able to do per-cpu allocations in an interrupt handler seems like 
> encouraging a Bad Idea though: I'd be tempted to avoid the flags word, always 
> zero, and use a mutex instead of a spinlock.
> 
> Cheers,
> Rusty.

I haven't seen any further discussion on these aspects... is there a consensus
to remove the flags from CPU_ALLOC() and use a mutex?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-05-30  4:58 ` [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Andrew Morton
  2008-05-30  5:03   ` Christoph Lameter
@ 2008-06-04 15:07   ` Mike Travis
  2008-06-06  5:33     ` Eric Dumazet
  1 sibling, 1 reply; 139+ messages in thread
From: Mike Travis @ 2008-06-04 15:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Rusty Russell

Andrew Morton wrote:
> On Thu, 29 May 2008 20:56:20 -0700 Christoph Lameter <clameter@sgi.com> wrote:
> 
>> In various places the kernel maintains arrays of pointers indexed by
>> processor numbers. These are used to locate objects that need to be used
>> when executing on a specirfic processor. Both the slab allocator
>> and the page allocator use these arrays and there the arrays are used in
>> performance critical code. The allocpercpu functionality is a simple
>> allocator to provide these arrays.
> 
> All seems reasonable to me.  The obvious question is "how do we size
> the arena".  We either waste memory or, much worse, run out.
> 
> And running out is a real possibility, I think.  Most people will only
> mount a handful of XFS filesystems.  But some customer will come along
> who wants to mount 5,000, and distributors will need to cater for that,
> but how can they?
> 
> I wonder if we can arrange for the default to be overridden via a
> kernel boot option?
> 
> 
> Another obvious question is "how much of a problem will we have with
> internal fragmentation"?  This might be a drop-dead showstopper.

One problem with variable sized cpu_alloc area is this comment in bitmap.h:

 * Note that nbits should be always a compile time evaluable constant.
 * Otherwise many inlines will generate horrible code.

I'm guessing since this will be of low use and not performance critical,
then we can ignore the "horrible code"?  ;-)

Thanks,
Mike


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-06-04 14:58       ` Mike Travis
@ 2008-06-04 15:11         ` Eric Dumazet
  2008-06-06  0:32           ` Rusty Russell
  2008-06-10 17:33         ` Christoph Lameter
  1 sibling, 1 reply; 139+ messages in thread
From: Eric Dumazet @ 2008-06-04 15:11 UTC (permalink / raw)
  To: Mike Travis
  Cc: Christoph Lameter, akpm, linux-arch, linux-kernel, David Miller,
	Peter Zijlstra, Rusty Russell

Mike Travis a écrit :
> Christoph Lameter wrote:
>> On Fri, 30 May 2008, Eric Dumazet wrote:
>>
>>>> +static DEFINE_PER_CPU(UNIT_TYPE, area[UNITS]);
>>>>   
>>> area[] is not guaranteed to be aligned on anything but 4 bytes.
>>>
>>> If someone then needs to call cpu_alloc(8, GFP_KERNEL, 8), it might get an non
>>> aligned result.
>>>
>>> Either you should add an __attribute__((__aligned__(PAGE_SIZE))),
>>> or take into account the real address of area[] in cpu_alloc() to avoid waste
>>> of up to PAGE_SIZE bytes
>>> per cpu.
>> I think cacheline aligning should be sufficient. People should not 
>> allocate large page aligned objects here.
> 
> I'm a bit confused.  Why is DEFINE_PER_CPU_SHARED_ALIGNED() conditioned on
> ifdef MODULE?
> 
>         #ifdef MODULE
>         #define SHARED_ALIGNED_SECTION ".data.percpu"
>         #else
>         #define SHARED_ALIGNED_SECTION ".data.percpu.shared_aligned"
>         #endif
> 
>         #define DEFINE_PER_CPU_SHARED_ALIGNED(type, name)                       \
>                 __attribute__((__section__(SHARED_ALIGNED_SECTION)))            \
>                 PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name             \
>                 ____cacheline_aligned_in_smp
> 
> Thanks,
> Mike
> 
> 

Because we had crashes when loading oprofile module, when a previous version of oprofile
used to use DEFINE_PER_CPU_SHARED_ALIGNED variable

module loader only takes into account the special section ".data.percpu" and ignores ".data.percpu.shared_aligned"

I therefore submitted two patches :

1) commit 8b8b498836942c0c855333d357d121c0adeefbd9
oprofile: don't request cache line alignment for cpu_buffer

Alignment was previously requested because cpu_buffer was an [NR_CPUS]
array, to avoid cache line sharing between CPUS.

After commit 608dfddd845da5ab6accef70154c8910529699f7 (oprofile: change
cpu_buffer from array to per_cpu variable ), we dont need to force an
alignement anymore since cpu_buffer sits in per_cpu zone.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Mike Travis <travis@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


2) and commit 	44c81433e8b05dbc85985d939046f10f95901184
per_cpu: fix DEFINE_PER_CPU_SHARED_ALIGNED for modules

Current module loader lookups ".data.percpu" ELF section to perform
per_cpu relocation.  But DEFINE_PER_CPU_SHARED_ALIGNED() uses another
section (".data.percpu.shared_aligned"), currently only handled in
vmlinux.lds, not by module loader.

To correct this problem, instead of adding logic into module loader, or
using at build time a module.lds file for all arches to group
".data.percpu.shared_aligned" into ".data.percpu", just use ".data.percpu"
for modules.

Alignment requirements are correctly handled by ld and module loader.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>




^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-05-30  5:52       ` Rusty Russell
@ 2008-06-04 15:30         ` Mike Travis
  2008-06-05 23:48           ` Rusty Russell
  0 siblings, 1 reply; 139+ messages in thread
From: Mike Travis @ 2008-06-04 15:30 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christoph Lameter, Eric Dumazet, akpm, linux-arch, linux-kernel,
	David Miller, Peter Zijlstra

Rusty Russell wrote:
> On Friday 30 May 2008 15:20:45 Christoph Lameter wrote:
>> On Fri, 30 May 2008, Eric Dumazet wrote:
>>>> +static DEFINE_PER_CPU(UNIT_TYPE, area[UNITS]);
>>> area[] is not guaranteed to be aligned on anything but 4 bytes.
>>>
>>> If someone then needs to call cpu_alloc(8, GFP_KERNEL, 8), it might get
>>> an non aligned result.
>>>
>>> Either you should add an __attribute__((__aligned__(PAGE_SIZE))),
>>> or take into account the real address of area[] in cpu_alloc() to avoid
>>> waste of up to PAGE_SIZE bytes
>>> per cpu.
>> I think cacheline aligning should be sufficient. People should not
>> allocate large page aligned objects here.
> 
> I vaguely recall there were issues with this in the module code.  They might 
> be gone now, but failing to meet alignment contraints without a big warning 
> would suck.
> 
> But modifying your code to consider the actual alignment is actually pretty 
> trivial, AFAICT.
> 
> Cheers,
> Rusty.

So paraphrasing my earlier email, we should add:

	bitmap_find_free_area(bitmap, nbits, size, align, alignbase)

so that > cacheline alignment is possible?

My thinking is that if we do go to true dynamically sized cpu_alloc area then
allocating PAGE_SIZE units may be both practical and worthwhile...?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-02  2:00               ` Rusty Russell
@ 2008-06-04 18:18                 ` Mike Travis
  2008-06-05 23:59                   ` Rusty Russell
  2008-06-09 23:09                   ` Christoph Lameter
  2008-06-10 17:42                 ` Christoph Lameter
  1 sibling, 2 replies; 139+ messages in thread
From: Mike Travis @ 2008-06-04 18:18 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christoph Lameter, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra


> cpu_local_inc() does all this: it takes the name of a local_t var, and is 
> expected to increment this cpu's version of that.  You ripped this out and 
> called it CPU_INC().

Hi,

I'm attempting to test both approaches to compare the object generated in order
to understand the issues involved here.  Here's my code:

        void test_cpu_inc(int *s)
        {
                __CPU_INC(s);
        }

        void test_local_inc(local_t *t)
        {
                __local_inc(THIS_CPU(t));
        }

        void test_cpu_local_inc(local_t *t)
        {
                __cpu_local_inc(t);
        }

But I don't know how I can use cpu_local_inc because the pointer to the object
is not &__get_cpu_var(l):

	#define __cpu_local_inc(l)      cpu_local_inc((l))
	#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var((l))))

At the minimum, we would need a new local_t op to get the correct CPU_ALLOC'd
pointer value for the increment.  These new local_t ops for CPU_ALLOC'd variables
could use CPU_XXX primitives to implement them, or just a base val_to_ptr primitive
to replace __get_cpu_var().

I did notice this in local.h:

	 * X86_64: This could be done better if we moved the per cpu data directly
	 * after GS.

... which it now is, so true per_cpu variables could be optimized better as well.

Also, the above cpu_local_wrap(...) adds:

	#define cpu_local_wrap(l)               \
	({                                      \
	        preempt_disable();              \
	        (l);                            \
	        preempt_enable();               \
	})                                      \

... and there isn't a non-preemption version that I can find.

Here are the objects.  

0000000000000000 <test_cpu_inc>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   48 83 ec 08             sub    $0x8,%rsp
   8:   48 89 7d f8             mov    %rdi,0xfffffffffffffff8(%rbp)
   c:   65 48 ff 45 f8          incq   %gs:0xfffffffffffffff8(%rbp)
  11:   c9                      leaveq
  12:   c3                      retq

0000000000000013 <test_local_inc>:
  13:   55                      push   %rbp
  14:   65 48 8b 05 00 00 00    mov    %gs:0(%rip),%rax        # 1c <test_local_inc+0x9>
  1b:   00
  1c:   48 89 e5                mov    %rsp,%rbp
  1f:   48 ff 04 07             incq   (%rdi,%rax,1)
  23:   c9                      leaveq
  24:   c3                      retq


With a new local_t op then test_local_inc probably could be optimized to be
the same instructions as test_cpu_inc.

One other distinction is CPU_INC increments an arbitrary sized variable
while local_inc requires a local_t variable.  This may not make it usable
in all cases.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-06-04 15:30         ` Mike Travis
@ 2008-06-05 23:48           ` Rusty Russell
  0 siblings, 0 replies; 139+ messages in thread
From: Rusty Russell @ 2008-06-05 23:48 UTC (permalink / raw)
  To: Mike Travis
  Cc: Christoph Lameter, Eric Dumazet, akpm, linux-arch, linux-kernel,
	David Miller, Peter Zijlstra

On Thursday 05 June 2008 01:30:23 Mike Travis wrote:
> Rusty Russell wrote:
> > On Friday 30 May 2008 15:20:45 Christoph Lameter wrote:
> >> On Fri, 30 May 2008, Eric Dumazet wrote:
> >>>> +static DEFINE_PER_CPU(UNIT_TYPE, area[UNITS]);
> >>>
> >>> area[] is not guaranteed to be aligned on anything but 4 bytes.
> >>>
> >>> If someone then needs to call cpu_alloc(8, GFP_KERNEL, 8), it might get
> >>> an non aligned result.
> >>>
> >>> Either you should add an __attribute__((__aligned__(PAGE_SIZE))),
> >>> or take into account the real address of area[] in cpu_alloc() to avoid
> >>> waste of up to PAGE_SIZE bytes
> >>> per cpu.
> >>
> >> I think cacheline aligning should be sufficient. People should not
> >> allocate large page aligned objects here.
> >
> > I vaguely recall there were issues with this in the module code.  They
> > might be gone now, but failing to meet alignment contraints without a big
> > warning would suck.
> >
> > But modifying your code to consider the actual alignment is actually
> > pretty trivial, AFAICT.
> >
> > Cheers,
> > Rusty.
>
> So paraphrasing my earlier email, we should add:
>
> 	bitmap_find_free_area(bitmap, nbits, size, align, alignbase)
>
> so that > cacheline alignment is possible?
>
> My thinking is that if we do go to true dynamically sized cpu_alloc area
> then allocating PAGE_SIZE units may be both practical and worthwhile...?
>
> Thanks,
> Mike

Well, my thinking is that unless we do true dynamic per-cpu, this entire patch 
series is a non-starter :(

Once we have that, we can reopen this.  Then we'll discuss why we're writing a 
new allocator rather than using the existing one :)

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-04 18:18                 ` Mike Travis
@ 2008-06-05 23:59                   ` Rusty Russell
  2008-06-09 19:00                     ` Christoph Lameter
  2008-06-09 23:09                   ` Christoph Lameter
  1 sibling, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-06-05 23:59 UTC (permalink / raw)
  To: Mike Travis
  Cc: Christoph Lameter, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra

On Thursday 05 June 2008 04:18:19 Mike Travis wrote:
> > cpu_local_inc() does all this: it takes the name of a local_t var, and is
> > expected to increment this cpu's version of that.  You ripped this out
> > and called it CPU_INC().
>
> Hi,
>
> I'm attempting to test both approaches to compare the object generated in
> order to understand the issues involved here.  Here's my code:
>
>         void test_cpu_inc(int *s)
>         {
>                 __CPU_INC(s);
>         }
>
>         void test_local_inc(local_t *t)
>         {
>                 __local_inc(THIS_CPU(t));
>         }
>
>         void test_cpu_local_inc(local_t *t)
>         {
>                 __cpu_local_inc(t);
>         }
>
> But I don't know how I can use cpu_local_inc because the pointer to the
> object is not &__get_cpu_var(l):

Yes.  Because the only true per-cpu vars are the static ones, cpu_local_inc() 
only works on identifiers, not arbitrary pointers.  Once this is fixed, we 
should be enhancing the infrastructure to allow that (AFAICT it's not too 
hard, but we should add an __percpu marker for sparse).

> At the minimum, we would need a new local_t op to get the correct
> CPU_ALLOC'd pointer value for the increment.  These new local_t ops for
> CPU_ALLOC'd variables could use CPU_XXX primitives to implement them, or
> just a base val_to_ptr primitive to replace __get_cpu_var().

I think the latter: __get_cpu_ptr() perhaps?

> I did notice this in local.h:
>
> 	 * X86_64: This could be done better if we moved the per cpu data directly
> 	 * after GS.
>
> ... which it now is, so true per_cpu variables could be optimized better as
> well.

Indeed.

>
> Also, the above cpu_local_wrap(...) adds:
>
> 	#define cpu_local_wrap(l)               \
> 	({                                      \
> 	        preempt_disable();              \
> 	        (l);                            \
> 	        preempt_enable();               \
> 	})                                      \
>
> ... and there isn't a non-preemption version that I can find.

Yes, this should be fixed.  I thought i386 had optimized versions pre-merge, 
but I was wrong (%gs for per-cpu came later, and noone cleaned up these naive 
versions).  Did you want me to write them?

I actually think that using local_t == atomic_t is better than 
preempt_disable/enable for most archs which can't do atomic deref-and-inc.

> One other distinction is CPU_INC increments an arbitrary sized variable
> while local_inc requires a local_t variable.  This may not make it usable
> in all cases.

You might be right, but note that local_t is 64 bit on 64-bit platforms.  And 
speculation of possible use cases isn't a good reason to rip out working 
infrastructure :)

Cheers,
Rusty.


>
> Thanks,
> Mike



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-06-04 15:11         ` Eric Dumazet
@ 2008-06-06  0:32           ` Rusty Russell
  0 siblings, 0 replies; 139+ messages in thread
From: Rusty Russell @ 2008-06-06  0:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mike Travis, Christoph Lameter, akpm, linux-arch, linux-kernel,
	David Miller, Peter Zijlstra

On Thursday 05 June 2008 01:11:00 Eric Dumazet wrote:
> Mike Travis a écrit :
> > I'm a bit confused.  Why is DEFINE_PER_CPU_SHARED_ALIGNED() conditioned
> > on ifdef MODULE?
> Because we had crashes when loading oprofile module, when a previous
> version of oprofile used to use DEFINE_PER_CPU_SHARED_ALIGNED variable
>
> module loader only takes into account the special section ".data.percpu"
> and ignores ".data.percpu.shared_aligned"
>
> I therefore submitted two patches :

Put one way, putting page-aligned per-cpu data in a separate section is a 
space-saving hack: one which is not really required for modules because of 
the low frequency of such variables.  Put another way, not respecting 
the .data.percpu.shared_aligned section in modules is a bug.

But a comment would probably be nice!

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-06-04 15:07   ` Mike Travis
@ 2008-06-06  5:33     ` Eric Dumazet
  2008-06-06 13:08       ` Mike Travis
                         ` (2 more replies)
  0 siblings, 3 replies; 139+ messages in thread
From: Eric Dumazet @ 2008-06-06  5:33 UTC (permalink / raw)
  To: Mike Travis, Christoph Lameter
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Peter Zijlstra, Rusty Russell

Mike Travis a écrit :
> Andrew Morton wrote:
>> On Thu, 29 May 2008 20:56:20 -0700 Christoph Lameter <clameter@sgi.com> wrote:
>>
>>> In various places the kernel maintains arrays of pointers indexed by
>>> processor numbers. These are used to locate objects that need to be used
>>> when executing on a specirfic processor. Both the slab allocator
>>> and the page allocator use these arrays and there the arrays are used in
>>> performance critical code. The allocpercpu functionality is a simple
>>> allocator to provide these arrays.
>> All seems reasonable to me.  The obvious question is "how do we size
>> the arena".  We either waste memory or, much worse, run out.
>>
>> And running out is a real possibility, I think.  Most people will only
>> mount a handful of XFS filesystems.  But some customer will come along
>> who wants to mount 5,000, and distributors will need to cater for that,
>> but how can they?
>>
>> I wonder if we can arrange for the default to be overridden via a
>> kernel boot option?
>>
>>
>> Another obvious question is "how much of a problem will we have with
>> internal fragmentation"?  This might be a drop-dead showstopper.

Christoph & Mike,

Please forgive me if I beat a dead horse, but this percpu stuff
should find its way.

I wonder why you absolutely want to have only one chunk holding
all percpu variables, static(vmlinux) & static(modules)
& dynamically allocated.

Its *not* possible to put an arbitrary limit to this global zone.
You'll allways find somebody to break this limit. This is the point
we must solve, before coding anything.

Have you considered using a list of fixed size chunks, each chunk
having its own bitmap ?

We only want fix offsets between CPU locations. For a given variable,
we MUST find addresses for all CPUS looking at the same offset table.
(Then we can optimize things on x86, using %gs/%fs register, instead
of a table lookup)

We could chose chunk size at compile time, depending on various
parameters (32/64 bit arches, or hugepage sizes on NUMA),
and a minimum value (ABI guarantee)

On x86_64 && NUMA we could use 2 Mbytes chunks, while
on x86_32 or non NUMA we should probably use 64 Kbytes.

At boot time, we setup the first chunk (chunk 0) and copy 
.data.percpu on this chunk, for each possible cpu, and we
build the bitmap for future dynamic/module percpu allocations.
So we still have the restriction that sizeofsection(.data.percpu)
should fit in the chunk 0. Not a problem in practice.

Then if we need to expand percpu zone for heavy duty machine,
and chunk 0 is already filled, we can add as many 2 M/ 64K 
chunks we need.

This would limit the dynamic percpu allocation to 64 kbytes for
a given variable, so huge users should probably still use a
different allocator (like oprofile alloc_cpu_buffers() function)
But at least we dont anymore limit the total size of percpu area.

I understand you want to offset percpu data to 0, but for
static percpu data. (pda being included in, to share %gs)

For dynamically allocated percpu variables (including modules
".data.percpu"), nothing forces you to have low offsets,
relative to %gs/%fs register. Access to these variables
will be register indirect based anyway (eg %gs:(%rax) )


1) NUMA case

For a 64 bit NUMA arch, chunk size of 2Mbytes

Allocates 2Mb for each possible processor (on its preferred memory
node), and compute values to setup offset_of_cpu[NR_CPUS] array.

Chunk 0
CPU 0 : virtual address XXXXXX        
CPU 1 : virtual address XXXXXX + offset_of_cpu[1]
...
CPU n : virtual address XXXXXX + offset_of_cpu[n]
+ a shared bitmap


For next chunks, we could use vmalloc() zone to find 
nr_possible_cpus virtual addresses ranges where you can map
a 2Mb page per possible cpu, as long as we respect the relative
delta between each cpu block, that was computed when
chunk 0 was setup.

Chunk 1..n
CPU 0 : virtual address YYYYYYYYYYYYYY   
CPU 1 : virtual address YYYYYYYYYYYYYY + offset_of_cpu[1]
...
CPU n : virtual address YYYYYYYYYYYYYY + offset_of_cpu[n]
+ a shared bitmap (32Kbytes if 8 bytes granularity in allocator)

For a variable located in chunk 0, its 'address' relative to current
cpu %gs will be some number between [0 and 2^20-1]

For a variable located in chunk 1, its 'address' relative to current
cpu %gs will be some number between
[YYYYYYYYYYYYYY - XXXXXX  and YYYYYYYYYYYYYY - XXXXXX + 2^20 - 1],
not necessarly [2^20 to 2^21 - 1]


Chunk 0 would use normal memory (no vmap TLB cost), only next ones need vmalloc().

So the extra TLB cost would only be taken for very special NUMA setups
(only if using a lot of percpu allocations)

Also, using a 2Mb page granularity probably wastes about 2Mb per cpu, but
this is nothing for NUMA machines :)

2) SMP && !NUMA

On non NUMA machines, we dont need vmalloc games, since we can allocate
chunk space using contiguous memory, (size = nr_possible_cpus*64Kbytes)

offset_of_cpu[N] = N * CHUNK_SIZE

(On a 4 CPU x86_32 machine, allocate a 256 Kbyte bloc then divide it in
64 kb blocs)
If this order-6 allocation fails, then fallback to vmalloc(), but most
percpu allocations happens at boot time, when memory is not yet fragmented...


3) UP case : fallback to standard allocators. No need for bitmaps.


NUMA special casing can be implemented later of course...

Thanks for reading



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-06-06  5:33     ` Eric Dumazet
@ 2008-06-06 13:08       ` Mike Travis
  2008-06-08  6:00       ` Rusty Russell
  2008-06-09 18:44       ` Christoph Lameter
  2 siblings, 0 replies; 139+ messages in thread
From: Mike Travis @ 2008-06-06 13:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Peter Zijlstra, Rusty Russell, Ingo Molnar,
	Jack Steiner

Eric Dumazet wrote:
> Mike Travis a écrit :
>> Andrew Morton wrote:
>>> On Thu, 29 May 2008 20:56:20 -0700 Christoph Lameter <clameter@sgi.com> wrote:
>>>
>>>> In various places the kernel maintains arrays of pointers indexed by
>>>> processor numbers. These are used to locate objects that need to be used
>>>> when executing on a specirfic processor. Both the slab allocator
>>>> and the page allocator use these arrays and there the arrays are used in
>>>> performance critical code. The allocpercpu functionality is a simple
>>>> allocator to provide these arrays.
>>> All seems reasonable to me.  The obvious question is "how do we size
>>> the arena".  We either waste memory or, much worse, run out.
>>>
>>> And running out is a real possibility, I think.  Most people will only
>>> mount a handful of XFS filesystems.  But some customer will come along
>>> who wants to mount 5,000, and distributors will need to cater for that,
>>> but how can they?
>>>
>>> I wonder if we can arrange for the default to be overridden via a
>>> kernel boot option?
>>>
>>>
>>> Another obvious question is "how much of a problem will we have with
>>> internal fragmentation"?  This might be a drop-dead showstopper.
> 
> Christoph & Mike,
> 
> Please forgive me if I beat a dead horse, but this percpu stuff
> should find its way.
> 
> I wonder why you absolutely want to have only one chunk holding
> all percpu variables, static(vmlinux) & static(modules)
> & dynamically allocated.
> 
> Its *not* possible to put an arbitrary limit to this global zone.
> You'll allways find somebody to break this limit. This is the point
> we must solve, before coding anything.
> 
> Have you considered using a list of fixed size chunks, each chunk
> having its own bitmap ?
> 
> We only want fix offsets between CPU locations. For a given variable,
> we MUST find addresses for all CPUS looking at the same offset table.
> (Then we can optimize things on x86, using %gs/%fs register, instead
> of a table lookup)
> 
> We could chose chunk size at compile time, depending on various
> parameters (32/64 bit arches, or hugepage sizes on NUMA),
> and a minimum value (ABI guarantee)
> 
> On x86_64 && NUMA we could use 2 Mbytes chunks, while
> on x86_32 or non NUMA we should probably use 64 Kbytes.
> 
> At boot time, we setup the first chunk (chunk 0) and copy 
> .data.percpu on this chunk, for each possible cpu, and we
> build the bitmap for future dynamic/module percpu allocations.
> So we still have the restriction that sizeofsection(.data.percpu)
> should fit in the chunk 0. Not a problem in practice.
> 
> Then if we need to expand percpu zone for heavy duty machine,
> and chunk 0 is already filled, we can add as many 2 M/ 64K 
> chunks we need.
> 
> This would limit the dynamic percpu allocation to 64 kbytes for
> a given variable, so huge users should probably still use a
> different allocator (like oprofile alloc_cpu_buffers() function)
> But at least we dont anymore limit the total size of percpu area.
> 
> I understand you want to offset percpu data to 0, but for
> static percpu data. (pda being included in, to share %gs)
> 
> For dynamically allocated percpu variables (including modules
> ".data.percpu"), nothing forces you to have low offsets,
> relative to %gs/%fs register. Access to these variables
> will be register indirect based anyway (eg %gs:(%rax) )
> 
> 
> 1) NUMA case
> 
> For a 64 bit NUMA arch, chunk size of 2Mbytes
> 
> Allocates 2Mb for each possible processor (on its preferred memory
> node), and compute values to setup offset_of_cpu[NR_CPUS] array.
> 
> Chunk 0
> CPU 0 : virtual address XXXXXX        
> CPU 1 : virtual address XXXXXX + offset_of_cpu[1]
> ...
> CPU n : virtual address XXXXXX + offset_of_cpu[n]
> + a shared bitmap
> 
> 
> For next chunks, we could use vmalloc() zone to find 
> nr_possible_cpus virtual addresses ranges where you can map
> a 2Mb page per possible cpu, as long as we respect the relative
> delta between each cpu block, that was computed when
> chunk 0 was setup.
> 
> Chunk 1..n
> CPU 0 : virtual address YYYYYYYYYYYYYY   
> CPU 1 : virtual address YYYYYYYYYYYYYY + offset_of_cpu[1]
> ...
> CPU n : virtual address YYYYYYYYYYYYYY + offset_of_cpu[n]
> + a shared bitmap (32Kbytes if 8 bytes granularity in allocator)
> 
> For a variable located in chunk 0, its 'address' relative to current
> cpu %gs will be some number between [0 and 2^20-1]
> 
> For a variable located in chunk 1, its 'address' relative to current
> cpu %gs will be some number between
> [YYYYYYYYYYYYYY - XXXXXX  and YYYYYYYYYYYYYY - XXXXXX + 2^20 - 1],
> not necessarly [2^20 to 2^21 - 1]
> 
> 
> Chunk 0 would use normal memory (no vmap TLB cost), only next ones need vmalloc().
> 
> So the extra TLB cost would only be taken for very special NUMA setups
> (only if using a lot of percpu allocations)
> 
> Also, using a 2Mb page granularity probably wastes about 2Mb per cpu, but
> this is nothing for NUMA machines :)
> 
> 2) SMP && !NUMA
> 
> On non NUMA machines, we dont need vmalloc games, since we can allocate
> chunk space using contiguous memory, (size = nr_possible_cpus*64Kbytes)
> 
> offset_of_cpu[N] = N * CHUNK_SIZE
> 
> (On a 4 CPU x86_32 machine, allocate a 256 Kbyte bloc then divide it in
> 64 kb blocs)
> If this order-6 allocation fails, then fallback to vmalloc(), but most
> percpu allocations happens at boot time, when memory is not yet fragmented...
> 
> 
> 3) UP case : fallback to standard allocators. No need for bitmaps.
> 
> 
> NUMA special casing can be implemented later of course...
> 
> Thanks for reading
> 
> 

Wow!  Thanks for the detail!  It's extremely useful (to me at least)
to see it spelled out.

Since Christoph is still on vacation I'll try to summarize where we're
at at the moment.  (Besides being stuck on a boot up problem with the
%gs based percpu variables that is. ;-)

Yes, the problem is we need to use virtual addresses to expand the
percpu areas since each cpu needs the same fixed offset to the newly
allocated variables.  This was in the prior (v2) version of cpu_alloc
so I'm looking at pulling that forward.  And I also figured that the
size of the expansion allocations should be based on the system size
to minimize the effect on small systems (seems to be my life the
past 6 months... ;-)

I'm also looking at integrating more into the already present
infrastructure (thanks Rusty!) so there are less "diffs" (and less
new testing needed.)  And of course, there's the complexities
of submitting patches to many architectures simultaneously.

Hopefully, I'll have something for review soon.

Thanks again,
Mike


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-06-06  5:33     ` Eric Dumazet
  2008-06-06 13:08       ` Mike Travis
@ 2008-06-08  6:00       ` Rusty Russell
  2008-06-09 18:44       ` Christoph Lameter
  2 siblings, 0 replies; 139+ messages in thread
From: Rusty Russell @ 2008-06-08  6:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mike Travis, Christoph Lameter, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Peter Zijlstra

On Friday 06 June 2008 15:33:22 Eric Dumazet wrote:
> 1) NUMA case
>
> For a 64 bit NUMA arch, chunk size of 2Mbytes
>
> Allocates 2Mb for each possible processor (on its preferred memory
> node), and compute values to setup offset_of_cpu[NR_CPUS] array.
>
> Chunk 0
> CPU 0 : virtual address XXXXXX
> CPU 1 : virtual address XXXXXX + offset_of_cpu[1]
> ...
> CPU n : virtual address XXXXXX + offset_of_cpu[n]
> + a shared bitmap
>
>
> For next chunks, we could use vmalloc() zone to find
> nr_possible_cpus virtual addresses ranges where you can map
> a 2Mb page per possible cpu, as long as we respect the relative
> delta between each cpu block, that was computed when
> chunk 0 was setup.
>
> Chunk 1..n
> CPU 0 : virtual address YYYYYYYYYYYYYY
> CPU 1 : virtual address YYYYYYYYYYYYYY + offset_of_cpu[1]
> ...
> CPU n : virtual address YYYYYYYYYYYYYY + offset_of_cpu[n]
> + a shared bitmap (32Kbytes if 8 bytes granularity in allocator)
>
> For a variable located in chunk 0, its 'address' relative to current
> cpu %gs will be some number between [0 and 2^20-1]
>
> For a variable located in chunk 1, its 'address' relative to current
> cpu %gs will be some number between
> [YYYYYYYYYYYYYY - XXXXXX  and YYYYYYYYYYYYYY - XXXXXX + 2^20 - 1],
> not necessarly [2^20 to 2^21 - 1]
>
>
> Chunk 0 would use normal memory (no vmap TLB cost), only next ones need
> vmalloc().
>
> So the extra TLB cost would only be taken for very special NUMA setups
> (only if using a lot of percpu allocations)
>
> Also, using a 2Mb page granularity probably wastes about 2Mb per cpu, but
> this is nothing for NUMA machines :)

If you're prepared to have mappings for chunk 0, you can simply make it 
virtually linear and creating a new chunk is simple.  If not, you need to 
reserve the virtual address space(s) for future mappings.  Otherwise you're 
unlikely to get the same layout for allocations.

This is not a show-stopper: we've lived with limited vmalloc room since 
forever.  It just has to be sufficient.

Otherwise, your analysis is correct, if a little verbose :)

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-06-06  5:33     ` Eric Dumazet
  2008-06-06 13:08       ` Mike Travis
  2008-06-08  6:00       ` Rusty Russell
@ 2008-06-09 18:44       ` Christoph Lameter
  2008-06-09 19:11         ` Andi Kleen
  2 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-06-09 18:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mike Travis, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Peter Zijlstra, Rusty Russell

On Fri, 6 Jun 2008, Eric Dumazet wrote:

> Please forgive me if I beat a dead horse, but this percpu stuff
> should find its way.

Definitely and its very complex so any more eyes on this are appreciated.

> I wonder why you absolutely want to have only one chunk holding
> all percpu variables, static(vmlinux) & static(modules)
> & dynamically allocated.
> 
> Its *not* possible to put an arbitrary limit to this global zone.
> You'll allways find somebody to break this limit. This is the point
> we must solve, before coding anything.

The problem is that offsets relative to %gs or %fs are limited by the 
small memory model that is chosen. We cannot have an offset large than 
2GB. So we must have a linear address range and cannot use separate chunks 
of memory. If we do not use the segment register then we cannot do atomic 
(wrt interrupt) cpu ops.

> Have you considered using a list of fixed size chunks, each chunk
> having its own bitmap ?

Mike has done so and then I had to tell him what I just told you.

> On x86_64 && NUMA we could use 2 Mbytes chunks, while
> on x86_32 or non NUMA we should probably use 64 Kbytes.

Right that is what cpu_alloc v2 did. It created a virtual mapping and 
populated it on demand with 2MB PMD entries.

> I understand you want to offset percpu data to 0, but for
> static percpu data. (pda being included in, to share %gs)
> 
> For dynamically allocated percpu variables (including modules
> ".data.percpu"), nothing forces you to have low offsets,
> relative to %gs/%fs register. Access to these variables
> will be register indirect based anyway (eg %gs:(%rax) )

The relative to 0 stuff comes in at the x86_64 level because we want to 
unify pda and percpu accesses. pda access have been relative to 0 and in 
particular the stack canary in glibc directly accesses the pda at a 
certain offset. So we must be zero based in order to preserve 
compatibility with glibc. 

> Chunk 0 would use normal memory (no vmap TLB cost), only next ones need vmalloc().

Normal memory uses 2MB tlbs. There is no overhead therefore by mapping the 
percpu areas using 2MB tlbs. So we do not need to be that complicated.

What v2 did was allocate an area n * MAX_VIRT_PER_CPU_SIZE in vmalloc 
space and then it dynamically populated 2MB segments as needed. The MAX 
size was 128MB or so.

We could either do the same on i386 or use 4kb mappings (then we can 
directly use the vmalloc functionality). But then there would be 
additional TLB overhead.

We have similar 2MB virtual mapping tricks for the virtual memmap. 
Basically we can copy the functions and customize them for the virtual per 
cpu areas (Mike is hopefully listening and reading the V2 patch ....)


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-05 23:59                   ` Rusty Russell
@ 2008-06-09 19:00                     ` Christoph Lameter
  2008-06-09 23:27                       ` Rusty Russell
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-06-09 19:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Mike Travis, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra

On Fri, 6 Jun 2008, Rusty Russell wrote:

> > Also, the above cpu_local_wrap(...) adds:
> >
> > 	#define cpu_local_wrap(l)               \
> > 	({                                      \
> > 	        preempt_disable();              \
> > 	        (l);                            \
> > 	        preempt_enable();               \
> > 	})                                      \
> >
> > ... and there isn't a non-preemption version that I can find.
> 
> Yes, this should be fixed.  I thought i386 had optimized versions pre-merge, 
> but I was wrong (%gs for per-cpu came later, and noone cleaned up these naive 
> versions).  Did you want me to write them?

How can that be fixed? You have no atomic instruction that calculates the 
per cpu address in one go. And as long as that is the case you need to 
disable preempt. Otherwise you may increment the per cpu variable of 
another processor because the process was rescheduled after the address 
was calculated but before the increment was done.

> > One other distinction is CPU_INC increments an arbitrary sized variable
> > while local_inc requires a local_t variable.  This may not make it usable
> > in all cases.
> 
> You might be right, but note that local_t is 64 bit on 64-bit platforms.  And 
> speculation of possible use cases isn't a good reason to rip out working 
> infrastructure :)

Its fundamentally broken since because of the preemption issue. This is 
also why local_t is rarely used.



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-06-09 18:44       ` Christoph Lameter
@ 2008-06-09 19:11         ` Andi Kleen
  2008-06-09 20:15           ` Eric Dumazet
  0 siblings, 1 reply; 139+ messages in thread
From: Andi Kleen @ 2008-06-09 19:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, Mike Travis, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Peter Zijlstra, Rusty Russell

Christoph Lameter <clameter@sgi.com> writes:

> The problem is that offsets relative to %gs or %fs are limited by the 
> small memory model that is chosen.

Actually they are not. If you really want you can do 
movabs $64bit,%reg ; op ...,%gs:(%reg) 
It's just not very efficient compared to small (or rather kernel) model
and also older binutils didn't support large model.

-Andi

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access
  2008-06-09 19:11         ` Andi Kleen
@ 2008-06-09 20:15           ` Eric Dumazet
  0 siblings, 0 replies; 139+ messages in thread
From: Eric Dumazet @ 2008-06-09 20:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Mike Travis, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Peter Zijlstra, Rusty Russell

Andi Kleen a écrit :
> Christoph Lameter <clameter@sgi.com> writes:
> 
>> The problem is that offsets relative to %gs or %fs are limited by the 
>> small memory model that is chosen.
> 
> Actually they are not. If you really want you can do 
> movabs $64bit,%reg ; op ...,%gs:(%reg) 
> It's just not very efficient compared to small (or rather kernel) model
> and also older binutils didn't support large model.
> 

I am not sure Christoph was refering to actual instructions.

I was suggesting using for static percpu (vmlinux or modules) :

vmlinux : (offset31 computed by linker at vmlinux link edit time)
incl  %gs:offset31

modules : (offset31 computed at module load time by module loader)
incl %gs:offset31

(If we make sure all this stuff is allocated in first chunk)

And for dynamic percpu :

movq   field(%rdi),%rax
incl    %gs:(%rax)   /* full 64bits 'offsets' */

I understood (but might be wrong again) that %gs itself could not be used with an offset > 2GB, because
the way %gs segment is setup. So in the 'dynamic percpu' case, %rax should not exceed 2^31






^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-04 18:18                 ` Mike Travis
  2008-06-05 23:59                   ` Rusty Russell
@ 2008-06-09 23:09                   ` Christoph Lameter
  1 sibling, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-06-09 23:09 UTC (permalink / raw)
  To: Mike Travis
  Cc: Rusty Russell, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra

On Wed, 4 Jun 2008, Mike Travis wrote:

> 0000000000000013 <test_local_inc>:
>   13:   55                      push   %rbp
>   14:   65 48 8b 05 00 00 00    mov    %gs:0(%rip),%rax        # 1c <test_local_inc+0x9>
>   1b:   00
>   1c:   48 89 e5                mov    %rsp,%rbp
>   1f:   48 ff 04 07             incq   (%rdi,%rax,1)
>   23:   c9                      leaveq
>   24:   c3                      retq

Note also that the address calculation occurs before the incq. That is why 
disabling preemption is required otherwise the processor may change 
between the determination of the per cpu area address and the increment.

The local_t operations could be modified to avoid the preemption issues 
with the zero based patches applied. Then there would still be the 
inflexbility of not being able to increment an arbitrary variable.

I think it is also bad to treat a per cpu variable like an atomic. Its not 
truly atomic nor are strictly atomic accesses used. It is fine to use 
regular operations on the per cpu variable provided one has either 
disabled preemption or interrupts. The per cpu atomic wrt interrupt ops 
are only useful when preemption and/or interrupts are off.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-09 19:00                     ` Christoph Lameter
@ 2008-06-09 23:27                       ` Rusty Russell
  2008-06-09 23:54                         ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-06-09 23:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra

On Tuesday 10 June 2008 05:00:36 Christoph Lameter wrote:
> On Fri, 6 Jun 2008, Rusty Russell wrote:
> > > Also, the above cpu_local_wrap(...) adds:
> > >
> > > 	#define cpu_local_wrap(l)               \
> > > 	({                                      \
> > > 	        preempt_disable();              \
> > > 	        (l);                            \
> > > 	        preempt_enable();               \
> > > 	})                                      \
> > >
> > > ... and there isn't a non-preemption version that I can find.
> >
> > Yes, this should be fixed.  I thought i386 had optimized versions
> > pre-merge, but I was wrong (%gs for per-cpu came later, and noone cleaned
> > up these naive versions).  Did you want me to write them?
>
> How can that be fixed? You have no atomic instruction that calculates the
> per cpu address in one go.

Huh?  "incl %fs:varname" does exactly this.

> And as long as that is the case you need to 
> disable preempt. Otherwise you may increment the per cpu variable of
> another processor because the process was rescheduled after the address
> was calculated but before the increment was done.

But of course, that is not a problem.  You make local_t an atomic_t, and then 
it doesn't matter which CPU you incremented.

By definition if the caller cared, they would have had premption disabled.

Hope that clarifies,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-09 23:27                       ` Rusty Russell
@ 2008-06-09 23:54                         ` Christoph Lameter
  2008-06-10  2:56                           ` Rusty Russell
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-06-09 23:54 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Mike Travis, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra

On Tue, 10 Jun 2008, Rusty Russell wrote:

> > > Yes, this should be fixed.  I thought i386 had optimized versions
> > > pre-merge, but I was wrong (%gs for per-cpu came later, and noone cleaned
> > > up these naive versions).  Did you want me to write them?
> >
> > How can that be fixed? You have no atomic instruction that calculates the
> > per cpu address in one go.
> 
> Huh?  "incl %fs:varname" does exactly this.

Right that is what the cpu alloc patches do. So you could implement 
cpu_local_inc on top of some of the cpu alloc patches.

> > And as long as that is the case you need to 
> > disable preempt. Otherwise you may increment the per cpu variable of
> > another processor because the process was rescheduled after the address
> > was calculated but before the increment was done.
> 
> But of course, that is not a problem.  You make local_t an atomic_t, and then 
> it doesn't matter which CPU you incremented.

But then the whole point of local_t is gone. Why not use atomic_t in the 
first place?
 
> By definition if the caller cared, they would have had premption disabled.

There are numerous instances where the caller does not care about 
preemption. Its just important that one per cpu counter is increment in 
the least intrusive way. See f.e. the VM event counters.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-09 23:54                         ` Christoph Lameter
@ 2008-06-10  2:56                           ` Rusty Russell
  2008-06-10  3:18                             ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-06-10  2:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra

On Tuesday 10 June 2008 09:54:09 Christoph Lameter wrote:
> On Tue, 10 Jun 2008, Rusty Russell wrote:
> > > > Yes, this should be fixed.  I thought i386 had optimized versions
> > > > pre-merge, but I was wrong (%gs for per-cpu came later, and noone
> > > > cleaned up these naive versions).  Did you want me to write them?
> > >
> > > How can that be fixed? You have no atomic instruction that calculates
> > > the per cpu address in one go.
> >
> > Huh?  "incl %fs:varname" does exactly this.
>
> Right that is what the cpu alloc patches do. So you could implement
> cpu_local_inc on top of some of the cpu alloc patches.

Or you could just implement it today as a standalone patch.

> > > And as long as that is the case you need to
> > > disable preempt. Otherwise you may increment the per cpu variable of
> > > another processor because the process was rescheduled after the address
> > > was calculated but before the increment was done.
> >
> > But of course, that is not a problem.  You make local_t an atomic_t, and
> > then it doesn't matter which CPU you incremented.
>
> But then the whole point of local_t is gone. Why not use atomic_t in the
> first place?

Because some archs can do better.

> > By definition if the caller cared, they would have had premption
> > disabled.
>
> There are numerous instances where the caller does not care about
> preemption. Its just important that one per cpu counter is increment in
> the least intrusive way. See f.e. the VM event counters.

Yes, and that's exactly the point.  The VM event counters are exactly a case 
where you should have used cpu_local_inc.

Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-10  2:56                           ` Rusty Russell
@ 2008-06-10  3:18                             ` Christoph Lameter
  2008-06-11  0:03                               ` Rusty Russell
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-06-10  3:18 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Mike Travis, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra

On Tue, 10 Jun 2008, Rusty Russell wrote:

> > Right that is what the cpu alloc patches do. So you could implement
> > cpu_local_inc on top of some of the cpu alloc patches.
> 
> Or you could just implement it today as a standalone patch.

You need at least the zero basing to enable the use of the segment 
register on x86_64.

> > But then the whole point of local_t is gone. Why not use atomic_t in the
> > first place?
> 
> Because some archs can do better.

The argument does not make any sense. First you want to use atomic_t then 
not?

> > > By definition if the caller cared, they would have had premption
> > > disabled.
> >
> > There are numerous instances where the caller does not care about
> > preemption. Its just important that one per cpu counter is increment in
> > the least intrusive way. See f.e. the VM event counters.
> 
> Yes, and that's exactly the point.  The VM event counters are exactly a case 
> where you should have used cpu_local_inc.

I tried it and did not give any benefit except first failing due to bugs 
because local_t did not disable preempt6... This led to Andi fixing 
local_t.

But with the preempt disabling I could not discern what the benefit 
would be.

CPU_INC does not require disabling of preempt and the cpu alloc patches 
shorten the code sequence to increment a VM counter significantly.

Here is the header from the patch. How would cpu_local_inc be able to do 
better unless you adopt this patchset and add a shim layer?

Subject: VM statistics: Use CPU ops

The use of CPU ops here avoids the offset calculations that we used to 
have to do with per cpu operations. The result of this patch is that event 
counters are coded with a single instruction the following way:

        incq   %gs:offset(%rip)

Without these patches this was:

        mov    %gs:0x8,%rdx
        mov    %eax,0x38(%rsp)
        mov    xxx(%rip),%eax
        mov    %eax,0x48(%rsp)
        mov    varoffset,%rax
        incq   0x110(%rax,%rdx,1)



^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [patch 01/41] cpu_alloc: Increase percpu area size to 128k
  2008-06-02 17:58   ` Luck, Tony
  2008-06-02 23:48     ` Rusty Russell
@ 2008-06-10 17:22     ` Christoph Lameter
  2008-06-10 19:54       ` Luck, Tony
  1 sibling, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-06-10 17:22 UTC (permalink / raw)
  To: Luck, Tony
  Cc: akpm, linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

On Mon, 2 Jun 2008, Luck, Tony wrote:

> > The per cpu allocator requires more per cpu space and we are already near
> > the limit on IA64. Increase the maximum size of the IA64 per cpu area from
> > 64K to 128K.
> 
> > -#define PERCPU_PAGE_SHIFT	16	/* log2() of max. size of per-CPU area */
> > +#define PERCPU_PAGE_SHIFT	17	/* log2() of max. size of per-CPU area */
> 
> Don't you need some more changes to the alt_dtlb_miss handler in
> ivt.S for this to work?  128K is not a supported pagesize on any
> processor model.

Ok so this needs to be 18?


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-06-04 14:58       ` Mike Travis
  2008-06-04 15:11         ` Eric Dumazet
@ 2008-06-10 17:33         ` Christoph Lameter
  2008-06-10 18:05           ` Eric Dumazet
  1 sibling, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-06-10 17:33 UTC (permalink / raw)
  To: Mike Travis
  Cc: Eric Dumazet, akpm, linux-arch, linux-kernel, David Miller,
	Peter Zijlstra, Rusty Russell

On Wed, 4 Jun 2008, Mike Travis wrote:

> I'm a bit confused.  Why is DEFINE_PER_CPU_SHARED_ALIGNED() conditioned on
> ifdef MODULE?
> 
>         #ifdef MODULE
>         #define SHARED_ALIGNED_SECTION ".data.percpu"
>         #else
>         #define SHARED_ALIGNED_SECTION ".data.percpu.shared_aligned"
>         #endif
> 
>         #define DEFINE_PER_CPU_SHARED_ALIGNED(type, name)                       \
>                 __attribute__((__section__(SHARED_ALIGNED_SECTION)))            \
>                 PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name             \
>                 ____cacheline_aligned_in_smp

Looks wrong to me. There can be shared objects even without modules.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-06-04 15:04     ` Mike Travis
@ 2008-06-10 17:34       ` Christoph Lameter
  0 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-06-10 17:34 UTC (permalink / raw)
  To: Mike Travis
  Cc: Rusty Russell, akpm, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra

On Wed, 4 Jun 2008, Mike Travis wrote:

> I haven't seen any further discussion on these aspects... is there a consensus
> to remove the flags from CPU_ALLOC() and use a mutex?

We want to have extensable per cpu areas. This means you need an 
allocation context. So we need to keep the flags. Mutex is not a bad idea.



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-02  2:00               ` Rusty Russell
  2008-06-04 18:18                 ` Mike Travis
@ 2008-06-10 17:42                 ` Christoph Lameter
  2008-06-11 11:10                   ` Rusty Russell
  1 sibling, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-06-10 17:42 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Mike Travis

On Mon, 2 Jun 2008, Rusty Russell wrote:

> > Believe me I have tried to use local_t repeatedly for vm statistics etc.
> > It always fails on that issue.
> 
> Frankly, I am finding it increasingly easy to believe that you failed.  But 
> you are blaming the wrong thing.
> 
> There are three implementations of local_t which are obvious.  The best is for 
> architectures which can locate and increment a per-cpu var in one instruction 
> (eg. x86).  Otherwise, using atomic_t/atomic64_t for local_t provides a 
> general solution.  The other general solution would involve 
> local_irq_disable()/increment/local_irq_enable().
> 
> My (fading) hope is that this idiocy is an abberation,

1. The x86 implementation does not exist because the segment register has 
   so far not been available on x86_64. So you could not do the solution.
   You need the zero basing. Then you can use per_xxx_add in cpu_inc.

2. The general solution created overhead that is often not needed. If we
   would have done vm event counters with local_t then we would have
   atomic overhead for each increment on f.e. IA64. That was not
   acceptable. cpu_alloc never falls back to atomic operations.

3. local_t is based on the atomic logic. But percpu handling is 
   fundamentally different in that accesses without the special macros
   are okay provided you are in a non preemptible or irq context!
   A local_t declaration makes such accesses impossible.

4. The modeling of local_t on atomic_t limits it to 32bit! There is no
   way to use this with pointers or 64 bit entities. Adding that would 
   duplicate the API for each type added.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-06-10 17:33         ` Christoph Lameter
@ 2008-06-10 18:05           ` Eric Dumazet
  2008-06-10 18:28             ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Eric Dumazet @ 2008-06-10 18:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, akpm, linux-arch, linux-kernel, David Miller,
	Peter Zijlstra, Rusty Russell

Christoph Lameter a écrit :
> On Wed, 4 Jun 2008, Mike Travis wrote:
> 
>> I'm a bit confused.  Why is DEFINE_PER_CPU_SHARED_ALIGNED() conditioned on
>> ifdef MODULE?
>>
>>         #ifdef MODULE
>>         #define SHARED_ALIGNED_SECTION ".data.percpu"
>>         #else
>>         #define SHARED_ALIGNED_SECTION ".data.percpu.shared_aligned"
>>         #endif
>>
>>         #define DEFINE_PER_CPU_SHARED_ALIGNED(type, name)                       \
>>                 __attribute__((__section__(SHARED_ALIGNED_SECTION)))            \
>>                 PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name             \
>>                 ____cacheline_aligned_in_smp
> 
> Looks wrong to me. There can be shared objects even without modules.
> 
> 

Well, MODULE is not CONFIG_MODULES :)

If compiling an object that is going to be statically linked to kernel, 
MODULE is not defined, so we have shared objects.

When compiling a module, we cannot *yet* use .data.percpu.shared_aligned 
section, since module loader wont handle this section.

Alternative is to change modules linking for all arches to merge 
.data.percpu{*} subsections correctly, or tell module loader to take 
into account all .data.percpu sections.

AFAIK no module uses DEFINE_PER_CPU_SHARED_ALIGNED() yet...




^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 02/41] cpu alloc: The allocator
  2008-06-10 18:05           ` Eric Dumazet
@ 2008-06-10 18:28             ` Christoph Lameter
  0 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-06-10 18:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mike Travis, akpm, linux-arch, linux-kernel, David Miller,
	Peter Zijlstra, Rusty Russell

On Tue, 10 Jun 2008, Eric Dumazet wrote:

> Well, MODULE is not CONFIG_MODULES :)
> 
> If compiling an object that is going to be statically linked to kernel, MODULE
> is not defined, so we have shared objects.
> 
> When compiling a module, we cannot *yet* use .data.percpu.shared_aligned
> section, since module loader wont handle this section.
> 
> Alternative is to change modules linking for all arches to merge
> .data.percpu{*} subsections correctly, or tell module loader to take into
> account all .data.percpu sections.
> 
> AFAIK no module uses DEFINE_PER_CPU_SHARED_ALIGNED() yet...

Ahhh. Makes sense. Add a comment to explain this?


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [patch 01/41] cpu_alloc: Increase percpu area size to 128k
  2008-06-10 17:22     ` Christoph Lameter
@ 2008-06-10 19:54       ` Luck, Tony
  0 siblings, 0 replies; 139+ messages in thread
From: Luck, Tony @ 2008-06-10 19:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: akpm, linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra, Rusty Russell, Mike Travis

> > > -#define PERCPU_PAGE_SHIFT	16	/* log2() of max. size of per-CPU area */
> > > +#define PERCPU_PAGE_SHIFT	17	/* log2() of max. size of per-CPU area */
> > 
> > Don't you need some more changes to the alt_dtlb_miss handler in
> > ivt.S for this to work?  128K is not a supported pagesize on any
> > processor model.
>
> Ok so this needs to be 18?

Yes. 18 will work (256K is an architected page size, guaranteed to be supported
by all processor models ... SDM 2:52 table 4-4).

-Tony


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-10  3:18                             ` Christoph Lameter
@ 2008-06-11  0:03                               ` Rusty Russell
  2008-06-11  0:15                                 ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-06-11  0:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mike Travis, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra

On Tuesday 10 June 2008 13:18:25 Christoph Lameter wrote:
> On Tue, 10 Jun 2008, Rusty Russell wrote:
> > > Right that is what the cpu alloc patches do. So you could implement
> > > cpu_local_inc on top of some of the cpu alloc patches.
> >
> > Or you could just implement it today as a standalone patch.
>
> You need at least the zero basing to enable the use of the segment
> register on x86_64.

Indeed.  Works for i386 as is, but 64 bit will need that patch.

> > > But then the whole point of local_t is gone. Why not use atomic_t in
> > > the first place?
> >
> > Because some archs can do better.
>
> The argument does not make any sense. First you want to use atomic_t then
> not?

You're being obtuse.  See previous mail about the three possible 
implementations of local_t, and the comment in asm-generic/local.h.

The paths forward are clear:
1) Improve x86 local_t (mostly orthogonal to the others, but useful).
2) Implement extensible per-cpu areas.
3) Generalize per-cpu accessors.
4) Extend or replace the module.c per-cpu allocator to alloc from the other
   areas.
5) Convert alloc_percpu et al. to use the new code.

Hope that clarifies,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-11  0:03                               ` Rusty Russell
@ 2008-06-11  0:15                                 ` Christoph Lameter
  0 siblings, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-06-11  0:15 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Mike Travis, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra

On Wed, 11 Jun 2008, Rusty Russell wrote:

> You're being obtuse.  See previous mail about the three possible 
> implementations of local_t, and the comment in asm-generic/local.h.

OK. I hope I responded in the other email in a more intelligent 
fashion.

> The paths forward are clear:
> 1) Improve x86 local_t (mostly orthogonal to the others, but useful).

Not sure about that. Its rarely used and the more general cpu alloc stuff 
can be used in lots of places as evident by the rest of the patchset. But 
lets leave it if its important for some reason.

> 2) Implement extensible per-cpu areas.
> 3) Generalize per-cpu accessors.
> 4) Extend or replace the module.c per-cpu allocator to alloc from the other
>    areas.
> 5) Convert alloc_percpu et al. to use the new code.

Yes thanks. We are mostly on the same wavelength.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-10 17:42                 ` Christoph Lameter
@ 2008-06-11 11:10                   ` Rusty Russell
  2008-06-11 23:39                     ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-06-11 11:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Mike Travis

On Wednesday 11 June 2008 03:42:15 Christoph Lameter wrote:
> 1. The x86 implementation does not exist because the segment register has
>    so far not been available on x86_64. So you could not do the solution.
>    You need the zero basing. Then you can use per_xxx_add in cpu_inc.

Yes: for 64 bit x86, getting rid of the PDA or zero-basing is required.

> 2. The general solution created overhead that is often not needed. If we
>    would have done vm event counters with local_t then we would have
>    atomic overhead for each increment on f.e. IA64. That was not
>    acceptable. cpu_alloc never falls back to atomic operations.

You can implement it either way.  I've said that three times now.  The current 
generic one uses atomics, but preempt disable/enable is possible.

> 3. local_t is based on the atomic logic. But percpu handling is
>    fundamentally different in that accesses without the special macros
>    are okay provided you are in a non preemptible or irq context!
>    A local_t declaration makes such accesses impossible.

Again, untrue.  The interface is already there.  So feel free to implement 
__cpu_local_inc et al in terms of preempt enable and disable so it doesn't 
need to use atomics.  

> 4. The modeling of local_t on atomic_t limits it to 32bit!

Again wrong.  And adding an exclamation mark doesn't make it true.

Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-11 11:10                   ` Rusty Russell
@ 2008-06-11 23:39                     ` Christoph Lameter
  2008-06-12  0:58                       ` Nick Piggin
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-06-11 23:39 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrew Morton, linux-arch, linux-kernel, David Miller,
	Eric Dumazet, Peter Zijlstra, Mike Travis

On Wed, 11 Jun 2008, Rusty Russell wrote:

> > 4. The modeling of local_t on atomic_t limits it to 32bit!
> 
> Again wrong.  And adding an exclamation mark doesn't make it true.

Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What about pointers?


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-11 23:39                     ` Christoph Lameter
@ 2008-06-12  0:58                       ` Nick Piggin
  2008-06-12  2:44                         ` Rusty Russell
  0 siblings, 1 reply; 139+ messages in thread
From: Nick Piggin @ 2008-06-12  0:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Rusty Russell, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra, Mike Travis

On Thursday 12 June 2008 09:39, Christoph Lameter wrote:
> On Wed, 11 Jun 2008, Rusty Russell wrote:
> > > 4. The modeling of local_t on atomic_t limits it to 32bit!
> >
> > Again wrong.  And adding an exclamation mark doesn't make it true.
>
> Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What about
> pointers?

sizeof(long) == sizeof(void *) in Linux, right?

If you were to support just a single data type, long would probably
be the most useful. Still, it might be more consistent to support
int and long, same as atomic.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-12  0:58                       ` Nick Piggin
@ 2008-06-12  2:44                         ` Rusty Russell
  2008-06-12  3:40                           ` Nick Piggin
  0 siblings, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-06-12  2:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Lameter, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra, Mike Travis,
	Martin Peschke

On Thursday 12 June 2008 10:58:01 Nick Piggin wrote:
> On Thursday 12 June 2008 09:39, Christoph Lameter wrote:
> > On Wed, 11 Jun 2008, Rusty Russell wrote:
> > > > 4. The modeling of local_t on atomic_t limits it to 32bit!
> > >
> > > Again wrong.  And adding an exclamation mark doesn't make it true.
> >
> > Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What
> > about pointers?
>
> sizeof(long) == sizeof(void *) in Linux, right?
>
> If you were to support just a single data type, long would probably
> be the most useful. Still, it might be more consistent to support
> int and long, same as atomic.

Sure, but in practice these tend to be simple counters: that could well change 
when dynamic percpu allocs become first class citizens, but let's not put the 
cart before the horse...

Per-cpu seems to be particularly prone to over-engineering: see commit 
7ff6f08295d90ab20d25200ef485ebb45b1b8d71 from almost two years ago.  Grepping 
here reveals that this infrastructure is still not used.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-12  2:44                         ` Rusty Russell
@ 2008-06-12  3:40                           ` Nick Piggin
  2008-06-12  9:37                             ` Martin Peschke
  0 siblings, 1 reply; 139+ messages in thread
From: Nick Piggin @ 2008-06-12  3:40 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christoph Lameter, Andrew Morton, linux-arch, linux-kernel,
	David Miller, Eric Dumazet, Peter Zijlstra, Mike Travis,
	Martin Peschke

On Thursday 12 June 2008 12:44, Rusty Russell wrote:
> On Thursday 12 June 2008 10:58:01 Nick Piggin wrote:
> > On Thursday 12 June 2008 09:39, Christoph Lameter wrote:
> > > On Wed, 11 Jun 2008, Rusty Russell wrote:
> > > > > 4. The modeling of local_t on atomic_t limits it to 32bit!
> > > >
> > > > Again wrong.  And adding an exclamation mark doesn't make it true.
> > >
> > > Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What
> > > about pointers?
> >
> > sizeof(long) == sizeof(void *) in Linux, right?
> >
> > If you were to support just a single data type, long would probably
> > be the most useful. Still, it might be more consistent to support
> > int and long, same as atomic.
>
> Sure, but in practice these tend to be simple counters: that could well
> change when dynamic percpu allocs become first class citizens, but let's
> not put the cart before the horse...

Right, I was just responding to Christoph's puzzling question.


> Per-cpu seems to be particularly prone to over-engineering: see commit
> 7ff6f08295d90ab20d25200ef485ebb45b1b8d71 from almost two years ago. 
> Grepping here reveals that this infrastructure is still not used.

Hmm. Something like that needs the question asked "who uses this?"
before it is merged I guess. If it were a trivial patch maybe not,
but something like this that sits untested for so long is almost
broken by definition ;)

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-12  3:40                           ` Nick Piggin
@ 2008-06-12  9:37                             ` Martin Peschke
  2008-06-12 11:21                               ` Nick Piggin
  0 siblings, 1 reply; 139+ messages in thread
From: Martin Peschke @ 2008-06-12  9:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rusty Russell, Christoph Lameter, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Eric Dumazet, Peter Zijlstra,
	Mike Travis

On Thu, 2008-06-12 at 13:40 +1000, Nick Piggin wrote:
> On Thursday 12 June 2008 12:44, Rusty Russell wrote:
> > On Thursday 12 June 2008 10:58:01 Nick Piggin wrote:
> > > On Thursday 12 June 2008 09:39, Christoph Lameter wrote:
> > > > On Wed, 11 Jun 2008, Rusty Russell wrote:
> > > > > > 4. The modeling of local_t on atomic_t limits it to 32bit!
> > > > >
> > > > > Again wrong.  And adding an exclamation mark doesn't make it true.
> > > >
> > > > Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What
> > > > about pointers?
> > >
> > > sizeof(long) == sizeof(void *) in Linux, right?
> > >
> > > If you were to support just a single data type, long would probably
> > > be the most useful. Still, it might be more consistent to support
> > > int and long, same as atomic.
> >
> > Sure, but in practice these tend to be simple counters: that could well
> > change when dynamic percpu allocs become first class citizens, but let's
> > not put the cart before the horse...
> 
> Right, I was just responding to Christoph's puzzling question.
> 
> 
> > Per-cpu seems to be particularly prone to over-engineering: see commit
> > 7ff6f08295d90ab20d25200ef485ebb45b1b8d71 from almost two years ago. 
> > Grepping here reveals that this infrastructure is still not used.
> 
> Hmm. Something like that needs the question asked "who uses this?"
> before it is merged I guess. If it were a trivial patch maybe not,
> but something like this that sits untested for so long is almost
> broken by definition ;)

Some code of mine which didn't make it beyond -mm used this small
per-cpu extension. So the commit you refer to was tested.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-12  9:37                             ` Martin Peschke
@ 2008-06-12 11:21                               ` Nick Piggin
  2008-06-12 17:19                                 ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Nick Piggin @ 2008-06-12 11:21 UTC (permalink / raw)
  To: Martin Peschke
  Cc: Rusty Russell, Christoph Lameter, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Eric Dumazet, Peter Zijlstra,
	Mike Travis

On Thursday 12 June 2008 19:37, Martin Peschke wrote:
> On Thu, 2008-06-12 at 13:40 +1000, Nick Piggin wrote:
> > On Thursday 12 June 2008 12:44, Rusty Russell wrote:

> > > Per-cpu seems to be particularly prone to over-engineering: see commit
> > > 7ff6f08295d90ab20d25200ef485ebb45b1b8d71 from almost two years ago.
> > > Grepping here reveals that this infrastructure is still not used.
> >
> > Hmm. Something like that needs the question asked "who uses this?"
> > before it is merged I guess. If it were a trivial patch maybe not,
> > but something like this that sits untested for so long is almost
> > broken by definition ;)
>
> Some code of mine which didn't make it beyond -mm used this small
> per-cpu extension. So the commit you refer to was tested.

Right, but it can easily rot after initial testing if it isn't
continually used.

Maybe this isn't the best example because maybe it still works
fine. But in general, unused, non-trivial code isn't good just
to leave around "just in case" IMO.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-12 11:21                               ` Nick Piggin
@ 2008-06-12 17:19                                 ` Christoph Lameter
  2008-06-13  0:38                                   ` Rusty Russell
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-06-12 17:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin Peschke, Rusty Russell, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Eric Dumazet, Peter Zijlstra,
	Mike Travis

Populating the per cpu areas on demand is a good thing especially for 
configurations with a large number of processors. If we really go to 
support 4k processor by default then we need to allocate the smallest 
amount of per cpu structures necessary. Maybe ACPI or so can tell us how 
many processors are possible and we only allocate those. But it would be 
best if the percpu structures are only allocated for actually active 
processors.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-12 17:19                                 ` Christoph Lameter
@ 2008-06-13  0:38                                   ` Rusty Russell
  2008-06-13  2:27                                     ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-06-13  0:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Martin Peschke, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Eric Dumazet, Peter Zijlstra,
	Mike Travis

On Friday 13 June 2008 03:19:51 Christoph Lameter wrote:
> Populating the per cpu areas on demand is a good thing especially for
> configurations with a large number of processors. If we really go to 
> support 4k processor by default then we need to allocate the smallest
> amount of per cpu structures necessary.  Maybe ACPI or so can tell us how 
> many processors are possible and we only allocate those. But it would be
> best if the percpu structures are only allocated for actually active
> processors.

cpu_possible_map should definitely be minimal, but your point is well made: 
dynamic percpu could actually cut memory allocation.  If we go for a hybrid 
scheme where static percpu is always allocated from the initial chunk, 
however, we still need the current pessimistic overallocation.

Mike's a clever guy, I'm sure he'll think of something :)
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-13  0:38                                   ` Rusty Russell
@ 2008-06-13  2:27                                     ` Christoph Lameter
  2008-06-15 10:33                                       ` Rusty Russell
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-06-13  2:27 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Nick Piggin, Martin Peschke, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Eric Dumazet, Peter Zijlstra,
	Mike Travis

On Fri, 13 Jun 2008, Rusty Russell wrote:

> cpu_possible_map should definitely be minimal, but your point is well made: 
> dynamic percpu could actually cut memory allocation.  If we go for a hybrid 
> scheme where static percpu is always allocated from the initial chunk, 
> however, we still need the current pessimistic overallocation.

The initial chunk would mean that the percpu areas all come from the same 
NUMA node. We really need to allocate from the node that is nearest to a 
processor (not all processors have processor local memory!).

It would be good to standardize the way that percpu areas are allocated. 
We have various ways of allocation now in various arches. 
init/main.c:setup_per_cpu_ares() needs to be generalized:

1. Allocate the per cpu areas in a NUMA aware fashions.

2. Have a function for instantiating a single per cpu area that 
   can be used during cpu hotplug.

3. Some hooks for arches to override particular behavior as needed.
   F.e. IA64 allocates percpu structures in a special way. x86_64
   needs to do some tricks for the pda etc etc.

> Mike's a clever guy, I'm sure he'll think of something :)

Hopefully. Otherwise he will ask me =-).


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-13  2:27                                     ` Christoph Lameter
@ 2008-06-15 10:33                                       ` Rusty Russell
  2008-06-16 14:52                                         ` Christoph Lameter
  0 siblings, 1 reply; 139+ messages in thread
From: Rusty Russell @ 2008-06-15 10:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Martin Peschke, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Eric Dumazet, Peter Zijlstra,
	Mike Travis

On Friday 13 June 2008 12:27:07 Christoph Lameter wrote:
> On Fri, 13 Jun 2008, Rusty Russell wrote:
> > cpu_possible_map should definitely be minimal, but your point is well
> > made: dynamic percpu could actually cut memory allocation.  If we go for
> > a hybrid scheme where static percpu is always allocated from the initial
> > chunk, however, we still need the current pessimistic overallocation.
>
> The initial chunk would mean that the percpu areas all come from the same
> NUMA node. We really need to allocate from the node that is nearest to a
> processor (not all processors have processor local memory!).

Yes, this is where it gets nasty.  We shouldn't even allocate the initial 
chunk in a non-NUMA aware way (I'm using the term chunk loosely, it's a chunk 
per cpu of course).

> It would be good to standardize the way that percpu areas are allocated.
> We have various ways of allocation now in various arches.
> init/main.c:setup_per_cpu_ares() needs to be generalized:
>
> 1. Allocate the per cpu areas in a NUMA aware fashions.

Definitely.  We also need to reserve virtual address space to create more 
areas with congruent mappings; that's the fun part.

Maybe a simpler non-NUMA variant too, but it's trivial if we want it.

> 2. Have a function for instantiating a single per cpu area that
>    can be used during cpu hotplug.

Unfortunately this breaks the current percpu semantics: that if you iterate 
over all possible cpus you can access percpu vars.  This means you don't need 
to have hotplug CPU notifiers for simple percpu counters.  We could do this 
with helpers, but AFAICT it's orthogonal to the other plans.

> 3. Some hooks for arches to override particular behavior as needed.
>    F.e. IA64 allocates percpu structures in a special way. x86_64
>    needs to do some tricks for the pda etc etc.

IA64 is going to need some work, since dynamic percpu addresses won't be able 
to use their pinned TLB trick to get the local version.

> > Mike's a clever guy, I'm sure he'll think of something :)
>
> Hopefully. Otherwise he will ask me =-).

And as always, lkml will offer feedback; useful and otherwise :)

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-15 10:33                                       ` Rusty Russell
@ 2008-06-16 14:52                                         ` Christoph Lameter
  2008-06-17  0:24                                           ` Rusty Russell
  0 siblings, 1 reply; 139+ messages in thread
From: Christoph Lameter @ 2008-06-16 14:52 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Nick Piggin, Martin Peschke, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Eric Dumazet, Peter Zijlstra,
	Mike Travis

On Sun, 15 Jun 2008, Rusty Russell wrote:

> > 3. Some hooks for arches to override particular behavior as needed.
> >    F.e. IA64 allocates percpu structures in a special way. x86_64
> >    needs to do some tricks for the pda etc etc.
> 
> IA64 is going to need some work, since dynamic percpu addresses won't be able 
> to use their pinned TLB trick to get the local version.

The ia64 hook could simply return the address of percpu area that 
was reserved when the per node memory layout was generated (which happens 
very early during node bootstrap).



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-16 14:52                                         ` Christoph Lameter
@ 2008-06-17  0:24                                           ` Rusty Russell
  2008-06-17  2:29                                             ` Christoph Lameter
  2008-06-17 14:21                                             ` Mike Travis
  0 siblings, 2 replies; 139+ messages in thread
From: Rusty Russell @ 2008-06-17  0:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nick Piggin, Martin Peschke, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Eric Dumazet, Peter Zijlstra,
	Mike Travis

On Tuesday 17 June 2008 00:52:08 Christoph Lameter wrote:
> On Sun, 15 Jun 2008, Rusty Russell wrote:
> > > 3. Some hooks for arches to override particular behavior as needed.
> > >    F.e. IA64 allocates percpu structures in a special way. x86_64
> > >    needs to do some tricks for the pda etc etc.
> >
> > IA64 is going to need some work, since dynamic percpu addresses won't be
> > able to use their pinned TLB trick to get the local version.
>
> The ia64 hook could simply return the address of percpu area that
> was reserved when the per node memory layout was generated (which happens
> very early during node bootstrap).

Apologies, this time I read the code.  I thought IA64 used the pinned TLB area 
to access per-cpu vars under some circumstances, but they only do that via an 
arch-specific macro.

So creating new congruent mappings to expand the percpu area(s) is our main 
concern now?

Rusty.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-17  0:24                                           ` Rusty Russell
@ 2008-06-17  2:29                                             ` Christoph Lameter
  2008-06-17 14:21                                             ` Mike Travis
  1 sibling, 0 replies; 139+ messages in thread
From: Christoph Lameter @ 2008-06-17  2:29 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Nick Piggin, Martin Peschke, Andrew Morton, linux-arch,
	linux-kernel, David Miller, Eric Dumazet, Peter Zijlstra,
	Mike Travis

On Tue, 17 Jun 2008, Rusty Russell wrote:

> > The ia64 hook could simply return the address of percpu area that
> > was reserved when the per node memory layout was generated (which happens
> > very early during node bootstrap).
> 
> Apologies, this time I read the code.  I thought IA64 used the pinned TLB area 
> to access per-cpu vars under some circumstances, but they only do that via an 
> arch-specific macro.
> 
> So creating new congruent mappings to expand the percpu area(s) is our main 
> concern now?

The concern here was just consolidating the setup code for the per cpu 
areas.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations
  2008-06-17  0:24                                           ` Rusty Russell
  2008-06-17  2:29                                             ` Christoph Lameter
@ 2008-06-17 14:21                                             ` Mike Travis
  1 sibling, 0 replies; 139+ messages in thread
From: Mike Travis @ 2008-06-17 14:21 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Christoph Lameter, Nick Piggin, Martin Peschke, Andrew Morton,
	linux-arch, linux-kernel, David Miller, Eric Dumazet,
	Peter Zijlstra

Rusty Russell wrote:
> On Tuesday 17 June 2008 00:52:08 Christoph Lameter wrote:
>> On Sun, 15 Jun 2008, Rusty Russell wrote:
>>>> 3. Some hooks for arches to override particular behavior as needed.
>>>>    F.e. IA64 allocates percpu structures in a special way. x86_64
>>>>    needs to do some tricks for the pda etc etc.
>>> IA64 is going to need some work, since dynamic percpu addresses won't be
>>> able to use their pinned TLB trick to get the local version.
>> The ia64 hook could simply return the address of percpu area that
>> was reserved when the per node memory layout was generated (which happens
>> very early during node bootstrap).
> 
> Apologies, this time I read the code.  I thought IA64 used the pinned TLB area 
> to access per-cpu vars under some circumstances, but they only do that via an 
> arch-specific macro.
> 
> So creating new congruent mappings to expand the percpu area(s) is our main 
> concern now?
> 
> Rusty.

Not exactly.  Getting the system to not panic early in the boot (before
x86_64_start_kernel()) is the primary problem right now.  This happens
in the tip tree with the change to use zero-based percpu offsets.  It
gets much farther on the linux-next tree.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 139+ messages in thread

end of thread, other threads:[~2008-06-17 14:21 UTC | newest]

Thread overview: 139+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-30  3:56 [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Christoph Lameter
2008-05-30  3:56 ` [patch 01/41] cpu_alloc: Increase percpu area size to 128k Christoph Lameter
2008-06-02 17:58   ` Luck, Tony
2008-06-02 23:48     ` Rusty Russell
2008-06-10 17:22     ` Christoph Lameter
2008-06-10 19:54       ` Luck, Tony
2008-05-30  3:56 ` [patch 02/41] cpu alloc: The allocator Christoph Lameter
2008-05-30  4:58   ` Andrew Morton
2008-05-30  5:10     ` Christoph Lameter
2008-05-30  5:31       ` Andrew Morton
2008-06-02  9:29         ` Paul Jackson
2008-05-30  5:56       ` KAMEZAWA Hiroyuki
2008-05-30  6:16         ` Christoph Lameter
2008-06-04 14:48     ` Mike Travis
2008-05-30  5:04   ` Eric Dumazet
2008-05-30  5:20     ` Christoph Lameter
2008-05-30  5:52       ` Rusty Russell
2008-06-04 15:30         ` Mike Travis
2008-06-05 23:48           ` Rusty Russell
2008-05-30  5:54       ` Eric Dumazet
2008-06-04 14:58       ` Mike Travis
2008-06-04 15:11         ` Eric Dumazet
2008-06-06  0:32           ` Rusty Russell
2008-06-10 17:33         ` Christoph Lameter
2008-06-10 18:05           ` Eric Dumazet
2008-06-10 18:28             ` Christoph Lameter
2008-05-30  5:46   ` Rusty Russell
2008-06-04 15:04     ` Mike Travis
2008-06-10 17:34       ` Christoph Lameter
2008-05-31 20:58   ` Pavel Machek
2008-05-30  3:56 ` [patch 03/41] cpu alloc: Use cpu allocator instead of the builtin modules per cpu allocator Christoph Lameter
2008-05-30  4:58   ` Andrew Morton
2008-05-30  5:14     ` Christoph Lameter
2008-05-30  5:34       ` Andrew Morton
2008-05-30  6:08   ` Rusty Russell
2008-05-30  6:21     ` Christoph Lameter
2008-05-30  3:56 ` [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations Christoph Lameter
2008-05-30  4:58   ` Andrew Morton
2008-05-30  5:17     ` Christoph Lameter
2008-05-30  5:38       ` Andrew Morton
2008-05-30  6:12         ` Christoph Lameter
2008-05-30  7:08           ` Rusty Russell
2008-05-30 18:00             ` Christoph Lameter
2008-06-02  2:00               ` Rusty Russell
2008-06-04 18:18                 ` Mike Travis
2008-06-05 23:59                   ` Rusty Russell
2008-06-09 19:00                     ` Christoph Lameter
2008-06-09 23:27                       ` Rusty Russell
2008-06-09 23:54                         ` Christoph Lameter
2008-06-10  2:56                           ` Rusty Russell
2008-06-10  3:18                             ` Christoph Lameter
2008-06-11  0:03                               ` Rusty Russell
2008-06-11  0:15                                 ` Christoph Lameter
2008-06-09 23:09                   ` Christoph Lameter
2008-06-10 17:42                 ` Christoph Lameter
2008-06-11 11:10                   ` Rusty Russell
2008-06-11 23:39                     ` Christoph Lameter
2008-06-12  0:58                       ` Nick Piggin
2008-06-12  2:44                         ` Rusty Russell
2008-06-12  3:40                           ` Nick Piggin
2008-06-12  9:37                             ` Martin Peschke
2008-06-12 11:21                               ` Nick Piggin
2008-06-12 17:19                                 ` Christoph Lameter
2008-06-13  0:38                                   ` Rusty Russell
2008-06-13  2:27                                     ` Christoph Lameter
2008-06-15 10:33                                       ` Rusty Russell
2008-06-16 14:52                                         ` Christoph Lameter
2008-06-17  0:24                                           ` Rusty Russell
2008-06-17  2:29                                             ` Christoph Lameter
2008-06-17 14:21                                             ` Mike Travis
2008-05-30  7:05         ` Rusty Russell
2008-05-30  6:32       ` Rusty Russell
2008-05-30  3:56 ` [patch 05/41] cpu alloc: Percpu_counter conversion Christoph Lameter
2008-05-30  6:47   ` Rusty Russell
2008-05-30 17:54     ` Christoph Lameter
2008-05-30  3:56 ` [patch 06/41] cpu alloc: crash_notes conversion Christoph Lameter
2008-05-30  3:56 ` [patch 07/41] cpu alloc: Workqueue conversion Christoph Lameter
2008-05-30  3:56 ` [patch 08/41] cpu alloc: ACPI cstate handling conversion Christoph Lameter
2008-05-30  3:56 ` [patch 09/41] cpu alloc: Genhd statistics conversion Christoph Lameter
2008-05-30  3:56 ` [patch 10/41] cpu alloc: blktrace conversion Christoph Lameter
2008-05-30  3:56 ` [patch 11/41] cpu alloc: SRCU cpu alloc conversion Christoph Lameter
2008-05-30  3:56 ` [patch 12/41] cpu alloc: XFS counter conversion Christoph Lameter
2008-05-30  3:56 ` [patch 13/41] cpu alloc: NFS statistics Christoph Lameter
2008-05-30  3:56 ` [patch 14/41] cpu alloc: Neigbour statistics Christoph Lameter
2008-05-30  3:56 ` [patch 15/41] cpu_alloc: Convert ip route statistics Christoph Lameter
2008-05-30  3:56 ` [patch 16/41] cpu alloc: Tcp statistics conversion Christoph Lameter
2008-05-30  3:56 ` [patch 17/41] cpu alloc: Convert scratches to cpu alloc Christoph Lameter
2008-05-30  3:56 ` [patch 18/41] cpu alloc: Dmaengine conversion Christoph Lameter
2008-05-30  3:56 ` [patch 19/41] cpu alloc: Convert loopback statistics Christoph Lameter
2008-05-30  3:56 ` [patch 20/41] cpu alloc: Veth conversion Christoph Lameter
2008-05-30  3:56 ` [patch 21/41] cpu alloc: Chelsio statistics conversion Christoph Lameter
2008-05-30  3:56 ` [patch 22/41] cpu alloc: Convert network sockets inuse counter Christoph Lameter
2008-05-30  3:56 ` [patch 23/41] cpu alloc: Use it for infiniband Christoph Lameter
2008-05-30  3:56 ` [patch 24/41] cpu alloc: Use in the crypto subsystem Christoph Lameter
2008-05-30  3:56 ` [patch 25/41] cpu alloc: scheduler: Convert cpuusage to cpu_alloc Christoph Lameter
2008-05-30  3:56 ` [patch 26/41] cpu alloc: Convert mib handling to cpu alloc Christoph Lameter
2008-05-30  6:47   ` Eric Dumazet
2008-05-30 18:01     ` Christoph Lameter
2008-05-30  3:56 ` [patch 27/41] cpu alloc: Remove the allocpercpu functionality Christoph Lameter
2008-05-30  4:58   ` Andrew Morton
2008-05-30  3:56 ` [patch 28/41] Module handling: Use CPU_xx ops to dynamically allocate counters Christoph Lameter
2008-05-30  3:56 ` [patch 29/41] x86_64: Use CPU ops for nmi alert counter Christoph Lameter
2008-05-30  3:56 ` [patch 30/41] Remove local_t support Christoph Lameter
2008-05-30  3:56 ` [patch 31/41] VM statistics: Use CPU ops Christoph Lameter
2008-05-30  3:56 ` [patch 32/41] cpu alloc: Use in slub Christoph Lameter
2008-05-30  3:56 ` [patch 33/41] cpu alloc: Remove slub fields Christoph Lameter
2008-05-30  3:56 ` [patch 34/41] cpu alloc: Page allocator conversion Christoph Lameter
2008-05-30  3:56 ` [patch 35/41] Support for CPU ops Christoph Lameter
2008-05-30  4:58   ` Andrew Morton
2008-05-30  5:18     ` Christoph Lameter
2008-05-30  3:56 ` [patch 36/41] Zero based percpu: Infrastructure to rebase the per cpu area to zero Christoph Lameter
2008-05-30  3:56 ` [patch 37/41] x86_64: Fold pda into per cpu area Christoph Lameter
2008-05-30  3:56 ` [patch 38/41] x86: Extend percpu ops to 64 bit Christoph Lameter
2008-05-30  3:56 ` [patch 39/41] x86: Replace cpu_pda() using percpu logic and get rid of _cpu_pda() Christoph Lameter
2008-05-30  3:57 ` [patch 40/41] x86: Replace xxx_pda() operations with x86_xx_percpu() Christoph Lameter
2008-05-30  3:57 ` [patch 41/41] x86_64: Support for cpu ops Christoph Lameter
2008-05-30  4:58 ` [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access Andrew Morton
2008-05-30  5:03   ` Christoph Lameter
2008-05-30  5:21     ` Andrew Morton
2008-05-30  5:27       ` Christoph Lameter
2008-05-30  5:49         ` Andrew Morton
2008-05-30  6:16           ` Christoph Lameter
2008-05-30  6:51             ` KAMEZAWA Hiroyuki
2008-05-30 14:38         ` Mike Travis
2008-05-30 17:50           ` Christoph Lameter
2008-05-30 18:00             ` Matthew Wilcox
2008-05-30 18:12               ` Christoph Lameter
2008-05-30  6:01       ` Eric Dumazet
2008-05-30  6:16         ` Andrew Morton
2008-05-30  6:22           ` Christoph Lameter
2008-05-30  6:37             ` Andrew Morton
2008-05-30 11:32               ` Matthew Wilcox
2008-06-04 15:07   ` Mike Travis
2008-06-06  5:33     ` Eric Dumazet
2008-06-06 13:08       ` Mike Travis
2008-06-08  6:00       ` Rusty Russell
2008-06-09 18:44       ` Christoph Lameter
2008-06-09 19:11         ` Andi Kleen
2008-06-09 20:15           ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).