linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] percpu: Optimize percpu accesses
@ 2008-02-01 19:14 travis
  2008-02-01 19:14 ` [PATCH 1/4] generic: Percpu infrastructure to rebase the per cpu area to zero travis
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: travis @ 2008-02-01 19:14 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, Ingo Molnar, Thomas Gleixner
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Jack Steiner, linux-mm,
	linux-kernel


This patchset provides the following:

  * Generic: Percpu infrastructure to rebase the per cpu area to zero

    This provides for the capability of accessing the percpu variables
    using a local register instead of having to go through a table
    on node 0 to find this cpu specific offsets.  It also would allow
    atomic operations on percpu variables to reduce required locking.

  * Init: Move setup of nr_cpu_ids to as early as possible for usage
    by early boot functions.

  * x86_64: Fold pda into per cpu area

    Declare the pda as a per cpu variable. This will move the pda
    area to an address accessible by the x86_64 per cpu macros.
    Subtraction of __per_cpu_start will make the offset based from
    the beginning of the per cpu area.  Since %gs is pointing to the
    pda, it will then also point to the per cpu variables and can be
    accessed thusly:

	%gs:[&per_cpu_xxxx - __per_cpu_start]

  * x86_64: Rebase per cpu variables to zero

    Take advantage of the zero-based per cpu area provided above.
    Then we can directly use the x86_32 percpu operations. x86_32
    offsets %fs by __per_cpu_start. x86_64 has %gs pointing directly
    to the pda and the per cpu area thereby allowing access to the
    pda with the x86_64 pda operations and access to the per cpu
    variables using x86_32 percpu operations.  After rebasing
    the access now becomes:

	%gs:[&per_cpu_xxxx]

    Introduces a new DEFINE_PER_CPU_FIRST to locate the percpu
    variable (pda in this case) at the beginning of the percpu
    .data section.

  * x86_64: Cleanup non-smp usage of cpu maps

    Cleanup references to the early cpu maps for the non-SMP configuration
    and remove some functions called for SMP configurations only.

Based on linux-2.6.git + x86.git

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
---
Notes:

(1 - had to disable CONFIG_SIS190 to build)
(2 - no modules)

Configs built and booted:

    x86_64-default
    x86_64-defconfig (2)
    x86_64-nonuma (2)
    x86_64-nosmp (2)
    x86_64-"Ingo Stress Test" (1,2)

Configs built with no errors:

    arm-default
    i386-allyesconfig (1)
    i386-allmodconfig (1)
    i386-defconfig
    i386-nosmp
    ppc-pmac32
    ppc-smp
    sparc64-default
    sparc64-smp
    x86_64-allmodconfig (1)
    x86_64-allyesconfig (1)
    x86_64-maxsmp (NR_CPUS=4k, MAXNODES=512)

Configs with errors prior to patch (preventing full build checkout):

    ia64-sn2: undefined reference to `mem_map' (more)
    ia64-default: (same error)
    ia64-nosmp: `per_cpu__kstat' truncated in .bss (more)
    s390-default: implicit declaration of '__raw_spin_is_contended'
    sparc-default: include/asm/pgtable.h: syntax '___f___swp_entry'

Memory Effects (using x86_64-maxsmp config):

    Note that 1/2MB has been moved from permanent data to
    the init data section, (which is removed after bootup),
    while the per cpu section is only increased by 128 bytes
    per cpu.  Also text size is reduced increasing cache
    performance.

    4k-cpus-before                  4k-cpus-after
       6588928 .data.cacheline_alig     -524288 -7%
	 48072 .data.percpu                +128 +0%
       4804576 .data.read_mostly         -32656 +0%
	854048 .init.data               +557056 +65%
	160382 .init.text                   +62 +0%
       1254214 .rodata                     +274 +0%
       3915552 .text                      -1632 +0%
	 11040 __param                     -272 -2%

       3915552 Text                       -1632 +0%
       1085440 InitData                 +557056 +51%
      11454056 OtherData                -557056 -4%
	 48072 PerCpu                      +128 +0%
      20459748 Total                      -1330 +0%

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] generic: Percpu infrastructure to rebase the per cpu area to zero
  2008-02-01 19:14 [PATCH 0/4] percpu: Optimize percpu accesses travis
@ 2008-02-01 19:14 ` travis
  2008-02-01 19:14 ` [PATCH 2/4] init: move setup of nr_cpu_ids to as early as possible travis
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: travis @ 2008-02-01 19:14 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, Ingo Molnar, Thomas Gleixner
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Jack Steiner, linux-mm,
	linux-kernel

[-- Attachment #1: zero_based_infrastructure --]
[-- Type: text/plain, Size: 5517 bytes --]

    * Support an option

	CONFIG_HAVE_ZERO_BASED_PER_CPU

      that makes offsets for per cpu variables to start at zero.

      If a percpu area starts at zero then:

	-  We do not need RELOC_HIDE anymore

	-  Provides for the future capability of architectures providing
	   a per cpu allocator that returns offsets instead of pointers.
	   The offsets would be independent of the processor so that
	   address calculations can be done in a processor independent way.
	   Per cpu instructions can then add the processor specific offset
	   at the last minute possibly in an atomic instruction.

      The data the linker provides is different for zero based percpu segments:

	__per_cpu_load	-> The address at which the percpu area was loaded
	__per_cpu_size	-> The length of the per cpu area

      For non-zero-based percpu segments, the above symbols are adjusted to
      maintain compatibility with existing architectures.

    * Removes the &__per_cpu_x in lockdep. The __per_cpu_x are already
      pointers. There is no need to take the address.

Based on linux-2.6.git + x86.git

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 include/asm-generic/percpu.h      |    5 +++++
 include/asm-generic/sections.h    |   10 ++++++++++
 include/asm-generic/vmlinux.lds.h |   14 ++++++++++++++
 kernel/lockdep.c                  |    4 ++--
 kernel/module.c                   |   10 ++++------
 5 files changed, 35 insertions(+), 8 deletions(-)

--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -43,7 +43,12 @@ extern unsigned long __per_cpu_offset[NR
  * Only S390 provides its own means of moving the pointer.
  */
 #ifndef SHIFT_PERCPU_PTR
+#ifndef CONFIG_HAVE_ZERO_BASED_PER_CPU
 #define SHIFT_PERCPU_PTR(__p, __offset)	RELOC_HIDE((__p), (__offset))
+#else
+#define SHIFT_PERCPU_PTR(__p, __offset) \
+	((__typeof(__p))(((void *)(__p)) + (__offset)))
+#endif
 #endif
 
 /*
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -11,7 +11,17 @@ extern char _sinittext[], _einittext[];
 extern char _sextratext[] __attribute__((weak));
 extern char _eextratext[] __attribute__((weak));
 extern char _end[];
+#ifdef CONFIG_HAVE_ZERO_BASED_PER_CPU
+extern char __per_cpu_load[];
+extern char ____per_cpu_size[];
+#define __per_cpu_size ((unsigned long)&____per_cpu_size)
+#define __per_cpu_start ((char *)0)
+#define __per_cpu_end ((char *)__per_cpu_size)
+#else
 extern char __per_cpu_start[], __per_cpu_end[];
+#define __per_cpu_load __per_cpu_start
+#define __per_cpu_size (__per_cpu_end - __per_cpu_start)
+#endif
 extern char __kprobes_text_start[], __kprobes_text_end[];
 extern char __initdata_begin[], __initdata_end[];
 extern char __start_rodata[], __end_rodata[];
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -341,6 +341,19 @@
   	*(.initcall7.init)						\
   	*(.initcall7s.init)
 
+#ifdef CONFIG_HAVE_ZERO_BASED_PER_CPU
+#define PERCPU(align)							\
+	. = ALIGN(align);						\
+	percpu : { } :percpu						\
+	__per_cpu_load = .;						\
+	.data.percpu 0 : AT(__per_cpu_load - LOAD_OFFSET) {		\
+		*(.data.percpu)						\
+		*(.data.percpu.shared_aligned)				\
+		____per_cpu_size = .;					\
+	}								\
+	. = __per_cpu_load + ____per_cpu_size;				\
+	data : { } :data
+#else
 #define PERCPU(align)							\
 	. = ALIGN(align);						\
 	__per_cpu_start = .;						\
@@ -349,3 +362,4 @@
 		*(.data.percpu.shared_aligned)				\
 	}								\
 	__per_cpu_end = .;
+#endif
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -609,8 +609,8 @@ static int static_obj(void *obj)
 	 * percpu var?
 	 */
 	for_each_possible_cpu(i) {
-		start = (unsigned long) &__per_cpu_start + per_cpu_offset(i);
-		end   = (unsigned long) &__per_cpu_start + PERCPU_ENOUGH_ROOM
+		start = (unsigned long) __per_cpu_start + per_cpu_offset(i);
+		end   = (unsigned long) __per_cpu_start + PERCPU_ENOUGH_ROOM
 					+ per_cpu_offset(i);
 
 		if ((addr >= start) && (addr < end))
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -45,6 +45,7 @@
 #include <asm/uaccess.h>
 #include <asm/semaphore.h>
 #include <asm/cacheflush.h>
+#include <asm/sections.h>
 #include <linux/license.h>
 
 #if 0
@@ -344,9 +345,6 @@ static inline unsigned int block_size(in
 	return val;
 }
 
-/* Created by linker magic */
-extern char __per_cpu_start[], __per_cpu_end[];
-
 static void *percpu_modalloc(unsigned long size, unsigned long align,
 			     const char *name)
 {
@@ -360,7 +358,7 @@ static void *percpu_modalloc(unsigned lo
 		align = PAGE_SIZE;
 	}
 
-	ptr = __per_cpu_start;
+	ptr = __per_cpu_load;
 	for (i = 0; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
 		/* Extra for alignment requirement. */
 		extra = ALIGN((unsigned long)ptr, align) - (unsigned long)ptr;
@@ -395,7 +393,7 @@ static void *percpu_modalloc(unsigned lo
 static void percpu_modfree(void *freeme)
 {
 	unsigned int i;
-	void *ptr = __per_cpu_start + block_size(pcpu_size[0]);
+	void *ptr = __per_cpu_load + block_size(pcpu_size[0]);
 
 	/* First entry is core kernel percpu data. */
 	for (i = 1; i < pcpu_num_used; ptr += block_size(pcpu_size[i]), i++) {
@@ -446,7 +444,7 @@ static int percpu_modinit(void)
 	pcpu_size = kmalloc(sizeof(pcpu_size[0]) * pcpu_num_allocated,
 			    GFP_KERNEL);
 	/* Static in-kernel percpu data (used). */
-	pcpu_size[0] = -(__per_cpu_end-__per_cpu_start);
+	pcpu_size[0] = -__per_cpu_size;
 	/* Free room. */
 	pcpu_size[1] = PERCPU_ENOUGH_ROOM + pcpu_size[0];
 	if (pcpu_size[1] < 0) {

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/4] init: move setup of nr_cpu_ids to as early as possible
  2008-02-01 19:14 [PATCH 0/4] percpu: Optimize percpu accesses travis
  2008-02-01 19:14 ` [PATCH 1/4] generic: Percpu infrastructure to rebase the per cpu area to zero travis
@ 2008-02-01 19:14 ` travis
  2008-02-01 19:14 ` [PATCH 3/4] x86_64: Fold pda into per cpu area travis
  2008-02-01 19:14 ` [PATCH 4/4] x86_64: Cleanup non-smp usage of cpu maps travis
  3 siblings, 0 replies; 17+ messages in thread
From: travis @ 2008-02-01 19:14 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, Ingo Molnar, Thomas Gleixner
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Jack Steiner, linux-mm,
	linux-kernel

[-- Attachment #1: mv-set-nr_cpu_ids --]
[-- Type: text/plain, Size: 2251 bytes --]

Move the setting of nr_cpu_ids from sched_init() to init/main.c,
so that it's available as early as possible.

Based on the linux-2.6.git + x86.git

Signed-off-by: Mike Travis <travis@sgi.com>
---
 init/main.c    |   21 +++++++++++++++++++++
 kernel/sched.c |    7 -------
 2 files changed, 21 insertions(+), 7 deletions(-)

--- a/init/main.c
+++ b/init/main.c
@@ -363,10 +363,30 @@ static void __init smp_init(void)
 #endif
 
 static inline void setup_per_cpu_areas(void) { }
+static inline void setup_nr_cpu_ids(void) { }
 static inline void smp_prepare_cpus(unsigned int maxcpus) { }
 
 #else
 
+/*
+ * Setup number of possible processor ids.
+ * This is different than setup_max_cpus as it accounts
+ * for zero bits embedded between one bits in the cpu
+ * possible map due to disabled cpu cores.
+ */
+int nr_cpu_ids __read_mostly = NR_CPUS;
+EXPORT_SYMBOL(nr_cpu_ids);
+
+static void __init setup_nr_cpu_ids(void)
+{
+	int cpu, highest_cpu = 0;
+
+	for_each_possible_cpu(cpu)
+		highest_cpu = cpu;
+
+	nr_cpu_ids = highest_cpu + 1;
+}
+
 #ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
 unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
 
@@ -542,6 +562,7 @@ asmlinkage void __init start_kernel(void
 	setup_arch(&command_line);
 	setup_command_line(command_line);
 	unwind_setup();
+	setup_nr_cpu_ids();
 	setup_per_cpu_areas();
 	smp_prepare_boot_cpu();	/* arch-specific boot-cpu hooks */
 
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5925,10 +5925,6 @@ void __init migration_init(void)
 
 #ifdef CONFIG_SMP
 
-/* Number of possible processor ids */
-int nr_cpu_ids __read_mostly = NR_CPUS;
-EXPORT_SYMBOL(nr_cpu_ids);
-
 #ifdef CONFIG_SCHED_DEBUG
 
 static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level)
@@ -7161,7 +7157,6 @@ static void init_tg_rt_entry(struct rq *
 
 void __init sched_init(void)
 {
-	int highest_cpu = 0;
 	int i, j;
 
 #ifdef CONFIG_SMP
@@ -7213,7 +7208,6 @@ void __init sched_init(void)
 #endif
 		init_rq_hrtick(rq);
 		atomic_set(&rq->nr_iowait, 0);
-		highest_cpu = i;
 	}
 
 	set_load_weight(&init_task);
@@ -7223,7 +7217,6 @@ void __init sched_init(void)
 #endif
 
 #ifdef CONFIG_SMP
-	nr_cpu_ids = highest_cpu + 1;
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL);
 #endif
 

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-02-01 19:14 [PATCH 0/4] percpu: Optimize percpu accesses travis
  2008-02-01 19:14 ` [PATCH 1/4] generic: Percpu infrastructure to rebase the per cpu area to zero travis
  2008-02-01 19:14 ` [PATCH 2/4] init: move setup of nr_cpu_ids to as early as possible travis
@ 2008-02-01 19:14 ` travis
  2008-02-15 20:16   ` Ingo Molnar
  2008-02-01 19:14 ` [PATCH 4/4] x86_64: Cleanup non-smp usage of cpu maps travis
  3 siblings, 1 reply; 17+ messages in thread
From: travis @ 2008-02-01 19:14 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, Ingo Molnar, Thomas Gleixner
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Jack Steiner, linux-mm,
	linux-kernel

[-- Attachment #1: x86_64_fold_pda --]
[-- Type: text/plain, Size: 12490 bytes --]

  * Declare the pda as a per cpu variable. This will move the pda area
    to an address accessible by the x86_64 per cpu macros.  Subtraction
    of __per_cpu_start will make the offset based from the beginning
    of the per cpu area.  Since %gs is pointing to the pda, it will
    then also point to the per cpu variables and can be accessed thusly:

	%gs:[&per_cpu_xxxx - __per_cpu_start]

  * The boot_pdas are only needed in head64.c so move the declaration
    over there.  And since the boot_cpu_pda is only used during
    bootup and then copied to the per_cpu areas during init, it is
    then removable.  In addition, the initial cpu_pda pointer table
    is reallocated to be the correct size for the number of cpus.

  * Remove the code that allocates special pda data structures.
    Since the percpu area is currently maintained for all possible
    cpus then the pda regions will stay intact in case cpus are
    hotplugged off and then back on.

  * Relocate the x86_64 percpu variables to begin at zero. Then
    we can directly use the x86_32 percpu operations. x86_32
    offsets %fs by __per_cpu_start. x86_64 has %gs pointing
    directly to the pda and the per cpu area thereby allowing
    access to the pda with the x86_64 pda operations and access
    to the per cpu variables using x86_32 percpu operations.

  * Introduces a new DEFINE_PER_CPU_FIRST to locate the percpu
    variable (cpu_pda in this case) at the beginning of the percpu
    .data section.

  * This also supports further integration of x86_32/64.

Based on linux-2.6.git + x86.git

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 arch/x86/Kconfig                  |    3 +
 arch/x86/kernel/head64.c          |   41 +++++++++++++++++++++++
 arch/x86/kernel/setup64.c         |   67 +++++++++++++++++++++++---------------
 arch/x86/kernel/smpboot_64.c      |   16 ---------
 arch/x86/kernel/vmlinux_64.lds.S  |    1 
 include/asm-generic/vmlinux.lds.h |    2 +
 include/asm-x86/pda.h             |   13 +++++--
 include/asm-x86/percpu.h          |   33 +++++++++++-------
 include/linux/percpu.h            |    9 ++++-
 9 files changed, 126 insertions(+), 59 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -103,6 +103,9 @@ config GENERIC_TIME_VSYSCALL
 config HAVE_SETUP_PER_CPU_AREA
 	def_bool X86_64
 
+config HAVE_ZERO_BASED_PER_CPU
+	def_bool X86_64
+
 config ARCH_SUPPORTS_OPROFILE
 	bool
 	default y
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -11,6 +11,7 @@
 #include <linux/string.h>
 #include <linux/percpu.h>
 #include <linux/start_kernel.h>
+#include <linux/bootmem.h>
 
 #include <asm/processor.h>
 #include <asm/proto.h>
@@ -23,6 +24,12 @@
 #include <asm/kdebug.h>
 #include <asm/e820.h>
 
+#ifdef CONFIG_SMP
+/* Only used before the per cpu areas are setup. */
+static struct x8664_pda boot_cpu_pda[NR_CPUS] __initdata;
+static struct x8664_pda *_cpu_pda_init[NR_CPUS] __initdata;
+#endif
+
 static void __init zap_identity_mappings(void)
 {
 	pgd_t *pgd = pgd_offset_k(0UL);
@@ -99,8 +106,14 @@ void __init x86_64_start_kernel(char * r
 
 	early_printk("Kernel alive\n");
 
+#ifdef CONFIG_SMP
+	_cpu_pda = (void *)_cpu_pda_init;
  	for (i = 0; i < NR_CPUS; i++)
  		cpu_pda(i) = &boot_cpu_pda[i];
+#endif
+
+	/* setup percpu segment offset for cpu 0 */
+	cpu_pda(0)->data_offset = (unsigned long)__per_cpu_load;
 
 	pda_init(0);
 	copy_bootdata(__va(real_mode_data));
@@ -125,3 +138,31 @@ void __init x86_64_start_kernel(char * r
 
 	start_kernel();
 }
+
+#ifdef	CONFIG_SMP
+/*
+ * Remove initial boot_cpu_pda array and cpu_pda pointer table.
+ *
+ * This depends on setup_per_cpu_areas relocating the pda to the beginning
+ * of the per_cpu area so that (_cpu_pda[i] != &boot_cpu_pda[i]).  If it
+ * is equal then the new pda has not been setup for this cpu, and the pda
+ * table will have a NULL address for this cpu.
+ */
+void __init x86_64_cleanup_pda(void)
+{
+	int i;
+
+	_cpu_pda = alloc_bootmem_low(nr_cpu_ids * sizeof(void *));
+
+	if (!_cpu_pda)
+		panic("Cannot allocate cpu pda table\n");
+
+	/* cpu_pda() now points to allocated cpu_pda_table */
+
+	for (i = 0; i < NR_CPUS; i++)
+ 		if (_cpu_pda_init[i] == &boot_cpu_pda[i])
+			cpu_pda(i) = NULL;
+		else
+			cpu_pda(i) = _cpu_pda_init[i];
+}
+#endif
--- a/arch/x86/kernel/setup64.c
+++ b/arch/x86/kernel/setup64.c
@@ -32,9 +32,13 @@ struct boot_params boot_params;
 
 cpumask_t cpu_initialized __cpuinitdata = CPU_MASK_NONE;
 
-struct x8664_pda *_cpu_pda[NR_CPUS] __read_mostly;
+#ifdef CONFIG_SMP
+struct x8664_pda **_cpu_pda __read_mostly;
 EXPORT_SYMBOL(_cpu_pda);
-struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned;
+#endif
+
+DEFINE_PER_CPU_FIRST(struct x8664_pda, pda);
+EXPORT_PER_CPU_SYMBOL(pda);
 
 struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table };
 
@@ -95,22 +99,14 @@ static void __init setup_per_cpu_maps(vo
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
-#ifdef CONFIG_SMP
-		if (per_cpu_offset(cpu)) {
-#endif
-			per_cpu(x86_cpu_to_apicid, cpu) =
-						x86_cpu_to_apicid_init[cpu];
-			per_cpu(x86_bios_cpu_apicid, cpu) =
-						x86_bios_cpu_apicid_init[cpu];
+		per_cpu(x86_cpu_to_apicid, cpu) =
+					x86_cpu_to_apicid_init[cpu];
+
+		per_cpu(x86_bios_cpu_apicid, cpu) =
+					x86_bios_cpu_apicid_init[cpu];
 #ifdef CONFIG_NUMA
-			per_cpu(x86_cpu_to_node_map, cpu) =
-						x86_cpu_to_node_map_init[cpu];
-#endif
-#ifdef CONFIG_SMP
-		}
-		else
-			printk(KERN_NOTICE "per_cpu_offset zero for cpu %d\n",
-									cpu);
+		per_cpu(x86_cpu_to_node_map, cpu) =
+					x86_cpu_to_node_map_init[cpu];
 #endif
 	}
 
@@ -139,25 +135,46 @@ void __init setup_per_cpu_areas(void)
 	/* Copy section for each CPU (we discard the original) */
 	size = PERCPU_ENOUGH_ROOM;
 
-	printk(KERN_INFO "PERCPU: Allocating %lu bytes of per cpu data\n", size);
-	for_each_cpu_mask (i, cpu_possible_map) {
+	printk(KERN_INFO
+		"PERCPU: Allocating %lu bytes of per cpu data\n", size);
+
+	for_each_possible_cpu(i) {
+
+#ifndef CONFIG_NEED_MULTIPLE_NODES
+		char *ptr = alloc_bootmem_pages(size);
+#else
 		char *ptr;
 
-		if (!NODE_DATA(early_cpu_to_node(i))) {
-			printk("cpu with no node %d, num_online_nodes %d\n",
-			       i, num_online_nodes());
+		if (NODE_DATA(early_cpu_to_node(i)))
+			ptr = alloc_bootmem_pages_node
+				(NODE_DATA(early_cpu_to_node(i)), size);
+
+		else {
+			printk(KERN_INFO
+				"cpu %d has no node, num_online_nodes %d\n",
+				i, num_online_nodes());
 			ptr = alloc_bootmem_pages(size);
-		} else { 
-			ptr = alloc_bootmem_pages_node(NODE_DATA(early_cpu_to_node(i)), size);
 		}
+#endif
 		if (!ptr)
 			panic("Cannot allocate cpu data for CPU %d\n", i);
+
+		memcpy(ptr, __per_cpu_load, __per_cpu_size);
+
+		/* Relocate the pda */
+		memcpy(ptr, cpu_pda(i), sizeof(struct x8664_pda));
+		cpu_pda(i) = (struct x8664_pda *)ptr;
 		cpu_pda(i)->data_offset = ptr - __per_cpu_start;
-		memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
 	}
 
 	/* setup percpu data maps early */
 	setup_per_cpu_maps();
+
+	/* clean up early cpu_pda pointer array */
+	x86_64_cleanup_pda();
+
+	/* Fix up pda for this processor .... */
+	pda_init(0);
 } 
 
 void pda_init(int cpu)
--- a/arch/x86/kernel/smpboot_64.c
+++ b/arch/x86/kernel/smpboot_64.c
@@ -566,22 +566,6 @@ static int __cpuinit do_boot_cpu(int cpu
 		return -1;
 	}
 
-	/* Allocate node local memory for AP pdas */
-	if (cpu_pda(cpu) == &boot_cpu_pda[cpu]) {
-		struct x8664_pda *newpda, *pda;
-		int node = cpu_to_node(cpu);
-		pda = cpu_pda(cpu);
-		newpda = kmalloc_node(sizeof (struct x8664_pda), GFP_ATOMIC,
-				      node);
-		if (newpda) {
-			memcpy(newpda, pda, sizeof (struct x8664_pda));
-			cpu_pda(cpu) = newpda;
-		} else
-			printk(KERN_ERR
-		"Could not allocate node local PDA for CPU %d on node %d\n",
-				cpu, node);
-	}
-
 	alternatives_smp_switch(1);
 
 	c_idle.idle = get_idle_for_cpu(cpu);
--- a/arch/x86/kernel/vmlinux_64.lds.S
+++ b/arch/x86/kernel/vmlinux_64.lds.S
@@ -16,6 +16,7 @@ jiffies_64 = jiffies;
 _proxy_pda = 1;
 PHDRS {
 	text PT_LOAD FLAGS(5);	/* R_E */
+	percpu PT_LOAD FLAGS(4);	/* R__ */
 	data PT_LOAD FLAGS(7);	/* RWE */
 	user PT_LOAD FLAGS(7);	/* RWE */
 	data.init PT_LOAD FLAGS(7);	/* RWE */
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -347,6 +347,7 @@
 	percpu : { } :percpu						\
 	__per_cpu_load = .;						\
 	.data.percpu 0 : AT(__per_cpu_load - LOAD_OFFSET) {		\
+		*(.data.percpu.first)					\
 		*(.data.percpu)						\
 		*(.data.percpu.shared_aligned)				\
 		____per_cpu_size = .;					\
@@ -358,6 +359,7 @@
 	. = ALIGN(align);						\
 	__per_cpu_start = .;						\
 	.data.percpu  : AT(ADDR(.data.percpu) - LOAD_OFFSET) {		\
+		*(.data.percpu.first)					\
 		*(.data.percpu)						\
 		*(.data.percpu.shared_aligned)				\
 	}								\
--- a/include/asm-x86/pda.h
+++ b/include/asm-x86/pda.h
@@ -38,11 +38,16 @@ struct x8664_pda {
 	unsigned irq_spurious_count;
 } ____cacheline_aligned_in_smp;
 
-extern struct x8664_pda *_cpu_pda[];
-extern struct x8664_pda boot_cpu_pda[];
-extern void pda_init(int);
-
+#ifdef CONFIG_SMP
 #define cpu_pda(i) (_cpu_pda[i])
+extern struct x8664_pda **_cpu_pda;
+extern void x86_64_cleanup_pda(void);
+#else
+#define	cpu_pda(i)	(&per_cpu(pda, i))
+static inline void x86_64_cleanup_pda(void) { }
+#endif
+
+extern void pda_init(int);
 
 /*
  * There is no fast way to get the base address of the PDA, all the accesses
--- a/include/asm-x86/percpu.h
+++ b/include/asm-x86/percpu.h
@@ -13,13 +13,19 @@
 #include <asm/pda.h>
 
 #define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
-#define __my_cpu_offset read_pda(data_offset)
-
 #define per_cpu_offset(x) (__per_cpu_offset(x))
 
+#define __my_cpu_offset read_pda(data_offset)
+#define __percpu_seg "%%gs:"
+
+#else
+#define __percpu_seg ""
 #endif
 #include <asm-generic/percpu.h>
 
+/* Calculate the offset to use with the segment register */
+#define seg_offset(name)   per_cpu_var(name)
+
 DECLARE_PER_CPU(struct x8664_pda, pda);
 
 #else /* CONFIG_X86_64 */
@@ -64,16 +70,11 @@ DECLARE_PER_CPU(struct x8664_pda, pda);
  *    PER_CPU(cpu_gdt_descr, %ebx)
  */
 #ifdef CONFIG_SMP
-
 #define __my_cpu_offset x86_read_percpu(this_cpu_off)
-
 /* fs segment starts at (positive) offset == __per_cpu_offset[cpu] */
 #define __percpu_seg "%%fs:"
-
 #else  /* !SMP */
-
 #define __percpu_seg ""
-
 #endif	/* SMP */
 
 #include <asm-generic/percpu.h>
@@ -81,6 +82,13 @@ DECLARE_PER_CPU(struct x8664_pda, pda);
 /* We can use this directly for local CPU (faster). */
 DECLARE_PER_CPU(unsigned long, this_cpu_off);
 
+#define seg_offset(name)	per_cpu_var(name)
+
+#endif /* __ASSEMBLY__ */
+#endif /* !CONFIG_X86_64 */
+
+#ifndef __ASSEMBLY__
+
 /* For arch-specific code, we can use direct single-insn ops (they
  * don't give an lvalue though). */
 extern void __bad_percpu_size(void);
@@ -132,11 +140,10 @@ extern void __bad_percpu_size(void);
 		}						\
 		ret__; })
 
-#define x86_read_percpu(var) percpu_from_op("mov", per_cpu__##var)
-#define x86_write_percpu(var,val) percpu_to_op("mov", per_cpu__##var, val)
-#define x86_add_percpu(var,val) percpu_to_op("add", per_cpu__##var, val)
-#define x86_sub_percpu(var,val) percpu_to_op("sub", per_cpu__##var, val)
-#define x86_or_percpu(var,val) percpu_to_op("or", per_cpu__##var, val)
+#define x86_read_percpu(var) percpu_from_op("mov", seg_offset(var))
+#define x86_write_percpu(var,val) percpu_to_op("mov", seg_offset(var), val)
+#define x86_add_percpu(var,val) percpu_to_op("add", seg_offset(var), val)
+#define x86_sub_percpu(var,val) percpu_to_op("sub", seg_offset(var), val)
+#define x86_or_percpu(var,val) percpu_to_op("or", seg_offset(var), val)
 #endif /* !__ASSEMBLY__ */
-#endif /* !CONFIG_X86_64 */
 #endif /* _ASM_X86_PERCPU_H_ */
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -18,11 +18,18 @@
 	__attribute__((__section__(".data.percpu.shared_aligned")))	\
 	PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name		\
 	____cacheline_aligned_in_smp
+
+#define DEFINE_PER_CPU_FIRST(type, name)				\
+	__attribute__((__section__(".data.percpu.first")))		\
+	PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name
 #else
 #define DEFINE_PER_CPU(type, name)					\
 	PER_CPU_ATTRIBUTES __typeof__(type) per_cpu__##name
 
-#define DEFINE_PER_CPU_SHARED_ALIGNED(type, name)		      \
+#define DEFINE_PER_CPU_SHARED_ALIGNED(type, name)			\
+	DEFINE_PER_CPU(type, name)
+
+#define DEFINE_PER_CPU_FIRST(type, name)				\
 	DEFINE_PER_CPU(type, name)
 #endif
 

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/4] x86_64: Cleanup non-smp usage of cpu maps
  2008-02-01 19:14 [PATCH 0/4] percpu: Optimize percpu accesses travis
                   ` (2 preceding siblings ...)
  2008-02-01 19:14 ` [PATCH 3/4] x86_64: Fold pda into per cpu area travis
@ 2008-02-01 19:14 ` travis
  2008-02-15 20:17   ` Ingo Molnar
  3 siblings, 1 reply; 17+ messages in thread
From: travis @ 2008-02-01 19:14 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, Ingo Molnar, Thomas Gleixner
  Cc: Jeremy Fitzhardinge, Christoph Lameter, Jack Steiner, linux-mm,
	linux-kernel

[-- Attachment #1: cleanup_nonsmp_maps --]
[-- Type: text/plain, Size: 6165 bytes --]

Cleanup references to the early cpu maps for the non-SMP configuration
and remove some functions called for SMP configurations only.

Based on linux-2.6.git + x86.git

Signed-off-by: Mike Travis <travis@sgi.com>
---
 arch/x86/kernel/genapic_64.c |    2 ++
 arch/x86/kernel/mpparse_64.c |    2 ++
 arch/x86/kernel/setup64.c    |    3 +++
 arch/x86/kernel/smpboot_32.c |    2 ++
 arch/x86/mm/numa_64.c        |   12 ++++++------
 include/asm-x86/pda.h        |    1 -
 include/asm-x86/smp_32.h     |    4 ++++
 include/asm-x86/smp_64.h     |    5 +++++
 include/asm-x86/topology.h   |   16 ++++++++++++----
 9 files changed, 36 insertions(+), 11 deletions(-)

--- a/arch/x86/kernel/genapic_64.c
+++ b/arch/x86/kernel/genapic_64.c
@@ -25,9 +25,11 @@
 #endif
 
 /* which logical CPU number maps to which CPU (physical APIC ID) */
+#ifdef CONFIG_SMP
 u16 x86_cpu_to_apicid_init[NR_CPUS] __initdata
 					= { [0 ... NR_CPUS-1] = BAD_APICID };
 void *x86_cpu_to_apicid_early_ptr;
+#endif
 DEFINE_PER_CPU(u16, x86_cpu_to_apicid) = BAD_APICID;
 EXPORT_PER_CPU_SYMBOL(x86_cpu_to_apicid);
 
--- a/arch/x86/kernel/mpparse_64.c
+++ b/arch/x86/kernel/mpparse_64.c
@@ -67,9 +67,11 @@ unsigned disabled_cpus __cpuinitdata;
 /* Bitmask of physically existing CPUs */
 physid_mask_t phys_cpu_present_map = PHYSID_MASK_NONE;
 
+#ifdef CONFIG_SMP
 u16 x86_bios_cpu_apicid_init[NR_CPUS] __initdata
 				= { [0 ... NR_CPUS-1] = BAD_APICID };
 void *x86_bios_cpu_apicid_early_ptr;
+#endif
 DEFINE_PER_CPU(u16, x86_bios_cpu_apicid) = BAD_APICID;
 EXPORT_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
 
--- a/arch/x86/kernel/setup64.c
+++ b/arch/x86/kernel/setup64.c
@@ -89,6 +89,8 @@ static int __init nonx32_setup(char *str
 }
 __setup("noexec32=", nonx32_setup);
 
+
+#ifdef CONFIG_SMP
 /*
  * Copy data used in early init routines from the initial arrays to the
  * per cpu data areas.  These arrays then become expendable and the
@@ -176,6 +178,7 @@ void __init setup_per_cpu_areas(void)
 	/* Fix up pda for this processor .... */
 	pda_init(0);
 } 
+#endif /* CONFIG_SMP */
 
 void pda_init(int cpu)
 { 
--- a/arch/x86/kernel/smpboot_32.c
+++ b/arch/x86/kernel/smpboot_32.c
@@ -92,9 +92,11 @@ DEFINE_PER_CPU_SHARED_ALIGNED(struct cpu
 EXPORT_PER_CPU_SYMBOL(cpu_info);
 
 /* which logical CPU number maps to which CPU (physical APIC ID) */
+#ifdef CONFIG_SMP
 u8 x86_cpu_to_apicid_init[NR_CPUS] __initdata =
 			{ [0 ... NR_CPUS-1] = BAD_APICID };
 void *x86_cpu_to_apicid_early_ptr;
+#endif
 DEFINE_PER_CPU(u8, x86_cpu_to_apicid) = BAD_APICID;
 EXPORT_PER_CPU_SYMBOL(x86_cpu_to_apicid);
 
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -31,13 +31,15 @@ bootmem_data_t plat_node_bdata[MAX_NUMNO
 
 struct memnode memnode;
 
+#ifdef CONFIG_SMP
 int x86_cpu_to_node_map_init[NR_CPUS] = {
 	[0 ... NR_CPUS-1] = NUMA_NO_NODE
 };
 void *x86_cpu_to_node_map_early_ptr;
+EXPORT_SYMBOL(x86_cpu_to_node_map_early_ptr);
+#endif
 DEFINE_PER_CPU(int, x86_cpu_to_node_map) = NUMA_NO_NODE;
 EXPORT_PER_CPU_SYMBOL(x86_cpu_to_node_map);
-EXPORT_SYMBOL(x86_cpu_to_node_map_early_ptr);
 
 s16 apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = {
 	[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
@@ -536,13 +538,11 @@ void __cpuinit numa_set_node(int cpu, in
 {
 	int *cpu_to_node_map = x86_cpu_to_node_map_early_ptr;
 
-	cpu_pda(cpu)->nodenumber = node;
-
-	if(cpu_to_node_map)
+	if (cpu_to_node_map)
 		cpu_to_node_map[cpu] = node;
-	else if(per_cpu_offset(cpu))
+	else if (per_cpu_offset(cpu))
 		per_cpu(x86_cpu_to_node_map, cpu) = node;
-	else
+ 	else
 		Dprintk(KERN_INFO "Setting node for non-present cpu %d\n", cpu);
 }
 
--- a/include/asm-x86/pda.h
+++ b/include/asm-x86/pda.h
@@ -22,7 +22,6 @@ struct x8664_pda {
 					   offset 40!!! */
 #endif
 	char *irqstackptr;
-	unsigned int nodenumber;	/* number of current node */
 	unsigned int __softirq_pending;
 	unsigned int __nmi_count;	/* number of NMI on this CPUs */
 	short mmu_state;
--- a/include/asm-x86/smp_32.h
+++ b/include/asm-x86/smp_32.h
@@ -29,8 +29,12 @@ extern void unlock_ipi_call_lock(void);
 extern void (*mtrr_hook) (void);
 extern void zap_low_mappings (void);
 
+#ifdef CONFIG_SMP
 extern u8 __initdata x86_cpu_to_apicid_init[];
 extern void *x86_cpu_to_apicid_early_ptr;
+#else
+#define x86_cpu_to_apicid_early_ptr NULL
+#endif
 
 DECLARE_PER_CPU(cpumask_t, cpu_sibling_map);
 DECLARE_PER_CPU(cpumask_t, cpu_core_map);
--- a/include/asm-x86/smp_64.h
+++ b/include/asm-x86/smp_64.h
@@ -26,10 +26,15 @@ extern void unlock_ipi_call_lock(void);
 extern int smp_call_function_mask(cpumask_t mask, void (*func)(void *),
 				  void *info, int wait);
 
+#ifdef CONFIG_SMP
 extern u16 __initdata x86_cpu_to_apicid_init[];
 extern u16 __initdata x86_bios_cpu_apicid_init[];
 extern void *x86_cpu_to_apicid_early_ptr;
 extern void *x86_bios_cpu_apicid_early_ptr;
+#else
+#define x86_cpu_to_apicid_early_ptr NULL
+#define x86_bios_cpu_apicid_early_ptr NULL
+#endif
 
 DECLARE_PER_CPU(cpumask_t, cpu_sibling_map);
 DECLARE_PER_CPU(cpumask_t, cpu_core_map);
--- a/include/asm-x86/topology.h
+++ b/include/asm-x86/topology.h
@@ -35,8 +35,14 @@ extern int cpu_to_node_map[];
 
 #else
 DECLARE_PER_CPU(int, x86_cpu_to_node_map);
+
+#ifdef CONFIG_SMP
 extern int x86_cpu_to_node_map_init[];
 extern void *x86_cpu_to_node_map_early_ptr;
+#else
+#define x86_cpu_to_node_map_early_ptr NULL
+#endif
+
 /* Returns the number of the current Node. */
 #define numa_node_id()		(early_cpu_to_node(raw_smp_processor_id()))
 #endif
@@ -54,6 +60,8 @@ static inline int cpu_to_node(int cpu)
 }
 
 #else /* CONFIG_X86_64 */
+
+#ifdef CONFIG_SMP
 static inline int early_cpu_to_node(int cpu)
 {
 	int *cpu_to_node_map = x86_cpu_to_node_map_early_ptr;
@@ -65,6 +73,9 @@ static inline int early_cpu_to_node(int 
 	else
 		return NUMA_NO_NODE;
 }
+#else
+#define	early_cpu_to_node(cpu)	cpu_to_node(cpu)
+#endif
 
 static inline int cpu_to_node(int cpu)
 {
@@ -76,10 +87,7 @@ static inline int cpu_to_node(int cpu)
 		return ((int *)x86_cpu_to_node_map_early_ptr)[cpu];
 	}
 #endif
-	if (per_cpu_offset(cpu))
-		return per_cpu(x86_cpu_to_node_map, cpu);
-	else
-		return NUMA_NO_NODE;
+	return per_cpu(x86_cpu_to_node_map, cpu);
 }
 #endif /* CONFIG_X86_64 */
 

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-02-01 19:14 ` [PATCH 3/4] x86_64: Fold pda into per cpu area travis
@ 2008-02-15 20:16   ` Ingo Molnar
  2008-02-15 22:43     ` Christoph Lameter
  2008-02-17  6:22     ` Yinghai Lu
  0 siblings, 2 replies; 17+ messages in thread
From: Ingo Molnar @ 2008-02-15 20:16 UTC (permalink / raw)
  To: travis
  Cc: Andrew Morton, Andi Kleen, Thomas Gleixner, Jeremy Fitzhardinge,
	Christoph Lameter, Jack Steiner, linux-mm, linux-kernel


* travis@sgi.com <travis@sgi.com> wrote:

>  include/asm-generic/vmlinux.lds.h |    2 +
>  include/linux/percpu.h            |    9 ++++-

couldnt these two generic bits be done separately (perhaps a preparatory 
but otherwise NOP patch pushed upstream straight away) to make 
subsequent patches only touch x86 architecture files?

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] x86_64: Cleanup non-smp usage of cpu maps
  2008-02-01 19:14 ` [PATCH 4/4] x86_64: Cleanup non-smp usage of cpu maps travis
@ 2008-02-15 20:17   ` Ingo Molnar
  0 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2008-02-15 20:17 UTC (permalink / raw)
  To: travis
  Cc: Andrew Morton, Andi Kleen, Thomas Gleixner, Jeremy Fitzhardinge,
	Christoph Lameter, Jack Steiner, linux-mm, linux-kernel


* travis@sgi.com <travis@sgi.com> wrote:

> Cleanup references to the early cpu maps for the non-SMP configuration 
> and remove some functions called for SMP configurations only.

thanks, applied.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-02-15 20:16   ` Ingo Molnar
@ 2008-02-15 22:43     ` Christoph Lameter
  2008-02-17  6:22     ` Yinghai Lu
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2008-02-15 22:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: travis, Andrew Morton, Andi Kleen, Thomas Gleixner,
	Jeremy Fitzhardinge, Jack Steiner, linux-mm, linux-kernel

On Fri, 15 Feb 2008, Ingo Molnar wrote:

> 
> * travis@sgi.com <travis@sgi.com> wrote:
> 
> >  include/asm-generic/vmlinux.lds.h |    2 +
> >  include/linux/percpu.h            |    9 ++++-
> 
> couldnt these two generic bits be done separately (perhaps a preparatory 
> but otherwise NOP patch pushed upstream straight away) to make 
> subsequent patches only touch x86 architecture files?

Yes those modifications could be folded into the generic patch for zero 
based percpu configurations.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-02-15 20:16   ` Ingo Molnar
  2008-02-15 22:43     ` Christoph Lameter
@ 2008-02-17  6:22     ` Yinghai Lu
  2008-02-17  7:36       ` Yinghai Lu
  1 sibling, 1 reply; 17+ messages in thread
From: Yinghai Lu @ 2008-02-17  6:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: travis, Andrew Morton, Andi Kleen, Thomas Gleixner,
	Jeremy Fitzhardinge, Christoph Lameter, Jack Steiner, linux-mm,
	linux-kernel

On Feb 15, 2008 12:16 PM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * travis@sgi.com <travis@sgi.com> wrote:
>
> >  include/asm-generic/vmlinux.lds.h |    2 +
> >  include/linux/percpu.h            |    9 ++++-
>
> couldnt these two generic bits be done separately (perhaps a preparatory
> but otherwise NOP patch pushed upstream straight away) to make
> subsequent patches only touch x86 architecture files?

this patch need to apply to mainline asap.

or you need revert to the patch about include/asm-x86/percpu.h

+#ifdef CONFIG_X86_64
+#include <linux/compiler.h>
+
+/* Same as asm-generic/percpu.h, except that we store the per cpu offset
+   in the PDA. Longer term the PDA and every per cpu variable
+   should be just put into a single section and referenced directly
+   from %gs */
+
+#ifdef CONFIG_SMP
+#include <asm/pda.h>
+
+#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
+#define __my_cpu_offset read_pda(data_offset)
+
+#define per_cpu_offset(x) (__per_cpu_offset(x))
+
 #endif
+#include <asm-generic/percpu.h>
+
+DECLARE_PER_CPU(struct x8664_pda, pda);
+
+#else /* CONFIG_X86_64 */

because current tree
in setup_per_cpu_areas will have
     cpu_pda(i)->data_offset = ptr - __per_cpu_start;

but at that time all APs will use cpu_pda for boot cpu...,and APs will
get their pda in do_boot_cpu()

the result is all cpu will have same data_offset, there will share one
per_cpu_data..that is totally wrong!!

that could explain a lot of strange panic ....recently about NUMA...

YH

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-02-17  6:22     ` Yinghai Lu
@ 2008-02-17  7:36       ` Yinghai Lu
  0 siblings, 0 replies; 17+ messages in thread
From: Yinghai Lu @ 2008-02-17  7:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: travis, Andrew Morton, Andi Kleen, Thomas Gleixner,
	Jeremy Fitzhardinge, Christoph Lameter, Jack Steiner, linux-mm,
	linux-kernel

On Feb 16, 2008 10:22 PM, Yinghai Lu <yhlu.kernel@gmail.com> wrote:
> On Feb 15, 2008 12:16 PM, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * travis@sgi.com <travis@sgi.com> wrote:
> >
> > >  include/asm-generic/vmlinux.lds.h |    2 +
> > >  include/linux/percpu.h            |    9 ++++-
> >
> > couldnt these two generic bits be done separately (perhaps a preparatory
> > but otherwise NOP patch pushed upstream straight away) to make
> > subsequent patches only touch x86 architecture files?
>
> this patch need to apply to mainline asap.
>
> or you need revert to the patch about include/asm-x86/percpu.h
>
> +#ifdef CONFIG_X86_64
> +#include <linux/compiler.h>
> +
> +/* Same as asm-generic/percpu.h, except that we store the per cpu offset
> +   in the PDA. Longer term the PDA and every per cpu variable
> +   should be just put into a single section and referenced directly
> +   from %gs */
> +
> +#ifdef CONFIG_SMP
> +#include <asm/pda.h>
> +
> +#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
> +#define __my_cpu_offset read_pda(data_offset)
> +
> +#define per_cpu_offset(x) (__per_cpu_offset(x))
> +
>  #endif
> +#include <asm-generic/percpu.h>
> +
> +DECLARE_PER_CPU(struct x8664_pda, pda);
> +
> +#else /* CONFIG_X86_64 */
>
> because current tree
> in setup_per_cpu_areas will have
>      cpu_pda(i)->data_offset = ptr - __per_cpu_start;
>
> but at that time all APs will use cpu_pda for boot cpu...,and APs will
> get their pda in do_boot_cpu()

sorry, boot_cpu_pda is array... so that is safe.

YH

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-07-25 21:11 [PATCH 0/4] x86_64: Optimize percpu accesses Mike Travis
@ 2008-07-25 21:11 ` Mike Travis
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Travis @ 2008-07-25 21:11 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton
  Cc: Eric W. Biederman, Hugh Dickins, Jack Steiner,
	Jeremy Fitzhardinge, H. Peter Anvin, linux-kernel,
	Christoph Lameter

[-- Attachment #1: fold_pda_into_percpu --]
[-- Type: text/plain, Size: 13594 bytes --]

WARNING: there are two FIXME's in arch/x86/xen/enlighten.c
	 and arch/x86/xen/smp.c that I'm not sure how to handle...?

  * Declare the pda as a per cpu variable.

  * Relocate the initial pda in head_64.S for the boot cpu (0).
    For secondary cpus, do_boot_cpu() sets up the correct initial pda.

Based on linux-2.6.tip/master

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 arch/x86/kernel/cpu/common_64.c |    4 -
 arch/x86/kernel/head64.c        |   29 +-----------
 arch/x86/kernel/head_64.S       |   19 ++++++--
 arch/x86/kernel/setup_percpu.c  |   93 +++++++++++-----------------------------
 arch/x86/kernel/smpboot.c       |   53 ----------------------
 arch/x86/xen/enlighten.c        |   10 ++++
 arch/x86/xen/smp.c              |   11 +---
 include/asm-x86/desc.h          |    5 ++
 include/asm-x86/pda.h           |    3 -
 include/asm-x86/percpu.h        |   13 -----
 include/asm-x86/setup.h         |    1 
 include/asm-x86/smp.h           |    2 
 include/asm-x86/trampoline.h    |    1 
 13 files changed, 72 insertions(+), 172 deletions(-)

--- linux-2.6.tip.orig/arch/x86/kernel/cpu/common_64.c
+++ linux-2.6.tip/arch/x86/kernel/cpu/common_64.c
@@ -418,8 +418,8 @@ __setup("clearcpuid=", setup_disablecpui
 
 cpumask_t cpu_initialized __cpuinitdata = CPU_MASK_NONE;
 
-struct x8664_pda **_cpu_pda __read_mostly;
-EXPORT_SYMBOL(_cpu_pda);
+DEFINE_PER_CPU_FIRST(struct x8664_pda, pda);
+EXPORT_PER_CPU_SYMBOL(pda);
 
 struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table };
 
--- linux-2.6.tip.orig/arch/x86/kernel/head64.c
+++ linux-2.6.tip/arch/x86/kernel/head64.c
@@ -25,27 +25,6 @@
 #include <asm/e820.h>
 #include <asm/bios_ebda.h>
 
-/* boot cpu pda */
-static struct x8664_pda _boot_cpu_pda __read_mostly;
-
-#ifdef CONFIG_SMP
-/*
- * We install an empty cpu_pda pointer table to indicate to early users
- * (numa_set_node) that the cpu_pda pointer table for cpus other than
- * the boot cpu is not yet setup.
- */
-static struct x8664_pda *__cpu_pda[NR_CPUS] __initdata;
-#else
-static struct x8664_pda *__cpu_pda[NR_CPUS] __read_mostly;
-#endif
-
-void __init x86_64_init_pda(void)
-{
-	_cpu_pda = __cpu_pda;
-	cpu_pda(0) = &_boot_cpu_pda;
-	pda_init(0);
-}
-
 static void __init zap_identity_mappings(void)
 {
 	pgd_t *pgd = pgd_offset_k(0UL);
@@ -98,6 +77,10 @@ void __init x86_64_start_kernel(char * r
 	/* Cleanup the over mapped high alias */
 	cleanup_highmap();
 
+	/* Initialize boot cpu_pda data */
+	/* (See head_64.S for earlier pda/gdt initialization) */
+	pda_init(0);
+
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
 #ifdef CONFIG_EARLY_PRINTK
 		set_intr_gate(i, &early_idt_handlers[i]);
@@ -109,10 +92,6 @@ void __init x86_64_start_kernel(char * r
 
 	early_printk("Kernel alive\n");
 
-	x86_64_init_pda();
-
-	early_printk("Kernel really alive\n");
-
 	x86_64_start_reservations(real_mode_data);
 }
 
--- linux-2.6.tip.orig/arch/x86/kernel/head_64.S
+++ linux-2.6.tip/arch/x86/kernel/head_64.S
@@ -248,14 +248,21 @@ ENTRY(secondary_startup_64)
 	movl %eax,%gs
 
 	/* 
-	 * Setup up a dummy PDA. this is just for some early bootup code
-	 * that does in_interrupt() 
+	 * Setup up the real PDA.
+	 *
+	 * For SMP, the boot cpu (0) uses the static pda which is the first
+	 * element in the percpu area (@__per_cpu_load).  This pda is moved
+	 * to the real percpu area once that is allocated.  Secondary cpus
+	 * will use the initial_pda value setup in do_boot_cpu().
 	 */ 
 	movl	$MSR_GS_BASE,%ecx
-	movq	$empty_zero_page,%rax
+	movq	initial_pda(%rip), %rax
 	movq    %rax,%rdx
 	shrq	$32,%rdx
 	wrmsr	
+#ifdef CONFIG_SMP
+	movq	%rax, %gs:pda_data_offset
+#endif
 
 	/* esi is pointer to real mode structure with interesting info.
 	   pass it to C */
@@ -278,6 +285,12 @@ ENTRY(secondary_startup_64)
 	.align	8
 	ENTRY(initial_code)
 	.quad	x86_64_start_kernel
+	ENTRY(initial_pda)
+#ifdef CONFIG_SMP
+	.quad	__per_cpu_load		# Overwritten for secondary CPUs
+#else
+	.quad	per_cpu__pda
+#endif
 	__FINITDATA
 
 	ENTRY(stack_start)
--- linux-2.6.tip.orig/arch/x86/kernel/setup_percpu.c
+++ linux-2.6.tip/arch/x86/kernel/setup_percpu.c
@@ -134,56 +134,8 @@ unsigned long __per_cpu_offset[NR_CPUS] 
 #endif
 EXPORT_SYMBOL(__per_cpu_offset);
 
-#if !defined(CONFIG_SMP) || !defined(CONFIG_X86_64)
-static inline void setup_cpu_pda_map(void) { }
-
-#else /* CONFIG_SMP && CONFIG_X86_64 */
-
-/*
- * Allocate cpu_pda pointer table and array via alloc_bootmem.
- */
-static void __init setup_cpu_pda_map(void)
-{
-	char *pda;
-	struct x8664_pda **new_cpu_pda;
-	unsigned long size;
-	int cpu;
-
-	size = roundup(sizeof(struct x8664_pda), cache_line_size());
-
-	/* allocate cpu_pda array and pointer table */
-	{
-		unsigned long tsize = nr_cpu_ids * sizeof(void *);
-		unsigned long asize = size * (nr_cpu_ids - 1);
-
-		tsize = roundup(tsize, cache_line_size());
-		new_cpu_pda = alloc_bootmem(tsize + asize);
-		pda = (char *)new_cpu_pda + tsize;
-	}
-
-	/* initialize pointer table to static pda's */
-	for_each_possible_cpu(cpu) {
-		if (cpu == 0) {
-			/* leave boot cpu pda in place */
-			new_cpu_pda[0] = cpu_pda(0);
-			DBG("cpu %4d pda %p\n", cpu, cpu_pda(0));
-			continue;
-		}
-		DBG("cpu %4d pda %p\n", cpu, pda);
-		new_cpu_pda[cpu] = (struct x8664_pda *)pda;
-		new_cpu_pda[cpu]->in_bootmem = 1;
-		pda += size;
-	}
-
-	/* point to new pointer table */
-	_cpu_pda = new_cpu_pda;
-}
-#endif
-
 /*
- * Great future plan:
- * Declare PDA itself and support (irqstack,tss,pgd) as per cpu data.
- * Always point %gs to its beginning
+ * Allocate and initialize the per cpu areas which include the PDAs.
  */
 void __init setup_per_cpu_areas(void)
 {
@@ -191,16 +143,11 @@ void __init setup_per_cpu_areas(void)
 	char *ptr;
 	int cpu;
 
-	/* Setup cpu_pda map */
-	setup_cpu_pda_map();
-
 	/* Copy section for each CPU (we discard the original) */
 	size = PERCPU_ENOUGH_ROOM;
 	printk(KERN_INFO "PERCPU: Allocating %zd bytes of per cpu data\n",
 			  size);
 
-	DBG("PERCPU: __per_cpu_start %p\n", __per_cpu_start);
-
 	for_each_possible_cpu(cpu) {
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 		ptr = alloc_bootmem_pages(size);
@@ -215,26 +162,38 @@ void __init setup_per_cpu_areas(void)
 		else
 			ptr = alloc_bootmem_pages_node(NODE_DATA(node), size);
 #endif
-		DBG("PERCPU: cpu %4d %p pda %p %p\n",
-			     cpu, ptr, _cpu_pda[cpu], cpu_pda(cpu));
-
 		/* Initialize each cpu's per_cpu area and save pointer */
 		memcpy(ptr, __per_cpu_load, __per_cpu_size);
 		per_cpu_offset(cpu) = ptr - __per_cpu_start;
 
-#ifdef CONFIG_X86_64
-		/* save for __my_cpu_offset() */
-		cpu_pda(cpu)->data_offset = (unsigned long)ptr;
+		DBG("PERCPU: cpu %4d %p\n", cpu, ptr);
 
+#ifdef CONFIG_X86_64
 		/*
-		 * The boot cpu gdt page must be reloaded as we moved it
-		 * from the static per cpu area to the newly allocated area.
+		 * Note the boot cpu (0) has been using the static per_cpu load
+		 * area for it's pda.  We need to zero out the pdas for the
+		 * other cpus that are coming online.
+		 *
+		 * Additionally, for the boot cpu the gdt page must be reloaded
+		 * as we moved it from the static per cpu area to the newly
+		 * allocated area.
 		 */
-		if (cpu == 0) {
-			struct desc_ptr	gdt_descr = early_gdt_descr;
-
-			gdt_descr.address = (unsigned long)get_cpu_gdt_table(0);
-			native_load_gdt(&gdt_descr);
+		{
+			/* We rely on the fact that pda is the first element */
+			struct x8664_pda *pda = (struct x8664_pda *)ptr;
+
+			if (cpu) {
+				memset(pda, 0, sizeof(*pda));
+				pda->data_offset = (unsigned long)ptr;
+			} else {
+				struct desc_ptr	gdt_descr = early_gdt_descr;
+
+				pda->data_offset = (unsigned long)ptr;
+				gdt_descr.address =
+					(unsigned long)get_cpu_gdt_table(0);
+				native_load_gdt(&gdt_descr);
+				pda_init(0);
+			}
 		}
 #endif
 	}
--- linux-2.6.tip.orig/arch/x86/kernel/smpboot.c
+++ linux-2.6.tip/arch/x86/kernel/smpboot.c
@@ -744,45 +744,6 @@ static void __cpuinit do_fork_idle(struc
 	complete(&c_idle->done);
 }
 
-#ifdef CONFIG_X86_64
-/*
- * Allocate node local memory for the AP pda.
- *
- * Must be called after the _cpu_pda pointer table is initialized.
- */
-int __cpuinit get_local_pda(int cpu)
-{
-	struct x8664_pda *oldpda, *newpda;
-	unsigned long size = sizeof(struct x8664_pda);
-	int node = cpu_to_node(cpu);
-
-	if (cpu_pda(cpu) && !cpu_pda(cpu)->in_bootmem)
-		return 0;
-
-	oldpda = cpu_pda(cpu);
-	newpda = kmalloc_node(size, GFP_ATOMIC, node);
-	if (!newpda) {
-		printk(KERN_ERR "Could not allocate node local PDA "
-			"for CPU %d on node %d\n", cpu, node);
-
-		if (oldpda)
-			return 0;	/* have a usable pda */
-		else
-			return -1;
-	}
-
-	if (oldpda) {
-		memcpy(newpda, oldpda, size);
-		if (!after_bootmem)
-			free_bootmem((unsigned long)oldpda, size);
-	}
-
-	newpda->in_bootmem = 0;
-	cpu_pda(cpu) = newpda;
-	return 0;
-}
-#endif /* CONFIG_X86_64 */
-
 static int __cpuinit do_boot_cpu(int apicid, int cpu)
 /*
  * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
@@ -800,16 +761,6 @@ static int __cpuinit do_boot_cpu(int api
 	};
 	INIT_WORK(&c_idle.work, do_fork_idle);
 
-#ifdef CONFIG_X86_64
-	/* Allocate node local memory for AP pdas */
-	if (cpu > 0) {
-		boot_error = get_local_pda(cpu);
-		if (boot_error)
-			goto restore_state;
-			/* if can't get pda memory, can't start cpu */
-	}
-#endif
-
 	alternatives_smp_switch(1);
 
 	c_idle.idle = get_idle_for_cpu(cpu);
@@ -847,6 +798,7 @@ do_rest:
 #else
 	cpu_pda(cpu)->pcurrent = c_idle.idle;
 	clear_tsk_thread_flag(c_idle.idle, TIF_FORK);
+	initial_pda = (unsigned long)get_cpu_pda(cpu);
 #endif
 	early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
 	initial_code = (unsigned long)start_secondary;
@@ -921,9 +873,6 @@ do_rest:
 				inquire_remote_apic(apicid);
 		}
 	}
-#ifdef CONFIG_X86_64
-restore_state:
-#endif
 	if (boot_error) {
 		/* Try to put things back the way they were before ... */
 		numa_remove_cpu(cpu); /* was set by numa_add_cpu */
--- linux-2.6.tip.orig/arch/x86/xen/enlighten.c
+++ linux-2.6.tip/arch/x86/xen/enlighten.c
@@ -1748,8 +1748,18 @@ asmlinkage void __init xen_start_kernel(
 #ifdef CONFIG_X86_64
 	/* Disable until direct per-cpu data access. */
 	have_vcpu_info_placement = 0;
+#if 0
+	/*
+	 * FIXME: is the above still true?
+	 * Also, x86_64_init_pda() has been removed...
+	 *   should anything replace it?
+	 *   (The offset for cpu_pda(0) is statically initialized
+	 *   to __per_cpu_load, while the remaining pda's come online
+	 *   in setup_per_cpu_areas().)
+	 */
 	x86_64_init_pda();
 #endif
+#endif
 
 	xen_smp_init();
 
--- linux-2.6.tip.orig/arch/x86/xen/smp.c
+++ linux-2.6.tip/arch/x86/xen/smp.c
@@ -285,13 +285,10 @@ static int __cpuinit xen_cpu_up(unsigned
 #endif
 
 #ifdef CONFIG_X86_64
-	/* Allocate node local memory for AP pdas */
-	WARN_ON(cpu == 0);
-	if (cpu > 0) {
-		rc = get_local_pda(cpu);
-		if (rc)
-			return rc;
-	}
+	/*
+	 * FIXME: I don't believe that calling get_local_pda() is
+	 * required any more...?
+	 */
 #endif
 
 #ifdef CONFIG_X86_32
--- linux-2.6.tip.orig/include/asm-x86/desc.h
+++ linux-2.6.tip/include/asm-x86/desc.h
@@ -41,6 +41,11 @@ static inline struct desc_struct *get_cp
 
 #ifdef CONFIG_X86_64
 
+static inline struct x8664_pda *get_cpu_pda(unsigned int cpu)
+{
+	return &per_cpu(pda, cpu);
+}
+
 static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
 			     unsigned dpl, unsigned ist, unsigned seg)
 {
--- linux-2.6.tip.orig/include/asm-x86/pda.h
+++ linux-2.6.tip/include/asm-x86/pda.h
@@ -37,10 +37,9 @@ struct x8664_pda {
 	unsigned irq_spurious_count;
 } ____cacheline_aligned_in_smp;
 
-extern struct x8664_pda **_cpu_pda;
 extern void pda_init(int);
 
-#define cpu_pda(i) (_cpu_pda[i])
+#define cpu_pda(cpu) (&per_cpu(pda, cpu))
 
 /*
  * There is no fast way to get the base address of the PDA, all the accesses
--- linux-2.6.tip.orig/include/asm-x86/percpu.h
+++ linux-2.6.tip/include/asm-x86/percpu.h
@@ -3,20 +3,11 @@
 
 #ifdef CONFIG_X86_64
 #include <linux/compiler.h>
-
-/* Same as asm-generic/percpu.h, except that we store the per cpu offset
-   in the PDA. Longer term the PDA and every per cpu variable
-   should be just put into a single section and referenced directly
-   from %gs */
-
-#ifdef CONFIG_SMP
 #include <asm/pda.h>
 
-#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
+/* Same as asm-generic/percpu.h */
+#ifdef CONFIG_SMP
 #define __my_cpu_offset read_pda(data_offset)
-
-#define per_cpu_offset(x) (__per_cpu_offset(x))
-
 #endif
 #include <asm-generic/percpu.h>
 
--- linux-2.6.tip.orig/include/asm-x86/setup.h
+++ linux-2.6.tip/include/asm-x86/setup.h
@@ -92,7 +92,6 @@ extern unsigned long init_pg_tables_star
 extern unsigned long init_pg_tables_end;
 
 #else
-void __init x86_64_init_pda(void);
 void __init x86_64_start_kernel(char *real_mode);
 void __init x86_64_start_reservations(char *real_mode_data);
 
--- linux-2.6.tip.orig/include/asm-x86/smp.h
+++ linux-2.6.tip/include/asm-x86/smp.h
@@ -25,8 +25,6 @@ extern cpumask_t cpu_callin_map;
 extern void (*mtrr_hook)(void);
 extern void zap_low_mappings(void);
 
-extern int __cpuinit get_local_pda(int cpu);
-
 extern int smp_num_siblings;
 extern unsigned int num_processors;
 extern cpumask_t cpu_initialized;
--- linux-2.6.tip.orig/include/asm-x86/trampoline.h
+++ linux-2.6.tip/include/asm-x86/trampoline.h
@@ -12,6 +12,7 @@ extern unsigned char *trampoline_base;
 
 extern unsigned long init_rsp;
 extern unsigned long initial_code;
+extern unsigned long initial_pda;
 
 #define TRAMPOLINE_BASE 0x6000
 extern unsigned long setup_trampoline(void);

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-06-04 12:59   ` Jeremy Fitzhardinge
  2008-06-04 13:48     ` Mike Travis
@ 2008-06-09 23:18     ` Christoph Lameter
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2008-06-09 23:18 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Mike Travis, Ingo Molnar, Andrew Morton, David Miller,
	Eric Dumazet, linux-kernel, Rusty Russell

On Wed, 4 Jun 2008, Jeremy Fitzhardinge wrote:

> > 	%gs:[&per_cpu_xxxx - __per_cpu_start]
> >   
> 
> Unfortunately that doesn't actually work, because you can't have a reloc with
> two variables.

That is just a conceptual discussion. __per_cpu_start is 0 with the zero 
based patch. And thus this reduces to

%gs[&per_cpu_xxx]


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-06-04 13:58       ` Jeremy Fitzhardinge
@ 2008-06-04 14:17         ` Mike Travis
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Travis @ 2008-06-04 14:17 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Christoph Lameter, David Miller,
	Eric Dumazet, linux-kernel, Rusty Russell

Jeremy Fitzhardinge wrote:
> Mike Travis wrote:
>> Jeremy Fitzhardinge wrote:
>>  
>>> Mike Travis wrote:
>>>    
>>>>   * Declare the pda as a per cpu variable.
>>>>
>>>>   * Make the x86_64 per cpu area start at zero.
>>>>
>>>>   * Since the pda is now the first element of the per_cpu area,
>>>> cpu_pda()
>>>>     is no longer needed and per_cpu() can be used instead.  This also
>>>> makes
>>>>     the _cpu_pda[] table obsolete.
>>>>
>>>>   * Since %gs is pointing to the pda, it will then also point to the
>>>> per cpu
>>>>     variables and can be accessed thusly:
>>>>
>>>>     %gs:[&per_cpu_xxxx - __per_cpu_start]
>>>>
>>>>       
>>
>>   The above is only a partial story (I folded the two patches but didn't
>> update the comments correctly.]  The variables are already offset from
>> __per_cpu_start by virtue of the .data.percpu section being based at
>> zero.  Therefore only the %gs register needs to be set to the base of
>> each cpu's percpu section to resolve the target address:
>>
>>         %gs:&per_cpu_xxxx
>>   
> 
> Oh, good.  I'd played with trying to make that work at one point, and
> got lost in linker bugs and/or random version-specific strangeness.
>    J

Incidentally, this is why the following load is needed in x86_64_start_kernel():

        pda = (struct x8664_pda *)__per_cpu_load;
        pda->data_offset = per_cpu_offset(0) = (unsigned long)pda;

        /* initialize boot cpu_pda data */
        pda_init(0);

pda_init() loads the %gs reg so early accesses to the static per_cpu section
can be executed before the percpu areas are allocated.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-06-04 13:48     ` Mike Travis
@ 2008-06-04 13:58       ` Jeremy Fitzhardinge
  2008-06-04 14:17         ` Mike Travis
  0 siblings, 1 reply; 17+ messages in thread
From: Jeremy Fitzhardinge @ 2008-06-04 13:58 UTC (permalink / raw)
  To: Mike Travis
  Cc: Ingo Molnar, Andrew Morton, Christoph Lameter, David Miller,
	Eric Dumazet, linux-kernel, Rusty Russell

Mike Travis wrote:
> Jeremy Fitzhardinge wrote:
>   
>> Mike Travis wrote:
>>     
>>>   * Declare the pda as a per cpu variable.
>>>
>>>   * Make the x86_64 per cpu area start at zero.
>>>
>>>   * Since the pda is now the first element of the per_cpu area, cpu_pda()
>>>     is no longer needed and per_cpu() can be used instead.  This also
>>> makes
>>>     the _cpu_pda[] table obsolete.
>>>
>>>   * Since %gs is pointing to the pda, it will then also point to the
>>> per cpu
>>>     variables and can be accessed thusly:
>>>
>>>     %gs:[&per_cpu_xxxx - __per_cpu_start]
>>>
>>>       
>
>   
> The above is only a partial story (I folded the two patches but didn't
> update the comments correctly.]  The variables are already offset from
> __per_cpu_start by virtue of the .data.percpu section being based at
> zero.  Therefore only the %gs register needs to be set to the base of
> each cpu's percpu section to resolve the target address:
>
>         %gs:&per_cpu_xxxx
>   

Oh, good.  I'd played with trying to make that work at one point, and 
got lost in linker bugs and/or random version-specific strangeness. 

    J

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-06-04 12:59   ` Jeremy Fitzhardinge
@ 2008-06-04 13:48     ` Mike Travis
  2008-06-04 13:58       ` Jeremy Fitzhardinge
  2008-06-09 23:18     ` Christoph Lameter
  1 sibling, 1 reply; 17+ messages in thread
From: Mike Travis @ 2008-06-04 13:48 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Ingo Molnar, Andrew Morton, Christoph Lameter, David Miller,
	Eric Dumazet, linux-kernel, Rusty Russell

Jeremy Fitzhardinge wrote:
> Mike Travis wrote:
>>   * Declare the pda as a per cpu variable.
>>
>>   * Make the x86_64 per cpu area start at zero.
>>
>>   * Since the pda is now the first element of the per_cpu area, cpu_pda()
>>     is no longer needed and per_cpu() can be used instead.  This also
>> makes
>>     the _cpu_pda[] table obsolete.
>>
>>   * Since %gs is pointing to the pda, it will then also point to the
>> per cpu
>>     variables and can be accessed thusly:
>>
>>     %gs:[&per_cpu_xxxx - __per_cpu_start]
>> 

  
The above is only a partial story (I folded the two patches but didn't
update the comments correctly.]  The variables are already offset from
__per_cpu_start by virtue of the .data.percpu section being based at
zero.  Therefore only the %gs register needs to be set to the base of
each cpu's percpu section to resolve the target address:

        %gs:&per_cpu_xxxx

And the .data.percpu.first forces the pda percpu variable to the front.


> 
> Unfortunately that doesn't actually work, because you can't have a reloc
> with two variables.
> 
> In something like:
> 
>     mov %gs:per_cpu__foo - 12345, %rax
>     mov %gs:per_cpu__foo, %rax
>     mov %gs:per_cpu__foo - 12345(%rip), %rax
>     mov %gs:per_cpu__foo(%rip), %rax
>     mov %gs:per_cpu__foo - __per_cpu_start, %rax
>     mov %gs:per_cpu__foo - __per_cpu_start(%rip), %rax
> 
> the last two lines will not assemble:
> 
> t.S:5: Error: can't resolve `per_cpu__foo' {*UND* section} -
> `__per_cpu_start' {*UND* section}
> t.S:6: Error: can't resolve `per_cpu__foo' {*UND* section} -
> `__per_cpu_start' {*UND* section}
> 
> Unfortunately, the only way I can think of fixing this is to compute the
> offset into a temp register, then use that:
> 
>     lea per_cpu__foo(%rip), %rax
>     mov %gs:__per_cpu_offset(%rax), %rax
> 
> (where __per_cpu_offset is defined in the linker script as
> -__per_cpu_start).
> 
> This seems to be a general problem with zero-offset per-cpu.  And its
> unfortunate, because no-register access to per-cpu variables is nice to
> have.
> 
> The other alternative - and I have no idea whether this is practical or
> possible - is to define  a complete set of pre-offset per_cpu symbols.
> 
>    J


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-06-04  0:30 ` [PATCH 3/4] x86_64: Fold pda into per cpu area Mike Travis
@ 2008-06-04 12:59   ` Jeremy Fitzhardinge
  2008-06-04 13:48     ` Mike Travis
  2008-06-09 23:18     ` Christoph Lameter
  0 siblings, 2 replies; 17+ messages in thread
From: Jeremy Fitzhardinge @ 2008-06-04 12:59 UTC (permalink / raw)
  To: Mike Travis
  Cc: Ingo Molnar, Andrew Morton, Christoph Lameter, David Miller,
	Eric Dumazet, linux-kernel, Rusty Russell

Mike Travis wrote:
>   * Declare the pda as a per cpu variable.
>
>   * Make the x86_64 per cpu area start at zero.
>
>   * Since the pda is now the first element of the per_cpu area, cpu_pda()
>     is no longer needed and per_cpu() can be used instead.  This also makes
>     the _cpu_pda[] table obsolete.
>
>   * Since %gs is pointing to the pda, it will then also point to the per cpu
>     variables and can be accessed thusly:
>
> 	%gs:[&per_cpu_xxxx - __per_cpu_start]
>   

Unfortunately that doesn't actually work, because you can't have a reloc 
with two variables.

In something like:

	mov %gs:per_cpu__foo - 12345, %rax
	mov %gs:per_cpu__foo, %rax
	mov %gs:per_cpu__foo - 12345(%rip), %rax
	mov %gs:per_cpu__foo(%rip), %rax
	mov %gs:per_cpu__foo - __per_cpu_start, %rax
	mov %gs:per_cpu__foo - __per_cpu_start(%rip), %rax

the last two lines will not assemble:

t.S:5: Error: can't resolve `per_cpu__foo' {*UND* section} - `__per_cpu_start' {*UND* section}
t.S:6: Error: can't resolve `per_cpu__foo' {*UND* section} - `__per_cpu_start' {*UND* section}

Unfortunately, the only way I can think of fixing this is to compute the 
offset into a temp register, then use that:

	lea per_cpu__foo(%rip), %rax
	mov %gs:__per_cpu_offset(%rax), %rax

(where __per_cpu_offset is defined in the linker script as 
-__per_cpu_start).

This seems to be a general problem with zero-offset per-cpu.  And its 
unfortunate, because no-register access to per-cpu variables is nice to 
have.

The other alternative - and I have no idea whether this is practical or 
possible - is to define  a complete set of pre-offset per_cpu symbols.

    J

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/4] x86_64: Fold pda into per cpu area
  2008-06-04  0:30 [PATCH 0/4] percpu: Optimize percpu accesses Mike Travis
@ 2008-06-04  0:30 ` Mike Travis
  2008-06-04 12:59   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Travis @ 2008-06-04  0:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Christoph Lameter, David Miller, Eric Dumazet,
	Jeremy Fitzhardinge, linux-kernel

[-- Attachment #1: zero_based_fold --]
[-- Type: text/plain, Size: 16555 bytes --]

  * Declare the pda as a per cpu variable.

  * Make the x86_64 per cpu area start at zero.

  * Since the pda is now the first element of the per_cpu area, cpu_pda()
    is no longer needed and per_cpu() can be used instead.  This also makes
    the _cpu_pda[] table obsolete.

  * Since %gs is pointing to the pda, it will then also point to the per cpu
    variables and can be accessed thusly:

	%gs:[&per_cpu_xxxx - __per_cpu_start]

Based on linux-2.6.tip

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
---
 arch/x86/Kconfig                 |    3 +
 arch/x86/kernel/head64.c         |   34 ++++++--------
 arch/x86/kernel/irq_64.c         |   36 ++++++++-------
 arch/x86/kernel/setup.c          |   90 ++++++++++++---------------------------
 arch/x86/kernel/setup64.c        |    5 --
 arch/x86/kernel/smpboot.c        |   51 ----------------------
 arch/x86/kernel/traps_64.c       |   11 +++-
 arch/x86/kernel/vmlinux_64.lds.S |    1 
 include/asm-x86/percpu.h         |   48 ++++++--------------
 9 files changed, 89 insertions(+), 190 deletions(-)

--- linux-2.6.tip.orig/arch/x86/Kconfig
+++ linux-2.6.tip/arch/x86/Kconfig
@@ -129,6 +129,9 @@ config HAVE_SETUP_PER_CPU_AREA
 config HAVE_CPUMASK_OF_CPU_MAP
 	def_bool X86_64_SMP
 
+config HAVE_ZERO_BASED_PER_CPU
+	def_bool X86_64_SMP
+
 config ARCH_HIBERNATION_POSSIBLE
 	def_bool y
 	depends on !SMP || !X86_VOYAGER
--- linux-2.6.tip.orig/arch/x86/kernel/head64.c
+++ linux-2.6.tip/arch/x86/kernel/head64.c
@@ -25,20 +25,6 @@
 #include <asm/e820.h>
 #include <asm/bios_ebda.h>
 
-/* boot cpu pda */
-static struct x8664_pda _boot_cpu_pda __read_mostly;
-
-#ifdef CONFIG_SMP
-/*
- * We install an empty cpu_pda pointer table to indicate to early users
- * (numa_set_node) that the cpu_pda pointer table for cpus other than
- * the boot cpu is not yet setup.
- */
-static struct x8664_pda *__cpu_pda[NR_CPUS] __initdata;
-#else
-static struct x8664_pda *__cpu_pda[NR_CPUS] __read_mostly;
-#endif
-
 static void __init zap_identity_mappings(void)
 {
 	pgd_t *pgd = pgd_offset_k(0UL);
@@ -159,6 +145,20 @@ void __init x86_64_start_kernel(char * r
 	/* Cleanup the over mapped high alias */
 	cleanup_highmap();
 
+	/* point to boot pda which is the first element in the percpu area */
+	{
+		struct x8664_pda *pda;
+#ifdef CONFIG_SMP
+		pda = (struct x8664_pda *)__per_cpu_load;
+		pda->data_offset = per_cpu_offset(0) = (unsigned long)pda;
+#else
+		pda = &per_cpu(pda, 0);
+		pda->data_offset = (unsigned long)pda;
+#endif
+	}
+	/* initialize boot cpu_pda data */
+	pda_init(0);
+
 	for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
 #ifdef CONFIG_EARLY_PRINTK
 		set_intr_gate(i, &early_idt_handlers[i]);
@@ -170,12 +170,6 @@ void __init x86_64_start_kernel(char * r
 
 	early_printk("Kernel alive\n");
 
-	_cpu_pda = __cpu_pda;
-	cpu_pda(0) = &_boot_cpu_pda;
-	pda_init(0);
-
-	early_printk("Kernel really alive\n");
-
 	copy_bootdata(__va(real_mode_data));
 
 	reserve_early(__pa_symbol(&_text), __pa_symbol(&_end), "TEXT DATA BSS");
--- linux-2.6.tip.orig/arch/x86/kernel/irq_64.c
+++ linux-2.6.tip/arch/x86/kernel/irq_64.c
@@ -115,39 +115,43 @@ skip:
 	} else if (i == NR_IRQS) {
 		seq_printf(p, "NMI: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->__nmi_count);
+			seq_printf(p, "%10u ", per_cpu(pda.__nmi_count, j));
 		seq_printf(p, "  Non-maskable interrupts\n");
 		seq_printf(p, "LOC: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->apic_timer_irqs);
+			seq_printf(p, "%10u ", per_cpu(pda.apic_timer_irqs, j));
 		seq_printf(p, "  Local timer interrupts\n");
 #ifdef CONFIG_SMP
 		seq_printf(p, "RES: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_resched_count);
+			seq_printf(p, "%10u ",
+				per_cpu(pda.irq_resched_count, j));
 		seq_printf(p, "  Rescheduling interrupts\n");
 		seq_printf(p, "CAL: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_call_count);
+			seq_printf(p, "%10u ", per_cpu(pda.irq_call_count, j));
 		seq_printf(p, "  function call interrupts\n");
 		seq_printf(p, "TLB: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_tlb_count);
+			seq_printf(p, "%10u ", per_cpu(pda.irq_tlb_count, j));
 		seq_printf(p, "  TLB shootdowns\n");
 #endif
 #ifdef CONFIG_X86_MCE
 		seq_printf(p, "TRM: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_thermal_count);
+			seq_printf(p, "%10u ",
+				per_cpu(pda.irq_thermal_count, j));
 		seq_printf(p, "  Thermal event interrupts\n");
 		seq_printf(p, "THR: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_threshold_count);
+			seq_printf(p, "%10u ",
+				per_cpu(pda.irq_threshold_count, j));
 		seq_printf(p, "  Threshold APIC interrupts\n");
 #endif
 		seq_printf(p, "SPU: ");
 		for_each_online_cpu(j)
-			seq_printf(p, "%10u ", cpu_pda(j)->irq_spurious_count);
+			seq_printf(p, "%10u ",
+				per_cpu(pda.irq_spurious_count, j));
 		seq_printf(p, "  Spurious interrupts\n");
 		seq_printf(p, "ERR: %10u\n", atomic_read(&irq_err_count));
 	}
@@ -159,19 +163,19 @@ skip:
  */
 u64 arch_irq_stat_cpu(unsigned int cpu)
 {
-	u64 sum = cpu_pda(cpu)->__nmi_count;
+	u64 sum = per_cpu(pda.__nmi_count, cpu);
 
-	sum += cpu_pda(cpu)->apic_timer_irqs;
+	sum += per_cpu(pda.apic_timer_irqs, cpu);
 #ifdef CONFIG_SMP
-	sum += cpu_pda(cpu)->irq_resched_count;
-	sum += cpu_pda(cpu)->irq_call_count;
-	sum += cpu_pda(cpu)->irq_tlb_count;
+	sum += per_cpu(pda.irq_resched_count, cpu);
+	sum += per_cpu(pda.irq_call_count, cpu);
+	sum += per_cpu(pda.irq_tlb_count, cpu);
 #endif
 #ifdef CONFIG_X86_MCE
-	sum += cpu_pda(cpu)->irq_thermal_count;
-	sum += cpu_pda(cpu)->irq_threshold_count;
+	sum += per_cpu(pda.irq_thermal_count, cpu);
+	sum += per_cpu(pda.irq_threshold_count, cpu);
 #endif
-	sum += cpu_pda(cpu)->irq_spurious_count;
+	sum += per_cpu(pda.irq_spurious_count, cpu);
 	return sum;
 }
 
--- linux-2.6.tip.orig/arch/x86/kernel/setup.c
+++ linux-2.6.tip/arch/x86/kernel/setup.c
@@ -29,6 +29,11 @@ DEFINE_EARLY_PER_CPU(u16, x86_bios_cpu_a
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_apicid);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
 
+#ifdef CONFIG_X86_64
+DEFINE_PER_CPU_FIRST(struct x8664_pda, pda);
+EXPORT_PER_CPU_SYMBOL(pda);
+#endif
+
 #if defined(CONFIG_NUMA) && defined(CONFIG_X86_64)
 #define	X86_64_NUMA	1
 
@@ -47,7 +52,7 @@ static void __init setup_node_to_cpumask
 static inline void setup_node_to_cpumask_map(void) { }
 #endif
 
-#if defined(CONFIG_HAVE_SETUP_PER_CPU_AREA) && defined(CONFIG_SMP)
+#ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
 /*
  * Copy data used in early init routines from the initial arrays to the
  * per cpu data areas.  These arrays then become expendable and the
@@ -94,64 +99,9 @@ static void __init setup_cpumask_of_cpu(
 static inline void setup_cpumask_of_cpu(void) { }
 #endif
 
-#ifdef CONFIG_X86_32
-/*
- * Great future not-so-futuristic plan: make i386 and x86_64 do it
- * the same way
- */
 unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
 EXPORT_SYMBOL(__per_cpu_offset);
-static inline void setup_cpu_pda_map(void) { }
-
-#elif !defined(CONFIG_SMP)
-static inline void setup_cpu_pda_map(void) { }
-
-#else /* CONFIG_SMP && CONFIG_X86_64 */
-
-/*
- * Allocate cpu_pda pointer table and array via alloc_bootmem.
- */
-static void __init setup_cpu_pda_map(void)
-{
-	char *pda;
-	struct x8664_pda **new_cpu_pda;
-	unsigned long size;
-	int cpu;
-
-	size = roundup(sizeof(struct x8664_pda), cache_line_size());
-
-	/* allocate cpu_pda array and pointer table */
-	{
-		unsigned long tsize = nr_cpu_ids * sizeof(void *);
-		unsigned long asize = size * (nr_cpu_ids - 1);
-
-		tsize = roundup(tsize, cache_line_size());
-		new_cpu_pda = alloc_bootmem(tsize + asize);
-		pda = (char *)new_cpu_pda + tsize;
-	}
 
-	/* initialize pointer table to static pda's */
-	for_each_possible_cpu(cpu) {
-		if (cpu == 0) {
-			/* leave boot cpu pda in place */
-			new_cpu_pda[0] = cpu_pda(0);
-			continue;
-		}
-		new_cpu_pda[cpu] = (struct x8664_pda *)pda;
-		new_cpu_pda[cpu]->in_bootmem = 1;
-		pda += size;
-	}
-
-	/* point to new pointer table */
-	_cpu_pda = new_cpu_pda;
-}
-#endif
-
-/*
- * Great future plan:
- * Declare PDA itself and support (irqstack,tss,pgd) as per cpu data.
- * Always point %gs to its beginning
- */
 void __init setup_per_cpu_areas(void)
 {
 	ssize_t size = PERCPU_ENOUGH_ROOM;
@@ -164,9 +114,6 @@ void __init setup_per_cpu_areas(void)
 	nr_cpu_ids = num_processors;
 #endif
 
-	/* Setup cpu_pda map */
-	setup_cpu_pda_map();
-
 	/* Copy section for each CPU (we discard the original) */
 	size = PERCPU_ENOUGH_ROOM;
 	printk(KERN_INFO "PERCPU: Allocating %lu bytes of per cpu data\n",
@@ -186,9 +133,28 @@ void __init setup_per_cpu_areas(void)
 		else
 			ptr = alloc_bootmem_pages_node(NODE_DATA(node), size);
 #endif
+		/* Initialize each cpu's per_cpu area and save pointer */
+		memcpy(ptr, __per_cpu_load, __per_cpu_size);
 		per_cpu_offset(cpu) = ptr - __per_cpu_start;
-		memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
 
+#ifdef CONFIG_X86_64
+		/*
+		 * Note the boot cpu has been using the static per_cpu load
+		 * area for it's pda.  We need to zero out the pda's for the
+		 * other cpu's that are coming online.
+		 */
+		{
+			/* we rely on the fact that pda is the first element */
+			struct x8664_pda *pda = (struct x8664_pda *)ptr;
+
+			if (cpu)
+				memset(pda, 0, sizeof(struct x8664_pda));
+			else
+				pda_init(0);
+
+			pda->data_offset = (unsigned long)ptr;
+		}
+#endif
 	}
 
 	printk(KERN_DEBUG "NR_CPUS: %d, nr_cpu_ids: %d, nr_node_ids %d\n",
@@ -240,8 +206,8 @@ void __cpuinit numa_set_node(int cpu, in
 {
 	int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map);
 
-	if (cpu_pda(cpu) && node != NUMA_NO_NODE)
-		cpu_pda(cpu)->nodenumber = node;
+	if (per_cpu_offset(cpu))
+		per_cpu(pda.nodenumber, cpu) = node;
 
 	if (cpu_to_node_map)
 		cpu_to_node_map[cpu] = node;
--- linux-2.6.tip.orig/arch/x86/kernel/setup64.c
+++ linux-2.6.tip/arch/x86/kernel/setup64.c
@@ -35,9 +35,6 @@ struct boot_params boot_params;
 
 cpumask_t cpu_initialized __cpuinitdata = CPU_MASK_NONE;
 
-struct x8664_pda **_cpu_pda __read_mostly;
-EXPORT_SYMBOL(_cpu_pda);
-
 struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table };
 
 char boot_cpu_stack[IRQSTACKSIZE] __attribute__((section(".bss.page_aligned")));
@@ -89,7 +86,7 @@ __setup("noexec32=", nonx32_setup);
 
 void pda_init(int cpu)
 { 
-	struct x8664_pda *pda = cpu_pda(cpu);
+	struct x8664_pda *pda = &per_cpu(pda, cpu);
 
 	/* Setup up data that may be needed in __get_free_pages early */
 	asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0)); 
--- linux-2.6.tip.orig/arch/x86/kernel/smpboot.c
+++ linux-2.6.tip/arch/x86/kernel/smpboot.c
@@ -798,45 +798,6 @@ static void __cpuinit do_fork_idle(struc
 	complete(&c_idle->done);
 }
 
-#ifdef CONFIG_X86_64
-/*
- * Allocate node local memory for the AP pda.
- *
- * Must be called after the _cpu_pda pointer table is initialized.
- */
-static int __cpuinit get_local_pda(int cpu)
-{
-	struct x8664_pda *oldpda, *newpda;
-	unsigned long size = sizeof(struct x8664_pda);
-	int node = cpu_to_node(cpu);
-
-	if (cpu_pda(cpu) && !cpu_pda(cpu)->in_bootmem)
-		return 0;
-
-	oldpda = cpu_pda(cpu);
-	newpda = kmalloc_node(size, GFP_ATOMIC, node);
-	if (!newpda) {
-		printk(KERN_ERR "Could not allocate node local PDA "
-			"for CPU %d on node %d\n", cpu, node);
-
-		if (oldpda)
-			return 0;	/* have a usable pda */
-		else
-			return -1;
-	}
-
-	if (oldpda) {
-		memcpy(newpda, oldpda, size);
-		if (!after_bootmem)
-			free_bootmem((unsigned long)oldpda, size);
-	}
-
-	newpda->in_bootmem = 0;
-	cpu_pda(cpu) = newpda;
-	return 0;
-}
-#endif /* CONFIG_X86_64 */
-
 static int __cpuinit do_boot_cpu(int apicid, int cpu)
 /*
  * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
@@ -860,14 +821,6 @@ static int __cpuinit do_boot_cpu(int api
 		printk(KERN_ERR "Failed to allocate GDT for CPU %d\n", cpu);
 		return -1;
 	}
-
-	/* Allocate node local memory for AP pdas */
-	if (cpu > 0) {
-		boot_error = get_local_pda(cpu);
-		if (boot_error)
-			goto restore_state;
-			/* if can't get pda memory, can't start cpu */
-	}
 #endif
 
 	alternatives_smp_switch(1);
@@ -908,7 +861,7 @@ do_rest:
 	stack_start.sp = (void *) c_idle.idle->thread.sp;
 	irq_ctx_init(cpu);
 #else
-	cpu_pda(cpu)->pcurrent = c_idle.idle;
+	per_cpu(pda.pcurrent, cpu) = c_idle.idle;
 	init_rsp = c_idle.idle->thread.sp;
 	load_sp0(&per_cpu(init_tss, cpu), &c_idle.idle->thread);
 	initial_code = (unsigned long)start_secondary;
@@ -985,8 +938,6 @@ do_rest:
 		}
 	}
 
-restore_state:
-
 	if (boot_error) {
 		/* Try to put things back the way they were before ... */
 		unmap_cpu_to_logical_apicid(cpu);
--- linux-2.6.tip.orig/arch/x86/kernel/traps_64.c
+++ linux-2.6.tip/arch/x86/kernel/traps_64.c
@@ -265,7 +265,8 @@ void dump_trace(struct task_struct *tsk,
 		const struct stacktrace_ops *ops, void *data)
 {
 	const unsigned cpu = get_cpu();
-	unsigned long *irqstack_end = (unsigned long*)cpu_pda(cpu)->irqstackptr;
+	unsigned long *irqstack_end =
+		(unsigned long*)per_cpu(pda.irqstackptr, cpu);
 	unsigned used = 0;
 	struct thread_info *tinfo;
 
@@ -399,8 +400,10 @@ _show_stack(struct task_struct *tsk, str
 	unsigned long *stack;
 	int i;
 	const int cpu = smp_processor_id();
-	unsigned long *irqstack_end = (unsigned long *) (cpu_pda(cpu)->irqstackptr);
-	unsigned long *irqstack = (unsigned long *) (cpu_pda(cpu)->irqstackptr - IRQSTACKSIZE);
+	unsigned long *irqstack_end =
+		(unsigned long *)per_cpu(pda.irqstackptr, cpu);
+	unsigned long *irqstack =
+		(unsigned long *)(per_cpu(pda.irqstackptr, cpu) - IRQSTACKSIZE);
 
 	// debugging aid: "show_stack(NULL, NULL);" prints the
 	// back trace for this cpu.
@@ -464,7 +467,7 @@ void show_registers(struct pt_regs *regs
 	int i;
 	unsigned long sp;
 	const int cpu = smp_processor_id();
-	struct task_struct *cur = cpu_pda(cpu)->pcurrent;
+	struct task_struct *cur = __get_cpu_var(pda.pcurrent);
 	u8 *ip;
 	unsigned int code_prologue = code_bytes * 43 / 64;
 	unsigned int code_len = code_bytes;
--- linux-2.6.tip.orig/arch/x86/kernel/vmlinux_64.lds.S
+++ linux-2.6.tip/arch/x86/kernel/vmlinux_64.lds.S
@@ -16,6 +16,7 @@ jiffies_64 = jiffies;
 _proxy_pda = 1;
 PHDRS {
 	text PT_LOAD FLAGS(5);	/* R_E */
+	percpu PT_LOAD FLAGS(4);	/* R__ */
 	data PT_LOAD FLAGS(7);	/* RWE */
 	user PT_LOAD FLAGS(7);	/* RWE */
 	data.init PT_LOAD FLAGS(7);	/* RWE */
--- linux-2.6.tip.orig/include/asm-x86/percpu.h
+++ linux-2.6.tip/include/asm-x86/percpu.h
@@ -3,26 +3,20 @@
 
 #ifdef CONFIG_X86_64
 #include <linux/compiler.h>
-
-/* Same as asm-generic/percpu.h, except that we store the per cpu offset
-   in the PDA. Longer term the PDA and every per cpu variable
-   should be just put into a single section and referenced directly
-   from %gs */
-
-#ifdef CONFIG_SMP
 #include <asm/pda.h>
 
-#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
-#define __my_cpu_offset read_pda(data_offset)
-
-#define per_cpu_offset(x) (__per_cpu_offset(x))
-
+#ifdef CONFIG_SMP
+#define __my_cpu_offset (x86_read_percpu(pda.data_offset))
+#define __percpu_seg "%%gs:"
+#else
+#define __percpu_seg ""
 #endif
+
 #include <asm-generic/percpu.h>
 
 DECLARE_PER_CPU(struct x8664_pda, pda);
 
-#else /* CONFIG_X86_64 */
+#else /* !CONFIG_X86_64 */
 
 #ifdef __ASSEMBLY__
 
@@ -51,36 +45,23 @@ DECLARE_PER_CPU(struct x8664_pda, pda);
 
 #else /* ...!ASSEMBLY */
 
-/*
- * PER_CPU finds an address of a per-cpu variable.
- *
- * Args:
- *    var - variable name
- *    cpu - 32bit register containing the current CPU number
- *
- * The resulting address is stored in the "cpu" argument.
- *
- * Example:
- *    PER_CPU(cpu_gdt_descr, %ebx)
- */
 #ifdef CONFIG_SMP
-
 #define __my_cpu_offset x86_read_percpu(this_cpu_off)
-
-/* fs segment starts at (positive) offset == __per_cpu_offset[cpu] */
 #define __percpu_seg "%%fs:"
-
-#else  /* !SMP */
-
+#else
 #define __percpu_seg ""
-
-#endif	/* SMP */
+#endif
 
 #include <asm-generic/percpu.h>
 
 /* We can use this directly for local CPU (faster). */
 DECLARE_PER_CPU(unsigned long, this_cpu_off);
 
+#endif /* __ASSEMBLY__ */
+#endif /* !CONFIG_X86_64 */
+
+#ifndef __ASSEMBLY__
+
 /* For arch-specific code, we can use direct single-insn ops (they
  * don't give an lvalue though). */
 extern void __bad_percpu_size(void);
@@ -215,7 +196,6 @@ do {							\
 				percpu_cmpxchg_op(per_cpu_var(var), old, new)
 
 #endif /* !__ASSEMBLY__ */
-#endif /* !CONFIG_X86_64 */
 
 #ifdef CONFIG_SMP
 

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-07-25 21:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-01 19:14 [PATCH 0/4] percpu: Optimize percpu accesses travis
2008-02-01 19:14 ` [PATCH 1/4] generic: Percpu infrastructure to rebase the per cpu area to zero travis
2008-02-01 19:14 ` [PATCH 2/4] init: move setup of nr_cpu_ids to as early as possible travis
2008-02-01 19:14 ` [PATCH 3/4] x86_64: Fold pda into per cpu area travis
2008-02-15 20:16   ` Ingo Molnar
2008-02-15 22:43     ` Christoph Lameter
2008-02-17  6:22     ` Yinghai Lu
2008-02-17  7:36       ` Yinghai Lu
2008-02-01 19:14 ` [PATCH 4/4] x86_64: Cleanup non-smp usage of cpu maps travis
2008-02-15 20:17   ` Ingo Molnar
2008-06-04  0:30 [PATCH 0/4] percpu: Optimize percpu accesses Mike Travis
2008-06-04  0:30 ` [PATCH 3/4] x86_64: Fold pda into per cpu area Mike Travis
2008-06-04 12:59   ` Jeremy Fitzhardinge
2008-06-04 13:48     ` Mike Travis
2008-06-04 13:58       ` Jeremy Fitzhardinge
2008-06-04 14:17         ` Mike Travis
2008-06-09 23:18     ` Christoph Lameter
2008-07-25 21:11 [PATCH 0/4] x86_64: Optimize percpu accesses Mike Travis
2008-07-25 21:11 ` [PATCH 3/4] x86_64: Fold pda into per cpu area Mike Travis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).