linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3
@ 2008-01-16 17:09 travis
  2008-01-16 17:09 ` [PATCH 01/10] x86: Change size of APICIDs from u8 to u16 V3 travis
                   ` (10 more replies)
  0 siblings, 11 replies; 17+ messages in thread
From: travis @ 2008-01-16 17:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, mingo
  Cc: Eric Dumazet, Christoph Lameter, Jack Steiner, linux-mm, linux-kernel


This patchset addresses the kernel bloat that occurs when NR_CPUS is increased.
The memory numbers below are with NR_CPUS = 1024 which I've been testing (4 and
32 real processors, the rest "possible" using the additional_cpus start option.)
These changes are all specific to the x86 architecture, non-arch specific
changes will follow.

Based on 2.6.24-rc6-mm1

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
---
V1->V2:
    - Remove extraneous casts
    - Add comment about node memory < NODE_MIN_SIZE
    - changed pxm_to_node_map to u16
    - changed memnode map entries to u16
    - Fix !NUMA builds with '#ifdef CONFIG_NUMA"
    - Add slight optimization to apic_is_clustered_box()

V2->V3:
    - add early_cpu_to_node function to keep cpu_to_node efficient
    - move and rename smp_set_apicids() to setup_percpu_maps()
    - call setup_percpu_maps() as early as possible
    - changed memnode.embedded_map from [64-16] to [64-8]
      (and size comment to 128 bytes)
---

[Updated memory usage table.]

The following columns are using the SuSE default x86_64 config
[which has NR_CPUS=128] and NR_CPUS=1024.  There are no modules
in either.  Each column's percentage is relative to the first.

128cpus			1kcpus-before		1kcpus-after
       228 .altinstr_re        228 +0%		       228 +0%
      1219 .altinstruct       1219 +0%		      1219 +0%
    866632 .bss		   2260296 +160%	   2112968 +143%
     61374 .comment	     61374 +0%		     61374 +0%
	16 .con_initcal 	16 +0%			16 +0%
    427560 .data	    445480 +4%		    444200 +3%
   1165824 .data.cachel   13076992 +1021%	  13076992 +1021%
      8192 .data.init_t       8192 +0%		      8192 +0%
      4096 .data.page_a       4096 +0%		      4096 +0%
     39296 .data.percpu     155776 +296%	    155904 +296%
    188992 .data.read_m    8751776 +4530%	   8747680 +4528%
	 4 .data_nosave 	 4 +0%			 4 +0%
      5141 .exit.text	      5150 +0%		      5149 +0%
    138576 .init.data	    139472 +0%		    145424 +4%
       134 .init.ramfs	       134 +0%		       134 +0%
      3192 .init.setup	      3192 +0%		      3192 +0%
    160143 .init.text	    160643 +0%		    160914 +0%
      2288 .initcall.in       2288 +0%		      2288 +0%
	 8 .jiffies		 8 +0%			 8 +0%
      4512 .pci_fixup	      4512 +0%		      4512 +0%
   1314318 .rodata	   1315630 +0%		   1315305 +0%
     36856 .smp_locks	     36808 +0%		     36800 +0%
   3975021 .text	   3984829 +0%		   3987389 +0%
      3368 .vdso	      3368 +0%		      3368 +0%
	 4 .vgetcpu_mod 	 4 +0%			 4 +0%
       218 .vsyscall_0	       218 +0%		       218 +0%
	52 .vsyscall_1		52 +0%			52 +0%
	91 .vsyscall_2		91 +0%			91 +0%
	 8 .vsyscall_3		 8 +0%			 8 +0%
	54 .vsyscall_fn 	54 +0%			54 +0%
	80 .vsyscall_gt 	80 +0%			80 +0%
     39480 __bug_table	     39480 +0%		     39480 +0%
     16320 __ex_table	     16320 +0%		     16320 +0%
      9160 __param	      9160 +0%		      9160 +0%
						
   1818299 Text		   1834603 +0%		   1834859 +0%
   3975021 Data		   3984829 +0%		   3987389 +0%
    866632 Bss		   2260296 +160%	   2112968 +143%
    360448 InitData	    483328 +34%		    487424 +35%
   1415640 OtherData	  21885912 +1446%	  21883096 +1445%
     39296 PerCpu	    155776 +0%		    155904 +0%
   8472457 Total	  30486950 +259%	  30342823 +258%


[Note that the sum of the last 6 lines may not equal "Total" as some
sections may contain others (eg., InitData contains PerCpu data.)]

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 01/10] x86: Change size of APICIDs from u8 to u16 V3
  2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
@ 2008-01-16 17:09 ` travis
  2008-01-16 17:09 ` [PATCH 02/10] x86: Change size of node ids " travis
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: travis @ 2008-01-16 17:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, mingo
  Cc: Eric Dumazet, Christoph Lameter, Jack Steiner, linux-mm, linux-kernel

[-- Attachment #1: big_apicids --]
[-- Type: text/plain, Size: 6733 bytes --]

Change the size of APICIDs from u8 to u16.  This partially
supports the new x2apic mode that will be present on future
processor chips. (Chips actually support 32-bit APICIDs, but that
change is more intrusive. Supporting 16-bit is sufficient for now).

Signed-off-by: Jack Steiner <steiner@sgi.com>

I've included just the partial change from u8 to u16 apicids.  The
remaining x2apic changes will be in a separate patch.

In addition, the fake_node_to_pxm_map[] and fake_apicid_to_node[]
tables have been moved from local data to the __initdata section
reducing stack pressure when MAX_NUMNODES and MAX_LOCAL_APIC are
increased in size.

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
---
V1->V2:
    - Remove extraneous casts
    - Add comment about node memory < NODE_MIN_SIZE
---
 arch/x86/kernel/genapic_64.c |    4 ++--
 arch/x86/kernel/mpparse_64.c |    4 ++--
 arch/x86/kernel/smpboot_64.c |    2 +-
 arch/x86/mm/numa_64.c        |    2 +-
 arch/x86/mm/srat_64.c        |   26 +++++++++++++++++---------
 include/asm-x86/processor.h  |   14 +++++++-------
 include/asm-x86/smp_64.h     |    8 ++++----
 7 files changed, 34 insertions(+), 26 deletions(-)

--- a/arch/x86/kernel/genapic_64.c
+++ b/arch/x86/kernel/genapic_64.c
@@ -32,10 +32,10 @@
  * array during this time.  Is it zeroed when the per_cpu
  * data area is removed.
  */
-u8 x86_cpu_to_apicid_init[NR_CPUS] __initdata
+u16 x86_cpu_to_apicid_init[NR_CPUS] __initdata
 					= { [0 ... NR_CPUS-1] = BAD_APICID };
 void *x86_cpu_to_apicid_ptr;
-DEFINE_PER_CPU(u8, x86_cpu_to_apicid) = BAD_APICID;
+DEFINE_PER_CPU(u16, x86_cpu_to_apicid) = BAD_APICID;
 EXPORT_PER_CPU_SYMBOL(x86_cpu_to_apicid);
 
 struct genapic __read_mostly *genapic = &apic_flat;
--- a/arch/x86/kernel/mpparse_64.c
+++ b/arch/x86/kernel/mpparse_64.c
@@ -67,7 +67,7 @@ unsigned disabled_cpus __cpuinitdata;
 /* Bitmask of physically existing CPUs */
 physid_mask_t phys_cpu_present_map = PHYSID_MASK_NONE;
 
-u8 bios_cpu_apicid[NR_CPUS] = { [0 ... NR_CPUS-1] = BAD_APICID };
+u16 bios_cpu_apicid[NR_CPUS] = { [0 ... NR_CPUS-1] = BAD_APICID };
 
 
 /*
@@ -132,7 +132,7 @@ static void __cpuinit MP_processor_info(
 	 * area is created.
 	 */
 	if (x86_cpu_to_apicid_ptr) {
-		u8 *x86_cpu_to_apicid = (u8 *)x86_cpu_to_apicid_ptr;
+		u16 *x86_cpu_to_apicid = x86_cpu_to_apicid_ptr;
 		x86_cpu_to_apicid[cpu] = m->mpc_apicid;
 	} else {
 		per_cpu(x86_cpu_to_apicid, cpu) = m->mpc_apicid;
--- a/arch/x86/kernel/smpboot_64.c
+++ b/arch/x86/kernel/smpboot_64.c
@@ -65,7 +65,7 @@ int smp_num_siblings = 1;
 EXPORT_SYMBOL(smp_num_siblings);
 
 /* Last level cache ID of each logical CPU */
-DEFINE_PER_CPU(u8, cpu_llc_id) = BAD_APICID;
+DEFINE_PER_CPU(u16, cpu_llc_id) = BAD_APICID;
 
 /* Bitmask of currently online CPUs */
 cpumask_t cpu_online_map __read_mostly;
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -627,7 +627,7 @@ void __init init_cpu_to_node(void)
 	int i;
 
 	for (i = 0; i < NR_CPUS; i++) {
-		u8 apicid = x86_cpu_to_apicid_init[i];
+		u16 apicid = x86_cpu_to_apicid_init[i];
 
 		if (apicid == BAD_APICID)
 			continue;
--- a/arch/x86/mm/srat_64.c
+++ b/arch/x86/mm/srat_64.c
@@ -130,6 +130,9 @@ void __init
 acpi_numa_processor_affinity_init(struct acpi_srat_cpu_affinity *pa)
 {
 	int pxm, node;
+	int apic_id;
+
+	apic_id = pa->apic_id;
 	if (srat_disabled())
 		return;
 	if (pa->header.length != sizeof(struct acpi_srat_cpu_affinity)) {
@@ -145,10 +148,10 @@ acpi_numa_processor_affinity_init(struct
 		bad_srat();
 		return;
 	}
-	apicid_to_node[pa->apic_id] = node;
+	apicid_to_node[apic_id] = node;
 	acpi_numa = 1;
 	printk(KERN_INFO "SRAT: PXM %u -> APIC %u -> Node %u\n",
-	       pxm, pa->apic_id, node);
+	       pxm, apic_id, node);
 }
 
 int update_end_of_memory(unsigned long end) {return -1;}
@@ -343,7 +346,12 @@ int __init acpi_scan_nodes(unsigned long
 	/* First clean up the node list */
 	for (i = 0; i < MAX_NUMNODES; i++) {
 		cutoff_node(i, start, end);
-		if ((nodes[i].end - nodes[i].start) < NODE_MIN_SIZE) {
+		/*
+		 * don't confuse VM with a node that doesn't have the
+		 * minimum memory.
+		 */
+		if (nodes[i].end &&
+			(nodes[i].end - nodes[i].start) < NODE_MIN_SIZE) {
 			unparse_node(i);
 			node_set_offline(i);
 		}
@@ -384,6 +392,12 @@ int __init acpi_scan_nodes(unsigned long
 }
 
 #ifdef CONFIG_NUMA_EMU
+static int fake_node_to_pxm_map[MAX_NUMNODES] __initdata = {
+	[0 ... MAX_NUMNODES-1] = PXM_INVAL
+};
+static unsigned char fake_apicid_to_node[MAX_LOCAL_APIC] __initdata = {
+	[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
+};
 static int __init find_node_by_addr(unsigned long addr)
 {
 	int ret = NUMA_NO_NODE;
@@ -414,12 +428,6 @@ static int __init find_node_by_addr(unsi
 void __init acpi_fake_nodes(const struct bootnode *fake_nodes, int num_nodes)
 {
 	int i, j;
-	int fake_node_to_pxm_map[MAX_NUMNODES] = {
-		[0 ... MAX_NUMNODES-1] = PXM_INVAL
-	};
-	unsigned char fake_apicid_to_node[MAX_LOCAL_APIC] = {
-		[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
-	};
 
 	printk(KERN_INFO "Faking PXM affinity for fake nodes on real "
 			 "topology.\n");
--- a/include/asm-x86/processor.h
+++ b/include/asm-x86/processor.h
@@ -86,14 +86,14 @@ struct cpuinfo_x86 {
 #ifdef CONFIG_SMP
 	cpumask_t llc_shared_map;	/* cpus sharing the last level cache */
 #endif
-	unsigned char x86_max_cores;	/* cpuid returned max cores value */
-	unsigned char apicid;
-	unsigned short x86_clflush_size;
+	u16 x86_max_cores;		/* cpuid returned max cores value */
+	u16 apicid;
+	u16 x86_clflush_size;
 #ifdef CONFIG_SMP
-	unsigned char booted_cores;	/* number of cores as seen by OS */
-	__u8 phys_proc_id; 		/* Physical processor id. */
-	__u8 cpu_core_id;  		/* Core id */
-	__u8 cpu_index;			/* index into per_cpu list */
+	u16 booted_cores;		/* number of cores as seen by OS */
+	u16 phys_proc_id; 		/* Physical processor id. */
+	u16 cpu_core_id;  		/* Core id */
+	u16 cpu_index;			/* index into per_cpu list */
 #endif
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
--- a/include/asm-x86/smp_64.h
+++ b/include/asm-x86/smp_64.h
@@ -26,14 +26,14 @@ extern void unlock_ipi_call_lock(void);
 extern int smp_call_function_mask(cpumask_t mask, void (*func)(void *),
 				  void *info, int wait);
 
-extern u8 __initdata x86_cpu_to_apicid_init[];
+extern u16 __initdata x86_cpu_to_apicid_init[];
 extern void *x86_cpu_to_apicid_ptr;
-extern u8 bios_cpu_apicid[];
+extern u16 bios_cpu_apicid[];
 
 DECLARE_PER_CPU(cpumask_t, cpu_sibling_map);
 DECLARE_PER_CPU(cpumask_t, cpu_core_map);
-DECLARE_PER_CPU(u8, cpu_llc_id);
-DECLARE_PER_CPU(u8, x86_cpu_to_apicid);
+DECLARE_PER_CPU(u16, cpu_llc_id);
+DECLARE_PER_CPU(u16, x86_cpu_to_apicid);
 
 static inline int cpu_present_to_apicid(int mps_cpu)
 {

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 02/10] x86: Change size of node ids from u8 to u16 V3
  2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
  2008-01-16 17:09 ` [PATCH 01/10] x86: Change size of APICIDs from u8 to u16 V3 travis
@ 2008-01-16 17:09 ` travis
  2008-01-16 17:53   ` Eric Dumazet
  2008-01-16 17:09 ` [PATCH 03/10] x86: Change NR_CPUS arrays in powernow-k8 V3 travis
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 17+ messages in thread
From: travis @ 2008-01-16 17:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, mingo
  Cc: Eric Dumazet, Christoph Lameter, Jack Steiner, linux-mm, linux-kernel

[-- Attachment #1: big_nodeids --]
[-- Type: text/plain, Size: 4106 bytes --]

Change the size of node ids from 8 bits to 16 bits to
accomodate more than 256 nodes.

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
---
V1->V2:
    - changed pxm_to_node_map to u16
    - changed memnode map entries to u16
V2->V3:
    - changed memnode.embedded_map from [64-16] to [64-8]
      (and size comment to 128 bytes)
---
 arch/x86/mm/numa_64.c       |    9 ++++++---
 arch/x86/mm/srat_64.c       |    2 +-
 drivers/acpi/numa.c         |    2 +-
 include/asm-x86/mmzone_64.h |    6 +++---
 include/asm-x86/numa_64.h   |    4 ++--
 include/asm-x86/topology.h  |    2 +-
 6 files changed, 14 insertions(+), 11 deletions(-)

--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -11,6 +11,7 @@
 #include <linux/ctype.h>
 #include <linux/module.h>
 #include <linux/nodemask.h>
+#include <linux/sched.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -30,12 +31,12 @@ bootmem_data_t plat_node_bdata[MAX_NUMNO
 
 struct memnode memnode;
 
-int cpu_to_node_map[NR_CPUS] __read_mostly = {
+u16 cpu_to_node_map[NR_CPUS] __read_mostly = {
 	[0 ... NR_CPUS-1] = NUMA_NO_NODE
 };
 EXPORT_SYMBOL(cpu_to_node_map);
 
-unsigned char apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = {
+u16 apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = {
 	[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
 };
 
@@ -544,7 +545,9 @@ void __init numa_initmem_init(unsigned l
 	node_set(0, node_possible_map);
 	for (i = 0; i < NR_CPUS; i++)
 		numa_set_node(i, 0);
-	node_to_cpumask_map[0] = cpumask_of_cpu(0);
+	/* we can't use cpumask_of_cpu() yet */
+	memset(&node_to_cpumask_map[0], 0, sizeof(node_to_cpumask_map[0]));
+	cpu_set(0, node_to_cpumask_map[0]);
 	e820_register_active_regions(0, start_pfn, end_pfn);
 	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, end_pfn << PAGE_SHIFT);
 }
--- a/arch/x86/mm/srat_64.c
+++ b/arch/x86/mm/srat_64.c
@@ -395,7 +395,7 @@ int __init acpi_scan_nodes(unsigned long
 static int fake_node_to_pxm_map[MAX_NUMNODES] __initdata = {
 	[0 ... MAX_NUMNODES-1] = PXM_INVAL
 };
-static unsigned char fake_apicid_to_node[MAX_LOCAL_APIC] __initdata = {
+static u16 fake_apicid_to_node[MAX_LOCAL_APIC] __initdata = {
 	[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
 };
 static int __init find_node_by_addr(unsigned long addr)
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -38,7 +38,7 @@ ACPI_MODULE_NAME("numa");
 static nodemask_t nodes_found_map = NODE_MASK_NONE;
 
 /* maps to convert between proximity domain and logical node ID */
-static int pxm_to_node_map[MAX_PXM_DOMAINS]
+static u16 pxm_to_node_map[MAX_PXM_DOMAINS]
 				= { [0 ... MAX_PXM_DOMAINS - 1] = NID_INVAL };
 static int node_to_pxm_map[MAX_NUMNODES]
 				= { [0 ... MAX_NUMNODES - 1] = PXM_INVAL };
--- a/include/asm-x86/mmzone_64.h
+++ b/include/asm-x86/mmzone_64.h
@@ -15,9 +15,9 @@
 struct memnode {
 	int shift;
 	unsigned int mapsize;
-	u8 *map;
-	u8 embedded_map[64-16];
-} ____cacheline_aligned; /* total size = 64 bytes */
+	u16 *map;
+	u16 embedded_map[64-8];
+} ____cacheline_aligned; /* total size = 128 bytes */
 extern struct memnode memnode;
 #define memnode_shift memnode.shift
 #define memnodemap memnode.map
--- a/include/asm-x86/numa_64.h
+++ b/include/asm-x86/numa_64.h
@@ -20,7 +20,7 @@ extern void numa_set_node(int cpu, int n
 extern void srat_reserve_add_area(int nodeid);
 extern int hotadd_percent;
 
-extern unsigned char apicid_to_node[MAX_LOCAL_APIC];
+extern u16 apicid_to_node[MAX_LOCAL_APIC];
 
 extern void numa_initmem_init(unsigned long start_pfn, unsigned long end_pfn);
 extern unsigned long numa_free_all_bootmem(void);
@@ -40,6 +40,6 @@ static inline void clear_node_cpumask(in
 #define clear_node_cpumask(cpu) do {} while (0)
 #endif
 
-#define NUMA_NO_NODE 0xff
+#define NUMA_NO_NODE 0xffff
 
 #endif
--- a/include/asm-x86/topology.h
+++ b/include/asm-x86/topology.h
@@ -30,7 +30,7 @@
 #include <asm/mpspec.h>
 
 /* Mappings between logical cpu number and node number */
-extern int cpu_to_node_map[];
+extern u16 cpu_to_node_map[];
 extern cpumask_t node_to_cpumask_map[];
 
 /* Returns the number of the node containing CPU 'cpu' */

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 03/10] x86: Change NR_CPUS arrays in powernow-k8 V3
  2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
  2008-01-16 17:09 ` [PATCH 01/10] x86: Change size of APICIDs from u8 to u16 V3 travis
  2008-01-16 17:09 ` [PATCH 02/10] x86: Change size of node ids " travis
@ 2008-01-16 17:09 ` travis
  2008-01-16 17:09 ` [PATCH 04/10] x86: Change NR_CPUS arrays in intel_cacheinfo V3 travis
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: travis @ 2008-01-16 17:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, mingo
  Cc: Eric Dumazet, Christoph Lameter, Jack Steiner, linux-mm, linux-kernel

[-- Attachment #1: NR_CPUS-arrays-in-powernow-k8 --]
[-- Type: text/plain, Size: 2243 bytes --]

Change the following static arrays sized by NR_CPUS to
per_cpu data variables:

	powernow_k8_data *powernow_data[NR_CPUS];


Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
---
V1->V2:
    - (none)
---
 arch/x86/kernel/cpu/cpufreq/powernow-k8.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/kernel/cpu/cpufreq/powernow-k8.c
+++ b/arch/x86/kernel/cpu/cpufreq/powernow-k8.c
@@ -53,7 +53,7 @@
 /* serialize freq changes  */
 static DEFINE_MUTEX(fidvid_mutex);
 
-static struct powernow_k8_data *powernow_data[NR_CPUS];
+static DEFINE_PER_CPU(struct powernow_k8_data *, powernow_data);
 
 static int cpu_family = CPU_OPTERON;
 
@@ -1052,7 +1052,7 @@ static int transition_frequency_pstate(s
 static int powernowk8_target(struct cpufreq_policy *pol, unsigned targfreq, unsigned relation)
 {
 	cpumask_t oldmask = CPU_MASK_ALL;
-	struct powernow_k8_data *data = powernow_data[pol->cpu];
+	struct powernow_k8_data *data = per_cpu(powernow_data, pol->cpu);
 	u32 checkfid;
 	u32 checkvid;
 	unsigned int newstate;
@@ -1128,7 +1128,7 @@ err_out:
 /* Driver entry point to verify the policy and range of frequencies */
 static int powernowk8_verify(struct cpufreq_policy *pol)
 {
-	struct powernow_k8_data *data = powernow_data[pol->cpu];
+	struct powernow_k8_data *data = per_cpu(powernow_data, pol->cpu);
 
 	if (!data)
 		return -EINVAL;
@@ -1233,7 +1233,7 @@ static int __cpuinit powernowk8_cpu_init
 		dprintk("cpu_init done, current fid 0x%x, vid 0x%x\n",
 			data->currfid, data->currvid);
 
-	powernow_data[pol->cpu] = data;
+	per_cpu(powernow_data, pol->cpu) = data;
 
 	return 0;
 
@@ -1247,7 +1247,7 @@ err_out:
 
 static int __devexit powernowk8_cpu_exit (struct cpufreq_policy *pol)
 {
-	struct powernow_k8_data *data = powernow_data[pol->cpu];
+	struct powernow_k8_data *data = per_cpu(powernow_data, pol->cpu);
 
 	if (!data)
 		return -EINVAL;
@@ -1268,7 +1268,7 @@ static unsigned int powernowk8_get (unsi
 	cpumask_t oldmask = current->cpus_allowed;
 	unsigned int khz = 0;
 
-	data = powernow_data[first_cpu(per_cpu(cpu_core_map, cpu))];
+	data = per_cpu(powernow_data, first_cpu(per_cpu(cpu_core_map, cpu)));
 
 	if (!data)
 		return -EINVAL;

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 04/10] x86: Change NR_CPUS arrays in intel_cacheinfo V3
  2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
                   ` (2 preceding siblings ...)
  2008-01-16 17:09 ` [PATCH 03/10] x86: Change NR_CPUS arrays in powernow-k8 V3 travis
@ 2008-01-16 17:09 ` travis
  2008-01-16 17:09 ` [PATCH 05/10] x86: Change NR_CPUS arrays in smpboot_64 V3 travis
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: travis @ 2008-01-16 17:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, mingo
  Cc: Eric Dumazet, Christoph Lameter, Jack Steiner, linux-mm, linux-kernel

[-- Attachment #1: NR_CPUS-arrays-in-intel_cacheinfo --]
[-- Type: text/plain, Size: 5866 bytes --]

Change the following static arrays sized by NR_CPUS to
per_cpu data variables:

	_cpuid4_info *cpuid4_info[NR_CPUS];
	_index_kobject *index_kobject[NR_CPUS];
	kobject * cache_kobject[NR_CPUS];

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
---
V1->V2:
    - (none)
---
 arch/x86/kernel/cpu/intel_cacheinfo.c |   55 +++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 26 deletions(-)

--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -451,8 +451,8 @@ unsigned int __cpuinit init_intel_cachei
 }
 
 /* pointer to _cpuid4_info array (for each cache leaf) */
-static struct _cpuid4_info *cpuid4_info[NR_CPUS];
-#define CPUID4_INFO_IDX(x,y)    (&((cpuid4_info[x])[y]))
+static DEFINE_PER_CPU(struct _cpuid4_info *, cpuid4_info);
+#define CPUID4_INFO_IDX(x,y)    (&((per_cpu(cpuid4_info, x))[y]))
 
 #ifdef CONFIG_SMP
 static void __cpuinit cache_shared_cpu_map_setup(unsigned int cpu, int index)
@@ -474,7 +474,7 @@ static void __cpuinit cache_shared_cpu_m
 			if (cpu_data(i).apicid >> index_msb ==
 			    c->apicid >> index_msb) {
 				cpu_set(i, this_leaf->shared_cpu_map);
-				if (i != cpu && cpuid4_info[i])  {
+				if (i != cpu && per_cpu(cpuid4_info, i))  {
 					sibling_leaf = CPUID4_INFO_IDX(i, index);
 					cpu_set(cpu, sibling_leaf->shared_cpu_map);
 				}
@@ -505,8 +505,8 @@ static void __cpuinit free_cache_attribu
 	for (i = 0; i < num_cache_leaves; i++)
 		cache_remove_shared_cpu_map(cpu, i);
 
-	kfree(cpuid4_info[cpu]);
-	cpuid4_info[cpu] = NULL;
+	kfree(per_cpu(cpuid4_info, cpu));
+	per_cpu(cpuid4_info, cpu) = NULL;
 }
 
 static int __cpuinit detect_cache_attributes(unsigned int cpu)
@@ -519,9 +519,9 @@ static int __cpuinit detect_cache_attrib
 	if (num_cache_leaves == 0)
 		return -ENOENT;
 
-	cpuid4_info[cpu] = kzalloc(
+	per_cpu(cpuid4_info, cpu) = kzalloc(
 	    sizeof(struct _cpuid4_info) * num_cache_leaves, GFP_KERNEL);
-	if (cpuid4_info[cpu] == NULL)
+	if (per_cpu(cpuid4_info, cpu) == NULL)
 		return -ENOMEM;
 
 	oldmask = current->cpus_allowed;
@@ -546,8 +546,8 @@ static int __cpuinit detect_cache_attrib
 
 out:
 	if (retval) {
-		kfree(cpuid4_info[cpu]);
-		cpuid4_info[cpu] = NULL;
+		kfree(per_cpu(cpuid4_info, cpu));
+		per_cpu(cpuid4_info, cpu) = NULL;
 	}
 
 	return retval;
@@ -561,7 +561,7 @@ out:
 extern struct sysdev_class cpu_sysdev_class; /* from drivers/base/cpu.c */
 
 /* pointer to kobject for cpuX/cache */
-static struct kobject * cache_kobject[NR_CPUS];
+static DEFINE_PER_CPU(struct kobject *, cache_kobject);
 
 struct _index_kobject {
 	struct kobject kobj;
@@ -570,8 +570,8 @@ struct _index_kobject {
 };
 
 /* pointer to array of kobjects for cpuX/cache/indexY */
-static struct _index_kobject *index_kobject[NR_CPUS];
-#define INDEX_KOBJECT_PTR(x,y)    (&((index_kobject[x])[y]))
+static DEFINE_PER_CPU(struct _index_kobject *, index_kobject);
+#define INDEX_KOBJECT_PTR(x,y)    (&((per_cpu(index_kobject, x))[y]))
 
 #define show_one_plus(file_name, object, val)				\
 static ssize_t show_##file_name						\
@@ -684,10 +684,10 @@ static struct kobj_type ktype_percpu_ent
 
 static void __cpuinit cpuid4_cache_sysfs_exit(unsigned int cpu)
 {
-	kfree(cache_kobject[cpu]);
-	kfree(index_kobject[cpu]);
-	cache_kobject[cpu] = NULL;
-	index_kobject[cpu] = NULL;
+	kfree(per_cpu(cache_kobject, cpu));
+	kfree(per_cpu(index_kobject, cpu));
+	per_cpu(cache_kobject, cpu) = NULL;
+	per_cpu(index_kobject, cpu) = NULL;
 	free_cache_attributes(cpu);
 }
 
@@ -703,13 +703,14 @@ static int __cpuinit cpuid4_cache_sysfs_
 		return err;
 
 	/* Allocate all required memory */
-	cache_kobject[cpu] = kzalloc(sizeof(struct kobject), GFP_KERNEL);
-	if (unlikely(cache_kobject[cpu] == NULL))
+	per_cpu(cache_kobject, cpu) =
+		kzalloc(sizeof(struct kobject), GFP_KERNEL);
+	if (unlikely(per_cpu(cache_kobject, cpu) == NULL))
 		goto err_out;
 
-	index_kobject[cpu] = kzalloc(
+	per_cpu(index_kobject, cpu) = kzalloc(
 	    sizeof(struct _index_kobject ) * num_cache_leaves, GFP_KERNEL);
-	if (unlikely(index_kobject[cpu] == NULL))
+	if (unlikely(per_cpu(index_kobject, cpu) == NULL))
 		goto err_out;
 
 	return 0;
@@ -733,7 +734,8 @@ static int __cpuinit cache_add_dev(struc
 	if (unlikely(retval < 0))
 		return retval;
 
-	retval = kobject_init_and_add(cache_kobject[cpu], &ktype_percpu_entry,
+	retval = kobject_init_and_add(per_cpu(cache_kobject, cpu),
+				      &ktype_percpu_entry,
 				      &sys_dev->kobj, "%s", "cache");
 	if (retval < 0) {
 		cpuid4_cache_sysfs_exit(cpu);
@@ -745,13 +747,14 @@ static int __cpuinit cache_add_dev(struc
 		this_object->cpu = cpu;
 		this_object->index = i;
 		retval = kobject_init_and_add(&(this_object->kobj),
-					      &ktype_cache, cache_kobject[cpu],
+					      &ktype_cache,
+					      per_cpu(cache_kobject, cpu),
 					      "index%1lu", i);
 		if (unlikely(retval)) {
 			for (j = 0; j < i; j++) {
 				kobject_put(&(INDEX_KOBJECT_PTR(cpu,j)->kobj));
 			}
-			kobject_put(cache_kobject[cpu]);
+			kobject_put(per_cpu(cache_kobject, cpu));
 			cpuid4_cache_sysfs_exit(cpu);
 			break;
 		}
@@ -760,7 +763,7 @@ static int __cpuinit cache_add_dev(struc
 	if (!retval)
 		cpu_set(cpu, cache_dev_map);
 
-	kobject_uevent(cache_kobject[cpu], KOBJ_ADD);
+	kobject_uevent(per_cpu(cache_kobject, cpu), KOBJ_ADD);
 	return retval;
 }
 
@@ -769,7 +772,7 @@ static void __cpuinit cache_remove_dev(s
 	unsigned int cpu = sys_dev->id;
 	unsigned long i;
 
-	if (cpuid4_info[cpu] == NULL)
+	if (per_cpu(cpuid4_info, cpu) == NULL)
 		return;
 	if (!cpu_isset(cpu, cache_dev_map))
 		return;
@@ -777,7 +780,7 @@ static void __cpuinit cache_remove_dev(s
 
 	for (i = 0; i < num_cache_leaves; i++)
 		kobject_put(&(INDEX_KOBJECT_PTR(cpu,i)->kobj));
-	kobject_put(cache_kobject[cpu]);
+	kobject_put(per_cpu(cache_kobject, cpu));
 	cpuid4_cache_sysfs_exit(cpu);
 }
 

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 05/10] x86: Change NR_CPUS arrays in smpboot_64 V3
  2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
                   ` (3 preceding siblings ...)
  2008-01-16 17:09 ` [PATCH 04/10] x86: Change NR_CPUS arrays in intel_cacheinfo V3 travis
@ 2008-01-16 17:09 ` travis
  2008-01-16 17:09 ` [PATCH 06/10] x86: Change NR_CPUS arrays in topology V3 travis
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: travis @ 2008-01-16 17:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, mingo
  Cc: Eric Dumazet, Christoph Lameter, Jack Steiner, linux-mm, linux-kernel

[-- Attachment #1: NR_CPUS-arrays-in-smpboot_64 --]
[-- Type: text/plain, Size: 1312 bytes --]

Change the following static arrays sized by NR_CPUS to
per_cpu data variables:

	task_struct *idle_thread_array[NR_CPUS];

This is only done if CONFIG_HOTPLUG_CPU is defined
as otherwise, the array is removed after initialization
anyways.

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
---
V1->V2:
    - (none)
---
 arch/x86/kernel/smpboot_64.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/smpboot_64.c
+++ b/arch/x86/kernel/smpboot_64.c
@@ -111,10 +111,20 @@ DEFINE_PER_CPU(int, cpu_state) = { 0 };
  * a new thread. Also avoids complicated thread destroy functionality
  * for idle threads.
  */
+#ifdef CONFIG_HOTPLUG_CPU
+/*
+ * Needed only for CONFIG_HOTPLUG_CPU because __cpuinitdata is
+ * removed after init for !CONFIG_HOTPLUG_CPU.
+ */
+static DEFINE_PER_CPU(struct task_struct *, idle_thread_array);
+#define get_idle_for_cpu(x)     (per_cpu(idle_thread_array, x))
+#define set_idle_for_cpu(x,p)   (per_cpu(idle_thread_array, x) = (p))
+#else
 struct task_struct *idle_thread_array[NR_CPUS] __cpuinitdata ;
-
 #define get_idle_for_cpu(x)     (idle_thread_array[(x)])
 #define set_idle_for_cpu(x,p)   (idle_thread_array[(x)] = (p))
+#endif
+
 
 /*
  * Currently trivial. Write the real->protected mode

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 06/10] x86: Change NR_CPUS arrays in topology V3
  2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
                   ` (4 preceding siblings ...)
  2008-01-16 17:09 ` [PATCH 05/10] x86: Change NR_CPUS arrays in smpboot_64 V3 travis
@ 2008-01-16 17:09 ` travis
  2008-01-16 17:09 ` [PATCH 07/10] x86: Cleanup x86_cpu_to_apicid references V3 travis
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: travis @ 2008-01-16 17:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, mingo
  Cc: Eric Dumazet, Christoph Lameter, Jack Steiner, linux-mm, linux-kernel

[-- Attachment #1: NR_CPUS-arrays-in-topology --]
[-- Type: text/plain, Size: 1495 bytes --]

Change the following static arrays sized by NR_CPUS to
per_cpu data variables:

	i386_cpu cpu_devices[NR_CPUS];

(And change the struct name to x86_cpu.)

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
---
V1->V2:
    - (none)
---
 arch/x86/kernel/topology.c |    8 ++++----
 include/asm-x86/cpu.h      |    2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

--- a/arch/x86/kernel/topology.c
+++ b/arch/x86/kernel/topology.c
@@ -31,7 +31,7 @@
 #include <linux/mmzone.h>
 #include <asm/cpu.h>
 
-static struct i386_cpu cpu_devices[NR_CPUS];
+static DEFINE_PER_CPU(struct x86_cpu, cpu_devices);
 
 int __cpuinit arch_register_cpu(int num)
 {
@@ -46,16 +46,16 @@ int __cpuinit arch_register_cpu(int num)
 	 */
 #ifdef CONFIG_HOTPLUG_CPU
 	if (num)
-		cpu_devices[num].cpu.hotpluggable = 1;
+		per_cpu(cpu_devices, num).cpu.hotpluggable = 1;
 #endif
 
-	return register_cpu(&cpu_devices[num].cpu, num);
+	return register_cpu(&per_cpu(cpu_devices, num).cpu, num);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
 void arch_unregister_cpu(int num)
 {
-	return unregister_cpu(&cpu_devices[num].cpu);
+	return unregister_cpu(&per_cpu(cpu_devices, num).cpu);
 }
 EXPORT_SYMBOL(arch_register_cpu);
 EXPORT_SYMBOL(arch_unregister_cpu);
--- a/include/asm-x86/cpu.h
+++ b/include/asm-x86/cpu.h
@@ -7,7 +7,7 @@
 #include <linux/nodemask.h>
 #include <linux/percpu.h>
 
-struct i386_cpu {
+struct x86_cpu {
 	struct cpu cpu;
 };
 extern int arch_register_cpu(int num);

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 07/10] x86: Cleanup x86_cpu_to_apicid references V3
  2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
                   ` (5 preceding siblings ...)
  2008-01-16 17:09 ` [PATCH 06/10] x86: Change NR_CPUS arrays in topology V3 travis
@ 2008-01-16 17:09 ` travis
  2008-01-16 17:09 ` [PATCH 08/10] x86: Change NR_CPUS arrays in numa_64 V3 travis
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: travis @ 2008-01-16 17:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, mingo
  Cc: Eric Dumazet, Christoph Lameter, Jack Steiner, linux-mm, linux-kernel

[-- Attachment #1: cleanup-x86_cpu_to_apicid --]
[-- Type: text/plain, Size: 5222 bytes --]

Clean up references to x86_cpu_to_apicid.  Removes extraneous
comments and standardizes on "x86_*_early_ptr" for the early
kernel init references.

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
---
V1->V2:
    - Removed extraneous casts
---
 arch/x86/kernel/genapic_64.c |   11 ++---------
 arch/x86/kernel/mpparse_64.c |   14 +++++---------
 arch/x86/kernel/setup_64.c   |    2 +-
 arch/x86/kernel/smpboot_32.c |    9 ++-------
 arch/x86/kernel/smpboot_64.c |   16 +++++++++-------
 include/asm-x86/smp_32.h     |    2 +-
 include/asm-x86/smp_64.h     |    2 +-
 7 files changed, 21 insertions(+), 35 deletions(-)

--- a/arch/x86/kernel/genapic_64.c
+++ b/arch/x86/kernel/genapic_64.c
@@ -24,17 +24,10 @@
 #include <acpi/acpi_bus.h>
 #endif
 
-/*
- * which logical CPU number maps to which CPU (physical APIC ID)
- *
- * The following static array is used during kernel startup
- * and the x86_cpu_to_apicid_ptr contains the address of the
- * array during this time.  Is it zeroed when the per_cpu
- * data area is removed.
- */
+/* which logical CPU number maps to which CPU (physical APIC ID) */
 u16 x86_cpu_to_apicid_init[NR_CPUS] __initdata
 					= { [0 ... NR_CPUS-1] = BAD_APICID };
-void *x86_cpu_to_apicid_ptr;
+void *x86_cpu_to_apicid_early_ptr;
 DEFINE_PER_CPU(u16, x86_cpu_to_apicid) = BAD_APICID;
 EXPORT_PER_CPU_SYMBOL(x86_cpu_to_apicid);
 
--- a/arch/x86/kernel/mpparse_64.c
+++ b/arch/x86/kernel/mpparse_64.c
@@ -125,15 +125,11 @@ static void __cpuinit MP_processor_info(
 		cpu = 0;
  	}
 	bios_cpu_apicid[cpu] = m->mpc_apicid;
-	/*
-	 * We get called early in the the start_kernel initialization
-	 * process when the per_cpu data area is not yet setup, so we
-	 * use a static array that is removed after the per_cpu data
-	 * area is created.
-	 */
-	if (x86_cpu_to_apicid_ptr) {
-		u16 *x86_cpu_to_apicid = x86_cpu_to_apicid_ptr;
-		x86_cpu_to_apicid[cpu] = m->mpc_apicid;
+	/* are we being called early in kernel startup? */
+	if (x86_cpu_to_apicid_early_ptr) {
+		u16 *cpu_to_apicid = x86_cpu_to_apicid_early_ptr;
+
+		cpu_to_apicid[cpu] = m->mpc_apicid;
 	} else {
 		per_cpu(x86_cpu_to_apicid, cpu) = m->mpc_apicid;
 	}
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -373,7 +373,7 @@ void __init setup_arch(char **cmdline_p)
 
 #ifdef CONFIG_SMP
 	/* setup to use the static apicid table during kernel startup */
-	x86_cpu_to_apicid_ptr = (void *)&x86_cpu_to_apicid_init;
+	x86_cpu_to_apicid_early_ptr = (void *)&x86_cpu_to_apicid_init;
 #endif
 
 #ifdef CONFIG_ACPI
--- a/arch/x86/kernel/smpboot_32.c
+++ b/arch/x86/kernel/smpboot_32.c
@@ -91,15 +91,10 @@ static cpumask_t smp_commenced_mask;
 DEFINE_PER_CPU_SHARED_ALIGNED(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
 
-/*
- * The following static array is used during kernel startup
- * and the x86_cpu_to_apicid_ptr contains the address of the
- * array during this time.  Is it zeroed when the per_cpu
- * data area is removed.
- */
+/* which logical CPU number maps to which CPU (physical APIC ID) */
 u8 x86_cpu_to_apicid_init[NR_CPUS] __initdata =
 			{ [0 ... NR_CPUS-1] = BAD_APICID };
-void *x86_cpu_to_apicid_ptr;
+void *x86_cpu_to_apicid_early_ptr;
 DEFINE_PER_CPU(u8, x86_cpu_to_apicid) = BAD_APICID;
 EXPORT_PER_CPU_SYMBOL(x86_cpu_to_apicid);
 
--- a/arch/x86/kernel/smpboot_64.c
+++ b/arch/x86/kernel/smpboot_64.c
@@ -852,23 +852,25 @@ static int __init smp_sanity_check(unsig
 }
 
 /*
- * Copy apicid's found by MP_processor_info from initial array to the per cpu
- * data area.  The x86_cpu_to_apicid_init array is then expendable and the
- * x86_cpu_to_apicid_ptr is zeroed indicating that the static array is no
- * longer available.
+ * Copy data used in early init routines from the initial arrays to the
+ * per cpu data areas.  These arrays then become expendable and the
+ * *_ptrs are zeroed indicating that the static arrays are gone.
  */
 void __init smp_set_apicids(void)
 {
 	int cpu;
 
-	for_each_cpu_mask(cpu, cpu_possible_map) {
+	for_each_possible_cpu(cpu) {
 		if (per_cpu_offset(cpu))
 			per_cpu(x86_cpu_to_apicid, cpu) =
 						x86_cpu_to_apicid_init[cpu];
+		else
+			printk(KERN_NOTICE "per_cpu_offset zero for cpu %d\n",
+									cpu);
 	}
 
-	/* indicate the static array will be going away soon */
-	x86_cpu_to_apicid_ptr = NULL;
+	/* indicate the early static arrays are gone */
+	x86_cpu_to_apicid_early_ptr = NULL;
 }
 
 static void __init smp_cpu_index_default(void)
--- a/include/asm-x86/smp_32.h
+++ b/include/asm-x86/smp_32.h
@@ -30,7 +30,7 @@ extern void (*mtrr_hook) (void);
 extern void zap_low_mappings (void);
 
 extern u8 __initdata x86_cpu_to_apicid_init[];
-extern void *x86_cpu_to_apicid_ptr;
+extern void *x86_cpu_to_apicid_early_ptr;
 
 DECLARE_PER_CPU(cpumask_t, cpu_sibling_map);
 DECLARE_PER_CPU(cpumask_t, cpu_core_map);
--- a/include/asm-x86/smp_64.h
+++ b/include/asm-x86/smp_64.h
@@ -27,7 +27,7 @@ extern int smp_call_function_mask(cpumas
 				  void *info, int wait);
 
 extern u16 __initdata x86_cpu_to_apicid_init[];
-extern void *x86_cpu_to_apicid_ptr;
+extern void *x86_cpu_to_apicid_early_ptr;
 extern u16 bios_cpu_apicid[];
 
 DECLARE_PER_CPU(cpumask_t, cpu_sibling_map);

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 08/10] x86: Change NR_CPUS arrays in numa_64 V3
  2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
                   ` (6 preceding siblings ...)
  2008-01-16 17:09 ` [PATCH 07/10] x86: Cleanup x86_cpu_to_apicid references V3 travis
@ 2008-01-16 17:09 ` travis
  2008-01-16 17:09 ` [PATCH 09/10] x86: Change NR_CPUS arrays in acpi-cpufreq V3 travis
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: travis @ 2008-01-16 17:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, mingo
  Cc: Eric Dumazet, Christoph Lameter, Jack Steiner, linux-mm, linux-kernel

[-- Attachment #1: NR_CPUS-arrays-in-numa_64 --]
[-- Type: text/plain, Size: 9494 bytes --]

Change the following static arrays sized by NR_CPUS to
per_cpu data variables:

	char cpu_to_node_map[NR_CPUS];


Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
---
V1->V2:
    - Removed extraneous casts
    - Fix !NUMA builds with '#ifdef CONFIG_NUMA"

V2->V3:
    - add early_cpu_to_node function to keep cpu_to_node efficient
    - move and rename smp_set_apicids() to setup_percpu_maps()
    - call setup_percpu_maps() as early as possible
---
 arch/x86/kernel/setup64.c    |   37 +++++++++++++++++++++++++++++++++++--
 arch/x86/kernel/setup_64.c   |    6 +++++-
 arch/x86/kernel/smpboot_64.c |   27 ++-------------------------
 arch/x86/mm/numa_64.c        |   20 ++++++++++++++++----
 arch/x86/mm/srat_64.c        |    5 +++--
 include/asm-x86/numa_64.h    |    9 ---------
 include/asm-x86/topology.h   |   23 +++++++++++++++++++++--
 mm/page_alloc.c              |    2 +-
 net/sunrpc/svc.c             |    1 +
 9 files changed, 84 insertions(+), 46 deletions(-)

--- a/arch/x86/kernel/setup64.c
+++ b/arch/x86/kernel/setup64.c
@@ -80,6 +80,36 @@ static int __init nonx32_setup(char *str
 __setup("noexec32=", nonx32_setup);
 
 /*
+ * Copy data used in early init routines from the initial arrays to the
+ * per cpu data areas.  These arrays then become expendable and the *_ptrs
+ * are zeroed indicating that the static arrays are gone.
+ */
+void __init setup_percpu_maps(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		if (per_cpu_offset(cpu)) {
+			per_cpu(x86_cpu_to_apicid, cpu) =
+						x86_cpu_to_apicid_init[cpu];
+#ifdef CONFIG_NUMA
+			per_cpu(x86_cpu_to_node_map, cpu) =
+						x86_cpu_to_node_map_init[cpu];
+#endif
+		}
+		else
+			printk(KERN_NOTICE "per_cpu_offset zero for cpu %d\n",
+									cpu);
+	}
+
+	/* indicate the early static arrays are gone */
+	x86_cpu_to_apicid_early_ptr = NULL;
+#ifdef CONFIG_NUMA
+	x86_cpu_to_node_map_early_ptr = NULL;
+#endif
+}
+
+/*
  * Great future plan:
  * Declare PDA itself and support (irqstack,tss,pgd) as per cpu data.
  * Always point %gs to its beginning
@@ -100,18 +130,21 @@ void __init setup_per_cpu_areas(void)
 	for_each_cpu_mask (i, cpu_possible_map) {
 		char *ptr;
 
-		if (!NODE_DATA(cpu_to_node(i))) {
+		if (!NODE_DATA(early_cpu_to_node(i))) {
 			printk("cpu with no node %d, num_online_nodes %d\n",
 			       i, num_online_nodes());
 			ptr = alloc_bootmem_pages(size);
 		} else { 
-			ptr = alloc_bootmem_pages_node(NODE_DATA(cpu_to_node(i)), size);
+			ptr = alloc_bootmem_pages_node(NODE_DATA(early_cpu_to_node(i)), size);
 		}
 		if (!ptr)
 			panic("Cannot allocate cpu data for CPU %d\n", i);
 		cpu_pda(i)->data_offset = ptr - __per_cpu_start;
 		memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
 	}
+
+	/* setup percpu data maps early */
+	setup_percpu_maps();
 } 
 
 void pda_init(int cpu)
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -63,6 +63,7 @@
 #include <asm/cacheflush.h>
 #include <asm/mce.h>
 #include <asm/ds.h>
+#include <asm/topology.h>
 
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
@@ -372,8 +373,11 @@ void __init setup_arch(char **cmdline_p)
 	io_delay_init();
 
 #ifdef CONFIG_SMP
-	/* setup to use the static apicid table during kernel startup */
+	/* setup to use the early static init tables during kernel startup */
 	x86_cpu_to_apicid_early_ptr = (void *)&x86_cpu_to_apicid_init;
+#ifdef CONFIG_NUMA
+	x86_cpu_to_node_map_early_ptr = (void *)&x86_cpu_to_node_map_init;
+#endif
 #endif
 
 #ifdef CONFIG_ACPI
--- a/arch/x86/kernel/smpboot_64.c
+++ b/arch/x86/kernel/smpboot_64.c
@@ -702,7 +702,7 @@ do_rest:
 	if (boot_error) {
 		cpu_clear(cpu, cpu_callout_map); /* was set here (do_boot_cpu()) */
 		clear_bit(cpu, (unsigned long *)&cpu_initialized); /* was set by cpu_init() */
-		clear_node_cpumask(cpu); /* was set by numa_add_cpu */
+		clear_bit(cpu, (unsigned long *)&node_to_cpumask_map[cpu_to_node(cpu)]);
 		cpu_clear(cpu, cpu_present_map);
 		cpu_clear(cpu, cpu_possible_map);
 		per_cpu(x86_cpu_to_apicid, cpu) = BAD_APICID;
@@ -851,28 +851,6 @@ static int __init smp_sanity_check(unsig
 	return 0;
 }
 
-/*
- * Copy data used in early init routines from the initial arrays to the
- * per cpu data areas.  These arrays then become expendable and the
- * *_ptrs are zeroed indicating that the static arrays are gone.
- */
-void __init smp_set_apicids(void)
-{
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-		if (per_cpu_offset(cpu))
-			per_cpu(x86_cpu_to_apicid, cpu) =
-						x86_cpu_to_apicid_init[cpu];
-		else
-			printk(KERN_NOTICE "per_cpu_offset zero for cpu %d\n",
-									cpu);
-	}
-
-	/* indicate the early static arrays are gone */
-	x86_cpu_to_apicid_early_ptr = NULL;
-}
-
 static void __init smp_cpu_index_default(void)
 {
 	int i;
@@ -895,7 +873,6 @@ void __init smp_prepare_cpus(unsigned in
 	smp_cpu_index_default();
 	current_cpu_data = boot_cpu_data;
 	current_thread_info()->cpu = 0;  /* needed? */
-	smp_set_apicids();
 	set_cpu_sibling_map(0);
 
 	if (smp_sanity_check(max_cpus) < 0) {
@@ -1049,7 +1026,7 @@ void remove_cpu_from_maps(void)
 	cpu_clear(cpu, cpu_callout_map);
 	cpu_clear(cpu, cpu_callin_map);
 	clear_bit(cpu, (unsigned long *)&cpu_initialized); /* was set by cpu_init() */
-	clear_node_cpumask(cpu);
+	clear_bit(cpu, (unsigned long *)&node_to_cpumask_map[cpu_to_node(cpu)]);
 }
 
 int __cpu_disable(void)
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -31,10 +31,14 @@ bootmem_data_t plat_node_bdata[MAX_NUMNO
 
 struct memnode memnode;
 
-u16 cpu_to_node_map[NR_CPUS] __read_mostly = {
+u16 x86_cpu_to_node_map_init[NR_CPUS] __initdata = {
 	[0 ... NR_CPUS-1] = NUMA_NO_NODE
 };
-EXPORT_SYMBOL(cpu_to_node_map);
+void *x86_cpu_to_node_map_early_ptr;
+EXPORT_SYMBOL(x86_cpu_to_node_map_init);
+EXPORT_SYMBOL(x86_cpu_to_node_map_early_ptr);
+DEFINE_PER_CPU(u16, x86_cpu_to_node_map) = NUMA_NO_NODE;
+EXPORT_PER_CPU_SYMBOL(x86_cpu_to_node_map);
 
 u16 apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = {
 	[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
@@ -545,7 +549,7 @@ void __init numa_initmem_init(unsigned l
 	node_set(0, node_possible_map);
 	for (i = 0; i < NR_CPUS; i++)
 		numa_set_node(i, 0);
-	/* we can't use cpumask_of_cpu() yet */
+	/* cpumask_of_cpu() may not be available during early startup */
 	memset(&node_to_cpumask_map[0], 0, sizeof(node_to_cpumask_map[0]));
 	cpu_set(0, node_to_cpumask_map[0]);
 	e820_register_active_regions(0, start_pfn, end_pfn);
@@ -559,8 +563,16 @@ __cpuinit void numa_add_cpu(int cpu)
 
 void __cpuinit numa_set_node(int cpu, int node)
 {
+	u16 *cpu_to_node_map = x86_cpu_to_node_map_early_ptr;
+
 	cpu_pda(cpu)->nodenumber = node;
-	cpu_to_node_map[cpu] = node;
+
+	if(cpu_to_node_map)
+		cpu_to_node_map[cpu] = node;
+	else if(per_cpu_offset(cpu))
+		per_cpu(x86_cpu_to_node_map, cpu) = node;
+	else
+		Dprintk(KERN_INFO "Setting node for non-present cpu %d\n", cpu);
 }
 
 unsigned long __init numa_free_all_bootmem(void)
--- a/arch/x86/mm/srat_64.c
+++ b/arch/x86/mm/srat_64.c
@@ -382,9 +382,10 @@ int __init acpi_scan_nodes(unsigned long
 			setup_node_bootmem(i, nodes[i].start, nodes[i].end);
 
 	for (i = 0; i < NR_CPUS; i++) {
-		if (cpu_to_node(i) == NUMA_NO_NODE)
+		int node = cpu_to_node(i);
+		if (node == NUMA_NO_NODE)
 			continue;
-		if (!node_isset(cpu_to_node(i), node_possible_map))
+		if (!node_isset(node, node_possible_map))
 			numa_set_node(i, NUMA_NO_NODE);
 	}
 	numa_init_array();
--- a/include/asm-x86/numa_64.h
+++ b/include/asm-x86/numa_64.h
@@ -29,17 +29,8 @@ extern void setup_node_bootmem(int nodei
 
 #ifdef CONFIG_NUMA
 extern void __init init_cpu_to_node(void);
-
-static inline void clear_node_cpumask(int cpu)
-{
-	clear_bit(cpu, (unsigned long *)&node_to_cpumask_map[cpu_to_node(cpu)]);
-}
-
 #else
 #define init_cpu_to_node() do {} while (0)
-#define clear_node_cpumask(cpu) do {} while (0)
 #endif
 
-#define NUMA_NO_NODE 0xffff
-
 #endif
--- a/include/asm-x86/topology.h
+++ b/include/asm-x86/topology.h
@@ -30,13 +30,32 @@
 #include <asm/mpspec.h>
 
 /* Mappings between logical cpu number and node number */
-extern u16 cpu_to_node_map[];
+DECLARE_PER_CPU(u16, x86_cpu_to_node_map);
+extern u16 __initdata x86_cpu_to_node_map_init[];
+extern void *x86_cpu_to_node_map_early_ptr;
 extern cpumask_t node_to_cpumask_map[];
 
+#define NUMA_NO_NODE	((u16)(~0))
+
 /* Returns the number of the node containing CPU 'cpu' */
+static inline int early_cpu_to_node(int cpu)
+{
+	u16 *cpu_to_node_map = x86_cpu_to_node_map_early_ptr;
+
+	if (cpu_to_node_map)
+		return cpu_to_node_map[cpu];
+	else if(per_cpu_offset(cpu))
+		return per_cpu(x86_cpu_to_node_map, cpu);
+	else
+		return NUMA_NO_NODE;
+}
+
 static inline int cpu_to_node(int cpu)
 {
-	return cpu_to_node_map[cpu];
+	if(per_cpu_offset(cpu))
+		return per_cpu(x86_cpu_to_node_map, cpu);
+	else
+		return NUMA_NO_NODE;
 }
 
 /*
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1783,7 +1783,7 @@ EXPORT_SYMBOL(free_pages);
 static unsigned int nr_free_zone_pages(int offset)
 {
 	/* Just pick one node, since fallback list is circular */
-	pg_data_t *pgdat = NODE_DATA(numa_node_id());
+	pg_data_t *pgdat = NODE_DATA(cpu_to_node(raw_smp_processor_id()));
 	unsigned int sum = 0;
 
 	struct zonelist *zonelist = pgdat->node_zonelists + offset;
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -18,6 +18,7 @@
 #include <linux/mm.h>
 #include <linux/interrupt.h>
 #include <linux/module.h>
+#include <linux/sched.h>
 
 #include <linux/sunrpc/types.h>
 #include <linux/sunrpc/xdr.h>

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 09/10] x86: Change NR_CPUS arrays in acpi-cpufreq V3
  2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
                   ` (7 preceding siblings ...)
  2008-01-16 17:09 ` [PATCH 08/10] x86: Change NR_CPUS arrays in numa_64 V3 travis
@ 2008-01-16 17:09 ` travis
  2008-01-16 17:09 ` [PATCH 10/10] x86: Change bios_cpu_apicid to percpu data variable V3 travis
  2008-01-16 18:01 ` [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 Frans Pop
  10 siblings, 0 replies; 17+ messages in thread
From: travis @ 2008-01-16 17:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, mingo
  Cc: Eric Dumazet, Christoph Lameter, Jack Steiner, linux-mm, linux-kernel

[-- Attachment #1: NR_CPUS-arrays-in-acpi-cpufreq --]
[-- Type: text/plain, Size: 3939 bytes --]

Change the following static arrays sized by NR_CPUS to
per_cpu data variables:

	acpi_cpufreq_data *drv_data[NR_CPUS]

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
---
V1->V2:
    - (none)
---
 arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |   25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

--- a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
+++ b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
@@ -67,7 +67,8 @@ struct acpi_cpufreq_data {
 	unsigned int cpu_feature;
 };
 
-static struct acpi_cpufreq_data *drv_data[NR_CPUS];
+static DEFINE_PER_CPU(struct acpi_cpufreq_data *, drv_data);
+
 /* acpi_perf_data is a pointer to percpu data. */
 static struct acpi_processor_performance *acpi_perf_data;
 
@@ -218,14 +219,14 @@ static u32 get_cur_val(cpumask_t mask)
 	if (unlikely(cpus_empty(mask)))
 		return 0;
 
-	switch (drv_data[first_cpu(mask)]->cpu_feature) {
+	switch (per_cpu(drv_data, first_cpu(mask))->cpu_feature) {
 	case SYSTEM_INTEL_MSR_CAPABLE:
 		cmd.type = SYSTEM_INTEL_MSR_CAPABLE;
 		cmd.addr.msr.reg = MSR_IA32_PERF_STATUS;
 		break;
 	case SYSTEM_IO_CAPABLE:
 		cmd.type = SYSTEM_IO_CAPABLE;
-		perf = drv_data[first_cpu(mask)]->acpi_data;
+		perf = per_cpu(drv_data, first_cpu(mask))->acpi_data;
 		cmd.addr.io.port = perf->control_register.address;
 		cmd.addr.io.bit_width = perf->control_register.bit_width;
 		break;
@@ -325,7 +326,7 @@ static unsigned int get_measured_perf(un
 
 #endif
 
-	retval = drv_data[cpu]->max_freq * perf_percent / 100;
+	retval = per_cpu(drv_data, cpu)->max_freq * perf_percent / 100;
 
 	put_cpu();
 	set_cpus_allowed(current, saved_mask);
@@ -336,7 +337,7 @@ static unsigned int get_measured_perf(un
 
 static unsigned int get_cur_freq_on_cpu(unsigned int cpu)
 {
-	struct acpi_cpufreq_data *data = drv_data[cpu];
+	struct acpi_cpufreq_data *data = per_cpu(drv_data, cpu);
 	unsigned int freq;
 
 	dprintk("get_cur_freq_on_cpu (%d)\n", cpu);
@@ -370,7 +371,7 @@ static unsigned int check_freqs(cpumask_
 static int acpi_cpufreq_target(struct cpufreq_policy *policy,
 			       unsigned int target_freq, unsigned int relation)
 {
-	struct acpi_cpufreq_data *data = drv_data[policy->cpu];
+	struct acpi_cpufreq_data *data = per_cpu(drv_data, policy->cpu);
 	struct acpi_processor_performance *perf;
 	struct cpufreq_freqs freqs;
 	cpumask_t online_policy_cpus;
@@ -466,7 +467,7 @@ static int acpi_cpufreq_target(struct cp
 
 static int acpi_cpufreq_verify(struct cpufreq_policy *policy)
 {
-	struct acpi_cpufreq_data *data = drv_data[policy->cpu];
+	struct acpi_cpufreq_data *data = per_cpu(drv_data, policy->cpu);
 
 	dprintk("acpi_cpufreq_verify\n");
 
@@ -570,7 +571,7 @@ static int acpi_cpufreq_cpu_init(struct 
 		return -ENOMEM;
 
 	data->acpi_data = percpu_ptr(acpi_perf_data, cpu);
-	drv_data[cpu] = data;
+	per_cpu(drv_data, cpu) = data;
 
 	if (cpu_has(c, X86_FEATURE_CONSTANT_TSC))
 		acpi_cpufreq_driver.flags |= CPUFREQ_CONST_LOOPS;
@@ -714,20 +715,20 @@ err_unreg:
 	acpi_processor_unregister_performance(perf, cpu);
 err_free:
 	kfree(data);
-	drv_data[cpu] = NULL;
+	per_cpu(drv_data, cpu) = NULL;
 
 	return result;
 }
 
 static int acpi_cpufreq_cpu_exit(struct cpufreq_policy *policy)
 {
-	struct acpi_cpufreq_data *data = drv_data[policy->cpu];
+	struct acpi_cpufreq_data *data = per_cpu(drv_data, policy->cpu);
 
 	dprintk("acpi_cpufreq_cpu_exit\n");
 
 	if (data) {
 		cpufreq_frequency_table_put_attr(policy->cpu);
-		drv_data[policy->cpu] = NULL;
+		per_cpu(drv_data, policy->cpu) = NULL;
 		acpi_processor_unregister_performance(data->acpi_data,
 						      policy->cpu);
 		kfree(data);
@@ -738,7 +739,7 @@ static int acpi_cpufreq_cpu_exit(struct 
 
 static int acpi_cpufreq_resume(struct cpufreq_policy *policy)
 {
-	struct acpi_cpufreq_data *data = drv_data[policy->cpu];
+	struct acpi_cpufreq_data *data = per_cpu(drv_data, policy->cpu);
 
 	dprintk("acpi_cpufreq_resume\n");
 

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 10/10] x86: Change bios_cpu_apicid to percpu data variable V3
  2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
                   ` (8 preceding siblings ...)
  2008-01-16 17:09 ` [PATCH 09/10] x86: Change NR_CPUS arrays in acpi-cpufreq V3 travis
@ 2008-01-16 17:09 ` travis
  2008-01-16 18:01 ` [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 Frans Pop
  10 siblings, 0 replies; 17+ messages in thread
From: travis @ 2008-01-16 17:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen, mingo
  Cc: Eric Dumazet, Christoph Lameter, Jack Steiner, linux-mm, linux-kernel

[-- Attachment #1: change-bios_cpu_apicid-to-percpu --]
[-- Type: text/plain, Size: 5402 bytes --]

Change static bios_cpu_apicid array to a per_cpu data variable.
This includes using a static array used during initialization
similar to the way x86_cpu_to_apicid[] is handled.

There is one early use of bios_cpu_apicid in apic_is_clustered_box().
The other reference in cpu_present_to_apicid() is called after
smp_set_apicids() has setup the percpu version of bios_cpu_apicid.


Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
---
V1->V2:
    - Removed extraneous casts
    - Add slight optimization to apic_is_clustered_box()
      [don't reference x86_bios_cpu_apicid_early_ptr each pass.]
---
 arch/x86/kernel/apic_64.c    |   17 +++++++++++++++--
 arch/x86/kernel/mpparse_64.c |   12 +++++++++---
 arch/x86/kernel/setup64.c    |    3 +++
 arch/x86/kernel/setup_64.c   |    1 +
 include/asm-x86/smp_64.h     |    8 +++++---
 5 files changed, 33 insertions(+), 8 deletions(-)

--- a/arch/x86/kernel/apic_64.c
+++ b/arch/x86/kernel/apic_64.c
@@ -1150,19 +1150,32 @@ __cpuinit int apic_is_clustered_box(void
 {
 	int i, clusters, zeros;
 	unsigned id;
+	u16 *bios_cpu_apicid = x86_bios_cpu_apicid_early_ptr;
 	DECLARE_BITMAP(clustermap, NUM_APIC_CLUSTERS);
 
 	bitmap_zero(clustermap, NUM_APIC_CLUSTERS);
 
 	for (i = 0; i < NR_CPUS; i++) {
-		id = bios_cpu_apicid[i];
+		/* are we being called early in kernel startup? */
+		if (bios_cpu_apicid) {
+			id = bios_cpu_apicid[i];
+		}
+		else if (i < nr_cpu_ids) {
+			if (cpu_present(i))
+				id = per_cpu(x86_bios_cpu_apicid, i);
+			else
+				continue;
+		}
+		else
+			break;
+
 		if (id != BAD_APICID)
 			__set_bit(APIC_CLUSTERID(id), clustermap);
 	}
 
 	/* Problem:  Partially populated chassis may not have CPUs in some of
 	 * the APIC clusters they have been allocated.  Only present CPUs have
-	 * bios_cpu_apicid entries, thus causing zeroes in the bitmap.  Since
+	 * x86_bios_cpu_apicid entries, thus causing zeroes in the bitmap.  Since
 	 * clusters are allocated sequentially, count zeros only if they are
 	 * bounded by ones.
 	 */
--- a/arch/x86/kernel/mpparse_64.c
+++ b/arch/x86/kernel/mpparse_64.c
@@ -67,7 +67,11 @@ unsigned disabled_cpus __cpuinitdata;
 /* Bitmask of physically existing CPUs */
 physid_mask_t phys_cpu_present_map = PHYSID_MASK_NONE;
 
-u16 bios_cpu_apicid[NR_CPUS] = { [0 ... NR_CPUS-1] = BAD_APICID };
+u16 x86_bios_cpu_apicid_init[NR_CPUS] __initdata
+				= { [0 ... NR_CPUS-1] = BAD_APICID };
+void *x86_bios_cpu_apicid_early_ptr;
+DEFINE_PER_CPU(u16, x86_bios_cpu_apicid) = BAD_APICID;
+EXPORT_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
 
 
 /*
@@ -118,20 +122,22 @@ static void __cpuinit MP_processor_info(
 	physid_set(m->mpc_apicid, phys_cpu_present_map);
  	if (m->mpc_cpuflag & CPU_BOOTPROCESSOR) {
  		/*
- 		 * bios_cpu_apicid is required to have processors listed
+		 * x86_bios_cpu_apicid is required to have processors listed
  		 * in same order as logical cpu numbers. Hence the first
  		 * entry is BSP, and so on.
  		 */
 		cpu = 0;
  	}
-	bios_cpu_apicid[cpu] = m->mpc_apicid;
 	/* are we being called early in kernel startup? */
 	if (x86_cpu_to_apicid_early_ptr) {
 		u16 *cpu_to_apicid = x86_cpu_to_apicid_early_ptr;
+		u16 *bios_cpu_apicid = x86_bios_cpu_apicid_early_ptr;
 
 		cpu_to_apicid[cpu] = m->mpc_apicid;
+		bios_cpu_apicid[cpu] = m->mpc_apicid;
 	} else {
 		per_cpu(x86_cpu_to_apicid, cpu) = m->mpc_apicid;
+		per_cpu(x86_bios_cpu_apicid, cpu) = m->mpc_apicid;
 	}
 
 	cpu_set(cpu, cpu_possible_map);
--- a/arch/x86/kernel/setup64.c
+++ b/arch/x86/kernel/setup64.c
@@ -92,6 +92,8 @@ void __init setup_percpu_maps(void)
 		if (per_cpu_offset(cpu)) {
 			per_cpu(x86_cpu_to_apicid, cpu) =
 						x86_cpu_to_apicid_init[cpu];
+			per_cpu(x86_bios_cpu_apicid, cpu) =
+						x86_bios_cpu_apicid_init[cpu];
 #ifdef CONFIG_NUMA
 			per_cpu(x86_cpu_to_node_map, cpu) =
 						x86_cpu_to_node_map_init[cpu];
@@ -104,6 +106,7 @@ void __init setup_percpu_maps(void)
 
 	/* indicate the early static arrays are gone */
 	x86_cpu_to_apicid_early_ptr = NULL;
+	x86_bios_cpu_apicid_early_ptr = NULL;
 #ifdef CONFIG_NUMA
 	x86_cpu_to_node_map_early_ptr = NULL;
 #endif
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -375,6 +375,7 @@ void __init setup_arch(char **cmdline_p)
 #ifdef CONFIG_SMP
 	/* setup to use the early static init tables during kernel startup */
 	x86_cpu_to_apicid_early_ptr = (void *)&x86_cpu_to_apicid_init;
+	x86_bios_cpu_apicid_early_ptr = (void *)&x86_bios_cpu_apicid_init;
 #ifdef CONFIG_NUMA
 	x86_cpu_to_node_map_early_ptr = (void *)&x86_cpu_to_node_map_init;
 #endif
--- a/include/asm-x86/smp_64.h
+++ b/include/asm-x86/smp_64.h
@@ -27,18 +27,20 @@ extern int smp_call_function_mask(cpumas
 				  void *info, int wait);
 
 extern u16 __initdata x86_cpu_to_apicid_init[];
+extern u16 __initdata x86_bios_cpu_apicid_init[];
 extern void *x86_cpu_to_apicid_early_ptr;
-extern u16 bios_cpu_apicid[];
+extern void *x86_bios_cpu_apicid_early_ptr;
 
 DECLARE_PER_CPU(cpumask_t, cpu_sibling_map);
 DECLARE_PER_CPU(cpumask_t, cpu_core_map);
 DECLARE_PER_CPU(u16, cpu_llc_id);
 DECLARE_PER_CPU(u16, x86_cpu_to_apicid);
+DECLARE_PER_CPU(u16, x86_bios_cpu_apicid);
 
 static inline int cpu_present_to_apicid(int mps_cpu)
 {
-	if (mps_cpu < NR_CPUS)
-		return (int)bios_cpu_apicid[mps_cpu];
+	if (cpu_present(mps_cpu))
+		return (int)per_cpu(x86_bios_cpu_apicid, mps_cpu);
 	else
 		return BAD_APICID;
 }

-- 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 02/10] x86: Change size of node ids from u8 to u16 V3
  2008-01-16 17:09 ` [PATCH 02/10] x86: Change size of node ids " travis
@ 2008-01-16 17:53   ` Eric Dumazet
  2008-01-16 18:10     ` Mike Travis
  2008-01-16 19:16     ` Mike Travis
  0 siblings, 2 replies; 17+ messages in thread
From: Eric Dumazet @ 2008-01-16 17:53 UTC (permalink / raw)
  To: travis
  Cc: Andrew Morton, Andi Kleen, mingo, Christoph Lameter,
	Jack Steiner, linux-mm, linux-kernel

On Wed, 16 Jan 2008 09:09:04 -0800
travis@sgi.com wrote:

> Change the size of node ids from 8 bits to 16 bits to
> accomodate more than 256 nodes.
> 
> Signed-off-by: Mike Travis <travis@sgi.com>
> Reviewed-by: Christoph Lameter <clameter@sgi.com>
> ---
> V1->V2:
>     - changed pxm_to_node_map to u16
>     - changed memnode map entries to u16
> V2->V3:
>     - changed memnode.embedded_map from [64-16] to [64-8]
>       (and size comment to 128 bytes)
> ---
>  arch/x86/mm/numa_64.c       |    9 ++++++---
>  arch/x86/mm/srat_64.c       |    2 +-
>  drivers/acpi/numa.c         |    2 +-
>  include/asm-x86/mmzone_64.h |    6 +++---
>  include/asm-x86/numa_64.h   |    4 ++--
>  include/asm-x86/topology.h  |    2 +-
>  6 files changed, 14 insertions(+), 11 deletions(-)

I know new typedefs are not welcome, but in this case, it could be nice
to define a fundamental type node_t (like pte_t, pmd_t, pgd_t, ...).

Clean NUMA code deserves it. 

#if MAX_NUMNODES > 256
typedef u16 node_t;
#else
typedef u8 node_t;
#endif

In 2016, we can add u32 for MAX_NUMNODES > 65536

Another point: you want this change, sorry if my previous mail was not detailed enough :

--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -78,7 +78,7 @@ static int __init allocate_cachealigned_memnodemap(void)
        unsigned long pad, pad_addr;
 
        memnodemap = memnode.embedded_map;
-       if (memnodemapsize <= 48)
+       if (memnodemapsize <= ARRAY_SIZE(memnode.embedded_map))
                return 0;
 
        pad = L1_CACHE_BYTES - 1;


Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3
  2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
                   ` (9 preceding siblings ...)
  2008-01-16 17:09 ` [PATCH 10/10] x86: Change bios_cpu_apicid to percpu data variable V3 travis
@ 2008-01-16 18:01 ` Frans Pop
  2008-01-16 18:14   ` Mike Travis
  10 siblings, 1 reply; 17+ messages in thread
From: Frans Pop @ 2008-01-16 18:01 UTC (permalink / raw)
  To: travis; +Cc: ak, akpm, clameter, dada1, linux-kernel, linux-mm, mingo, steiner

travis@sgi.com wrote:
>    8472457 Total          30486950 +259%          30342823 +258%

Hmmm. The table for previous versions looked a lot more impressive.

now:    8472457 Total	 +22014493 +259%	 +21870366 +258%
V2 :    7172678 Total    +23314404 +325%           -147590   -2%
(recalculated for comparison)

Did something go wrong with the "after" data?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 02/10] x86: Change size of node ids from u8 to u16 V3
  2008-01-16 17:53   ` Eric Dumazet
@ 2008-01-16 18:10     ` Mike Travis
  2008-01-16 19:16     ` Mike Travis
  1 sibling, 0 replies; 17+ messages in thread
From: Mike Travis @ 2008-01-16 18:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Andi Kleen, mingo, Christoph Lameter,
	Jack Steiner, linux-mm, linux-kernel

Eric Dumazet wrote:
> On Wed, 16 Jan 2008 09:09:04 -0800
> travis@sgi.com wrote:
> 
>> Change the size of node ids from 8 bits to 16 bits to
>> accomodate more than 256 nodes.
>>
>> Signed-off-by: Mike Travis <travis@sgi.com>
>> Reviewed-by: Christoph Lameter <clameter@sgi.com>
>> ---
>> V1->V2:
>>     - changed pxm_to_node_map to u16
>>     - changed memnode map entries to u16
>> V2->V3:
>>     - changed memnode.embedded_map from [64-16] to [64-8]
>>       (and size comment to 128 bytes)
>> ---
>>  arch/x86/mm/numa_64.c       |    9 ++++++---
>>  arch/x86/mm/srat_64.c       |    2 +-
>>  drivers/acpi/numa.c         |    2 +-
>>  include/asm-x86/mmzone_64.h |    6 +++---
>>  include/asm-x86/numa_64.h   |    4 ++--
>>  include/asm-x86/topology.h  |    2 +-
>>  6 files changed, 14 insertions(+), 11 deletions(-)
> 
> I know new typedefs are not welcome, but in this case, it could be nice
> to define a fundamental type node_t (like pte_t, pmd_t, pgd_t, ...).
> 
> Clean NUMA code deserves it. 
> 
> #if MAX_NUMNODES > 256
> typedef u16 node_t;
> #else
> typedef u8 node_t;
> #endif
> 

Funny, I had this in originally and someone suggested that it was
superfluous. ;-)  But I agree, though I had called it numanode_t.
Even a cpu_t for size of the cpu index could be useful.

I'll wait for other opinions.

> In 2016, we can add u32 for MAX_NUMNODES > 65536

Probably 2010 is closer... ;-)

> 
> Another point: you want this change, sorry if my previous mail was not detailed enough :
> 
> --- a/arch/x86/mm/numa_64.c
> +++ b/arch/x86/mm/numa_64.c
> @@ -78,7 +78,7 @@ static int __init allocate_cachealigned_memnodemap(void)
>         unsigned long pad, pad_addr;
>  
>         memnodemap = memnode.embedded_map;
> -       if (memnodemapsize <= 48)
> +       if (memnodemapsize <= ARRAY_SIZE(memnode.embedded_map))
>                 return 0;
>  
>         pad = L1_CACHE_BYTES - 1;
> 
> 
> Thanks

Thanks!  This hash lookup is still a bit of a mystery to me.

I'll submit a 'fixup' patch momentarily, also removing:

--- linux.orig/arch/x86/mm/numa_64.c    2008-01-16 08:21:00.000000000 -0800
+++ linux/arch/x86/mm/numa_64.c 2008-01-16 09:57:27.168691249 -0800
@@ -35,8 +35,6 @@ u16 x86_cpu_to_node_map_init[NR_CPUS] __
        [0 ... NR_CPUS-1] = NUMA_NO_NODE
 };
 void *x86_cpu_to_node_map_early_ptr;
-EXPORT_SYMBOL(x86_cpu_to_node_map_init);
-EXPORT_SYMBOL(x86_cpu_to_node_map_early_ptr);
 DEFINE_PER_CPU(u16, x86_cpu_to_node_map) = NUMA_NO_NODE;
 EXPORT_PER_CPU_SYMBOL(x86_cpu_to_node_map);

... to avoid section mismatches.

Thanks,
Mike



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3
  2008-01-16 18:01 ` [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 Frans Pop
@ 2008-01-16 18:14   ` Mike Travis
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Travis @ 2008-01-16 18:14 UTC (permalink / raw)
  To: Frans Pop
  Cc: ak, akpm, clameter, dada1, linux-kernel, linux-mm, mingo, steiner

Frans Pop wrote:
> travis@sgi.com wrote:
>>    8472457 Total          30486950 +259%          30342823 +258%
> 
> Hmmm. The table for previous versions looked a lot more impressive.
> 
> now:    8472457 Total	 +22014493 +259%	 +21870366 +258%
> V2 :    7172678 Total    +23314404 +325%           -147590   -2%
> (recalculated for comparison)
> 
> Did something go wrong with the "after" data?

The previous version had each column's difference from the
previous.  The new one had eacho column's difference from the
first column.  (Also, there are differences in what "after"
means. ;-)

Here's the "each-relative" version.  It sort of depends on if
you want to see each incremental change or the net overall change.

Thanks,
Mike
--- 

128cpus                 1kcpus-before           1kcpus-after
       228 .altinstr_re         +0 +0%                  +0 +0%
      1219 .altinstruct         +0 +0%                  +0 +0%
    866632 .bss           +1393664 +160%           -147328 -6%
     61374 .comment             +0 +0%                  +0 +0%
        16 .con_initcal         +0 +0%                  +0 +0%
    427560 .data            +17920 +4%               -1280 +0%
   1165824 .data.cachel  +11911168 +1021%               +0 +0%
      8192 .data.init_t         +0 +0%                  +0 +0%
      4096 .data.page_a         +0 +0%                  +0 +0%
     39296 .data.percpu    +116480 +296%              +128 +0%
    188992 .data.read_m   +8562784 +4530%            -4096 +0%
         4 .data_nosave         +0 +0%                  +0 +0%
      5141 .exit.text           +9 +0%                  -1 +0%
    138576 .init.data         +896 +0%               +5952 +4%
       134 .init.ramfs          +0 +0%                  +0 +0%
      3192 .init.setup          +0 +0%                  +0 +0%
    160143 .init.text         +500 +0%                +271 +0%
      2288 .initcall.in         +0 +0%                  +0 +0%
         8 .jiffies             +0 +0%                  +0 +0%
      4512 .pci_fixup           +0 +0%                  +0 +0%
   1314318 .rodata           +1312 +0%                -325 +0%
     36856 .smp_locks          -48 +0%                  -8 +0%
   3975021 .text             +9808 +0%               +2560 +0%
      3368 .vdso                +0 +0%                  +0 +0%
         4 .vgetcpu_mod         +0 +0%                  +0 +0%
       218 .vsyscall_0          +0 +0%                  +0 +0%
        52 .vsyscall_1          +0 +0%                  +0 +0%
        91 .vsyscall_2          +0 +0%                  +0 +0%
         8 .vsyscall_3          +0 +0%                  +0 +0%
        54 .vsyscall_fn         +0 +0%                  +0 +0%
        80 .vsyscall_gt         +0 +0%                  +0 +0%
     39480 __bug_table          +0 +0%                  +0 +0%
     16320 __ex_table           +0 +0%                  +0 +0%
      9160 __param              +0 +0%                  +0 +0%

   1818299 Text             +16304 +0%                +256 +0%
   3975021 Data              +9808 +0%               +2560 +0%
    866632 Bss            +1393664 +160%           -147328 -6%
    360448 InitData        +122880 +34%              +4096 +0%
   1415640 OtherData     +20470272 +1446%            -2816 +0%
     39296 PerCpu          +116480 +0%                +128 +0%
   8472457 Total         +22014493 +259%           -144127 +0%

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 02/10] x86: Change size of node ids from u8 to u16 V3
  2008-01-16 17:53   ` Eric Dumazet
  2008-01-16 18:10     ` Mike Travis
@ 2008-01-16 19:16     ` Mike Travis
  2008-01-16 20:37       ` Eric Dumazet
  1 sibling, 1 reply; 17+ messages in thread
From: Mike Travis @ 2008-01-16 19:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Andi Kleen, mingo, Christoph Lameter,
	Jack Steiner, linux-mm, linux-kernel

>
> Another point: you want this change, sorry if my previous mail was not detailed enough :
>
> --- a/arch/x86/mm/numa_64.c
> +++ b/arch/x86/mm/numa_64.c
> @@ -78,7 +78,7 @@ static int __init allocate_cachealigned_memnodemap(void)
>         unsigned long pad, pad_addr;
>
>         memnodemap = memnode.embedded_map;
> -       if (memnodemapsize <= 48)
> +       if (memnodemapsize <= ARRAY_SIZE(memnode.embedded_map))
>                 return 0;
>
>         pad = L1_CACHE_BYTES - 1;
>

Hi Eric,

I'm still getting this message with the numa=fake=4 start option:

Faking node 0 at 0000000000000000-0000000028000000 (640MB)
Faking node 1 at 0000000028000000-0000000050000000 (640MB)
Faking node 2 at 0000000050000000-0000000078000000 (640MB)
Faking node 3 at 0000000078000000-000000009ff00000 (639MB)

NUMA: Using 27 for the hash shift.
Your memory is not aligned you need to rebuild your kernel
with a bigger NODEMAPSIZE shift=27 No NUMA hash function found.
NUMA emulation disabled.

Is there something else I need to change?  (This is on an AMD box.)

Thanks!
Mike



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 02/10] x86: Change size of node ids from u8 to u16 V3
  2008-01-16 19:16     ` Mike Travis
@ 2008-01-16 20:37       ` Eric Dumazet
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2008-01-16 20:37 UTC (permalink / raw)
  To: Mike Travis
  Cc: Andrew Morton, Andi Kleen, mingo, Christoph Lameter,
	Jack Steiner, linux-mm, linux-kernel

Mike Travis a écrit :
>> Another point: you want this change, sorry if my previous mail was not detailed enough :
>>
>> --- a/arch/x86/mm/numa_64.c
>> +++ b/arch/x86/mm/numa_64.c
>> @@ -78,7 +78,7 @@ static int __init allocate_cachealigned_memnodemap(void)
>>         unsigned long pad, pad_addr;
>>
>>         memnodemap = memnode.embedded_map;
>> -       if (memnodemapsize <= 48)
>> +       if (memnodemapsize <= ARRAY_SIZE(memnode.embedded_map))
>>                 return 0;
>>
>>         pad = L1_CACHE_BYTES - 1;
>>
> 
> Hi Eric,
> 
> I'm still getting this message with the numa=fake=4 start option:
> 
> Faking node 0 at 0000000000000000-0000000028000000 (640MB)
> Faking node 1 at 0000000028000000-0000000050000000 (640MB)
> Faking node 2 at 0000000050000000-0000000078000000 (640MB)
> Faking node 3 at 0000000078000000-000000009ff00000 (639MB)
> 
> NUMA: Using 27 for the hash shift.
> Your memory is not aligned you need to rebuild your kernel
> with a bigger NODEMAPSIZE shift=27 No NUMA hash function found.
> NUMA emulation disabled.
> 
> Is there something else I need to change?  (This is on an AMD box.)
> 

Sure, check populate_memnodemap() in arch/x86/mm/numa_64.c


--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -54,7 +54,7 @@ populate_memnodemap(const struct bootnode *nodes, int 
numnodes, int shift)
         int res = -1;
         unsigned long addr, end;

-       memset(memnodemap, 0xff, memnodemapsize);
+       memset(memnodemap, 0xff, sizeof(u16)*memnodemapsize);
         for (i = 0; i < numnodes; i++) {
                 addr = nodes[i].start;
                 end = nodes[i].end;
@@ -63,7 +63,7 @@ populate_memnodemap(const struct bootnode *nodes, int 
numnodes, int shift)
                 if ((end >> shift) >= memnodemapsize)
                         return 0;
                 do {
-                       if (memnodemap[addr >> shift] != 0xff)
+                       if (memnodemap[addr >> shift] != NUMA_NO_NODE)
                                 return -1;
                         memnodemap[addr >> shift] = i;
                         addr += (1UL << shift);


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2008-01-16 20:39 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-16 17:09 [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 travis
2008-01-16 17:09 ` [PATCH 01/10] x86: Change size of APICIDs from u8 to u16 V3 travis
2008-01-16 17:09 ` [PATCH 02/10] x86: Change size of node ids " travis
2008-01-16 17:53   ` Eric Dumazet
2008-01-16 18:10     ` Mike Travis
2008-01-16 19:16     ` Mike Travis
2008-01-16 20:37       ` Eric Dumazet
2008-01-16 17:09 ` [PATCH 03/10] x86: Change NR_CPUS arrays in powernow-k8 V3 travis
2008-01-16 17:09 ` [PATCH 04/10] x86: Change NR_CPUS arrays in intel_cacheinfo V3 travis
2008-01-16 17:09 ` [PATCH 05/10] x86: Change NR_CPUS arrays in smpboot_64 V3 travis
2008-01-16 17:09 ` [PATCH 06/10] x86: Change NR_CPUS arrays in topology V3 travis
2008-01-16 17:09 ` [PATCH 07/10] x86: Cleanup x86_cpu_to_apicid references V3 travis
2008-01-16 17:09 ` [PATCH 08/10] x86: Change NR_CPUS arrays in numa_64 V3 travis
2008-01-16 17:09 ` [PATCH 09/10] x86: Change NR_CPUS arrays in acpi-cpufreq V3 travis
2008-01-16 17:09 ` [PATCH 10/10] x86: Change bios_cpu_apicid to percpu data variable V3 travis
2008-01-16 18:01 ` [PATCH 00/10] x86: Reduce memory and intra-node effects with large count NR_CPUs V3 Frans Pop
2008-01-16 18:14   ` Mike Travis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).