linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Sizing zones and holes in an architecture independent manner V6
@ 2006-05-08 14:10 Mel Gorman
  2006-05-08 14:10 ` [PATCH 1/6] Introduce mechanism for registering active regions of memory Mel Gorman
                   ` (5 more replies)
  0 siblings, 6 replies; 47+ messages in thread
From: Mel Gorman @ 2006-05-08 14:10 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, Mel Gorman, ak, bob.picco, linux-kernel,
	linuxppc-dev, linux-mm

This is V6 of the patchset to size zones and memory holes in an
architecture-independent manner. This is based against 2.6.17-rc3-mm1
and as there were no objections to these patches since V4. Please merge.

The reasons why I'd like to this merged include;

 o Less architecture-specific code - particularly for x86 and ppc64
 o More maintainable. Changes to zone layout need only be made in one place
 o Zone-sizing and memory hole calculation is one less job that needs to be
   done for new architecture ports
 o With the architecture-independent representation, zone-based
   anti-fragmentation needs a lot less architecture-specific code making it
   more portable between architectures. This will be important for future
   hugepage-availability work
 o Nigel Cunningham has stated that that software suspend could potentially
   use the architecture-independent representation to discover what pages
   need to be saved during suspend

When testing on powerpc, I found compile errors that looked like this;

arch/powerpc/kernel/built-in.o(.init.text+0x7778): In function `vrsqrtefp':
: undefined reference to `__udivdi3'
arch/powerpc/kernel/built-in.o(.init.text+0x77c4): In function `vrsqrtefp':
: undefined reference to `__udivdi3'
make: *** [.tmp_vmlinux1] Error 1

A workaround that is *very likely wrong* and blatantly stolen from arch/sh
is at http://www.csn.ul.ie/~mel/udivdi3-powerpc-workaround.diff .

Changelog since V5
o Add a missing #include to mm/mem_init.c
o Drop the verbose debugging part of the set
o Report active range registration when loglevel is set for KERN_DEBUG

Changelog since V4
o Rebase to 2.6.17-rc3-mm1
o Calculate holes on x86 with SRAT correctly

Changelog since V3
o Rebase to 2.6.17-rc2
o Allow the active regions to be cleared. Needed by x86_64 when it decides
  the SRAT table is bad half way through the registering of active regions
o Fix for flatmem x86_64 machines booting

Changelog since V2
o Fix a bug where holes in lower zones get double counted
o Catch the case where a new range is registered that is within an range
o Catch the case where a zone boundary is within a hole
o Use the EFI map for registering ranges on x86_64+numa
o On IA64+NUMA, add the active ranges before rounding for granules
o On x86_64, remove e820_hole_size and e820_bootmem_free and use
  arch-independent equivalents
o On x86_64, remove the map walk in e820_end_of_ram()
o Rename memory_present_with_active_regions, name ambiguous
o Add absent_pages_in_range() for arches to call

Changelog since V1
o Correctly convert virtual and physical addresses to PFNs on ia64
o Correctly convert physical addresses to PFN on older ppc 
o When add_active_range() is called with overlapping pfn ranges, merge them
o When a zone boundary occurs within a memory hole, account correctly
o Minor whitespace damage cleanup
o Debugging patch temporarily included

At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate
zone sizes and holes in each architecture is very similar.  Some of this
zone and hole sizing code is difficult to read for no good reason. This
set of patches eliminates the similar-looking architecture-specific code.

The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas
have been discovered, free_area_init_nodes() is called to initialise
the pgdat and zones. The zone sizes and holes are then calculated in an
architecture independent manner.

Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 128 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 142 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 94 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 57 arch-specific LOC removed

At this point, there is a reduction of 421 architecture-specific lines of code
and a net reduction of 25 lines. The arch-independent code is a lot easier
to read in comparison to some of the arch-specific stuff, particularly in
arch/i386/ .

For Patch 6, it was also noted that page_alloc.c has a *lot* of
initialisation code which makes the file harder to read than it needs to
be. Patch 6 creates a new file mem_init.c and moves a lot of initialisation
code from page_alloc.c to it. After the patch is applied, there is still a net
loss of 7 lines.

The patches have been successfully boot tested by me and verified that the
zones are the correct size on

o x86, flatmem with 1.5GiB of RAM
o x86, NUMAQ
o x86, NUMA, with SRAT
o x86 with SRAT CONFIG_NUMA=n
o PPC64, NUMA
o PPC64, CONFIG_NUMA=n
o PPC64, CONFIG_64BIT=N
o Power, RS6000 (Had difficulty here with missing __udivdi3 symbol)
o x86_64, NUMA with SRAT
o x86_64, NUMA with broken SRAT that falls back to k8topology discovery
o x86_64, ACPI_NUMA, ACPI_MEMORY_HOTPLUG && !SPARSEMEM to trigger the
  hotadd path without sparsemem fun in srat.c (SRAT broken on test machine and
  I'm pretty sure the machine does not support physical memory hotadd anyway
  so test may not have been effective other than being a compile test.)
o x86_64, CONFIG_NUMA=n
o x86_64, CONFIG_64=n
o x86_64, CONFIG_64=n, CONFIG_NUMA=n
o x86_64, AMD64 desktop machine with flatmem
o ia64 (Itanium 2)
o ia64 (Itanium 2), CONFIG_64=N

Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged
on IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.

There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory
holes but the architecture-independent code accounts the memory as present.

The net reduction seems small but the big benefit of this set of patches
is the reduction of 421 lines of architecture-specific code, some of
which is very hairy. There should be a greater net reduction when other
architectures use the same mechanisms for zone and hole sizing but I lack
the hardware to test on.

Comments?

Additional credit;
	Dave Hansen for the initial suggestion and comments on early patches
	Andy Whitcroft for reviewing early versions and catching numerous errors
	Tony Luck for testing and debugging on IA64
	Bob Picco for testing and fixing bugs related to pfn registration
	Jack Steiner and Yasunori for testing on IA64
	Andi Kleen for reviewing and feeding back about x86_64

 arch/i386/Kconfig           |    8 
 arch/i386/kernel/setup.c    |   19 
 arch/i386/kernel/srat.c     |  101 ---
 arch/i386/mm/discontig.c    |   65 --
 arch/ia64/Kconfig           |    3 
 arch/ia64/mm/contig.c       |   60 --
 arch/ia64/mm/discontig.c    |   41 -
 arch/ia64/mm/init.c         |   12 
 arch/powerpc/Kconfig        |   13 
 arch/powerpc/mm/mem.c       |   53 --
 arch/powerpc/mm/numa.c      |  157 ------
 arch/ppc/Kconfig            |    3 
 arch/ppc/mm/init.c          |   26 -
 arch/x86_64/Kconfig         |    3 
 arch/x86_64/kernel/e820.c   |  109 +---
 arch/x86_64/kernel/setup.c  |    7 
 arch/x86_64/mm/init.c       |   62 --
 arch/x86_64/mm/k8topology.c |    3 
 arch/x86_64/mm/numa.c       |   18 
 arch/x86_64/mm/srat.c       |   11 
 include/asm-ia64/meminit.h  |    1 
 include/asm-x86_64/e820.h   |    5 
 include/asm-x86_64/proto.h  |    2 
 include/linux/mm.h          |   34 +
 include/linux/mmzone.h      |   10 
 mm/Makefile                 |    2 
 mm/mem_init.c               | 1121 ++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |  750 -----------------------------
-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/6] Introduce mechanism for registering active regions of memory
  2006-05-08 14:10 [PATCH 0/6] Sizing zones and holes in an architecture independent manner V6 Mel Gorman
@ 2006-05-08 14:10 ` Mel Gorman
  2006-05-08 14:11 ` [PATCH 2/6] Have Power use add_active_range() and free_area_init_nodes() Mel Gorman
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2006-05-08 14:10 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, Mel Gorman, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev


This patch defines the structure to represent an active range of page
frames within a node in an architecture independent manner. Architectures
are expected to register active ranges of PFNs using add_active_range(nid,
start_pfn, end_pfn) and call free_area_init_nodes() passing the PFNs of
the end of each zone.


 include/linux/mm.h     |   34 +++
 include/linux/mmzone.h |   10 -
 mm/page_alloc.c        |  403 +++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 422 insertions(+), 25 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-clean/include/linux/mm.h linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/include/linux/mm.h
--- linux-2.6.17-rc3-mm1-clean/include/linux/mm.h	2006-05-01 11:37:01.000000000 +0100
+++ linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/include/linux/mm.h	2006-05-08 09:16:57.000000000 +0100
@@ -916,6 +916,40 @@ extern void free_area_init(unsigned long
 extern void free_area_init_node(int nid, pg_data_t *pgdat,
 	unsigned long * zones_size, unsigned long zone_start_pfn, 
 	unsigned long *zholes_size);
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/*
+ * Any architecture that supports CONFIG_ARCH_POPULATES_NODE_MAP can
+ * initialise zone and hole information by
+ *
+ * for_all_memory_regions()
+ * 	add_active_range(nid, start, end)
+ * free_area_init_nodes(max_dma, max_dma32, max_low_pfn, max_pfn);
+ *
+ * Optionally, free_bootmem_with_active_regions() can be used to call
+ * free_bootmem_node() after active regions have been registered with
+ * add_active_range(). Similarly, sparse_memory_present_with_active_regions()
+ * calls memory_present() for active regions when SPARSEMEM is enabled
+ */
+extern void free_area_init_nodes(unsigned long max_dma_pfn,
+					unsigned long max_dma32_pfn,
+					unsigned long max_low_pfn,
+					unsigned long max_high_pfn);
+extern void add_active_range(unsigned int nid, unsigned long start_pfn,
+					unsigned long end_pfn);
+extern void shrink_active_range(unsigned int nid, unsigned long old_end_pfn,
+						unsigned long new_end_pfn);
+extern void remove_all_active_ranges(void);
+extern unsigned long absent_pages_in_range(unsigned long start_pfn,
+						unsigned long end_pfn);
+extern void get_pfn_range_for_nid(unsigned int nid,
+			unsigned long *start_pfn, unsigned long *end_pfn);
+extern unsigned long find_min_pfn_with_active_regions(void);
+extern unsigned long find_max_pfn_with_active_regions(void);
+extern int early_pfn_to_nid(unsigned long pfn);
+extern void free_bootmem_with_active_regions(int nid,
+						unsigned long max_low_pfn);
+extern void sparse_memory_present_with_active_regions(int nid);
+#endif
 extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long);
 extern void setup_per_zone_pages_min(void);
 extern void mem_init(void);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-clean/include/linux/mmzone.h linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/include/linux/mmzone.h
--- linux-2.6.17-rc3-mm1-clean/include/linux/mmzone.h	2006-05-01 11:37:01.000000000 +0100
+++ linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/include/linux/mmzone.h	2006-05-08 09:16:57.000000000 +0100
@@ -271,6 +271,13 @@ struct zonelist {
 	struct zone *zones[MAX_NUMNODES * MAX_NR_ZONES + 1]; // NULL delimited
 };
 
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+struct node_active_region {
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+	int nid;
+};
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 
 /*
  * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
@@ -468,7 +475,8 @@ extern struct zone *next_zone(struct zon
 
 #endif
 
-#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+#if !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && \
+	!defined(CONFIG_ARCH_POPULATES_NODE_MAP)
 #define early_pfn_to_nid(nid)  (0UL)
 #endif
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-clean/mm/page_alloc.c linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/mm/page_alloc.c
--- linux-2.6.17-rc3-mm1-clean/mm/page_alloc.c	2006-05-01 11:37:01.000000000 +0100
+++ linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/mm/page_alloc.c	2006-05-08 10:56:43.000000000 +0100
@@ -38,6 +38,8 @@
 #include <linux/vmalloc.h>
 #include <linux/mempolicy.h>
 #include <linux/stop_machine.h>
+#include <linux/sort.h>
+#include <linux/pfn.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -86,6 +88,18 @@ int min_free_kbytes = 1024;
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
 
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+  #ifdef CONFIG_MAX_ACTIVE_REGIONS
+    #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
+  #else
+    #define MAX_ACTIVE_REGIONS (MAX_NR_ZONES * MAX_NUMNODES + 1)
+  #endif
+
+  struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
+  unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
+  unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -1864,25 +1878,6 @@ static inline unsigned long wait_table_b
 
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
 
-static void __init calculate_zone_totalpages(struct pglist_data *pgdat,
-		unsigned long *zones_size, unsigned long *zholes_size)
-{
-	unsigned long realtotalpages, totalpages = 0;
-	int i;
-
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		totalpages += zones_size[i];
-	pgdat->node_spanned_pages = totalpages;
-
-	realtotalpages = totalpages;
-	if (zholes_size)
-		for (i = 0; i < MAX_NR_ZONES; i++)
-			realtotalpages -= zholes_size[i];
-	pgdat->node_present_pages = realtotalpages;
-	printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
-}
-
-
 /*
  * Initially all pages are reserved - free ones are freed
  * up by free_all_bootmem() once the early boot process is
@@ -2200,6 +2195,215 @@ __meminit int init_currently_empty_zone(
 	return 0;
 }
 
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/* Note: nid == MAX_NUMNODES returns first region */
+static int __init first_active_region_index_in_nid(int nid)
+{
+	int i;
+	for (i = 0; early_node_map[i].end_pfn; i++) {
+		if (nid == MAX_NUMNODES || early_node_map[i].nid == nid)
+			return i;
+	}
+
+	return MAX_ACTIVE_REGIONS;
+}
+
+/* Note: nid == MAX_NUMNODES returns next region */
+static int __init next_active_region_index_in_nid(unsigned int index, int nid)
+{
+	for (index = index + 1; early_node_map[index].end_pfn; index++) {
+		if (nid == MAX_NUMNODES || early_node_map[index].nid == nid)
+			return index;
+	}
+
+	return MAX_ACTIVE_REGIONS;
+}
+
+#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+int __init early_pfn_to_nid(unsigned long pfn)
+{
+	int i;
+
+	for (i = 0; early_node_map[i].end_pfn; i++) {
+		unsigned long start_pfn = early_node_map[i].start_pfn;
+		unsigned long end_pfn = early_node_map[i].end_pfn;
+
+		if ((start_pfn <= pfn) && (pfn < end_pfn))
+			return early_node_map[i].nid;
+	}
+
+	return -1;
+}
+#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
+
+#define for_each_active_range_index_in_nid(i, nid) \
+	for (i = first_active_region_index_in_nid(nid); \
+				i != MAX_ACTIVE_REGIONS; \
+				i = next_active_region_index_in_nid(i, nid))
+
+void __init free_bootmem_with_active_regions(int nid,
+						unsigned long max_low_pfn)
+{
+	unsigned int i;
+	for_each_active_range_index_in_nid(i, nid) {
+		unsigned long size_pages = 0;
+		unsigned long end_pfn = early_node_map[i].end_pfn;
+		if (early_node_map[i].start_pfn >= max_low_pfn)
+			continue;
+
+		if (end_pfn > max_low_pfn)
+			end_pfn = max_low_pfn;
+
+		size_pages = end_pfn - early_node_map[i].start_pfn;
+		free_bootmem_node(NODE_DATA(early_node_map[i].nid),
+				PFN_PHYS(early_node_map[i].start_pfn),
+				size_pages << PAGE_SHIFT);
+	}
+}
+
+void __init sparse_memory_present_with_active_regions(int nid)
+{
+	unsigned int i;
+	for_each_active_range_index_in_nid(i, nid)
+		memory_present(early_node_map[i].nid,
+				early_node_map[i].start_pfn,
+				early_node_map[i].end_pfn);
+}
+
+void __init get_pfn_range_for_nid(unsigned int nid,
+			unsigned long *start_pfn, unsigned long *end_pfn)
+{
+	unsigned int i;
+	*start_pfn = -1UL;
+	*end_pfn = 0;
+
+	for_each_active_range_index_in_nid(i, nid) {
+		*start_pfn = min(*start_pfn, early_node_map[i].start_pfn);
+		*end_pfn = max(*end_pfn, early_node_map[i].end_pfn);
+	}
+
+	if (*start_pfn == -1UL) {
+		printk(KERN_WARNING "Node %u active with no memory\n", nid);
+		*start_pfn = 0;
+	}
+}
+
+unsigned long __init zone_present_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *ignored)
+{
+	unsigned long node_start_pfn, node_end_pfn;
+	unsigned long zone_start_pfn, zone_end_pfn;
+
+	/* Get the start and end of the node and zone */
+	get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
+	zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+	zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+	/* Check that this node has pages within the zone's required range */
+	if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+		return 0;
+
+	/* Move the zone boundaries inside the node if necessary */
+	zone_end_pfn = min(zone_end_pfn, node_end_pfn);
+	zone_start_pfn = max(zone_start_pfn, node_start_pfn);
+
+	/* Return the spanned pages */
+	return zone_end_pfn - zone_start_pfn;
+}
+
+unsigned long __init __absent_pages_in_range(int nid,
+				unsigned long range_start_pfn,
+				unsigned long range_end_pfn)
+{
+	int i = 0;
+	unsigned long prev_end_pfn = 0, hole_pages = 0;
+	unsigned long start_pfn;
+
+	/* Find the end_pfn of the first active range of pfns in the node */
+	i = first_active_region_index_in_nid(nid);
+	if (i == MAX_ACTIVE_REGIONS)
+		return 0;
+	prev_end_pfn = early_node_map[i].start_pfn;
+
+	/* Find all holes for the zone within the node */
+	for (; i != MAX_ACTIVE_REGIONS;
+			i = next_active_region_index_in_nid(i, nid)) {
+
+		/* No need to continue if prev_end_pfn is outside the zone */
+		if (prev_end_pfn >= range_end_pfn)
+			break;
+
+		/* Make sure the end of the zone is not within the hole */
+		start_pfn = min(early_node_map[i].start_pfn, range_end_pfn);
+		prev_end_pfn = max(prev_end_pfn, range_start_pfn);
+
+		/* Update the hole size cound and move on */
+		if (start_pfn > range_start_pfn) {
+			BUG_ON(prev_end_pfn > start_pfn);
+			hole_pages += start_pfn - prev_end_pfn;
+		}
+		prev_end_pfn = early_node_map[i].end_pfn;
+	}
+
+	return hole_pages;
+}
+
+unsigned long __init absent_pages_in_range(unsigned long start_pfn,
+							unsigned long end_pfn)
+{
+	return __absent_pages_in_range(MAX_NUMNODES, start_pfn, end_pfn);
+}
+
+unsigned long __init zone_absent_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *ignored)
+{
+	return __absent_pages_in_range(nid,
+				arch_zone_lowest_possible_pfn[zone_type],
+				arch_zone_highest_possible_pfn[zone_type]);
+}
+#else
+static inline unsigned long zone_present_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *zones_size)
+{
+	return zones_size[zone_type];
+}
+
+static inline unsigned long zone_absent_pages_in_node(int nid,
+						unsigned long zone_type,
+						unsigned long *zholes_size)
+{
+	if (!zholes_size)
+		return 0;
+
+	return zholes_size[zone_type];
+}
+#endif
+
+static void __init calculate_node_totalpages(struct pglist_data *pgdat,
+		unsigned long *zones_size, unsigned long *zholes_size)
+{
+	unsigned long realtotalpages, totalpages = 0;
+	int i;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		totalpages += zone_present_pages_in_node(pgdat->node_id, i,
+								zones_size);
+	}
+	pgdat->node_spanned_pages = totalpages;
+
+	realtotalpages = totalpages;
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		realtotalpages -=
+			zone_absent_pages_in_node(pgdat->node_id, i, zholes_size);
+	}
+	pgdat->node_present_pages = realtotalpages;
+	printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
+							realtotalpages);
+}
+
 /*
  * Set up the zone data structures:
  *   - mark all pages reserved
@@ -2223,10 +2427,9 @@ static void __meminit free_area_init_cor
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize;
 
-		realsize = size = zones_size[j];
-		if (zholes_size)
-			realsize -= zholes_size[j];
-
+		size = zone_present_pages_in_node(nid, j, zones_size);
+		realsize = size - zone_absent_pages_in_node(nid, j,
+								zholes_size);
 		if (j < ZONE_HIGHMEM)
 			nr_kernel_pages += realsize;
 		nr_all_pages += realsize;
@@ -2294,13 +2497,165 @@ void __meminit free_area_init_node(int n
 {
 	pgdat->node_id = nid;
 	pgdat->node_start_pfn = node_start_pfn;
-	calculate_zone_totalpages(pgdat, zones_size, zholes_size);
+	calculate_node_totalpages(pgdat, zones_size, zholes_size);
 
 	alloc_node_mem_map(pgdat);
 
 	free_area_init_core(pgdat, zones_size, zholes_size);
 }
 
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+void __init add_active_range(unsigned int nid, unsigned long start_pfn,
+						unsigned long end_pfn)
+{
+	unsigned int i;
+	printk(KERN_DEBUG "Range (%d) %lu -> %lu\n", nid, start_pfn, end_pfn);
+
+	/* Merge with existing active regions if possible */
+	for (i = 0; early_node_map[i].end_pfn; i++) {
+		if (early_node_map[i].nid != nid)
+			continue;
+
+		/* Skip if an existing region covers this new one */
+		if (start_pfn >= early_node_map[i].start_pfn &&
+				end_pfn <= early_node_map[i].end_pfn)
+			return;
+
+		/* Merge forward if suitable */
+		if (start_pfn <= early_node_map[i].end_pfn &&
+				end_pfn > early_node_map[i].end_pfn) {
+			early_node_map[i].end_pfn = end_pfn;
+			return;
+		}
+
+		/* Merge backward if suitable */
+		if (start_pfn < early_node_map[i].end_pfn &&
+				end_pfn >= early_node_map[i].start_pfn) {
+			early_node_map[i].start_pfn = start_pfn;
+			return;
+		}
+	}
+
+	/* Leave last entry NULL, we use range.end_pfn to terminate the walk */
+	if (i >= MAX_ACTIVE_REGIONS - 1) {
+		printk(KERN_ERR "Too many memory regions, truncating\n");
+		return;
+	}
+
+	early_node_map[i].nid = nid;
+	early_node_map[i].start_pfn = start_pfn;
+	early_node_map[i].end_pfn = end_pfn;
+}
+
+void __init shrink_active_range(unsigned int nid, unsigned long old_end_pfn,
+						unsigned long new_end_pfn)
+{
+	unsigned int i;
+
+	/* Find the old active region end and shrink */
+	for_each_active_range_index_in_nid(i, nid) {
+		if (early_node_map[i].end_pfn == old_end_pfn) {
+			early_node_map[i].end_pfn = new_end_pfn;
+			break;
+		}
+	}
+}
+
+void __init remove_all_active_ranges()
+{
+	memset(early_node_map, 0, sizeof(early_node_map));
+}
+
+/* Compare two active node_active_regions */
+static int __init cmp_node_active_region(const void *a, const void *b)
+{
+	struct node_active_region *arange = (struct node_active_region *)a;
+	struct node_active_region *brange = (struct node_active_region *)b;
+
+	/* Done this way to avoid overflows */
+	if (arange->start_pfn > brange->start_pfn)
+		return 1;
+	if (arange->start_pfn < brange->start_pfn)
+		return -1;
+
+	return 0;
+}
+
+/* sort the node_map by start_pfn */
+static void __init sort_node_map(void)
+{
+	size_t num = 0;
+	while (early_node_map[num].end_pfn)
+		num++;
+
+	sort(early_node_map, num, sizeof(struct node_active_region),
+						cmp_node_active_region, NULL);
+}
+
+/* Find the lowest pfn for a node. This depends on a sorted early_node_map */
+unsigned long __init find_min_pfn_for_node(unsigned long nid)
+{
+	int i;
+
+	/* Assuming a sorted map, the first range found has the starting pfn */
+	for_each_active_range_index_in_nid(i, nid)
+		return early_node_map[i].start_pfn;
+
+	printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
+	return 0;
+}
+
+unsigned long __init find_min_pfn_with_active_regions(void)
+{
+	return find_min_pfn_for_node(MAX_NUMNODES);
+}
+
+unsigned long __init find_max_pfn_with_active_regions(void)
+{
+	int i;
+	unsigned long max_pfn = 0;
+
+	for (i = 0; early_node_map[i].end_pfn; i++)
+		max_pfn = max(max_pfn, early_node_map[i].end_pfn);
+
+	return max_pfn;
+}
+
+void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
+				unsigned long arch_max_dma32_pfn,
+				unsigned long arch_max_low_pfn,
+				unsigned long arch_max_high_pfn)
+{
+	unsigned long nid;
+	int zone_index;
+
+	/* Record where the zone boundaries are */
+	memset(arch_zone_lowest_possible_pfn, 0,
+				sizeof(arch_zone_lowest_possible_pfn));
+	memset(arch_zone_highest_possible_pfn, 0,
+				sizeof(arch_zone_highest_possible_pfn));
+	arch_zone_lowest_possible_pfn[ZONE_DMA] =
+					find_min_pfn_with_active_regions();
+	arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
+	arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
+	arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
+	arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
+	for (zone_index = 1; zone_index < MAX_NR_ZONES; zone_index++) {
+		arch_zone_lowest_possible_pfn[zone_index] =
+			arch_zone_highest_possible_pfn[zone_index-1];
+	}
+
+	/* Regions in the early_node_map can be in any order */
+	sort_node_map();
+
+	for_each_online_node(nid) {
+		pg_data_t *pgdat = NODE_DATA(nid);
+		free_area_init_node(nid, pgdat, NULL,
+				find_min_pfn_for_node(nid), NULL);
+	}
+}
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 static bootmem_data_t contig_bootmem_data;
 struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 2/6] Have Power use add_active_range() and free_area_init_nodes()
  2006-05-08 14:10 [PATCH 0/6] Sizing zones and holes in an architecture independent manner V6 Mel Gorman
  2006-05-08 14:10 ` [PATCH 1/6] Introduce mechanism for registering active regions of memory Mel Gorman
@ 2006-05-08 14:11 ` Mel Gorman
  2006-05-08 14:11 ` [PATCH 3/6] Have x86 use add_active_range() and free_area_init_nodes Mel Gorman
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2006-05-08 14:11 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, Mel Gorman, ak, bob.picco, linux-kernel,
	linuxppc-dev, linux-mm


Size zones and holes in an architecture independent manner for Power.


 powerpc/Kconfig   |   13 ++--
 powerpc/mm/mem.c  |   53 ++++++----------
 powerpc/mm/numa.c |  157 ++++---------------------------------------------
 ppc/Kconfig       |    3 
 ppc/mm/init.c     |   26 ++++----
 5 files changed, 62 insertions(+), 190 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/arch/powerpc/Kconfig linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/powerpc/Kconfig
--- linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/arch/powerpc/Kconfig	2006-05-01 11:36:58.000000000 +0100
+++ linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/powerpc/Kconfig	2006-05-08 09:17:57.000000000 +0100
@@ -676,11 +676,16 @@ config ARCH_SPARSEMEM_DEFAULT
 	def_bool y
 	depends on SMP && PPC_PSERIES
 
-source "mm/Kconfig"
-
-config HAVE_ARCH_EARLY_PFN_TO_NID
+config ARCH_POPULATES_NODE_MAP
 	def_bool y
-	depends on NEED_MULTIPLE_NODES
+
+# Value of 256 is MAX_LMB_REGIONS * 2
+config MAX_ACTIVE_REGIONS
+	int
+	default 256
+	depends on ARCH_POPULATES_NODE_MAP
+
+source "mm/Kconfig"
 
 config ARCH_MEMORY_PROBE
 	def_bool y
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/arch/powerpc/mm/mem.c linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/powerpc/mm/mem.c
--- linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/arch/powerpc/mm/mem.c	2006-05-01 11:36:58.000000000 +0100
+++ linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/powerpc/mm/mem.c	2006-05-08 09:17:57.000000000 +0100
@@ -257,20 +257,22 @@ void __init do_init_bootmem(void)
 
 	boot_mapsize = init_bootmem(start >> PAGE_SHIFT, total_pages);
 
+	/* Add active regions with valid PFNs */
+	for (i = 0; i < lmb.memory.cnt; i++) {
+		unsigned long start_pfn, end_pfn;
+		start_pfn = lmb.memory.region[i].base >> PAGE_SHIFT;
+		end_pfn = start_pfn + lmb_size_pages(&lmb.memory, i);
+		add_active_range(0, start_pfn, end_pfn);
+	}
+
 	/* Add all physical memory to the bootmem map, mark each area
 	 * present.
 	 */
-	for (i = 0; i < lmb.memory.cnt; i++) {
-		unsigned long base = lmb.memory.region[i].base;
-		unsigned long size = lmb_size_bytes(&lmb.memory, i);
 #ifdef CONFIG_HIGHMEM
-		if (base >= total_lowmem)
-			continue;
-		if (base + size > total_lowmem)
-			size = total_lowmem - base;
+	free_bootmem_with_active_regions(0, total_lowmem >> PAGE_SHIFT);
+#else
+	free_bootmem_with_active_regions(0, max_pfn);
 #endif
-		free_bootmem(base, size);
-	}
 
 	/* reserve the sections we're already using */
 	for (i = 0; i < lmb.reserved.cnt; i++)
@@ -278,9 +280,8 @@ void __init do_init_bootmem(void)
 				lmb_size_bytes(&lmb.reserved, i));
 
 	/* XXX need to clip this if using highmem? */
-	for (i = 0; i < lmb.memory.cnt; i++)
-		memory_present(0, lmb_start_pfn(&lmb.memory, i),
-			       lmb_end_pfn(&lmb.memory, i));
+	sparse_memory_present_with_active_regions(0);
+
 	init_bootmem_done = 1;
 }
 
@@ -289,8 +290,6 @@ void __init do_init_bootmem(void)
  */
 void __init paging_init(void)
 {
-	unsigned long zones_size[MAX_NR_ZONES];
-	unsigned long zholes_size[MAX_NR_ZONES];
 	unsigned long total_ram = lmb_phys_mem_size();
 	unsigned long top_of_ram = lmb_end_of_DRAM();
 
@@ -308,26 +307,18 @@ void __init paging_init(void)
 	       top_of_ram, total_ram);
 	printk(KERN_DEBUG "Memory hole size: %ldMB\n",
 	       (top_of_ram - total_ram) >> 20);
-	/*
-	 * All pages are DMA-able so we put them all in the DMA zone.
-	 */
-	memset(zones_size, 0, sizeof(zones_size));
-	memset(zholes_size, 0, sizeof(zholes_size));
-
-	zones_size[ZONE_DMA] = top_of_ram >> PAGE_SHIFT;
-	zholes_size[ZONE_DMA] = (top_of_ram - total_ram) >> PAGE_SHIFT;
-
 #ifdef CONFIG_HIGHMEM
-	zones_size[ZONE_DMA] = total_lowmem >> PAGE_SHIFT;
-	zones_size[ZONE_HIGHMEM] = (total_memory - total_lowmem) >> PAGE_SHIFT;
-	zholes_size[ZONE_HIGHMEM] = (top_of_ram - total_ram) >> PAGE_SHIFT;
+	free_area_init_nodes(total_lowmem >> PAGE_SHIFT,
+				total_lowmem >> PAGE_SHIFT,
+				total_lowmem >> PAGE_SHIFT,
+				top_of_ram >> PAGE_SHIFT);
 #else
-	zones_size[ZONE_DMA] = top_of_ram >> PAGE_SHIFT;
-	zholes_size[ZONE_DMA] = (top_of_ram - total_ram) >> PAGE_SHIFT;
-#endif /* CONFIG_HIGHMEM */
+	free_area_init_nodes(top_of_ram >> PAGE_SHIFT,
+				top_of_ram >> PAGE_SHIFT,
+				top_of_ram >> PAGE_SHIFT,
+				top_of_ram >> PAGE_SHIFT);
+#endif
 
-	free_area_init_node(0, NODE_DATA(0), zones_size,
-			    __pa(PAGE_OFFSET) >> PAGE_SHIFT, zholes_size);
 }
 #endif /* ! CONFIG_NEED_MULTIPLE_NODES */
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/arch/powerpc/mm/numa.c linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/powerpc/mm/numa.c
--- linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/arch/powerpc/mm/numa.c	2006-05-01 11:36:58.000000000 +0100
+++ linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/powerpc/mm/numa.c	2006-05-08 09:17:57.000000000 +0100
@@ -39,96 +39,6 @@ static bootmem_data_t __initdata plat_no
 static int min_common_depth;
 static int n_mem_addr_cells, n_mem_size_cells;
 
-/*
- * We need somewhere to store start/end/node for each region until we have
- * allocated the real node_data structures.
- */
-#define MAX_REGIONS	(MAX_LMB_REGIONS*2)
-static struct {
-	unsigned long start_pfn;
-	unsigned long end_pfn;
-	int nid;
-} init_node_data[MAX_REGIONS] __initdata;
-
-int __init early_pfn_to_nid(unsigned long pfn)
-{
-	unsigned int i;
-
-	for (i = 0; init_node_data[i].end_pfn; i++) {
-		unsigned long start_pfn = init_node_data[i].start_pfn;
-		unsigned long end_pfn = init_node_data[i].end_pfn;
-
-		if ((start_pfn <= pfn) && (pfn < end_pfn))
-			return init_node_data[i].nid;
-	}
-
-	return -1;
-}
-
-void __init add_region(unsigned int nid, unsigned long start_pfn,
-		       unsigned long pages)
-{
-	unsigned int i;
-
-	dbg("add_region nid %d start_pfn 0x%lx pages 0x%lx\n",
-		nid, start_pfn, pages);
-
-	for (i = 0; init_node_data[i].end_pfn; i++) {
-		if (init_node_data[i].nid != nid)
-			continue;
-		if (init_node_data[i].end_pfn == start_pfn) {
-			init_node_data[i].end_pfn += pages;
-			return;
-		}
-		if (init_node_data[i].start_pfn == (start_pfn + pages)) {
-			init_node_data[i].start_pfn -= pages;
-			return;
-		}
-	}
-
-	/*
-	 * Leave last entry NULL so we dont iterate off the end (we use
-	 * entry.end_pfn to terminate the walk).
-	 */
-	if (i >= (MAX_REGIONS - 1)) {
-		printk(KERN_ERR "WARNING: too many memory regions in "
-				"numa code, truncating\n");
-		return;
-	}
-
-	init_node_data[i].start_pfn = start_pfn;
-	init_node_data[i].end_pfn = start_pfn + pages;
-	init_node_data[i].nid = nid;
-}
-
-/* We assume init_node_data has no overlapping regions */
-void __init get_region(unsigned int nid, unsigned long *start_pfn,
-		       unsigned long *end_pfn, unsigned long *pages_present)
-{
-	unsigned int i;
-
-	*start_pfn = -1UL;
-	*end_pfn = *pages_present = 0;
-
-	for (i = 0; init_node_data[i].end_pfn; i++) {
-		if (init_node_data[i].nid != nid)
-			continue;
-
-		*pages_present += init_node_data[i].end_pfn -
-			init_node_data[i].start_pfn;
-
-		if (init_node_data[i].start_pfn < *start_pfn)
-			*start_pfn = init_node_data[i].start_pfn;
-
-		if (init_node_data[i].end_pfn > *end_pfn)
-			*end_pfn = init_node_data[i].end_pfn;
-	}
-
-	/* We didnt find a matching region, return start/end as 0 */
-	if (*start_pfn == -1UL)
-		*start_pfn = 0;
-}
-
 static void __cpuinit map_cpu_to_node(int cpu, int node)
 {
 	numa_cpu_lookup_table[cpu] = node;
@@ -471,8 +381,8 @@ new_range:
 				continue;
 		}
 
-		add_region(nid, start >> PAGE_SHIFT,
-			   size >> PAGE_SHIFT);
+		add_active_range(nid, start >> PAGE_SHIFT,
+				(start >> PAGE_SHIFT) + (size >> PAGE_SHIFT));
 
 		if (--ranges)
 			goto new_range;
@@ -485,6 +395,7 @@ static void __init setup_nonnuma(void)
 {
 	unsigned long top_of_ram = lmb_end_of_DRAM();
 	unsigned long total_ram = lmb_phys_mem_size();
+	unsigned long start_pfn, end_pfn;
 	unsigned int i;
 
 	printk(KERN_DEBUG "Top of RAM: 0x%lx, Total RAM: 0x%lx\n",
@@ -492,9 +403,11 @@ static void __init setup_nonnuma(void)
 	printk(KERN_DEBUG "Memory hole size: %ldMB\n",
 	       (top_of_ram - total_ram) >> 20);
 
-	for (i = 0; i < lmb.memory.cnt; ++i)
-		add_region(0, lmb.memory.region[i].base >> PAGE_SHIFT,
-			   lmb_size_pages(&lmb.memory, i));
+	for (i = 0; i < lmb.memory.cnt; ++i) {
+		start_pfn = lmb.memory.region[i].base >> PAGE_SHIFT;
+		end_pfn = start_pfn + lmb_size_pages(&lmb.memory, i);
+		add_active_range(0, start_pfn, end_pfn);
+	}
 	node_set_online(0);
 }
 
@@ -632,11 +545,11 @@ void __init do_init_bootmem(void)
 			  (void *)(unsigned long)boot_cpuid);
 
 	for_each_online_node(nid) {
-		unsigned long start_pfn, end_pfn, pages_present;
+		unsigned long start_pfn, end_pfn;
 		unsigned long bootmem_paddr;
 		unsigned long bootmap_pages;
 
-		get_region(nid, &start_pfn, &end_pfn, &pages_present);
+		get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 
 		/* Allocate the node structure node local if possible */
 		NODE_DATA(nid) = careful_allocation(nid,
@@ -669,19 +582,7 @@ void __init do_init_bootmem(void)
 		init_bootmem_node(NODE_DATA(nid), bootmem_paddr >> PAGE_SHIFT,
 				  start_pfn, end_pfn);
 
-		/* Add free regions on this node */
-		for (i = 0; init_node_data[i].end_pfn; i++) {
-			unsigned long start, end;
-
-			if (init_node_data[i].nid != nid)
-				continue;
-
-			start = init_node_data[i].start_pfn << PAGE_SHIFT;
-			end = init_node_data[i].end_pfn << PAGE_SHIFT;
-
-			dbg("free_bootmem %lx %lx\n", start, end - start);
-  			free_bootmem_node(NODE_DATA(nid), start, end - start);
-		}
+		free_bootmem_with_active_regions(nid, end_pfn);
 
 		/* Mark reserved regions on this node */
 		for (i = 0; i < lmb.reserved.cnt; i++) {
@@ -712,44 +613,14 @@ void __init do_init_bootmem(void)
 			}
 		}
 
-		/* Add regions into sparsemem */
-		for (i = 0; init_node_data[i].end_pfn; i++) {
-			unsigned long start, end;
-
-			if (init_node_data[i].nid != nid)
-				continue;
-
-			start = init_node_data[i].start_pfn;
-			end = init_node_data[i].end_pfn;
-
-			memory_present(nid, start, end);
-		}
+		sparse_memory_present_with_active_regions(nid);
 	}
 }
 
 void __init paging_init(void)
 {
-	unsigned long zones_size[MAX_NR_ZONES];
-	unsigned long zholes_size[MAX_NR_ZONES];
-	int nid;
-
-	memset(zones_size, 0, sizeof(zones_size));
-	memset(zholes_size, 0, sizeof(zholes_size));
-
-	for_each_online_node(nid) {
-		unsigned long start_pfn, end_pfn, pages_present;
-
-		get_region(nid, &start_pfn, &end_pfn, &pages_present);
-
-		zones_size[ZONE_DMA] = end_pfn - start_pfn;
-		zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - pages_present;
-
-		dbg("free_area_init node %d %lx %lx (hole: %lx)\n", nid,
-		    zones_size[ZONE_DMA], start_pfn, zholes_size[ZONE_DMA]);
-
-		free_area_init_node(nid, NODE_DATA(nid), zones_size, start_pfn,
-				    zholes_size);
-	}
+	unsigned long end_pfn = lmb_end_of_DRAM() >> PAGE_SHIFT;
+	free_area_init_nodes(end_pfn, end_pfn, end_pfn, end_pfn);
 }
 
 static int __init early_numa(char *p)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/arch/ppc/Kconfig linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/ppc/Kconfig
--- linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/arch/ppc/Kconfig	2006-04-27 03:19:25.000000000 +0100
+++ linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/ppc/Kconfig	2006-05-08 09:17:57.000000000 +0100
@@ -949,6 +949,9 @@ config NR_CPUS
 config HIGHMEM
 	bool "High memory support"
 
+config ARCH_POPULATES_NODE_MAP
+	def_bool y
+
 source kernel/Kconfig.hz
 source kernel/Kconfig.preempt
 source "mm/Kconfig"
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/arch/ppc/mm/init.c linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/ppc/mm/init.c
--- linux-2.6.17-rc3-mm1-101-add_free_area_init_nodes/arch/ppc/mm/init.c	2006-04-27 03:19:25.000000000 +0100
+++ linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/ppc/mm/init.c	2006-05-08 09:17:57.000000000 +0100
@@ -359,8 +359,7 @@ void __init do_init_bootmem(void)
  */
 void __init paging_init(void)
 {
-	unsigned long zones_size[MAX_NR_ZONES], i;
-
+	unsigned long start_pfn, end_pfn;
 #ifdef CONFIG_HIGHMEM
 	map_page(PKMAP_BASE, 0, 0);	/* XXX gross */
 	pkmap_page_table = pte_offset_kernel(pmd_offset(pgd_offset_k
@@ -370,19 +369,22 @@ void __init paging_init(void)
 			(KMAP_FIX_BEGIN), KMAP_FIX_BEGIN), KMAP_FIX_BEGIN);
 	kmap_prot = PAGE_KERNEL;
 #endif /* CONFIG_HIGHMEM */
-
-	/*
-	 * All pages are DMA-able so we put them all in the DMA zone.
-	 */
-	zones_size[ZONE_DMA] = total_lowmem >> PAGE_SHIFT;
-	for (i = 1; i < MAX_NR_ZONES; i++)
-		zones_size[i] = 0;
+	/* All pages are DMA-able so we put them all in the DMA zone. */
+	start_pfn = __pa(PAGE_OFFSET) >> PAGE_SHIFT;
+	end_pfn = start_pfn + (total_memory >> PAGE_SHIFT);
+	add_active_range(0, start_pfn, end_pfn);
 
 #ifdef CONFIG_HIGHMEM
-	zones_size[ZONE_HIGHMEM] = (total_memory - total_lowmem) >> PAGE_SHIFT;
+	free_area_init_nodes(total_lowmem >> PAGE_SHIFT,
+				total_lowmem >> PAGE_SHIFT,
+				total_lowmem >> PAGE_SHIFT,
+				total_memory >> PAGE_SHIFT);
+#else
+	free_area_init_nodes(total_memory >> PAGE_SHIFT,
+				total_memory >> PAGE_SHIFT,
+				total_memory >> PAGE_SHIFT,
+				total_memory >> PAGE_SHIFT);
 #endif /* CONFIG_HIGHMEM */
-
-	free_area_init(zones_size);
 }
 
 void __init mem_init(void)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 3/6] Have x86 use add_active_range() and free_area_init_nodes
  2006-05-08 14:10 [PATCH 0/6] Sizing zones and holes in an architecture independent manner V6 Mel Gorman
  2006-05-08 14:10 ` [PATCH 1/6] Introduce mechanism for registering active regions of memory Mel Gorman
  2006-05-08 14:11 ` [PATCH 2/6] Have Power use add_active_range() and free_area_init_nodes() Mel Gorman
@ 2006-05-08 14:11 ` Mel Gorman
  2006-05-08 14:11 ` [PATCH 4/6] Have x86_64 " Mel Gorman
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2006-05-08 14:11 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, Mel Gorman, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev


Size zones and holes in an architecture independent manner for x86.


 Kconfig        |    8 +---
 kernel/setup.c |   19 +++------
 kernel/srat.c  |  100 +---------------------------------------------------
 mm/discontig.c |   65 +++++++--------------------------
 4 files changed, 25 insertions(+), 167 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/i386/Kconfig linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/i386/Kconfig
--- linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/i386/Kconfig	2006-05-01 11:36:54.000000000 +0100
+++ linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/i386/Kconfig	2006-05-08 09:18:57.000000000 +0100
@@ -577,12 +577,10 @@ config ARCH_SELECT_MEMORY_MODEL
 	def_bool y
 	depends on ARCH_SPARSEMEM_ENABLE
 
-source "mm/Kconfig"
+config ARCH_POPULATES_NODE_MAP
+	def_bool y
 
-config HAVE_ARCH_EARLY_PFN_TO_NID
-	bool
-	default y
-	depends on NUMA
+source "mm/Kconfig"
 
 config HIGHPTE
 	bool "Allocate 3rd-level pagetables from highmem"
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/i386/kernel/setup.c linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/i386/kernel/setup.c
--- linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/i386/kernel/setup.c	2006-05-01 11:36:54.000000000 +0100
+++ linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/i386/kernel/setup.c	2006-05-08 09:18:57.000000000 +0100
@@ -1207,22 +1207,15 @@ static unsigned long __init setup_memory
 
 void __init zone_sizes_init(void)
 {
-	unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
-	unsigned int max_dma, low;
+	unsigned int max_dma;
+#ifndef CONFIG_HIGHMEM
+	unsigned long highend_pfn = max_low_pfn;
+#endif
 
 	max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-	low = max_low_pfn;
 
-	if (low < max_dma)
-		zones_size[ZONE_DMA] = low;
-	else {
-		zones_size[ZONE_DMA] = max_dma;
-		zones_size[ZONE_NORMAL] = low - max_dma;
-#ifdef CONFIG_HIGHMEM
-		zones_size[ZONE_HIGHMEM] = highend_pfn - low;
-#endif
-	}
-	free_area_init(zones_size);
+	add_active_range(0, 0, highend_pfn);
+	free_area_init_nodes(max_dma, max_dma, max_low_pfn, highend_pfn);
 }
 #else
 extern unsigned long __init setup_memory(void);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/i386/kernel/srat.c linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/i386/kernel/srat.c
--- linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/i386/kernel/srat.c	2006-05-01 11:36:54.000000000 +0100
+++ linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/i386/kernel/srat.c	2006-05-08 09:18:57.000000000 +0100
@@ -55,8 +55,6 @@ struct node_memory_chunk_s {
 static struct node_memory_chunk_s node_memory_chunk[MAXCHUNKS];
 
 static int num_memory_chunks;		/* total number of memory chunks */
-static int zholes_size_init;
-static unsigned long zholes_size[MAX_NUMNODES * MAX_NR_ZONES];
 
 extern void * boot_ioremap(unsigned long, unsigned long);
 
@@ -136,50 +134,6 @@ static void __init parse_memory_affinity
 		 "enabled and removable" : "enabled" ) );
 }
 
-#if MAX_NR_ZONES != 4
-#error "MAX_NR_ZONES != 4, chunk_to_zone requires review"
-#endif
-/* Take a chunk of pages from page frame cstart to cend and count the number
- * of pages in each zone, returned via zones[].
- */
-static __init void chunk_to_zones(unsigned long cstart, unsigned long cend, 
-		unsigned long *zones)
-{
-	unsigned long max_dma;
-	extern unsigned long max_low_pfn;
-
-	int z;
-	unsigned long rend;
-
-	/* FIXME: MAX_DMA_ADDRESS and max_low_pfn are trying to provide
-	 * similarly scoped information and should be handled in a consistant
-	 * manner.
-	 */
-	max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
-	/* Split the hole into the zones in which it falls.  Repeatedly
-	 * take the segment in which the remaining hole starts, round it
-	 * to the end of that zone.
-	 */
-	memset(zones, 0, MAX_NR_ZONES * sizeof(long));
-	while (cstart < cend) {
-		if (cstart < max_dma) {
-			z = ZONE_DMA;
-			rend = (cend < max_dma)? cend : max_dma;
-
-		} else if (cstart < max_low_pfn) {
-			z = ZONE_NORMAL;
-			rend = (cend < max_low_pfn)? cend : max_low_pfn;
-
-		} else {
-			z = ZONE_HIGHMEM;
-			rend = cend;
-		}
-		zones[z] += rend - cstart;
-		cstart = rend;
-	}
-}
-
 /*
  * The SRAT table always lists ascending addresses, so can always
  * assume that the first "start" address that you see is the real
@@ -224,7 +178,6 @@ static int __init acpi20_parse_srat(stru
 
 	memset(pxm_bitmap, 0, sizeof(pxm_bitmap));	/* init proximity domain bitmap */
 	memset(node_memory_chunk, 0, sizeof(node_memory_chunk));
-	memset(zholes_size, 0, sizeof(zholes_size));
 
 	num_memory_chunks = 0;
 	while (p < end) {
@@ -288,6 +241,7 @@ static int __init acpi20_parse_srat(stru
 		printk("chunk %d nid %d start_pfn %08lx end_pfn %08lx\n",
 		       j, chunk->nid, chunk->start_pfn, chunk->end_pfn);
 		node_read_chunk(chunk->nid, chunk);
+		add_active_range(chunk->nid, chunk->start_pfn, chunk->end_pfn);
 	}
  
 	for_each_online_node(nid) {
@@ -396,57 +350,7 @@ int __init get_memcfg_from_srat(void)
 		return acpi20_parse_srat((struct acpi_table_srat *)header);
 	}
 out_err:
+	remove_all_active_ranges();
 	printk("failed to get NUMA memory information from SRAT table\n");
 	return 0;
 }
-
-/* For each node run the memory list to determine whether there are
- * any memory holes.  For each hole determine which ZONE they fall
- * into.
- *
- * NOTE#1: this requires knowledge of the zone boundries and so
- * _cannot_ be performed before those are calculated in setup_memory.
- * 
- * NOTE#2: we rely on the fact that the memory chunks are ordered by
- * start pfn number during setup.
- */
-static void __init get_zholes_init(void)
-{
-	int nid;
-	int c;
-	int first;
-	unsigned long end = 0;
-
-	for_each_online_node(nid) {
-		first = 1;
-		for (c = 0; c < num_memory_chunks; c++){
-			if (node_memory_chunk[c].nid == nid) {
-				if (first) {
-					end = node_memory_chunk[c].end_pfn;
-					first = 0;
-
-				} else {
-					/* Record any gap between this chunk
-					 * and the previous chunk on this node
-					 * against the zones it spans.
-					 */
-					chunk_to_zones(end,
-						node_memory_chunk[c].start_pfn,
-						&zholes_size[nid * MAX_NR_ZONES]);
-				}
-			}
-		}
-	}
-}
-
-unsigned long * __init get_zholes_size(int nid)
-{
-	if (!zholes_size_init) {
-		zholes_size_init++;
-		get_zholes_init();
-	}
-	if (nid >= MAX_NUMNODES || !node_online(nid))
-		printk("%s: nid = %d is invalid/offline. num_online_nodes = %d",
-		       __FUNCTION__, nid, num_online_nodes());
-	return &zholes_size[nid * MAX_NR_ZONES];
-}
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/i386/mm/discontig.c linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/i386/mm/discontig.c
--- linux-2.6.17-rc3-mm1-102-powerpc_use_init_nodes/arch/i386/mm/discontig.c	2006-04-27 03:19:25.000000000 +0100
+++ linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/i386/mm/discontig.c	2006-05-08 09:18:57.000000000 +0100
@@ -157,21 +157,6 @@ static void __init find_max_pfn_node(int
 		BUG();
 }
 
-/* Find the owning node for a pfn. */
-int early_pfn_to_nid(unsigned long pfn)
-{
-	int nid;
-
-	for_each_node(nid) {
-		if (node_end_pfn[nid] == 0)
-			break;
-		if (node_start_pfn[nid] <= pfn && node_end_pfn[nid] >= pfn)
-			return nid;
-	}
-
-	return 0;
-}
-
 /* 
  * Allocate memory for the pg_data_t for this node via a crude pre-bootmem
  * method.  For node zero take this from the bottom of memory, for
@@ -227,6 +212,8 @@ static unsigned long calculate_numa_rema
 	unsigned long pfn;
 
 	for_each_online_node(nid) {
+		unsigned old_end_pfn = node_end_pfn[nid];
+
 		/*
 		 * The acpi/srat node info can show hot-add memroy zones
 		 * where memory could be added but not currently present.
@@ -276,6 +263,7 @@ static unsigned long calculate_numa_rema
 
 		node_end_pfn[nid] -= size;
 		node_remap_start_pfn[nid] = node_end_pfn[nid];
+		shrink_active_range(nid, old_end_pfn, node_end_pfn[nid]);
 	}
 	printk("Reserving total of %ld pages for numa KVA remap\n",
 			reserve_pages);
@@ -352,45 +340,20 @@ unsigned long __init setup_memory(void)
 void __init zone_sizes_init(void)
 {
 	int nid;
+	unsigned long max_dma_pfn;
 
-
-	for_each_online_node(nid) {
-		unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
-		unsigned long *zholes_size;
-		unsigned int max_dma;
-
-		unsigned long low = max_low_pfn;
-		unsigned long start = node_start_pfn[nid];
-		unsigned long high = node_end_pfn[nid];
-
-		max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
-		if (node_has_online_mem(nid)){
-			if (start > low) {
-#ifdef CONFIG_HIGHMEM
-				BUG_ON(start > high);
-				zones_size[ZONE_HIGHMEM] = high - start;
-#endif
-			} else {
-				if (low < max_dma)
-					zones_size[ZONE_DMA] = low;
-				else {
-					BUG_ON(max_dma > low);
-					BUG_ON(low > high);
-					zones_size[ZONE_DMA] = max_dma;
-					zones_size[ZONE_NORMAL] = low - max_dma;
-#ifdef CONFIG_HIGHMEM
-					zones_size[ZONE_HIGHMEM] = high - low;
-#endif
-				}
-			}
+	/* If SRAT has not registered memory, register it now */
+	if (find_max_pfn_with_active_regions() == 0) {
+		for_each_online_node(nid) {
+			if (node_has_online_mem(nid))
+				add_active_range(nid, node_start_pfn[nid],
+							node_end_pfn[nid]);
 		}
-
-		zholes_size = get_zholes_size(nid);
-
-		free_area_init_node(nid, NODE_DATA(nid), zones_size, start,
-				zholes_size);
 	}
+
+	max_dma_pfn = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
+	free_area_init_nodes(max_dma_pfn, max_dma_pfn,
+						max_low_pfn, highend_pfn);
 	return;
 }
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-05-08 14:10 [PATCH 0/6] Sizing zones and holes in an architecture independent manner V6 Mel Gorman
                   ` (2 preceding siblings ...)
  2006-05-08 14:11 ` [PATCH 3/6] Have x86 use add_active_range() and free_area_init_nodes Mel Gorman
@ 2006-05-08 14:11 ` Mel Gorman
  2006-05-20 20:59   ` Andrew Morton
  2006-05-08 14:12 ` [PATCH 5/6] Have ia64 " Mel Gorman
  2006-05-08 14:12 ` [PATCH 6/6] Break out memory initialisation code from page_alloc.c to mem_init.c Mel Gorman
  5 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2006-05-08 14:11 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, Mel Gorman, ak, bob.picco, linux-kernel,
	linuxppc-dev, linux-mm


Size zones and holes in an architecture independent manner for x86_64.


 arch/x86_64/Kconfig         |    3 +
 arch/x86_64/kernel/e820.c   |  109 ++++++++++-----------------------------
 arch/x86_64/kernel/setup.c  |    7 ++
 arch/x86_64/mm/init.c       |   62 +---------------------
 arch/x86_64/mm/k8topology.c |    3 +
 arch/x86_64/mm/numa.c       |   18 +++---
 arch/x86_64/mm/srat.c       |   11 ++-
 include/asm-x86_64/e820.h   |    5 -
 include/asm-x86_64/proto.h  |    2 
 9 files changed, 63 insertions(+), 157 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/Kconfig linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/Kconfig
--- linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/Kconfig	2006-05-01 11:36:58.000000000 +0100
+++ linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/Kconfig	2006-05-08 09:20:01.000000000 +0100
@@ -73,6 +73,9 @@ config ARCH_MAY_HAVE_PC_FDC
 	bool
 	default y
 
+config ARCH_POPULATES_NODE_MAP
+	def_bool y
+
 config DMI
 	bool
 	default y
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c
--- linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c	2006-04-27 03:19:25.000000000 +0100
+++ linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c	2006-05-08 09:20:01.000000000 +0100
@@ -18,6 +18,7 @@
 #include <linux/string.h>
 #include <linux/kexec.h>
 #include <linux/module.h>
+#include <linux/mm.h>
 
 #include <asm/page.h>
 #include <asm/e820.h>
@@ -155,58 +156,14 @@ unsigned long __init find_e820_area(unsi
 	return -1UL;		
 } 
 
-/* 
- * Free bootmem based on the e820 table for a node.
- */
-void __init e820_bootmem_free(pg_data_t *pgdat, unsigned long start,unsigned long end)
-{
-	int i;
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i]; 
-		unsigned long last, addr;
-
-		if (ei->type != E820_RAM || 
-		    ei->addr+ei->size <= start || 
-		    ei->addr >= end)
-			continue;
-
-		addr = round_up(ei->addr, PAGE_SIZE);
-		if (addr < start) 
-			addr = start;
-
-		last = round_down(ei->addr + ei->size, PAGE_SIZE); 
-		if (last >= end)
-			last = end; 
-
-		if (last > addr && last-addr >= PAGE_SIZE)
-			free_bootmem_node(pgdat, addr, last-addr);
-	}
-}
-
 /*
  * Find the highest page frame number we have available
  */
 unsigned long __init e820_end_of_ram(void)
 {
-	int i;
 	unsigned long end_pfn = 0;
 	
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i]; 
-		unsigned long start, end;
-
-		start = round_up(ei->addr, PAGE_SIZE); 
-		end = round_down(ei->addr + ei->size, PAGE_SIZE); 
-		if (start >= end)
-			continue;
-		if (ei->type == E820_RAM) { 
-		if (end > end_pfn<<PAGE_SHIFT)
-			end_pfn = end>>PAGE_SHIFT;
-		} else { 
-			if (end > end_pfn_map<<PAGE_SHIFT) 
-				end_pfn_map = end>>PAGE_SHIFT;
-		} 
-	}
+	end_pfn = find_max_pfn_with_active_regions();
 
 	if (end_pfn > end_pfn_map) 
 		end_pfn_map = end_pfn;
@@ -220,40 +177,6 @@ unsigned long __init e820_end_of_ram(voi
 	return end_pfn;	
 }
 
-/* 
- * Compute how much memory is missing in a range.
- * Unlike the other functions in this file the arguments are in page numbers.
- */
-unsigned long __init
-e820_hole_size(unsigned long start_pfn, unsigned long end_pfn)
-{
-	unsigned long ram = 0;
-	unsigned long start = start_pfn << PAGE_SHIFT;
-	unsigned long end = end_pfn << PAGE_SHIFT;
-	int i;
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i];
-		unsigned long last, addr;
-
-		if (ei->type != E820_RAM ||
-		    ei->addr+ei->size <= start ||
-		    ei->addr >= end)
-			continue;
-
-		addr = round_up(ei->addr, PAGE_SIZE);
-		if (addr < start)
-			addr = start;
-
-		last = round_down(ei->addr + ei->size, PAGE_SIZE);
-		if (last >= end)
-			last = end;
-
-		if (last > addr)
-			ram += last - addr;
-	}
-	return ((end - start) - ram) >> PAGE_SHIFT;
-}
-
 /*
  * Mark e820 reserved areas as busy for the resource manager.
  */
@@ -288,6 +211,34 @@ void __init e820_reserve_resources(void)
 	}
 }
 
+/* Walk the e820 map and register active regions within a node */
+void __init
+e820_register_active_regions(int nid, unsigned long start_pfn,
+							unsigned long end_pfn)
+{
+	int i;
+	unsigned long ei_startpfn, ei_endpfn;
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+		ei_startpfn = round_up(ei->addr, PAGE_SIZE) >> PAGE_SHIFT;
+		ei_endpfn = round_down(ei->addr + ei->size, PAGE_SIZE)
+								>> PAGE_SHIFT;
+		/* Skip if map is outside the node */
+		if (ei->type != E820_RAM ||
+				ei_endpfn <= start_pfn ||
+				ei_startpfn >= end_pfn)
+			continue;
+
+		/* Check for overlaps */
+		if (ei_startpfn < start_pfn)
+			ei_startpfn = start_pfn;
+		if (ei_endpfn > end_pfn)
+			ei_endpfn = end_pfn;
+
+		add_active_range(nid, ei_startpfn, ei_endpfn);
+	}
+}
+
 /* 
  * Add a memory region to the kernel e820 map.
  */ 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/kernel/setup.c linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/kernel/setup.c
--- linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/kernel/setup.c	2006-05-01 11:36:58.000000000 +0100
+++ linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/kernel/setup.c	2006-05-08 09:20:01.000000000 +0100
@@ -475,7 +475,8 @@ contig_initmem_init(unsigned long start_
 	if (bootmap == -1L)
 		panic("Cannot find bootmem map of size %ld\n",bootmap_size);
 	bootmap_size = init_bootmem(bootmap >> PAGE_SHIFT, end_pfn);
-	e820_bootmem_free(NODE_DATA(0), 0, end_pfn << PAGE_SHIFT);
+	e820_register_active_regions(0, start_pfn, end_pfn);
+	free_bootmem_with_active_regions(0, end_pfn);
 	reserve_bootmem(bootmap, bootmap_size);
 } 
 #endif
@@ -645,6 +646,7 @@ void __init setup_arch(char **cmdline_p)
 
 	early_identify_cpu(&boot_cpu_data);
 
+	e820_register_active_regions(0, 0, -1UL);
 	/*
 	 * partially used pages are not usable - thus
 	 * we are rounding upwards:
@@ -668,6 +670,9 @@ void __init setup_arch(char **cmdline_p)
 	acpi_boot_table_init();
 #endif
 
+	/* Remove active ranges so rediscovery with NUMA-awareness happens */
+	remove_all_active_ranges();
+
 #ifdef CONFIG_ACPI_NUMA
 	/*
 	 * Parse SRAT to discover nodes.
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/mm/init.c linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c
--- linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/mm/init.c	2006-05-01 11:36:58.000000000 +0100
+++ linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c	2006-05-08 09:20:01.000000000 +0100
@@ -406,69 +406,12 @@ void __cpuinit zap_low_mappings(int cpu)
 	__flush_tlb_all();
 }
 
-/* Compute zone sizes for the DMA and DMA32 zones in a node. */
-__init void
-size_zones(unsigned long *z, unsigned long *h,
-	   unsigned long start_pfn, unsigned long end_pfn)
-{
- 	int i;
- 	unsigned long w;
-
- 	for (i = 0; i < MAX_NR_ZONES; i++)
- 		z[i] = 0;
-
- 	if (start_pfn < MAX_DMA_PFN)
- 		z[ZONE_DMA] = MAX_DMA_PFN - start_pfn;
- 	if (start_pfn < MAX_DMA32_PFN) {
- 		unsigned long dma32_pfn = MAX_DMA32_PFN;
- 		if (dma32_pfn > end_pfn)
- 			dma32_pfn = end_pfn;
- 		z[ZONE_DMA32] = dma32_pfn - start_pfn;
- 	}
- 	z[ZONE_NORMAL] = end_pfn - start_pfn;
-
- 	/* Remove lower zones from higher ones. */
- 	w = 0;
- 	for (i = 0; i < MAX_NR_ZONES; i++) {
- 		if (z[i])
- 			z[i] -= w;
- 	        w += z[i];
-	}
-
-	/* Compute holes */
-	w = start_pfn;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		unsigned long s = w;
-		w += z[i];
-		h[i] = e820_hole_size(s, w);
-	}
-
-	/* Add the space pace needed for mem_map to the holes too. */
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		h[i] += (z[i] * sizeof(struct page)) / PAGE_SIZE;
-
-	/* The 16MB DMA zone has the kernel and other misc mappings.
- 	   Account them too */
-	if (h[ZONE_DMA]) {
-		h[ZONE_DMA] += dma_reserve;
-		if (h[ZONE_DMA] >= z[ZONE_DMA]) {
-			printk(KERN_WARNING
-				"Kernel too large and filling up ZONE_DMA?\n");
-			h[ZONE_DMA] = z[ZONE_DMA];
-		}
-	}
-}
-
 #ifndef CONFIG_NUMA
 void __init paging_init(void)
 {
-	unsigned long zones[MAX_NR_ZONES], holes[MAX_NR_ZONES];
-
 	memory_present(0, 0, end_pfn);
 	sparse_init();
-	size_zones(zones, holes, 0, end_pfn);
-	free_area_init_node(0, NODE_DATA(0), zones,
-			    __pa(PAGE_OFFSET) >> PAGE_SHIFT, holes);
+	free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, end_pfn, end_pfn);
 }
 #endif
 
@@ -620,7 +563,8 @@ void __init mem_init(void)
 #else
 	totalram_pages = free_all_bootmem();
 #endif
-	reservedpages = end_pfn - totalram_pages - e820_hole_size(0, end_pfn);
+	reservedpages = end_pfn - totalram_pages -
+					absent_pages_in_range(0, end_pfn);
 
 	after_bootmem = 1;
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/mm/k8topology.c linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/mm/k8topology.c
--- linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/mm/k8topology.c	2006-04-27 03:19:25.000000000 +0100
+++ linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/mm/k8topology.c	2006-05-08 09:20:01.000000000 +0100
@@ -146,6 +146,9 @@ int __init k8_scan_nodes(unsigned long s
 		
 		nodes[nodeid].start = base; 
 		nodes[nodeid].end = limit;
+		e820_register_active_regions(nodeid,
+				nodes[nodeid].start >> PAGE_SHIFT,
+				nodes[nodeid].end >> PAGE_SHIFT);
 
 		prevbase = base;
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/mm/numa.c linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c
--- linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/mm/numa.c	2006-04-27 03:19:25.000000000 +0100
+++ linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c	2006-05-08 09:20:01.000000000 +0100
@@ -161,7 +161,7 @@ void __init setup_node_bootmem(int nodei
 					 bootmap_start >> PAGE_SHIFT, 
 					 start_pfn, end_pfn); 
 
-	e820_bootmem_free(NODE_DATA(nodeid), start, end);
+	free_bootmem_with_active_regions(nodeid, end);
 
 	reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size); 
 	reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start, bootmap_pages<<PAGE_SHIFT);
@@ -175,13 +175,11 @@ void __init setup_node_bootmem(int nodei
 void __init setup_node_zones(int nodeid)
 { 
 	unsigned long start_pfn, end_pfn, memmapsize, limit;
-	unsigned long zones[MAX_NR_ZONES];
-	unsigned long holes[MAX_NR_ZONES];
 
  	start_pfn = node_start_pfn(nodeid);
  	end_pfn = node_end_pfn(nodeid);
 
-	Dprintk(KERN_INFO "Setting up node %d %lx-%lx\n",
+	Dprintk(KERN_INFO "Setting up memmap for node %d %lx-%lx\n",
 		nodeid, start_pfn, end_pfn);
 
 	/* Try to allocate mem_map at end to not fill up precious <4GB
@@ -195,10 +193,6 @@ void __init setup_node_zones(int nodeid)
 				round_down(limit - memmapsize, PAGE_SIZE), 
 				limit);
 #endif
-
-	size_zones(zones, holes, start_pfn, end_pfn);
-	free_area_init_node(nodeid, NODE_DATA(nodeid), zones,
-			    start_pfn, holes);
 } 
 
 void __init numa_init_array(void)
@@ -259,8 +253,11 @@ static int numa_emulation(unsigned long 
  		printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
  		return -1;
  	}
- 	for_each_online_node(i)
+ 	for_each_online_node(i) {
+		e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
+						nodes[i].end >> PAGE_SHIFT);
  		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
+	}
  	numa_init_array();
  	return 0;
 }
@@ -299,6 +296,7 @@ void __init numa_initmem_init(unsigned l
 	for (i = 0; i < NR_CPUS; i++)
 		numa_set_node(i, 0);
 	node_to_cpumask[0] = cpumask_of_cpu(0);
+	e820_register_active_regions(0, start_pfn, end_pfn);
 	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, end_pfn << PAGE_SHIFT);
 }
 
@@ -346,6 +344,8 @@ void __init paging_init(void)
 	for_each_online_node(i) {
 		setup_node_zones(i); 
 	}
+
+	free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, end_pfn, end_pfn);
 } 
 
 /* [numa=off] */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/mm/srat.c linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
--- linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/arch/x86_64/mm/srat.c	2006-05-01 11:36:58.000000000 +0100
+++ linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c	2006-05-08 09:20:01.000000000 +0100
@@ -87,6 +87,7 @@ static __init void bad_srat(void)
 		apicid_to_node[i] = NUMA_NO_NODE;
 	for (i = 0; i < MAX_NUMNODES; i++)
 		nodes_add[i].start = nodes[i].end = 0;
+	remove_all_active_ranges();
 }
 
 static __init inline int srat_disabled(void)
@@ -168,7 +169,7 @@ static int hotadd_enough_memory(struct b
 
 	if (mem < 0)
 		return 0;
-	allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
+	allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) * PAGE_SIZE;
 	allowed = (allowed / 100) * hotadd_percent;
 	if (allocated + mem > allowed) {
 		/* Give them at least part of their hotadd memory upto hotadd_percent
@@ -216,7 +217,7 @@ static int reserve_hotadd(int node, unsi
 	}
 
 	/* This check might be a bit too strict, but I'm keeping it for now. */
-	if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
+	if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
 		printk(KERN_ERR "SRAT: Hotplug area has existing memory\n");
 		return -1;
 	}
@@ -310,6 +311,8 @@ acpi_numa_memory_affinity_init(struct ac
 
 	printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
 	       nd->start, nd->end);
+	e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
+						nd->end >> PAGE_SHIFT);
 
 #ifdef RESERVE_HOTADD
  	if (ma->flags.hot_pluggable && reserve_hotadd(node, start, end) < 0) {
@@ -334,13 +337,13 @@ static int nodes_cover_memory(void)
 		unsigned long s = nodes[i].start >> PAGE_SHIFT;
 		unsigned long e = nodes[i].end >> PAGE_SHIFT;
 		pxmram += e - s;
-		pxmram -= e820_hole_size(s, e);
+		pxmram -= absent_pages_in_range(s, e);
 		pxmram -= nodes_add[i].end - nodes_add[i].start;
 		if ((long)pxmram < 0)
 			pxmram = 0;
 	}
 
-	e820ram = end_pfn - e820_hole_size(0, end_pfn);
+	e820ram = end_pfn - absent_pages_in_range(0, end_pfn);
 	/* We seem to lose 3 pages somewhere. Allow a bit of slack. */
 	if ((long)(e820ram - pxmram) >= 1*1024*1024) {
 		printk(KERN_ERR
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/include/asm-x86_64/e820.h linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h
--- linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/include/asm-x86_64/e820.h	2006-04-27 03:19:25.000000000 +0100
+++ linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h	2006-05-08 09:20:01.000000000 +0100
@@ -50,10 +50,9 @@ extern void e820_print_map(char *who);
 extern int e820_any_mapped(unsigned long start, unsigned long end, unsigned type);
 extern int e820_all_mapped(unsigned long start, unsigned long end, unsigned type);
 
-extern void e820_bootmem_free(pg_data_t *pgdat, unsigned long start,unsigned long end);
 extern void e820_setup_gap(void);
-extern unsigned long e820_hole_size(unsigned long start_pfn,
-				    unsigned long end_pfn);
+extern void e820_register_active_regions(int nid,
+				unsigned long start_pfn, unsigned long end_pfn);
 
 extern void __init parse_memopt(char *p, char **end);
 extern void __init parse_memmapopt(char *p, char **end);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/include/asm-x86_64/proto.h linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h
--- linux-2.6.17-rc3-mm1-103-x86_use_init_nodes/include/asm-x86_64/proto.h	2006-05-01 11:37:01.000000000 +0100
+++ linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h	2006-05-08 09:20:01.000000000 +0100
@@ -24,8 +24,6 @@ extern void mtrr_bp_init(void);
 #define mtrr_bp_init() do {} while (0)
 #endif
 extern void init_memory_mapping(unsigned long start, unsigned long end);
-extern void size_zones(unsigned long *z, unsigned long *h,
-			unsigned long start_pfn, unsigned long end_pfn);
 
 extern void system_call(void); 
 extern int kernel_syscall(void);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-08 14:10 [PATCH 0/6] Sizing zones and holes in an architecture independent manner V6 Mel Gorman
                   ` (3 preceding siblings ...)
  2006-05-08 14:11 ` [PATCH 4/6] Have x86_64 " Mel Gorman
@ 2006-05-08 14:12 ` Mel Gorman
  2006-05-15  3:31   ` Andrew Morton
  2006-05-08 14:12 ` [PATCH 6/6] Break out memory initialisation code from page_alloc.c to mem_init.c Mel Gorman
  5 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2006-05-08 14:12 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, Mel Gorman, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev


Size zones and holes in an architecture independent manner for ia64.


 arch/ia64/Kconfig          |    3 ++
 arch/ia64/mm/contig.c      |   60 +++++-----------------------------------
 arch/ia64/mm/discontig.c   |   41 ++++-----------------------
 arch/ia64/mm/init.c        |   12 ++++++++
 include/asm-ia64/meminit.h |    1 
 5 files changed, 30 insertions(+), 87 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/ia64/Kconfig linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/arch/ia64/Kconfig
--- linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/ia64/Kconfig	2006-05-01 11:36:54.000000000 +0100
+++ linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/arch/ia64/Kconfig	2006-05-08 09:21:02.000000000 +0100
@@ -353,6 +353,9 @@ config NODES_SHIFT
 	  MAX_NUMNODES will be 2^(This value).
 	  If in doubt, use the default.
 
+config ARCH_POPULATES_NODE_MAP
+	def_bool y
+
 # VIRTUAL_MEM_MAP and FLAT_NODE_MEM_MAP are functionally equivalent.
 # VIRTUAL_MEM_MAP has been retained for historical reasons.
 config VIRTUAL_MEM_MAP
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/ia64/mm/contig.c linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/arch/ia64/mm/contig.c
--- linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/ia64/mm/contig.c	2006-04-27 03:19:25.000000000 +0100
+++ linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/arch/ia64/mm/contig.c	2006-05-08 09:21:02.000000000 +0100
@@ -26,10 +26,6 @@
 #include <asm/sections.h>
 #include <asm/mca.h>
 
-#ifdef CONFIG_VIRTUAL_MEM_MAP
-static unsigned long num_dma_physpages;
-#endif
-
 /**
  * show_mem - display a memory statistics summary
  *
@@ -212,18 +208,6 @@ count_pages (u64 start, u64 end, void *a
 	return 0;
 }
 
-#ifdef CONFIG_VIRTUAL_MEM_MAP
-static int
-count_dma_pages (u64 start, u64 end, void *arg)
-{
-	unsigned long *count = arg;
-
-	if (start < MAX_DMA_ADDRESS)
-		*count += (min(end, MAX_DMA_ADDRESS) - start) >> PAGE_SHIFT;
-	return 0;
-}
-#endif
-
 /*
  * Set up the page tables.
  */
@@ -232,47 +216,24 @@ void __init
 paging_init (void)
 {
 	unsigned long max_dma;
-	unsigned long zones_size[MAX_NR_ZONES];
 #ifdef CONFIG_VIRTUAL_MEM_MAP
-	unsigned long zholes_size[MAX_NR_ZONES];
+	unsigned long nid = 0;
 	unsigned long max_gap;
 #endif
 
-	/* initialize mem_map[] */
-
-	memset(zones_size, 0, sizeof(zones_size));
-
 	num_physpages = 0;
 	efi_memmap_walk(count_pages, &num_physpages);
 
 	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 
 #ifdef CONFIG_VIRTUAL_MEM_MAP
-	memset(zholes_size, 0, sizeof(zholes_size));
-
-	num_dma_physpages = 0;
-	efi_memmap_walk(count_dma_pages, &num_dma_physpages);
-
-	if (max_low_pfn < max_dma) {
-		zones_size[ZONE_DMA] = max_low_pfn;
-		zholes_size[ZONE_DMA] = max_low_pfn - num_dma_physpages;
-	} else {
-		zones_size[ZONE_DMA] = max_dma;
-		zholes_size[ZONE_DMA] = max_dma - num_dma_physpages;
-		if (num_physpages > num_dma_physpages) {
-			zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
-			zholes_size[ZONE_NORMAL] =
-				((max_low_pfn - max_dma) -
-				 (num_physpages - num_dma_physpages));
-		}
-	}
-
 	max_gap = 0;
+	efi_memmap_walk(register_active_ranges, &nid);
 	efi_memmap_walk(find_largest_hole, (u64 *)&max_gap);
 	if (max_gap < LARGE_GAP) {
 		vmem_map = (struct page *) 0;
-		free_area_init_node(0, NODE_DATA(0), zones_size, 0,
-				    zholes_size);
+		free_area_init_nodes(max_dma, max_dma,
+				max_low_pfn, max_low_pfn);
 	} else {
 		unsigned long map_size;
 
@@ -284,19 +245,14 @@ paging_init (void)
 		efi_memmap_walk(create_mem_map_page_table, NULL);
 
 		NODE_DATA(0)->node_mem_map = vmem_map;
-		free_area_init_node(0, NODE_DATA(0), zones_size,
-				    0, zholes_size);
+		free_area_init_nodes(max_dma, max_dma,
+				max_low_pfn, max_low_pfn);
 
 		printk("Virtual mem_map starts at 0x%p\n", mem_map);
 	}
 #else /* !CONFIG_VIRTUAL_MEM_MAP */
-	if (max_low_pfn < max_dma)
-		zones_size[ZONE_DMA] = max_low_pfn;
-	else {
-		zones_size[ZONE_DMA] = max_dma;
-		zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
-	}
-	free_area_init(zones_size);
+	add_active_range(0, 0, max_low_pfn);
+	free_area_init_nodes(max_dma, max_dma, max_low_pfn, max_low_pfn);
 #endif /* !CONFIG_VIRTUAL_MEM_MAP */
 	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/ia64/mm/discontig.c linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/arch/ia64/mm/discontig.c
--- linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/ia64/mm/discontig.c	2006-04-27 03:19:25.000000000 +0100
+++ linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/arch/ia64/mm/discontig.c	2006-05-08 09:21:02.000000000 +0100
@@ -700,6 +700,7 @@ static __init int count_node_pages(unsig
 {
 	unsigned long end = start + len;
 
+	add_active_range(node, start >> PAGE_SHIFT, end >> PAGE_SHIFT);
 	mem_data[node].num_physpages += len >> PAGE_SHIFT;
 	if (start <= __pa(MAX_DMA_ADDRESS))
 		mem_data[node].num_dma_physpages +=
@@ -724,9 +725,8 @@ static __init int count_node_pages(unsig
 void __init paging_init(void)
 {
 	unsigned long max_dma;
-	unsigned long zones_size[MAX_NR_ZONES];
-	unsigned long zholes_size[MAX_NR_ZONES];
 	unsigned long pfn_offset = 0;
+	unsigned long max_pfn = 0;
 	int node;
 
 	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
@@ -743,46 +743,17 @@ void __init paging_init(void)
 #endif
 
 	for_each_online_node(node) {
-		memset(zones_size, 0, sizeof(zones_size));
-		memset(zholes_size, 0, sizeof(zholes_size));
-
 		num_physpages += mem_data[node].num_physpages;
-
-		if (mem_data[node].min_pfn >= max_dma) {
-			/* All of this node's memory is above ZONE_DMA */
-			zones_size[ZONE_NORMAL] = mem_data[node].max_pfn -
-				mem_data[node].min_pfn;
-			zholes_size[ZONE_NORMAL] = mem_data[node].max_pfn -
-				mem_data[node].min_pfn -
-				mem_data[node].num_physpages;
-		} else if (mem_data[node].max_pfn < max_dma) {
-			/* All of this node's memory is in ZONE_DMA */
-			zones_size[ZONE_DMA] = mem_data[node].max_pfn -
-				mem_data[node].min_pfn;
-			zholes_size[ZONE_DMA] = mem_data[node].max_pfn -
-				mem_data[node].min_pfn -
-				mem_data[node].num_dma_physpages;
-		} else {
-			/* This node has memory in both zones */
-			zones_size[ZONE_DMA] = max_dma -
-				mem_data[node].min_pfn;
-			zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] -
-				mem_data[node].num_dma_physpages;
-			zones_size[ZONE_NORMAL] = mem_data[node].max_pfn -
-				max_dma;
-			zholes_size[ZONE_NORMAL] = zones_size[ZONE_NORMAL] -
-				(mem_data[node].num_physpages -
-				 mem_data[node].num_dma_physpages);
-		}
-
 		pfn_offset = mem_data[node].min_pfn;
 
 #ifdef CONFIG_VIRTUAL_MEM_MAP
 		NODE_DATA(node)->node_mem_map = vmem_map + pfn_offset;
 #endif
-		free_area_init_node(node, NODE_DATA(node), zones_size,
-				    pfn_offset, zholes_size);
+		if (mem_data[node].max_pfn > max_pfn)
+			max_pfn = mem_data[node].max_pfn;
 	}
 
+	free_area_init_nodes(max_dma, max_dma, max_pfn, max_pfn);
+
 	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/ia64/mm/init.c linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/arch/ia64/mm/init.c
--- linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/arch/ia64/mm/init.c	2006-05-01 11:36:54.000000000 +0100
+++ linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/arch/ia64/mm/init.c	2006-05-08 09:21:02.000000000 +0100
@@ -539,6 +539,18 @@ find_largest_hole (u64 start, u64 end, v
 	last_end = end;
 	return 0;
 }
+
+int __init
+register_active_ranges(u64 start, u64 end, void *nid)
+{
+	BUG_ON(nid == NULL);
+	BUG_ON(*(unsigned long *)nid >= MAX_NUMNODES);
+
+	add_active_range(*(unsigned long *)nid,
+				__pa(start) >> PAGE_SHIFT,
+				__pa(end) >> PAGE_SHIFT);
+	return 0;
+}
 #endif /* CONFIG_VIRTUAL_MEM_MAP */
 
 static int __init
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/include/asm-ia64/meminit.h linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/include/asm-ia64/meminit.h
--- linux-2.6.17-rc3-mm1-104-x86_64_use_init_nodes/include/asm-ia64/meminit.h	2006-04-27 03:19:25.000000000 +0100
+++ linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/include/asm-ia64/meminit.h	2006-05-08 09:21:02.000000000 +0100
@@ -56,6 +56,7 @@ extern void efi_memmap_init(unsigned lon
   extern unsigned long vmalloc_end;
   extern struct page *vmem_map;
   extern int find_largest_hole (u64 start, u64 end, void *arg);
+  extern int register_active_ranges (u64 start, u64 end, void *arg);
   extern int create_mem_map_page_table (u64 start, u64 end, void *arg);
 #endif
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 6/6] Break out memory initialisation code from page_alloc.c to mem_init.c
  2006-05-08 14:10 [PATCH 0/6] Sizing zones and holes in an architecture independent manner V6 Mel Gorman
                   ` (4 preceding siblings ...)
  2006-05-08 14:12 ` [PATCH 5/6] Have ia64 " Mel Gorman
@ 2006-05-08 14:12 ` Mel Gorman
  2006-05-09  1:47   ` Nick Piggin
  5 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2006-05-08 14:12 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, Mel Gorman, ak, bob.picco, linux-kernel,
	linuxppc-dev, linux-mm


page_alloc.c contains a large amount of memory initialisation code. This patch
breaks out the initialisation code to a separate file to make page_alloc.c
a bit easier to read.


 Makefile     |    2 
 mem_init.c   | 1123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 page_alloc.c | 1105 -----------------------------------------------------
 3 files changed, 1124 insertions(+), 1106 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/mm/Makefile linux-2.6.17-rc3-mm1-106-breakout_mem_init/mm/Makefile
--- linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/mm/Makefile	2006-05-01 11:37:01.000000000 +0100
+++ linux-2.6.17-rc3-mm1-106-breakout_mem_init/mm/Makefile	2006-05-08 09:22:02.000000000 +0100
@@ -8,7 +8,7 @@ mmu-$(CONFIG_MMU)	:= fremap.o highmem.o 
 			   vmalloc.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
-			   page_alloc.o page-writeback.o pdflush.o \
+			   page_alloc.o mem_init.o page-writeback.o pdflush.o \
 			   readahead.o swap.o truncate.o vmscan.o \
 			   prio_tree.o util.o mmzone.o $(mmu-y)
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/mm/mem_init.c linux-2.6.17-rc3-mm1-106-breakout_mem_init/mm/mem_init.c
--- linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/mm/mem_init.c	2006-05-01 11:51:50.000000000 +0100
+++ linux-2.6.17-rc3-mm1-106-breakout_mem_init/mm/mem_init.c	2006-05-08 11:00:23.000000000 +0100
@@ -0,0 +1,1123 @@
+/*
+ * mm/mem_init.c
+ * Initialises the architecture independant view of memory. pgdats, zones, etc
+ *
+ *  Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
+ *  Copyright (C) 1995, Stephen Tweedie
+ *  Copyright (C) July 1999, Gerhard Wichert, Siemens AG
+ *  Copyright (C) 1999, Ingo Molnar, Red Hat
+ *  Copyright (C) 1999, 2000, Kanoj Sarcar, SGI
+ *  Copyright (C) Sept 2000, Martin J. Bligh
+ *	(lots of bits borrowed from Ingo Molnar & Andrew Morton)
+ *  Copyright (C) Apr 2006, Mel Gorman, IBM
+ *	(lots of bits taken from architecture-specific code)
+ */
+#include <linux/config.h>
+#include <linux/sort.h>
+#include <linux/pfn.h>
+#include <linux/mm.h>
+#include <linux/bootmem.h>
+#include <linux/cpuset.h>
+#include <linux/mempolicy.h>
+#include <linux/sysctl.h>
+#include <linux/swap.h>
+#include <linux/cpu.h>
+#include <linux/stop_machine.h>
+#include <linux/vmalloc.h>
+
+static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
+int percpu_pagelist_fraction;
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+  #ifdef CONFIG_MAX_ACTIVE_REGIONS
+    #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
+  #else
+    #define MAX_ACTIVE_REGIONS (MAX_NR_ZONES * MAX_NUMNODES + 1)
+  #endif
+
+  struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
+  unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
+  unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
+
+/*
+ * Builds allocation fallback zone lists.
+ *
+ * Add all populated zones of a node to the zonelist.
+ */
+static int __meminit build_zonelists_node(pg_data_t *pgdat,
+			struct zonelist *zonelist, int nr_zones, int zone_type)
+{
+	struct zone *zone;
+
+	BUG_ON(zone_type > ZONE_HIGHMEM);
+
+	do {
+		zone = pgdat->node_zones + zone_type;
+		if (populated_zone(zone)) {
+#ifndef CONFIG_HIGHMEM
+			BUG_ON(zone_type > ZONE_NORMAL);
+#endif
+			zonelist->zones[nr_zones++] = zone;
+			check_highest_zone(zone_type);
+		}
+		zone_type--;
+
+	} while (zone_type >= 0);
+	return nr_zones;
+}
+
+static inline int highest_zone(int zone_bits)
+{
+	int res = ZONE_NORMAL;
+	if (zone_bits & (__force int)__GFP_HIGHMEM)
+		res = ZONE_HIGHMEM;
+	if (zone_bits & (__force int)__GFP_DMA32)
+		res = ZONE_DMA32;
+	if (zone_bits & (__force int)__GFP_DMA)
+		res = ZONE_DMA;
+	return res;
+}
+
+#ifdef CONFIG_NUMA
+#define MAX_NODE_LOAD (num_online_nodes())
+static int __meminitdata node_load[MAX_NUMNODES];
+/**
+ * find_next_best_node - find the next node that should appear in a given node's fallback list
+ * @node: node whose fallback list we're appending
+ * @used_node_mask: nodemask_t of already used nodes
+ *
+ * We use a number of factors to determine which is the next node that should
+ * appear on a given node's fallback list.  The node should not have appeared
+ * already in @node's fallback list, and it should be the next closest node
+ * according to the distance array (which contains arbitrary distance values
+ * from each node to each node in the system), and should also prefer nodes
+ * with no CPUs, since presumably they'll have very little allocation pressure
+ * on them otherwise.
+ * It returns -1 if no node is found.
+ */
+static int __meminit find_next_best_node(int node, nodemask_t *used_node_mask)
+{
+	int n, val;
+	int min_val = INT_MAX;
+	int best_node = -1;
+
+	/* Use the local node if we haven't already */
+	if (!node_isset(node, *used_node_mask)) {
+		node_set(node, *used_node_mask);
+		return node;
+	}
+
+	for_each_online_node(n) {
+		cpumask_t tmp;
+
+		/* Don't want a node to appear more than once */
+		if (node_isset(n, *used_node_mask))
+			continue;
+
+		/* Use the distance array to find the distance */
+		val = node_distance(node, n);
+
+		/* Penalize nodes under us ("prefer the next node") */
+		val += (n < node);
+
+		/* Give preference to headless and unused nodes */
+		tmp = node_to_cpumask(n);
+		if (!cpus_empty(tmp))
+			val += PENALTY_FOR_NODE_WITH_CPUS;
+
+		/* Slight preference for less loaded node */
+		val *= (MAX_NODE_LOAD*MAX_NUMNODES);
+		val += node_load[n];
+
+		if (val < min_val) {
+			min_val = val;
+			best_node = n;
+		}
+	}
+
+	if (best_node >= 0)
+		node_set(best_node, *used_node_mask);
+
+	return best_node;
+}
+
+static void __meminit build_zonelists(pg_data_t *pgdat)
+{
+	int i, j, k, node, local_node;
+	int prev_node, load;
+	struct zonelist *zonelist;
+	nodemask_t used_mask;
+
+	/* initialize zonelists */
+	for (i = 0; i < GFP_ZONETYPES; i++) {
+		zonelist = pgdat->node_zonelists + i;
+		zonelist->zones[0] = NULL;
+	}
+
+	/* NUMA-aware ordering of nodes */
+	local_node = pgdat->node_id;
+	load = num_online_nodes();
+	prev_node = local_node;
+	nodes_clear(used_mask);
+	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+		int distance = node_distance(local_node, node);
+
+		/*
+		 * If another node is sufficiently far away then it is better
+		 * to reclaim pages in a zone before going off node.
+		 */
+		if (distance > RECLAIM_DISTANCE)
+			zone_reclaim_mode = 1;
+
+		/*
+		 * We don't want to pressure a particular node.
+		 * So adding penalty to the first node in same
+		 * distance group to make it round-robin.
+		 */
+
+		if (distance != node_distance(local_node, prev_node))
+			node_load[node] += load;
+		prev_node = node;
+		load--;
+		for (i = 0; i < GFP_ZONETYPES; i++) {
+			zonelist = pgdat->node_zonelists + i;
+			for (j = 0; zonelist->zones[j] != NULL; j++);
+
+			k = highest_zone(i);
+
+	 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+			zonelist->zones[j] = NULL;
+		}
+	}
+}
+
+#else	/* CONFIG_NUMA */
+
+static void __meminit build_zonelists(pg_data_t *pgdat)
+{
+	int i, j, k, node, local_node;
+
+	local_node = pgdat->node_id;
+	for (i = 0; i < GFP_ZONETYPES; i++) {
+		struct zonelist *zonelist;
+
+		zonelist = pgdat->node_zonelists + i;
+
+		j = 0;
+		k = highest_zone(i);
+ 		j = build_zonelists_node(pgdat, zonelist, j, k);
+ 		/*
+ 		 * Now we build the zonelist so that it contains the zones
+ 		 * of all the other nodes.
+ 		 * We don't want to pressure a particular node, so when
+ 		 * building the zones for node N, we make sure that the
+ 		 * zones coming right after the local ones are those from
+ 		 * node N+1 (modulo N)
+ 		 */
+		for (node = local_node + 1; node < MAX_NUMNODES; node++) {
+			if (!node_online(node))
+				continue;
+			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+		}
+		for (node = 0; node < local_node; node++) {
+			if (!node_online(node))
+				continue;
+			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
+		}
+
+		zonelist->zones[j] = NULL;
+	}
+}
+
+#endif	/* CONFIG_NUMA */
+
+/* return values int ....just for stop_machine_run() */
+static int __meminit __build_all_zonelists(void *dummy)
+{
+	int nid;
+	for_each_online_node(nid)
+		build_zonelists(NODE_DATA(nid));
+	return 0;
+}
+
+void __meminit build_all_zonelists(void)
+{
+	if (system_state == SYSTEM_BOOTING) {
+		__build_all_zonelists(0);
+		cpuset_init_current_mems_allowed();
+	} else {
+		/* we have to stop all cpus to guaranntee there is no user
+		   of zonelist */
+		stop_machine_run(__build_all_zonelists, NULL, NR_CPUS);
+		/* cpuset refresh routine should be here */
+	}
+
+	printk("Built %i zonelists\n", num_online_nodes());
+
+}
+
+/*
+ * Helper functions to size the waitqueue hash table.
+ * Essentially these want to choose hash table sizes sufficiently
+ * large so that collisions trying to wait on pages are rare.
+ * But in fact, the number of active page waitqueues on typical
+ * systems is ridiculously low, less than 200. So this is even
+ * conservative, even though it seems large.
+ *
+ * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to
+ * waitqueues, i.e. the size of the waitq table given the number of pages.
+ */
+#define PAGES_PER_WAITQUEUE	256
+
+#ifndef CONFIG_MEMORY_HOTPLUG
+static inline unsigned long wait_table_hash_nr_entries(unsigned long pages)
+{
+	unsigned long size = 1;
+
+	pages /= PAGES_PER_WAITQUEUE;
+
+	while (size < pages)
+		size <<= 1;
+
+	/*
+	 * Once we have dozens or even hundreds of threads sleeping
+	 * on IO we've got bigger problems than wait queue collision.
+	 * Limit the size of the wait table to a reasonable size.
+	 */
+	size = min(size, 4096UL);
+
+	return max(size, 4UL);
+}
+#else
+/*
+ * A zone's size might be changed by hot-add, so it is not possible to determine
+ * a suitable size for its wait_table.  So we use the maximum size now.
+ *
+ * The max wait table size = 4096 x sizeof(wait_queue_head_t).   ie:
+ *
+ *    i386 (preemption config)    : 4096 x 16 = 64Kbyte.
+ *    ia64, x86-64 (no preemption): 4096 x 20 = 80Kbyte.
+ *    ia64, x86-64 (preemption)   : 4096 x 24 = 96Kbyte.
+ *
+ * The maximum entries are prepared when a zone's memory is (512K + 256) pages
+ * or more by the traditional way. (See above).  It equals:
+ *
+ *    i386, x86-64, powerpc(4K page size) : =  ( 2G + 1M)byte.
+ *    ia64(16K page size)                 : =  ( 8G + 4M)byte.
+ *    powerpc (64K page size)             : =  (32G +16M)byte.
+ */
+static inline unsigned long wait_table_hash_nr_entries(unsigned long pages)
+{
+	return 4096UL;
+}
+#endif
+
+/*
+ * This is an integer logarithm so that shifts can be used later
+ * to extract the more random high bits from the multiplicative
+ * hash function before the remainder is taken.
+ */
+static inline unsigned long wait_table_bits(unsigned long size)
+{
+	return ffz(~size);
+}
+
+#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
+
+#ifndef __HAVE_ARCH_MEMMAP_INIT
+#define memmap_init(size, nid, zone, start_pfn) \
+	memmap_init_zone((size), (nid), (zone), (start_pfn))
+#endif
+
+/*
+ * Initially all pages are reserved - free ones are freed
+ * up by free_all_bootmem() once the early boot process is
+ * done. Non-atomic initialization, single-pass.
+ */
+void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
+		unsigned long start_pfn)
+{
+	struct page *page;
+	unsigned long end_pfn = start_pfn + size;
+	unsigned long pfn;
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+		if (!early_pfn_valid(pfn))
+			continue;
+		page = pfn_to_page(pfn);
+		set_page_links(page, zone, nid, pfn);
+		init_page_count(page);
+		reset_page_mapcount(page);
+		SetPageReserved(page);
+		INIT_LIST_HEAD(&page->lru);
+#ifdef WANT_PAGE_VIRTUAL
+		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
+		if (!is_highmem_idx(zone))
+			set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+	}
+}
+
+void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
+				unsigned long size)
+{
+	int order;
+	for (order = 0; order < MAX_ORDER ; order++) {
+		INIT_LIST_HEAD(&zone->free_area[order].free_list);
+		zone->free_area[order].nr_free = 0;
+	}
+}
+
+#define ZONETABLE_INDEX(x, zone_nr)	((x << ZONES_SHIFT) | zone_nr)
+void zonetable_add(struct zone *zone, int nid, int zid, unsigned long pfn,
+		unsigned long size)
+{
+	unsigned long snum = pfn_to_section_nr(pfn);
+	unsigned long end = pfn_to_section_nr(pfn + size);
+
+	if (FLAGS_HAS_NODE)
+		zone_table[ZONETABLE_INDEX(nid, zid)] = zone;
+	else
+		for (; snum <= end; snum++)
+			zone_table[ZONETABLE_INDEX(snum, zid)] = zone;
+}
+
+static __meminit
+int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
+{
+	int i;
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	size_t alloc_size;
+
+	/*
+	 * The per-page waitqueue mechanism uses hashed waitqueues
+	 * per zone.
+	 */
+	zone->wait_table_hash_nr_entries =
+		 wait_table_hash_nr_entries(zone_size_pages);
+	zone->wait_table_bits =
+		wait_table_bits(zone->wait_table_hash_nr_entries);
+	alloc_size = zone->wait_table_hash_nr_entries
+					* sizeof(wait_queue_head_t);
+
+ 	if (system_state == SYSTEM_BOOTING) {
+		zone->wait_table = (wait_queue_head_t *)
+			alloc_bootmem_node(pgdat, alloc_size);
+	} else {
+		/*
+		 * This case means that a zone whose size was 0 gets new memory
+		 * via memory hot-add.
+		 * But it may be the case that a new node was hot-added.  In
+		 * this case vmalloc() will not be able to use this new node's
+		 * memory - this wait_table must be initialized to use this new
+		 * node itself as well.
+		 * To use this new node's memory, further consideration will be
+		 * necessary.
+		 */
+		zone->wait_table = (wait_queue_head_t *)vmalloc(alloc_size);
+	}
+	if (!zone->wait_table)
+		return -ENOMEM;
+
+	for(i = 0; i < zone->wait_table_hash_nr_entries; ++i)
+		init_waitqueue_head(zone->wait_table + i);
+
+	return 0;
+}
+
+/*
+ * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
+ * to the value high for the pageset p.
+ */
+static void setup_pagelist_highmark(struct per_cpu_pageset *p,
+				unsigned long high)
+{
+	struct per_cpu_pages *pcp;
+
+	pcp = &p->pcp[0]; /* hot list */
+	pcp->high = high;
+	pcp->batch = max(1UL, high/4);
+	if ((high/4) > (PAGE_SHIFT * 8))
+		pcp->batch = PAGE_SHIFT * 8;
+}
+
+/*
+ * percpu_pagelist_fraction - changes the pcp->high for each zone on each
+ * cpu.  It is the fraction of total pages in each zone that a hot per cpu pagelist
+ * can have before it gets flushed back to buddy allocator.
+ */
+int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
+	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+	struct zone *zone;
+	unsigned int cpu;
+	int ret;
+
+	ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+	if (!write || (ret == -EINVAL))
+		return ret;
+	for_each_zone(zone) {
+		for_each_online_cpu(cpu) {
+			unsigned long  high;
+			high = zone->present_pages / percpu_pagelist_fraction;
+			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+		}
+	}
+	return 0;
+}
+
+static int __cpuinit zone_batchsize(struct zone *zone)
+{
+	int batch;
+
+	/*
+	 * The per-cpu-pages pools are set to around 1000th of the
+	 * size of the zone.  But no more than 1/2 of a meg.
+	 *
+	 * OK, so we don't know how big the cache is.  So guess.
+	 */
+	batch = zone->present_pages / 1024;
+	if (batch * PAGE_SIZE > 512 * 1024)
+		batch = (512 * 1024) / PAGE_SIZE;
+	batch /= 4;		/* We effectively *= 4 below */
+	if (batch < 1)
+		batch = 1;
+
+	/*
+	 * Clamp the batch to a 2^n - 1 value. Having a power
+	 * of 2 value was found to be more likely to have
+	 * suboptimal cache aliasing properties in some cases.
+	 *
+	 * For example if 2 tasks are alternately allocating
+	 * batches of pages, one task can end up with a lot
+	 * of pages of one half of the possible page colors
+	 * and the other with pages of the other colors.
+	 */
+	batch = (1 << (fls(batch + batch/2)-1)) - 1;
+
+	return batch;
+}
+
+inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+{
+	struct per_cpu_pages *pcp;
+
+	memset(p, 0, sizeof(*p));
+
+	pcp = &p->pcp[0];		/* hot */
+	pcp->count = 0;
+	pcp->high = 6 * batch;
+	pcp->batch = max(1UL, 1 * batch);
+	INIT_LIST_HEAD(&pcp->list);
+
+	pcp = &p->pcp[1];		/* cold*/
+	pcp->count = 0;
+	pcp->high = 2 * batch;
+	pcp->batch = max(1UL, batch/2);
+	INIT_LIST_HEAD(&pcp->list);
+}
+
+#ifdef CONFIG_NUMA
+/*
+ * Boot pageset table. One per cpu which is going to be used for all
+ * zones and all nodes. The parameters will be set in such a way
+ * that an item put on a list will immediately be handed over to
+ * the buddy list. This is safe since pageset manipulation is done
+ * with interrupts disabled.
+ *
+ * Some NUMA counter updates may also be caught by the boot pagesets.
+ *
+ * The boot_pagesets must be kept even after bootup is complete for
+ * unused processors and/or zones. They do play a role for bootstrapping
+ * hotplugged processors.
+ *
+ * zoneinfo_show() and maybe other functions do
+ * not check if the processor is online before following the pageset pointer.
+ * Other parts of the kernel may not check if the zone is available.
+ */
+static struct per_cpu_pageset boot_pageset[NR_CPUS];
+
+/*
+ * Dynamically allocate memory for the
+ * per cpu pageset array in struct zone.
+ */
+static int __cpuinit process_zones(int cpu)
+{
+	struct zone *zone, *dzone;
+
+	for_each_zone(zone) {
+
+		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
+					 GFP_KERNEL, cpu_to_node(cpu));
+		if (!zone_pcp(zone, cpu))
+			goto bad;
+
+		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
+
+		if (percpu_pagelist_fraction)
+			setup_pagelist_highmark(zone_pcp(zone, cpu),
+			 	(zone->present_pages / percpu_pagelist_fraction));
+	}
+
+	return 0;
+bad:
+	for_each_zone(dzone) {
+		if (dzone == zone)
+			break;
+		kfree(zone_pcp(dzone, cpu));
+		zone_pcp(dzone, cpu) = NULL;
+	}
+	return -ENOMEM;
+}
+
+static inline void free_zone_pagesets(int cpu)
+{
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
+
+		zone_pcp(zone, cpu) = NULL;
+		kfree(pset);
+	}
+}
+
+static int pageset_cpuup_callback(struct notifier_block *nfb,
+		unsigned long action,
+		void *hcpu)
+{
+	int cpu = (long)hcpu;
+	int ret = NOTIFY_OK;
+
+	switch (action) {
+		case CPU_UP_PREPARE:
+			if (process_zones(cpu))
+				ret = NOTIFY_BAD;
+			break;
+		case CPU_UP_CANCELED:
+		case CPU_DEAD:
+			free_zone_pagesets(cpu);
+			break;
+		default:
+			break;
+	}
+	return ret;
+}
+
+static struct notifier_block pageset_notifier =
+	{ &pageset_cpuup_callback, NULL, 0 };
+
+void __init setup_per_cpu_pageset(void)
+{
+	int err;
+
+	/* Initialize per_cpu_pageset for cpu 0.
+	 * A cpuup callback will do this for every cpu
+	 * as it comes online
+	 */
+	err = process_zones(smp_processor_id());
+	BUG_ON(err);
+	register_cpu_notifier(&pageset_notifier);
+}
+#endif
+
+static __meminit void zone_pcp_init(struct zone *zone)
+{
+	int cpu;
+	unsigned long batch = zone_batchsize(zone);
+
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+#ifdef CONFIG_NUMA
+		/* Early boot. Slab allocator not functional yet */
+		zone_pcp(zone, cpu) = &boot_pageset[cpu];
+		setup_pageset(&boot_pageset[cpu],0);
+#else
+		setup_pageset(zone_pcp(zone,cpu), batch);
+#endif
+	}
+	if (zone->present_pages)
+		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
+			zone->name, zone->present_pages, batch);
+}
+
+__meminit int init_currently_empty_zone(struct zone *zone,
+					unsigned long zone_start_pfn,
+					unsigned long size)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	int ret;
+	ret = zone_wait_table_init(zone, size);
+	if (ret)
+		return ret;
+	pgdat->nr_zones = zone_idx(zone) + 1;
+
+	zone->zone_start_pfn = zone_start_pfn;
+
+	memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
+
+	zone_init_free_lists(pgdat, zone, zone->spanned_pages);
+
+	return 0;
+}
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+/* Note: nid == MAX_NUMNODES returns first region */
+static int __init first_active_region_index_in_nid(int nid)
+{
+	int i;
+	for (i = 0; early_node_map[i].end_pfn; i++) {
+		if (nid == MAX_NUMNODES || early_node_map[i].nid == nid)
+			return i;
+	}
+
+	return MAX_ACTIVE_REGIONS;
+}
+
+/* Note: nid == MAX_NUMNODES returns next region */
+static int __init next_active_region_index_in_nid(unsigned int index, int nid)
+{
+	for (index = index + 1; early_node_map[index].end_pfn; index++) {
+		if (nid == MAX_NUMNODES || early_node_map[index].nid == nid)
+			return index;
+	}
+
+	return MAX_ACTIVE_REGIONS;
+}
+
+#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
+int __init early_pfn_to_nid(unsigned long pfn)
+{
+	int i;
+
+	for (i = 0; early_node_map[i].end_pfn; i++) {
+		unsigned long start_pfn = early_node_map[i].start_pfn;
+		unsigned long end_pfn = early_node_map[i].end_pfn;
+
+		if ((start_pfn <= pfn) && (pfn < end_pfn))
+			return early_node_map[i].nid;
+	}
+
+	return -1;
+}
+#endif
+
+#define for_each_active_range_index_in_nid(i, nid) \
+	for (i = first_active_region_index_in_nid(nid); \
+				i != MAX_ACTIVE_REGIONS; \
+				i = next_active_region_index_in_nid(i, nid))
+
+void __init free_bootmem_with_active_regions(int nid,
+						unsigned long max_low_pfn)
+{
+	unsigned int i;
+	for_each_active_range_index_in_nid(i, nid) {
+		unsigned long size_pages = 0;
+		unsigned long end_pfn = early_node_map[i].end_pfn;
+		if (early_node_map[i].start_pfn >= max_low_pfn)
+			continue;
+
+		if (end_pfn > max_low_pfn)
+			end_pfn = max_low_pfn;
+
+		size_pages = end_pfn - early_node_map[i].start_pfn;
+		free_bootmem_node(NODE_DATA(early_node_map[i].nid),
+				PFN_PHYS(early_node_map[i].start_pfn),
+				size_pages << PAGE_SHIFT);
+	}
+}
+
+void __init sparse_memory_present_with_active_regions(int nid)
+{
+	unsigned int i;
+	for_each_active_range_index_in_nid(i, nid)
+		memory_present(early_node_map[i].nid,
+				early_node_map[i].start_pfn,
+				early_node_map[i].end_pfn);
+}
+
+void __init get_pfn_range_for_nid(unsigned int nid,
+			unsigned long *start_pfn, unsigned long *end_pfn)
+{
+	unsigned int i;
+	*start_pfn = -1UL;
+	*end_pfn = 0;
+
+	for_each_active_range_index_in_nid(i, nid) {
+		*start_pfn = min(*start_pfn, early_node_map[i].start_pfn);
+		*end_pfn = max(*end_pfn, early_node_map[i].end_pfn);
+	}
+
+	if (*start_pfn == -1UL) {
+		printk(KERN_WARNING "Node %u active with no memory\n", nid);
+		*start_pfn = 0;
+	}
+}
+
+unsigned long __init zone_present_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *ignored)
+{
+	unsigned long node_start_pfn, node_end_pfn;
+	unsigned long zone_start_pfn, zone_end_pfn;
+
+	/* Get the start and end of the node and zone */
+	get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
+	zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+	zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+	/* Check that this node has pages within the zone's required range */
+	if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+		return 0;
+
+	/* Move the zone boundaries inside the node if necessary */
+	zone_end_pfn = min(zone_end_pfn, node_end_pfn);
+	zone_start_pfn = max(zone_start_pfn, node_start_pfn);
+
+	/* Return the spanned pages */
+	return zone_end_pfn - zone_start_pfn;
+}
+
+unsigned long __init __absent_pages_in_range(int nid,
+				unsigned long range_start_pfn,
+				unsigned long range_end_pfn)
+{
+	int i = 0;
+	unsigned long prev_end_pfn = 0, hole_pages = 0;
+	unsigned long start_pfn;
+
+	/* Find the end_pfn of the first active range of pfns in the node */
+	i = first_active_region_index_in_nid(nid);
+	if (i == MAX_ACTIVE_REGIONS)
+		return 0;
+	prev_end_pfn = early_node_map[i].start_pfn;
+
+	/* Find all holes for the zone within the node */
+	for (; i != MAX_ACTIVE_REGIONS;
+			i = next_active_region_index_in_nid(i, nid)) {
+
+		/* No need to continue if prev_end_pfn is outside the zone */
+		if (prev_end_pfn >= range_end_pfn)
+			break;
+
+		/* Make sure the end of the zone is not within the hole */
+		start_pfn = min(early_node_map[i].start_pfn, range_end_pfn);
+		prev_end_pfn = max(prev_end_pfn, range_start_pfn);
+
+		/* Update the hole size cound and move on */
+		if (start_pfn > range_start_pfn) {
+			BUG_ON(prev_end_pfn > start_pfn);
+			hole_pages += start_pfn - prev_end_pfn;
+		}
+		prev_end_pfn = early_node_map[i].end_pfn;
+	}
+
+	return hole_pages;
+}
+
+unsigned long __init absent_pages_in_range(unsigned long start_pfn,
+							unsigned long end_pfn)
+{
+	return __absent_pages_in_range(MAX_NUMNODES, start_pfn, end_pfn);
+}
+
+unsigned long __init zone_absent_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *ignored)
+{
+	return __absent_pages_in_range(nid,
+				arch_zone_lowest_possible_pfn[zone_type],
+				arch_zone_highest_possible_pfn[zone_type]);
+}
+#else
+static inline unsigned long zone_present_pages_in_node(int nid,
+					unsigned long zone_type,
+					unsigned long *zones_size)
+{
+	return zones_size[zone_type];
+}
+
+static inline unsigned long zone_absent_pages_in_node(int nid,
+						unsigned long zone_type,
+						unsigned long *zholes_size)
+{
+	if (!zholes_size)
+		return 0;
+
+	return zholes_size[zone_type];
+}
+#endif
+
+static void __init calculate_node_totalpages(struct pglist_data *pgdat,
+		unsigned long *zones_size, unsigned long *zholes_size)
+{
+	unsigned long realtotalpages, totalpages = 0;
+	int i;
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		totalpages += zone_present_pages_in_node(pgdat->node_id, i,
+								zones_size);
+	}
+	pgdat->node_spanned_pages = totalpages;
+
+	realtotalpages = totalpages;
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		realtotalpages -=
+			zone_absent_pages_in_node(pgdat->node_id, i, zholes_size);
+	}
+	pgdat->node_present_pages = realtotalpages;
+	printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
+							realtotalpages);
+}
+
+/*
+ * Set up the zone data structures:
+ *   - mark all pages reserved
+ *   - mark all memory queues empty
+ *   - clear the memory bitmaps
+ */
+static void __meminit free_area_init_core(struct pglist_data *pgdat,
+		unsigned long *zones_size, unsigned long *zholes_size)
+{
+	unsigned long j;
+	int nid = pgdat->node_id;
+	unsigned long zone_start_pfn = pgdat->node_start_pfn;
+	int ret;
+
+	pgdat_resize_init(pgdat);
+	pgdat->nr_zones = 0;
+	init_waitqueue_head(&pgdat->kswapd_wait);
+	pgdat->kswapd_max_order = 0;
+
+	for (j = 0; j < MAX_NR_ZONES; j++) {
+		struct zone *zone = pgdat->node_zones + j;
+		unsigned long size, realsize;
+
+		size = zone_present_pages_in_node(nid, j, zones_size);
+		realsize = size - zone_absent_pages_in_node(nid, j,
+								zholes_size);
+		if (j < ZONE_HIGHMEM)
+			nr_kernel_pages += realsize;
+		nr_all_pages += realsize;
+
+		zone->spanned_pages = size;
+		zone->present_pages = realsize;
+		zone->name = zone_names[j];
+		spin_lock_init(&zone->lock);
+		spin_lock_init(&zone->lru_lock);
+		zone_seqlock_init(zone);
+		zone->zone_pgdat = pgdat;
+		zone->free_pages = 0;
+
+		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
+
+		zone_pcp_init(zone);
+		INIT_LIST_HEAD(&zone->active_list);
+		INIT_LIST_HEAD(&zone->inactive_list);
+		zone->nr_scan_active = 0;
+		zone->nr_scan_inactive = 0;
+		zone->nr_active = 0;
+		zone->nr_inactive = 0;
+		atomic_set(&zone->reclaim_in_progress, 0);
+		if (!size)
+			continue;
+
+		zonetable_add(zone, nid, j, zone_start_pfn, size);
+		ret = init_currently_empty_zone(zone, zone_start_pfn, size);
+		BUG_ON(ret);
+		zone_start_pfn += size;
+	}
+}
+
+static void __init alloc_node_mem_map(struct pglist_data *pgdat)
+{
+	/* Skip empty nodes */
+	if (!pgdat->node_spanned_pages)
+		return;
+
+#ifdef CONFIG_FLAT_NODE_MEM_MAP
+	/* ia64 gets its own node_mem_map, before this, without bootmem */
+	if (!pgdat->node_mem_map) {
+		unsigned long size;
+		struct page *map;
+
+		size = (pgdat->node_spanned_pages + 1) * sizeof(struct page);
+		map = alloc_remap(pgdat->node_id, size);
+		if (!map)
+			map = alloc_bootmem_node(pgdat, size);
+		pgdat->node_mem_map = map;
+	}
+#ifdef CONFIG_FLATMEM
+	/*
+	 * With no DISCONTIG, the global mem_map is just set as node 0's
+	 */
+	if (pgdat == NODE_DATA(0))
+		mem_map = NODE_DATA(0)->node_mem_map;
+#endif
+#endif /* CONFIG_FLAT_NODE_MEM_MAP */
+}
+
+void __meminit free_area_init_node(int nid, struct pglist_data *pgdat,
+		unsigned long *zones_size, unsigned long node_start_pfn,
+		unsigned long *zholes_size)
+{
+	pgdat->node_id = nid;
+	pgdat->node_start_pfn = node_start_pfn;
+	calculate_node_totalpages(pgdat, zones_size, zholes_size);
+
+	alloc_node_mem_map(pgdat);
+
+	free_area_init_core(pgdat, zones_size, zholes_size);
+}
+
+#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
+void __init add_active_range(unsigned int nid, unsigned long start_pfn,
+						unsigned long end_pfn)
+{
+	unsigned int i;
+	printk(KERN_DEBUG "Range (%d) %lu -> %lu\n", nid, start_pfn, end_pfn);
+
+	/* Merge with existing active regions if possible */
+	for (i = 0; early_node_map[i].end_pfn; i++) {
+		if (early_node_map[i].nid != nid)
+			continue;
+
+		/* Skip if an existing region covers this new one */
+		if (start_pfn >= early_node_map[i].start_pfn &&
+				end_pfn <= early_node_map[i].end_pfn)
+			return;
+
+		/* Merge forward if suitable */
+		if (start_pfn <= early_node_map[i].end_pfn &&
+				end_pfn > early_node_map[i].end_pfn) {
+			early_node_map[i].end_pfn = end_pfn;
+			return;
+		}
+
+		/* Merge backward if suitable */
+		if (start_pfn < early_node_map[i].end_pfn &&
+				end_pfn >= early_node_map[i].start_pfn) {
+			early_node_map[i].start_pfn = start_pfn;
+			return;
+		}
+	}
+
+	/* Leave last entry NULL, we use range.end_pfn to terminate the walk */
+	if (i >= MAX_ACTIVE_REGIONS - 1) {
+		printk(KERN_ERR "Too many memory regions, truncating\n");
+		return;
+	}
+
+	early_node_map[i].nid = nid;
+	early_node_map[i].start_pfn = start_pfn;
+	early_node_map[i].end_pfn = end_pfn;
+}
+
+void __init shrink_active_range(unsigned int nid, unsigned long old_end_pfn,
+						unsigned long new_end_pfn)
+{
+	unsigned int i;
+
+	/* Find the old active region end and shrink */
+	for_each_active_range_index_in_nid(i, nid) {
+		if (early_node_map[i].end_pfn == old_end_pfn) {
+			early_node_map[i].end_pfn = new_end_pfn;
+			break;
+		}
+	}
+}
+
+void __init remove_all_active_ranges()
+{
+	memset(early_node_map, 0, sizeof(early_node_map));
+}
+
+/* Compare two active node_active_regions */
+static int __init cmp_node_active_region(const void *a, const void *b)
+{
+	struct node_active_region *arange = (struct node_active_region *)a;
+	struct node_active_region *brange = (struct node_active_region *)b;
+
+	/* Done this way to avoid overflows */
+	if (arange->start_pfn > brange->start_pfn)
+		return 1;
+	if (arange->start_pfn < brange->start_pfn)
+		return -1;
+
+	return 0;
+}
+
+/* sort the node_map by start_pfn */
+static void __init sort_node_map(void)
+{
+	size_t num = 0;
+	while (early_node_map[num].end_pfn)
+		num++;
+
+	sort(early_node_map, num, sizeof(struct node_active_region),
+						cmp_node_active_region, NULL);
+}
+
+/* Find the lowest pfn for a node. This depends on a sorted early_node_map */
+unsigned long __init find_min_pfn_for_node(unsigned long nid)
+{
+	int i;
+
+	/* Assuming a sorted map, the first range found has the starting pfn */
+	for_each_active_range_index_in_nid(i, nid)
+		return early_node_map[i].start_pfn;
+
+	printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
+	return 0;
+}
+
+unsigned long __init find_min_pfn_with_active_regions(void)
+{
+	return find_min_pfn_for_node(MAX_NUMNODES);
+}
+
+unsigned long __init find_max_pfn_with_active_regions(void)
+{
+	int i;
+	unsigned long max_pfn = 0;
+
+	for (i = 0; early_node_map[i].end_pfn; i++)
+		max_pfn = max(max_pfn, early_node_map[i].end_pfn);
+
+	return max_pfn;
+}
+
+void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
+				unsigned long arch_max_dma32_pfn,
+				unsigned long arch_max_low_pfn,
+				unsigned long arch_max_high_pfn)
+{
+	unsigned long nid;
+	int zone_index;
+
+	/* Record where the zone boundaries are */
+	memset(arch_zone_lowest_possible_pfn, 0,
+				sizeof(arch_zone_lowest_possible_pfn));
+	memset(arch_zone_highest_possible_pfn, 0,
+				sizeof(arch_zone_highest_possible_pfn));
+	arch_zone_lowest_possible_pfn[ZONE_DMA] =
+					find_min_pfn_with_active_regions();
+	arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
+	arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
+	arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
+	arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
+	for (zone_index = 1; zone_index < MAX_NR_ZONES; zone_index++) {
+		arch_zone_lowest_possible_pfn[zone_index] =
+			arch_zone_highest_possible_pfn[zone_index-1];
+	}
+
+	/* Regions in the early_node_map can be in any order */
+	sort_node_map();
+
+	for_each_online_node(nid) {
+		pg_data_t *pgdat = NODE_DATA(nid);
+		free_area_init_node(nid, pgdat, NULL,
+				find_min_pfn_for_node(nid), NULL);
+	}
+}
+#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/mm/page_alloc.c linux-2.6.17-rc3-mm1-106-breakout_mem_init/mm/page_alloc.c
--- linux-2.6.17-rc3-mm1-105-ia64_use_init_nodes/mm/page_alloc.c	2006-05-08 10:56:43.000000000 +0100
+++ linux-2.6.17-rc3-mm1-106-breakout_mem_init/mm/page_alloc.c	2006-05-08 09:22:02.000000000 +0100
@@ -38,8 +38,6 @@
 #include <linux/vmalloc.h>
 #include <linux/mempolicy.h>
 #include <linux/stop_machine.h>
-#include <linux/sort.h>
-#include <linux/pfn.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -56,7 +54,6 @@ unsigned long totalram_pages __read_most
 unsigned long totalhigh_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;
-int percpu_pagelist_fraction;
 
 static void __free_pages_ok(struct page *page, unsigned int order);
 
@@ -82,24 +79,11 @@ EXPORT_SYMBOL(totalram_pages);
 struct zone *zone_table[1 << ZONETABLE_SHIFT] __read_mostly;
 EXPORT_SYMBOL(zone_table);
 
-static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
 int min_free_kbytes = 1024;
 
 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
 
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
-  #ifdef CONFIG_MAX_ACTIVE_REGIONS
-    #define MAX_ACTIVE_REGIONS CONFIG_MAX_ACTIVE_REGIONS
-  #else
-    #define MAX_ACTIVE_REGIONS (MAX_NR_ZONES * MAX_NUMNODES + 1)
-  #endif
-
-  struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
-  unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
-  unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
-#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -1593,1069 +1577,6 @@ void show_free_areas(void)
 	show_swap_cache_info();
 }
 
-/*
- * Builds allocation fallback zone lists.
- *
- * Add all populated zones of a node to the zonelist.
- */
-static int __meminit build_zonelists_node(pg_data_t *pgdat,
-			struct zonelist *zonelist, int nr_zones, int zone_type)
-{
-	struct zone *zone;
-
-	BUG_ON(zone_type > ZONE_HIGHMEM);
-
-	do {
-		zone = pgdat->node_zones + zone_type;
-		if (populated_zone(zone)) {
-#ifndef CONFIG_HIGHMEM
-			BUG_ON(zone_type > ZONE_NORMAL);
-#endif
-			zonelist->zones[nr_zones++] = zone;
-			check_highest_zone(zone_type);
-		}
-		zone_type--;
-
-	} while (zone_type >= 0);
-	return nr_zones;
-}
-
-static inline int highest_zone(int zone_bits)
-{
-	int res = ZONE_NORMAL;
-	if (zone_bits & (__force int)__GFP_HIGHMEM)
-		res = ZONE_HIGHMEM;
-	if (zone_bits & (__force int)__GFP_DMA32)
-		res = ZONE_DMA32;
-	if (zone_bits & (__force int)__GFP_DMA)
-		res = ZONE_DMA;
-	return res;
-}
-
-#ifdef CONFIG_NUMA
-#define MAX_NODE_LOAD (num_online_nodes())
-static int __meminitdata node_load[MAX_NUMNODES];
-/**
- * find_next_best_node - find the next node that should appear in a given node's fallback list
- * @node: node whose fallback list we're appending
- * @used_node_mask: nodemask_t of already used nodes
- *
- * We use a number of factors to determine which is the next node that should
- * appear on a given node's fallback list.  The node should not have appeared
- * already in @node's fallback list, and it should be the next closest node
- * according to the distance array (which contains arbitrary distance values
- * from each node to each node in the system), and should also prefer nodes
- * with no CPUs, since presumably they'll have very little allocation pressure
- * on them otherwise.
- * It returns -1 if no node is found.
- */
-static int __meminit find_next_best_node(int node, nodemask_t *used_node_mask)
-{
-	int n, val;
-	int min_val = INT_MAX;
-	int best_node = -1;
-
-	/* Use the local node if we haven't already */
-	if (!node_isset(node, *used_node_mask)) {
-		node_set(node, *used_node_mask);
-		return node;
-	}
-
-	for_each_online_node(n) {
-		cpumask_t tmp;
-
-		/* Don't want a node to appear more than once */
-		if (node_isset(n, *used_node_mask))
-			continue;
-
-		/* Use the distance array to find the distance */
-		val = node_distance(node, n);
-
-		/* Penalize nodes under us ("prefer the next node") */
-		val += (n < node);
-
-		/* Give preference to headless and unused nodes */
-		tmp = node_to_cpumask(n);
-		if (!cpus_empty(tmp))
-			val += PENALTY_FOR_NODE_WITH_CPUS;
-
-		/* Slight preference for less loaded node */
-		val *= (MAX_NODE_LOAD*MAX_NUMNODES);
-		val += node_load[n];
-
-		if (val < min_val) {
-			min_val = val;
-			best_node = n;
-		}
-	}
-
-	if (best_node >= 0)
-		node_set(best_node, *used_node_mask);
-
-	return best_node;
-}
-
-static void __meminit build_zonelists(pg_data_t *pgdat)
-{
-	int i, j, k, node, local_node;
-	int prev_node, load;
-	struct zonelist *zonelist;
-	nodemask_t used_mask;
-
-	/* initialize zonelists */
-	for (i = 0; i < GFP_ZONETYPES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		zonelist->zones[0] = NULL;
-	}
-
-	/* NUMA-aware ordering of nodes */
-	local_node = pgdat->node_id;
-	load = num_online_nodes();
-	prev_node = local_node;
-	nodes_clear(used_mask);
-	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
-		int distance = node_distance(local_node, node);
-
-		/*
-		 * If another node is sufficiently far away then it is better
-		 * to reclaim pages in a zone before going off node.
-		 */
-		if (distance > RECLAIM_DISTANCE)
-			zone_reclaim_mode = 1;
-
-		/*
-		 * We don't want to pressure a particular node.
-		 * So adding penalty to the first node in same
-		 * distance group to make it round-robin.
-		 */
-
-		if (distance != node_distance(local_node, prev_node))
-			node_load[node] += load;
-		prev_node = node;
-		load--;
-		for (i = 0; i < GFP_ZONETYPES; i++) {
-			zonelist = pgdat->node_zonelists + i;
-			for (j = 0; zonelist->zones[j] != NULL; j++);
-
-			k = highest_zone(i);
-
-	 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
-			zonelist->zones[j] = NULL;
-		}
-	}
-}
-
-#else	/* CONFIG_NUMA */
-
-static void __meminit build_zonelists(pg_data_t *pgdat)
-{
-	int i, j, k, node, local_node;
-
-	local_node = pgdat->node_id;
-	for (i = 0; i < GFP_ZONETYPES; i++) {
-		struct zonelist *zonelist;
-
-		zonelist = pgdat->node_zonelists + i;
-
-		j = 0;
-		k = highest_zone(i);
- 		j = build_zonelists_node(pgdat, zonelist, j, k);
- 		/*
- 		 * Now we build the zonelist so that it contains the zones
- 		 * of all the other nodes.
- 		 * We don't want to pressure a particular node, so when
- 		 * building the zones for node N, we make sure that the
- 		 * zones coming right after the local ones are those from
- 		 * node N+1 (modulo N)
- 		 */
-		for (node = local_node + 1; node < MAX_NUMNODES; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
-		}
-		for (node = 0; node < local_node; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, k);
-		}
-
-		zonelist->zones[j] = NULL;
-	}
-}
-
-#endif	/* CONFIG_NUMA */
-
-/* return values int ....just for stop_machine_run() */
-static int __meminit __build_all_zonelists(void *dummy)
-{
-	int nid;
-	for_each_online_node(nid)
-		build_zonelists(NODE_DATA(nid));
-	return 0;
-}
-
-void __meminit build_all_zonelists(void)
-{
-	if (system_state == SYSTEM_BOOTING) {
-		__build_all_zonelists(0);
-		cpuset_init_current_mems_allowed();
-	} else {
-		/* we have to stop all cpus to guaranntee there is no user
-		   of zonelist */
-		stop_machine_run(__build_all_zonelists, NULL, NR_CPUS);
-		/* cpuset refresh routine should be here */
-	}
-
-	printk("Built %i zonelists\n", num_online_nodes());
-
-}
-
-/*
- * Helper functions to size the waitqueue hash table.
- * Essentially these want to choose hash table sizes sufficiently
- * large so that collisions trying to wait on pages are rare.
- * But in fact, the number of active page waitqueues on typical
- * systems is ridiculously low, less than 200. So this is even
- * conservative, even though it seems large.
- *
- * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to
- * waitqueues, i.e. the size of the waitq table given the number of pages.
- */
-#define PAGES_PER_WAITQUEUE	256
-
-#ifndef CONFIG_MEMORY_HOTPLUG
-static inline unsigned long wait_table_hash_nr_entries(unsigned long pages)
-{
-	unsigned long size = 1;
-
-	pages /= PAGES_PER_WAITQUEUE;
-
-	while (size < pages)
-		size <<= 1;
-
-	/*
-	 * Once we have dozens or even hundreds of threads sleeping
-	 * on IO we've got bigger problems than wait queue collision.
-	 * Limit the size of the wait table to a reasonable size.
-	 */
-	size = min(size, 4096UL);
-
-	return max(size, 4UL);
-}
-#else
-/*
- * A zone's size might be changed by hot-add, so it is not possible to determine
- * a suitable size for its wait_table.  So we use the maximum size now.
- *
- * The max wait table size = 4096 x sizeof(wait_queue_head_t).   ie:
- *
- *    i386 (preemption config)    : 4096 x 16 = 64Kbyte.
- *    ia64, x86-64 (no preemption): 4096 x 20 = 80Kbyte.
- *    ia64, x86-64 (preemption)   : 4096 x 24 = 96Kbyte.
- *
- * The maximum entries are prepared when a zone's memory is (512K + 256) pages
- * or more by the traditional way. (See above).  It equals:
- *
- *    i386, x86-64, powerpc(4K page size) : =  ( 2G + 1M)byte.
- *    ia64(16K page size)                 : =  ( 8G + 4M)byte.
- *    powerpc (64K page size)             : =  (32G +16M)byte.
- */
-static inline unsigned long wait_table_hash_nr_entries(unsigned long pages)
-{
-	return 4096UL;
-}
-#endif
-
-/*
- * This is an integer logarithm so that shifts can be used later
- * to extract the more random high bits from the multiplicative
- * hash function before the remainder is taken.
- */
-static inline unsigned long wait_table_bits(unsigned long size)
-{
-	return ffz(~size);
-}
-
-#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
-
-/*
- * Initially all pages are reserved - free ones are freed
- * up by free_all_bootmem() once the early boot process is
- * done. Non-atomic initialization, single-pass.
- */
-void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
-		unsigned long start_pfn)
-{
-	struct page *page;
-	unsigned long end_pfn = start_pfn + size;
-	unsigned long pfn;
-
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!early_pfn_valid(pfn))
-			continue;
-		page = pfn_to_page(pfn);
-		set_page_links(page, zone, nid, pfn);
-		init_page_count(page);
-		reset_page_mapcount(page);
-		SetPageReserved(page);
-		INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
-		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
-		if (!is_highmem_idx(zone))
-			set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
-#ifdef CONFIG_PAGE_OWNER
-		page->order = -1;
-#endif
-	}
-}
-
-void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
-				unsigned long size)
-{
-	int order;
-	for (order = 0; order < MAX_ORDER ; order++) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
-		zone->free_area[order].nr_free = 0;
-	}
-}
-
-#define ZONETABLE_INDEX(x, zone_nr)	((x << ZONES_SHIFT) | zone_nr)
-void zonetable_add(struct zone *zone, int nid, int zid, unsigned long pfn,
-		unsigned long size)
-{
-	unsigned long snum = pfn_to_section_nr(pfn);
-	unsigned long end = pfn_to_section_nr(pfn + size);
-
-	if (FLAGS_HAS_NODE)
-		zone_table[ZONETABLE_INDEX(nid, zid)] = zone;
-	else
-		for (; snum <= end; snum++)
-			zone_table[ZONETABLE_INDEX(snum, zid)] = zone;
-}
-
-#ifndef __HAVE_ARCH_MEMMAP_INIT
-#define memmap_init(size, nid, zone, start_pfn) \
-	memmap_init_zone((size), (nid), (zone), (start_pfn))
-#endif
-
-static int __cpuinit zone_batchsize(struct zone *zone)
-{
-	int batch;
-
-	/*
-	 * The per-cpu-pages pools are set to around 1000th of the
-	 * size of the zone.  But no more than 1/2 of a meg.
-	 *
-	 * OK, so we don't know how big the cache is.  So guess.
-	 */
-	batch = zone->present_pages / 1024;
-	if (batch * PAGE_SIZE > 512 * 1024)
-		batch = (512 * 1024) / PAGE_SIZE;
-	batch /= 4;		/* We effectively *= 4 below */
-	if (batch < 1)
-		batch = 1;
-
-	/*
-	 * Clamp the batch to a 2^n - 1 value. Having a power
-	 * of 2 value was found to be more likely to have
-	 * suboptimal cache aliasing properties in some cases.
-	 *
-	 * For example if 2 tasks are alternately allocating
-	 * batches of pages, one task can end up with a lot
-	 * of pages of one half of the possible page colors
-	 * and the other with pages of the other colors.
-	 */
-	batch = (1 << (fls(batch + batch/2)-1)) - 1;
-
-	return batch;
-}
-
-inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
-{
-	struct per_cpu_pages *pcp;
-
-	memset(p, 0, sizeof(*p));
-
-	pcp = &p->pcp[0];		/* hot */
-	pcp->count = 0;
-	pcp->high = 6 * batch;
-	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
-
-	pcp = &p->pcp[1];		/* cold*/
-	pcp->count = 0;
-	pcp->high = 2 * batch;
-	pcp->batch = max(1UL, batch/2);
-	INIT_LIST_HEAD(&pcp->list);
-}
-
-/*
- * setup_pagelist_highmark() sets the high water mark for hot per_cpu_pagelist
- * to the value high for the pageset p.
- */
-
-static void setup_pagelist_highmark(struct per_cpu_pageset *p,
-				unsigned long high)
-{
-	struct per_cpu_pages *pcp;
-
-	pcp = &p->pcp[0]; /* hot list */
-	pcp->high = high;
-	pcp->batch = max(1UL, high/4);
-	if ((high/4) > (PAGE_SHIFT * 8))
-		pcp->batch = PAGE_SHIFT * 8;
-}
-
-
-#ifdef CONFIG_NUMA
-/*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
- */
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
-{
-	struct zone *zone, *dzone;
-
-	for_each_zone(zone) {
-
-		zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
-					 GFP_KERNEL, cpu_to_node(cpu));
-		if (!zone_pcp(zone, cpu))
-			goto bad;
-
-		setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
-
-		if (percpu_pagelist_fraction)
-			setup_pagelist_highmark(zone_pcp(zone, cpu),
-			 	(zone->present_pages / percpu_pagelist_fraction));
-	}
-
-	return 0;
-bad:
-	for_each_zone(dzone) {
-		if (dzone == zone)
-			break;
-		kfree(zone_pcp(dzone, cpu));
-		zone_pcp(dzone, cpu) = NULL;
-	}
-	return -ENOMEM;
-}
-
-static inline void free_zone_pagesets(int cpu)
-{
-	struct zone *zone;
-
-	for_each_zone(zone) {
-		struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
-		zone_pcp(zone, cpu) = NULL;
-		kfree(pset);
-	}
-}
-
-static int pageset_cpuup_callback(struct notifier_block *nfb,
-		unsigned long action,
-		void *hcpu)
-{
-	int cpu = (long)hcpu;
-	int ret = NOTIFY_OK;
-
-	switch (action) {
-		case CPU_UP_PREPARE:
-			if (process_zones(cpu))
-				ret = NOTIFY_BAD;
-			break;
-		case CPU_UP_CANCELED:
-		case CPU_DEAD:
-			free_zone_pagesets(cpu);
-			break;
-		default:
-			break;
-	}
-	return ret;
-}
-
-static struct notifier_block pageset_notifier =
-	{ &pageset_cpuup_callback, NULL, 0 };
-
-void __init setup_per_cpu_pageset(void)
-{
-	int err;
-
-	/* Initialize per_cpu_pageset for cpu 0.
-	 * A cpuup callback will do this for every cpu
-	 * as it comes online
-	 */
-	err = process_zones(smp_processor_id());
-	BUG_ON(err);
-	register_cpu_notifier(&pageset_notifier);
-}
-
-#endif
-
-static __meminit
-int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
-{
-	int i;
-	struct pglist_data *pgdat = zone->zone_pgdat;
-	size_t alloc_size;
-
-	/*
-	 * The per-page waitqueue mechanism uses hashed waitqueues
-	 * per zone.
-	 */
-	zone->wait_table_hash_nr_entries =
-		 wait_table_hash_nr_entries(zone_size_pages);
-	zone->wait_table_bits =
-		wait_table_bits(zone->wait_table_hash_nr_entries);
-	alloc_size = zone->wait_table_hash_nr_entries
-					* sizeof(wait_queue_head_t);
-
- 	if (system_state == SYSTEM_BOOTING) {
-		zone->wait_table = (wait_queue_head_t *)
-			alloc_bootmem_node(pgdat, alloc_size);
-	} else {
-		/*
-		 * This case means that a zone whose size was 0 gets new memory
-		 * via memory hot-add.
-		 * But it may be the case that a new node was hot-added.  In
-		 * this case vmalloc() will not be able to use this new node's
-		 * memory - this wait_table must be initialized to use this new
-		 * node itself as well.
-		 * To use this new node's memory, further consideration will be
-		 * necessary.
-		 */
-		zone->wait_table = (wait_queue_head_t *)vmalloc(alloc_size);
-	}
-	if (!zone->wait_table)
-		return -ENOMEM;
-
-	for(i = 0; i < zone->wait_table_hash_nr_entries; ++i)
-		init_waitqueue_head(zone->wait_table + i);
-
-	return 0;
-}
-
-static __meminit void zone_pcp_init(struct zone *zone)
-{
-	int cpu;
-	unsigned long batch = zone_batchsize(zone);
-
-	for (cpu = 0; cpu < NR_CPUS; cpu++) {
-#ifdef CONFIG_NUMA
-		/* Early boot. Slab allocator not functional yet */
-		zone_pcp(zone, cpu) = &boot_pageset[cpu];
-		setup_pageset(&boot_pageset[cpu],0);
-#else
-		setup_pageset(zone_pcp(zone,cpu), batch);
-#endif
-	}
-	if (zone->present_pages)
-		printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%lu\n",
-			zone->name, zone->present_pages, batch);
-}
-
-__meminit int init_currently_empty_zone(struct zone *zone,
-					unsigned long zone_start_pfn,
-					unsigned long size)
-{
-	struct pglist_data *pgdat = zone->zone_pgdat;
-	int ret;
-	ret = zone_wait_table_init(zone, size);
-	if (ret)
-		return ret;
-	pgdat->nr_zones = zone_idx(zone) + 1;
-
-	zone->zone_start_pfn = zone_start_pfn;
-
-	memmap_init(size, pgdat->node_id, zone_idx(zone), zone_start_pfn);
-
-	zone_init_free_lists(pgdat, zone, zone->spanned_pages);
-
-	return 0;
-}
-
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
-/* Note: nid == MAX_NUMNODES returns first region */
-static int __init first_active_region_index_in_nid(int nid)
-{
-	int i;
-	for (i = 0; early_node_map[i].end_pfn; i++) {
-		if (nid == MAX_NUMNODES || early_node_map[i].nid == nid)
-			return i;
-	}
-
-	return MAX_ACTIVE_REGIONS;
-}
-
-/* Note: nid == MAX_NUMNODES returns next region */
-static int __init next_active_region_index_in_nid(unsigned int index, int nid)
-{
-	for (index = index + 1; early_node_map[index].end_pfn; index++) {
-		if (nid == MAX_NUMNODES || early_node_map[index].nid == nid)
-			return index;
-	}
-
-	return MAX_ACTIVE_REGIONS;
-}
-
-#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-int __init early_pfn_to_nid(unsigned long pfn)
-{
-	int i;
-
-	for (i = 0; early_node_map[i].end_pfn; i++) {
-		unsigned long start_pfn = early_node_map[i].start_pfn;
-		unsigned long end_pfn = early_node_map[i].end_pfn;
-
-		if ((start_pfn <= pfn) && (pfn < end_pfn))
-			return early_node_map[i].nid;
-	}
-
-	return -1;
-}
-#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
-
-#define for_each_active_range_index_in_nid(i, nid) \
-	for (i = first_active_region_index_in_nid(nid); \
-				i != MAX_ACTIVE_REGIONS; \
-				i = next_active_region_index_in_nid(i, nid))
-
-void __init free_bootmem_with_active_regions(int nid,
-						unsigned long max_low_pfn)
-{
-	unsigned int i;
-	for_each_active_range_index_in_nid(i, nid) {
-		unsigned long size_pages = 0;
-		unsigned long end_pfn = early_node_map[i].end_pfn;
-		if (early_node_map[i].start_pfn >= max_low_pfn)
-			continue;
-
-		if (end_pfn > max_low_pfn)
-			end_pfn = max_low_pfn;
-
-		size_pages = end_pfn - early_node_map[i].start_pfn;
-		free_bootmem_node(NODE_DATA(early_node_map[i].nid),
-				PFN_PHYS(early_node_map[i].start_pfn),
-				size_pages << PAGE_SHIFT);
-	}
-}
-
-void __init sparse_memory_present_with_active_regions(int nid)
-{
-	unsigned int i;
-	for_each_active_range_index_in_nid(i, nid)
-		memory_present(early_node_map[i].nid,
-				early_node_map[i].start_pfn,
-				early_node_map[i].end_pfn);
-}
-
-void __init get_pfn_range_for_nid(unsigned int nid,
-			unsigned long *start_pfn, unsigned long *end_pfn)
-{
-	unsigned int i;
-	*start_pfn = -1UL;
-	*end_pfn = 0;
-
-	for_each_active_range_index_in_nid(i, nid) {
-		*start_pfn = min(*start_pfn, early_node_map[i].start_pfn);
-		*end_pfn = max(*end_pfn, early_node_map[i].end_pfn);
-	}
-
-	if (*start_pfn == -1UL) {
-		printk(KERN_WARNING "Node %u active with no memory\n", nid);
-		*start_pfn = 0;
-	}
-}
-
-unsigned long __init zone_present_pages_in_node(int nid,
-					unsigned long zone_type,
-					unsigned long *ignored)
-{
-	unsigned long node_start_pfn, node_end_pfn;
-	unsigned long zone_start_pfn, zone_end_pfn;
-
-	/* Get the start and end of the node and zone */
-	get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
-	zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
-	zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
-
-	/* Check that this node has pages within the zone's required range */
-	if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
-		return 0;
-
-	/* Move the zone boundaries inside the node if necessary */
-	zone_end_pfn = min(zone_end_pfn, node_end_pfn);
-	zone_start_pfn = max(zone_start_pfn, node_start_pfn);
-
-	/* Return the spanned pages */
-	return zone_end_pfn - zone_start_pfn;
-}
-
-unsigned long __init __absent_pages_in_range(int nid,
-				unsigned long range_start_pfn,
-				unsigned long range_end_pfn)
-{
-	int i = 0;
-	unsigned long prev_end_pfn = 0, hole_pages = 0;
-	unsigned long start_pfn;
-
-	/* Find the end_pfn of the first active range of pfns in the node */
-	i = first_active_region_index_in_nid(nid);
-	if (i == MAX_ACTIVE_REGIONS)
-		return 0;
-	prev_end_pfn = early_node_map[i].start_pfn;
-
-	/* Find all holes for the zone within the node */
-	for (; i != MAX_ACTIVE_REGIONS;
-			i = next_active_region_index_in_nid(i, nid)) {
-
-		/* No need to continue if prev_end_pfn is outside the zone */
-		if (prev_end_pfn >= range_end_pfn)
-			break;
-
-		/* Make sure the end of the zone is not within the hole */
-		start_pfn = min(early_node_map[i].start_pfn, range_end_pfn);
-		prev_end_pfn = max(prev_end_pfn, range_start_pfn);
-
-		/* Update the hole size cound and move on */
-		if (start_pfn > range_start_pfn) {
-			BUG_ON(prev_end_pfn > start_pfn);
-			hole_pages += start_pfn - prev_end_pfn;
-		}
-		prev_end_pfn = early_node_map[i].end_pfn;
-	}
-
-	return hole_pages;
-}
-
-unsigned long __init absent_pages_in_range(unsigned long start_pfn,
-							unsigned long end_pfn)
-{
-	return __absent_pages_in_range(MAX_NUMNODES, start_pfn, end_pfn);
-}
-
-unsigned long __init zone_absent_pages_in_node(int nid,
-					unsigned long zone_type,
-					unsigned long *ignored)
-{
-	return __absent_pages_in_range(nid,
-				arch_zone_lowest_possible_pfn[zone_type],
-				arch_zone_highest_possible_pfn[zone_type]);
-}
-#else
-static inline unsigned long zone_present_pages_in_node(int nid,
-					unsigned long zone_type,
-					unsigned long *zones_size)
-{
-	return zones_size[zone_type];
-}
-
-static inline unsigned long zone_absent_pages_in_node(int nid,
-						unsigned long zone_type,
-						unsigned long *zholes_size)
-{
-	if (!zholes_size)
-		return 0;
-
-	return zholes_size[zone_type];
-}
-#endif
-
-static void __init calculate_node_totalpages(struct pglist_data *pgdat,
-		unsigned long *zones_size, unsigned long *zholes_size)
-{
-	unsigned long realtotalpages, totalpages = 0;
-	int i;
-
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		totalpages += zone_present_pages_in_node(pgdat->node_id, i,
-								zones_size);
-	}
-	pgdat->node_spanned_pages = totalpages;
-
-	realtotalpages = totalpages;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		realtotalpages -=
-			zone_absent_pages_in_node(pgdat->node_id, i, zholes_size);
-	}
-	pgdat->node_present_pages = realtotalpages;
-	printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
-							realtotalpages);
-}
-
-/*
- * Set up the zone data structures:
- *   - mark all pages reserved
- *   - mark all memory queues empty
- *   - clear the memory bitmaps
- */
-static void __meminit free_area_init_core(struct pglist_data *pgdat,
-		unsigned long *zones_size, unsigned long *zholes_size)
-{
-	unsigned long j;
-	int nid = pgdat->node_id;
-	unsigned long zone_start_pfn = pgdat->node_start_pfn;
-	int ret;
-
-	pgdat_resize_init(pgdat);
-	pgdat->nr_zones = 0;
-	init_waitqueue_head(&pgdat->kswapd_wait);
-	pgdat->kswapd_max_order = 0;
-	
-	for (j = 0; j < MAX_NR_ZONES; j++) {
-		struct zone *zone = pgdat->node_zones + j;
-		unsigned long size, realsize;
-
-		size = zone_present_pages_in_node(nid, j, zones_size);
-		realsize = size - zone_absent_pages_in_node(nid, j,
-								zholes_size);
-		if (j < ZONE_HIGHMEM)
-			nr_kernel_pages += realsize;
-		nr_all_pages += realsize;
-
-		zone->spanned_pages = size;
-		zone->present_pages = realsize;
-		zone->name = zone_names[j];
-		spin_lock_init(&zone->lock);
-		spin_lock_init(&zone->lru_lock);
-		zone_seqlock_init(zone);
-		zone->zone_pgdat = pgdat;
-		zone->free_pages = 0;
-
-		zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
-
-		zone_pcp_init(zone);
-		INIT_LIST_HEAD(&zone->active_list);
-		INIT_LIST_HEAD(&zone->inactive_list);
-		zone->nr_scan_active = 0;
-		zone->nr_scan_inactive = 0;
-		zone->nr_active = 0;
-		zone->nr_inactive = 0;
-		atomic_set(&zone->reclaim_in_progress, 0);
-		if (!size)
-			continue;
-
-		zonetable_add(zone, nid, j, zone_start_pfn, size);
-		ret = init_currently_empty_zone(zone, zone_start_pfn, size);
-		BUG_ON(ret);
-		zone_start_pfn += size;
-	}
-}
-
-static void __init alloc_node_mem_map(struct pglist_data *pgdat)
-{
-	/* Skip empty nodes */
-	if (!pgdat->node_spanned_pages)
-		return;
-
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
-	/* ia64 gets its own node_mem_map, before this, without bootmem */
-	if (!pgdat->node_mem_map) {
-		unsigned long size;
-		struct page *map;
-
-		size = (pgdat->node_spanned_pages + 1) * sizeof(struct page);
-		map = alloc_remap(pgdat->node_id, size);
-		if (!map)
-			map = alloc_bootmem_node(pgdat, size);
-		pgdat->node_mem_map = map;
-	}
-#ifdef CONFIG_FLATMEM
-	/*
-	 * With no DISCONTIG, the global mem_map is just set as node 0's
-	 */
-	if (pgdat == NODE_DATA(0))
-		mem_map = NODE_DATA(0)->node_mem_map;
-#endif
-#endif /* CONFIG_FLAT_NODE_MEM_MAP */
-}
-
-void __meminit free_area_init_node(int nid, struct pglist_data *pgdat,
-		unsigned long *zones_size, unsigned long node_start_pfn,
-		unsigned long *zholes_size)
-{
-	pgdat->node_id = nid;
-	pgdat->node_start_pfn = node_start_pfn;
-	calculate_node_totalpages(pgdat, zones_size, zholes_size);
-
-	alloc_node_mem_map(pgdat);
-
-	free_area_init_core(pgdat, zones_size, zholes_size);
-}
-
-#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
-void __init add_active_range(unsigned int nid, unsigned long start_pfn,
-						unsigned long end_pfn)
-{
-	unsigned int i;
-	printk(KERN_DEBUG "Range (%d) %lu -> %lu\n", nid, start_pfn, end_pfn);
-
-	/* Merge with existing active regions if possible */
-	for (i = 0; early_node_map[i].end_pfn; i++) {
-		if (early_node_map[i].nid != nid)
-			continue;
-
-		/* Skip if an existing region covers this new one */
-		if (start_pfn >= early_node_map[i].start_pfn &&
-				end_pfn <= early_node_map[i].end_pfn)
-			return;
-
-		/* Merge forward if suitable */
-		if (start_pfn <= early_node_map[i].end_pfn &&
-				end_pfn > early_node_map[i].end_pfn) {
-			early_node_map[i].end_pfn = end_pfn;
-			return;
-		}
-
-		/* Merge backward if suitable */
-		if (start_pfn < early_node_map[i].end_pfn &&
-				end_pfn >= early_node_map[i].start_pfn) {
-			early_node_map[i].start_pfn = start_pfn;
-			return;
-		}
-	}
-
-	/* Leave last entry NULL, we use range.end_pfn to terminate the walk */
-	if (i >= MAX_ACTIVE_REGIONS - 1) {
-		printk(KERN_ERR "Too many memory regions, truncating\n");
-		return;
-	}
-
-	early_node_map[i].nid = nid;
-	early_node_map[i].start_pfn = start_pfn;
-	early_node_map[i].end_pfn = end_pfn;
-}
-
-void __init shrink_active_range(unsigned int nid, unsigned long old_end_pfn,
-						unsigned long new_end_pfn)
-{
-	unsigned int i;
-
-	/* Find the old active region end and shrink */
-	for_each_active_range_index_in_nid(i, nid) {
-		if (early_node_map[i].end_pfn == old_end_pfn) {
-			early_node_map[i].end_pfn = new_end_pfn;
-			break;
-		}
-	}
-}
-
-void __init remove_all_active_ranges()
-{
-	memset(early_node_map, 0, sizeof(early_node_map));
-}
-
-/* Compare two active node_active_regions */
-static int __init cmp_node_active_region(const void *a, const void *b)
-{
-	struct node_active_region *arange = (struct node_active_region *)a;
-	struct node_active_region *brange = (struct node_active_region *)b;
-
-	/* Done this way to avoid overflows */
-	if (arange->start_pfn > brange->start_pfn)
-		return 1;
-	if (arange->start_pfn < brange->start_pfn)
-		return -1;
-
-	return 0;
-}
-
-/* sort the node_map by start_pfn */
-static void __init sort_node_map(void)
-{
-	size_t num = 0;
-	while (early_node_map[num].end_pfn)
-		num++;
-
-	sort(early_node_map, num, sizeof(struct node_active_region),
-						cmp_node_active_region, NULL);
-}
-
-/* Find the lowest pfn for a node. This depends on a sorted early_node_map */
-unsigned long __init find_min_pfn_for_node(unsigned long nid)
-{
-	int i;
-
-	/* Assuming a sorted map, the first range found has the starting pfn */
-	for_each_active_range_index_in_nid(i, nid)
-		return early_node_map[i].start_pfn;
-
-	printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
-	return 0;
-}
-
-unsigned long __init find_min_pfn_with_active_regions(void)
-{
-	return find_min_pfn_for_node(MAX_NUMNODES);
-}
-
-unsigned long __init find_max_pfn_with_active_regions(void)
-{
-	int i;
-	unsigned long max_pfn = 0;
-
-	for (i = 0; early_node_map[i].end_pfn; i++)
-		max_pfn = max(max_pfn, early_node_map[i].end_pfn);
-
-	return max_pfn;
-}
-
-void __init free_area_init_nodes(unsigned long arch_max_dma_pfn,
-				unsigned long arch_max_dma32_pfn,
-				unsigned long arch_max_low_pfn,
-				unsigned long arch_max_high_pfn)
-{
-	unsigned long nid;
-	int zone_index;
-
-	/* Record where the zone boundaries are */
-	memset(arch_zone_lowest_possible_pfn, 0,
-				sizeof(arch_zone_lowest_possible_pfn));
-	memset(arch_zone_highest_possible_pfn, 0,
-				sizeof(arch_zone_highest_possible_pfn));
-	arch_zone_lowest_possible_pfn[ZONE_DMA] =
-					find_min_pfn_with_active_regions();
-	arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
-	arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
-	arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
-	arch_zone_highest_possible_pfn[ZONE_HIGHMEM] = arch_max_high_pfn;
-	for (zone_index = 1; zone_index < MAX_NR_ZONES; zone_index++) {
-		arch_zone_lowest_possible_pfn[zone_index] =
-			arch_zone_highest_possible_pfn[zone_index-1];
-	}
-
-	/* Regions in the early_node_map can be in any order */
-	sort_node_map();
-
-	for_each_online_node(nid) {
-		pg_data_t *pgdat = NODE_DATA(nid);
-		free_area_init_node(nid, pgdat, NULL,
-				find_min_pfn_for_node(nid), NULL);
-	}
-}
-#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
-
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 static bootmem_data_t contig_bootmem_data;
 struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
@@ -3176,32 +2097,6 @@ int lowmem_reserve_ratio_sysctl_handler(
 	return 0;
 }
 
-/*
- * percpu_pagelist_fraction - changes the pcp->high for each zone on each
- * cpu.  It is the fraction of total pages in each zone that a hot per cpu pagelist
- * can have before it gets flushed back to buddy allocator.
- */
-
-int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
-	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
-{
-	struct zone *zone;
-	unsigned int cpu;
-	int ret;
-
-	ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
-	if (!write || (ret == -EINVAL))
-		return ret;
-	for_each_zone(zone) {
-		for_each_online_cpu(cpu) {
-			unsigned long  high;
-			high = zone->present_pages / percpu_pagelist_fraction;
-			setup_pagelist_highmark(zone_pcp(zone, cpu), high);
-		}
-	}
-	return 0;
-}
-
 __initdata int hashdist = HASHDIST_DEFAULT;
 
 #ifdef CONFIG_NUMA

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/6] Break out memory initialisation code from page_alloc.c to mem_init.c
  2006-05-08 14:12 ` [PATCH 6/6] Break out memory initialisation code from page_alloc.c to mem_init.c Mel Gorman
@ 2006-05-09  1:47   ` Nick Piggin
  2006-05-09  8:24     ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Nick Piggin @ 2006-05-09  1:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, davej, tony.luck, ak, bob.picco, linux-kernel,
	linuxppc-dev, linux-mm

Mel Gorman wrote:

>page_alloc.c contains a large amount of memory initialisation code. This patch
>breaks out the initialisation code to a separate file to make page_alloc.c
>a bit easier to read.
>

I realise this is at the wrong end of your queue, but if you _can_ easily
break it out and submit it first, it would be a nice cleanup and would help
shrink your main patchset.

Also, we're recently having some problems with architectures not aligning
zones correctly. Would it make sense to add these sorts of sanity checks,
and possibly forcing alignment corrections into your generic code?

Nick
--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/6] Break out memory initialisation code from page_alloc.c to mem_init.c
  2006-05-09  1:47   ` Nick Piggin
@ 2006-05-09  8:24     ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2006-05-09  8:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: akpm, davej, tony.luck, ak, bob.picco, linux-kernel,
	linuxppc-dev, linux-mm

On Tue, 9 May 2006, Nick Piggin wrote:

> Mel Gorman wrote:
>
>> page_alloc.c contains a large amount of memory initialisation code. This 
>> patch
>> breaks out the initialisation code to a separate file to make page_alloc.c
>> a bit easier to read.
>> 
>
> I realise this is at the wrong end of your queue, but if you _can_ easily
> break it out and submit it first, it would be a nice cleanup and would help
> shrink your main patchset.
>

The split-out potentially affects 10 other patches currently in -mm and is 
a merge headache for Andrew. My current understanding is that he wants to 
drop patch 6/6 until a later time. I guess this would be still true if the 
patch was at the other end of the queue.

> Also, we're recently having some problems with architectures not aligning
> zones correctly. Would it make sense to add these sorts of sanity checks,
> and possibly forcing alignment corrections into your generic code?
>

Yes, it is easy to force alignment corrections into the generic code. From 
that thread, there was this comment from Andy Whitcroft and your response;

> >1) check the alignment of the zones matches the implied alignment
> > constraints and correct it as we go.
> Yes. And preferably have checks in the generic page allocator setup
> code, so we can do something sane if the arch code gets it wrong.

With this patchset, it is trivial to move the start of highmem during 
setup. free_area_init_nodes() is passed the PFN each zone starts at by the 
architecture. If one wanted to force HIGHMEM to aligned, the 
arch_max_high_pfn value could be rounded down to MAX_ORDER alignment in 
free_area_init_nodes() before it calls free_area_init_node(). It doesn't 
matter if the PFN is in a hole. From there, an aligned mem_map should be 
allocated and memmap_init() will set the correct zone flags.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-08 14:12 ` [PATCH 5/6] Have ia64 " Mel Gorman
@ 2006-05-15  3:31   ` Andrew Morton
  2006-05-15  8:21     ` Andy Whitcroft
                       ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Andrew Morton @ 2006-05-15  3:31 UTC (permalink / raw)
  To: Mel Gorman, Andy Whitcroft
  Cc: davej, tony.luck, mel, linux-kernel, bob.picco, ak, linux-mm,
	linuxppc-dev

Mel Gorman <mel@csn.ul.ie> wrote:
>
> Size zones and holes in an architecture independent manner for ia64.
> 

This one makes my ia64 die very early in boot.   The trace is pretty useless.

config at http://www.zip.com.au/~akpm/linux/patches/stuff/config-ia64

EFI v1.10 by INTEL: SALsystab=0x3fe4c8c0 ACPI=0x3ff84000 ACPI 2.0=0x3ff83000 MP0
Early serial console at I/O port 0x2f8 (options '9600n8')
SAL 3.1: Intel Corp                       SR870BN4                         vers0
SAL Platform features: BusLock IRQ_Redirection
SAL: AP wakeup using external interrupt vector 0xf0
No logical to physical processor mapping available
iosapic_system_init: Disabling PC-AT compatible 8259 interrupts
ACPI: Local APIC address c0000000fee00000
PLATFORM int CPEI (0x3): GSI 22 (level, low) -> CPU 0 (0xc618) vector 30
register_intr: changing vector 39 from IO-SAPIC-edge to IO-SAPIC-level
4 CPUs available, 4 CPUs total
MCA related initialization done
node 0 zone DMA missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
node 0 zone DMA32 missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
node 0 zone Normal missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
node 0 zone HighMem missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
SMP: Allowing 4 CPUs, 0 hotplug CPUs
Built 1 zonelists
Kernel command line: BOOT_IMAGE=scsi0:\EFI\redhat\vmlinuz-2.6.17-rc4-mm1 root=/o
PID hash table entries: 4096 (order: 12, 32768 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 6, 1048576 bytes)
Inode-cache hash table entries: 65536 (order: 5, 524288 bytes)
Placing software IO TLB between 0x4a30000 - 0x8a30000
Unable to handle kernel NULL pointer dereference (address 0000000000000008)
swapper[0]: Oops 8813272891392 [1]
Modules linked in:

Pid: 0, CPU 0, comm:              swapper
psr : 00001010084a6010 ifs : 800000000000060f ip  : [<a0000001000e6750>]    Notd
ip is at __free_pages_ok+0x190/0x3c0
unat: 0000000000000000 pfs : 000000000000060f rsc : 0000000000000003
rnat: 0000000000ffffff bsps: 00000000000002f9 pr  : 80000000afb5956b
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0930ffff00090000 ssd : 0930ffff00090000
b0  : a0000001000e6660 b6  : e00000003fe52940 b7  : a000000100790120
f6  : 1003e6db6db6db6db6db7 f7  : 1003e000000000006dec0
f8  : 1003e000000000000fb80 f9  : 1003e000000000006e080
f10 : 1003e000000000000fb40 f11 : 1003e000000000006dec0
r1  : a000000100af2db0 r2  : 0000000000000001 r3  : 0000000000000000
r8  : a0000001008f3d38 r9  : 0000000000004000 r10 : 0000000000370400
r11 : 0000000000004000 r12 : a0000001007b7e10 r13 : a0000001007b0000
r14 : 0000000000000001 r15 : 0000000100000001 r16 : 0000000100000001
r17 : 0000000100000001 r18 : 0000000000001041 r19 : 0000000000000000
r20 : e00000000149df00 r21 : 0000000100000000 r22 : 0000000055555155
r23 : 00000000ffffffff r24 : e00000000149df08 r25 : 1555555555555155
r26 : 0000000000000032 r27 : 0000000000000000 r28 : 0000000000000008
r29 : 0000000000001041 r30 : 0000000000001041 r31 : 0000000000000001
Unable to handle kernel NULL pointer dereference (address 0000000000000000)
swapper[0]: Oops 8813272891392 [2]
Modules linked in:

Pid: 0, CPU 0, comm:              swapper
psr : 0000101008022018 ifs : 8000000000000287 ip  : [<a0000001001236c0>]    Notd
ip is at kmem_cache_alloc+0x40/0x100
unat: 0000000000000000 pfs : 0000000000000712 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 80000000afb59967
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0930ffff00090000 ssd : 0930ffff00090000
b0  : a00000010003e450 b6  : a000000100001b50 b7  : a00000010003f320
f6  : 1003e9e3779b97f4a7c16 f7  : 0ffdb8000000000000000
f8  : 1003e000000000000007f f9  : 1003e0000000000000379
f10 : 1003e6db6db6db6db6db7 f11 : 1003e000000000000007f
r1  : a000000100af2db0 r2  : 0000000000000000 r3  : 0000000000000000
r8  : 0000000000000000 r9  : 0000000000000000 r10 : a0000001007b0f24
r11 : 0000000000000000 r12 : a0000001007b7280 r13 : a0000001007b0000
r14 : 0000000000000000 r15 : 0000000000000000 r16 : a0000001007b7310
r17 : 0000000000000000 r18 : a0000001007b7478 r19 : 0000000000000000
r20 : 0000000000000000 r21 : 0000000000000018 r22 : 0000000000000000
r23 : 0000000000000000 r24 : 0000000000000000 r25 : a0000001007b7308
r26 : 000000007fffffff r27 : a000000100825520 r28 : a0000001008f3c40
r29 : a000000100816ca8 r30 : 0000000000000018 r31 : 0000000000000018



(gdb) l *0xa0000001000e6750
0xa0000001000e6750 is in __free_pages_ok (mm.h:324).
319     extern void FASTCALL(__page_cache_release(struct page *));
320     
321     static inline int page_count(struct page *page)
322     {
323             if (unlikely(PageCompound(page)))
324                     page = (struct page *)page_private(page);
325             return atomic_read(&page->_count);
326     }
327     
328     static inline void get_page(struct page *page)


Note the misaligned pfns.

Andy's (misspelled) CONFIG_UNALIGNED_ZONE_BOUNDRIES patch didn't actually
include an update to any Kconfig files.  But hacking that in by hand didn't
help.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-15  3:31   ` Andrew Morton
@ 2006-05-15  8:21     ` Andy Whitcroft
  2006-05-15 10:00       ` Nick Piggin
  2006-05-15 12:27     ` Mel Gorman
  2006-05-19 14:03     ` Mel Gorman
  2 siblings, 1 reply; 47+ messages in thread
From: Andy Whitcroft @ 2006-05-15  8:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, davej, tony.luck, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev

Andrew Morton wrote:
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
>>Size zones and holes in an architecture independent manner for ia64.
>>
> 
> 
> This one makes my ia64 die very early in boot.   The trace is pretty useless.
> 
> config at http://www.zip.com.au/~akpm/linux/patches/stuff/config-ia64
> 
> EFI v1.10 by INTEL: SALsystab=0x3fe4c8c0 ACPI=0x3ff84000 ACPI 2.0=0x3ff83000 MP0
> Early serial console at I/O port 0x2f8 (options '9600n8')
> SAL 3.1: Intel Corp                       SR870BN4                         vers0
> SAL Platform features: BusLock IRQ_Redirection
> SAL: AP wakeup using external interrupt vector 0xf0
> No logical to physical processor mapping available
> iosapic_system_init: Disabling PC-AT compatible 8259 interrupts
> ACPI: Local APIC address c0000000fee00000
> PLATFORM int CPEI (0x3): GSI 22 (level, low) -> CPU 0 (0xc618) vector 30
> register_intr: changing vector 39 from IO-SAPIC-edge to IO-SAPIC-level
> 4 CPUs available, 4 CPUs total
> MCA related initialization done
> node 0 zone DMA missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
> node 0 zone DMA32 missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
> node 0 zone Normal missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
> node 0 zone HighMem missaligned start pfn, enable UNALIGNED_ZONE_BOUNDRIES
> SMP: Allowing 4 CPUs, 0 hotplug CPUs
> Built 1 zonelists
> Kernel command line: BOOT_IMAGE=scsi0:\EFI\redhat\vmlinuz-2.6.17-rc4-mm1 root=/o
> PID hash table entries: 4096 (order: 12, 32768 bytes)
> Console: colour VGA+ 80x25
> Dentry cache hash table entries: 131072 (order: 6, 1048576 bytes)
> Inode-cache hash table entries: 65536 (order: 5, 524288 bytes)
> Placing software IO TLB between 0x4a30000 - 0x8a30000
> Unable to handle kernel NULL pointer dereference (address 0000000000000008)
> swapper[0]: Oops 8813272891392 [1]
> Modules linked in:
> 
> Pid: 0, CPU 0, comm:              swapper
> psr : 00001010084a6010 ifs : 800000000000060f ip  : [<a0000001000e6750>]    Notd
> ip is at __free_pages_ok+0x190/0x3c0
> unat: 0000000000000000 pfs : 000000000000060f rsc : 0000000000000003
> rnat: 0000000000ffffff bsps: 00000000000002f9 pr  : 80000000afb5956b
> ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
> csd : 0930ffff00090000 ssd : 0930ffff00090000
> b0  : a0000001000e6660 b6  : e00000003fe52940 b7  : a000000100790120
> f6  : 1003e6db6db6db6db6db7 f7  : 1003e000000000006dec0
> f8  : 1003e000000000000fb80 f9  : 1003e000000000006e080
> f10 : 1003e000000000000fb40 f11 : 1003e000000000006dec0
> r1  : a000000100af2db0 r2  : 0000000000000001 r3  : 0000000000000000
> r8  : a0000001008f3d38 r9  : 0000000000004000 r10 : 0000000000370400
> r11 : 0000000000004000 r12 : a0000001007b7e10 r13 : a0000001007b0000
> r14 : 0000000000000001 r15 : 0000000100000001 r16 : 0000000100000001
> r17 : 0000000100000001 r18 : 0000000000001041 r19 : 0000000000000000
> r20 : e00000000149df00 r21 : 0000000100000000 r22 : 0000000055555155
> r23 : 00000000ffffffff r24 : e00000000149df08 r25 : 1555555555555155
> r26 : 0000000000000032 r27 : 0000000000000000 r28 : 0000000000000008
> r29 : 0000000000001041 r30 : 0000000000001041 r31 : 0000000000000001
> Unable to handle kernel NULL pointer dereference (address 0000000000000000)
> swapper[0]: Oops 8813272891392 [2]
> Modules linked in:
> 
> Pid: 0, CPU 0, comm:              swapper
> psr : 0000101008022018 ifs : 8000000000000287 ip  : [<a0000001001236c0>]    Notd
> ip is at kmem_cache_alloc+0x40/0x100
> unat: 0000000000000000 pfs : 0000000000000712 rsc : 0000000000000003
> rnat: 0000000000000000 bsps: 0000000000000000 pr  : 80000000afb59967
> ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
> csd : 0930ffff00090000 ssd : 0930ffff00090000
> b0  : a00000010003e450 b6  : a000000100001b50 b7  : a00000010003f320
> f6  : 1003e9e3779b97f4a7c16 f7  : 0ffdb8000000000000000
> f8  : 1003e000000000000007f f9  : 1003e0000000000000379
> f10 : 1003e6db6db6db6db6db7 f11 : 1003e000000000000007f
> r1  : a000000100af2db0 r2  : 0000000000000000 r3  : 0000000000000000
> r8  : 0000000000000000 r9  : 0000000000000000 r10 : a0000001007b0f24
> r11 : 0000000000000000 r12 : a0000001007b7280 r13 : a0000001007b0000
> r14 : 0000000000000000 r15 : 0000000000000000 r16 : a0000001007b7310
> r17 : 0000000000000000 r18 : a0000001007b7478 r19 : 0000000000000000
> r20 : 0000000000000000 r21 : 0000000000000018 r22 : 0000000000000000
> r23 : 0000000000000000 r24 : 0000000000000000 r25 : a0000001007b7308
> r26 : 000000007fffffff r27 : a000000100825520 r28 : a0000001008f3c40
> r29 : a000000100816ca8 r30 : 0000000000000018 r31 : 0000000000000018
> 
> 
> 
> (gdb) l *0xa0000001000e6750
> 0xa0000001000e6750 is in __free_pages_ok (mm.h:324).
> 319     extern void FASTCALL(__page_cache_release(struct page *));
> 320     
> 321     static inline int page_count(struct page *page)
> 322     {
> 323             if (unlikely(PageCompound(page)))
> 324                     page = (struct page *)page_private(page);
> 325             return atomic_read(&page->_count);
> 326     }
> 327     
> 328     static inline void get_page(struct page *page)
> 
> 
> Note the misaligned pfns.
> 
> Andy's (misspelled) CONFIG_UNALIGNED_ZONE_BOUNDRIES patch didn't actually
> include an update to any Kconfig files.  But hacking that in by hand didn't
> help.

Interesting.  You are correct there was no config component, at the time
I didn't have direct evidence that any architecture needed it, only that
we had an unchecked requirement on zones, a requirement that had only
recently arrived with the changes to free buddy detection.  I note that
MAX_ORDER is 17 for ia64 so that probabally accounts for the
missalignment.  It is clear that the reporting is slightly over-zelous
as I am reporting zero-sized zones.  I'll get that fixed and patch to
you.  I'll also have a look at the patch as added to -mm and try and get
the rest of the spelling sorted :-/.

I'll go see if we currently have a machine to test this config on.

-apw

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-15  8:21     ` Andy Whitcroft
@ 2006-05-15 10:00       ` Nick Piggin
  2006-05-15 10:19         ` Andy Whitcroft
  0 siblings, 1 reply; 47+ messages in thread
From: Nick Piggin @ 2006-05-15 10:00 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Andrew Morton, Mel Gorman, davej, tony.luck, linux-kernel,
	bob.picco, ak, linux-mm, linuxppc-dev

Andy Whitcroft wrote:

> Interesting.  You are correct there was no config component, at the time
> I didn't have direct evidence that any architecture needed it, only that
> we had an unchecked requirement on zones, a requirement that had only
> recently arrived with the changes to free buddy detection.  I note that

Recently arrived? Over a year ago with the no-buddy-bitmap patches,
right? Just checking because I that's what I'm assuming broke it...

> MAX_ORDER is 17 for ia64 so that probabally accounts for the
> missalignment.  It is clear that the reporting is slightly over-zelous
> as I am reporting zero-sized zones.  I'll get that fixed and patch to
> you.  I'll also have a look at the patch as added to -mm and try and get
> the rest of the spelling sorted :-/.
> 
> I'll go see if we currently have a machine to test this config on.


-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-15 10:00       ` Nick Piggin
@ 2006-05-15 10:19         ` Andy Whitcroft
  2006-05-15 10:29           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 47+ messages in thread
From: Andy Whitcroft @ 2006-05-15 10:19 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Mel Gorman, davej, tony.luck, linux-kernel,
	bob.picco, ak, linux-mm, linuxppc-dev

Nick Piggin wrote:
> Andy Whitcroft wrote:
> 
>> Interesting.  You are correct there was no config component, at the time
>> I didn't have direct evidence that any architecture needed it, only that
>> we had an unchecked requirement on zones, a requirement that had only
>> recently arrived with the changes to free buddy detection.  I note that
> 
> 
> Recently arrived? Over a year ago with the no-buddy-bitmap patches,
> right? Just checking because I that's what I'm assuming broke it...

Yep, sorry I forget I was out of the game for 6 months!  And yes that
was when the requirements were altered.

-apw

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-15 10:19         ` Andy Whitcroft
@ 2006-05-15 10:29           ` KAMEZAWA Hiroyuki
  2006-05-15 10:47             ` KAMEZAWA Hiroyuki
                               ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-05-15 10:29 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: nickpiggin, akpm, mel, davej, tony.luck, linux-kernel, bob.picco,
	ak, linux-mm, linuxppc-dev

On Mon, 15 May 2006 11:19:27 +0100
Andy Whitcroft <apw@shadowen.org> wrote:

> Nick Piggin wrote:
> > Andy Whitcroft wrote:
> > 
> >> Interesting.  You are correct there was no config component, at the time
> >> I didn't have direct evidence that any architecture needed it, only that
> >> we had an unchecked requirement on zones, a requirement that had only
> >> recently arrived with the changes to free buddy detection.  I note that
> > 
> > 
> > Recently arrived? Over a year ago with the no-buddy-bitmap patches,
> > right? Just checking because I that's what I'm assuming broke it...
> 
> Yep, sorry I forget I was out of the game for 6 months!  And yes that
> was when the requirements were altered.
> 
When no-bitmap-buddy patches was included,

1. bad_range() is not covered by CONFIG_VM_DEBUG. It always worked.
==
static int bad_range(struct zone *zone, struct page *page)
{
        if (page_to_pfn(page) >= zone->zone_start_pfn + zone->spanned_pages)
                return 1;
        if (page_to_pfn(page) < zone->zone_start_pfn)
                return 1;
==
And , this code
==
                buddy = __page_find_buddy(page, page_idx, order);

                if (bad_range(zone, buddy))
                        break;
==

checked whether buddy is in zone and guarantees it to have page struct.


But clean-up/speed-up codes vanished these checks. (I don't know when this occurs)
Sorry for misses these things.

-Kame


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-15 10:29           ` KAMEZAWA Hiroyuki
@ 2006-05-15 10:47             ` KAMEZAWA Hiroyuki
  2006-05-15 11:02             ` Andy Whitcroft
  2006-05-16  0:31             ` Nick Piggin
  2 siblings, 0 replies; 47+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-05-15 10:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: apw, nickpiggin, akpm, mel, davej, tony.luck, linux-kernel,
	bob.picco, ak, linux-mm, linuxppc-dev

On Mon, 15 May 2006 19:29:18 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 15 May 2006 11:19:27 +0100
> Andy Whitcroft <apw@shadowen.org> wrote:
> 
> > Nick Piggin wrote:
> > > Andy Whitcroft wrote:
> > > 
> > >> Interesting.  You are correct there was no config component, at the time
> > >> I didn't have direct evidence that any architecture needed it, only that
> > >> we had an unchecked requirement on zones, a requirement that had only
> > >> recently arrived with the changes to free buddy detection.  I note that
> > > 
> > > 
> > > Recently arrived? Over a year ago with the no-buddy-bitmap patches,
> > > right? Just checking because I that's what I'm assuming broke it...
> > 
> > Yep, sorry I forget I was out of the game for 6 months!  And yes that
> > was when the requirements were altered.
> > 
> When no-bitmap-buddy patches was included,
> 
> 1. bad_range() is not covered by CONFIG_VM_DEBUG. It always worked.
> ==
> static int bad_range(struct zone *zone, struct page *page)
> {
>         if (page_to_pfn(page) >= zone->zone_start_pfn + zone->spanned_pages)
>                 return 1;
>         if (page_to_pfn(page) < zone->zone_start_pfn)
>                 return 1;
> ==
> And , this code
> ==
>                 buddy = __page_find_buddy(page, page_idx, order);
> 
>                 if (bad_range(zone, buddy))
>                         break;
> ==
> 
> checked whether buddy is in zone and guarantees it to have page struct.
> 
> 
> But clean-up/speed-up codes vanished these checks. (I don't know when this occurs)
> Sorry for misses these things.
> 

One more point
When above no-bitmap patches was included, the user of not-aligned zones
are only ia64, I think. Because ia64 used virtual mem_map, page_to_pfn(page)
on  CONFIG_DISCONTIG_MEM doesn't access page struct itself. 

#define page_to_pfn(page)	(page - vmemmap)

So, it didn't  panic. ia64/vmemmap was safe.

If other archs used not-aligned zone + CONFIG_DISCONTIGMEM,
not-aligned-zones problem would come out earlier.

-Kame


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-15 10:29           ` KAMEZAWA Hiroyuki
  2006-05-15 10:47             ` KAMEZAWA Hiroyuki
@ 2006-05-15 11:02             ` Andy Whitcroft
  2006-05-16  0:31             ` Nick Piggin
  2 siblings, 0 replies; 47+ messages in thread
From: Andy Whitcroft @ 2006-05-15 11:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nickpiggin, akpm, mel, davej, tony.luck, linux-kernel, bob.picco,
	ak, linux-mm, linuxppc-dev

KAMEZAWA Hiroyuki wrote:
> On Mon, 15 May 2006 11:19:27 +0100
> Andy Whitcroft <apw@shadowen.org> wrote:
> 
> 
>>Nick Piggin wrote:
>>
>>>Andy Whitcroft wrote:
>>>
>>>
>>>>Interesting.  You are correct there was no config component, at the time
>>>>I didn't have direct evidence that any architecture needed it, only that
>>>>we had an unchecked requirement on zones, a requirement that had only
>>>>recently arrived with the changes to free buddy detection.  I note that
>>>
>>>
>>>Recently arrived? Over a year ago with the no-buddy-bitmap patches,
>>>right? Just checking because I that's what I'm assuming broke it...
>>
>>Yep, sorry I forget I was out of the game for 6 months!  And yes that
>>was when the requirements were altered.
>>
> 
> When no-bitmap-buddy patches was included,
> 
> 1. bad_range() is not covered by CONFIG_VM_DEBUG. It always worked.
> ==
> static int bad_range(struct zone *zone, struct page *page)
> {
>         if (page_to_pfn(page) >= zone->zone_start_pfn + zone->spanned_pages)
>                 return 1;
>         if (page_to_pfn(page) < zone->zone_start_pfn)
>                 return 1;
> ==
> And , this code
> ==
>                 buddy = __page_find_buddy(page, page_idx, order);
> 
>                 if (bad_range(zone, buddy))
>                         break;
> ==
> 
> checked whether buddy is in zone and guarantees it to have page struct.
> 
> 
> But clean-up/speed-up codes vanished these checks. (I don't know when this occurs)
> Sorry for misses these things.

Heh, sorry to make it sound like it was you who was responsible.

-apw

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-15  3:31   ` Andrew Morton
  2006-05-15  8:21     ` Andy Whitcroft
@ 2006-05-15 12:27     ` Mel Gorman
  2006-05-15 22:44       ` Mel Gorman
  2006-05-19 14:03     ` Mel Gorman
  2 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2006-05-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Whitcroft, davej, tony.luck, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev

On (14/05/06 20:31), Andrew Morton didst pronounce:
> Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > Size zones and holes in an architecture independent manner for ia64.
> > 
> 
> This one makes my ia64 die very early in boot.   The trace is pretty useless.
> 
> config at http://www.zip.com.au/~akpm/linux/patches/stuff/config-ia64
> 
> <log snipped>

Curses. When I tried to reproduce this, the machine booted with my default
config but died before initialising the console with your config. The machine
is far away so I can't see the screen or restart the machine remotely so
I can only assume it is dying for the same reasons yours did.

> Note the misaligned pfns.
> 
> Andy's (misspelled) CONFIG_UNALIGNED_ZONE_BOUNDRIES patch didn't actually
> include an update to any Kconfig files.  But hacking that in by hand didn't
> help.

It would not have helped in this case because the zone boundaries would still
be in the wrong place for ia64. Below is a patch that aligns the zones on
all architectures that use CONFIG_ARCH_POPULATES_NODE_MAP . That is currently
i386, x86_64, powerpc, ppc and ia64. It does *not* align pgdat->node_start_pfn
but I don't believe that it is necessary.

I can't test it on ia64 until I get someone to restart the machine. The patch
compiles and is currently boot-testing on a range of other machines. I hope
to know within 5-6 hours if everything is ok.

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc4-mm4-clean/mm/page_alloc.c linux-2.6.17-rc4-mm4-ia64_force_alignment/mm/page_alloc.c
--- linux-2.6.17-rc4-mm4-clean/mm/page_alloc.c	2006-05-15 10:37:55.000000000 +0100
+++ linux-2.6.17-rc4-mm4-ia64_force_alignment/mm/page_alloc.c	2006-05-15 13:10:42.000000000 +0100
@@ -2640,14 +2640,20 @@ void __init free_area_init_nodes(unsigne
 {
 	unsigned long nid;
 	int zone_index;
+	unsigned long lowest_pfn = find_min_pfn_with_active_regions();
+
+	lowest_pfn = zone_boundary_align_pfn(lowest_pfn);
+	arch_max_dma_pfn = zone_boundary_align_pfn(arch_max_dma_pfn);
+	arch_max_dma32_pfn = zone_boundary_align_pfn(arch_max_dma32_pfn);
+	arch_max_low_pfn = zone_boundary_align_pfn(arch_max_low_pfn);
+	arch_max_high_pfn = zone_boundary_align_pfn(arch_max_high_pfn);
 
 	/* Record where the zone boundaries are */
 	memset(arch_zone_lowest_possible_pfn, 0,
 				sizeof(arch_zone_lowest_possible_pfn));
 	memset(arch_zone_highest_possible_pfn, 0,
 				sizeof(arch_zone_highest_possible_pfn));
-	arch_zone_lowest_possible_pfn[ZONE_DMA] =
-					find_min_pfn_with_active_regions();
+	arch_zone_lowest_possible_pfn[ZONE_DMA] = lowest_pfn;
 	arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
 	arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
 	arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-15 12:27     ` Mel Gorman
@ 2006-05-15 22:44       ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2006-05-15 22:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Whitcroft, davej, tony.luck, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev

> diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc4-mm4-clean/mm/page_alloc.c linux-2.6.17-rc4-mm4-ia64_force_alignment/mm/page_alloc.c
> --- linux-2.6.17-rc4-mm4-clean/mm/page_alloc.c	2006-05-15 10:37:55.000000000 +0100
> +++ linux-2.6.17-rc4-mm4-ia64_force_alignment/mm/page_alloc.c	2006-05-15 13:10:42.000000000 +0100
> @@ -2640,14 +2640,20 @@ void __init free_area_init_nodes(unsigne
> {
> 	unsigned long nid;
> 	int zone_index;
> +	unsigned long lowest_pfn = find_min_pfn_with_active_regions();
> +
> +	lowest_pfn = zone_boundary_align_pfn(lowest_pfn);
> +	arch_max_dma_pfn = zone_boundary_align_pfn(arch_max_dma_pfn);
> +	arch_max_dma32_pfn = zone_boundary_align_pfn(arch_max_dma32_pfn);
> +	arch_max_low_pfn = zone_boundary_align_pfn(arch_max_low_pfn);
> +	arch_max_high_pfn = zone_boundary_align_pfn(arch_max_high_pfn);
>
> 	/* Record where the zone boundaries are */
> 	memset(arch_zone_lowest_possible_pfn, 0,
> 				sizeof(arch_zone_lowest_possible_pfn));
> 	memset(arch_zone_highest_possible_pfn, 0,
> 				sizeof(arch_zone_highest_possible_pfn));
> -	arch_zone_lowest_possible_pfn[ZONE_DMA] =
> -					find_min_pfn_with_active_regions();
> +	arch_zone_lowest_possible_pfn[ZONE_DMA] = lowest_pfn;
> 	arch_zone_highest_possible_pfn[ZONE_DMA] = arch_max_dma_pfn;
> 	arch_zone_highest_possible_pfn[ZONE_DMA32] = arch_max_dma32_pfn;
> 	arch_zone_highest_possible_pfn[ZONE_NORMAL] = arch_max_low_pfn;
>

Ok, this patch is broken in a number of ways. It doesn't help the IA64 
problem at all and two other machine configurations failed with the patch 
applied during regression testing. Please drop and I'll figure out what 
the correct solution is to your IA64 machine not booting.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-15 10:29           ` KAMEZAWA Hiroyuki
  2006-05-15 10:47             ` KAMEZAWA Hiroyuki
  2006-05-15 11:02             ` Andy Whitcroft
@ 2006-05-16  0:31             ` Nick Piggin
  2006-05-16  1:34               ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 47+ messages in thread
From: Nick Piggin @ 2006-05-16  0:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andy Whitcroft, akpm, mel, davej, tony.luck, linux-kernel,
	bob.picco, ak, linux-mm, linuxppc-dev

KAMEZAWA Hiroyuki wrote:

>On Mon, 15 May 2006 11:19:27 +0100
>Andy Whitcroft <apw@shadowen.org> wrote:
>
>>>
>>>Recently arrived? Over a year ago with the no-buddy-bitmap patches,
>>>right? Just checking because I that's what I'm assuming broke it...
>>>
>>Yep, sorry I forget I was out of the game for 6 months!  And yes that
>>was when the requirements were altered.
>>
>>
>When no-bitmap-buddy patches was included,
>
>1. bad_range() is not covered by CONFIG_VM_DEBUG. It always worked.
>==
>static int bad_range(struct zone *zone, struct page *page)
>{
>        if (page_to_pfn(page) >= zone->zone_start_pfn + zone->spanned_pages)
>                return 1;
>        if (page_to_pfn(page) < zone->zone_start_pfn)
>                return 1;
>==
>And , this code
>==
>                buddy = __page_find_buddy(page, page_idx, order);
>
>                if (bad_range(zone, buddy))
>                        break;
>==
>
>checked whether buddy is in zone and guarantees it to have page struct.
>

Ah, my mistake indeed. Sorry.

>But clean-up/speed-up codes vanished these checks. (I don't know when this occurs)
>Sorry for misses these things.
>

I think if anything they should be moved into page_is_buddy, however 
page_to_pfn
is expensive on some architectures, so it is something we want to be 
able to opt
out of if we do the correct alignment.

--
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-16  0:31             ` Nick Piggin
@ 2006-05-16  1:34               ` KAMEZAWA Hiroyuki
  2006-05-16  2:11                 ` Nick Piggin
  0 siblings, 1 reply; 47+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-05-16  1:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: apw, akpm, mel, davej, tony.luck, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev

On Tue, 16 May 2006 10:31:56 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> >But clean-up/speed-up codes vanished these checks. (I don't know when this occurs)
> >Sorry for misses these things.
> >
> 
> I think if anything they should be moved into page_is_buddy, however 
> page_to_pfn
> is expensive on some architectures, so it is something we want to be 
> able to opt
> out of if we do the correct alignment.
> 

yes. It's expensive.
I received following e-Mail from Dave Hansen in past.
==
There are some ppc64 machines which have memory laid out like this:

  0-100 MB Node0
100-200 MB Node1
200-300 MB Node0
==
I didn't imagine above (strange) configration.
So, simple range check will not work for all configration anyway.
(above example is aligned and has no problem.  Moreover, ppc uses SPARSEMEM.)

Andy's page_zone(page) == page_zone(buddy) check is good, I think.

Making alignment is a difficult problem, I think. It complecates many things.
We can avoid above check only when memory layout is ideal.

BTW, How about following patch ?
I don't want to  say "Oh, you have to re-compile your kernel with 
CONFIG_UNALIGNED_ZONE on your new machine. you are unlucky." to users.

==

Index: linux-2.6.17-rc4-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.17-rc4-mm1.orig/mm/page_alloc.c
+++ linux-2.6.17-rc4-mm1/mm/page_alloc.c
@@ -58,6 +58,7 @@ unsigned long totalhigh_pages __read_mos
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;
 int percpu_pagelist_fraction;
+int unaligned_zone;
 
 static void __free_pages_ok(struct page *page, unsigned int order);
 
@@ -296,12 +297,12 @@ static inline void rmv_page_order(struct
  *
  * Assumption: *_mem_map is contigious at least up to MAX_ORDER
  */
-static inline struct page *
+static inline long *
 __page_find_buddy(struct page *page, unsigned long page_idx, unsigned int order)
 {
 	unsigned long buddy_idx = page_idx ^ (1 << order);
 
-	return page + (buddy_idx - page_idx);
+	return (buddy_idx - page_idx);
 }
 
 static inline unsigned long
@@ -310,6 +311,23 @@ __find_combined_index(unsigned long page
 	return (page_idx & ~(1 << order));
 }
 
+static inline struct page *
+buddy_in_range(struct zone *zone, struct page *page, long buddy_idx)
+{
+	unsigned long pfn;
+	struct page *buddy;
+	if (!unaligned_zone)
+		return page + buddy_idx;
+	pfn = page_to_pfn(page);
+	if (!pfn_valid(pfn + buddy_idx))
+		return NULL;
+	buddy = page + buddy_idx;
+	if (page_zone_id(page) != page_zone_id(buddy))
+		return NULL;
+	return buddy;
+}
+
+
 /*
  * This function checks whether a page is free && is the buddy
  * we can do coalesce a page and its buddy if
@@ -326,15 +344,10 @@ __find_combined_index(unsigned long page
 static inline int page_is_buddy(struct page *page, struct page *buddy,
 								int order)
 {
-#ifdef CONFIG_HOLES_IN_ZONE
+#if defined(CONFIG_HOLES_IN_ZONE)
 	if (!pfn_valid(page_to_pfn(buddy)))
 		return 0;
 #endif
-#ifdef CONFIG_UNALIGNED_ZONE_BOUNDRIES
-	if (page_zone_id(page) != page_zone_id(buddy))
-		return 0;
-#endif
-
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
 		BUG_ON(page_count(buddy) != 0);
 		return 1;
@@ -383,10 +396,14 @@ static inline void __free_one_page(struc
 	zone->free_pages += order_size;
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
+		long buddy_idx;
 		struct free_area *area;
 		struct page *buddy;
 
-		buddy = __page_find_buddy(page, page_idx, order);
+		buddy_idx = __page_find_buddy(page, page_idx, order);
+		buddy = buddy_in_zone(zone, page, buddy_idx);
+		if (unlikely(!buddy))
+			break;
 		if (!page_is_buddy(page, buddy, order))
 			break;		/* Move the buddy up one level. */
 
@@ -2434,10 +2451,9 @@ static void __meminit free_area_init_cor
 		struct zone *zone = pgdat->node_zones + j;
 		unsigned long size, realsize;
 
-		if (zone_boundary_align_pfn(zone_start_pfn) != zone_start_pfn)
-			printk(KERN_CRIT "node %d zone %s missaligned "
-				"start pfn, enable UNALIGNED_ZONE_BOUNDRIES\n",
-							nid, zone_names[j]);
+		if (zone_boundary_align_pfn(zone_start_pfn) != zone_start_pfn) {
+			unaligned_zone = 1;
+		}
 
 		size = zone_present_pages_in_node(nid, j, zones_size);
 		realsize = size - zone_absent_pages_in_node(nid, j,
Index: linux-2.6.17-rc4-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.17-rc4-mm1.orig/include/linux/mmzone.h
+++ linux-2.6.17-rc4-mm1/include/linux/mmzone.h
@@ -399,11 +399,7 @@ static inline int is_dma(struct zone *zo
 
 static inline unsigned long zone_boundary_align_pfn(unsigned long pfn)
 {
-#ifdef CONFIG_UNALIGNED_ZONE_BOUNDRIES
-	return pfn;
-#else
 	return pfn & ~((1 << MAX_ORDER) - 1);
-#endif
 }
 
 /* These two functions are used to setup the per zone pages min values */




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-16  1:34               ` KAMEZAWA Hiroyuki
@ 2006-05-16  2:11                 ` Nick Piggin
  0 siblings, 0 replies; 47+ messages in thread
From: Nick Piggin @ 2006-05-16  2:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: apw, akpm, mel, davej, tony.luck, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev

KAMEZAWA Hiroyuki wrote:

>
>Andy's page_zone(page) == page_zone(buddy) check is good, I think.
>
>Making alignment is a difficult problem, I think. It complecates many things.
>We can avoid above check only when memory layout is ideal.
>
>BTW, How about following patch ?
>I don't want to  say "Oh, you have to re-compile your kernel with 
>CONFIG_UNALIGNED_ZONE on your new machine. you are unlucky." to users.
>

No, this is a function of the architecture code, not the specific
machine it is running on.

So if the architecture ensures alignment and no holes, then they don't
need the overhead of CONFIG_UNALIGNED_ZONE or CONFIG_HOLES_IN_ZONE.

If they do not ensure correct alignment, then they must enable
CONFIG_UNALIGNED_ZONE, even if there may be actual systems which do
result in aligned zones.
--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-15  3:31   ` Andrew Morton
  2006-05-15  8:21     ` Andy Whitcroft
  2006-05-15 12:27     ` Mel Gorman
@ 2006-05-19 14:03     ` Mel Gorman
  2006-05-19 14:23       ` Andy Whitcroft
  2 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2006-05-19 14:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Whitcroft, davej, tony.luck, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev

On Sun, 14 May 2006, Andrew Morton wrote:

> Mel Gorman <mel@csn.ul.ie> wrote:
>>
>> Size zones and holes in an architecture independent manner for ia64.
>>
>
> This one makes my ia64 die very early in boot.   The trace is pretty useless.
>
> config at http://www.zip.com.au/~akpm/linux/patches/stuff/config-ia64
>

An indirect fix for this has been set out with a patchset with the subject 
"[PATCH 0/2] Fixes for node alignment and flatmem assumptions" . For 
arch-independent-zone-sizing, the issue was that FLATMEM assumes that 
NODE_DATA(0)->node_start_pfn == 0. This is not the case with 
arch-independent-zone-sizing and IA64. With arch-independent-zone-sizing, 
a nodes node_start_pfn will be at the first valid PFN.

> <log snipped>
>
> Note the misaligned pfns.
>

You will still get the message about misaligned PFNs on IA64. This is 
because the lowest zone starts at the lowest available PFN which may not 
be 0 or any other aligned number. It shouldn't make a different - or at 
least I couldn't cause any problems.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/6] Have ia64 use add_active_range() and free_area_init_nodes
  2006-05-19 14:03     ` Mel Gorman
@ 2006-05-19 14:23       ` Andy Whitcroft
  0 siblings, 0 replies; 47+ messages in thread
From: Andy Whitcroft @ 2006-05-19 14:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, davej, tony.luck, linux-kernel, bob.picco, ak,
	linux-mm, linuxppc-dev

Mel Gorman wrote:
> On Sun, 14 May 2006, Andrew Morton wrote:
> 
>> Mel Gorman <mel@csn.ul.ie> wrote:
>>
>>>
>>> Size zones and holes in an architecture independent manner for ia64.
>>>
>>
>> This one makes my ia64 die very early in boot.   The trace is pretty
>> useless.
>>
>> config at http://www.zip.com.au/~akpm/linux/patches/stuff/config-ia64
>>
> 
> An indirect fix for this has been set out with a patchset with the
> subject "[PATCH 0/2] Fixes for node alignment and flatmem assumptions" .
> For arch-independent-zone-sizing, the issue was that FLATMEM assumes
> that NODE_DATA(0)->node_start_pfn == 0. This is not the case with
> arch-independent-zone-sizing and IA64. With
> arch-independent-zone-sizing, a nodes node_start_pfn will be at the
> first valid PFN.
> 
>> <log snipped>
>>
>> Note the misaligned pfns.
>>
> 
> You will still get the message about misaligned PFNs on IA64. This is
> because the lowest zone starts at the lowest available PFN which may not
> be 0 or any other aligned number. It shouldn't make a different - or at
> least I couldn't cause any problems.

With the updates I sent out to the zone alignment checks yesterday this
should now be ignored correctly without comment.  The first zone is
allowed to be misaligned because we expect alignment of the mem_map.
With bob picco's patch from your set we ensure it is so.

-apw

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-05-08 14:11 ` [PATCH 4/6] Have x86_64 " Mel Gorman
@ 2006-05-20 20:59   ` Andrew Morton
  2006-05-20 21:27     ` Andi Kleen
                       ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Andrew Morton @ 2006-05-20 20:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: davej, tony.luck, mel, ak, bob.picco, linux-kernel, linuxppc-dev,
	linux-mm

Mel Gorman <mel@csn.ul.ie> wrote:
>
> 
> Size zones and holes in an architecture independent manner for x86_64.
> 
> 

I found a .config which triggers the cant-map-acpitables problem.


With that .config, and without this patch:

Linux version 2.6.17-rc4-mm2 (akpm@box) (gcc version 4.1.0 20060304 (Red Hat 4.6
BIOS-provided physical RAM map:                                                
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000ca605000 (usable)  
 BIOS-e820: 00000000ca605000 - 00000000ca680000 (ACPI NVS)
 BIOS-e820: 00000000ca680000 - 00000000cb5ef000 (usable)  
 BIOS-e820: 00000000cb5ef000 - 00000000cb5fc000 (reserved)
 BIOS-e820: 00000000cb5fc000 - 00000000cb6a2000 (usable)  
 BIOS-e820: 00000000cb6a2000 - 00000000cb6eb000 (ACPI NVS)
 BIOS-e820: 00000000cb6eb000 - 00000000cb6ef000 (usable)  
 BIOS-e820: 00000000cb6ef000 - 00000000cb6ff000 (ACPI data)
 BIOS-e820: 00000000cb6ff000 - 00000000cb700000 (usable)   
 BIOS-e820: 00000000cb700000 - 00000000cc000000 (reserved)
 BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000130000000 (usable)  
DMI 2.4 present.                                        
ACPI: PM-Timer IO Port: 0x408
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 6:15 APIC version 20                 
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 6:15 APIC version 20                 
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled)
ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])  
ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)      


With that .config, and with this patch:

Bootdata ok (command line is ro root=LABEL=/ earlyprintk=serial,ttyS0,9600,keep netconsole=4444@192.168.2.4/eth0,5147@192.168.2.33/00:0D:56:C6:C6:CC)
Linux version 2.6.17-rc4-mm2 (akpm@box) (gcc version 4.1.0 20060304 (Red Hat 4.1.0-3)) #33 SMP Sat May 20 12:08:03 PDT 2006
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000ca605000 (usable)
 BIOS-e820: 00000000ca605000 - 00000000ca680000 (ACPI NVS)
 BIOS-e820: 00000000ca680000 - 00000000cb5ef000 (usable)
 BIOS-e820: 00000000cb5ef000 - 00000000cb5fc000 (reserved)
 BIOS-e820: 00000000cb5fc000 - 00000000cb6a2000 (usable)
 BIOS-e820: 00000000cb6a2000 - 00000000cb6eb000 (ACPI NVS)
 BIOS-e820: 00000000cb6eb000 - 00000000cb6ef000 (usable)
 BIOS-e820: 00000000cb6ef000 - 00000000cb6ff000 (ACPI data)
 BIOS-e820: 00000000cb6ff000 - 00000000cb700000 (usable)
 BIOS-e820: 00000000cb700000 - 00000000cc000000 (reserved)
 BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000130000000 (usable)
Too many memory regions, truncating
Too many memory regions, truncating
Too many memory regions, truncating
DMI 2.4 present.
ACPI: Unable to map RSDT header
Intel MultiProcessor Specification v1.4
    Virtual Wire compatibility mode.
OEM ID:  Product ID:  APIC at: 0xFEE00000


ACPI disables itself.

Good .config: http://www.zip.com.au/~akpm/linux/patches/stuff/config-good
Bad .config: http://www.zip.com.au/~akpm/linux/patches/stuff/config-bad


The handling of MAX_ACTIVE_REGIONS is unpleasing, sorry.  In my setup it is
5.  But we _really_ only support 4 regions.  So for a start it is misnamed.
The maximum number of regions we support is actually MAX_ACTIVE_REGIONS-1.
And this is a config option too!  So the user must specify
CONFIG_MAX_ACTIVE_REGIONS as the number of active regions plus one, for the
terminating region which has end_pfn=0.  It's weird.

I would not consider this code to be adequately commented.  Please raise a
patch which comments the major functions - what they do, why they do it,
any caveats or implementations details.  A few lines each - don't overdo
it.  Details such as whether the various end_pfn's are inclusive or
exclusive are important, as is a description of the return value.

Anyway, I just don't get how this code can work.  We have an e820 map with
up to 128 entries (this machine has ten) and we're trying to scrunch that
all into the four-entry early_node_map[].

With config-good we're set up for NUMA, CONFIG_NODES_SHIFT=6.  So
MAX_ACTIVE_REGIONS is enormous.  But it's quite wrong that we're using
number-of-zones*number-of-nodes to size a data structure which has to
accommodate all the entries in the e820 map.  These things aren't related.


On my little x86 PC:

BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
 BIOS-e820: 000000000009bc00 - 000000000009c000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000000ffc0000 (usable)
 BIOS-e820: 000000000ffc0000 - 000000000fff8000 (ACPI data)
 BIOS-e820: 000000000fff8000 - 0000000010000000 (ACPI NVS)
 BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ffb80000 - 00000000ffc00000 (reserved)
 BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
0MB HIGHMEM available.
255MB LOWMEM available.
found SMP MP-table at 000ff780
Range (nid 0) 0 -> 65472, max 4
On node 0 totalpages: 65472
  DMA zone: 4096 pages, LIFO batch:0
  Normal zone: 61376 pages, LIFO batch:15

So here, the architecture code only called add_active_range() the once, for
the entire memory map.  But on the x86_64 add_active_range() was called
once per e820 entry.  I'm dimly starting to realise that this is perhaps
the problem - the weird-looking definition of MAX_ACTIVE_REGIONS _expects_
the architecture to call add_active_range() with a start_pfn/end_pfn which
describes the entire range of pfns for each zone in each node.  Even if
that span includes not-present pfns.  Would that be correct?  I didn't see
a comment in there describing this design (I do go on).

If so, perhaps the bug is that the x86_64 code isn't doing that.  And that
x86 isn't doing it for some people either.

Anyway.  From the implementation I can see what the code is doing.  But I
see no description of what it is _supposed_ to be doing.  (The process of
finding differences between these two things is known as "debugging").  I
could kludge things by setting MAX_ACTIVE_REGIONS to 1000000, but enough. 
I look forward to the next version ;)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-05-20 20:59   ` Andrew Morton
@ 2006-05-20 21:27     ` Andi Kleen
  2006-05-20 21:40       ` Andrew Morton
  2006-05-21 16:20       ` Mel Gorman
  2006-05-21 15:50     ` Mel Gorman
  2006-05-23 18:01     ` Mel Gorman
  2 siblings, 2 replies; 47+ messages in thread
From: Andi Kleen @ 2006-05-20 21:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, davej, tony.luck, bob.picco, linux-kernel,
	linuxppc-dev, linux-mm


> Anyway.  From the implementation I can see what the code is doing.  But I
> see no description of what it is _supposed_ to be doing.  (The process of
> finding differences between these two things is known as "debugging").  I
> could kludge things by setting MAX_ACTIVE_REGIONS to 1000000, but enough. 
> I look forward to the next version ;)

Or we could just keep the working old code.

Can somebody remind me what this patch kit was supposed to fix or improve again? 

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-05-20 21:27     ` Andi Kleen
@ 2006-05-20 21:40       ` Andrew Morton
  2006-05-20 22:17         ` Andi Kleen
  2006-05-21 16:20       ` Mel Gorman
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2006-05-20 21:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: mel, davej, tony.luck, bob.picco, linux-kernel, linuxppc-dev, linux-mm

Andi Kleen <ak@suse.de> wrote:
>
> 
> > Anyway.  From the implementation I can see what the code is doing.  But I
> > see no description of what it is _supposed_ to be doing.  (The process of
> > finding differences between these two things is known as "debugging").  I
> > could kludge things by setting MAX_ACTIVE_REGIONS to 1000000, but enough. 
> > I look forward to the next version ;)
> 
> Or we could just keep the working old code.
> 
> Can somebody remind me what this patch kit was supposed to fix or improve again? 
> 

Well, it creates arch-neutral common code, teaches various architectures
use it.  It's the sort of thing we do all the time.

These things are opportunities to eliminate crufty arch code which few
people understand and replace them with new, clean common code which lots
of people understand.  That's not a bad thing to be doing.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-05-20 21:40       ` Andrew Morton
@ 2006-05-20 22:17         ` Andi Kleen
  2006-05-20 22:54           ` Andrew Morton
  0 siblings, 1 reply; 47+ messages in thread
From: Andi Kleen @ 2006-05-20 22:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mel, davej, tony.luck, bob.picco, linux-kernel, linuxppc-dev, linux-mm

 
> Well, it creates arch-neutral common code, teaches various architectures
> use it.  It's the sort of thing we do all the time.
> 
> These things are opportunities to eliminate crufty arch code which few
> people understand and replace them with new, clean common code which lots
> of people understand.  That's not a bad thing to be doing.

I'm not fundamentally against that, but so far it seems to just generate lots of 
new bugs?  I'm not sure it's really worth the pain.

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-05-20 22:17         ` Andi Kleen
@ 2006-05-20 22:54           ` Andrew Morton
  0 siblings, 0 replies; 47+ messages in thread
From: Andrew Morton @ 2006-05-20 22:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: mel, davej, tony.luck, bob.picco, linux-kernel, linuxppc-dev, linux-mm

Andi Kleen <ak@suse.de> wrote:
>
>  
> > Well, it creates arch-neutral common code, teaches various architectures
> > use it.  It's the sort of thing we do all the time.
> > 
> > These things are opportunities to eliminate crufty arch code which few
> > people understand and replace them with new, clean common code which lots
> > of people understand.  That's not a bad thing to be doing.
> 
> I'm not fundamentally against that, but so far it seems to just generate lots of 
> new bugs?  I'm not sure it's really worth the pain.
> 

It is a bit disproportionate.  But in some ways that's a commentary on the
current code.   All this numa/sparse/flat/discontig/holes-in-zones/
virt-memmap/ stuff is pretty hairy, especially in its initalisation.

I'm willing to go through the pain if it ends up with something cleaner
which more people understand a little bit.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-05-20 20:59   ` Andrew Morton
  2006-05-20 21:27     ` Andi Kleen
@ 2006-05-21 15:50     ` Mel Gorman
  2006-05-21 19:08       ` Andrew Morton
  2006-05-23 18:01     ` Mel Gorman
  2 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2006-05-21 15:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: davej, tony.luck, ak, bob.picco, linux-kernel, linuxppc-dev, linux-mm

On Sat, 20 May 2006, Andrew Morton wrote:

> Mel Gorman <mel@csn.ul.ie> wrote:
>>
>>
>> Size zones and holes in an architecture independent manner for x86_64.
>>
>>
>
> I found a .config which triggers the cant-map-acpitables problem.
>
>
> With that .config, and without this patch:
>
> Linux version 2.6.17-rc4-mm2 (akpm@box) (gcc version 4.1.0 20060304 (Red Hat 4.6
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
> BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 00000000ca605000 (usable)
> BIOS-e820: 00000000ca605000 - 00000000ca680000 (ACPI NVS)
> BIOS-e820: 00000000ca680000 - 00000000cb5ef000 (usable)
> BIOS-e820: 00000000cb5ef000 - 00000000cb5fc000 (reserved)
> BIOS-e820: 00000000cb5fc000 - 00000000cb6a2000 (usable)
> BIOS-e820: 00000000cb6a2000 - 00000000cb6eb000 (ACPI NVS)
> BIOS-e820: 00000000cb6eb000 - 00000000cb6ef000 (usable)
> BIOS-e820: 00000000cb6ef000 - 00000000cb6ff000 (ACPI data)
> BIOS-e820: 00000000cb6ff000 - 00000000cb700000 (usable)
> BIOS-e820: 00000000cb700000 - 00000000cc000000 (reserved)
> BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
> BIOS-e820: 0000000100000000 - 0000000130000000 (usable)
> DMI 2.4 present.
> ACPI: PM-Timer IO Port: 0x408
> ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
> Processor #0 6:15 APIC version 20
> ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
> Processor #1 6:15 APIC version 20
> ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled)
> ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled)
> ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
> ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
> ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
> IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
> ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
>
>
> With that .config, and with this patch:
>
> Bootdata ok (command line is ro root=LABEL=/ earlyprintk=serial,ttyS0,9600,keep netconsole=4444@192.168.2.4/eth0,5147@192.168.2.33/00:0D:56:C6:C6:CC)
> Linux version 2.6.17-rc4-mm2 (akpm@box) (gcc version 4.1.0 20060304 (Red Hat 4.1.0-3)) #33 SMP Sat May 20 12:08:03 PDT 2006
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
> BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 00000000ca605000 (usable)
> BIOS-e820: 00000000ca605000 - 00000000ca680000 (ACPI NVS)
> BIOS-e820: 00000000ca680000 - 00000000cb5ef000 (usable)
> BIOS-e820: 00000000cb5ef000 - 00000000cb5fc000 (reserved)
> BIOS-e820: 00000000cb5fc000 - 00000000cb6a2000 (usable)
> BIOS-e820: 00000000cb6a2000 - 00000000cb6eb000 (ACPI NVS)
> BIOS-e820: 00000000cb6eb000 - 00000000cb6ef000 (usable)
> BIOS-e820: 00000000cb6ef000 - 00000000cb6ff000 (ACPI data)
> BIOS-e820: 00000000cb6ff000 - 00000000cb700000 (usable)
> BIOS-e820: 00000000cb700000 - 00000000cc000000 (reserved)
> BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
> BIOS-e820: 0000000100000000 - 0000000130000000 (usable)
> Too many memory regions, truncating
> Too many memory regions, truncating
> Too many memory regions, truncating
> DMI 2.4 present.
> ACPI: Unable to map RSDT header
> Intel MultiProcessor Specification v1.4
>    Virtual Wire compatibility mode.
> OEM ID:  Product ID:  APIC at: 0xFEE00000
>
> ACPI disables itself.
>

ok, not good but at least it's certain. I had been lulled into a false 
sense of security when Christian's machine was still not able to boot with 
the patch backed out. I still have not reproduced the problem locally, but 
I don't have an x86_64 with so many holes either.

> Good .config: http://www.zip.com.au/~akpm/linux/patches/stuff/config-good
> Bad .config: http://www.zip.com.au/~akpm/linux/patches/stuff/config-bad
>
>
> The handling of MAX_ACTIVE_REGIONS is unpleasing, sorry.  In my setup it is
> 5.  But we _really_ only support 4 regions.  So for a start it is misnamed.
> The maximum number of regions we support is actually MAX_ACTIVE_REGIONS-1.
> And this is a config option too!  So the user must specify
> CONFIG_MAX_ACTIVE_REGIONS as the number of active regions plus one, for the
> terminating region which has end_pfn=0.  It's weird.
>

That's a fair comment. I'll put more thought into the definition of 
MAX_ACTIVE_REGIONS and how it is used.

> I would not consider this code to be adequately commented.  Please raise a
> patch which comments the major functions - what they do, why they do it,
> any caveats or implementations details.  A few lines each - don't overdo
> it.  Details such as whether the various end_pfn's are inclusive or
> exclusive are important, as is a description of the return value.
>

Understood. Right now, the closest thing to an explanation is a comment in 
include/linux/mm.h but it is not very detailed.

> Anyway, I just don't get how this code can work.  We have an e820 map with
> up to 128 entries (this machine has ten) and we're trying to scrunch that
> all into the four-entry early_node_map[].
>

Missing E820MAX was a mistake. On x86_64, CONFIG_MAX_ACTIVE_REGIONS should 
have been used. I didn't expect x86_64 to have so many memory holes.

> With config-good we're set up for NUMA, CONFIG_NODES_SHIFT=6.  So
> MAX_ACTIVE_REGIONS is enormous.  But it's quite wrong that we're using
> number-of-zones*number-of-nodes to size a data structure which has to
> accommodate all the entries in the e820 map.  These things aren't related.
>

Very true.

>
> On my little x86 PC:
>
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
> BIOS-e820: 000000000009bc00 - 000000000009c000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 000000000ffc0000 (usable)
> BIOS-e820: 000000000ffc0000 - 000000000fff8000 (ACPI data)
> BIOS-e820: 000000000fff8000 - 0000000010000000 (ACPI NVS)
> BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
> BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
> BIOS-e820: 00000000ffb80000 - 00000000ffc00000 (reserved)
> BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
> 0MB HIGHMEM available.
> 255MB LOWMEM available.
> found SMP MP-table at 000ff780
> Range (nid 0) 0 -> 65472, max 4
> On node 0 totalpages: 65472
>  DMA zone: 4096 pages, LIFO batch:0
>  Normal zone: 61376 pages, LIFO batch:15
>
> So here, the architecture code only called add_active_range() the once, for
> the entire memory map.

Because in this case, the architecture reported that there was just one 
range of available pages with no holes.

> But on the x86_64 add_active_range() was called once per e820 entry. 
> I'm dimly starting to realise that this is perhaps the problem - the 
> weird-looking definition of MAX_ACTIVE_REGIONS _expects_ the 
> architecture to call add_active_range() with a start_pfn/end_pfn which 
> describes the entire range of pfns for each zone in each node. Even if 
> that span includes not-present pfns.  Would that be correct?

Not quite and the confusion is because of how I defined 
MAX_ACTIVE_REGIONS. The calculation for it is similar to how i386 using 
SRAT calculates MAX_CHUNKS;

#define MAX_CHUNKS_PER_NODE     4
#define MAXCHUNKS               (MAX_CHUNKS_PER_NODE * MAX_NUMNODES)

which has little bearing on any other arch.

What is meant to happen is that add_active_range() is called for every 
range of physical page frames where real memory is, regardless of what 
zone they are in. On x86_64, the ranges are discovered by walking the e820 
map in e820_register_active_ranges(). Once it is known where physical 
memory is, the holes can be figured out. If I have;

Range (nid 0) 0 -> 100
Range (nid 0) 200 -> 400

I know there is a memory hole of 100 pages. The end PFN of each zone is 
passed to free_area_init_nodes(). By walking the early_node_map[], it can 
be discovered what the size of the zone and the holes are.

> I didn't see
> a comment in there describing this design (I do go on).
>

I'll address that issue.

> If so, perhaps the bug is that the x86_64 code isn't doing that.  And that
> x86 isn't doing it for some people either.
>

I'm hoping in this case that having MAX_ACTIVE_REGIONS match E820MAX will 
fix the issue on your machine. I'm still confused why Christian's failed 
to boot with the patch backed out though.

> Anyway.  From the implementation I can see what the code is doing.  But I
> see no description of what it is _supposed_ to be doing.  (The process of
> finding differences between these two things is known as "debugging").  I
> could kludge things by setting MAX_ACTIVE_REGIONS to 1000000, but enough.
> I look forward to the next version ;)
>

I'll start working on it. Thanks a lot.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-05-20 21:27     ` Andi Kleen
  2006-05-20 21:40       ` Andrew Morton
@ 2006-05-21 16:20       ` Mel Gorman
  1 sibling, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2006-05-21 16:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, davej, tony.luck, bob.picco, linux-kernel,
	linuxppc-dev, linux-mm

On Sat, 20 May 2006, Andi Kleen wrote:

>
>> Anyway.  From the implementation I can see what the code is doing.  But I
>> see no description of what it is _supposed_ to be doing.  (The process of
>> finding differences between these two things is known as "debugging").  I
>> could kludge things by setting MAX_ACTIVE_REGIONS to 1000000, but enough.
>> I look forward to the next version ;)
>
> Or we could just keep the working old code.
>
> Can somebody remind me what this patch kit was supposed to fix or 
> improve again?
>

The current code for discovering the zone sizes and holes is sometimes 
very hairy despite there being some similaries in each arch. This patch 
kit will eliminiate some of the uglier code and have one place where zones 
and holes can be sized. To me, that is a good idea once the bugs are 
rattled out.

On a related note, parts of the current zone-based anti-fragmentation 
implementation are an architecture-specific mess because changing how 
zones are sized is tricky with the current code. With this patch kit, 
sizing zones for easily reclaimable pages is relatively straight-forward.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-05-21 15:50     ` Mel Gorman
@ 2006-05-21 19:08       ` Andrew Morton
  2006-05-21 22:23         ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2006-05-21 19:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: davej, tony.luck, ak, bob.picco, linux-kernel, linuxppc-dev, linux-mm

Mel Gorman <mel@csn.ul.ie> wrote:
>

> > Anyway, I just don't get how this code can work.  We have an e820 map with
> > up to 128 entries (this machine has ten) and we're trying to scrunch that
> > all into the four-entry early_node_map[].
> >
> 
> Missing E820MAX was a mistake. On x86_64, CONFIG_MAX_ACTIVE_REGIONS should 
> have been used. I didn't expect x86_64 to have so many memory holes.

x86 uses 128 e820 slots too.

>
> > On my little x86 PC:
> >
> > BIOS-provided physical RAM map:
> > BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
> > BIOS-e820: 000000000009bc00 - 000000000009c000 (reserved)
> > BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> > BIOS-e820: 0000000000100000 - 000000000ffc0000 (usable)
> > BIOS-e820: 000000000ffc0000 - 000000000fff8000 (ACPI data)
> > BIOS-e820: 000000000fff8000 - 0000000010000000 (ACPI NVS)
> > BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
> > BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
> > BIOS-e820: 00000000ffb80000 - 00000000ffc00000 (reserved)
> > BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
> > 0MB HIGHMEM available.
> > 255MB LOWMEM available.
> > found SMP MP-table at 000ff780
> > Range (nid 0) 0 -> 65472, max 4
> > On node 0 totalpages: 65472
> >  DMA zone: 4096 pages, LIFO batch:0
> >  Normal zone: 61376 pages, LIFO batch:15
> >
> > So here, the architecture code only called add_active_range() the once, for
> > the entire memory map.
>
> Because in this case, the architecture reported that there was just one 
> range of available pages with no holes.

So..  we're registering a simgle blob of pfns which includes the "reserved"
memory as well as the "ACPI data" and the "ACPI NVS" (with an apparent
off-by-one here).

How come the machine still works?  I guess the architecture went and marked
those pfns reserved.

> > If so, perhaps the bug is that the x86_64 code isn't doing that.  And that
>  > x86 isn't doing it for some people either.
>  >
> 
>  I'm hoping in this case that having MAX_ACTIVE_REGIONS match E820MAX will 
>  fix the issue on your machine.

I expect it will.

One does wonder whether it's worth all this fuss though.  It's only a
24-byte structure and it's all thrown away in free_initmem().  One _could_
just go and do

	#define MAX_ACTIVE_REGIONS 10000

and be happy.

> I'm still confused why Christian's failed 
>  to boot with the patch backed out though.

He didn't get any "Too many memory regions" messages, so it's something
different.

Maybe he hit my off-by-one on his "ACPI data"?

hm, I didn't mention this in the earlier email.   On my x86 I have

  BIOS-provided physical RAM map:
  BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
  BIOS-e820: 000000000009bc00 - 000000000009c000 (reserved)
  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
  BIOS-e820: 0000000000100000 - 000000000ffc0000 (usable)
  BIOS-e820: 000000000ffc0000 - 000000000fff8000 (ACPI data)
  BIOS-e820: 000000000fff8000 - 0000000010000000 (ACPI NVS)
  BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
  BIOS-e820: 00000000ffb80000 - 00000000ffc00000 (reserved)
  BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)

I added some debug and saw that add_active_range() was getting a
start_pfn=0 and an end_pfn which corresponds with 0x0fffc000.  So my "ACPI
NVS" is getting chopped off.

If Christian is seeing a similar thing then his "ACPI data" will be getting
only part-registered.

I'd suggest that the next rev be liberal in its printking.  This is the
debug patch I used:

 mm/page_alloc.c |   25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff -puN mm/page_alloc.c~a mm/page_alloc.c
--- devel/mm/page_alloc.c~a	2006-05-20 13:19:58.000000000 -0700
+++ devel-akpm/mm/page_alloc.c	2006-05-20 13:20:42.000000000 -0700
@@ -2463,22 +2463,36 @@ void __init add_active_range(unsigned in
 						unsigned long end_pfn)
 {
 	unsigned int i;
-	printk(KERN_DEBUG "Range (%d) %lu -> %lu\n", nid, start_pfn, end_pfn);
+
+	printk("Range (nid %d) %lu -> %lu, max %d\n",
+			nid, start_pfn, end_pfn, MAX_ACTIVE_REGIONS - 1);
 
 	/* Merge with existing active regions if possible */
 	for (i = 0; early_node_map[i].end_pfn; i++) {
-		if (early_node_map[i].nid != nid)
+		printk("i=%d early_node_map[i].nid=%d "
+				"early_node_map[i].start_pfn=%lu "
+				"early_node_map[i].end_pfn=%lu",
+			i, early_node_map[i].nid,
+			early_node_map[i].start_pfn,
+			early_node_map[i].end_pfn);
+
+		if (early_node_map[i].nid != nid) {
+			printk(" continue 1\n");
 			continue;
+		}
 
 		/* Skip if an existing region covers this new one */
 		if (start_pfn >= early_node_map[i].start_pfn &&
-				end_pfn <= early_node_map[i].end_pfn)
+				end_pfn <= early_node_map[i].end_pfn) {
+			printk(" return 1\n");
 			return;
+		}
 
 		/* Merge forward if suitable */
 		if (start_pfn <= early_node_map[i].end_pfn &&
 				end_pfn > early_node_map[i].end_pfn) {
 			early_node_map[i].end_pfn = end_pfn;
+			printk(" return 2\n");
 			return;
 		}
 
@@ -2486,13 +2500,16 @@ void __init add_active_range(unsigned in
 		if (start_pfn < early_node_map[i].end_pfn &&
 				end_pfn >= early_node_map[i].start_pfn) {
 			early_node_map[i].start_pfn = start_pfn;
+			printk(" return 3\n");
 			return;
 		}
+		printk("\n");
 	}
 
 	/* Leave last entry NULL, we use range.end_pfn to terminate the walk */
 	if (i >= MAX_ACTIVE_REGIONS - 1) {
-		printk(KERN_ERR "Too many memory regions, truncating\n");
+		printk(KERN_ERR "More than %d memory regions, truncating\n",
+				MAX_ACTIVE_REGIONS - 1);
 		return;
 	}
 
_


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-05-21 19:08       ` Andrew Morton
@ 2006-05-21 22:23         ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2006-05-21 22:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: davej, tony.luck, ak, bob.picco, linux-kernel, linuxppc-dev, linux-mm

On Sun, 21 May 2006, Andrew Morton wrote:

> Mel Gorman <mel@csn.ul.ie> wrote:
>>
>
>>> Anyway, I just don't get how this code can work.  We have an e820 map with
>>> up to 128 entries (this machine has ten) and we're trying to scrunch that
>>> all into the four-entry early_node_map[].
>>>
>>
>> Missing E820MAX was a mistake. On x86_64, CONFIG_MAX_ACTIVE_REGIONS should
>> have been used. I didn't expect x86_64 to have so many memory holes.
>
> x86 uses 128 e820 slots too.
>

That is true, but with x86, I am not expecting many regions. For flatmem, 
only one region will be registered. For NUMA, I would expect one 
registration per node *unless* SRAT is being used. With SRAT, MAXCHUNKS 
regions at most with is 4 * MAX_NUMNODES.

>>
>>> On my little x86 PC:
>>>
>>> BIOS-provided physical RAM map:
>>> BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
>>> BIOS-e820: 000000000009bc00 - 000000000009c000 (reserved)
>>> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
>>> BIOS-e820: 0000000000100000 - 000000000ffc0000 (usable)
>>> BIOS-e820: 000000000ffc0000 - 000000000fff8000 (ACPI data)
>>> BIOS-e820: 000000000fff8000 - 0000000010000000 (ACPI NVS)
>>> BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
>>> BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
>>> BIOS-e820: 00000000ffb80000 - 00000000ffc00000 (reserved)
>>> BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
>>> 0MB HIGHMEM available.
>>> 255MB LOWMEM available.
>>> found SMP MP-table at 000ff780
>>> Range (nid 0) 0 -> 65472, max 4
>>> On node 0 totalpages: 65472
>>>  DMA zone: 4096 pages, LIFO batch:0
>>>  Normal zone: 61376 pages, LIFO batch:15
>>>
>>> So here, the architecture code only called add_active_range() the once, for
>>> the entire memory map.
>>
>> Because in this case, the architecture reported that there was just one
>> range of available pages with no holes.
>
> So..  we're registering a simgle blob of pfns which includes the "reserved"
> memory as well as the "ACPI data" and the "ACPI NVS" (with an apparent
> off-by-one here).
>

The off-by-one is a surprise. On this machine, it must be because the 
arch-specific code calculated highend_pfn wrong. I don't use the e820 on 
i386 because it didn't seem necessary.

> How come the machine still works?  I guess the architecture went and marked
> those pfns reserved.
>

Yes, that is what I'd expect to happen. The ranges are registered and a 
memmap allocated but the freeing of memory from bootmem is still the same 
on i386. For i386, my patchset reports the same size of zones and 
start_pfn on each node so there should be no difference in the end result 
between my code and the arch-specific initialisation.

>>> If so, perhaps the bug is that the x86_64 code isn't doing that.  And that
>> > x86 isn't doing it for some people either.
>> >
>>
>>  I'm hoping in this case that having MAX_ACTIVE_REGIONS match E820MAX will
>>  fix the issue on your machine.
>
> I expect it will.
>
> One does wonder whether it's worth all this fuss though.  It's only a
> 24-byte structure and it's all thrown away in free_initmem().  One _could_
> just go and do
>
> 	#define MAX_ACTIVE_REGIONS 10000
>
> and be happy.
>

I could, but I thought I'd be shot for trying something like that. A fixed 
value of 128 would cover the largest tables I'm aware of on all 
architectures. Should I just set that fixed value?

>> I'm still confused why Christian's failed
>>  to boot with the patch backed out though.
>
> He didn't get any "Too many memory regions" messages, so it's something
> different.
>
> Maybe he hit my off-by-one on his "ACPI data"?
>

Possibly but the off-by-one error for you was on x86 not x86_64 and I 
suspect that highend_pfn was wrong in this case. I'll be checking tomorrow 
where I can see an off-by-one error.

> hm, I didn't mention this in the earlier email.   On my x86 I have
>
>  BIOS-provided physical RAM map:
>  BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
>  BIOS-e820: 000000000009bc00 - 000000000009c000 (reserved)
>  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
>  BIOS-e820: 0000000000100000 - 000000000ffc0000 (usable)
>  BIOS-e820: 000000000ffc0000 - 000000000fff8000 (ACPI data)
>  BIOS-e820: 000000000fff8000 - 0000000010000000 (ACPI NVS)
>  BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
>  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
>  BIOS-e820: 00000000ffb80000 - 00000000ffc00000 (reserved)
>  BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
>
> I added some debug and saw that add_active_range() was getting a
> start_pfn=0 and an end_pfn which corresponds with 0x0fffc000.  So my "ACPI
> NVS" is getting chopped off.
>

Yes. However, this just means that the memory for that the PFN range will 
not be backed by memmap. This would only be a problem if free_bootmem() is 
called on those range of pages. If that was happening, I would be 
expecting oops early or bad_page reports during the boot process.

> If Christian is seeing a similar thing then his "ACPI data" will be getting
> only part-registered.
>
> I'd suggest that the next rev be liberal in its printking.  This is the
> debug patch I used:
>

I also have an old debug patch that was very printk happy. I will dust it 
off and add it with the additional information from your patch.

> mm/page_alloc.c |   25 +++++++++++++++++++++----
> 1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff -puN mm/page_alloc.c~a mm/page_alloc.c
> --- devel/mm/page_alloc.c~a	2006-05-20 13:19:58.000000000 -0700
> +++ devel-akpm/mm/page_alloc.c	2006-05-20 13:20:42.000000000 -0700
> @@ -2463,22 +2463,36 @@ void __init add_active_range(unsigned in
> 						unsigned long end_pfn)
> {
> 	unsigned int i;
> -	printk(KERN_DEBUG "Range (%d) %lu -> %lu\n", nid, start_pfn, end_pfn);
> +
> +	printk("Range (nid %d) %lu -> %lu, max %d\n",
> +			nid, start_pfn, end_pfn, MAX_ACTIVE_REGIONS - 1);
>
> 	/* Merge with existing active regions if possible */
> 	for (i = 0; early_node_map[i].end_pfn; i++) {
> -		if (early_node_map[i].nid != nid)
> +		printk("i=%d early_node_map[i].nid=%d "
> +				"early_node_map[i].start_pfn=%lu "
> +				"early_node_map[i].end_pfn=%lu",
> +			i, early_node_map[i].nid,
> +			early_node_map[i].start_pfn,
> +			early_node_map[i].end_pfn);
> +
> +		if (early_node_map[i].nid != nid) {
> +			printk(" continue 1\n");
> 			continue;
> +		}
>
> 		/* Skip if an existing region covers this new one */
> 		if (start_pfn >= early_node_map[i].start_pfn &&
> -				end_pfn <= early_node_map[i].end_pfn)
> +				end_pfn <= early_node_map[i].end_pfn) {
> +			printk(" return 1\n");
> 			return;
> +		}
>
> 		/* Merge forward if suitable */
> 		if (start_pfn <= early_node_map[i].end_pfn &&
> 				end_pfn > early_node_map[i].end_pfn) {
> 			early_node_map[i].end_pfn = end_pfn;
> +			printk(" return 2\n");
> 			return;
> 		}
>
> @@ -2486,13 +2500,16 @@ void __init add_active_range(unsigned in
> 		if (start_pfn < early_node_map[i].end_pfn &&
> 				end_pfn >= early_node_map[i].start_pfn) {
> 			early_node_map[i].start_pfn = start_pfn;
> +			printk(" return 3\n");
> 			return;
> 		}
> +		printk("\n");
> 	}
>
> 	/* Leave last entry NULL, we use range.end_pfn to terminate the walk */
> 	if (i >= MAX_ACTIVE_REGIONS - 1) {
> -		printk(KERN_ERR "Too many memory regions, truncating\n");
> +		printk(KERN_ERR "More than %d memory regions, truncating\n",
> +				MAX_ACTIVE_REGIONS - 1);
> 		return;
> 	}
>
> _
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-05-20 20:59   ` Andrew Morton
  2006-05-20 21:27     ` Andi Kleen
  2006-05-21 15:50     ` Mel Gorman
@ 2006-05-23 18:01     ` Mel Gorman
  2 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2006-05-23 18:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: davej, tony.luck, ak, bob.picco, linux-kernel, linuxppc-dev, linux-mm

On Sat, 20 May 2006, Andrew Morton wrote:

> Mel Gorman <mel@csn.ul.ie> wrote:
>>
>>
>> Size zones and holes in an architecture independent manner for x86_64.
>>
>>
>
> I found a .config which triggers the cant-map-acpitables problem.
>
>
> With that .config, and without this patch:
>
> Linux version 2.6.17-rc4-mm2 (akpm@box) (gcc version 4.1.0 20060304 (Red Hat 4.6
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
> BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 00000000ca605000 (usable)
> BIOS-e820: 00000000ca605000 - 00000000ca680000 (ACPI NVS)
> BIOS-e820: 00000000ca680000 - 00000000cb5ef000 (usable)
> BIOS-e820: 00000000cb5ef000 - 00000000cb5fc000 (reserved)
> BIOS-e820: 00000000cb5fc000 - 00000000cb6a2000 (usable)
> BIOS-e820: 00000000cb6a2000 - 00000000cb6eb000 (ACPI NVS)
> BIOS-e820: 00000000cb6eb000 - 00000000cb6ef000 (usable)
> BIOS-e820: 00000000cb6ef000 - 00000000cb6ff000 (ACPI data)
> BIOS-e820: 00000000cb6ff000 - 00000000cb700000 (usable)
> BIOS-e820: 00000000cb700000 - 00000000cc000000 (reserved)
> BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
> BIOS-e820: 0000000100000000 - 0000000130000000 (usable)
> DMI 2.4 present.
> ACPI: PM-Timer IO Port: 0x408
> ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
> Processor #0 6:15 APIC version 20
> ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
> Processor #1 6:15 APIC version 20
> ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled)
> ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled)
> ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
> ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
> ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
> IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
> ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
>
>
> With that .config, and with this patch:
>
> Bootdata ok (command line is ro root=LABEL=/ earlyprintk=serial,ttyS0,9600,keep netconsole=4444@192.168.2.4/eth0,5147@192.168.2.33/00:0D:56:C6:C6:CC)
> Linux version 2.6.17-rc4-mm2 (akpm@box) (gcc version 4.1.0 20060304 (Red Hat 4.1.0-3)) #33 SMP Sat May 20 12:08:03 PDT 2006
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
> BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 00000000ca605000 (usable)
> BIOS-e820: 00000000ca605000 - 00000000ca680000 (ACPI NVS)
> BIOS-e820: 00000000ca680000 - 00000000cb5ef000 (usable)
> BIOS-e820: 00000000cb5ef000 - 00000000cb5fc000 (reserved)
> BIOS-e820: 00000000cb5fc000 - 00000000cb6a2000 (usable)
> BIOS-e820: 00000000cb6a2000 - 00000000cb6eb000 (ACPI NVS)
> BIOS-e820: 00000000cb6eb000 - 00000000cb6ef000 (usable)
> BIOS-e820: 00000000cb6ef000 - 00000000cb6ff000 (ACPI data)
> BIOS-e820: 00000000cb6ff000 - 00000000cb700000 (usable)
> BIOS-e820: 00000000cb700000 - 00000000cc000000 (reserved)
> BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
> BIOS-e820: 0000000100000000 - 0000000130000000 (usable)
> Too many memory regions, truncating
> Too many memory regions, truncating
> Too many memory regions, truncating
> DMI 2.4 present.
> ACPI: Unable to map RSDT header
> Intel MultiProcessor Specification v1.4
>    Virtual Wire compatibility mode.
> OEM ID:  Product ID:  APIC at: 0xFEE00000
>

I think I have figured out what went wrong here.

arch/i386/kernel/acpi/boot.c has a __acpi_map_table() function which uses 
a variable end_pfn_map variable defined in arch/x86_64/kernel/e820.c . 
Part of the arch-independent-zone-sizing patch calculates end_pfn_map from 
early_node_map[] which only contains information on real RAM regions.

On Christian's machine, there is no usable region after the ACPI table 
data so early_node_map[] finishes just before the ACPI tables. This 
results in the wrong value for end_pfn_map and the table fails to be 
mapped. In Andrew's machines case, the regions got truncated and nothing 
after ACPI NVS was recorded, including ACPI data which is why it fails to 
boot.

Am not ready to release another set of patches, but I think this was the 
cause of magic failures on x86_64.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-09-01  3:08           ` Keith Mannthey
  2006-09-01  8:33             ` Mel Gorman
@ 2006-09-04 15:36             ` Mel Gorman
  1 sibling, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2006-09-04 15:36 UTC (permalink / raw)
  To: Keith Mannthey
  Cc: akpm, tony.luck, Linux Memory Management List, ak, bob.picco,
	Linux Kernel Mailing List, linuxppc-dev

On (31/08/06 20:08), Keith Mannthey didst pronounce:
> >So, do you actally expect a lot of unused mem_map to be allocated with
> >struct pages that are inactive until memory is hot-added in an
> >x86_64-specific manner? The arch-independent stuff currently will not do
> >that. It sets up memmap for where memory really exists. If that is not
> >what you expect, it will hit issues at hotadd time which is not the
> >current issue but one that can be fixed.
> 
> Yes. RESERVED based is a big waste of mem_map space.  The add areas
> are marked as RESERVED during boot and then later onlined during add.
> It might be ok.  I will play with tomorrow.  I might just need to
> call add_active_range in the right spot :)
> 

Following this mail should be two patches that may address the problem
with reserved memory hot-add. One assumption made by arch-independent
zone-sizing was that the only memory holes of interest were those before the
end of physical memory.  Another assumption was that mem_map should only be
allocated for memory that was physically present in the machine.

With MEMORY_HOTPLUG_RESERVE on x86_64, these assumptions do not hold. This
feature expects that mem_map is allocated at boot time and later activated
on a memory hot-add event. To determine if the region is usable for hot-add
in the future, holes are calculated beyond the end of physical memory.

The following two patches fix these two assumptions. They have been boot-tested
on a range of hardware (x86, ppc64, ia64 and x86_64) so there should be no
new regressions.

I don't have access to hardware that can use MEMORY_HOTPLUG_RESERVE so I'd
appreciate hearing if the patches work. I wrote a test program that simulated
the input from the machine the problem was reported on.  It registers active
memory and simulates the check made by reserve_hotadd(). push_node_boundaries()
is called to push the end of the node out by 100 pages like what SRAT would
do for reserve hot-add and it appears to do the right thing. Output is below.

mel@arnold:~/patches/brokenout/zonesizing/driver_test$ gcc driver_test.c -o
driver_test && ./driver_test | grep -v "active with" | grep -v
account_node_boundary
Stage 1: Registering active ranges
Entering add_active_range(0, 0, 152) 0 entries of 96 used
Entering add_active_range(0, 256, 524165) 1 entries of 96 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 96 used
Entering add_active_range(1, 17235968, 18219008) 3 entries of 96 used

Dumping active map
0: 0  0 -> 152
1: 0  256 -> 524165
2: 0  1048576 -> 4653056
3: 1  17235968 -> 18219008
Entering push_node_boundaries(0, 0, 4653156)

Checking reserve-hotadd
  absent_pages_in_range(4653056, 17235968) == 17235968 - 4653056 == 12582912
  absent_pages_in_range(18219008, 52428800) == 52428800 - 18219008 == 34209792

Stage 2: Calculating zone sizes and holes

Stage 3: Dumping zone sizes and holes
zone_size[0][0] =     4096 zone_holes[0][0] =      104
zone_size[0][1] =  1044480 zone_holes[0][1] =   524411
zone_size[0][2] =  3604580 zone_holes[0][2] =      100
zone_size[1][2] =   983040 zone_holes[1][2] =        0

Stage 4: Printing present pages
On node 0, 4128541 pages
 zone 0 present_pages = 3992
 zone 1 present_pages = 520069
 zone 2 present_pages = 3604480
On node 1, 983040 pages
 zone 2 present_pages = 983040

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-09-01  8:33             ` Mel Gorman
@ 2006-09-01  8:46               ` Mika Penttilä
  0 siblings, 0 replies; 47+ messages in thread
From: Mika Penttilä @ 2006-09-01  8:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Keith Mannthey, akpm, tony.luck, Linux Memory Management List,
	ak, bob.picco, Linux Kernel Mailing List, linuxppc-dev


>
> Right, it's all very clear now. At some point in the future, I'd like 
> to visit why SPARSEMEM-based hot-add is not always used but it's a 
> separate issue.
>
>> The add areas
>> are marked as RESERVED during boot and then later onlined during add.
>
> That explains the reserve_bootmem_node()
>
But pages are marked reserved by default. You still have to alloc the 
bootmem map for the the whole node range, including reserve hot add 
areas and areas beyond e820-end-of-ram. So all the areas are already 
reserved, until freed.

--Mika


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-09-01  3:08           ` Keith Mannthey
@ 2006-09-01  8:33             ` Mel Gorman
  2006-09-01  8:46               ` Mika Penttilä
  2006-09-04 15:36             ` Mel Gorman
  1 sibling, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2006-09-01  8:33 UTC (permalink / raw)
  To: Keith Mannthey
  Cc: akpm, tony.luck, Linux Memory Management List, ak, bob.picco,
	Linux Kernel Mailing List, linuxppc-dev

On Thu, 31 Aug 2006, Keith Mannthey wrote:

> On 8/31/06, Mel Gorman <mel@csn.ul.ie> wrote:
>> On Thu, 31 Aug 2006, Keith Mannthey wrote:
>> > On 8/31/06, Mel Gorman <mel@skynet.ie> wrote:
>> >> On (30/08/06 13:57), Keith Mannthey didst pronounce:
>> >> > On 8/21/06, Mel Gorman <mel@csn.ul.ie> wrote:
>> >> > >
>
>> Can you confirm that happens by applying the patch I sent to you and
>> checking the output? When the reserve fails, it should print out what
>> range it actually checked. I want to be sure it's not checking the
>> addresses 0->0x1070000000
>
> See below
>

Perfect, thanks a lot. I should have enough to reproduce without a test 
machine what is going on and develop the required patches.

>> >> > >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
>> >> > >
>> >> > >        printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
>> >> > >               nd->start, nd->end);
>> >> > >+       e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
>> >> > >+                                               nd->end >> 
>> PAGE_SHIFT);
>> >> >
>> >> > A node chunk in this section of code may be a hot-pluggable zone. With
>> >> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
>> >> >
>> >>
>> >> The ranges should not get registered as active memory by
>> >> e820_register_active_regions() unless they are marked E820_RAM. My
>> >> understanding is that the regions for hotadd would be marked "reserved"
>> >> in the e820 map. Is that wrong?
>> >
>> > This is wrong.  In a mult-node system that last node add area will not
>> > be marked reserved by the e820.  The e820 only defines memory <
>> > end_pfn. the last node add area is > end_pfn.
>> >
>> 
>> ok, that should still be fine. As long as the ranges are not marked
>> "usable", add_active_range() will not be called and the holes should be
>> counted correctly with the patch I sent you.
>> 
>> > With RESERVE based add-memory you want the add-areas repored by the
>> > srat to be setup during boot like all the other pages.
>> >
>> 
>> So, do you actally expect a lot of unused mem_map to be allocated with
>> struct pages that are inactive until memory is hot-added in an
>> x86_64-specific manner? The arch-independent stuff currently will not do
>> that. It sets up memmap for where memory really exists. If that is not
>> what you expect, it will hit issues at hotadd time which is not the
>> current issue but one that can be fixed.
>
> Yes. RESERVED based is a big waste of mem_map space.

Right, it's all very clear now. At some point in the future, I'd like to 
visit why SPARSEMEM-based hot-add is not always used but it's a separate 
issue.

> The add areas
> are marked as RESERVED during boot and then later onlined during add.

That explains the reserve_bootmem_node()

> It might be ok.  I will play with tomorrow.  I might just need to
> call add_active_range in the right spot :)
>

I'll play with this from the opposite perspective - what is required for 
any arch using this API to have spare memmap for reserve-based hot-add.

>> >> > >        if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, 
>> end)
>> >> <
>> >> > >        0) {
>> >> > >                /* Ignore hotadd region. Undo damage */
>> >> >
>> >> >  I have but the e820_register_active_regions as a else to this
>> >> > statment the absent pages check fails.
>> >> >
>> >>
>> >> The patch below omits this change because I think
>> >> e820_register_active_regions() will still have got called by the time
>> >> you encounter a hotplug area.
>> >
>> > called but then removed in setup arch.
>> 
>> By "removed", I assume you mean the active regions removed by the call
>> to remove_all_active_regions() in setup_arch(). Before reserve_hotadd() is
>> called, e820_register_active_regions() will have reregistered the active
>> regions with the NUMA node id.
>
> I see e820_register_active_regions is acting as a filter against the e820
>

Yes. A range of pfn's in given to e820_register_active_regions() and it 
reads the e820 for E820_RAM sections within that range.

>> >> > Also nodes_cover_memory and alot of these check were based against
>> >> > comparing the srat data against the e820.  Now all this code is
>> >> > comparing SRAT against SRAT....
>> >> >
>> >>
>> >> I don't see why. The SRAT table passes a range to
>> >> e820_register_active_regions() so should be comparing SRAT to e820
>> >
>> > let me go off and look at e820_register_active_regions() some more.
> Things get clear :)
>
> Should be ok.
>

nice.

>> > Sure thing.  It is just the hot-add area I am guessing it is an off by
>> > one error of some sort.
>> >
> See below. I do my e820_register_active_area as an else to to if
> (hotplug.....!reserve) and the prink is easy to sort out.
>

yep, it should be easy to put this into a test program.

> I see your pfn are in base 10.  Looks like it considers the last
> addres to be a present page. (off by one thing).
>

Probably. I'll start poking now. Thanks a lot.

> Thanks,
> Keith
>
> Output below
> disabling early console
> Linux version 2.6.18-rc4-mm3-smp (root@elm3a153) (gcc version 4.1.0
> (SUSE Linux)) #6 SMP Thu Aug 31 22:06:00 EDT 2006
> Command line: root=/dev/sda3
> ip=9.47.66.153:9.47.66.169:9.47.66.1:255.255.255.0 resume=/dev/sda2
> showopts earlyprintk=ttyS0,115200 console=ttyS0,115200 console=tty0
> debug numa=hotadd=100
> BIOS-provided physical RAM map:
> BIOS-e820: 0000000000000000 - 0000000000098400 (usable)
> BIOS-e820: 0000000000098400 - 00000000000a0000 (reserved)
> BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
> BIOS-e820: 0000000000100000 - 000000007ff85e00 (usable)
> BIOS-e820: 000000007ff85e00 - 000000007ff98880 (ACPI data)
> BIOS-e820: 000000007ff98880 - 0000000080000000 (reserved)
> BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
> BIOS-e820: 0000000100000000 - 0000000470000000 (usable)
> BIOS-e820: 0000001070000000 - 0000001160000000 (usable)
> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
> Entering add_active_range(0, 17235968, 18219008) 3 entries of 3200 used
> end_pfn_map = 18219008
> DMI 2.3 present.
> ACPI: RSDP (v000 IBM                                   ) @ 0x00000000000fdcf0
> ACPI: RSDT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
> 0x000000007ff98800
> ACPI: FADT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
> 0x000000007ff98780
> ACPI: MADT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
> 0x000000007ff98600
> ACPI: SRAT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
> 0x000000007ff983c0
> ACPI: HPET (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
> 0x000000007ff98380
> ACPI: SSDT (v001 IBM    VIGSSDT0 0x00001000 INTL 0x20030122) @
> 0x000000007ff90780
> ACPI: SSDT (v001 IBM    VIGSSDT1 0x00001000 INTL 0x20030122) @
> 0x000000007ff88bc0
> ACPI: DSDT (v001 IBM    EXA01ZEU 0x00001000 INTL 0x20030122) @
> 0x0000000000000000
> SRAT: PXM 0 -> APIC 0 -> Node 0
> SRAT: PXM 0 -> APIC 1 -> Node 0
> SRAT: PXM 0 -> APIC 2 -> Node 0
> SRAT: PXM 0 -> APIC 3 -> Node 0
> SRAT: PXM 0 -> APIC 38 -> Node 0
> SRAT: PXM 0 -> APIC 39 -> Node 0
> SRAT: PXM 0 -> APIC 36 -> Node 0
> SRAT: PXM 0 -> APIC 37 -> Node 0
> SRAT: PXM 1 -> APIC 64 -> Node 1
> SRAT: PXM 1 -> APIC 65 -> Node 1
> SRAT: PXM 1 -> APIC 66 -> Node 1
> SRAT: PXM 1 -> APIC 67 -> Node 1
> SRAT: PXM 1 -> APIC 102 -> Node 1
> SRAT: PXM 1 -> APIC 103 -> Node 1
> SRAT: PXM 1 -> APIC 100 -> Node 1
> SRAT: PXM 1 -> APIC 101 -> Node 1
> SRAT: Node 0 PXM 0 0-80000000
> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
> SRAT: Node 0 PXM 0 0-470000000
> Entering add_active_range(0, 0, 152) 2 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
> SRAT: Node 0 PXM 0 0-1070000000
> reserve_hotadd called with node 0 sart 470000000 end 1070000000
> SRAT: Hotplug area has existing memory
> Entering add_active_range(0, 0, 152) 3 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 3 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 3 entries of 3200 used
> SRAT: Node 1 PXM 1 1070000000-1160000000
> Entering add_active_range(1, 17235968, 18219008) 3 entries of 3200 used
> SRAT: Node 1 PXM 1 1070000000-3200000000
> reserve_hotadd called with node 1 sart 1160000000 end 3200000000
> SRAT: Hotplug area has existing memory
> Entering add_active_range(1, 17235968, 18219008) 4 entries of 3200 used
> NUMA: Using 28 for the hash shift.
> Bootmem setup node 0 0000000000000000-0000001070000000
> Bootmem setup node 1 0000001070000000-0000001160000000
> Zone PFN ranges:
> DMA             0 ->     4096
> DMA32        4096 ->  1048576
> Normal    1048576 -> 18219008
> early_node_map[4] active PFN ranges
>   0:        0 ->      152
>   0:      256 ->   524165
>   0:  1048576 ->  4653056
>   1: 17235968 -> 18219008
> On node 0 totalpages: 4128541
> 0 pages used for SPARSE memmap
> 1149 pages DMA reserved
> DMA zone: 2843 pages, LIFO batch:0
> 0 pages used for SPARSE memmap
> DMA32 zone: 520069 pages, LIFO batch:31
> 0 pages used for SPARSE memmap
> Normal zone: 3604480 pages, LIFO batch:31
> On node 1 totalpages: 983040
> 0 pages used for SPARSE memmap
> 0 pages used for SPARSE memmap
> 0 pages used for SPARSE memmap
> Normal zone: 983040 pages, LIFO batch:31
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-08-31 18:40         ` Mel Gorman
@ 2006-09-01  3:08           ` Keith Mannthey
  2006-09-01  8:33             ` Mel Gorman
  2006-09-04 15:36             ` Mel Gorman
  0 siblings, 2 replies; 47+ messages in thread
From: Keith Mannthey @ 2006-09-01  3:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, tony.luck, Linux Memory Management List, ak, bob.picco,
	Linux Kernel Mailing List, linuxppc-dev

On 8/31/06, Mel Gorman <mel@csn.ul.ie> wrote:
> On Thu, 31 Aug 2006, Keith Mannthey wrote:
> > On 8/31/06, Mel Gorman <mel@skynet.ie> wrote:
> >> On (30/08/06 13:57), Keith Mannthey didst pronounce:
> >> > On 8/21/06, Mel Gorman <mel@csn.ul.ie> wrote:
> >> > >

> Can you confirm that happens by applying the patch I sent to you and
> checking the output? When the reserve fails, it should print out what
> range it actually checked. I want to be sure it's not checking the
> addresses 0->0x1070000000

See below

> >> > >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
> >> > >
> >> > >        printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
> >> > >               nd->start, nd->end);
> >> > >+       e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
> >> > >+                                               nd->end >> PAGE_SHIFT);
> >> >
> >> > A node chunk in this section of code may be a hot-pluggable zone. With
> >> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
> >> >
> >>
> >> The ranges should not get registered as active memory by
> >> e820_register_active_regions() unless they are marked E820_RAM. My
> >> understanding is that the regions for hotadd would be marked "reserved"
> >> in the e820 map. Is that wrong?
> >
> > This is wrong.  In a mult-node system that last node add area will not
> > be marked reserved by the e820.  The e820 only defines memory <
> > end_pfn. the last node add area is > end_pfn.
> >
>
> ok, that should still be fine. As long as the ranges are not marked
> "usable", add_active_range() will not be called and the holes should be
> counted correctly with the patch I sent you.
>
> > With RESERVE based add-memory you want the add-areas repored by the
> > srat to be setup during boot like all the other pages.
> >
>
> So, do you actally expect a lot of unused mem_map to be allocated with
> struct pages that are inactive until memory is hot-added in an
> x86_64-specific manner? The arch-independent stuff currently will not do
> that. It sets up memmap for where memory really exists. If that is not
> what you expect, it will hit issues at hotadd time which is not the
> current issue but one that can be fixed.

Yes. RESERVED based is a big waste of mem_map space.  The add areas
are marked as RESERVED during boot and then later onlined during add.
 It might be ok.  I will play with tomorrow.  I might just need to
call add_active_range in the right spot :)

> >> > >        if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end)
> >> <
> >> > >        0) {
> >> > >                /* Ignore hotadd region. Undo damage */
> >> >
> >> >  I have but the e820_register_active_regions as a else to this
> >> > statment the absent pages check fails.
> >> >
> >>
> >> The patch below omits this change because I think
> >> e820_register_active_regions() will still have got called by the time
> >> you encounter a hotplug area.
> >
> > called but then removed in setup arch.
>
> By "removed", I assume you mean the active regions removed by the call
> to remove_all_active_regions() in setup_arch(). Before reserve_hotadd() is
> called, e820_register_active_regions() will have reregistered the active
> regions with the NUMA node id.

I see e820_register_active_regions is acting as a filter against the e820

> >> > Also nodes_cover_memory and alot of these check were based against
> >> > comparing the srat data against the e820.  Now all this code is
> >> > comparing SRAT against SRAT....
> >> >
> >>
> >> I don't see why. The SRAT table passes a range to
> >> e820_register_active_regions() so should be comparing SRAT to e820
> >
> > let me go off and look at e820_register_active_regions() some more.
Things get clear :)

Should be ok.

> > Sure thing.  It is just the hot-add area I am guessing it is an off by
> > one error of some sort.
> >
See below. I do my e820_register_active_area as an else to to if
(hotplug.....!reserve) and the prink is easy to sort out.

I see your pfn are in base 10.  Looks like it considers the last
addres to be a present page. (off by one thing).

Thanks,
  Keith

Output below
disabling early console
Linux version 2.6.18-rc4-mm3-smp (root@elm3a153) (gcc version 4.1.0
(SUSE Linux)) #6 SMP Thu Aug 31 22:06:00 EDT 2006
Command line: root=/dev/sda3
ip=9.47.66.153:9.47.66.169:9.47.66.1:255.255.255.0 resume=/dev/sda2
showopts earlyprintk=ttyS0,115200 console=ttyS0,115200 console=tty0
debug numa=hotadd=100
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 0000000000098400 (usable)
 BIOS-e820: 0000000000098400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007ff85e00 (usable)
 BIOS-e820: 000000007ff85e00 - 000000007ff98880 (ACPI data)
 BIOS-e820: 000000007ff98880 - 0000000080000000 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000470000000 (usable)
 BIOS-e820: 0000001070000000 - 0000001160000000 (usable)
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
Entering add_active_range(0, 17235968, 18219008) 3 entries of 3200 used
end_pfn_map = 18219008
DMI 2.3 present.
ACPI: RSDP (v000 IBM                                   ) @ 0x00000000000fdcf0
ACPI: RSDT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
0x000000007ff98800
ACPI: FADT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
0x000000007ff98780
ACPI: MADT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
0x000000007ff98600
ACPI: SRAT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
0x000000007ff983c0
ACPI: HPET (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
0x000000007ff98380
ACPI: SSDT (v001 IBM    VIGSSDT0 0x00001000 INTL 0x20030122) @
0x000000007ff90780
ACPI: SSDT (v001 IBM    VIGSSDT1 0x00001000 INTL 0x20030122) @
0x000000007ff88bc0
ACPI: DSDT (v001 IBM    EXA01ZEU 0x00001000 INTL 0x20030122) @
0x0000000000000000
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: PXM 0 -> APIC 2 -> Node 0
SRAT: PXM 0 -> APIC 3 -> Node 0
SRAT: PXM 0 -> APIC 38 -> Node 0
SRAT: PXM 0 -> APIC 39 -> Node 0
SRAT: PXM 0 -> APIC 36 -> Node 0
SRAT: PXM 0 -> APIC 37 -> Node 0
SRAT: PXM 1 -> APIC 64 -> Node 1
SRAT: PXM 1 -> APIC 65 -> Node 1
SRAT: PXM 1 -> APIC 66 -> Node 1
SRAT: PXM 1 -> APIC 67 -> Node 1
SRAT: PXM 1 -> APIC 102 -> Node 1
SRAT: PXM 1 -> APIC 103 -> Node 1
SRAT: PXM 1 -> APIC 100 -> Node 1
SRAT: PXM 1 -> APIC 101 -> Node 1
SRAT: Node 0 PXM 0 0-80000000
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
SRAT: Node 0 PXM 0 0-470000000
Entering add_active_range(0, 0, 152) 2 entries of 3200 used
Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
SRAT: Node 0 PXM 0 0-1070000000
reserve_hotadd called with node 0 sart 470000000 end 1070000000
SRAT: Hotplug area has existing memory
Entering add_active_range(0, 0, 152) 3 entries of 3200 used
Entering add_active_range(0, 256, 524165) 3 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 3 entries of 3200 used
SRAT: Node 1 PXM 1 1070000000-1160000000
Entering add_active_range(1, 17235968, 18219008) 3 entries of 3200 used
SRAT: Node 1 PXM 1 1070000000-3200000000
reserve_hotadd called with node 1 sart 1160000000 end 3200000000
SRAT: Hotplug area has existing memory
Entering add_active_range(1, 17235968, 18219008) 4 entries of 3200 used
NUMA: Using 28 for the hash shift.
Bootmem setup node 0 0000000000000000-0000001070000000
Bootmem setup node 1 0000001070000000-0000001160000000
Zone PFN ranges:
  DMA             0 ->     4096
  DMA32        4096 ->  1048576
  Normal    1048576 -> 18219008
early_node_map[4] active PFN ranges
    0:        0 ->      152
    0:      256 ->   524165
    0:  1048576 ->  4653056
    1: 17235968 -> 18219008
On node 0 totalpages: 4128541
0 pages used for SPARSE memmap
1149 pages DMA reserved
  DMA zone: 2843 pages, LIFO batch:0
0 pages used for SPARSE memmap
  DMA32 zone: 520069 pages, LIFO batch:31
0 pages used for SPARSE memmap
  Normal zone: 3604480 pages, LIFO batch:31
On node 1 totalpages: 983040
0 pages used for SPARSE memmap
0 pages used for SPARSE memmap
0 pages used for SPARSE memmap
  Normal zone: 983040 pages, LIFO batch:31

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-08-31 17:52       ` Keith Mannthey
@ 2006-08-31 18:40         ` Mel Gorman
  2006-09-01  3:08           ` Keith Mannthey
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2006-08-31 18:40 UTC (permalink / raw)
  To: Keith Mannthey
  Cc: akpm, tony.luck, Linux Memory Management List, ak, bob.picco,
	Linux Kernel Mailing List, linuxppc-dev

On Thu, 31 Aug 2006, Keith Mannthey wrote:

> On 8/31/06, Mel Gorman <mel@skynet.ie> wrote:
>> On (30/08/06 13:57), Keith Mannthey didst pronounce:
>> > On 8/21/06, Mel Gorman <mel@csn.ul.ie> wrote:
>> > >
>
>> ok, great. How much physical memory is installed on the machine? I want to
>> determine if the "usable" entries in the e820 map contain physical memory
>> or not.
>
> Usable entries in the e820 contian memory.  I have about 20-24gb
> depending on config.
>

ok, that seems to match the (usable) regions in the e820 map.

>
>> When the SRAT is bad, the information is discarded and discovered by an
>> alternative method later in the boot process.
>> 
>> In this case, numa_initmem_init() is called after acpi_numa_init(). It
>> calls acpi_scan_nodes() which returns -1 because the SRAT is bad. Once
>> that happens, either k8_scan_nodes() will be called and the regions
>> discovered there or if that is not possible, it'll fall through and
>> e820_register_active_regions will be called without any node awareness.
>
> sorry I have missed some of the logic in this patch.
>
> I see now in numa_initmem_init that if no numa setup is found it calls
> e820_register_active_regions(0, start_pfn, end_pfn) again.
>

right.

> So if the srat is discard it runs the e820 code again.
>

yes.

>
>> > >diff -rup -X /usr/src/patchset-0.6/bin//dontdiff
>> > >linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c
>> > >linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
>> > >--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c
>> > >2006-08-21 09:23:50.000000000 +0100
>> > >+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
>> > >2006-08-21 10:15:58.000000000 +0100
>> > >@@ -84,6 +84,7 @@ static __init void bad_srat(void)
>> > >                apicid_to_node[i] = NUMA_NO_NODE;
>> > >        for (i = 0; i < MAX_NUMNODES; i++)
>> > >                nodes_add[i].start = nodes[i].end = 0;
>> > >+       remove_all_active_ranges();
>> > > }
>> >
>> > We go back to setup_arch with no active areas?
>> >
>> 
>> Yes, and it'll be discovered using an alternative method later. There is
>> no point returning to setup_arch with known bad information about active
>> areas.
>
> Totally agreeded! I just didn't the the fallback path.

grand.

>> 
>> > > static __init inline int srat_disabled(void)
>> > >@@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
>> > >
>> > >        if (mem < 0)
>> > >                return 0;
>> > >-       allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
>> > >+       allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) *
>> > >PAGE_SIZE;
>> > >        allowed = (allowed / 100) * hotadd_percent;
>> > >        if (allocated + mem > allowed) {
>> > >                unsigned long range;
>> > >@@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
>> > >        }
>> > >
>> > >        /* This check might be a bit too strict, but I'm keeping it for
>> > >        now. */
>> > >-       if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
>> > >+       if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
>> > >                printk(KERN_ERR "SRAT: Hotplug area has existing
>> > >                memory\n");
>> > >                return -1;
>> > >        }
>> > We really do want to to compare against the e820 map at it contains
>> > the memory that is really present (this info was blown away before
>> > acpi_numa)
>> 
>> The information used by absent_pages_in_range() should match what was
>> available to e820_hole_size().
>
> Is absent_pages_in_range a check against the e820 or the
> add_pages_to_range calls?
>

absent_pages_in_range() uses information provided via add_active_range() 
and on x86_64, add_active_range() is called based on information in the 
e820.

>> > Anyway I fixed up to have the current chunk added
>> > (e820_register_active_regions) after calling this code so it logicaly
>> > makes sense but it still trip over the check.
>> > I am not sure what you
>> > are printing out in you debug code but dosen't look like pfns or
>> > phys_addresses but maybe it can tell us why the check fails.
>> >
>> 
>> My debug code for add_active_range() printing out pfns but I spotted one
>> case where absent_pages_in_range(I) does not do what one would expect.
>> Lets say the ranges with physical memory was 0->1000 and 2000-3000 (in
>> pfns).  absent_pages_in_range(0, 3000) would return 1000 as you'd expect 
>> but
>> absent_pages_in_range(5000-6000) would return 0! I have a patch that might
>> fix this at the end of the mail but I'm not sure it's the problem you are
>> hitting. In the bootlog, I see;
>> 
>> SRAT: Node 0 PXM 0 0-80000000
>> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
>> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
>> SRAT: Node 0 PXM 0 0-470000000
>> Entering add_active_range(0, 0, 152) 2 entries of 3200 used
>> Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
>> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
>> SRAT: Node 0 PXM 0 0-1070000000
>> SRAT: Hotplug area has existing memory
>> 
>
>> The last part (0-1070000000) is checked as a hotplug area but it's clear
>> that memory exists in that range.  As reserve_hotadd() requires that the
>> whole range be a hole, I'm having trouble seeing how it ever successfully
>> reserved unless the ranges going into reserve_hotadd() are something other
>> than the pfn range for 0-1070000000). The patch later will print out the
>> range used by reserve_hotadd() so we can see.
>
> No the whole node is 0-1070000000 the hot add range is 470000000-1070000000
> reserve_hotadd is called with start and end not nd->start nd->end.
> 470000000-1070000000 sould be empty.
>

Can you confirm that happens by applying the patch I sent to you and 
checking the output? When the reserve fails, it should print out what 
range it actually checked. I want to be sure it's not checking the 
addresses 0->0x1070000000

>
>> > >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
>> > >
>> > >        printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
>> > >               nd->start, nd->end);
>> > >+       e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
>> > >+                                               nd->end >> PAGE_SHIFT);
>> >
>> > A node chunk in this section of code may be a hot-pluggable zone. With
>> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
>> >
>> 
>> The ranges should not get registered as active memory by
>> e820_register_active_regions() unless they are marked E820_RAM. My
>> understanding is that the regions for hotadd would be marked "reserved"
>> in the e820 map. Is that wrong?
>
> This is wrong.  In a mult-node system that last node add area will not
> be marked reserved by the e820.  The e820 only defines memory <
> end_pfn. the last node add area is > end_pfn.
>

ok, that should still be fine. As long as the ranges are not marked 
"usable", add_active_range() will not be called and the holes should be 
counted correctly with the patch I sent you.

> With RESERVE based add-memory you want the add-areas repored by the
> srat to be setup during boot like all the other pages.
>

So, do you actally expect a lot of unused mem_map to be allocated with 
struct pages that are inactive until memory is hot-added in an 
x86_64-specific manner? The arch-independent stuff currently will not do 
that. It sets up memmap for where memory really exists. If that is not 
what you expect, it will hit issues at hotadd time which is not the 
current issue but one that can be fixed.

>> > >        if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end) 
>> <
>> > >        0) {
>> > >                /* Ignore hotadd region. Undo damage */
>> >
>> >  I have but the e820_register_active_regions as a else to this
>> > statment the absent pages check fails.
>> >
>> 
>> The patch below omits this change because I think
>> e820_register_active_regions() will still have got called by the time
>> you encounter a hotplug area.
>
> called but then removed in setup arch.

By "removed", I assume you mean the active regions removed by the call 
to remove_all_active_regions() in setup_arch(). Before reserve_hotadd() is 
called, e820_register_active_regions() will have reregistered the active 
regions with the NUMA node id.

>> > Also nodes_cover_memory and alot of these check were based against
>> > comparing the srat data against the e820.  Now all this code is
>> > comparing SRAT against SRAT....
>> >
>> 
>> I don't see why. The SRAT table passes a range to
>> e820_register_active_regions() so should be comparing SRAT to e820
>
> let me go off and look at e820_register_active_regions() some more.
>

Cool

>> > I am willing to help here but we should compare the SRAT against to
>> > e820. Table v. Table.
>> >
>> > What to you think should be done?
>> >
>> 
>> Can you read through this patch and see does it address the problem in any
>> way? If it doesn't, can you send a complete bootlog so I can see what is
>> being sent to reserve_hotadd()? Thanks
>
> Sure thing.  It is just the hot-add area I am guessing it is an off by
> one error of some sort.
>

Possible, the change to reserve_hotadd() should tell me.

> What is all this code buying us?

Less architecture-specific code across a number of architectures is the 
main one.

> Since this code dosen't appear to do
> anything to help the arch out (just increases it's vm boot code
> complexity a little) maybe insead of weaving
> e820_register_active_regions() calls throught out the boot process you
> should just waint untill things are sorted out and do a quick scan of
> node data that has been setup at the end?
>

That would defeat the purpose of sizing zones and holes in an architecture 
independent manner.

> What are the future plans for this api?
>

In the future, I will be releasing patches that set aside a zone (similar 
to the Solaris Kernel Cage) used for easily-reclaimed pages that can be 
used for growing the huge page pool at runtime (it comes under the heading 
of anti-fragmentation) work. The same zone could also be used to give 
memory hot-remove a better success rate than 0%. These patches make the 
creation of the zone relatively trivial. Without them, the 
architecture-specific code is really hairy.

Other possibilities are doing stuff like handling the mem= boot parameter 
in an architecture-independent manner. My understanding is that some NUMA 
architectures get the handling of that arguement wrong.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-08-31 15:49     ` Mel Gorman
  2006-08-31 16:25       ` Mika Penttilä
@ 2006-08-31 17:52       ` Keith Mannthey
  2006-08-31 18:40         ` Mel Gorman
  1 sibling, 1 reply; 47+ messages in thread
From: Keith Mannthey @ 2006-08-31 17:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, tony.luck, linux-mm, ak, bob.picco, linux-kernel, linuxppc-dev

On 8/31/06, Mel Gorman <mel@skynet.ie> wrote:
> On (30/08/06 13:57), Keith Mannthey didst pronounce:
> > On 8/21/06, Mel Gorman <mel@csn.ul.ie> wrote:
> > >

> ok, great. How much physical memory is installed on the machine? I want to
> determine if the "usable" entries in the e820 map contain physical memory
> or not.
Usable entries in the e820 contian memory.  I have about 20-24gb
depending on config.


> When the SRAT is bad, the information is discarded and discovered by an
> alternative method later in the boot process.
>
> In this case, numa_initmem_init() is called after acpi_numa_init(). It
> calls acpi_scan_nodes() which returns -1 because the SRAT is bad. Once
> that happens, either k8_scan_nodes() will be called and the regions
> discovered there or if that is not possible, it'll fall through and
> e820_register_active_regions will be called without any node awareness.

sorry I have missed some of the logic in this patch.

I see now in numa_initmem_init that if no numa setup is found it calls
e820_register_active_regions(0, start_pfn, end_pfn) again.

So if the srat is discard it runs the e820 code again.


> > >diff -rup -X /usr/src/patchset-0.6/bin//dontdiff
> > >linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c
> > >linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
> > >--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c
> > >2006-08-21 09:23:50.000000000 +0100
> > >+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
> > >2006-08-21 10:15:58.000000000 +0100
> > >@@ -84,6 +84,7 @@ static __init void bad_srat(void)
> > >                apicid_to_node[i] = NUMA_NO_NODE;
> > >        for (i = 0; i < MAX_NUMNODES; i++)
> > >                nodes_add[i].start = nodes[i].end = 0;
> > >+       remove_all_active_ranges();
> > > }
> >
> > We go back to setup_arch with no active areas?
> >
>
> Yes, and it'll be discovered using an alternative method later. There is
> no point returning to setup_arch with known bad information about active
> areas.

Totally agreeded! I just didn't the the fallback path.
>
> > > static __init inline int srat_disabled(void)
> > >@@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
> > >
> > >        if (mem < 0)
> > >                return 0;
> > >-       allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
> > >+       allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) *
> > >PAGE_SIZE;
> > >        allowed = (allowed / 100) * hotadd_percent;
> > >        if (allocated + mem > allowed) {
> > >                unsigned long range;
> > >@@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
> > >        }
> > >
> > >        /* This check might be a bit too strict, but I'm keeping it for
> > >        now. */
> > >-       if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
> > >+       if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
> > >                printk(KERN_ERR "SRAT: Hotplug area has existing
> > >                memory\n");
> > >                return -1;
> > >        }
> > We really do want to to compare against the e820 map at it contains
> > the memory that is really present (this info was blown away before
> > acpi_numa)
>
> The information used by absent_pages_in_range() should match what was
> available to e820_hole_size().

Is absent_pages_in_range a check against the e820 or the
add_pages_to_range calls?

> > Anyway I fixed up to have the current chunk added
> > (e820_register_active_regions) after calling this code so it logicaly
> > makes sense but it still trip over the check.
> > I am not sure what you
> > are printing out in you debug code but dosen't look like pfns or
> > phys_addresses but maybe it can tell us why the check fails.
> >
>
> My debug code for add_active_range() printing out pfns but I spotted one
> case where absent_pages_in_range(I) does not do what one would expect.
> Lets say the ranges with physical memory was 0->1000 and 2000-3000 (in
> pfns).  absent_pages_in_range(0, 3000) would return 1000 as you'd expect but
> absent_pages_in_range(5000-6000) would return 0! I have a patch that might
> fix this at the end of the mail but I'm not sure it's the problem you are
> hitting. In the bootlog, I see;
>
> SRAT: Node 0 PXM 0 0-80000000
> Entering add_active_range(0, 0, 152) 0 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
> SRAT: Node 0 PXM 0 0-470000000
> Entering add_active_range(0, 0, 152) 2 entries of 3200 used
> Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
> Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
> SRAT: Node 0 PXM 0 0-1070000000
> SRAT: Hotplug area has existing memory
>

> The last part (0-1070000000) is checked as a hotplug area but it's clear
> that memory exists in that range.  As reserve_hotadd() requires that the
> whole range be a hole, I'm having trouble seeing how it ever successfully
> reserved unless the ranges going into reserve_hotadd() are something other
> than the pfn range for 0-1070000000). The patch later will print out the
> range used by reserve_hotadd() so we can see.

No the whole node is 0-1070000000 the hot add range is 470000000-1070000000
reserve_hotadd is called with start and end not nd->start nd->end.
470000000-1070000000 sould be empty.


> > >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
> > >
> > >        printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
> > >               nd->start, nd->end);
> > >+       e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
> > >+                                               nd->end >> PAGE_SHIFT);
> >
> > A node chunk in this section of code may be a hot-pluggable zone. With
> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
> >
>
> The ranges should not get registered as active memory by
> e820_register_active_regions() unless they are marked E820_RAM. My
> understanding is that the regions for hotadd would be marked "reserved"
> in the e820 map. Is that wrong?

This is wrong.  In a mult-node system that last node add area will not
be marked reserved by the e820.  The e820 only defines memory <
end_pfn. the last node add area is > end_pfn.

With RESERVE based add-memory you want the add-areas repored by the
srat to be setup during boot like all the other pages.

> > >        if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end) <
> > >        0) {
> > >                /* Ignore hotadd region. Undo damage */
> >
> >  I have but the e820_register_active_regions as a else to this
> > statment the absent pages check fails.
> >
>
> The patch below omits this change because I think
> e820_register_active_regions() will still have got called by the time
> you encounter a hotplug area.

called but then removed in setup arch.
> > Also nodes_cover_memory and alot of these check were based against
> > comparing the srat data against the e820.  Now all this code is
> > comparing SRAT against SRAT....
> >
>
> I don't see why. The SRAT table passes a range to
> e820_register_active_regions() so should be comparing SRAT to e820

let me go off and look at e820_register_active_regions() some more.

> > I am willing to help here but we should compare the SRAT against to
> > e820. Table v. Table.
> >
> > What to you think should be done?
> >
>
> Can you read through this patch and see does it address the problem in any
> way? If it doesn't, can you send a complete bootlog so I can see what is
> being sent to reserve_hotadd()? Thanks

Sure thing.  It is just the hot-add area I am guessing it is an off by
one error of some sort.

What is all this code buying us?  Since this code dosen't appear to do
anything to help the arch out (just increases it's vm boot code
complexity a little) maybe insead of weaving
e820_register_active_regions() calls throught out the boot process you
should just waint untill things are sorted out and do a quick scan of
node data that has been setup at the end?

What are the future plans for this api?

Thanks,
  Keith u

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-08-31 17:01         ` Mel Gorman
@ 2006-08-31 17:40           ` Mika Penttilä
  0 siblings, 0 replies; 47+ messages in thread
From: Mika Penttilä @ 2006-08-31 17:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Keith Mannthey, akpm, tony.luck, Linux Memory Management List,
	ak, bob.picco, Linux Kernel Mailing List

Mel Gorman wrote:
> On Thu, 31 Aug 2006, Mika Penttilä wrote:
>
>>
>>>>> static __init inline int srat_disabled(void)
>>>>> @@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
>>>>>
>>>>>        if (mem < 0)
>>>>>                return 0;
>>>>> -       allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
>>>>> +       allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) * 
>>>>> PAGE_SIZE;
>>>>>        allowed = (allowed / 100) * hotadd_percent;
>>>>>        if (allocated + mem > allowed) {
>>>>>                unsigned long range;
>>>>> @@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
>>>>>        }
>>>>>
>>>>>        /* This check might be a bit too strict, but I'm keeping it 
>>>>> for now. */
>>>>> -       if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>>>> +       if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>>>>                printk(KERN_ERR "SRAT: Hotplug area has existing 
>>>>> memory\n");
>>>>>                return -1;
>>>>>        }
>>>>>
>>>> We really do want to to compare against the e820 map at it contains
>>>> the memory that is really present (this info was blown away before
>>>> acpi_numa) 
>>>
>>> The information used by absent_pages_in_range() should match what was
>>> available to e820_hole_size().
>>>
>>>
>> But it doesn't : all active ranges are removed before parsing srat. I 
>> think we really need to check against e820 here.
>>
>
> What I see happening is this;
>
> 1. setup_arch calls e820_register_active_regions(0, 0, -1UL) so that all
>    regions are registered as if they were on node 0 so e820_end_of_ram()
>    gets the right value
> 2. remove_all_active_regions() is called to clear what was registered so
>    that rediscovery with NUMA awareness happens
> 3. acpi_numa_init() is called. It parses the table and a little later
>    calls acpi_numa_memory_affinity_init() for each range in the table so
>    now we're into x86_64 code
> 4. acpi_numa_memory_affinity_init() basically deals an address range.
>    Assuming the SRAT table is not broken, it calls
>    e820_register_active_ranges() for that range. At this point, for the
>    range of addresses, the active ranges are now registered
> 5. reserve_hotadd is called if the range is hotpluggable. It will fail if
>    it finds that memory already exists there
>
> So, when absent_pages_in_range() is being called by reserve_hotadd(), 
> it should be using the same information that was available in e820. 
> What am I missing?
>
Ok, right, missed the e820_register_active_ranges() in 
acpi_numa_memory_affinity_init() before reserve_hotadd stuff. So 
logically it should be working mod bugs.

Argh, just looked through the reserve hotadd code and 
hotadd_enough_memory() looks still broken. And why are we doing 
reserve_bootmem_node(), the regions aren't present RAM anyways?

--Mika


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-08-31 16:25       ` Mika Penttilä
@ 2006-08-31 17:01         ` Mel Gorman
  2006-08-31 17:40           ` Mika Penttilä
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2006-08-31 17:01 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Keith Mannthey, akpm, tony.luck, Linux Memory Management List,
	ak, bob.picco, Linux Kernel Mailing List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2637 bytes --]

On Thu, 31 Aug 2006, Mika Penttilä wrote:

>
>>>> static __init inline int srat_disabled(void)
>>>> @@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
>>>>
>>>>        if (mem < 0)
>>>>                return 0;
>>>> -       allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
>>>> +       allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) * 
>>>> PAGE_SIZE;
>>>>        allowed = (allowed / 100) * hotadd_percent;
>>>>        if (allocated + mem > allowed) {
>>>>                unsigned long range;
>>>> @@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
>>>>        }
>>>>
>>>>        /* This check might be a bit too strict, but I'm keeping it for 
>>>> now. */
>>>> -       if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>>> +       if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>>>                printk(KERN_ERR "SRAT: Hotplug area has existing 
>>>> memory\n");
>>>>                return -1;
>>>>        }
>>>> 
>>> We really do want to to compare against the e820 map at it contains
>>> the memory that is really present (this info was blown away before
>>> acpi_numa) 
>> 
>> The information used by absent_pages_in_range() should match what was
>> available to e820_hole_size().
>>
>> 
> But it doesn't : all active ranges are removed before parsing srat. I think 
> we really need to check against e820 here.
>

What I see happening is this;

1. setup_arch calls e820_register_active_regions(0, 0, -1UL) so that all
    regions are registered as if they were on node 0 so e820_end_of_ram()
    gets the right value
2. remove_all_active_regions() is called to clear what was registered so
    that rediscovery with NUMA awareness happens
3. acpi_numa_init() is called. It parses the table and a little later
    calls acpi_numa_memory_affinity_init() for each range in the table so
    now we're into x86_64 code
4. acpi_numa_memory_affinity_init() basically deals an address range.
    Assuming the SRAT table is not broken, it calls
    e820_register_active_ranges() for that range. At this point, for the
    range of addresses, the active ranges are now registered
5. reserve_hotadd is called if the range is hotpluggable. It will fail if
    it finds that memory already exists there

So, when absent_pages_in_range() is being called by reserve_hotadd(), it 
should be using the same information that was available in e820. What am I 
missing?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-08-31 15:49     ` Mel Gorman
@ 2006-08-31 16:25       ` Mika Penttilä
  2006-08-31 17:01         ` Mel Gorman
  2006-08-31 17:52       ` Keith Mannthey
  1 sibling, 1 reply; 47+ messages in thread
From: Mika Penttilä @ 2006-08-31 16:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Keith Mannthey, akpm, tony.luck, linux-mm, ak, bob.picco, linux-kernel


>>> static __init inline int srat_disabled(void)
>>> @@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
>>>
>>>        if (mem < 0)
>>>                return 0;
>>> -       allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
>>> +       allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) * 
>>> PAGE_SIZE;
>>>        allowed = (allowed / 100) * hotadd_percent;
>>>        if (allocated + mem > allowed) {
>>>                unsigned long range;
>>> @@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
>>>        }
>>>
>>>        /* This check might be a bit too strict, but I'm keeping it for 
>>>        now. */
>>> -       if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>> +       if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
>>>                printk(KERN_ERR "SRAT: Hotplug area has existing 
>>>                memory\n");
>>>                return -1;
>>>        }
>>>       
>> We really do want to to compare against the e820 map at it contains
>> the memory that is really present (this info was blown away before
>> acpi_numa) 
>>     
>
> The information used by absent_pages_in_range() should match what was
> available to e820_hole_size().
>
>   
But it doesn't : all active ranges are removed before parsing srat. I 
think we really need to check against e820 here.

--Mika


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-08-30 20:57   ` Keith Mannthey
@ 2006-08-31 15:49     ` Mel Gorman
  2006-08-31 16:25       ` Mika Penttilä
  2006-08-31 17:52       ` Keith Mannthey
  0 siblings, 2 replies; 47+ messages in thread
From: Mel Gorman @ 2006-08-31 15:49 UTC (permalink / raw)
  To: Keith Mannthey
  Cc: akpm, tony.luck, linux-mm, ak, bob.picco, linux-kernel, linuxppc-dev

On (30/08/06 13:57), Keith Mannthey didst pronounce:
> On 8/21/06, Mel Gorman <mel@csn.ul.ie> wrote:
> >
> >Size zones and holes in an architecture independent manner for x86_64.
> 
> 
> Hey Mel,

Hi Keith.

>  I am having some trouble with the srat.c changes.  I keep running into
> "SRAT: Hotplug area has existing memory" so am am taking more throught
> look at this patch.
>  I am working on 2.6.18-rc4-mm3 x86_64.
> 

ok, great. How much physical memory is installed on the machine? I want to
determine if the "usable" entries in the e820 map contain physical memory
or not.

>   srat.c is doing some sanity checking against the e820 and hot-add
> memory ranges.  BIOS folk aren't to be trusted with the SRAT.  Calling
> remove_all_active_ranges before acpi_numa_init leaves nothing to fall
> back onto if the SRAT is bad.  (see bad_srat()). What should happen
> when we discard the srat info?
> 

When the SRAT is bad, the information is discarded and discovered by an
alternative method later in the boot process.

In this case, numa_initmem_init() is called after acpi_numa_init(). It
calls acpi_scan_nodes() which returns -1 because the SRAT is bad. Once
that happens, either k8_scan_nodes() will be called and the regions
discovered there or if that is not possible, it'll fall through and
e820_register_active_regions will be called without any node awareness.

>  i386 code may have similar fallback logic (haven't been there in a while)
> 

There is fallback logic in the i386 code as well.

> also
> 
> >diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
> >linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c 
> >linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
> >--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c   
> >2006-08-21 09:23:50.000000000 +0100
> >+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c   
> >2006-08-21 10:15:58.000000000 +0100
> >@@ -84,6 +84,7 @@ static __init void bad_srat(void)
> >                apicid_to_node[i] = NUMA_NO_NODE;
> >        for (i = 0; i < MAX_NUMNODES; i++)
> >                nodes_add[i].start = nodes[i].end = 0;
> >+       remove_all_active_ranges();
> > }
> 
> We go back to setup_arch with no active areas?
> 

Yes, and it'll be discovered using an alternative method later. There is
no point returning to setup_arch with known bad information about active
areas.

> > static __init inline int srat_disabled(void)
> >@@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
> >
> >        if (mem < 0)
> >                return 0;
> >-       allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
> >+       allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) * 
> >PAGE_SIZE;
> >        allowed = (allowed / 100) * hotadd_percent;
> >        if (allocated + mem > allowed) {
> >                unsigned long range;
> >@@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
> >        }
> >
> >        /* This check might be a bit too strict, but I'm keeping it for 
> >        now. */
> >-       if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
> >+       if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
> >                printk(KERN_ERR "SRAT: Hotplug area has existing 
> >                memory\n");
> >                return -1;
> >        }
> We really do want to to compare against the e820 map at it contains
> the memory that is really present (this info was blown away before
> acpi_numa) 

The information used by absent_pages_in_range() should match what was
available to e820_hole_size().

> Anyway I fixed up to have the current chunk added
> (e820_register_active_regions) after calling this code so it logicaly
> makes sense but it still trip over the check.  
> I am not sure what you
> are printing out in you debug code but dosen't look like pfns or
> phys_addresses but maybe it can tell us why the check fails.
> 

My debug code for add_active_range() printing out pfns but I spotted one
case where absent_pages_in_range(I) does not do what one would expect.
Lets say the ranges with physical memory was 0->1000 and 2000-3000 (in
pfns).  absent_pages_in_range(0, 3000) would return 1000 as you'd expect but
absent_pages_in_range(5000-6000) would return 0! I have a patch that might
fix this at the end of the mail but I'm not sure it's the problem you are
hitting. In the bootlog, I see;

SRAT: Node 0 PXM 0 0-80000000
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
SRAT: Node 0 PXM 0 0-470000000
Entering add_active_range(0, 0, 152) 2 entries of 3200 used
Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
SRAT: Node 0 PXM 0 0-1070000000
SRAT: Hotplug area has existing memory

The last part (0-1070000000) is checked as a hotplug area but it's clear
that memory exists in that range.  As reserve_hotadd() requires that the
whole range be a hole, I'm having trouble seeing how it ever successfully
reserved unless the ranges going into reserve_hotadd() are something other
than the pfn range for 0-1070000000). The patch later will print out the
range used by reserve_hotadd() so we can see.

> >@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
> >
> >        printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
> >               nd->start, nd->end);
> >+       e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
> >+                                               nd->end >> PAGE_SHIFT);
> 
> A node chunk in this section of code may be a hot-pluggable zone. With
> MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
> 

The ranges should not get registered as active memory by
e820_register_active_regions() unless they are marked E820_RAM. My
understanding is that the regions for hotadd would be marked "reserved"
in the e820 map. Is that wrong?

> >        if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end) < 
> >        0) {
> >                /* Ignore hotadd region. Undo damage */
> 
>  I have but the e820_register_active_regions as a else to this
> statment the absent pages check fails.
> 

The patch below omits this change because I think
e820_register_active_regions() will still have got called by the time
you encounter a hotplug area.

> Also nodes_cover_memory and alot of these check were based against
> comparing the srat data against the e820.  Now all this code is
> comparing SRAT against SRAT....
> 

I don't see why. The SRAT table passes a range to
e820_register_active_regions() so should be comparing SRAT to e820

> I am willing to help here but we should compare the SRAT against to
> e820. Table v. Table.
> 
> What to you think should be done?
> 

Can you read through this patch and see does it address the problem in any
way? If it doesn't, can you send a complete bootlog so I can see what is
being sent to reserve_hotadd()? Thanks

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm3-clean/arch/x86_64/mm/srat.c linux-2.6.18-rc4-mm3-fix_x8664_hotadd/arch/x86_64/mm/srat.c
--- linux-2.6.18-rc4-mm3-clean/arch/x86_64/mm/srat.c	2006-08-29 16:25:10.000000000 +0100
+++ linux-2.6.18-rc4-mm3-fix_x8664_hotadd/arch/x86_64/mm/srat.c	2006-08-31 16:17:26.000000000 +0100
@@ -240,7 +240,8 @@ static int reserve_hotadd(int node, unsi
 
 	/* This check might be a bit too strict, but I'm keeping it for now. */
 	if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
-		printk(KERN_ERR "SRAT: Hotplug area has existing memory\n");
+		printk(KERN_ERR "SRAT: Hotplug area %lu -> %lu has existing memory\n",
+				s_pfn, e_pfn);
 		return -1;
 	}
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm3-clean/mm/page_alloc.c linux-2.6.18-rc4-mm3-fix_x8664_hotadd/mm/page_alloc.c
--- linux-2.6.18-rc4-mm3-clean/mm/page_alloc.c	2006-08-29 16:25:31.000000000 +0100
+++ linux-2.6.18-rc4-mm3-fix_x8664_hotadd/mm/page_alloc.c	2006-08-31 14:52:38.000000000 +0100
@@ -2280,6 +2280,10 @@ unsigned long __init __absent_pages_in_r
 		prev_end_pfn = early_node_map[i].end_pfn;
 	}
 
+	/* If the range is outside of physical memory, return the range */
+	if (range_start_pfn > prev_end_pfn)
+		hole_pages = range_end_pfn - range_start_pfn;
+
 	return hole_pages;
 }
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-08-21 13:46 ` [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes Mel Gorman
@ 2006-08-30 20:57   ` Keith Mannthey
  2006-08-31 15:49     ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: Keith Mannthey @ 2006-08-30 20:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, tony.luck, linux-mm, ak, bob.picco, linux-kernel, linuxppc-dev

On 8/21/06, Mel Gorman <mel@csn.ul.ie> wrote:
>
> Size zones and holes in an architecture independent manner for x86_64.


Hey Mel,
  I am having some trouble with the srat.c changes.  I keep running into
"SRAT: Hotplug area has existing memory" so am am taking more throught
look at this patch.
  I am working on 2.6.18-rc4-mm3 x86_64.

   srat.c is doing some sanity checking against the e820 and hot-add
memory ranges.  BIOS folk aren't to be trusted with the SRAT.  Calling
 remove_all_active_ranges before acpi_numa_init leaves nothing to fall
back onto if the SRAT is bad.  (see bad_srat()). What should happen
when we discard the srat info?

  i386 code may have similar fallback logic (haven't been there in a while)

also

> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
> --- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c   2006-08-21 09:23:50.000000000 +0100
> +++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c        2006-08-21 10:15:58.000000000 +0100
> @@ -84,6 +84,7 @@ static __init void bad_srat(void)
>                 apicid_to_node[i] = NUMA_NO_NODE;
>         for (i = 0; i < MAX_NUMNODES; i++)
>                 nodes_add[i].start = nodes[i].end = 0;
> +       remove_all_active_ranges();
>  }

We go back to setup_arch with no active areas?

>  static __init inline int srat_disabled(void)
> @@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
>
>         if (mem < 0)
>                 return 0;
> -       allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
> +       allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) * PAGE_SIZE;
>         allowed = (allowed / 100) * hotadd_percent;
>         if (allocated + mem > allowed) {
>                 unsigned long range;
> @@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
>         }
>
>         /* This check might be a bit too strict, but I'm keeping it for now. */
> -       if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
> +       if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
>                 printk(KERN_ERR "SRAT: Hotplug area has existing memory\n");
>                 return -1;
>         }
We really do want to to compare against the e820 map at it contains
the memory that is really present (this info was blown away before
acpi_numa)  Anyway I fixed up to have the current chunk added
(e820_register_active_regions) after calling this code so it logicaly
makes sense but it still trip over the check.   I am not sure what you
are printing out in you debug code but dosen't look like pfns or
phys_addresses but maybe it can tell us why the check fails.

> @@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
>
>         printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
>                nd->start, nd->end);
> +       e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
> +                                               nd->end >> PAGE_SHIFT);

A node chunk in this section of code may be a hot-pluggable zone. With
MEMORY_HOTPLUG_SPARSE we don't want to register these regions.

>         if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end) < 0) {
>                 /* Ignore hotadd region. Undo damage */

  I have but the e820_register_active_regions as a else to this
statment the absent pages check fails.

Also nodes_cover_memory and alot of these check were based against
comparing the srat data against the e820.  Now all this code is
comparing SRAT against SRAT....

I am willing to help here but we should compare the SRAT against to
e820. Table v. Table.

What to you think should be done?

Thanks,
  Keith

Linux version 2.6.18-rc4-mm3-smp (root@elm3a153) (gcc version 4.1.0
(SUSE Linux)) #3 SMP Wed Aug 30 15:17:13 EDT 2006
Command line: root=/dev/sda3
ip=9.47.66.153:9.47.66.169:9.47.66.1:255.255.255.0 resume=/dev/sda2
showopts earlyprintk=ttyS0,115200 console=ttyS0,115200 console=tty0
debug numa=hotadd=100
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 0000000000098400 (usable)
 BIOS-e820: 0000000000098400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007ff85e00 (usable)
 BIOS-e820: 000000007ff85e00 - 000000007ff98880 (ACPI data)
 BIOS-e820: 000000007ff98880 - 0000000080000000 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000470000000 (usable)
 BIOS-e820: 0000001070000000 - 0000001160000000 (usable)
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
Entering add_active_range(0, 17235968, 18219008) 3 entries of 3200 used
end_pfn_map = 18219008
DMI 2.3 present.
ACPI: RSDP (v000 IBM                                   ) @ 0x00000000000fdcf0
ACPI: RSDT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
0x000000007ff98800
ACPI: FADT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
0x000000007ff98780
ACPI: MADT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
0x000000007ff98600
ACPI: SRAT (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
0x000000007ff983c0
ACPI: HPET (v001 IBM    EXA01ZEU 0x00001000 IBM  0x45444f43) @
0x000000007ff98380
ACPI: SSDT (v001 IBM    VIGSSDT0 0x00001000 INTL 0x20030122) @
0x000000007ff90780
ACPI: SSDT (v001 IBM    VIGSSDT1 0x00001000 INTL 0x20030122) @
0x000000007ff88bc0
ACPI: DSDT (v001 IBM    EXA01ZEU 0x00001000 INTL 0x20030122) @
0x0000000000000000
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: PXM 0 -> APIC 2 -> Node 0
SRAT: PXM 0 -> APIC 3 -> Node 0
SRAT: PXM 0 -> APIC 38 -> Node 0
SRAT: PXM 0 -> APIC 39 -> Node 0
SRAT: PXM 0 -> APIC 36 -> Node 0
SRAT: PXM 0 -> APIC 37 -> Node 0
SRAT: PXM 1 -> APIC 64 -> Node 1
SRAT: PXM 1 -> APIC 65 -> Node 1
SRAT: PXM 1 -> APIC 66 -> Node 1
SRAT: PXM 1 -> APIC 67 -> Node 1
SRAT: PXM 1 -> APIC 102 -> Node 1
SRAT: PXM 1 -> APIC 103 -> Node 1
SRAT: PXM 1 -> APIC 100 -> Node 1
SRAT: PXM 1 -> APIC 101 -> Node 1
SRAT: Node 0 PXM 0 0-80000000
Entering add_active_range(0, 0, 152) 0 entries of 3200 used
Entering add_active_range(0, 256, 524165) 1 entries of 3200 used
SRAT: Node 0 PXM 0 0-470000000
Entering add_active_range(0, 0, 152) 2 entries of 3200 used
Entering add_active_range(0, 256, 524165) 2 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 2 entries of 3200 used
SRAT: Node 0 PXM 0 0-1070000000
SRAT: Hotplug area has existing memory
Entering add_active_range(0, 0, 152) 3 entries of 3200 used
Entering add_active_range(0, 256, 524165) 3 entries of 3200 used
Entering add_active_range(0, 1048576, 4653056) 3 entries of 3200 used
SRAT: Node 1 PXM 1 1070000000-1160000000
Entering add_active_range(1, 17235968, 18219008) 3 entries of 3200 used
SRAT: Node 1 PXM 1 1070000000-3200000000
SRAT: Hotplug area has existing memory
Entering add_active_range(1, 17235968, 18219008) 4 entries of 3200 used
NUMA: Using 28 for the hash shift.
Bootmem setup node 0 0000000000000000-0000001070000000
Bootmem setup node 1 0000001070000000-0000001160000000

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-08-21 13:45 [PATCH 0/6] Sizing zones and holes in an architecture independent manner V9 Mel Gorman
@ 2006-08-21 13:46 ` Mel Gorman
  2006-08-30 20:57   ` Keith Mannthey
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2006-08-21 13:46 UTC (permalink / raw)
  To: akpm
  Cc: Mel Gorman, tony.luck, linux-mm, ak, bob.picco, linux-kernel,
	linuxppc-dev


Size zones and holes in an architecture independent manner for x86_64.


 arch/x86_64/Kconfig         |    3 
 arch/x86_64/kernel/e820.c   |  125 ++++++++++++++-------------------------
 arch/x86_64/kernel/setup.c  |    7 +-
 arch/x86_64/mm/init.c       |   65 +-------------------
 arch/x86_64/mm/k8topology.c |    3 
 arch/x86_64/mm/numa.c       |   21 +++---
 arch/x86_64/mm/srat.c       |   11 ++-
 include/asm-x86_64/e820.h   |    5 -
 include/asm-x86_64/proto.h  |    2 
 9 files changed, 84 insertions(+), 158 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/Kconfig linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/Kconfig
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/Kconfig	2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/Kconfig	2006-08-21 10:15:58.000000000 +0100
@@ -85,6 +85,9 @@ config ARCH_MAY_HAVE_PC_FDC
 	bool
 	default y
 
+config ARCH_POPULATES_NODE_MAP
+	def_bool y
+
 config DMI
 	bool
 	default y
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c	2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c	2006-08-21 10:15:58.000000000 +0100
@@ -162,59 +162,14 @@ unsigned long __init find_e820_area(unsi
 	return -1UL;		
 } 
 
-/* 
- * Free bootmem based on the e820 table for a node.
- */
-void __init e820_bootmem_free(pg_data_t *pgdat, unsigned long start,unsigned long end)
-{
-	int i;
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i]; 
-		unsigned long last, addr;
-
-		if (ei->type != E820_RAM || 
-		    ei->addr+ei->size <= start || 
-		    ei->addr >= end)
-			continue;
-
-		addr = round_up(ei->addr, PAGE_SIZE);
-		if (addr < start) 
-			addr = start;
-
-		last = round_down(ei->addr + ei->size, PAGE_SIZE); 
-		if (last >= end)
-			last = end; 
-
-		if (last > addr && last-addr >= PAGE_SIZE)
-			free_bootmem_node(pgdat, addr, last-addr);
-	}
-}
-
 /*
  * Find the highest page frame number we have available
  */
 unsigned long __init e820_end_of_ram(void)
 {
-	int i;
 	unsigned long end_pfn = 0;
+	end_pfn = find_max_pfn_with_active_regions();
 	
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i]; 
-		unsigned long start, end;
-
-		start = round_up(ei->addr, PAGE_SIZE); 
-		end = round_down(ei->addr + ei->size, PAGE_SIZE); 
-		if (start >= end)
-			continue;
-		if (ei->type == E820_RAM) { 
-		if (end > end_pfn<<PAGE_SHIFT)
-			end_pfn = end>>PAGE_SHIFT;
-		} else { 
-			if (end > end_pfn_map<<PAGE_SHIFT) 
-				end_pfn_map = end>>PAGE_SHIFT;
-		} 
-	}
-
 	if (end_pfn > end_pfn_map) 
 		end_pfn_map = end_pfn;
 	if (end_pfn_map > MAXMEM>>PAGE_SHIFT)
@@ -224,43 +179,10 @@ unsigned long __init e820_end_of_ram(voi
 	if (end_pfn > end_pfn_map) 
 		end_pfn = end_pfn_map; 
 
+	printk("end_pfn_map = %lu\n", end_pfn_map);
 	return end_pfn;	
 }
 
-/* 
- * Compute how much memory is missing in a range.
- * Unlike the other functions in this file the arguments are in page numbers.
- */
-unsigned long __init
-e820_hole_size(unsigned long start_pfn, unsigned long end_pfn)
-{
-	unsigned long ram = 0;
-	unsigned long start = start_pfn << PAGE_SHIFT;
-	unsigned long end = end_pfn << PAGE_SHIFT;
-	int i;
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i];
-		unsigned long last, addr;
-
-		if (ei->type != E820_RAM ||
-		    ei->addr+ei->size <= start ||
-		    ei->addr >= end)
-			continue;
-
-		addr = round_up(ei->addr, PAGE_SIZE);
-		if (addr < start)
-			addr = start;
-
-		last = round_down(ei->addr + ei->size, PAGE_SIZE);
-		if (last >= end)
-			last = end;
-
-		if (last > addr)
-			ram += last - addr;
-	}
-	return ((end - start) - ram) >> PAGE_SHIFT;
-}
-
 /*
  * Mark e820 reserved areas as busy for the resource manager.
  */
@@ -342,6 +264,49 @@ void __init e820_mark_nosave_regions(voi
 	}
 }
 
+/* Walk the e820 map and register active regions within a node */
+void __init
+e820_register_active_regions(int nid, unsigned long start_pfn,
+							unsigned long end_pfn)
+{
+	int i;
+	unsigned long ei_startpfn, ei_endpfn;
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+		ei_startpfn = round_up(ei->addr, PAGE_SIZE) >> PAGE_SHIFT;
+		ei_endpfn = round_down(ei->addr + ei->size, PAGE_SIZE)
+								>> PAGE_SHIFT;
+
+		/* Skip map entries smaller than a page */
+		if (ei_startpfn > ei_endpfn)
+			continue;
+
+		/* Check if end_pfn_map should be updated */
+		if (ei->type != E820_RAM && ei_endpfn > end_pfn_map)
+			end_pfn_map = ei_endpfn;
+
+		/* Skip if map is outside the node */
+		if (ei->type != E820_RAM ||
+				ei_endpfn <= start_pfn ||
+				ei_startpfn >= end_pfn)
+			continue;
+
+		/* Check for overlaps */
+		if (ei_startpfn < start_pfn)
+			ei_startpfn = start_pfn;
+		if (ei_endpfn > end_pfn)
+			ei_endpfn = end_pfn;
+
+		/* Obey end_user_pfn to save on memmap */
+		if (ei_startpfn >= end_user_pfn)
+			continue;
+		if (ei_endpfn > end_user_pfn)
+			ei_endpfn = end_user_pfn;
+
+		add_active_range(nid, ei_startpfn, ei_endpfn);
+	}
+}
+
 /* 
  * Add a memory region to the kernel e820 map.
  */ 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/kernel/setup.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/kernel/setup.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/kernel/setup.c	2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/kernel/setup.c	2006-08-21 10:15:58.000000000 +0100
@@ -292,7 +292,8 @@ contig_initmem_init(unsigned long start_
 	if (bootmap == -1L)
 		panic("Cannot find bootmem map of size %ld\n",bootmap_size);
 	bootmap_size = init_bootmem(bootmap >> PAGE_SHIFT, end_pfn);
-	e820_bootmem_free(NODE_DATA(0), 0, end_pfn << PAGE_SHIFT);
+	e820_register_active_regions(0, start_pfn, end_pfn);
+	free_bootmem_with_active_regions(0, end_pfn);
 	reserve_bootmem(bootmap, bootmap_size);
 } 
 #endif
@@ -386,6 +387,7 @@ void __init setup_arch(char **cmdline_p)
 
 	finish_e820_parsing();
 
+	e820_register_active_regions(0, 0, -1UL);
 	/*
 	 * partially used pages are not usable - thus
 	 * we are rounding upwards:
@@ -416,6 +418,9 @@ void __init setup_arch(char **cmdline_p)
 	max_pfn = end_pfn;
 	high_memory = (void *)__va(end_pfn * PAGE_SIZE - 1) + 1;
 
+	/* Remove active ranges so rediscovery with NUMA-awareness happens */
+	remove_all_active_ranges();
+
 #ifdef CONFIG_ACPI_NUMA
 	/*
 	 * Parse SRAT to discover nodes.
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/init.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/init.c	2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c	2006-08-21 10:15:58.000000000 +0100
@@ -404,69 +404,15 @@ void __cpuinit zap_low_mappings(int cpu)
 	__flush_tlb_all();
 }
 
-/* Compute zone sizes for the DMA and DMA32 zones in a node. */
-__init void
-size_zones(unsigned long *z, unsigned long *h,
-	   unsigned long start_pfn, unsigned long end_pfn)
-{
- 	int i;
- 	unsigned long w;
-
- 	for (i = 0; i < MAX_NR_ZONES; i++)
- 		z[i] = 0;
-
- 	if (start_pfn < MAX_DMA_PFN)
- 		z[ZONE_DMA] = MAX_DMA_PFN - start_pfn;
- 	if (start_pfn < MAX_DMA32_PFN) {
- 		unsigned long dma32_pfn = MAX_DMA32_PFN;
- 		if (dma32_pfn > end_pfn)
- 			dma32_pfn = end_pfn;
- 		z[ZONE_DMA32] = dma32_pfn - start_pfn;
- 	}
- 	z[ZONE_NORMAL] = end_pfn - start_pfn;
-
- 	/* Remove lower zones from higher ones. */
- 	w = 0;
- 	for (i = 0; i < MAX_NR_ZONES; i++) {
- 		if (z[i])
- 			z[i] -= w;
- 	        w += z[i];
-	}
-
-	/* Compute holes */
-	w = start_pfn;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		unsigned long s = w;
-		w += z[i];
-		h[i] = e820_hole_size(s, w);
-	}
-
-	/* Add the space pace needed for mem_map to the holes too. */
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		h[i] += (z[i] * sizeof(struct page)) / PAGE_SIZE;
-
-	/* The 16MB DMA zone has the kernel and other misc mappings.
- 	   Account them too */
-	if (h[ZONE_DMA]) {
-		h[ZONE_DMA] += dma_reserve;
-		if (h[ZONE_DMA] >= z[ZONE_DMA]) {
-			printk(KERN_WARNING
-				"Kernel too large and filling up ZONE_DMA?\n");
-			h[ZONE_DMA] = z[ZONE_DMA];
-		}
-	}
-}
-
 #ifndef CONFIG_NUMA
 void __init paging_init(void)
 {
-	unsigned long zones[MAX_NR_ZONES], holes[MAX_NR_ZONES];
-
+	unsigned long max_zone_pfns[MAX_NR_ZONES] = {MAX_DMA_PFN,
+							MAX_DMA32_PFN,
+							end_pfn};
 	memory_present(0, 0, end_pfn);
 	sparse_init();
-	size_zones(zones, holes, 0, end_pfn);
-	free_area_init_node(0, NODE_DATA(0), zones,
-			    __pa(PAGE_OFFSET) >> PAGE_SHIFT, holes);
+	free_area_init_nodes(max_zone_pfns);
 }
 #endif
 
@@ -613,7 +559,8 @@ void __init mem_init(void)
 #else
 	totalram_pages = free_all_bootmem();
 #endif
-	reservedpages = end_pfn - totalram_pages - e820_hole_size(0, end_pfn);
+	reservedpages = end_pfn - totalram_pages -
+					absent_pages_in_range(0, end_pfn);
 
 	after_bootmem = 1;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/k8topology.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/k8topology.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/k8topology.c	2006-08-06 19:20:11.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/k8topology.c	2006-08-21 10:15:58.000000000 +0100
@@ -146,6 +146,9 @@ int __init k8_scan_nodes(unsigned long s
 		
 		nodes[nodeid].start = base; 
 		nodes[nodeid].end = limit;
+		e820_register_active_regions(nodeid,
+				nodes[nodeid].start >> PAGE_SHIFT,
+				nodes[nodeid].end >> PAGE_SHIFT);
 
 		prevbase = base;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/numa.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/numa.c	2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c	2006-08-21 10:15:58.000000000 +0100
@@ -161,7 +161,7 @@ void __init setup_node_bootmem(int nodei
 					 bootmap_start >> PAGE_SHIFT, 
 					 start_pfn, end_pfn); 
 
-	e820_bootmem_free(NODE_DATA(nodeid), start, end);
+	free_bootmem_with_active_regions(nodeid, end);
 
 	reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size); 
 	reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start, bootmap_pages<<PAGE_SHIFT);
@@ -175,13 +175,11 @@ void __init setup_node_bootmem(int nodei
 void __init setup_node_zones(int nodeid)
 { 
 	unsigned long start_pfn, end_pfn, memmapsize, limit;
-	unsigned long zones[MAX_NR_ZONES];
-	unsigned long holes[MAX_NR_ZONES];
 
  	start_pfn = node_start_pfn(nodeid);
  	end_pfn = node_end_pfn(nodeid);
 
-	Dprintk(KERN_INFO "Setting up node %d %lx-%lx\n",
+	Dprintk(KERN_INFO "Setting up memmap for node %d %lx-%lx\n",
 		nodeid, start_pfn, end_pfn);
 
 	/* Try to allocate mem_map at end to not fill up precious <4GB
@@ -195,10 +193,6 @@ void __init setup_node_zones(int nodeid)
 				round_down(limit - memmapsize, PAGE_SIZE), 
 				limit);
 #endif
-
-	size_zones(zones, holes, start_pfn, end_pfn);
-	free_area_init_node(nodeid, NODE_DATA(nodeid), zones,
-			    start_pfn, holes);
 } 
 
 void __init numa_init_array(void)
@@ -259,8 +253,11 @@ static int __init numa_emulation(unsigne
  		printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
  		return -1;
  	}
- 	for_each_online_node(i)
+ 	for_each_online_node(i) {
+		e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
+						nodes[i].end >> PAGE_SHIFT);
  		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
+	}
  	numa_init_array();
  	return 0;
 }
@@ -299,6 +296,7 @@ void __init numa_initmem_init(unsigned l
 	for (i = 0; i < NR_CPUS; i++)
 		numa_set_node(i, 0);
 	node_to_cpumask[0] = cpumask_of_cpu(0);
+	e820_register_active_regions(0, start_pfn, end_pfn);
 	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, end_pfn << PAGE_SHIFT);
 }
 
@@ -340,12 +338,17 @@ static void __init arch_sparse_init(void
 void __init paging_init(void)
 { 
 	int i;
+	unsigned long max_zone_pfns[MAX_NR_ZONES] = { MAX_DMA_PFN,
+		MAX_DMA32_PFN,
+		end_pfn};
 
 	arch_sparse_init();
 
 	for_each_online_node(i) {
 		setup_node_zones(i); 
 	}
+
+	free_area_init_nodes(max_zone_pfns);
 } 
 
 static __init int numa_setup(char *opt)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/arch/x86_64/mm/srat.c	2006-08-21 09:23:50.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c	2006-08-21 10:15:58.000000000 +0100
@@ -84,6 +84,7 @@ static __init void bad_srat(void)
 		apicid_to_node[i] = NUMA_NO_NODE;
 	for (i = 0; i < MAX_NUMNODES; i++)
 		nodes_add[i].start = nodes[i].end = 0;
+	remove_all_active_ranges();
 }
 
 static __init inline int srat_disabled(void)
@@ -166,7 +167,7 @@ static int hotadd_enough_memory(struct b
 
 	if (mem < 0)
 		return 0;
-	allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
+	allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) * PAGE_SIZE;
 	allowed = (allowed / 100) * hotadd_percent;
 	if (allocated + mem > allowed) {
 		unsigned long range;
@@ -238,7 +239,7 @@ static int reserve_hotadd(int node, unsi
 	}
 
 	/* This check might be a bit too strict, but I'm keeping it for now. */
-	if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
+	if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
 		printk(KERN_ERR "SRAT: Hotplug area has existing memory\n");
 		return -1;
 	}
@@ -329,6 +330,8 @@ acpi_numa_memory_affinity_init(struct ac
 
 	printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
 	       nd->start, nd->end);
+	e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
+						nd->end >> PAGE_SHIFT);
 
  	if (ma->flags.hot_pluggable && !reserve_hotadd(node, start, end) < 0) {
 		/* Ignore hotadd region. Undo damage */
@@ -351,12 +354,12 @@ static int nodes_cover_memory(void)
 		unsigned long s = nodes[i].start >> PAGE_SHIFT;
 		unsigned long e = nodes[i].end >> PAGE_SHIFT;
 		pxmram += e - s;
-		pxmram -= e820_hole_size(s, e);
+		pxmram -= absent_pages_in_range(s, e);
 		if ((long)pxmram < 0)
 			pxmram = 0;
 	}
 
-	e820ram = end_pfn - e820_hole_size(0, end_pfn);
+	e820ram = end_pfn - absent_pages_in_range(0, end_pfn);
 	/* We seem to lose 3 pages somewhere. Allow a bit of slack. */
 	if ((long)(e820ram - pxmram) >= 1*1024*1024) {
 		printk(KERN_ERR
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/include/asm-x86_64/e820.h linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/include/asm-x86_64/e820.h	2006-08-21 09:23:52.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h	2006-08-21 10:15:58.000000000 +0100
@@ -51,10 +51,9 @@ extern void e820_print_map(char *who);
 extern int e820_any_mapped(unsigned long start, unsigned long end, unsigned type);
 extern int e820_all_mapped(unsigned long start, unsigned long end, unsigned type);
 
-extern void e820_bootmem_free(pg_data_t *pgdat, unsigned long start,unsigned long end);
 extern void e820_setup_gap(void);
-extern unsigned long e820_hole_size(unsigned long start_pfn,
-				    unsigned long end_pfn);
+extern void e820_register_active_regions(int nid,
+				unsigned long start_pfn, unsigned long end_pfn);
 
 extern void finish_e820_parsing(void);
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/include/asm-x86_64/proto.h linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h
--- linux-2.6.18-rc4-mm2-103-x86_use_init_nodes/include/asm-x86_64/proto.h	2006-08-21 09:23:52.000000000 +0100
+++ linux-2.6.18-rc4-mm2-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h	2006-08-21 10:15:58.000000000 +0100
@@ -24,8 +24,6 @@ extern void mtrr_bp_init(void);
 #define mtrr_bp_init() do {} while (0)
 #endif
 extern void init_memory_mapping(unsigned long start, unsigned long end);
-extern void size_zones(unsigned long *z, unsigned long *h,
-			unsigned long start_pfn, unsigned long end_pfn);
 
 extern void system_call(void); 
 extern int kernel_syscall(void);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-07-08 11:10 [PATCH 0/6] Sizing zones and holes in an architecture independent manner V8 Mel Gorman
@ 2006-07-08 11:12 ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2006-07-08 11:12 UTC (permalink / raw)
  To: akpm
  Cc: davej, tony.luck, Mel Gorman, ak, bob.picco, linux-kernel,
	linuxppc-dev, linux-mm


Size zones and holes in an architecture independent manner for x86_64.


 arch/x86_64/Kconfig         |    3 
 arch/x86_64/kernel/e820.c   |  125 ++++++++++++++-------------------------
 arch/x86_64/kernel/setup.c  |    7 +-
 arch/x86_64/mm/init.c       |   62 -------------------
 arch/x86_64/mm/k8topology.c |    3 
 arch/x86_64/mm/numa.c       |   18 ++---
 arch/x86_64/mm/srat.c       |   11 ++-
 include/asm-x86_64/e820.h   |    5 -
 include/asm-x86_64/proto.h  |    2 
 9 files changed, 79 insertions(+), 157 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/Kconfig linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/Kconfig
--- linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/Kconfig	2006-07-05 14:31:12.000000000 +0100
+++ linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/Kconfig	2006-07-06 11:09:46.000000000 +0100
@@ -81,6 +81,9 @@ config ARCH_MAY_HAVE_PC_FDC
 	bool
 	default y
 
+config ARCH_POPULATES_NODE_MAP
+	def_bool y
+
 config DMI
 	bool
 	default y
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c
--- linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c	2006-07-05 14:31:12.000000000 +0100
+++ linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c	2006-07-06 11:09:46.000000000 +0100
@@ -16,6 +16,7 @@
 #include <linux/string.h>
 #include <linux/kexec.h>
 #include <linux/module.h>
+#include <linux/mm.h>
 
 #include <asm/page.h>
 #include <asm/e820.h>
@@ -159,58 +160,14 @@ unsigned long __init find_e820_area(unsi
 	return -1UL;		
 } 
 
-/* 
- * Free bootmem based on the e820 table for a node.
- */
-void __init e820_bootmem_free(pg_data_t *pgdat, unsigned long start,unsigned long end)
-{
-	int i;
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i]; 
-		unsigned long last, addr;
-
-		if (ei->type != E820_RAM || 
-		    ei->addr+ei->size <= start || 
-		    ei->addr >= end)
-			continue;
-
-		addr = round_up(ei->addr, PAGE_SIZE);
-		if (addr < start) 
-			addr = start;
-
-		last = round_down(ei->addr + ei->size, PAGE_SIZE); 
-		if (last >= end)
-			last = end; 
-
-		if (last > addr && last-addr >= PAGE_SIZE)
-			free_bootmem_node(pgdat, addr, last-addr);
-	}
-}
-
 /*
  * Find the highest page frame number we have available
  */
 unsigned long __init e820_end_of_ram(void)
 {
-	int i;
 	unsigned long end_pfn = 0;
 	
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i]; 
-		unsigned long start, end;
-
-		start = round_up(ei->addr, PAGE_SIZE); 
-		end = round_down(ei->addr + ei->size, PAGE_SIZE); 
-		if (start >= end)
-			continue;
-		if (ei->type == E820_RAM) { 
-		if (end > end_pfn<<PAGE_SHIFT)
-			end_pfn = end>>PAGE_SHIFT;
-		} else { 
-			if (end > end_pfn_map<<PAGE_SHIFT) 
-				end_pfn_map = end>>PAGE_SHIFT;
-		} 
-	}
+	end_pfn = find_max_pfn_with_active_regions();
 
 	if (end_pfn > end_pfn_map) 
 		end_pfn_map = end_pfn;
@@ -221,43 +178,10 @@ unsigned long __init e820_end_of_ram(voi
 	if (end_pfn > end_pfn_map) 
 		end_pfn = end_pfn_map; 
 
+	printk("end_pfn_map = %lu\n", end_pfn_map);
 	return end_pfn;	
 }
 
-/* 
- * Compute how much memory is missing in a range.
- * Unlike the other functions in this file the arguments are in page numbers.
- */
-unsigned long __init
-e820_hole_size(unsigned long start_pfn, unsigned long end_pfn)
-{
-	unsigned long ram = 0;
-	unsigned long start = start_pfn << PAGE_SHIFT;
-	unsigned long end = end_pfn << PAGE_SHIFT;
-	int i;
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i];
-		unsigned long last, addr;
-
-		if (ei->type != E820_RAM ||
-		    ei->addr+ei->size <= start ||
-		    ei->addr >= end)
-			continue;
-
-		addr = round_up(ei->addr, PAGE_SIZE);
-		if (addr < start)
-			addr = start;
-
-		last = round_down(ei->addr + ei->size, PAGE_SIZE);
-		if (last >= end)
-			last = end;
-
-		if (last > addr)
-			ram += last - addr;
-	}
-	return ((end - start) - ram) >> PAGE_SHIFT;
-}
-
 /*
  * Mark e820 reserved areas as busy for the resource manager.
  */
@@ -292,6 +216,49 @@ void __init e820_reserve_resources(void)
 	}
 }
 
+/* Walk the e820 map and register active regions within a node */
+void __init
+e820_register_active_regions(int nid, unsigned long start_pfn,
+							unsigned long end_pfn)
+{
+	int i;
+	unsigned long ei_startpfn, ei_endpfn;
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+		ei_startpfn = round_up(ei->addr, PAGE_SIZE) >> PAGE_SHIFT;
+		ei_endpfn = round_down(ei->addr + ei->size, PAGE_SIZE)
+								>> PAGE_SHIFT;
+
+		/* Skip map entries smaller than a page */
+		if (ei_startpfn > ei_endpfn)
+			continue;
+
+		/* Check if end_pfn_map should be updated */
+		if (ei->type != E820_RAM && ei_endpfn > end_pfn_map)
+			end_pfn_map = ei_endpfn;
+
+		/* Skip if map is outside the node */
+		if (ei->type != E820_RAM ||
+				ei_endpfn <= start_pfn ||
+				ei_startpfn >= end_pfn)
+			continue;
+
+		/* Check for overlaps */
+		if (ei_startpfn < start_pfn)
+			ei_startpfn = start_pfn;
+		if (ei_endpfn > end_pfn)
+			ei_endpfn = end_pfn;
+
+		/* Obey end_user_pfn to save on memmap */
+		if (ei_startpfn >= end_user_pfn)
+			continue;
+		if (ei_endpfn > end_user_pfn)
+			ei_endpfn = end_user_pfn;
+
+		add_active_range(nid, ei_startpfn, ei_endpfn);
+	}
+}
+
 /* 
  * Add a memory region to the kernel e820 map.
  */ 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/kernel/setup.c linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/kernel/setup.c
--- linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/kernel/setup.c	2006-07-05 14:31:12.000000000 +0100
+++ linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/kernel/setup.c	2006-07-06 11:09:46.000000000 +0100
@@ -466,7 +466,8 @@ contig_initmem_init(unsigned long start_
 	if (bootmap == -1L)
 		panic("Cannot find bootmem map of size %ld\n",bootmap_size);
 	bootmap_size = init_bootmem(bootmap >> PAGE_SHIFT, end_pfn);
-	e820_bootmem_free(NODE_DATA(0), 0, end_pfn << PAGE_SHIFT);
+	e820_register_active_regions(0, start_pfn, end_pfn);
+	free_bootmem_with_active_regions(0, end_pfn);
 	reserve_bootmem(bootmap, bootmap_size);
 } 
 #endif
@@ -546,6 +547,7 @@ void __init setup_arch(char **cmdline_p)
 
 	early_identify_cpu(&boot_cpu_data);
 
+	e820_register_active_regions(0, 0, -1UL);
 	/*
 	 * partially used pages are not usable - thus
 	 * we are rounding upwards:
@@ -571,6 +573,9 @@ void __init setup_arch(char **cmdline_p)
 	acpi_boot_table_init();
 #endif
 
+	/* Remove active ranges so rediscovery with NUMA-awareness happens */
+	remove_all_active_ranges();
+
 #ifdef CONFIG_ACPI_NUMA
 	/*
 	 * Parse SRAT to discover nodes.
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/mm/init.c linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c
--- linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/mm/init.c	2006-07-05 14:31:12.000000000 +0100
+++ linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c	2006-07-06 11:09:46.000000000 +0100
@@ -403,69 +403,12 @@ void __cpuinit zap_low_mappings(int cpu)
 	__flush_tlb_all();
 }
 
-/* Compute zone sizes for the DMA and DMA32 zones in a node. */
-__init void
-size_zones(unsigned long *z, unsigned long *h,
-	   unsigned long start_pfn, unsigned long end_pfn)
-{
- 	int i;
- 	unsigned long w;
-
- 	for (i = 0; i < MAX_NR_ZONES; i++)
- 		z[i] = 0;
-
- 	if (start_pfn < MAX_DMA_PFN)
- 		z[ZONE_DMA] = MAX_DMA_PFN - start_pfn;
- 	if (start_pfn < MAX_DMA32_PFN) {
- 		unsigned long dma32_pfn = MAX_DMA32_PFN;
- 		if (dma32_pfn > end_pfn)
- 			dma32_pfn = end_pfn;
- 		z[ZONE_DMA32] = dma32_pfn - start_pfn;
- 	}
- 	z[ZONE_NORMAL] = end_pfn - start_pfn;
-
- 	/* Remove lower zones from higher ones. */
- 	w = 0;
- 	for (i = 0; i < MAX_NR_ZONES; i++) {
- 		if (z[i])
- 			z[i] -= w;
- 	        w += z[i];
-	}
-
-	/* Compute holes */
-	w = start_pfn;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		unsigned long s = w;
-		w += z[i];
-		h[i] = e820_hole_size(s, w);
-	}
-
-	/* Add the space pace needed for mem_map to the holes too. */
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		h[i] += (z[i] * sizeof(struct page)) / PAGE_SIZE;
-
-	/* The 16MB DMA zone has the kernel and other misc mappings.
- 	   Account them too */
-	if (h[ZONE_DMA]) {
-		h[ZONE_DMA] += dma_reserve;
-		if (h[ZONE_DMA] >= z[ZONE_DMA]) {
-			printk(KERN_WARNING
-				"Kernel too large and filling up ZONE_DMA?\n");
-			h[ZONE_DMA] = z[ZONE_DMA];
-		}
-	}
-}
-
 #ifndef CONFIG_NUMA
 void __init paging_init(void)
 {
-	unsigned long zones[MAX_NR_ZONES], holes[MAX_NR_ZONES];
-
 	memory_present(0, 0, end_pfn);
 	sparse_init();
-	size_zones(zones, holes, 0, end_pfn);
-	free_area_init_node(0, NODE_DATA(0), zones,
-			    __pa(PAGE_OFFSET) >> PAGE_SHIFT, holes);
+	free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, end_pfn, end_pfn);
 }
 #endif
 
@@ -614,7 +557,8 @@ void __init mem_init(void)
 #else
 	totalram_pages = free_all_bootmem();
 #endif
-	reservedpages = end_pfn - totalram_pages - e820_hole_size(0, end_pfn);
+	reservedpages = end_pfn - totalram_pages -
+					absent_pages_in_range(0, end_pfn);
 
 	after_bootmem = 1;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/mm/k8topology.c linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/mm/k8topology.c
--- linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/mm/k8topology.c	2006-06-18 02:49:35.000000000 +0100
+++ linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/mm/k8topology.c	2006-07-06 11:09:46.000000000 +0100
@@ -146,6 +146,9 @@ int __init k8_scan_nodes(unsigned long s
 		
 		nodes[nodeid].start = base; 
 		nodes[nodeid].end = limit;
+		e820_register_active_regions(nodeid,
+				nodes[nodeid].start >> PAGE_SHIFT,
+				nodes[nodeid].end >> PAGE_SHIFT);
 
 		prevbase = base;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/mm/numa.c linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c
--- linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/mm/numa.c	2006-06-18 02:49:35.000000000 +0100
+++ linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c	2006-07-06 11:09:46.000000000 +0100
@@ -161,7 +161,7 @@ void __init setup_node_bootmem(int nodei
 					 bootmap_start >> PAGE_SHIFT, 
 					 start_pfn, end_pfn); 
 
-	e820_bootmem_free(NODE_DATA(nodeid), start, end);
+	free_bootmem_with_active_regions(nodeid, end);
 
 	reserve_bootmem_node(NODE_DATA(nodeid), nodedata_phys, pgdat_size); 
 	reserve_bootmem_node(NODE_DATA(nodeid), bootmap_start, bootmap_pages<<PAGE_SHIFT);
@@ -175,13 +175,11 @@ void __init setup_node_bootmem(int nodei
 void __init setup_node_zones(int nodeid)
 { 
 	unsigned long start_pfn, end_pfn, memmapsize, limit;
-	unsigned long zones[MAX_NR_ZONES];
-	unsigned long holes[MAX_NR_ZONES];
 
  	start_pfn = node_start_pfn(nodeid);
  	end_pfn = node_end_pfn(nodeid);
 
-	Dprintk(KERN_INFO "Setting up node %d %lx-%lx\n",
+	Dprintk(KERN_INFO "Setting up memmap for node %d %lx-%lx\n",
 		nodeid, start_pfn, end_pfn);
 
 	/* Try to allocate mem_map at end to not fill up precious <4GB
@@ -195,10 +193,6 @@ void __init setup_node_zones(int nodeid)
 				round_down(limit - memmapsize, PAGE_SIZE), 
 				limit);
 #endif
-
-	size_zones(zones, holes, start_pfn, end_pfn);
-	free_area_init_node(nodeid, NODE_DATA(nodeid), zones,
-			    start_pfn, holes);
 } 
 
 void __init numa_init_array(void)
@@ -259,8 +253,11 @@ static int numa_emulation(unsigned long 
  		printk(KERN_ERR "No NUMA hash function found. Emulation disabled.\n");
  		return -1;
  	}
- 	for_each_online_node(i)
+ 	for_each_online_node(i) {
+		e820_register_active_regions(i, nodes[i].start >> PAGE_SHIFT,
+						nodes[i].end >> PAGE_SHIFT);
  		setup_node_bootmem(i, nodes[i].start, nodes[i].end);
+	}
  	numa_init_array();
  	return 0;
 }
@@ -299,6 +296,7 @@ void __init numa_initmem_init(unsigned l
 	for (i = 0; i < NR_CPUS; i++)
 		numa_set_node(i, 0);
 	node_to_cpumask[0] = cpumask_of_cpu(0);
+	e820_register_active_regions(0, start_pfn, end_pfn);
 	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, end_pfn << PAGE_SHIFT);
 }
 
@@ -346,6 +344,8 @@ void __init paging_init(void)
 	for_each_online_node(i) {
 		setup_node_zones(i); 
 	}
+
+	free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, end_pfn, end_pfn);
 } 
 
 /* [numa=off] */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/mm/srat.c linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c
--- linux-2.6.17-mm6-103-x86_use_init_nodes/arch/x86_64/mm/srat.c	2006-07-05 14:31:12.000000000 +0100
+++ linux-2.6.17-mm6-104-x86_64_use_init_nodes/arch/x86_64/mm/srat.c	2006-07-06 11:09:46.000000000 +0100
@@ -91,6 +91,7 @@ static __init void bad_srat(void)
 		apicid_to_node[i] = NUMA_NO_NODE;
 	for (i = 0; i < MAX_NUMNODES; i++)
 		nodes_add[i].start = nodes[i].end = 0;
+	remove_all_active_ranges();
 }
 
 static __init inline int srat_disabled(void)
@@ -173,7 +174,7 @@ static int hotadd_enough_memory(struct b
 
 	if (mem < 0)
 		return 0;
-	allowed = (end_pfn - e820_hole_size(0, end_pfn)) * PAGE_SIZE;
+	allowed = (end_pfn - absent_pages_in_range(0, end_pfn)) * PAGE_SIZE;
 	allowed = (allowed / 100) * hotadd_percent;
 	if (allocated + mem > allowed) {
 		unsigned long range;
@@ -223,7 +224,7 @@ static int reserve_hotadd(int node, unsi
 	}
 
 	/* This check might be a bit too strict, but I'm keeping it for now. */
-	if (e820_hole_size(s_pfn, e_pfn) != e_pfn - s_pfn) {
+	if (absent_pages_in_range(s_pfn, e_pfn) != e_pfn - s_pfn) {
 		printk(KERN_ERR "SRAT: Hotplug area has existing memory\n");
 		return -1;
 	}
@@ -317,6 +318,8 @@ acpi_numa_memory_affinity_init(struct ac
 
 	printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
 	       nd->start, nd->end);
+	e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
+						nd->end >> PAGE_SHIFT);
 
 #ifdef RESERVE_HOTADD
  	if (ma->flags.hot_pluggable && reserve_hotadd(node, start, end) < 0) {
@@ -341,13 +344,13 @@ static int nodes_cover_memory(void)
 		unsigned long s = nodes[i].start >> PAGE_SHIFT;
 		unsigned long e = nodes[i].end >> PAGE_SHIFT;
 		pxmram += e - s;
-		pxmram -= e820_hole_size(s, e);
+		pxmram -= absent_pages_in_range(s, e);
 		pxmram -= nodes_add[i].end - nodes_add[i].start;
 		if ((long)pxmram < 0)
 			pxmram = 0;
 	}
 
-	e820ram = end_pfn - e820_hole_size(0, end_pfn);
+	e820ram = end_pfn - absent_pages_in_range(0, end_pfn);
 	/* We seem to lose 3 pages somewhere. Allow a bit of slack. */
 	if ((long)(e820ram - pxmram) >= 1*1024*1024) {
 		printk(KERN_ERR
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-mm6-103-x86_use_init_nodes/include/asm-x86_64/e820.h linux-2.6.17-mm6-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h
--- linux-2.6.17-mm6-103-x86_use_init_nodes/include/asm-x86_64/e820.h	2006-06-18 02:49:35.000000000 +0100
+++ linux-2.6.17-mm6-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h	2006-07-06 11:09:46.000000000 +0100
@@ -50,10 +50,9 @@ extern void e820_print_map(char *who);
 extern int e820_any_mapped(unsigned long start, unsigned long end, unsigned type);
 extern int e820_all_mapped(unsigned long start, unsigned long end, unsigned type);
 
-extern void e820_bootmem_free(pg_data_t *pgdat, unsigned long start,unsigned long end);
 extern void e820_setup_gap(void);
-extern unsigned long e820_hole_size(unsigned long start_pfn,
-				    unsigned long end_pfn);
+extern void e820_register_active_regions(int nid,
+				unsigned long start_pfn, unsigned long end_pfn);
 
 extern void __init parse_memopt(char *p, char **end);
 extern void __init parse_memmapopt(char *p, char **end);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.17-mm6-103-x86_use_init_nodes/include/asm-x86_64/proto.h linux-2.6.17-mm6-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h
--- linux-2.6.17-mm6-103-x86_use_init_nodes/include/asm-x86_64/proto.h	2006-07-05 14:31:17.000000000 +0100
+++ linux-2.6.17-mm6-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h	2006-07-06 11:09:46.000000000 +0100
@@ -24,8 +24,6 @@ extern void mtrr_bp_init(void);
 #define mtrr_bp_init() do {} while (0)
 #endif
 extern void init_memory_mapping(unsigned long start, unsigned long end);
-extern void size_zones(unsigned long *z, unsigned long *h,
-			unsigned long start_pfn, unsigned long end_pfn);
 
 extern void system_call(void); 
 extern int kernel_syscall(void);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes
  2006-04-11 10:39 [PATCH 0/6] [RFC] Sizing zones and holes in an architecture independent manner Mel Gorman
@ 2006-04-11 10:41 ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2006-04-11 10:41 UTC (permalink / raw)
  To: linuxppc-dev, davej, tony.luck, ak, linux-kernel; +Cc: Mel Gorman


Size zones and holes in an architecture independent manner for x86_64.

This has only been boot tested on an x86_64 with NUMA and SRAT.


 arch/x86_64/Kconfig        |    3 ++
 arch/x86_64/kernel/e820.c  |   18 ++++++++++++
 arch/x86_64/mm/init.c      |   60 +---------------------------------------
 arch/x86_64/mm/numa.c      |   15 +++++-----
 include/asm-x86_64/e820.h  |    1 
 include/asm-x86_64/proto.h |    2 -
 6 files changed, 32 insertions(+), 67 deletions(-)

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/Kconfig linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/Kconfig
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/Kconfig	2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/Kconfig	2006-04-10 10:54:38.000000000 +0100
@@ -73,6 +73,9 @@ config ARCH_MAY_HAVE_PC_FDC
 	bool
 	default y
 
+config ARCH_POPULATES_NODE_MAP
+	def_bool y
+
 config DMI
 	bool
 	default y
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/kernel/e820.c	2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/kernel/e820.c	2006-04-10 10:54:38.000000000 +0100
@@ -220,6 +220,24 @@ e820_hole_size(unsigned long start_pfn, 
 	return ((end - start) - ram) >> PAGE_SHIFT;
 }
 
+/* Walk the e820 map and register active regions */
+unsigned long __init
+e820_register_active_regions(void)
+{
+	int i;
+	unsigned long start_pfn, end_pfn;
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+		if (ei->type != E820_RAM)
+			continue;
+
+		start_pfn = round_up(ei->addr, PAGE_SIZE);
+		end_pfn = round_down(ei->addr + ei->size, PAGE_SIZE);
+
+		add_active_range(0, start_pfn, end_pfn);
+	}
+}
+
 /*
  * Mark e820 reserved areas as busy for the resource manager.
  */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/init.c linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/init.c	2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/init.c	2006-04-10 10:54:38.000000000 +0100
@@ -405,69 +405,13 @@ void __cpuinit zap_low_mappings(int cpu)
 	__flush_tlb_all();
 }
 
-/* Compute zone sizes for the DMA and DMA32 zones in a node. */
-__init void
-size_zones(unsigned long *z, unsigned long *h,
-	   unsigned long start_pfn, unsigned long end_pfn)
-{
- 	int i;
- 	unsigned long w;
-
- 	for (i = 0; i < MAX_NR_ZONES; i++)
- 		z[i] = 0;
-
- 	if (start_pfn < MAX_DMA_PFN)
- 		z[ZONE_DMA] = MAX_DMA_PFN - start_pfn;
- 	if (start_pfn < MAX_DMA32_PFN) {
- 		unsigned long dma32_pfn = MAX_DMA32_PFN;
- 		if (dma32_pfn > end_pfn)
- 			dma32_pfn = end_pfn;
- 		z[ZONE_DMA32] = dma32_pfn - start_pfn;
- 	}
- 	z[ZONE_NORMAL] = end_pfn - start_pfn;
-
- 	/* Remove lower zones from higher ones. */
- 	w = 0;
- 	for (i = 0; i < MAX_NR_ZONES; i++) {
- 		if (z[i])
- 			z[i] -= w;
- 	        w += z[i];
-	}
-
-	/* Compute holes */
-	w = start_pfn;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		unsigned long s = w;
-		w += z[i];
-		h[i] = e820_hole_size(s, w);
-	}
-
-	/* Add the space pace needed for mem_map to the holes too. */
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		h[i] += (z[i] * sizeof(struct page)) / PAGE_SIZE;
-
-	/* The 16MB DMA zone has the kernel and other misc mappings.
- 	   Account them too */
-	if (h[ZONE_DMA]) {
-		h[ZONE_DMA] += dma_reserve;
-		if (h[ZONE_DMA] >= z[ZONE_DMA]) {
-			printk(KERN_WARNING
-				"Kernel too large and filling up ZONE_DMA?\n");
-			h[ZONE_DMA] = z[ZONE_DMA];
-		}
-	}
-}
-
 #ifndef CONFIG_NUMA
 void __init paging_init(void)
 {
-	unsigned long zones[MAX_NR_ZONES], holes[MAX_NR_ZONES];
-
 	memory_present(0, 0, end_pfn);
 	sparse_init();
-	size_zones(zones, holes, 0, end_pfn);
-	free_area_init_node(0, NODE_DATA(0), zones,
-			    __pa(PAGE_OFFSET) >> PAGE_SHIFT, holes);
+	e820_register_active_regions();
+	free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, end_pfn, end_pfn);
 }
 #endif
 
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/numa.c linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c
--- linux-2.6.17-rc1-103-x86_use_init_nodes/arch/x86_64/mm/numa.c	2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/arch/x86_64/mm/numa.c	2006-04-10 10:54:38.000000000 +0100
@@ -149,13 +149,12 @@ void __init setup_node_bootmem(int nodei
 void __init setup_node_zones(int nodeid)
 { 
 	unsigned long start_pfn, end_pfn, memmapsize, limit;
-	unsigned long zones[MAX_NR_ZONES];
-	unsigned long holes[MAX_NR_ZONES];
 
  	start_pfn = node_start_pfn(nodeid);
  	end_pfn = node_end_pfn(nodeid);
+	add_active_range(nodeid, start_pfn, end_pfn);
 
-	Dprintk(KERN_INFO "Setting up node %d %lx-%lx\n",
+	Dprintk(KERN_INFO "Setting up memmap for node %d %lx-%lx\n",
 		nodeid, start_pfn, end_pfn);
 
 	/* Try to allocate mem_map at end to not fill up precious <4GB
@@ -167,10 +166,6 @@ void __init setup_node_zones(int nodeid)
 				memmapsize, SMP_CACHE_BYTES, 
 				round_down(limit - memmapsize, PAGE_SIZE), 
 				limit);
-
-	size_zones(zones, holes, start_pfn, end_pfn);
-	free_area_init_node(nodeid, NODE_DATA(nodeid), zones,
-			    start_pfn, holes);
 } 
 
 void __init numa_init_array(void)
@@ -312,12 +307,18 @@ static void __init arch_sparse_init(void
 void __init paging_init(void)
 { 
 	int i;
+	unsigned long max_normal_pfn = 0;
 
 	arch_sparse_init();
 
 	for_each_online_node(i) {
 		setup_node_zones(i); 
+		if (max_normal_pfn < node_end_pfn(i))
+			max_normal_pfn = node_end_pfn(i);
 	}
+
+	free_area_init_nodes(MAX_DMA_PFN, MAX_DMA32_PFN, max_normal_pfn,
+								max_normal_pfn);
 } 
 
 /* [numa=off] */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/e820.h linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h
--- linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/e820.h	2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/e820.h	2006-04-10 10:54:38.000000000 +0100
@@ -53,6 +53,7 @@ extern void e820_bootmem_free(pg_data_t 
 extern void e820_setup_gap(void);
 extern unsigned long e820_hole_size(unsigned long start_pfn,
 				    unsigned long end_pfn);
+extern unsigned long e820_register_active_regions(void);
 
 extern void __init parse_memopt(char *p, char **end);
 extern void __init parse_memmapopt(char *p, char **end);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/proto.h linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h
--- linux-2.6.17-rc1-103-x86_use_init_nodes/include/asm-x86_64/proto.h	2006-04-03 04:22:10.000000000 +0100
+++ linux-2.6.17-rc1-104-x86_64_use_init_nodes/include/asm-x86_64/proto.h	2006-04-10 10:54:38.000000000 +0100
@@ -24,8 +24,6 @@ extern void mtrr_bp_init(void);
 #define mtrr_bp_init() do {} while (0)
 #endif
 extern void init_memory_mapping(unsigned long start, unsigned long end);
-extern void size_zones(unsigned long *z, unsigned long *h,
-			unsigned long start_pfn, unsigned long end_pfn);
 
 extern void system_call(void); 
 extern int kernel_syscall(void);

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2006-09-04 15:36 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-05-08 14:10 [PATCH 0/6] Sizing zones and holes in an architecture independent manner V6 Mel Gorman
2006-05-08 14:10 ` [PATCH 1/6] Introduce mechanism for registering active regions of memory Mel Gorman
2006-05-08 14:11 ` [PATCH 2/6] Have Power use add_active_range() and free_area_init_nodes() Mel Gorman
2006-05-08 14:11 ` [PATCH 3/6] Have x86 use add_active_range() and free_area_init_nodes Mel Gorman
2006-05-08 14:11 ` [PATCH 4/6] Have x86_64 " Mel Gorman
2006-05-20 20:59   ` Andrew Morton
2006-05-20 21:27     ` Andi Kleen
2006-05-20 21:40       ` Andrew Morton
2006-05-20 22:17         ` Andi Kleen
2006-05-20 22:54           ` Andrew Morton
2006-05-21 16:20       ` Mel Gorman
2006-05-21 15:50     ` Mel Gorman
2006-05-21 19:08       ` Andrew Morton
2006-05-21 22:23         ` Mel Gorman
2006-05-23 18:01     ` Mel Gorman
2006-05-08 14:12 ` [PATCH 5/6] Have ia64 " Mel Gorman
2006-05-15  3:31   ` Andrew Morton
2006-05-15  8:21     ` Andy Whitcroft
2006-05-15 10:00       ` Nick Piggin
2006-05-15 10:19         ` Andy Whitcroft
2006-05-15 10:29           ` KAMEZAWA Hiroyuki
2006-05-15 10:47             ` KAMEZAWA Hiroyuki
2006-05-15 11:02             ` Andy Whitcroft
2006-05-16  0:31             ` Nick Piggin
2006-05-16  1:34               ` KAMEZAWA Hiroyuki
2006-05-16  2:11                 ` Nick Piggin
2006-05-15 12:27     ` Mel Gorman
2006-05-15 22:44       ` Mel Gorman
2006-05-19 14:03     ` Mel Gorman
2006-05-19 14:23       ` Andy Whitcroft
2006-05-08 14:12 ` [PATCH 6/6] Break out memory initialisation code from page_alloc.c to mem_init.c Mel Gorman
2006-05-09  1:47   ` Nick Piggin
2006-05-09  8:24     ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2006-08-21 13:45 [PATCH 0/6] Sizing zones and holes in an architecture independent manner V9 Mel Gorman
2006-08-21 13:46 ` [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes Mel Gorman
2006-08-30 20:57   ` Keith Mannthey
2006-08-31 15:49     ` Mel Gorman
2006-08-31 16:25       ` Mika Penttilä
2006-08-31 17:01         ` Mel Gorman
2006-08-31 17:40           ` Mika Penttilä
2006-08-31 17:52       ` Keith Mannthey
2006-08-31 18:40         ` Mel Gorman
2006-09-01  3:08           ` Keith Mannthey
2006-09-01  8:33             ` Mel Gorman
2006-09-01  8:46               ` Mika Penttilä
2006-09-04 15:36             ` Mel Gorman
2006-07-08 11:10 [PATCH 0/6] Sizing zones and holes in an architecture independent manner V8 Mel Gorman
2006-07-08 11:12 ` [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes Mel Gorman
2006-04-11 10:39 [PATCH 0/6] [RFC] Sizing zones and holes in an architecture independent manner Mel Gorman
2006-04-11 10:41 ` [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).