All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Discontig-devel] [PATCH] another discontig patch
@ 2003-06-21  9:06 Christoph Hellwig
  2003-06-21 14:48 ` Martin J. Bligh
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: Christoph Hellwig @ 2003-06-21  9:06 UTC (permalink / raw)
  To: linux-ia64

On Thu, Jun 19, 2003 at 03:28:29PM -0700, Jesse Barnes wrote:
> @@ -703,6 +702,8 @@
>  	 * get_free_pages() cannot be used before cpu_init() done.  BSP allocates
>  	 * "NR_CPUS" pages for all CPUs to avoid that AP calls get_zeroed_page().
>  	 */
> +#ifndef CONFIG_DISCONTIGMEM
> +	/* for discontig machines, we do this in discontig.c */
>  	if (smp_processor_id() = 0) {
>  		cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS, PERCPU_PAGE_SIZE,
>  					   __pa(MAX_DMA_ADDRESS));
> @@ -712,15 +713,13 @@
>  			cpu_data += PERCPU_PAGE_SIZE;
>  		}
>  	}
> +#endif
>  	cpu_data = __per_cpu_start + __per_cpu_offset[smp_processor_id()];
>  #else /* !CONFIG_SMP */
>  	cpu_data = __phys_per_cpu_start;
>  #endif /* !CONFIG_SMP */

Maybe this whole code should better be abstracted out then?  Also
the use of smp_processor_id() here is not preempt-safe but I 'm not sure
whether preemption is already enabled at that point.

> +static struct ia64_node_data	*boot_node_data[NR_NODES] __initdata;
> +static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
> +static bootmem_data_t		bdata[NR_NODES] __initdata;
> +static unsigned long		boot_pernode[NR_NODES] __initdata;
> +static unsigned long		boot_pernodesize[NR_NODES] __initdata;

Maybe it's time for special pernode data like the percpu data?
Okay, okay, not revelant for this patch yet...

>  struct page *zero_page_memmap_ptr;		/* map entry for zero page */
>  
> +extern int filter_rsvd_memory(unsigned long start, unsigned long end, void *arg);

Move this to a header?

> +#if defined(CONFIG_VIRTUAL_MEM_MAP) && !defined(CONFIG_DISCONTIGMEM)
> +  extern int ia64_pfn_valid (unsigned long pfn);
> +# define pfn_valid(pfn)	(((pfn) < max_mapnr) && ia64_pfn_valid(pfn))
> +#elif defined(CONFIG_VIRTUAL_MEM_MAP) && defined(CONFIG_DISCONTIGMEM)
> +  extern int ia64_pfn_valid (unsigned long pfn);
> +  extern unsigned long max_low_pfn;
> +# define pfn_valid(pfn)	(((pfn) < max_low_pfn) && ia64_pfn_valid(pfn))
> +#else
> +# define pfn_valid(pfn)	((pfn) < max_mapnr)
> +#endif

Hmm, this ifdef mess look ugly.  Can't we use max_low_pfn for the
!CONFIG_DISCONTIGMEM case, too and simp;ify it to something like:

#if defined(CONFIG_VIRTUAL_MEM_MAP)
extern int ia64_pfn_valid(unsigned long pfn);
#else
# define ia64_pfn_valid(pfn)	(1)
#endif

#define pfn_valid(pfn)	(((pfn) < max_low_pfn) && ia64_pfn_valid(pfn))

> +
> +#if defined(CONFIG_DISCONTIGMEM) && defined(CONFIG_VIRTUAL_MEM_MAP)
> +  extern struct page *vmem_map;
> +# define pfn_to_page(pfn)	(vmem_map + (pfn))
> +# define page_to_pfn(page)	((unsigned long) (page - vmem_map))
> +#else
> +# define pfn_to_page(pfn)	(mem_map + (pfn))
> +# define page_to_pfn(page)	((unsigned long) (page - mem_map))
> +#endif /* CONFIG_DISCONTIGMEM && CONFIG_VIRTUAL_MEM_MAP */
> +

Hmm, doesn't CONFIG_VIRTUAL_MEM_MAP && !CONFIG_DISCONTIGMEM need
vmem_map also.  Again, the ifdef mess is horrible :)

What about:

#ifndef CONFIG_VIRTUAL_MEM_MAP
#define	vmem_map		mem_map
#endif

#define pfn_to_page(pfn)	(vmem_map + (pfn))
#define page_to_pfn(page)	((unsigned long) (page - vmem_map))

In fact I wonder what's so special about mem_map that the symbol
can't be used for the vmalloc'ed version..

BTW, what about per-node memmaps for SN2 like it's done for NUMAQ?


>  #ifdef CONFIG_NUMA
>  	struct ia64_node_data *node_data;
> +	int nodeid;
> +	struct cpuinfo_ia64 *cpu_data[NR_CPUS];
>  #endif

Should the big array be before the nodeid field?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
@ 2003-06-21 14:48 ` Martin J. Bligh
  2003-06-22  5:53 ` Jesse Barnes
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Martin J. Bligh @ 2003-06-21 14:48 UTC (permalink / raw)
  To: linux-ia64

>> +static struct ia64_node_data	*boot_node_data[NR_NODES] __initdata;
>> +static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
>> +static bootmem_data_t		bdata[NR_NODES] __initdata;
>> +static unsigned long		boot_pernode[NR_NODES] __initdata;
>> +static unsigned long		boot_pernodesize[NR_NODES] __initdata;

The only use that I can *see* (without looking very hard) for 
pg_data_ptr is as pg_data_ptr[node]->bdata, for which you already have
bdata[node], don't you? The fact that you have both pg_data_ptr and
pg_data_ptrs, which seem to do the same thing, but one as initdata,
and one not, is also rather confusing ...

memcpy(boot_node_data[0]->pg_data_ptrs, pg_data_ptr, sizeof(pg_data_ptr));

hmmm.

And aren't boot_pernode and boot_pernodesize really part of bdata?
Seems like there's a lot of complexity in order to save a couple of
pointer dereferences ... during boot ;-) But maybe I'm just misreading
it ;-)

> Maybe it's time for special pernode data like the percpu data?
> Okay, okay, not revelant for this patch yet...

># ifndef CONFIG_VIRTUAL_MEM_MAP
># define	vmem_map		mem_map
># endif
> 
># define pfn_to_page(pfn)	(vmem_map + (pfn))
># define page_to_pfn(page)	((unsigned long) (page - vmem_map))
> 
> In fact I wonder what's so special about mem_map that the symbol
> can't be used for the vmalloc'ed version..
> 
> BTW, what about per-node memmaps for SN2 like it's done for NUMAQ?

I really don't see how you *can't* do that, and have it still work.
If the mem_map is placed in node local memory, and not a size that 
happens to be a complete number of pages, even if you pack it down 
into a vmem_map, it's still not contiguous. So adding a pfn (physical 
page frame number) to the base of vmem_map can NOT give you the correct 
address.

Presumably this *is* working for you ... but I'm buggered if I know how ;-)
Maybe I just need more coffee ;-)

Thanks,

M.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
  2003-06-21 14:48 ` Martin J. Bligh
@ 2003-06-22  5:53 ` Jesse Barnes
  2003-06-22  5:57 ` Jesse Barnes
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jesse Barnes @ 2003-06-22  5:53 UTC (permalink / raw)
  To: linux-ia64

Thanks for looking at this, Christoph.  Comments below.

On Sat, Jun 21, 2003 at 10:06:46AM +0100, Christoph Hellwig wrote:
> On Thu, Jun 19, 2003 at 03:28:29PM -0700, Jesse Barnes wrote:
> >  #else /* !CONFIG_SMP */
> >  	cpu_data = __phys_per_cpu_start;
> >  #endif /* !CONFIG_SMP */
> 
> Maybe this whole code should better be abstracted out then?  Also
> the use of smp_processor_id() here is not preempt-safe but I 'm not sure
> whether preemption is already enabled at that point.

I think we're ok preempt wise here, but you're right that it could
probably be abstracted a bit.  The issue is that we need to create
per-node data structures containing each node's info before the per-cpu
stuff is initialized, so we do it all before the bootmem allocator is
even setup while we discover memory (see
discontig.c:find_pernode_space() I think).

> > +static struct ia64_node_data	*boot_node_data[NR_NODES] __initdata;
> > +static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
> > +static bootmem_data_t		bdata[NR_NODES] __initdata;
> > +static unsigned long		boot_pernode[NR_NODES] __initdata;
> > +static unsigned long		boot_pernodesize[NR_NODES] __initdata;
> 
> Maybe it's time for special pernode data like the percpu data?
> Okay, okay, not revelant for this patch yet...

It's already there.  It's really necessary to get decent performance,
and you have to watch out for cache aliasing effects.

> > +extern int filter_rsvd_memory(unsigned long start, unsigned long end, void *arg);
> 
> Move this to a header?

Yeah, maybe it should be in asm/page.h?  Or maybe we need an
asm/discontig.h?

> > +# define pfn_valid(pfn)	((pfn) < max_mapnr)
> > +#endif
> 
> Hmm, this ifdef mess look ugly.  Can't we use max_low_pfn for the
> !CONFIG_DISCONTIGMEM case, too and simp;ify it to something like:

Agreed, no question.  I think this could use some additional cleanup.

> #if defined(CONFIG_VIRTUAL_MEM_MAP)
> extern int ia64_pfn_valid(unsigned long pfn);
> #else
> # define ia64_pfn_valid(pfn)	(1)
> #endif
> 
> #define pfn_valid(pfn)	(((pfn) < max_low_pfn) && ia64_pfn_valid(pfn))
> 
> > +
> > +#if defined(CONFIG_DISCONTIGMEM) && defined(CONFIG_VIRTUAL_MEM_MAP)
> > +  extern struct page *vmem_map;
> > +# define pfn_to_page(pfn)	(vmem_map + (pfn))
> > +# define page_to_pfn(page)	((unsigned long) (page - vmem_map))
> > +#else
> > +# define pfn_to_page(pfn)	(mem_map + (pfn))
> > +# define page_to_pfn(page)	((unsigned long) (page - mem_map))
> > +#endif /* CONFIG_DISCONTIGMEM && CONFIG_VIRTUAL_MEM_MAP */
> > +

Though I'm not sure if this is the right way to do it...  If discontig
on ia64 is going to depend on virtual mem map, then making this one
#ifdef somehow is probably best.

> Hmm, doesn't CONFIG_VIRTUAL_MEM_MAP && !CONFIG_DISCONTIGMEM need
> vmem_map also.  Again, the ifdef mess is horrible :)
> 
> What about:
> 
> #ifndef CONFIG_VIRTUAL_MEM_MAP
> #define	vmem_map		mem_map
> #endif
> 
> #define pfn_to_page(pfn)	(vmem_map + (pfn))
> #define page_to_pfn(page)	((unsigned long) (page - vmem_map))
> 
> In fact I wonder what's so special about mem_map that the symbol
> can't be used for the vmalloc'ed version..

Don't know, I'll have to look some more.

> BTW, what about per-node memmaps for SN2 like it's done for NUMAQ?

AFAIK, the per-node memmaps index into the big virtual memmap, but I'd
have to look at it some more.  If we made per-node memory maps, then we
probably wouldn't need to use the virtual mem map, except that it makes
dealing with holes a bit easier (since on sn2 even memory within a node
isn't contiguous, and there are big holes between nodes).

> >  #ifdef CONFIG_NUMA
> >  	struct ia64_node_data *node_data;
> > +	int nodeid;
> > +	struct cpuinfo_ia64 *cpu_data[NR_CPUS];
> >  #endif
> 
> Should the big array be before the nodeid field?

Good catch, probably not. :)

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
  2003-06-21 14:48 ` Martin J. Bligh
  2003-06-22  5:53 ` Jesse Barnes
@ 2003-06-22  5:57 ` Jesse Barnes
  2003-06-22 15:25 ` Martin J. Bligh
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jesse Barnes @ 2003-06-22  5:57 UTC (permalink / raw)
  To: linux-ia64

On Sat, Jun 21, 2003 at 07:48:08AM -0700, Martin J. Bligh wrote:
> >> +static struct ia64_node_data	*boot_node_data[NR_NODES] __initdata;
> >> +static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
> >> +static bootmem_data_t		bdata[NR_NODES] __initdata;
> >> +static unsigned long		boot_pernode[NR_NODES] __initdata;
> >> +static unsigned long		boot_pernodesize[NR_NODES] __initdata;
> 
> The only use that I can *see* (without looking very hard) for 
> pg_data_ptr is as pg_data_ptr[node]->bdata, for which you already have
> bdata[node], don't you? The fact that you have both pg_data_ptr and
> pg_data_ptrs, which seem to do the same thing, but one as initdata,
> and one not, is also rather confusing ...

Yeah, that's one thing that I keep meaning to remove, but didn't in the
initial versions to keep the 2.4 and 2.5 patches as close as possible.

> memcpy(boot_node_data[0]->pg_data_ptrs, pg_data_ptr, sizeof(pg_data_ptr));
> 
> hmmm.
> 
> And aren't boot_pernode and boot_pernodesize really part of bdata?
> Seems like there's a lot of complexity in order to save a couple of
> pointer dereferences ... during boot ;-) But maybe I'm just misreading
> it ;-)

Well, there's definitely some pointer magic going on here, but I think
we use that stuff once the system is up as well, though from a different
pointer.

> 
> > Maybe it's time for special pernode data like the percpu data?
> > Okay, okay, not revelant for this patch yet...
> 
> ># ifndef CONFIG_VIRTUAL_MEM_MAP
> ># define	vmem_map		mem_map
> ># endif
> > 
> ># define pfn_to_page(pfn)	(vmem_map + (pfn))
> ># define page_to_pfn(page)	((unsigned long) (page - vmem_map))
> > 
> > In fact I wonder what's so special about mem_map that the symbol
> > can't be used for the vmalloc'ed version..
> > 
> > BTW, what about per-node memmaps for SN2 like it's done for NUMAQ?
> 
> I really don't see how you *can't* do that, and have it still work.
> If the mem_map is placed in node local memory, and not a size that 
> happens to be a complete number of pages, even if you pack it down 
> into a vmem_map, it's still not contiguous. So adding a pfn (physical 
> page frame number) to the base of vmem_map can NOT give you the correct 
> address.
> 
> Presumably this *is* working for you ... but I'm buggered if I know how ;-)
> Maybe I just need more coffee ;-)

No, there are some mixed metaphors here for sure (at least, as far as I
understand it!) :).  You can find stuff either using the global vmemmap
or the per-node maps I think, but honestly, I'm still coming up to speed
on this code myself, Jack and Kimi are the real experts.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
                   ` (2 preceding siblings ...)
  2003-06-22  5:57 ` Jesse Barnes
@ 2003-06-22 15:25 ` Martin J. Bligh
  2003-06-23 17:20 ` William Lee Irwin III
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Martin J. Bligh @ 2003-06-22 15:25 UTC (permalink / raw)
  To: linux-ia64

>> >> +static struct ia64_node_data	*boot_node_data[NR_NODES] __initdata;
>> >> +static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
>> >> +static bootmem_data_t		bdata[NR_NODES] __initdata;
>> >> +static unsigned long		boot_pernode[NR_NODES] __initdata;
>> >> +static unsigned long		boot_pernodesize[NR_NODES] __initdata;
>> 
>> The only use that I can *see* (without looking very hard) for 
>> pg_data_ptr is as pg_data_ptr[node]->bdata, for which you already have
>> bdata[node], don't you? The fact that you have both pg_data_ptr and
>> pg_data_ptrs, which seem to do the same thing, but one as initdata,
>> and one not, is also rather confusing ...
> 
> Yeah, that's one thing that I keep meaning to remove, but didn't in the
> initial versions to keep the 2.4 and 2.5 patches as close as possible.

OK, I don't think it's something you're introducing with your patch,
so it's probably best done separately anyway - you seem to be making
the current situation better in general ;-) 

>> > BTW, what about per-node memmaps for SN2 like it's done for NUMAQ?
>> 
>> I really don't see how you *can't* do that, and have it still work.
>> If the mem_map is placed in node local memory, and not a size that 
>> happens to be a complete number of pages, even if you pack it down 
>> into a vmem_map, it's still not contiguous. So adding a pfn (physical 
>> page frame number) to the base of vmem_map can NOT give you the correct 
>> address.
>> 
>> Presumably this *is* working for you ... but I'm buggered if I know how ;-)
>> Maybe I just need more coffee ;-)
> 
> No, there are some mixed metaphors here for sure (at least, as far as I
> understand it!) :).  You can find stuff either using the global vmemmap
> or the per-node maps I think, but honestly, I'm still coming up to speed
> on this code myself, Jack and Kimi are the real experts.

OK. Am I mistaken in thinking that the point of vmem_map is to skip over
physical holes in RAM? If not, then disregard the following ;-) Perhaps
it's just about node locality of the struct pages, but that seems odd?

Maybe if the start end of memory areas are always hitting boundaries aligned
on PAGE_SIZE number of pages, (ie 16MB for 4K pages), then your mem_map is
always page aligned because it's in (sizeof (struct page) * PAGE_SIZE) chunks.
If you (plural) are relying on that, then I think it needs a huge stinking 
comment - isn't terribly obvious to innocent bystanders ;-)

Whilst that would enable you to make the array node_local, you still can't 
just add a pfn to the base as an offset anyway, which is what it looks like 
you were doing - that assumes it's all contig. eg this:

pfn_to_page(pfn)      (vmem_map + (pfn))

Unless you somehow warped the definition of pfn into a contiguous version
of the space, in which case, I think people's heads may explode ... we've
always had  pfn = phys_addr / PAGE_SIZE as an identity.

M.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
                   ` (3 preceding siblings ...)
  2003-06-22 15:25 ` Martin J. Bligh
@ 2003-06-23 17:20 ` William Lee Irwin III
  2003-07-16 19:29 ` Jesse Barnes
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: William Lee Irwin III @ 2003-06-23 17:20 UTC (permalink / raw)
  To: linux-ia64

At some point in the past, someone wrote:
>># define pfn_to_page(pfn)	(vmem_map + (pfn))
>># define page_to_pfn(page)	((unsigned long) (page - vmem_map))

On Sat, Jun 21, 2003 at 07:48:08AM -0700, Martin J. Bligh wrote:
> I really don't see how you *can't* do that, and have it still work.
> If the mem_map is placed in node local memory, and not a size that 
> happens to be a complete number of pages, even if you pack it down 
> into a vmem_map, it's still not contiguous. So adding a pfn (physical 
> page frame number) to the base of vmem_map can NOT give you the correct 
> address.
> Presumably this *is* working for you ... but I'm buggered if I know how ;-)
> Maybe I just need more coffee ;-)

It's a sparse array. The interstices between nodes are padded with
empty virtualspace in order to cheapen the indexing operation. i.e. the
nodes' local mem_map's are virtually positioned according to the range
of pfns they span. If 32-bit weren't as pressed for virtualspace it
could do likewise.


-- wli

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
                   ` (4 preceding siblings ...)
  2003-06-23 17:20 ` William Lee Irwin III
@ 2003-07-16 19:29 ` Jesse Barnes
  2003-07-16 19:40 ` Matthew Wilcox
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jesse Barnes @ 2003-07-16 19:29 UTC (permalink / raw)
  To: linux-ia64

Sorry it took so long for me to reply to this, but I've been on vacation
for the past 3 weeks.  Here's a new discontig patch that incorporates
some of the feeback from this thread.  Comments and/or integration into
2.6 appreciated.

Thanks,
Jesse


# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.1396  -> 1.1397 
#	include/asm-ia64/page.h	1.19    -> 1.20   
#	arch/ia64/kernel/setup.c	1.53    -> 1.54   
#	include/asm-ia64/pgtable.h	1.28    -> 1.29   
#	        mm/bootmem.c	1.18    -> 1.19   
#	include/asm-ia64/numa.h	1.5     -> 1.6    
#	include/asm-ia64/processor.h	1.48    -> 1.49   
#	 arch/ia64/mm/init.c	1.46    -> 1.47   
#	include/asm-ia64/nodedata.h	1.3     -> 1.4    
#	arch/ia64/mm/discontig.c	1.4     -> 1.5    
#	   arch/ia64/Kconfig	1.38    -> 1.39   
#	drivers/acpi/Kconfig	1.12    -> 1.13   
#	include/asm-ia64/mmzone.h	1.4     -> 1.5    
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/07/16	jbarnes@tomahawk.engr.sgi.com	1.1397
# discontig update
# --------------------------------------------
#
diff -Nru a/arch/ia64/Kconfig b/arch/ia64/Kconfig
--- a/arch/ia64/Kconfig	Wed Jul 16 12:28:28 2003
+++ b/arch/ia64/Kconfig	Wed Jul 16 12:28:28 2003
@@ -210,8 +210,8 @@
 	  system with an A0 or A1 stepping CPU.
 
 config NUMA
-	bool "Enable NUMA support" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
-	default y if IA64_SGI_SN2
+	bool
+	default y if IA64_SGI_SN2 || IA64_GENERIC
 	help
 	  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
 	  Access).  This option is for configuring high-end multiprocessor
@@ -235,8 +235,7 @@
 
 config DISCONTIGMEM
 	bool
-	depends on IA64_SGI_SN2 || (IA64_GENERIC || IA64_DIG || IA64_HP_ZX1) && NUMA
-	default y
+	default y if IA64_SGI_SN2 || IA64_GENERIC
 	help
 	  Say Y to support efficient handling of discontiguous physical memory,
 	  for architectures which are either NUMA (Non-Uniform Memory Access)
@@ -245,8 +244,7 @@
 
 config VIRTUAL_MEM_MAP
 	bool "Enable Virtual Mem Map"
-	depends on !NUMA
-	default y if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
+	default y if !IA64_HP_SIM
 	help
 	  Say Y to compile the kernel with support for a virtual mem map.
 	  This is an alternate method of supporting large holes in the
@@ -259,8 +257,8 @@
 	  are unsure, say Y.
 
 config IA64_MCA
-	bool "Enable IA-64 Machine Check Abort" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
-	default y if IA64_SGI_SN2
+	bool "Enable IA-64 Machine Check Abort"
+	default y if !IA64_HP_SIM
 	help
 	  Say Y here to enable machine check support for IA-64.  If you're
 	  unsure, answer Y.
@@ -292,43 +290,12 @@
 	depends on IA64_GENERIC || IA64_DIG || IA64_HP_ZX1 || IA64_SGI_SN2
 	default y
 
-config IA64_SGI_SN_DEBUG
-	bool "Enable extra debugging code"
-	depends on IA64_SGI_SN2
-	help
-	  Turns on extra debugging code in the SGI SN (Scalable NUMA) platform
-	  for IA-64.  Unless you are debugging problems on an SGI SN IA-64 box,
-	  say N.
-
 config IA64_SGI_SN_SIM
 	bool "Enable SGI Medusa Simulator Support"
 	depends on IA64_SGI_SN2
 	help
 	  If you are compiling a kernel that will run under SGI's IA-64
 	  simulator (Medusa) then say Y, otherwise say N.
-
-config IA64_SGI_AUTOTEST
-	bool "Enable autotest (llsc). Option to run cache test instead of booting"
-	depends on IA64_SGI_SN2
-	help
-	  Build a kernel used for hardware validation. If you include the
-	  keyword "autotest" on the boot command line, the kernel does NOT boot.
-	  Instead, it starts all cpus and runs cache coherency tests instead.
-
-	  If unsure, say N.
-
-config SERIAL_SGI_L1_PROTOCOL
-	bool "Enable protocol mode for the L1 console"
-	depends on IA64_SGI_SN2
-	help
-	  Uses protocol mode instead of raw mode for the level 1 console on the
-	  SGI SN (Scalable NUMA) platform for IA-64.  If you are compiling for
-	  an SGI SN box then Y is the recommended value, otherwise say N.
-
-config PERCPU_IRQ
-	bool
-	depends on IA64_SGI_SN2
-	default y
 
 # On IA-64, we always want an ELF /proc/kcore.
 config KCORE_ELF
diff -Nru a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
--- a/arch/ia64/kernel/setup.c	Wed Jul 16 12:28:28 2003
+++ b/arch/ia64/kernel/setup.c	Wed Jul 16 12:28:28 2003
@@ -138,7 +138,7 @@
 call_pernode_memory (unsigned long start, unsigned long end, void *arg)
 {
 	unsigned long rs, re;
-	void (*func)(unsigned long, unsigned long, int, int);
+	void (*func)(unsigned long, unsigned long, int);
 	int i;
 
 	start = PAGE_ALIGN(start);
@@ -149,22 +149,21 @@
 	func = arg;
 
 	if (!num_memblks) {
-		/*
-		 * This machine doesn't have SRAT, so call func with
-		 * nid=0, bank=0.
-		 */
+		/* No SRAT table, to assume one node (node 0) */
 		if (start < end)
-			(*func)(start, end - start, 0, 0);
+			(*func)(start, end, 0);
 		return;
 	}
 
 	for (i = 0; i < num_memblks; i++) {
-		rs = max(start, node_memblk[i].start_paddr);
-		re = min(end, node_memblk[i].start_paddr+node_memblk[i].size);
+		rs = max(__pa(start), node_memblk[i].start_paddr);
+		re = min(__pa(end), node_memblk[i].start_paddr+node_memblk[i].size);
 
 		if (rs < re)
-			(*func)(rs, re-rs, node_memblk[i].nid,
-				node_memblk[i].bank);
+			(*func)((unsigned long)__va(rs), (unsigned long)__va(re), node_memblk[i].nid);
+
+		if ((unsigned long)__va(re) = end)
+			break;
 	}
 }
 
@@ -180,7 +179,7 @@
 filter_rsvd_memory (unsigned long start, unsigned long end, void *arg)
 {
 	unsigned long range_start, range_end, prev_start;
-	void (*func)(unsigned long, unsigned long);
+	void (*func)(unsigned long, unsigned long, int);
 	int i;
 
 #if IGNORE_PFN0
@@ -202,9 +201,9 @@
 
 		if (range_start < range_end)
 #ifdef CONFIG_DISCONTIGMEM
-			call_pernode_memory(__pa(range_start), __pa(range_end), func);
+			call_pernode_memory(range_start, range_end, func);
 #else
-			(*func)(__pa(range_start), range_end - range_start);
+			(*func)(range_start, range_end, 0);
 #endif
 
 		/* nothing more available in this segment */
@@ -703,6 +702,8 @@
 	 * get_free_pages() cannot be used before cpu_init() done.  BSP allocates
 	 * "NR_CPUS" pages for all CPUs to avoid that AP calls get_zeroed_page().
 	 */
+#ifndef CONFIG_DISCONTIGMEM
+	/* for discontig machines, we do this in discontig.c */
 	if (smp_processor_id() = 0) {
 		cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS, PERCPU_PAGE_SIZE,
 					   __pa(MAX_DMA_ADDRESS));
@@ -714,6 +715,7 @@
 			per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu];
 		}
 	}
+#endif
 	cpu_data = __per_cpu_start + __per_cpu_offset[smp_processor_id()];
 #else /* !CONFIG_SMP */
 	cpu_data = __phys_per_cpu_start;
diff -Nru a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
--- a/arch/ia64/mm/discontig.c	Wed Jul 16 12:28:28 2003
+++ b/arch/ia64/mm/discontig.c	Wed Jul 16 12:28:28 2003
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2000 Silicon Graphics, Inc.  All rights reserved.
+ * Copyright (c) 2000, 2003 Silicon Graphics, Inc.  All rights reserved.
  * Copyright (c) 2001 Intel Corp.
  * Copyright (c) 2001 Tony Luck <tony.luck@intel.com>
  * Copyright (c) 2002 NEC Corp.
@@ -16,74 +16,60 @@
 #include <linux/mmzone.h>
 #include <linux/acpi.h>
 #include <linux/efi.h>
-
+#include <asm/pgalloc.h>
+#include <asm/tlb.h>
 
 /*
- * Round an address upward to the next multiple of GRANULE size.
+ * Round an address upward or downward to the next multiple of IA64_GRANULE_SIZE.
  */
+#define GRANULEROUNDDOWN(n) ((n) & ~(IA64_GRANULE_SIZE-1))
 #define GRANULEROUNDUP(n) (((n)+IA64_GRANULE_SIZE-1) & ~(IA64_GRANULE_SIZE-1))
 
-static struct ia64_node_data	*node_data[NR_NODES];
-static long			boot_pg_data[8*NR_NODES+sizeof(pg_data_t)]  __initdata;
-static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
-static bootmem_data_t		bdata[NR_NODES][NR_BANKS_PER_NODE+1] __initdata;
-
-extern int  filter_rsvd_memory (unsigned long start, unsigned long end, void *arg);
+/*
+ * Used to locate BOOT_DATA prior to initializing the node data area.
+ */
+#define BOOT_NODE_DATA(node)	pg_data_ptr[node]
 
 /*
- * Return the compact node number of this cpu. Used prior to
- * setting up the cpu_data area.
- *	Note - not fast, intended for boot use only!!
+ * To prevent cache aliasing effects, align per-node structures so that they 
+ * start at addresses that are strided by node number.
  */
-int
-boot_get_local_nodeid(void)
-{
-	int	i;
+#define NODEDATA_ALIGN(addr, node)	((((addr) + 1024*1024-1) & ~(1024*1024-1)) + (node)*PERCPU_PAGE_SIZE)
 
-	for (i = 0; i < NR_CPUS; i++)
-		if (node_cpuid[i].phys_id = hard_smp_processor_id())
-			return node_cpuid[i].nid;
 
-	/* node info missing, so nid should be 0.. */
-	return 0;
-}
+static struct ia64_node_data	*boot_node_data[NR_NODES] __initdata;
+static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
+static bootmem_data_t		bdata[NR_NODES] __initdata;
+static unsigned long		boot_pernode[NR_NODES] __initdata;
+static unsigned long		boot_pernodesize[NR_NODES] __initdata;
 
-/*
- * Return a pointer to the pg_data structure for a node.
- * This function is used ONLY in early boot before the cpu_data
- * structure is available.
- */
-pg_data_t* __init
-boot_get_pg_data_ptr(long node)
-{
-	return pg_data_ptr[node];
-}
+extern char __per_cpu_start[], __per_cpu_end[];
 
 
-/*
- * Return a pointer to the node data for the current node.
- *	(boottime initialization only)
- */
-struct ia64_node_data *
+struct ia64_node_data*
 get_node_data_ptr(void)
 {
-	return node_data[boot_get_local_nodeid()];
+	return boot_node_data[cpu_to_node_map[smp_processor_id()]];	/* ZZZ */
 }
 
 /*
  * We allocate one of the bootmem_data_t structs for each piece of memory
  * that we wish to treat as a contiguous block.  Each such block must start
- * on a BANKSIZE boundary.  Multiple banks per node is not supported.
+ * on a GRANULE boundary.  Multiple banks per node is not supported.
+ *   (Note: on SN2, all memory on a node is trated as a single bank.
+ *   Holes within the bank are supported. This works because memory
+ *   from different banks is not interleaved. The bootmap bitmap
+ *   for the node is somewhat large but not too large).
  */
 static int __init
-build_maps(unsigned long pstart, unsigned long length, int node)
+build_maps(unsigned long start, unsigned long end, int node)
 {
 	bootmem_data_t	*bdp;
 	unsigned long cstart, epfn;
 
-	bdp = pg_data_ptr[node]->bdata;
-	epfn = GRANULEROUNDUP(pstart + length) >> PAGE_SHIFT;
-	cstart = pstart & ~(BANKSIZE - 1);
+	bdp = &bdata[node];
+	epfn = GRANULEROUNDUP(__pa(end)) >> PAGE_SHIFT;
+	cstart = GRANULEROUNDDOWN(__pa(start));
 
 	if (!bdp->node_low_pfn) {
 		bdp->node_boot_start = cstart;
@@ -99,34 +85,96 @@
 	return 0;
 }
 
+
+/*
+ * Count the number of cpus on the node
+ */
+static __inline__ int
+count_cpus(int node)
+{
+	int cpu, n=0;
+
+	for (cpu=0; cpu < NR_CPUS; cpu++)
+		if (node = node_cpuid[cpu].nid)
+			n++;
+	return n;
+}
+
+
 /*
- * Find space on each node for the bootmem map.
+ * Find space on each node for the bootmem map & other per-node data structures.
  *
  * Called by efi_memmap_walk to find boot memory on each node. Note that
  * only blocks that are free are passed to this routine (currently filtered by
  * free_available_memory).
  */
 static int __init
-find_bootmap_space(unsigned long pstart, unsigned long length, int node)
+find_pernode_space(unsigned long start, unsigned long end, int node)
 {
-	unsigned long	mapsize, pages, epfn;
+	unsigned long	mapsize, pages, epfn, map=0, cpu, cpus;
+	unsigned long	pernodesize=0, pernode;
+	unsigned long	cpu_data;
+	unsigned long	pstart, length;
 	bootmem_data_t	*bdp;
 
+	pstart = __pa(start);
+	length = end - start;
 	epfn = (pstart + length) >> PAGE_SHIFT;
-	bdp = &pg_data_ptr[node]->bdata[0];
+	bdp = &bdata[node];
 
 	if (pstart < bdp->node_boot_start || epfn > bdp->node_low_pfn)
 		return 0;
 
-	if (!bdp->node_bootmem_map) {
+	if (!boot_pernode[node]) {
+		cpus = count_cpus(node);
+		pernodesize += PERCPU_PAGE_SIZE * cpus;
+		pernodesize += L1_CACHE_ALIGN(sizeof(pg_data_t));
+		pernodesize += L1_CACHE_ALIGN(sizeof(struct ia64_node_data));
+		pernodesize = PAGE_ALIGN(pernodesize);
+		pernode = NODEDATA_ALIGN(pstart, node);
+	
+		if (pstart + length > (pernode + pernodesize)) {
+			boot_pernode[node] = pernode;
+			boot_pernodesize[node] = pernodesize;
+			memset(__va(pernode), 0, pernodesize);
+
+			cpu_data = pernode;
+			pernode += PERCPU_PAGE_SIZE * cpus;
+
+			pg_data_ptr[node] = __va(pernode);
+			pernode += L1_CACHE_ALIGN(sizeof(pg_data_t));
+
+			boot_node_data[node] = __va(pernode);
+			pernode += L1_CACHE_ALIGN(sizeof(struct ia64_node_data));
+
+			pg_data_ptr[node]->bdata = &bdata[node];
+			pernode += L1_CACHE_ALIGN(sizeof(pg_data_t));
+
+			for (cpu=0; cpu < NR_CPUS; cpu++) {
+				if (node = node_cpuid[cpu].nid) {
+					extern char __per_cpu_start[], __phys_per_cpu_start[];
+					memcpy((void*)cpu_data, __phys_per_cpu_start, __per_cpu_end - __per_cpu_start);
+					__per_cpu_offset[cpu] = (char*)__va(cpu_data) - __per_cpu_start;
+					cpu_data +=  PERCPU_PAGE_SIZE;
+				}
+			}
+		}
+	}
+
+	pernode = boot_pernode[node];
+	pernodesize = boot_pernodesize[node];
+	if (pernode && !bdp->node_bootmem_map) {
 		pages = bdp->node_low_pfn - (bdp->node_boot_start>>PAGE_SHIFT);
 		mapsize = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
-		if (length > mapsize) {
-			init_bootmem_node(
-				BOOT_NODE_DATA(node),
-				pstart>>PAGE_SHIFT, 
-				bdp->node_boot_start>>PAGE_SHIFT,
-				bdp->node_low_pfn);
+
+		if (pernode - pstart > mapsize)
+			map = pstart;
+		else if (pstart + length - pernode - pernodesize > mapsize)
+			map = pernode + pernodesize;
+
+		if (map) {
+			init_bootmem_node(BOOT_NODE_DATA(node),	map>>PAGE_SHIFT, 
+				bdp->node_boot_start>>PAGE_SHIFT, bdp->node_low_pfn);
 		}
 
 	}
@@ -143,9 +191,9 @@
  *
  */
 static int __init
-discontig_free_bootmem_node(unsigned long pstart, unsigned long length, int node)
+discontig_free_bootmem_node(unsigned long start, unsigned long end, int node)
 {
-	free_bootmem_node(BOOT_NODE_DATA(node), pstart, length);
+	free_bootmem_node(BOOT_NODE_DATA(node), __pa(start), end - start);
 
 	return 0;
 }
@@ -158,53 +206,50 @@
 discontig_reserve_bootmem(void)
 {
 	int		node;
-	unsigned long	mapbase, mapsize, pages;
+	unsigned long	base, size, pages;
 	bootmem_data_t	*bdp;
 
 	for (node = 0; node < numnodes; node++) {
 		bdp = BOOT_NODE_DATA(node)->bdata;
 
 		pages = bdp->node_low_pfn - (bdp->node_boot_start>>PAGE_SHIFT);
-		mapsize = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
-		mapbase = __pa(bdp->node_bootmem_map);
-		reserve_bootmem_node(BOOT_NODE_DATA(node), mapbase, mapsize);
+		size = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
+		base = __pa(bdp->node_bootmem_map);
+		reserve_bootmem_node(BOOT_NODE_DATA(node), base, size);
+
+		size = boot_pernodesize[node];
+		base = __pa(boot_pernode[node]);
+		reserve_bootmem_node(BOOT_NODE_DATA(node), base, size);
 	}
 }
 
 /*
- * Allocate per node tables.
- * 	- the pg_data structure is allocated on each node. This minimizes offnode 
- *	  memory references
- *	- the node data is allocated & initialized. Portions of this structure is read-only (after 
- *	  boot) and contains node-local pointers to usefuls data structures located on
- *	  other nodes.
+ * Initialize per-node data
+ *
+ * Finish setting up the node data for this node, then copy it to the other nodes.
  *
- * We also switch to using the "real" pg_data structures at this point. Earlier in boot, we
- * use a different structure. The only use for pg_data prior to the point in boot is to get 
- * the pointer to the bdata for the node.
  */
 static void __init
-allocate_pernode_structures(void)
+initialize_pernode_data(void)
 {
-	pg_data_t	*pgdat=0, *new_pgdat_list=0;
-	int		node, mynode;
+	int	cpu, node;
+
+	memcpy(boot_node_data[0]->pg_data_ptrs, pg_data_ptr, sizeof(pg_data_ptr));
+	memcpy(boot_node_data[0]->node_data_ptrs, boot_node_data, sizeof(boot_node_data));
 
-	mynode = boot_get_local_nodeid();
-	for (node = numnodes - 1; node >= 0 ; node--) {
-		node_data[node] = alloc_bootmem_node(BOOT_NODE_DATA(node), sizeof (struct ia64_node_data));
-		pgdat = __alloc_bootmem_node(BOOT_NODE_DATA(node), sizeof(pg_data_t), SMP_CACHE_BYTES, 0);
-		pgdat->bdata = &(bdata[node][0]);
-		pg_data_ptr[node] = pgdat;
-		pgdat->pgdat_next = new_pgdat_list;
-		new_pgdat_list = pgdat;
+	for (node=1; node < numnodes; node++) {
+		memcpy(boot_node_data[node], boot_node_data[0], sizeof(struct ia64_node_data));
+		boot_node_data[node]->node = node;
 	}
-	
-	memcpy(node_data[mynode]->pg_data_ptrs, pg_data_ptr, sizeof(pg_data_ptr));
-	memcpy(node_data[mynode]->node_data_ptrs, node_data, sizeof(node_data));
 
-	pgdat_list = new_pgdat_list;
+	for (cpu=0; cpu < NR_CPUS; cpu++) {
+		node = node_cpuid[cpu].nid;
+		per_cpu(cpu_info, cpu).node_data = boot_node_data[node];
+		per_cpu(cpu_info, cpu).nodeid = node;
+	}
 }
 
+
 /*
  * Called early in boot to setup the boot memory allocator, and to
  * allocate the node-local pg_data & node-directory data structures..
@@ -212,96 +257,19 @@
 void __init
 discontig_mem_init(void)
 {
-	int	node;
-
 	if (numnodes = 0) {
 		printk(KERN_ERR "node info missing!\n");
 		numnodes = 1;
 	}
 
-	for (node = 0; node < numnodes; node++) {
-		pg_data_ptr[node] = (pg_data_t*) &boot_pg_data[node];
-		pg_data_ptr[node]->bdata = &bdata[node][0];
-	}
-
 	min_low_pfn = -1;
 	max_low_pfn = 0;
 
         efi_memmap_walk(filter_rsvd_memory, build_maps);
-        efi_memmap_walk(filter_rsvd_memory, find_bootmap_space);
+        efi_memmap_walk(filter_rsvd_memory, find_pernode_space);
         efi_memmap_walk(filter_rsvd_memory, discontig_free_bootmem_node);
-	discontig_reserve_bootmem();
-	allocate_pernode_structures();
-}
-
-/*
- * Initialize the paging system.
- *	- determine sizes of each node
- *	- initialize the paging system for the node
- *	- build the nodedir for the node. This contains pointers to
- *	  the per-bank mem_map entries.
- *	- fix the page struct "virtual" pointers. These are bank specific
- *	  values that the paging system doesn't understand.
- *	- replicate the nodedir structure to other nodes	
- */ 
-
-void __init
-discontig_paging_init(void)
-{
-	int		node, mynode;
-	unsigned long	max_dma, zones_size[MAX_NR_ZONES];
-	unsigned long	kaddr, ekaddr, bid;
-	struct page	*page;
-	bootmem_data_t	*bdp;
-
-	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 
-	mynode = boot_get_local_nodeid();
-	for (node = 0; node < numnodes; node++) {
-		long pfn, startpfn;
-
-		memset(zones_size, 0, sizeof(zones_size));
-
-		startpfn = -1;
-		bdp = BOOT_NODE_DATA(node)->bdata;
-		pfn = bdp->node_boot_start >> PAGE_SHIFT;
-		if (startpfn = -1)
-			startpfn = pfn;
-		if (pfn > max_dma)
-			zones_size[ZONE_NORMAL] += (bdp->node_low_pfn - pfn);
-		else if (bdp->node_low_pfn < max_dma)
-			zones_size[ZONE_DMA] += (bdp->node_low_pfn - pfn);
-		else {
-			zones_size[ZONE_DMA] += (max_dma - pfn);
-			zones_size[ZONE_NORMAL] += (bdp->node_low_pfn - max_dma);
-		}
-
-		free_area_init_node(node, NODE_DATA(node), NULL, zones_size, startpfn, 0);
-
-		page = NODE_DATA(node)->node_mem_map;
-
-		bdp = BOOT_NODE_DATA(node)->bdata;
-
-		kaddr = (unsigned long)__va(bdp->node_boot_start);
-		ekaddr = (unsigned long)__va(bdp->node_low_pfn << PAGE_SHIFT);
-		while (kaddr < ekaddr) {
-			if (paddr_to_nid(__pa(kaddr)) = node) {
-				bid = BANK_MEM_MAP_INDEX(kaddr);
-				node_data[mynode]->node_id_map[bid] = node;
-				node_data[mynode]->bank_mem_map_base[bid] = page;
-			}
-			kaddr += BANKSIZE;
-			page += BANKSIZE/PAGE_SIZE;
-		}
-	}
-
-	/*
-	 * Finish setting up the node data for this node, then copy it to the other nodes.
-	 */
-	for (node=0; node < numnodes; node++)
-		if (mynode != node) {
-			memcpy(node_data[node], node_data[mynode], sizeof(struct ia64_node_data));
-			node_data[node]->node = node;
-		}
+	discontig_reserve_bootmem();
+	initialize_pernode_data();
 }
 
diff -Nru a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
--- a/arch/ia64/mm/init.c	Wed Jul 16 12:28:28 2003
+++ b/arch/ia64/mm/init.c	Wed Jul 16 12:28:28 2003
@@ -44,7 +44,7 @@
 #ifdef CONFIG_VIRTUAL_MEM_MAP
 # define LARGE_GAP	0x40000000	/* Use virtual mem map if hole is > than this */
   unsigned long vmalloc_end = VMALLOC_END_INIT;
-  static struct page *vmem_map;
+  struct page *vmem_map;
   static unsigned long num_dma_physpages;
 #endif
 
@@ -240,7 +240,7 @@
 				else if (page_count(pgdat->node_mem_map + i))
 					shared += page_count(pgdat->node_mem_map + i) - 1;
 			}
-			printk("\t%d pages of RAM\n", pgdat->node_spanned_pages);
+			printk("\t%ld pages of RAM\n", pgdat->node_spanned_pages);
 			printk("\t%d reserved pages\n", reserved);
 			printk("\t%d pages shared\n", shared);
 			printk("\t%d pages swap cached\n", cached);
@@ -397,6 +397,7 @@
 {
 	unsigned long address, start_page, end_page;
 	struct page *map_start, *map_end;
+	int node;
 	pgd_t *pgd;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -406,19 +407,20 @@
 
 	start_page = (unsigned long) map_start & PAGE_MASK;
 	end_page = PAGE_ALIGN((unsigned long) map_end);
+	node = paddr_to_nid(__pa(start));
 
 	for (address = start_page; address < end_page; address += PAGE_SIZE) {
 		pgd = pgd_offset_k(address);
 		if (pgd_none(*pgd))
-			pgd_populate(&init_mm, pgd, alloc_bootmem_pages(PAGE_SIZE));
+			pgd_populate(&init_mm, pgd, alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE));
 		pmd = pmd_offset(pgd, address);
 
 		if (pmd_none(*pmd))
-			pmd_populate_kernel(&init_mm, pmd, alloc_bootmem_pages(PAGE_SIZE));
+			pmd_populate_kernel(&init_mm, pmd, alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE));
 		pte = pte_offset_kernel(pmd, address);
 
 		if (pte_none(*pte))
-			set_pte(pte, pfn_pte(__pa(alloc_bootmem_pages(PAGE_SIZE)) >> PAGE_SHIFT,
+			set_pte(pte, pfn_pte(__pa(alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE)) >> PAGE_SHIFT,
 					     PAGE_KERNEL));
 	}
 	return 0;
@@ -431,6 +433,14 @@
 	unsigned long zone;
 };
 
+struct memmap_count_callback_data {
+	int node;
+	unsigned long num_physpages;
+	unsigned long num_dma_physpages;
+	unsigned long min_pfn;
+	unsigned long max_pfn;
+} cdata;
+
 static int
 virtual_memmap_init (u64 start, u64 end, void *arg)
 {
@@ -489,16 +499,6 @@
 }
 
 static int
-count_dma_pages (u64 start, u64 end, void *arg)
-{
-	unsigned long *count = arg;
-
-	if (end <= MAX_DMA_ADDRESS)
-		*count += (end - start) >> PAGE_SHIFT;
-	return 0;
-}
-
-static int
 find_largest_hole (u64 start, u64 end, void *arg)
 {
 	u64 *max_gap = arg;
@@ -514,102 +514,101 @@
 }
 #endif /* CONFIG_VIRTUAL_MEM_MAP */
 
+#define GRANULEROUNDDOWN(n) ((n) & ~(IA64_GRANULE_SIZE-1))
+#define GRANULEROUNDUP(n) (((n)+IA64_GRANULE_SIZE-1) & ~(IA64_GRANULE_SIZE-1))
+#define ORDERROUNDDOWN(n) ((n) & ~((PAGE_SIZE<<MAX_ORDER)-1))
 static int
-count_pages (u64 start, u64 end, void *arg)
+count_pages (unsigned long start, unsigned long end, int node)
 {
-	unsigned long *count = arg;
+	start = __pa(start);
+	end = __pa(end);
 
-	*count += (end - start) >> PAGE_SHIFT;
+	if (node = cdata.node) {
+		cdata.num_physpages += (end - start) >> PAGE_SHIFT;
+		if (start <= __pa(MAX_DMA_ADDRESS))
+			cdata.num_dma_physpages += (min(end, __pa(MAX_DMA_ADDRESS)) - start) >> PAGE_SHIFT;
+		start = GRANULEROUNDDOWN(__pa(start));
+		start = ORDERROUNDDOWN(start);
+		end = GRANULEROUNDUP(__pa(end));
+		cdata.max_pfn = max(cdata.max_pfn, end >> PAGE_SHIFT);
+		cdata.min_pfn = min(cdata.min_pfn, start >> PAGE_SHIFT);
+	}
 	return 0;
 }
 
 /*
  * Set up the page tables.
  */
-
-#ifdef CONFIG_DISCONTIGMEM
 void
 paging_init (void)
 {
-	extern void discontig_paging_init(void);
-
-	discontig_paging_init();
-	efi_memmap_walk(count_pages, &num_physpages);
-	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
-}
-#else /* !CONFIG_DISCONTIGMEM */
-void
-paging_init (void)
-{
-	unsigned long max_dma;
+	unsigned long max_dma_pfn;
 	unsigned long zones_size[MAX_NR_ZONES];
 #  ifdef CONFIG_VIRTUAL_MEM_MAP
 	unsigned long zholes_size[MAX_NR_ZONES];
 	unsigned long max_gap;
 #  endif
+	int node;
 
-	/* initialize mem_map[] */
-
-	memset(zones_size, 0, sizeof(zones_size));
-
-	num_physpages = 0;
-	efi_memmap_walk(count_pages, &num_physpages);
-
-	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
-#  ifdef CONFIG_VIRTUAL_MEM_MAP
-	memset(zholes_size, 0, sizeof(zholes_size));
-
-	num_dma_physpages = 0;
-	efi_memmap_walk(count_dma_pages, &num_dma_physpages);
-
-	if (max_low_pfn < max_dma) {
-		zones_size[ZONE_DMA] = max_low_pfn;
-		zholes_size[ZONE_DMA] = max_low_pfn - num_dma_physpages;
-	} else {
-		zones_size[ZONE_DMA] = max_dma;
-		zholes_size[ZONE_DMA] = max_dma - num_dma_physpages;
-		if (num_physpages > num_dma_physpages) {
-			zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
-			zholes_size[ZONE_NORMAL] = ((max_low_pfn - max_dma)
-						    - (num_physpages - num_dma_physpages));
-		}
-	}
-
+	max_dma_pfn = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 	max_gap = 0;
 	efi_memmap_walk(find_largest_hole, (u64 *)&max_gap);
-	if (max_gap < LARGE_GAP) {
-		vmem_map = (struct page *) 0;
-		free_area_init_node(0, &contig_page_data, NULL, zones_size, 0, zholes_size);
-		mem_map = contig_page_data.node_mem_map;
-	}
-	else {
-		unsigned long map_size;
-
-		/* allocate virtual_mem_map */
 
-		map_size = PAGE_ALIGN(max_low_pfn * sizeof(struct page));
-		vmalloc_end -= map_size;
-		vmem_map = (struct page *) vmalloc_end;
-		efi_memmap_walk(create_mem_map_page_table, 0);
-
-		free_area_init_node(0, &contig_page_data, vmem_map, zones_size, 0, zholes_size);
+	for (node = 0; node < numnodes; node++) {
+		memset(zones_size, 0, sizeof(zones_size));
+		memset(zholes_size, 0, sizeof(zholes_size));
+		memset(&cdata, 0, sizeof(cdata));
+
+		cdata.node = node;
+		cdata.min_pfn = ~0;
+
+		efi_memmap_walk(filter_rsvd_memory, count_pages);
+		num_dma_physpages += cdata.num_dma_physpages;
+		num_physpages += cdata.num_physpages;
+
+		if (cdata.min_pfn >= max_dma_pfn) {
+			/* Above the DMA zone */
+			zones_size[ZONE_NORMAL] = cdata.max_pfn - cdata.min_pfn;
+			zholes_size[ZONE_NORMAL] = cdata.max_pfn - cdata.min_pfn - cdata.num_physpages;
+		} else if (cdata.max_pfn < max_dma_pfn) {
+			/* This block is DMAable */
+			zones_size[ZONE_DMA] = cdata.max_pfn - cdata.min_pfn;
+			zholes_size[ZONE_DMA] = cdata.max_pfn - cdata.min_pfn - cdata.num_dma_physpages;
+		} else {
+			zones_size[ZONE_DMA] = max_dma_pfn - cdata.min_pfn;
+			zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - cdata.num_dma_physpages;
+			zones_size[ZONE_NORMAL] = cdata.max_pfn - max_dma_pfn;
+			zholes_size[ZONE_NORMAL] = zones_size[ZONE_NORMAL] - (cdata.num_physpages - cdata.num_dma_physpages);
+		}
 
-		mem_map = contig_page_data.node_mem_map;
-		printk("Virtual mem_map starts at 0x%p\n", mem_map);
-	}
-#  else /* !CONFIG_VIRTUAL_MEM_MAP */
-	if (max_low_pfn < max_dma)
-		zones_size[ZONE_DMA] = max_low_pfn;
-	else {
-		zones_size[ZONE_DMA] = max_dma;
-		zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
+		if (numnodes = 1 && max_gap < LARGE_GAP) {
+			/* Just one node with no big holes... */
+			vmem_map = (struct page *)0;
+			zones_size[ZONE_DMA] += cdata.min_pfn;
+			zholes_size[ZONE_DMA] += cdata.min_pfn;
+			free_area_init_node(0, NODE_DATA(node), NODE_DATA(node)->node_mem_map,
+					    zones_size, 0, zholes_size);
+		}
+		else {
+			/* allocate virtual mem_map */
+			if (node = 0) {
+				unsigned long map_size;
+				map_size = PAGE_ALIGN(max_low_pfn*sizeof(struct page));
+				vmalloc_end -= map_size;
+				vmem_map = (struct page *) vmalloc_end;
+				efi_memmap_walk(create_mem_map_page_table, 0);
+				printk("Virtual mem_map starts at 0x%p\n", vmem_map);
+#ifndef CONFIG_DISCONTIGMEM
+				mem_map = vmem_map;
+#endif
+			}
+			free_area_init_node(node, NODE_DATA(node), vmem_map + cdata.min_pfn,
+					    zones_size, cdata.min_pfn, zholes_size);
+		}
 	}
-	free_area_init(zones_size);
-#  endif /* !CONFIG_VIRTUAL_MEM_MAP */
+
 	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
-#endif /* !CONFIG_DISCONTIGMEM */
 
 static int
 count_reserved_pages (u64 start, u64 end, void *arg)
diff -Nru a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
--- a/drivers/acpi/Kconfig	Wed Jul 16 12:28:28 2003
+++ b/drivers/acpi/Kconfig	Wed Jul 16 12:28:28 2003
@@ -133,7 +133,7 @@
 
 config ACPI_NUMA
 	bool "NUMA support" if NUMA && (IA64 && !IA64_HP_SIM || X86 && ACPI && !ACPI_HT_ONLY && !X86_64)
-	default y if IA64 && IA64_SGI_SN
+	default y if IA64_GENERIC || IA64_SGI_SN2
 
 config ACPI_ASUS
         tristate "ASUS/Medion Laptop Extras"
diff -Nru a/include/asm-ia64/mmzone.h b/include/asm-ia64/mmzone.h
--- a/include/asm-ia64/mmzone.h	Wed Jul 16 12:28:28 2003
+++ b/include/asm-ia64/mmzone.h	Wed Jul 16 12:28:28 2003
@@ -3,7 +3,7 @@
  * License.  See the file "COPYING" in the main directory of this archive
  * for more details.
  *
- * Copyright (c) 2000 Silicon Graphics, Inc.  All rights reserved.
+ * Copyright (c) 2000,2003 Silicon Graphics, Inc.  All rights reserved.
  * Copyright (c) 2002 NEC Corp.
  * Copyright (c) 2002 Erich Focht <efocht@ess.nec.de>
  * Copyright (c) 2002 Kimio Suganuma <k-suganuma@da.jp.nec.com>
@@ -14,150 +14,50 @@
 #include <linux/config.h>
 #include <linux/init.h>
 
-/*
- * Given a kaddr, find the base mem_map address for the start of the mem_map
- * entries for the bank containing the kaddr.
- */
-#define BANK_MEM_MAP_BASE(kaddr) local_node_data->bank_mem_map_base[BANK_MEM_MAP_INDEX(kaddr)]
-
-/*
- * Given a kaddr, this macro return the relative map number 
- * within the bank.
- */
-#define BANK_MAP_NR(kaddr) 	(BANK_OFFSET(kaddr) >> PAGE_SHIFT)
 
-/*
- * Given a pte, this macro returns a pointer to the page struct for the pte.
- */
-#define pte_page(pte)	virt_to_page(PAGE_OFFSET | (pte_val(pte)&_PFN_MASK))
+#ifdef CONFIG_NUMA
 
-/*
- * Determine if a kaddr is a valid memory address of memory that
- * actually exists. 
- *
- * The check consists of 2 parts:
- *	- verify that the address is a region 7 address & does not 
- *	  contain any bits that preclude it from being a valid platform
- *	  memory address
- *	- verify that the chunk actually exists.
- *
- * Note that IO addresses are NOT considered valid addresses.
- *
- * Note, many platforms can simply check if kaddr exceeds a specific size.  
- *	(However, this won't work on SGI platforms since IO space is embedded 
- * 	within the range of valid memory addresses & nodes have holes in the 
- *	address range between banks). 
- */
-#define kern_addr_valid(kaddr)		({long _kav=(long)(kaddr);	\
-					VALID_MEM_KADDR(_kav);})
-
-/*
- * Given a kaddr, return a pointer to the page struct for the page.
- * If the kaddr does not represent RAM memory that potentially exists, return
- * a pointer the page struct for max_mapnr. IO addresses will
- * return the page for max_nr. Addresses in unpopulated RAM banks may
- * return undefined results OR may panic the system.
- *
- */
-#define virt_to_page(kaddr)	({long _kvtp=(long)(kaddr);	\
-				(VALID_MEM_KADDR(_kvtp))	\
-					? BANK_MEM_MAP_BASE(_kvtp) + BANK_MAP_NR(_kvtp)	\
-					: NULL;})
+#ifdef CONFIG_IA64_DIG
 
 /*
- * Given a page struct entry, return the physical address that the page struct represents.
- * Since IA64 has all memory in the DMA zone, the following works:
+ * Platform definitions for DIG platform with contiguous memory.
  */
-#define page_to_phys(page)	__pa(page_address(page))
-
-#define node_mem_map(nid)	(NODE_DATA(nid)->node_mem_map)
+#define MAX_PHYSNODE_ID	8		/* Maximum node number +1 */
+#define NR_NODES	8		/* Maximum number of nodes in SSI */
+#define NR_MEMBLKS	(NR_NODES * 32)
 
-#define node_localnr(pfn, nid)	((pfn) - NODE_DATA(nid)->node_start_pfn)
 
-#define pfn_to_page(pfn)	(struct page *)(node_mem_map(pfn_to_nid(pfn)) + node_localnr(pfn, pfn_to_nid(pfn)))
 
-#define pfn_to_nid(pfn)		 local_node_data->node_id_map[(pfn << PAGE_SHIFT) >> BANKSHIFT]
-
-#define page_to_pfn(page)	(long)((page - page_zone(page)->zone_mem_map) + page_zone(page)->zone_start_pfn)
 
+#elif CONFIG_IA64_SGI_SN2
 
 /*
- * pfn_valid should be made as fast as possible, and the current definition
- * is valid for machines that are NUMA, but still contiguous, which is what
- * is currently supported. A more generalised, but slower definition would
- * be something like this - mbligh:
- * ( pfn_to_pgdat(pfn) && (pfn < node_end_pfn(pfn_to_nid(pfn))) )
+ * Platform definitions for DIG platform with contiguous memory.
  */
-#define pfn_valid(pfn)          (pfn < max_low_pfn)
-extern unsigned long max_low_pfn;
+#define MAX_PHYSNODE_ID	2048		/* Maximum node number +1 */
+#define NR_NODES	256		/* Maximum number of compute nodes in SSI */
+#define NR_MEMBLKS	(NR_NODES)
 
+#elif CONFIG_IA64_GENERIC
 
-#ifdef CONFIG_IA64_DIG
 
 /*
- * Platform definitions for DIG platform with contiguous memory.
+ * Platform definitions for GENERIC platform with contiguous or discontiguous memory.
  */
-#define MAX_PHYSNODE_ID	8	/* Maximum node number +1 */
-#define NR_NODES	8	/* Maximum number of nodes in SSI */
+#define MAX_PHYSNODE_ID 2048		/* Maximum node number +1 */
+#define NR_NODES        256		/* Maximum number of nodes in SSI */
+#define NR_MEMBLKS      (NR_NODES)
 
-#define MAX_PHYS_MEMORY	(1UL << 40)	/* 1 TB */
 
-/*
- * Bank definitions.
- * Configurable settings for DIG: 512MB/bank:  16GB/node,
- *                               2048MB/bank:  64GB/node,
- *                               8192MB/bank: 256GB/node.
- */
-#define NR_BANKS_PER_NODE	32
-#if defined(CONFIG_IA64_NODESIZE_16GB)
-# define BANKSHIFT		29
-#elif defined(CONFIG_IA64_NODESIZE_64GB)
-# define BANKSHIFT		31
-#elif defined(CONFIG_IA64_NODESIZE_256GB)
-# define BANKSHIFT		33
 #else
-# error Unsupported bank and nodesize!
+#error unknown platform
 #endif
-#define BANKSIZE		(1UL << BANKSHIFT)
-#define BANK_OFFSET(addr)	((unsigned long)(addr) & (BANKSIZE-1))
-#define NR_BANKS		(NR_BANKS_PER_NODE * NR_NODES)
 
-/*
- * VALID_MEM_KADDR returns a boolean to indicate if a kaddr is
- * potentially a valid cacheable identity mapped RAM memory address.
- * Note that the RAM may or may not actually be present!!
- */
-#define VALID_MEM_KADDR(kaddr)	1
+extern void build_cpu_to_node_map(void);
 
-/*
- * Given a nodeid & a bank number, find the address of the mem_map
- * entry for the first page of the bank.
- */
-#define BANK_MEM_MAP_INDEX(kaddr) \
-	(((unsigned long)(kaddr) & (MAX_PHYS_MEMORY-1)) >> BANKSHIFT)
+#else /* CONFIG_NUMA */
 
-#elif defined(CONFIG_IA64_SGI_SN2)
-/*
- * SGI SN2 discontig definitions
- */
-#define MAX_PHYSNODE_ID	2048	/* 2048 node ids (also called nasid) */
-#define NR_NODES	128	/* Maximum number of nodes in SSI */
-#define MAX_PHYS_MEMORY	(1UL << 49)
-
-#define BANKSHIFT		38
-#define NR_BANKS_PER_NODE	4
-#define SN2_NODE_SIZE		(64UL*1024*1024*1024)	/* 64GB per node */
-#define BANKSIZE		(SN2_NODE_SIZE/NR_BANKS_PER_NODE)
-#define BANK_OFFSET(addr)	((unsigned long)(addr) & (BANKSIZE-1))
-#define NR_BANKS		(NR_BANKS_PER_NODE * NR_NODES)
-#define VALID_MEM_KADDR(kaddr)	1
-
-/*
- * Given a nodeid & a bank number, find the address of the mem_map
- * entry for the first page of the bank.
- */
-#define BANK_MEM_MAP_INDEX(kaddr) \
-	(((unsigned long)(kaddr) & (MAX_PHYS_MEMORY-1)) >> BANKSHIFT)
+#define NR_NODES	1
 
-#endif /* CONFIG_IA64_DIG */
+#endif /* CONFIG_NUMA */
 #endif /* _ASM_IA64_MMZONE_H */
diff -Nru a/include/asm-ia64/nodedata.h b/include/asm-ia64/nodedata.h
--- a/include/asm-ia64/nodedata.h	Wed Jul 16 12:28:28 2003
+++ b/include/asm-ia64/nodedata.h	Wed Jul 16 12:28:28 2003
@@ -13,7 +13,7 @@
 #ifndef _ASM_IA64_NODEDATA_H
 #define _ASM_IA64_NODEDATA_H
 
-
+#include <asm/percpu.h>
 #include <asm/mmzone.h>
 
 /*
@@ -22,15 +22,17 @@
 
 struct pglist_data;
 struct ia64_node_data {
-	short			active_cpu_count;
 	short			node;
+	short			active_cpu_count;
+	/*
+	 * The fields are read-only (after boot). They contain pointers to various structures
+	 * located on other nodes. Ths data is replicated on each node in order to reduce
+	 * off-node references.
+	 */
         struct pglist_data	*pg_data_ptrs[NR_NODES];
-	struct page		*bank_mem_map_base[NR_BANKS];
 	struct ia64_node_data	*node_data_ptrs[NR_NODES];
-	short			node_id_map[NR_BANKS];
 };
 
-
 /*
  * Return a pointer to the node_data structure for the executing cpu.
  */
@@ -40,7 +42,8 @@
 /*
  * Return a pointer to the node_data structure for the specified node.
  */
-#define node_data(node)	(local_node_data->node_data_ptrs[node])
+#define node_data(node) (local_node_data->node_data_ptrs[node])
+#define NODE_DATA(nid) (local_node_data->pg_data_ptrs[nid])
 
 /*
  * Get a pointer to the node_id/node_data for the current cpu.
@@ -48,29 +51,5 @@
  */
 extern int boot_get_local_nodeid(void);
 extern struct ia64_node_data *get_node_data_ptr(void);
-
-/*
- * Given a node id, return a pointer to the pg_data_t for the node.
- * The following 2 macros are similar. 
- *
- * NODE_DATA 	- should be used in all code not related to system
- *		  initialization. It uses pernode data structures to minimize
- *		  offnode memory references. However, these structure are not 
- *		  present during boot. This macro can be used once cpu_init
- *		  completes.
- *
- * BOOT_NODE_DATA
- *		- should be used during system initialization 
- *		  prior to freeing __initdata. It does not depend on the percpu
- *		  area being present.
- *
- * NOTE:   The names of these macros are misleading but are difficult to change
- *	   since they are used in generic linux & on other architecures.
- */
-#define NODE_DATA(nid)		(local_node_data->pg_data_ptrs[nid])
-#define BOOT_NODE_DATA(nid)	boot_get_pg_data_ptr((long)(nid))
-
-struct pglist_data;
-extern struct pglist_data * __init boot_get_pg_data_ptr(long);
 
 #endif /* _ASM_IA64_NODEDATA_H */
diff -Nru a/include/asm-ia64/numa.h b/include/asm-ia64/numa.h
--- a/include/asm-ia64/numa.h	Wed Jul 16 12:28:28 2003
+++ b/include/asm-ia64/numa.h	Wed Jul 16 12:28:28 2003
@@ -15,13 +15,21 @@
 
 #ifdef CONFIG_DISCONTIGMEM
 # include <asm/mmzone.h>
-# define NR_MEMBLKS   (NR_BANKS)
 #else
 # define NR_NODES     (8)
 # define NR_MEMBLKS   (NR_NODES * 8)
 #endif
 
 #include <linux/cache.h>
+#include <linux/threads.h>
+#include <linux/smp.h>
+
+#define NODEMASK_WORDCOUNT       ((NR_NODES+(BITS_PER_LONG-1))/BITS_PER_LONG)
+
+#define NODE_MASK_NONE   { [0 ... ((NR_NODES+BITS_PER_LONG-1)/BITS_PER_LONG)-1] = 0 }
+
+typedef unsigned long   nodemask_t[NODEMASK_WORDCOUNT];
+                                                                                                                             
 extern volatile char cpu_to_node_map[NR_CPUS] __cacheline_aligned;
 extern volatile unsigned long node_to_cpu_mask[NR_NODES] __cacheline_aligned;
 
@@ -63,6 +71,12 @@
 extern int paddr_to_nid(unsigned long paddr);
 
 #define local_nodeid (cpu_to_node_map[smp_processor_id()])
+
+#else /* !CONFIG_NUMA */
+
+#define node_distance(from,to) 10
+#define paddr_to_nid(x) 0
+#define local_nodeid 0
 
 #endif /* CONFIG_NUMA */
 
diff -Nru a/include/asm-ia64/page.h b/include/asm-ia64/page.h
--- a/include/asm-ia64/page.h	Wed Jul 16 12:28:28 2003
+++ b/include/asm-ia64/page.h	Wed Jul 16 12:28:28 2003
@@ -93,18 +93,26 @@
 
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
 
-#ifndef CONFIG_DISCONTIGMEM
-# ifdef CONFIG_VIRTUAL_MEM_MAP
-   extern int ia64_pfn_valid (unsigned long pfn);
-#  define pfn_valid(pfn)	(((pfn) < max_mapnr) && ia64_pfn_valid(pfn))
-# else
-#  define pfn_valid(pfn)	((pfn) < max_mapnr)
-# endif
-#define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
-#define page_to_pfn(page)	((unsigned long) (page - mem_map))
-#define pfn_to_page(pfn)	(mem_map + (pfn))
-#define page_to_phys(page)	(page_to_pfn(page) << PAGE_SHIFT)
+#ifdef CONFIG_VIRTUAL_MEM_MAP
+extern int ia64_pfn_valid(unsigned long pfn);
+#else
+#define ia64_pfn_valid(pfn) (1)
+#endif
+
+extern unsigned long max_low_pfn;
+#define pfn_valid(pfn) (((pfn) < max_low_pfn) && ia64_pfn_valid(pfn))
+
+#if defined(CONFIG_VIRTUAL_MEM_MAP) && !defined(CONFIG_DISCONTIGMEM)
+#define vmem_map mem_map
+#else
+extern struct page *vmem_map;
 #endif
+
+#define pfn_to_page(pfn)	(vmem_map + (pfn))
+#define page_to_pfn(page)	((unsigned long) (page - vmem_map))
+
+#define virt_to_page(kaddr)	(pfn_to_page(__pa(kaddr) >> PAGE_SHIFT))
+#define page_to_phys(page)	(page_to_pfn(page) << PAGE_SHIFT)
 
 typedef union ia64_va {
 	struct {
diff -Nru a/include/asm-ia64/pgtable.h b/include/asm-ia64/pgtable.h
--- a/include/asm-ia64/pgtable.h	Wed Jul 16 12:28:28 2003
+++ b/include/asm-ia64/pgtable.h	Wed Jul 16 12:28:28 2003
@@ -174,7 +174,6 @@
 	return (addr & (local_cpu_data->unimpl_pa_mask)) = 0;
 }
 
-#ifndef CONFIG_DISCONTIGMEM
 /*
  * kern_addr_valid(ADDR) tests if ADDR is pointing to valid kernel
  * memory.  For the return value to be meaningful, ADDR must be >@@ -190,7 +189,6 @@
  */
 #define kern_addr_valid(addr)	(1)
 
-#endif
 
 /*
  * Now come the defines and routines to manage and access the three-level
@@ -241,10 +239,8 @@
 #define pte_none(pte) 			(!pte_val(pte))
 #define pte_present(pte)		(pte_val(pte) & (_PAGE_P | _PAGE_PROTNONE))
 #define pte_clear(pte)			(pte_val(*(pte)) = 0UL)
-#ifndef CONFIG_DISCONTIGMEM
 /* pte_page() returns the "struct page *" corresponding to the PTE: */
 #define pte_page(pte)			virt_to_page(((pte_val(pte) & _PFN_MASK) + PAGE_OFFSET))
-#endif
 
 #define pmd_none(pmd)			(!pmd_val(pmd))
 #define pmd_bad(pmd)			(!ia64_phys_addr_valid(pmd_val(pmd)))
@@ -416,6 +412,7 @@
 
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
 extern void paging_init (void);
+extern int filter_rsvd_memory(unsigned long start, unsigned long end, void *arg);
 
 /*
  * Note: The macros below rely on the fact that MAX_SWAPFILES_SHIFT <= number of
diff -Nru a/include/asm-ia64/processor.h b/include/asm-ia64/processor.h
--- a/include/asm-ia64/processor.h	Wed Jul 16 12:28:28 2003
+++ b/include/asm-ia64/processor.h	Wed Jul 16 12:28:28 2003
@@ -185,6 +185,8 @@
 #endif
 #ifdef CONFIG_NUMA
 	struct ia64_node_data *node_data;
+	struct cpuinfo_ia64 *cpu_data[NR_CPUS];
+	int nodeid;
 #endif
 };
 
diff -Nru a/mm/bootmem.c b/mm/bootmem.c
--- a/mm/bootmem.c	Wed Jul 16 12:28:28 2003
+++ b/mm/bootmem.c	Wed Jul 16 12:28:28 2003
@@ -48,8 +48,24 @@
 	bootmem_data_t *bdata = pgdat->bdata;
 	unsigned long mapsize = ((end - start)+7)/8;
 
-	pgdat->pgdat_next = pgdat_list;
-	pgdat_list = pgdat;
+
+	/*
+	 * sort pgdat_list so that the lowest one comes first,
+	 * which makes alloc_bootmem_low_pages work as desired.
+	 */
+	if (!pgdat_list || pgdat_list->node_start_pfn > pgdat->node_start_pfn) {
+		pgdat->pgdat_next = pgdat_list;
+		pgdat_list = pgdat;
+	} else {
+		pg_data_t *tmp = pgdat_list;
+		while (tmp->pgdat_next) {
+			if (tmp->pgdat_next->node_start_pfn > pgdat->node_start_pfn)
+				break;
+			tmp = tmp->pgdat_next;
+		}
+		pgdat->pgdat_next = tmp->pgdat_next;
+		tmp->pgdat_next = pgdat;
+	}
 
 	mapsize = (mapsize + (sizeof(long) - 1UL)) & ~(sizeof(long) - 1UL);
 	bdata->node_bootmem_map = phys_to_virt(mapstart << PAGE_SHIFT);
@@ -251,7 +267,7 @@
 
 static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat)
 {
-	struct page *page = pgdat->node_mem_map;
+	struct page *page;
 	bootmem_data_t *bdata = pgdat->bdata;
 	unsigned long i, count, total = 0;
 	unsigned long idx;
@@ -260,23 +276,23 @@
 	if (!bdata->node_bootmem_map) BUG();
 
 	count = 0;
+	page = virt_to_page(phys_to_virt(bdata->node_boot_start));
 	idx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
 	map = bdata->node_bootmem_map;
 	for (i = 0; i < idx; ) {
 		unsigned long v = ~map[i / BITS_PER_LONG];
 		if (v) {
 			unsigned long m;
-			for (m = 1; m && i < idx; m<<=1, page++, i++) {
+			for (m = 1; m && i < idx; m<<=1, i++) {
 				if (v & m) {
 					count++;
-					ClearPageReserved(page);
-					set_page_count(page, 1);
-					__free_page(page);
+					ClearPageReserved(page+i);
+					set_page_count(page+i, 1);
+					__free_page(page+i);
 				}
 			}
 		} else {
 			i+=BITS_PER_LONG;
-			page += BITS_PER_LONG;
 		}
 	}
 	total += count;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
                   ` (5 preceding siblings ...)
  2003-07-16 19:29 ` Jesse Barnes
@ 2003-07-16 19:40 ` Matthew Wilcox
  2003-07-16 19:51 ` Jesse Barnes
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Matthew Wilcox @ 2003-07-16 19:40 UTC (permalink / raw)
  To: linux-ia64

On Wed, Jul 16, 2003 at 12:29:52PM -0700, Jesse Barnes wrote:
> @@ -210,8 +210,8 @@
>  	  system with an A0 or A1 stepping CPU.
>  
>  config NUMA
> -	bool "Enable NUMA support" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
> -	default y if IA64_SGI_SN2
> +	bool
> +	default y if IA64_SGI_SN2 || IA64_GENERIC
>  	help
>  	  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
>  	  Access).  This option is for configuring high-end multiprocessor

If you're removing the question, you can remove the helptext too.

> @@ -235,8 +235,7 @@
>  
>  config DISCONTIGMEM
>  	bool
> -	depends on IA64_SGI_SN2 || (IA64_GENERIC || IA64_DIG || IA64_HP_ZX1) && NUMA
> -	default y
> +	default y if IA64_SGI_SN2 || IA64_GENERIC
>  	help
>  	  Say Y to support efficient handling of discontiguous physical memory,
>  	  for architectures which are either NUMA (Non-Uniform Memory Access)

This one already shouldn't have helptext ;-)

> - * on a BANKSIZE boundary.  Multiple banks per node is not supported.
> + * on a GRANULE boundary.  Multiple banks per node is not supported.

Multiple banks *are* not supported ;-)  The feature of multiple banks
*is* not supported.

> @@ -22,15 +22,17 @@
>  
>  struct pglist_data;
>  struct ia64_node_data {
> -	short			active_cpu_count;
>  	short			node;
> +	short			active_cpu_count;
> +	/*
> +	 * The fields are read-only (after boot). They contain pointers to various structures
> +	 * located on other nodes. Ths data is replicated on each node in order to reduce
> +	 * off-node references.
> +	 */

Can you wrap comments at 80 columns?  It makes them much easier to read.

-- 
"It's not Hollywood.  War is real, war is primarily not about defeat or
victory, it is about death.  I've seen thousands and thousands of dead bodies.
Do you think I want to have an academic debate on this subject?" -- Robert Fisk

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
                   ` (6 preceding siblings ...)
  2003-07-16 19:40 ` Matthew Wilcox
@ 2003-07-16 19:51 ` Jesse Barnes
  2003-07-16 19:56 ` Erich Focht
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jesse Barnes @ 2003-07-16 19:51 UTC (permalink / raw)
  To: linux-ia64

Thanks for looking at it.  Here's another one that fixes that stuff up.

On Wed, Jul 16, 2003 at 08:40:51PM +0100, Matthew Wilcox wrote:
> > +	default y if IA64_SGI_SN2 || IA64_GENERIC
> >  	help
> >  	  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
> >  	  Access).  This option is for configuring high-end multiprocessor
> 
> If you're removing the question, you can remove the helptext too.
> 
> > +	default y if IA64_SGI_SN2 || IA64_GENERIC
> >  	help
> >  	  Say Y to support efficient handling of discontiguous physical memory,
> >  	  for architectures which are either NUMA (Non-Uniform Memory Access)
> 
> This one already shouldn't have helptext ;-)
> 
> > - * on a BANKSIZE boundary.  Multiple banks per node is not supported.
> > + * on a GRANULE boundary.  Multiple banks per node is not supported.
> 
> Multiple banks *are* not supported ;-)  The feature of multiple banks
> *is* not supported.
> 
> > @@ -22,15 +22,17 @@
> >  
> >  struct pglist_data;
> >  struct ia64_node_data {
> > -	short			active_cpu_count;
> >  	short			node;
> > +	short			active_cpu_count;
> > +	/*
> > +	 * The fields are read-only (after boot). They contain pointers to various structures
> > +	 * located on other nodes. Ths data is replicated on each node in order to reduce
> > +	 * off-node references.
> > +	 */
> 
> Can you wrap comments at 80 columns?  It makes them much easier to read.

I prefer it that way too, but David likes 100 columns, AFAICT. :)  Fixed
anyway.

Thanks,
Jesse


# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.1396  -> 1.1398 
#	include/asm-ia64/page.h	1.19    -> 1.20   
#	arch/ia64/kernel/setup.c	1.53    -> 1.54   
#	include/asm-ia64/pgtable.h	1.28    -> 1.29   
#	        mm/bootmem.c	1.18    -> 1.19   
#	include/asm-ia64/numa.h	1.5     -> 1.6    
#	include/asm-ia64/processor.h	1.48    -> 1.49   
#	 arch/ia64/mm/init.c	1.46    -> 1.47   
#	include/asm-ia64/nodedata.h	1.3     -> 1.5    
#	arch/ia64/mm/discontig.c	1.4     -> 1.6    
#	   arch/ia64/Kconfig	1.38    -> 1.40   
#	drivers/acpi/Kconfig	1.12    -> 1.13   
#	include/asm-ia64/mmzone.h	1.4     -> 1.5    
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/07/16	jbarnes@tomahawk.engr.sgi.com	1.1397
# discontig update
# --------------------------------------------
# 03/07/16	jbarnes@tomahawk.engr.sgi.com	1.1398
# more discontig stuff
# --------------------------------------------
#
diff -Nru a/arch/ia64/Kconfig b/arch/ia64/Kconfig
--- a/arch/ia64/Kconfig	Wed Jul 16 12:50:02 2003
+++ b/arch/ia64/Kconfig	Wed Jul 16 12:50:02 2003
@@ -210,12 +210,8 @@
 	  system with an A0 or A1 stepping CPU.
 
 config NUMA
-	bool "Enable NUMA support" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
-	default y if IA64_SGI_SN2
-	help
-	  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
-	  Access).  This option is for configuring high-end multiprocessor
-	  server systems.  If in doubt, say N.
+	bool
+	default y if IA64_SGI_SN2 || IA64_GENERIC
 
 choice
 	prompt "Maximum Memory per NUMA Node" if NUMA && IA64_DIG
@@ -235,18 +231,11 @@
 
 config DISCONTIGMEM
 	bool
-	depends on IA64_SGI_SN2 || (IA64_GENERIC || IA64_DIG || IA64_HP_ZX1) && NUMA
-	default y
-	help
-	  Say Y to support efficient handling of discontiguous physical memory,
-	  for architectures which are either NUMA (Non-Uniform Memory Access)
-	  or have huge holes in the physical address space for other reasons.
-	  See <file:Documentation/vm/numa> for more.
+	default y if IA64_SGI_SN2 || IA64_GENERIC
 
 config VIRTUAL_MEM_MAP
 	bool "Enable Virtual Mem Map"
-	depends on !NUMA
-	default y if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
+	default y if !IA64_HP_SIM
 	help
 	  Say Y to compile the kernel with support for a virtual mem map.
 	  This is an alternate method of supporting large holes in the
@@ -259,8 +248,8 @@
 	  are unsure, say Y.
 
 config IA64_MCA
-	bool "Enable IA-64 Machine Check Abort" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
-	default y if IA64_SGI_SN2
+	bool "Enable IA-64 Machine Check Abort"
+	default y if !IA64_HP_SIM
 	help
 	  Say Y here to enable machine check support for IA-64.  If you're
 	  unsure, answer Y.
@@ -292,43 +281,12 @@
 	depends on IA64_GENERIC || IA64_DIG || IA64_HP_ZX1 || IA64_SGI_SN2
 	default y
 
-config IA64_SGI_SN_DEBUG
-	bool "Enable extra debugging code"
-	depends on IA64_SGI_SN2
-	help
-	  Turns on extra debugging code in the SGI SN (Scalable NUMA) platform
-	  for IA-64.  Unless you are debugging problems on an SGI SN IA-64 box,
-	  say N.
-
 config IA64_SGI_SN_SIM
 	bool "Enable SGI Medusa Simulator Support"
 	depends on IA64_SGI_SN2
 	help
 	  If you are compiling a kernel that will run under SGI's IA-64
 	  simulator (Medusa) then say Y, otherwise say N.
-
-config IA64_SGI_AUTOTEST
-	bool "Enable autotest (llsc). Option to run cache test instead of booting"
-	depends on IA64_SGI_SN2
-	help
-	  Build a kernel used for hardware validation. If you include the
-	  keyword "autotest" on the boot command line, the kernel does NOT boot.
-	  Instead, it starts all cpus and runs cache coherency tests instead.
-
-	  If unsure, say N.
-
-config SERIAL_SGI_L1_PROTOCOL
-	bool "Enable protocol mode for the L1 console"
-	depends on IA64_SGI_SN2
-	help
-	  Uses protocol mode instead of raw mode for the level 1 console on the
-	  SGI SN (Scalable NUMA) platform for IA-64.  If you are compiling for
-	  an SGI SN box then Y is the recommended value, otherwise say N.
-
-config PERCPU_IRQ
-	bool
-	depends on IA64_SGI_SN2
-	default y
 
 # On IA-64, we always want an ELF /proc/kcore.
 config KCORE_ELF
diff -Nru a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
--- a/arch/ia64/kernel/setup.c	Wed Jul 16 12:50:02 2003
+++ b/arch/ia64/kernel/setup.c	Wed Jul 16 12:50:02 2003
@@ -138,7 +138,7 @@
 call_pernode_memory (unsigned long start, unsigned long end, void *arg)
 {
 	unsigned long rs, re;
-	void (*func)(unsigned long, unsigned long, int, int);
+	void (*func)(unsigned long, unsigned long, int);
 	int i;
 
 	start = PAGE_ALIGN(start);
@@ -149,22 +149,21 @@
 	func = arg;
 
 	if (!num_memblks) {
-		/*
-		 * This machine doesn't have SRAT, so call func with
-		 * nid=0, bank=0.
-		 */
+		/* No SRAT table, to assume one node (node 0) */
 		if (start < end)
-			(*func)(start, end - start, 0, 0);
+			(*func)(start, end, 0);
 		return;
 	}
 
 	for (i = 0; i < num_memblks; i++) {
-		rs = max(start, node_memblk[i].start_paddr);
-		re = min(end, node_memblk[i].start_paddr+node_memblk[i].size);
+		rs = max(__pa(start), node_memblk[i].start_paddr);
+		re = min(__pa(end), node_memblk[i].start_paddr+node_memblk[i].size);
 
 		if (rs < re)
-			(*func)(rs, re-rs, node_memblk[i].nid,
-				node_memblk[i].bank);
+			(*func)((unsigned long)__va(rs), (unsigned long)__va(re), node_memblk[i].nid);
+
+		if ((unsigned long)__va(re) = end)
+			break;
 	}
 }
 
@@ -180,7 +179,7 @@
 filter_rsvd_memory (unsigned long start, unsigned long end, void *arg)
 {
 	unsigned long range_start, range_end, prev_start;
-	void (*func)(unsigned long, unsigned long);
+	void (*func)(unsigned long, unsigned long, int);
 	int i;
 
 #if IGNORE_PFN0
@@ -202,9 +201,9 @@
 
 		if (range_start < range_end)
 #ifdef CONFIG_DISCONTIGMEM
-			call_pernode_memory(__pa(range_start), __pa(range_end), func);
+			call_pernode_memory(range_start, range_end, func);
 #else
-			(*func)(__pa(range_start), range_end - range_start);
+			(*func)(range_start, range_end, 0);
 #endif
 
 		/* nothing more available in this segment */
@@ -703,6 +702,8 @@
 	 * get_free_pages() cannot be used before cpu_init() done.  BSP allocates
 	 * "NR_CPUS" pages for all CPUs to avoid that AP calls get_zeroed_page().
 	 */
+#ifndef CONFIG_DISCONTIGMEM
+	/* for discontig machines, we do this in discontig.c */
 	if (smp_processor_id() = 0) {
 		cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS, PERCPU_PAGE_SIZE,
 					   __pa(MAX_DMA_ADDRESS));
@@ -714,6 +715,7 @@
 			per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu];
 		}
 	}
+#endif
 	cpu_data = __per_cpu_start + __per_cpu_offset[smp_processor_id()];
 #else /* !CONFIG_SMP */
 	cpu_data = __phys_per_cpu_start;
diff -Nru a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
--- a/arch/ia64/mm/discontig.c	Wed Jul 16 12:50:02 2003
+++ b/arch/ia64/mm/discontig.c	Wed Jul 16 12:50:02 2003
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2000 Silicon Graphics, Inc.  All rights reserved.
+ * Copyright (c) 2000, 2003 Silicon Graphics, Inc.  All rights reserved.
  * Copyright (c) 2001 Intel Corp.
  * Copyright (c) 2001 Tony Luck <tony.luck@intel.com>
  * Copyright (c) 2002 NEC Corp.
@@ -16,74 +16,60 @@
 #include <linux/mmzone.h>
 #include <linux/acpi.h>
 #include <linux/efi.h>
-
+#include <asm/pgalloc.h>
+#include <asm/tlb.h>
 
 /*
- * Round an address upward to the next multiple of GRANULE size.
+ * Round an address upward or downward to the next multiple of IA64_GRANULE_SIZE.
  */
+#define GRANULEROUNDDOWN(n) ((n) & ~(IA64_GRANULE_SIZE-1))
 #define GRANULEROUNDUP(n) (((n)+IA64_GRANULE_SIZE-1) & ~(IA64_GRANULE_SIZE-1))
 
-static struct ia64_node_data	*node_data[NR_NODES];
-static long			boot_pg_data[8*NR_NODES+sizeof(pg_data_t)]  __initdata;
-static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
-static bootmem_data_t		bdata[NR_NODES][NR_BANKS_PER_NODE+1] __initdata;
-
-extern int  filter_rsvd_memory (unsigned long start, unsigned long end, void *arg);
+/*
+ * Used to locate BOOT_DATA prior to initializing the node data area.
+ */
+#define BOOT_NODE_DATA(node)	pg_data_ptr[node]
 
 /*
- * Return the compact node number of this cpu. Used prior to
- * setting up the cpu_data area.
- *	Note - not fast, intended for boot use only!!
+ * To prevent cache aliasing effects, align per-node structures so that they 
+ * start at addresses that are strided by node number.
  */
-int
-boot_get_local_nodeid(void)
-{
-	int	i;
+#define NODEDATA_ALIGN(addr, node)	((((addr) + 1024*1024-1) & ~(1024*1024-1)) + (node)*PERCPU_PAGE_SIZE)
 
-	for (i = 0; i < NR_CPUS; i++)
-		if (node_cpuid[i].phys_id = hard_smp_processor_id())
-			return node_cpuid[i].nid;
 
-	/* node info missing, so nid should be 0.. */
-	return 0;
-}
+static struct ia64_node_data	*boot_node_data[NR_NODES] __initdata;
+static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
+static bootmem_data_t		bdata[NR_NODES] __initdata;
+static unsigned long		boot_pernode[NR_NODES] __initdata;
+static unsigned long		boot_pernodesize[NR_NODES] __initdata;
 
-/*
- * Return a pointer to the pg_data structure for a node.
- * This function is used ONLY in early boot before the cpu_data
- * structure is available.
- */
-pg_data_t* __init
-boot_get_pg_data_ptr(long node)
-{
-	return pg_data_ptr[node];
-}
+extern char __per_cpu_start[], __per_cpu_end[];
 
 
-/*
- * Return a pointer to the node data for the current node.
- *	(boottime initialization only)
- */
-struct ia64_node_data *
+struct ia64_node_data*
 get_node_data_ptr(void)
 {
-	return node_data[boot_get_local_nodeid()];
+	return boot_node_data[cpu_to_node_map[smp_processor_id()]];	/* ZZZ */
 }
 
 /*
  * We allocate one of the bootmem_data_t structs for each piece of memory
  * that we wish to treat as a contiguous block.  Each such block must start
- * on a BANKSIZE boundary.  Multiple banks per node is not supported.
+ * on a GRANULE boundary.  Multiple banks per node are not supported.
+ *   (Note: on SN2, all memory on a node is trated as a single bank.
+ *   Holes within the bank are supported. This works because memory
+ *   from different banks is not interleaved. The bootmap bitmap
+ *   for the node is somewhat large but not too large).
  */
 static int __init
-build_maps(unsigned long pstart, unsigned long length, int node)
+build_maps(unsigned long start, unsigned long end, int node)
 {
 	bootmem_data_t	*bdp;
 	unsigned long cstart, epfn;
 
-	bdp = pg_data_ptr[node]->bdata;
-	epfn = GRANULEROUNDUP(pstart + length) >> PAGE_SHIFT;
-	cstart = pstart & ~(BANKSIZE - 1);
+	bdp = &bdata[node];
+	epfn = GRANULEROUNDUP(__pa(end)) >> PAGE_SHIFT;
+	cstart = GRANULEROUNDDOWN(__pa(start));
 
 	if (!bdp->node_low_pfn) {
 		bdp->node_boot_start = cstart;
@@ -99,34 +85,96 @@
 	return 0;
 }
 
+
+/*
+ * Count the number of cpus on the node
+ */
+static __inline__ int
+count_cpus(int node)
+{
+	int cpu, n=0;
+
+	for (cpu=0; cpu < NR_CPUS; cpu++)
+		if (node = node_cpuid[cpu].nid)
+			n++;
+	return n;
+}
+
+
 /*
- * Find space on each node for the bootmem map.
+ * Find space on each node for the bootmem map & other per-node data structures.
  *
  * Called by efi_memmap_walk to find boot memory on each node. Note that
  * only blocks that are free are passed to this routine (currently filtered by
  * free_available_memory).
  */
 static int __init
-find_bootmap_space(unsigned long pstart, unsigned long length, int node)
+find_pernode_space(unsigned long start, unsigned long end, int node)
 {
-	unsigned long	mapsize, pages, epfn;
+	unsigned long	mapsize, pages, epfn, map=0, cpu, cpus;
+	unsigned long	pernodesize=0, pernode;
+	unsigned long	cpu_data;
+	unsigned long	pstart, length;
 	bootmem_data_t	*bdp;
 
+	pstart = __pa(start);
+	length = end - start;
 	epfn = (pstart + length) >> PAGE_SHIFT;
-	bdp = &pg_data_ptr[node]->bdata[0];
+	bdp = &bdata[node];
 
 	if (pstart < bdp->node_boot_start || epfn > bdp->node_low_pfn)
 		return 0;
 
-	if (!bdp->node_bootmem_map) {
+	if (!boot_pernode[node]) {
+		cpus = count_cpus(node);
+		pernodesize += PERCPU_PAGE_SIZE * cpus;
+		pernodesize += L1_CACHE_ALIGN(sizeof(pg_data_t));
+		pernodesize += L1_CACHE_ALIGN(sizeof(struct ia64_node_data));
+		pernodesize = PAGE_ALIGN(pernodesize);
+		pernode = NODEDATA_ALIGN(pstart, node);
+	
+		if (pstart + length > (pernode + pernodesize)) {
+			boot_pernode[node] = pernode;
+			boot_pernodesize[node] = pernodesize;
+			memset(__va(pernode), 0, pernodesize);
+
+			cpu_data = pernode;
+			pernode += PERCPU_PAGE_SIZE * cpus;
+
+			pg_data_ptr[node] = __va(pernode);
+			pernode += L1_CACHE_ALIGN(sizeof(pg_data_t));
+
+			boot_node_data[node] = __va(pernode);
+			pernode += L1_CACHE_ALIGN(sizeof(struct ia64_node_data));
+
+			pg_data_ptr[node]->bdata = &bdata[node];
+			pernode += L1_CACHE_ALIGN(sizeof(pg_data_t));
+
+			for (cpu=0; cpu < NR_CPUS; cpu++) {
+				if (node = node_cpuid[cpu].nid) {
+					extern char __per_cpu_start[], __phys_per_cpu_start[];
+					memcpy((void*)cpu_data, __phys_per_cpu_start, __per_cpu_end - __per_cpu_start);
+					__per_cpu_offset[cpu] = (char*)__va(cpu_data) - __per_cpu_start;
+					cpu_data +=  PERCPU_PAGE_SIZE;
+				}
+			}
+		}
+	}
+
+	pernode = boot_pernode[node];
+	pernodesize = boot_pernodesize[node];
+	if (pernode && !bdp->node_bootmem_map) {
 		pages = bdp->node_low_pfn - (bdp->node_boot_start>>PAGE_SHIFT);
 		mapsize = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
-		if (length > mapsize) {
-			init_bootmem_node(
-				BOOT_NODE_DATA(node),
-				pstart>>PAGE_SHIFT, 
-				bdp->node_boot_start>>PAGE_SHIFT,
-				bdp->node_low_pfn);
+
+		if (pernode - pstart > mapsize)
+			map = pstart;
+		else if (pstart + length - pernode - pernodesize > mapsize)
+			map = pernode + pernodesize;
+
+		if (map) {
+			init_bootmem_node(BOOT_NODE_DATA(node),	map>>PAGE_SHIFT, 
+				bdp->node_boot_start>>PAGE_SHIFT, bdp->node_low_pfn);
 		}
 
 	}
@@ -143,9 +191,9 @@
  *
  */
 static int __init
-discontig_free_bootmem_node(unsigned long pstart, unsigned long length, int node)
+discontig_free_bootmem_node(unsigned long start, unsigned long end, int node)
 {
-	free_bootmem_node(BOOT_NODE_DATA(node), pstart, length);
+	free_bootmem_node(BOOT_NODE_DATA(node), __pa(start), end - start);
 
 	return 0;
 }
@@ -158,53 +206,50 @@
 discontig_reserve_bootmem(void)
 {
 	int		node;
-	unsigned long	mapbase, mapsize, pages;
+	unsigned long	base, size, pages;
 	bootmem_data_t	*bdp;
 
 	for (node = 0; node < numnodes; node++) {
 		bdp = BOOT_NODE_DATA(node)->bdata;
 
 		pages = bdp->node_low_pfn - (bdp->node_boot_start>>PAGE_SHIFT);
-		mapsize = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
-		mapbase = __pa(bdp->node_bootmem_map);
-		reserve_bootmem_node(BOOT_NODE_DATA(node), mapbase, mapsize);
+		size = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
+		base = __pa(bdp->node_bootmem_map);
+		reserve_bootmem_node(BOOT_NODE_DATA(node), base, size);
+
+		size = boot_pernodesize[node];
+		base = __pa(boot_pernode[node]);
+		reserve_bootmem_node(BOOT_NODE_DATA(node), base, size);
 	}
 }
 
 /*
- * Allocate per node tables.
- * 	- the pg_data structure is allocated on each node. This minimizes offnode 
- *	  memory references
- *	- the node data is allocated & initialized. Portions of this structure is read-only (after 
- *	  boot) and contains node-local pointers to usefuls data structures located on
- *	  other nodes.
+ * Initialize per-node data
+ *
+ * Finish setting up the node data for this node, then copy it to the other nodes.
  *
- * We also switch to using the "real" pg_data structures at this point. Earlier in boot, we
- * use a different structure. The only use for pg_data prior to the point in boot is to get 
- * the pointer to the bdata for the node.
  */
 static void __init
-allocate_pernode_structures(void)
+initialize_pernode_data(void)
 {
-	pg_data_t	*pgdat=0, *new_pgdat_list=0;
-	int		node, mynode;
+	int	cpu, node;
+
+	memcpy(boot_node_data[0]->pg_data_ptrs, pg_data_ptr, sizeof(pg_data_ptr));
+	memcpy(boot_node_data[0]->node_data_ptrs, boot_node_data, sizeof(boot_node_data));
 
-	mynode = boot_get_local_nodeid();
-	for (node = numnodes - 1; node >= 0 ; node--) {
-		node_data[node] = alloc_bootmem_node(BOOT_NODE_DATA(node), sizeof (struct ia64_node_data));
-		pgdat = __alloc_bootmem_node(BOOT_NODE_DATA(node), sizeof(pg_data_t), SMP_CACHE_BYTES, 0);
-		pgdat->bdata = &(bdata[node][0]);
-		pg_data_ptr[node] = pgdat;
-		pgdat->pgdat_next = new_pgdat_list;
-		new_pgdat_list = pgdat;
+	for (node=1; node < numnodes; node++) {
+		memcpy(boot_node_data[node], boot_node_data[0], sizeof(struct ia64_node_data));
+		boot_node_data[node]->node = node;
 	}
-	
-	memcpy(node_data[mynode]->pg_data_ptrs, pg_data_ptr, sizeof(pg_data_ptr));
-	memcpy(node_data[mynode]->node_data_ptrs, node_data, sizeof(node_data));
 
-	pgdat_list = new_pgdat_list;
+	for (cpu=0; cpu < NR_CPUS; cpu++) {
+		node = node_cpuid[cpu].nid;
+		per_cpu(cpu_info, cpu).node_data = boot_node_data[node];
+		per_cpu(cpu_info, cpu).nodeid = node;
+	}
 }
 
+
 /*
  * Called early in boot to setup the boot memory allocator, and to
  * allocate the node-local pg_data & node-directory data structures..
@@ -212,96 +257,19 @@
 void __init
 discontig_mem_init(void)
 {
-	int	node;
-
 	if (numnodes = 0) {
 		printk(KERN_ERR "node info missing!\n");
 		numnodes = 1;
 	}
 
-	for (node = 0; node < numnodes; node++) {
-		pg_data_ptr[node] = (pg_data_t*) &boot_pg_data[node];
-		pg_data_ptr[node]->bdata = &bdata[node][0];
-	}
-
 	min_low_pfn = -1;
 	max_low_pfn = 0;
 
         efi_memmap_walk(filter_rsvd_memory, build_maps);
-        efi_memmap_walk(filter_rsvd_memory, find_bootmap_space);
+        efi_memmap_walk(filter_rsvd_memory, find_pernode_space);
         efi_memmap_walk(filter_rsvd_memory, discontig_free_bootmem_node);
-	discontig_reserve_bootmem();
-	allocate_pernode_structures();
-}
-
-/*
- * Initialize the paging system.
- *	- determine sizes of each node
- *	- initialize the paging system for the node
- *	- build the nodedir for the node. This contains pointers to
- *	  the per-bank mem_map entries.
- *	- fix the page struct "virtual" pointers. These are bank specific
- *	  values that the paging system doesn't understand.
- *	- replicate the nodedir structure to other nodes	
- */ 
-
-void __init
-discontig_paging_init(void)
-{
-	int		node, mynode;
-	unsigned long	max_dma, zones_size[MAX_NR_ZONES];
-	unsigned long	kaddr, ekaddr, bid;
-	struct page	*page;
-	bootmem_data_t	*bdp;
-
-	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 
-	mynode = boot_get_local_nodeid();
-	for (node = 0; node < numnodes; node++) {
-		long pfn, startpfn;
-
-		memset(zones_size, 0, sizeof(zones_size));
-
-		startpfn = -1;
-		bdp = BOOT_NODE_DATA(node)->bdata;
-		pfn = bdp->node_boot_start >> PAGE_SHIFT;
-		if (startpfn = -1)
-			startpfn = pfn;
-		if (pfn > max_dma)
-			zones_size[ZONE_NORMAL] += (bdp->node_low_pfn - pfn);
-		else if (bdp->node_low_pfn < max_dma)
-			zones_size[ZONE_DMA] += (bdp->node_low_pfn - pfn);
-		else {
-			zones_size[ZONE_DMA] += (max_dma - pfn);
-			zones_size[ZONE_NORMAL] += (bdp->node_low_pfn - max_dma);
-		}
-
-		free_area_init_node(node, NODE_DATA(node), NULL, zones_size, startpfn, 0);
-
-		page = NODE_DATA(node)->node_mem_map;
-
-		bdp = BOOT_NODE_DATA(node)->bdata;
-
-		kaddr = (unsigned long)__va(bdp->node_boot_start);
-		ekaddr = (unsigned long)__va(bdp->node_low_pfn << PAGE_SHIFT);
-		while (kaddr < ekaddr) {
-			if (paddr_to_nid(__pa(kaddr)) = node) {
-				bid = BANK_MEM_MAP_INDEX(kaddr);
-				node_data[mynode]->node_id_map[bid] = node;
-				node_data[mynode]->bank_mem_map_base[bid] = page;
-			}
-			kaddr += BANKSIZE;
-			page += BANKSIZE/PAGE_SIZE;
-		}
-	}
-
-	/*
-	 * Finish setting up the node data for this node, then copy it to the other nodes.
-	 */
-	for (node=0; node < numnodes; node++)
-		if (mynode != node) {
-			memcpy(node_data[node], node_data[mynode], sizeof(struct ia64_node_data));
-			node_data[node]->node = node;
-		}
+	discontig_reserve_bootmem();
+	initialize_pernode_data();
 }
 
diff -Nru a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
--- a/arch/ia64/mm/init.c	Wed Jul 16 12:50:02 2003
+++ b/arch/ia64/mm/init.c	Wed Jul 16 12:50:02 2003
@@ -44,7 +44,7 @@
 #ifdef CONFIG_VIRTUAL_MEM_MAP
 # define LARGE_GAP	0x40000000	/* Use virtual mem map if hole is > than this */
   unsigned long vmalloc_end = VMALLOC_END_INIT;
-  static struct page *vmem_map;
+  struct page *vmem_map;
   static unsigned long num_dma_physpages;
 #endif
 
@@ -240,7 +240,7 @@
 				else if (page_count(pgdat->node_mem_map + i))
 					shared += page_count(pgdat->node_mem_map + i) - 1;
 			}
-			printk("\t%d pages of RAM\n", pgdat->node_spanned_pages);
+			printk("\t%ld pages of RAM\n", pgdat->node_spanned_pages);
 			printk("\t%d reserved pages\n", reserved);
 			printk("\t%d pages shared\n", shared);
 			printk("\t%d pages swap cached\n", cached);
@@ -397,6 +397,7 @@
 {
 	unsigned long address, start_page, end_page;
 	struct page *map_start, *map_end;
+	int node;
 	pgd_t *pgd;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -406,19 +407,20 @@
 
 	start_page = (unsigned long) map_start & PAGE_MASK;
 	end_page = PAGE_ALIGN((unsigned long) map_end);
+	node = paddr_to_nid(__pa(start));
 
 	for (address = start_page; address < end_page; address += PAGE_SIZE) {
 		pgd = pgd_offset_k(address);
 		if (pgd_none(*pgd))
-			pgd_populate(&init_mm, pgd, alloc_bootmem_pages(PAGE_SIZE));
+			pgd_populate(&init_mm, pgd, alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE));
 		pmd = pmd_offset(pgd, address);
 
 		if (pmd_none(*pmd))
-			pmd_populate_kernel(&init_mm, pmd, alloc_bootmem_pages(PAGE_SIZE));
+			pmd_populate_kernel(&init_mm, pmd, alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE));
 		pte = pte_offset_kernel(pmd, address);
 
 		if (pte_none(*pte))
-			set_pte(pte, pfn_pte(__pa(alloc_bootmem_pages(PAGE_SIZE)) >> PAGE_SHIFT,
+			set_pte(pte, pfn_pte(__pa(alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE)) >> PAGE_SHIFT,
 					     PAGE_KERNEL));
 	}
 	return 0;
@@ -431,6 +433,14 @@
 	unsigned long zone;
 };
 
+struct memmap_count_callback_data {
+	int node;
+	unsigned long num_physpages;
+	unsigned long num_dma_physpages;
+	unsigned long min_pfn;
+	unsigned long max_pfn;
+} cdata;
+
 static int
 virtual_memmap_init (u64 start, u64 end, void *arg)
 {
@@ -489,16 +499,6 @@
 }
 
 static int
-count_dma_pages (u64 start, u64 end, void *arg)
-{
-	unsigned long *count = arg;
-
-	if (end <= MAX_DMA_ADDRESS)
-		*count += (end - start) >> PAGE_SHIFT;
-	return 0;
-}
-
-static int
 find_largest_hole (u64 start, u64 end, void *arg)
 {
 	u64 *max_gap = arg;
@@ -514,102 +514,101 @@
 }
 #endif /* CONFIG_VIRTUAL_MEM_MAP */
 
+#define GRANULEROUNDDOWN(n) ((n) & ~(IA64_GRANULE_SIZE-1))
+#define GRANULEROUNDUP(n) (((n)+IA64_GRANULE_SIZE-1) & ~(IA64_GRANULE_SIZE-1))
+#define ORDERROUNDDOWN(n) ((n) & ~((PAGE_SIZE<<MAX_ORDER)-1))
 static int
-count_pages (u64 start, u64 end, void *arg)
+count_pages (unsigned long start, unsigned long end, int node)
 {
-	unsigned long *count = arg;
+	start = __pa(start);
+	end = __pa(end);
 
-	*count += (end - start) >> PAGE_SHIFT;
+	if (node = cdata.node) {
+		cdata.num_physpages += (end - start) >> PAGE_SHIFT;
+		if (start <= __pa(MAX_DMA_ADDRESS))
+			cdata.num_dma_physpages += (min(end, __pa(MAX_DMA_ADDRESS)) - start) >> PAGE_SHIFT;
+		start = GRANULEROUNDDOWN(__pa(start));
+		start = ORDERROUNDDOWN(start);
+		end = GRANULEROUNDUP(__pa(end));
+		cdata.max_pfn = max(cdata.max_pfn, end >> PAGE_SHIFT);
+		cdata.min_pfn = min(cdata.min_pfn, start >> PAGE_SHIFT);
+	}
 	return 0;
 }
 
 /*
  * Set up the page tables.
  */
-
-#ifdef CONFIG_DISCONTIGMEM
 void
 paging_init (void)
 {
-	extern void discontig_paging_init(void);
-
-	discontig_paging_init();
-	efi_memmap_walk(count_pages, &num_physpages);
-	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
-}
-#else /* !CONFIG_DISCONTIGMEM */
-void
-paging_init (void)
-{
-	unsigned long max_dma;
+	unsigned long max_dma_pfn;
 	unsigned long zones_size[MAX_NR_ZONES];
 #  ifdef CONFIG_VIRTUAL_MEM_MAP
 	unsigned long zholes_size[MAX_NR_ZONES];
 	unsigned long max_gap;
 #  endif
+	int node;
 
-	/* initialize mem_map[] */
-
-	memset(zones_size, 0, sizeof(zones_size));
-
-	num_physpages = 0;
-	efi_memmap_walk(count_pages, &num_physpages);
-
-	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
-#  ifdef CONFIG_VIRTUAL_MEM_MAP
-	memset(zholes_size, 0, sizeof(zholes_size));
-
-	num_dma_physpages = 0;
-	efi_memmap_walk(count_dma_pages, &num_dma_physpages);
-
-	if (max_low_pfn < max_dma) {
-		zones_size[ZONE_DMA] = max_low_pfn;
-		zholes_size[ZONE_DMA] = max_low_pfn - num_dma_physpages;
-	} else {
-		zones_size[ZONE_DMA] = max_dma;
-		zholes_size[ZONE_DMA] = max_dma - num_dma_physpages;
-		if (num_physpages > num_dma_physpages) {
-			zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
-			zholes_size[ZONE_NORMAL] = ((max_low_pfn - max_dma)
-						    - (num_physpages - num_dma_physpages));
-		}
-	}
-
+	max_dma_pfn = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 	max_gap = 0;
 	efi_memmap_walk(find_largest_hole, (u64 *)&max_gap);
-	if (max_gap < LARGE_GAP) {
-		vmem_map = (struct page *) 0;
-		free_area_init_node(0, &contig_page_data, NULL, zones_size, 0, zholes_size);
-		mem_map = contig_page_data.node_mem_map;
-	}
-	else {
-		unsigned long map_size;
-
-		/* allocate virtual_mem_map */
 
-		map_size = PAGE_ALIGN(max_low_pfn * sizeof(struct page));
-		vmalloc_end -= map_size;
-		vmem_map = (struct page *) vmalloc_end;
-		efi_memmap_walk(create_mem_map_page_table, 0);
-
-		free_area_init_node(0, &contig_page_data, vmem_map, zones_size, 0, zholes_size);
+	for (node = 0; node < numnodes; node++) {
+		memset(zones_size, 0, sizeof(zones_size));
+		memset(zholes_size, 0, sizeof(zholes_size));
+		memset(&cdata, 0, sizeof(cdata));
+
+		cdata.node = node;
+		cdata.min_pfn = ~0;
+
+		efi_memmap_walk(filter_rsvd_memory, count_pages);
+		num_dma_physpages += cdata.num_dma_physpages;
+		num_physpages += cdata.num_physpages;
+
+		if (cdata.min_pfn >= max_dma_pfn) {
+			/* Above the DMA zone */
+			zones_size[ZONE_NORMAL] = cdata.max_pfn - cdata.min_pfn;
+			zholes_size[ZONE_NORMAL] = cdata.max_pfn - cdata.min_pfn - cdata.num_physpages;
+		} else if (cdata.max_pfn < max_dma_pfn) {
+			/* This block is DMAable */
+			zones_size[ZONE_DMA] = cdata.max_pfn - cdata.min_pfn;
+			zholes_size[ZONE_DMA] = cdata.max_pfn - cdata.min_pfn - cdata.num_dma_physpages;
+		} else {
+			zones_size[ZONE_DMA] = max_dma_pfn - cdata.min_pfn;
+			zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - cdata.num_dma_physpages;
+			zones_size[ZONE_NORMAL] = cdata.max_pfn - max_dma_pfn;
+			zholes_size[ZONE_NORMAL] = zones_size[ZONE_NORMAL] - (cdata.num_physpages - cdata.num_dma_physpages);
+		}
 
-		mem_map = contig_page_data.node_mem_map;
-		printk("Virtual mem_map starts at 0x%p\n", mem_map);
-	}
-#  else /* !CONFIG_VIRTUAL_MEM_MAP */
-	if (max_low_pfn < max_dma)
-		zones_size[ZONE_DMA] = max_low_pfn;
-	else {
-		zones_size[ZONE_DMA] = max_dma;
-		zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
+		if (numnodes = 1 && max_gap < LARGE_GAP) {
+			/* Just one node with no big holes... */
+			vmem_map = (struct page *)0;
+			zones_size[ZONE_DMA] += cdata.min_pfn;
+			zholes_size[ZONE_DMA] += cdata.min_pfn;
+			free_area_init_node(0, NODE_DATA(node), NODE_DATA(node)->node_mem_map,
+					    zones_size, 0, zholes_size);
+		}
+		else {
+			/* allocate virtual mem_map */
+			if (node = 0) {
+				unsigned long map_size;
+				map_size = PAGE_ALIGN(max_low_pfn*sizeof(struct page));
+				vmalloc_end -= map_size;
+				vmem_map = (struct page *) vmalloc_end;
+				efi_memmap_walk(create_mem_map_page_table, 0);
+				printk("Virtual mem_map starts at 0x%p\n", vmem_map);
+#ifndef CONFIG_DISCONTIGMEM
+				mem_map = vmem_map;
+#endif
+			}
+			free_area_init_node(node, NODE_DATA(node), vmem_map + cdata.min_pfn,
+					    zones_size, cdata.min_pfn, zholes_size);
+		}
 	}
-	free_area_init(zones_size);
-#  endif /* !CONFIG_VIRTUAL_MEM_MAP */
+
 	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
-#endif /* !CONFIG_DISCONTIGMEM */
 
 static int
 count_reserved_pages (u64 start, u64 end, void *arg)
diff -Nru a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
--- a/drivers/acpi/Kconfig	Wed Jul 16 12:50:02 2003
+++ b/drivers/acpi/Kconfig	Wed Jul 16 12:50:02 2003
@@ -133,7 +133,7 @@
 
 config ACPI_NUMA
 	bool "NUMA support" if NUMA && (IA64 && !IA64_HP_SIM || X86 && ACPI && !ACPI_HT_ONLY && !X86_64)
-	default y if IA64 && IA64_SGI_SN
+	default y if IA64_GENERIC || IA64_SGI_SN2
 
 config ACPI_ASUS
         tristate "ASUS/Medion Laptop Extras"
diff -Nru a/include/asm-ia64/mmzone.h b/include/asm-ia64/mmzone.h
--- a/include/asm-ia64/mmzone.h	Wed Jul 16 12:50:02 2003
+++ b/include/asm-ia64/mmzone.h	Wed Jul 16 12:50:02 2003
@@ -3,7 +3,7 @@
  * License.  See the file "COPYING" in the main directory of this archive
  * for more details.
  *
- * Copyright (c) 2000 Silicon Graphics, Inc.  All rights reserved.
+ * Copyright (c) 2000,2003 Silicon Graphics, Inc.  All rights reserved.
  * Copyright (c) 2002 NEC Corp.
  * Copyright (c) 2002 Erich Focht <efocht@ess.nec.de>
  * Copyright (c) 2002 Kimio Suganuma <k-suganuma@da.jp.nec.com>
@@ -14,150 +14,50 @@
 #include <linux/config.h>
 #include <linux/init.h>
 
-/*
- * Given a kaddr, find the base mem_map address for the start of the mem_map
- * entries for the bank containing the kaddr.
- */
-#define BANK_MEM_MAP_BASE(kaddr) local_node_data->bank_mem_map_base[BANK_MEM_MAP_INDEX(kaddr)]
-
-/*
- * Given a kaddr, this macro return the relative map number 
- * within the bank.
- */
-#define BANK_MAP_NR(kaddr) 	(BANK_OFFSET(kaddr) >> PAGE_SHIFT)
 
-/*
- * Given a pte, this macro returns a pointer to the page struct for the pte.
- */
-#define pte_page(pte)	virt_to_page(PAGE_OFFSET | (pte_val(pte)&_PFN_MASK))
+#ifdef CONFIG_NUMA
 
-/*
- * Determine if a kaddr is a valid memory address of memory that
- * actually exists. 
- *
- * The check consists of 2 parts:
- *	- verify that the address is a region 7 address & does not 
- *	  contain any bits that preclude it from being a valid platform
- *	  memory address
- *	- verify that the chunk actually exists.
- *
- * Note that IO addresses are NOT considered valid addresses.
- *
- * Note, many platforms can simply check if kaddr exceeds a specific size.  
- *	(However, this won't work on SGI platforms since IO space is embedded 
- * 	within the range of valid memory addresses & nodes have holes in the 
- *	address range between banks). 
- */
-#define kern_addr_valid(kaddr)		({long _kav=(long)(kaddr);	\
-					VALID_MEM_KADDR(_kav);})
-
-/*
- * Given a kaddr, return a pointer to the page struct for the page.
- * If the kaddr does not represent RAM memory that potentially exists, return
- * a pointer the page struct for max_mapnr. IO addresses will
- * return the page for max_nr. Addresses in unpopulated RAM banks may
- * return undefined results OR may panic the system.
- *
- */
-#define virt_to_page(kaddr)	({long _kvtp=(long)(kaddr);	\
-				(VALID_MEM_KADDR(_kvtp))	\
-					? BANK_MEM_MAP_BASE(_kvtp) + BANK_MAP_NR(_kvtp)	\
-					: NULL;})
+#ifdef CONFIG_IA64_DIG
 
 /*
- * Given a page struct entry, return the physical address that the page struct represents.
- * Since IA64 has all memory in the DMA zone, the following works:
+ * Platform definitions for DIG platform with contiguous memory.
  */
-#define page_to_phys(page)	__pa(page_address(page))
-
-#define node_mem_map(nid)	(NODE_DATA(nid)->node_mem_map)
+#define MAX_PHYSNODE_ID	8		/* Maximum node number +1 */
+#define NR_NODES	8		/* Maximum number of nodes in SSI */
+#define NR_MEMBLKS	(NR_NODES * 32)
 
-#define node_localnr(pfn, nid)	((pfn) - NODE_DATA(nid)->node_start_pfn)
 
-#define pfn_to_page(pfn)	(struct page *)(node_mem_map(pfn_to_nid(pfn)) + node_localnr(pfn, pfn_to_nid(pfn)))
 
-#define pfn_to_nid(pfn)		 local_node_data->node_id_map[(pfn << PAGE_SHIFT) >> BANKSHIFT]
-
-#define page_to_pfn(page)	(long)((page - page_zone(page)->zone_mem_map) + page_zone(page)->zone_start_pfn)
 
+#elif CONFIG_IA64_SGI_SN2
 
 /*
- * pfn_valid should be made as fast as possible, and the current definition
- * is valid for machines that are NUMA, but still contiguous, which is what
- * is currently supported. A more generalised, but slower definition would
- * be something like this - mbligh:
- * ( pfn_to_pgdat(pfn) && (pfn < node_end_pfn(pfn_to_nid(pfn))) )
+ * Platform definitions for DIG platform with contiguous memory.
  */
-#define pfn_valid(pfn)          (pfn < max_low_pfn)
-extern unsigned long max_low_pfn;
+#define MAX_PHYSNODE_ID	2048		/* Maximum node number +1 */
+#define NR_NODES	256		/* Maximum number of compute nodes in SSI */
+#define NR_MEMBLKS	(NR_NODES)
 
+#elif CONFIG_IA64_GENERIC
 
-#ifdef CONFIG_IA64_DIG
 
 /*
- * Platform definitions for DIG platform with contiguous memory.
+ * Platform definitions for GENERIC platform with contiguous or discontiguous memory.
  */
-#define MAX_PHYSNODE_ID	8	/* Maximum node number +1 */
-#define NR_NODES	8	/* Maximum number of nodes in SSI */
+#define MAX_PHYSNODE_ID 2048		/* Maximum node number +1 */
+#define NR_NODES        256		/* Maximum number of nodes in SSI */
+#define NR_MEMBLKS      (NR_NODES)
 
-#define MAX_PHYS_MEMORY	(1UL << 40)	/* 1 TB */
 
-/*
- * Bank definitions.
- * Configurable settings for DIG: 512MB/bank:  16GB/node,
- *                               2048MB/bank:  64GB/node,
- *                               8192MB/bank: 256GB/node.
- */
-#define NR_BANKS_PER_NODE	32
-#if defined(CONFIG_IA64_NODESIZE_16GB)
-# define BANKSHIFT		29
-#elif defined(CONFIG_IA64_NODESIZE_64GB)
-# define BANKSHIFT		31
-#elif defined(CONFIG_IA64_NODESIZE_256GB)
-# define BANKSHIFT		33
 #else
-# error Unsupported bank and nodesize!
+#error unknown platform
 #endif
-#define BANKSIZE		(1UL << BANKSHIFT)
-#define BANK_OFFSET(addr)	((unsigned long)(addr) & (BANKSIZE-1))
-#define NR_BANKS		(NR_BANKS_PER_NODE * NR_NODES)
 
-/*
- * VALID_MEM_KADDR returns a boolean to indicate if a kaddr is
- * potentially a valid cacheable identity mapped RAM memory address.
- * Note that the RAM may or may not actually be present!!
- */
-#define VALID_MEM_KADDR(kaddr)	1
+extern void build_cpu_to_node_map(void);
 
-/*
- * Given a nodeid & a bank number, find the address of the mem_map
- * entry for the first page of the bank.
- */
-#define BANK_MEM_MAP_INDEX(kaddr) \
-	(((unsigned long)(kaddr) & (MAX_PHYS_MEMORY-1)) >> BANKSHIFT)
+#else /* CONFIG_NUMA */
 
-#elif defined(CONFIG_IA64_SGI_SN2)
-/*
- * SGI SN2 discontig definitions
- */
-#define MAX_PHYSNODE_ID	2048	/* 2048 node ids (also called nasid) */
-#define NR_NODES	128	/* Maximum number of nodes in SSI */
-#define MAX_PHYS_MEMORY	(1UL << 49)
-
-#define BANKSHIFT		38
-#define NR_BANKS_PER_NODE	4
-#define SN2_NODE_SIZE		(64UL*1024*1024*1024)	/* 64GB per node */
-#define BANKSIZE		(SN2_NODE_SIZE/NR_BANKS_PER_NODE)
-#define BANK_OFFSET(addr)	((unsigned long)(addr) & (BANKSIZE-1))
-#define NR_BANKS		(NR_BANKS_PER_NODE * NR_NODES)
-#define VALID_MEM_KADDR(kaddr)	1
-
-/*
- * Given a nodeid & a bank number, find the address of the mem_map
- * entry for the first page of the bank.
- */
-#define BANK_MEM_MAP_INDEX(kaddr) \
-	(((unsigned long)(kaddr) & (MAX_PHYS_MEMORY-1)) >> BANKSHIFT)
+#define NR_NODES	1
 
-#endif /* CONFIG_IA64_DIG */
+#endif /* CONFIG_NUMA */
 #endif /* _ASM_IA64_MMZONE_H */
diff -Nru a/include/asm-ia64/nodedata.h b/include/asm-ia64/nodedata.h
--- a/include/asm-ia64/nodedata.h	Wed Jul 16 12:50:02 2003
+++ b/include/asm-ia64/nodedata.h	Wed Jul 16 12:50:02 2003
@@ -13,7 +13,7 @@
 #ifndef _ASM_IA64_NODEDATA_H
 #define _ASM_IA64_NODEDATA_H
 
-
+#include <asm/percpu.h>
 #include <asm/mmzone.h>
 
 /*
@@ -22,15 +22,17 @@
 
 struct pglist_data;
 struct ia64_node_data {
-	short			active_cpu_count;
 	short			node;
+	short			active_cpu_count;
+	/*
+	 * The fields are read-only (after boot). They contain pointers
+	 * to various structures located on other nodes. Ths data is
+	 * replicated on each node in order to reduce off-node references.
+	 */
         struct pglist_data	*pg_data_ptrs[NR_NODES];
-	struct page		*bank_mem_map_base[NR_BANKS];
 	struct ia64_node_data	*node_data_ptrs[NR_NODES];
-	short			node_id_map[NR_BANKS];
 };
 
-
 /*
  * Return a pointer to the node_data structure for the executing cpu.
  */
@@ -40,7 +42,8 @@
 /*
  * Return a pointer to the node_data structure for the specified node.
  */
-#define node_data(node)	(local_node_data->node_data_ptrs[node])
+#define node_data(node) (local_node_data->node_data_ptrs[node])
+#define NODE_DATA(nid) (local_node_data->pg_data_ptrs[nid])
 
 /*
  * Get a pointer to the node_id/node_data for the current cpu.
@@ -48,29 +51,5 @@
  */
 extern int boot_get_local_nodeid(void);
 extern struct ia64_node_data *get_node_data_ptr(void);
-
-/*
- * Given a node id, return a pointer to the pg_data_t for the node.
- * The following 2 macros are similar. 
- *
- * NODE_DATA 	- should be used in all code not related to system
- *		  initialization. It uses pernode data structures to minimize
- *		  offnode memory references. However, these structure are not 
- *		  present during boot. This macro can be used once cpu_init
- *		  completes.
- *
- * BOOT_NODE_DATA
- *		- should be used during system initialization 
- *		  prior to freeing __initdata. It does not depend on the percpu
- *		  area being present.
- *
- * NOTE:   The names of these macros are misleading but are difficult to change
- *	   since they are used in generic linux & on other architecures.
- */
-#define NODE_DATA(nid)		(local_node_data->pg_data_ptrs[nid])
-#define BOOT_NODE_DATA(nid)	boot_get_pg_data_ptr((long)(nid))
-
-struct pglist_data;
-extern struct pglist_data * __init boot_get_pg_data_ptr(long);
 
 #endif /* _ASM_IA64_NODEDATA_H */
diff -Nru a/include/asm-ia64/numa.h b/include/asm-ia64/numa.h
--- a/include/asm-ia64/numa.h	Wed Jul 16 12:50:02 2003
+++ b/include/asm-ia64/numa.h	Wed Jul 16 12:50:02 2003
@@ -15,13 +15,21 @@
 
 #ifdef CONFIG_DISCONTIGMEM
 # include <asm/mmzone.h>
-# define NR_MEMBLKS   (NR_BANKS)
 #else
 # define NR_NODES     (8)
 # define NR_MEMBLKS   (NR_NODES * 8)
 #endif
 
 #include <linux/cache.h>
+#include <linux/threads.h>
+#include <linux/smp.h>
+
+#define NODEMASK_WORDCOUNT       ((NR_NODES+(BITS_PER_LONG-1))/BITS_PER_LONG)
+
+#define NODE_MASK_NONE   { [0 ... ((NR_NODES+BITS_PER_LONG-1)/BITS_PER_LONG)-1] = 0 }
+
+typedef unsigned long   nodemask_t[NODEMASK_WORDCOUNT];
+                                                                                                                             
 extern volatile char cpu_to_node_map[NR_CPUS] __cacheline_aligned;
 extern volatile unsigned long node_to_cpu_mask[NR_NODES] __cacheline_aligned;
 
@@ -63,6 +71,12 @@
 extern int paddr_to_nid(unsigned long paddr);
 
 #define local_nodeid (cpu_to_node_map[smp_processor_id()])
+
+#else /* !CONFIG_NUMA */
+
+#define node_distance(from,to) 10
+#define paddr_to_nid(x) 0
+#define local_nodeid 0
 
 #endif /* CONFIG_NUMA */
 
diff -Nru a/include/asm-ia64/page.h b/include/asm-ia64/page.h
--- a/include/asm-ia64/page.h	Wed Jul 16 12:50:02 2003
+++ b/include/asm-ia64/page.h	Wed Jul 16 12:50:02 2003
@@ -93,18 +93,26 @@
 
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
 
-#ifndef CONFIG_DISCONTIGMEM
-# ifdef CONFIG_VIRTUAL_MEM_MAP
-   extern int ia64_pfn_valid (unsigned long pfn);
-#  define pfn_valid(pfn)	(((pfn) < max_mapnr) && ia64_pfn_valid(pfn))
-# else
-#  define pfn_valid(pfn)	((pfn) < max_mapnr)
-# endif
-#define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
-#define page_to_pfn(page)	((unsigned long) (page - mem_map))
-#define pfn_to_page(pfn)	(mem_map + (pfn))
-#define page_to_phys(page)	(page_to_pfn(page) << PAGE_SHIFT)
+#ifdef CONFIG_VIRTUAL_MEM_MAP
+extern int ia64_pfn_valid(unsigned long pfn);
+#else
+#define ia64_pfn_valid(pfn) (1)
+#endif
+
+extern unsigned long max_low_pfn;
+#define pfn_valid(pfn) (((pfn) < max_low_pfn) && ia64_pfn_valid(pfn))
+
+#if defined(CONFIG_VIRTUAL_MEM_MAP) && !defined(CONFIG_DISCONTIGMEM)
+#define vmem_map mem_map
+#else
+extern struct page *vmem_map;
 #endif
+
+#define pfn_to_page(pfn)	(vmem_map + (pfn))
+#define page_to_pfn(page)	((unsigned long) (page - vmem_map))
+
+#define virt_to_page(kaddr)	(pfn_to_page(__pa(kaddr) >> PAGE_SHIFT))
+#define page_to_phys(page)	(page_to_pfn(page) << PAGE_SHIFT)
 
 typedef union ia64_va {
 	struct {
diff -Nru a/include/asm-ia64/pgtable.h b/include/asm-ia64/pgtable.h
--- a/include/asm-ia64/pgtable.h	Wed Jul 16 12:50:02 2003
+++ b/include/asm-ia64/pgtable.h	Wed Jul 16 12:50:02 2003
@@ -174,7 +174,6 @@
 	return (addr & (local_cpu_data->unimpl_pa_mask)) = 0;
 }
 
-#ifndef CONFIG_DISCONTIGMEM
 /*
  * kern_addr_valid(ADDR) tests if ADDR is pointing to valid kernel
  * memory.  For the return value to be meaningful, ADDR must be >@@ -190,7 +189,6 @@
  */
 #define kern_addr_valid(addr)	(1)
 
-#endif
 
 /*
  * Now come the defines and routines to manage and access the three-level
@@ -241,10 +239,8 @@
 #define pte_none(pte) 			(!pte_val(pte))
 #define pte_present(pte)		(pte_val(pte) & (_PAGE_P | _PAGE_PROTNONE))
 #define pte_clear(pte)			(pte_val(*(pte)) = 0UL)
-#ifndef CONFIG_DISCONTIGMEM
 /* pte_page() returns the "struct page *" corresponding to the PTE: */
 #define pte_page(pte)			virt_to_page(((pte_val(pte) & _PFN_MASK) + PAGE_OFFSET))
-#endif
 
 #define pmd_none(pmd)			(!pmd_val(pmd))
 #define pmd_bad(pmd)			(!ia64_phys_addr_valid(pmd_val(pmd)))
@@ -416,6 +412,7 @@
 
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
 extern void paging_init (void);
+extern int filter_rsvd_memory(unsigned long start, unsigned long end, void *arg);
 
 /*
  * Note: The macros below rely on the fact that MAX_SWAPFILES_SHIFT <= number of
diff -Nru a/include/asm-ia64/processor.h b/include/asm-ia64/processor.h
--- a/include/asm-ia64/processor.h	Wed Jul 16 12:50:02 2003
+++ b/include/asm-ia64/processor.h	Wed Jul 16 12:50:02 2003
@@ -185,6 +185,8 @@
 #endif
 #ifdef CONFIG_NUMA
 	struct ia64_node_data *node_data;
+	struct cpuinfo_ia64 *cpu_data[NR_CPUS];
+	int nodeid;
 #endif
 };
 
diff -Nru a/mm/bootmem.c b/mm/bootmem.c
--- a/mm/bootmem.c	Wed Jul 16 12:50:02 2003
+++ b/mm/bootmem.c	Wed Jul 16 12:50:02 2003
@@ -48,8 +48,24 @@
 	bootmem_data_t *bdata = pgdat->bdata;
 	unsigned long mapsize = ((end - start)+7)/8;
 
-	pgdat->pgdat_next = pgdat_list;
-	pgdat_list = pgdat;
+
+	/*
+	 * sort pgdat_list so that the lowest one comes first,
+	 * which makes alloc_bootmem_low_pages work as desired.
+	 */
+	if (!pgdat_list || pgdat_list->node_start_pfn > pgdat->node_start_pfn) {
+		pgdat->pgdat_next = pgdat_list;
+		pgdat_list = pgdat;
+	} else {
+		pg_data_t *tmp = pgdat_list;
+		while (tmp->pgdat_next) {
+			if (tmp->pgdat_next->node_start_pfn > pgdat->node_start_pfn)
+				break;
+			tmp = tmp->pgdat_next;
+		}
+		pgdat->pgdat_next = tmp->pgdat_next;
+		tmp->pgdat_next = pgdat;
+	}
 
 	mapsize = (mapsize + (sizeof(long) - 1UL)) & ~(sizeof(long) - 1UL);
 	bdata->node_bootmem_map = phys_to_virt(mapstart << PAGE_SHIFT);
@@ -251,7 +267,7 @@
 
 static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat)
 {
-	struct page *page = pgdat->node_mem_map;
+	struct page *page;
 	bootmem_data_t *bdata = pgdat->bdata;
 	unsigned long i, count, total = 0;
 	unsigned long idx;
@@ -260,23 +276,23 @@
 	if (!bdata->node_bootmem_map) BUG();
 
 	count = 0;
+	page = virt_to_page(phys_to_virt(bdata->node_boot_start));
 	idx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
 	map = bdata->node_bootmem_map;
 	for (i = 0; i < idx; ) {
 		unsigned long v = ~map[i / BITS_PER_LONG];
 		if (v) {
 			unsigned long m;
-			for (m = 1; m && i < idx; m<<=1, page++, i++) {
+			for (m = 1; m && i < idx; m<<=1, i++) {
 				if (v & m) {
 					count++;
-					ClearPageReserved(page);
-					set_page_count(page, 1);
-					__free_page(page);
+					ClearPageReserved(page+i);
+					set_page_count(page+i, 1);
+					__free_page(page+i);
 				}
 			}
 		} else {
 			i+=BITS_PER_LONG;
-			page += BITS_PER_LONG;
 		}
 	}
 	total += count;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
                   ` (7 preceding siblings ...)
  2003-07-16 19:51 ` Jesse Barnes
@ 2003-07-16 19:56 ` Erich Focht
  2003-07-16 22:37 ` Jesse Barnes
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Erich Focht @ 2003-07-16 19:56 UTC (permalink / raw)
  To: linux-ia64

On Wednesday 16 July 2003 21:40, Matthew Wilcox wrote:
> On Wed, Jul 16, 2003 at 12:29:52PM -0700, Jesse Barnes wrote:
> > @@ -210,8 +210,8 @@
> >  	  system with an A0 or A1 stepping CPU.
> >
> >  config NUMA
> > -	bool "Enable NUMA support" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
> > -	default y if IA64_SGI_SN2
> > +	bool
> > +	default y if IA64_SGI_SN2 || IA64_GENERIC
> >  	help
> >  	  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
> >  	  Access).  This option is for configuring high-end multiprocessor

There are other NUMA machines around which are not from SGI and are
IA64_DIG. You excluded them on purpose? I hope not...

> If you're removing the question, you can remove the helptext too.
>
> > @@ -235,8 +235,7 @@
> >
> >  config DISCONTIGMEM
> >  	bool
> > -	depends on IA64_SGI_SN2 || (IA64_GENERIC || IA64_DIG || IA64_HP_ZX1) &&
> > NUMA -	default y
> > +	default y if IA64_SGI_SN2 || IA64_GENERIC
> >  	help
> >  	  Say Y to support efficient handling of discontiguous physical memory,
> >  	  for architectures which are either NUMA (Non-Uniform Memory Access)

Same comment as above. NEC TX7 is an IA64_DIG platform.

Erich



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
                   ` (8 preceding siblings ...)
  2003-07-16 19:56 ` Erich Focht
@ 2003-07-16 22:37 ` Jesse Barnes
  2003-07-17  8:23 ` Erich Focht
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jesse Barnes @ 2003-07-16 22:37 UTC (permalink / raw)
  To: linux-ia64

On Wed, Jul 16, 2003 at 09:56:32PM +0200, Erich Focht wrote:
> On Wednesday 16 July 2003 21:40, Matthew Wilcox wrote:
> > On Wed, Jul 16, 2003 at 12:29:52PM -0700, Jesse Barnes wrote:
> > > @@ -210,8 +210,8 @@
> > >  	  system with an A0 or A1 stepping CPU.
> > >
> > >  config NUMA
> > > -	bool "Enable NUMA support" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
> > > -	default y if IA64_SGI_SN2
> > > +	bool
> > > +	default y if IA64_SGI_SN2 || IA64_GENERIC
> > >  	help
> > >  	  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
> > >  	  Access).  This option is for configuring high-end multiprocessor
> 
> There are other NUMA machines around which are not from SGI and are
> IA64_DIG. You excluded them on purpose? I hope not...
> 
> > If you're removing the question, you can remove the helptext too.
> >
> > > @@ -235,8 +235,7 @@
> > >
> > >  config DISCONTIGMEM
> > >  	bool
> > > -	depends on IA64_SGI_SN2 || (IA64_GENERIC || IA64_DIG || IA64_HP_ZX1) &&
> > > NUMA -	default y
> > > +	default y if IA64_SGI_SN2 || IA64_GENERIC
> > >  	help
> > >  	  Say Y to support efficient handling of discontiguous physical memory,
> > >  	  for architectures which are either NUMA (Non-Uniform Memory Access)
> 
> Same comment as above. NEC TX7 is an IA64_DIG platform.

Oops, sorry.  Here's another one.  How does it look?

Thanks,
Jesse


# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.1397  -> 1.1398 
#	include/asm-ia64/page.h	1.19    -> 1.20   
#	arch/ia64/kernel/setup.c	1.53    -> 1.54   
#	include/asm-ia64/pgtable.h	1.28    -> 1.29   
#	        mm/bootmem.c	1.18    -> 1.19   
#	include/asm-ia64/numa.h	1.5     -> 1.6    
#	include/asm-ia64/processor.h	1.48    -> 1.49   
#	 arch/ia64/mm/init.c	1.46    -> 1.47   
#	include/asm-ia64/nodedata.h	1.3     -> 1.4    
#	arch/ia64/mm/discontig.c	1.4     -> 1.5    
#	   arch/ia64/Kconfig	1.38    -> 1.39   
#	drivers/acpi/Kconfig	1.12    -> 1.13   
#	include/asm-ia64/mmzone.h	1.4     -> 1.5    
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/07/16	jbarnes@tomahawk.engr.sgi.com	1.1398
# latest discontig
# --------------------------------------------
#
diff -Nru a/arch/ia64/Kconfig b/arch/ia64/Kconfig
--- a/arch/ia64/Kconfig	Wed Jul 16 15:36:33 2003
+++ b/arch/ia64/Kconfig	Wed Jul 16 15:36:33 2003
@@ -211,7 +211,7 @@
 
 config NUMA
 	bool "Enable NUMA support" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
-	default y if IA64_SGI_SN2
+	default y if IA64_SGI_SN2 || IA64_GENERIC
 	help
 	  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
 	  Access).  This option is for configuring high-end multiprocessor
@@ -234,9 +234,8 @@
 endchoice
 
 config DISCONTIGMEM
-	bool
-	depends on IA64_SGI_SN2 || (IA64_GENERIC || IA64_DIG || IA64_HP_ZX1) && NUMA
-	default y
+	bool "Discontiguous memory support" if IA64_DIG
+	default y if IA64_SGI_SN2 || IA64_GENERIC
 	help
 	  Say Y to support efficient handling of discontiguous physical memory,
 	  for architectures which are either NUMA (Non-Uniform Memory Access)
@@ -245,8 +244,7 @@
 
 config VIRTUAL_MEM_MAP
 	bool "Enable Virtual Mem Map"
-	depends on !NUMA
-	default y if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
+	default y if !IA64_HP_SIM
 	help
 	  Say Y to compile the kernel with support for a virtual mem map.
 	  This is an alternate method of supporting large holes in the
@@ -259,8 +257,8 @@
 	  are unsure, say Y.
 
 config IA64_MCA
-	bool "Enable IA-64 Machine Check Abort" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
-	default y if IA64_SGI_SN2
+	bool "Enable IA-64 Machine Check Abort"
+	default y if !IA64_HP_SIM
 	help
 	  Say Y here to enable machine check support for IA-64.  If you're
 	  unsure, answer Y.
@@ -292,43 +290,12 @@
 	depends on IA64_GENERIC || IA64_DIG || IA64_HP_ZX1 || IA64_SGI_SN2
 	default y
 
-config IA64_SGI_SN_DEBUG
-	bool "Enable extra debugging code"
-	depends on IA64_SGI_SN2
-	help
-	  Turns on extra debugging code in the SGI SN (Scalable NUMA) platform
-	  for IA-64.  Unless you are debugging problems on an SGI SN IA-64 box,
-	  say N.
-
 config IA64_SGI_SN_SIM
 	bool "Enable SGI Medusa Simulator Support"
 	depends on IA64_SGI_SN2
 	help
 	  If you are compiling a kernel that will run under SGI's IA-64
 	  simulator (Medusa) then say Y, otherwise say N.
-
-config IA64_SGI_AUTOTEST
-	bool "Enable autotest (llsc). Option to run cache test instead of booting"
-	depends on IA64_SGI_SN2
-	help
-	  Build a kernel used for hardware validation. If you include the
-	  keyword "autotest" on the boot command line, the kernel does NOT boot.
-	  Instead, it starts all cpus and runs cache coherency tests instead.
-
-	  If unsure, say N.
-
-config SERIAL_SGI_L1_PROTOCOL
-	bool "Enable protocol mode for the L1 console"
-	depends on IA64_SGI_SN2
-	help
-	  Uses protocol mode instead of raw mode for the level 1 console on the
-	  SGI SN (Scalable NUMA) platform for IA-64.  If you are compiling for
-	  an SGI SN box then Y is the recommended value, otherwise say N.
-
-config PERCPU_IRQ
-	bool
-	depends on IA64_SGI_SN2
-	default y
 
 # On IA-64, we always want an ELF /proc/kcore.
 config KCORE_ELF
diff -Nru a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
--- a/arch/ia64/kernel/setup.c	Wed Jul 16 15:36:33 2003
+++ b/arch/ia64/kernel/setup.c	Wed Jul 16 15:36:33 2003
@@ -138,7 +138,7 @@
 call_pernode_memory (unsigned long start, unsigned long end, void *arg)
 {
 	unsigned long rs, re;
-	void (*func)(unsigned long, unsigned long, int, int);
+	void (*func)(unsigned long, unsigned long, int);
 	int i;
 
 	start = PAGE_ALIGN(start);
@@ -149,22 +149,21 @@
 	func = arg;
 
 	if (!num_memblks) {
-		/*
-		 * This machine doesn't have SRAT, so call func with
-		 * nid=0, bank=0.
-		 */
+		/* No SRAT table, to assume one node (node 0) */
 		if (start < end)
-			(*func)(start, end - start, 0, 0);
+			(*func)(start, end, 0);
 		return;
 	}
 
 	for (i = 0; i < num_memblks; i++) {
-		rs = max(start, node_memblk[i].start_paddr);
-		re = min(end, node_memblk[i].start_paddr+node_memblk[i].size);
+		rs = max(__pa(start), node_memblk[i].start_paddr);
+		re = min(__pa(end), node_memblk[i].start_paddr+node_memblk[i].size);
 
 		if (rs < re)
-			(*func)(rs, re-rs, node_memblk[i].nid,
-				node_memblk[i].bank);
+			(*func)((unsigned long)__va(rs), (unsigned long)__va(re), node_memblk[i].nid);
+
+		if ((unsigned long)__va(re) = end)
+			break;
 	}
 }
 
@@ -180,7 +179,7 @@
 filter_rsvd_memory (unsigned long start, unsigned long end, void *arg)
 {
 	unsigned long range_start, range_end, prev_start;
-	void (*func)(unsigned long, unsigned long);
+	void (*func)(unsigned long, unsigned long, int);
 	int i;
 
 #if IGNORE_PFN0
@@ -202,9 +201,9 @@
 
 		if (range_start < range_end)
 #ifdef CONFIG_DISCONTIGMEM
-			call_pernode_memory(__pa(range_start), __pa(range_end), func);
+			call_pernode_memory(range_start, range_end, func);
 #else
-			(*func)(__pa(range_start), range_end - range_start);
+			(*func)(range_start, range_end, 0);
 #endif
 
 		/* nothing more available in this segment */
@@ -703,6 +702,8 @@
 	 * get_free_pages() cannot be used before cpu_init() done.  BSP allocates
 	 * "NR_CPUS" pages for all CPUs to avoid that AP calls get_zeroed_page().
 	 */
+#ifndef CONFIG_DISCONTIGMEM
+	/* for discontig machines, we do this in discontig.c */
 	if (smp_processor_id() = 0) {
 		cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS, PERCPU_PAGE_SIZE,
 					   __pa(MAX_DMA_ADDRESS));
@@ -714,6 +715,7 @@
 			per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu];
 		}
 	}
+#endif
 	cpu_data = __per_cpu_start + __per_cpu_offset[smp_processor_id()];
 #else /* !CONFIG_SMP */
 	cpu_data = __phys_per_cpu_start;
diff -Nru a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
--- a/arch/ia64/mm/discontig.c	Wed Jul 16 15:36:33 2003
+++ b/arch/ia64/mm/discontig.c	Wed Jul 16 15:36:33 2003
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2000 Silicon Graphics, Inc.  All rights reserved.
+ * Copyright (c) 2000, 2003 Silicon Graphics, Inc.  All rights reserved.
  * Copyright (c) 2001 Intel Corp.
  * Copyright (c) 2001 Tony Luck <tony.luck@intel.com>
  * Copyright (c) 2002 NEC Corp.
@@ -16,74 +16,60 @@
 #include <linux/mmzone.h>
 #include <linux/acpi.h>
 #include <linux/efi.h>
-
+#include <asm/pgalloc.h>
+#include <asm/tlb.h>
 
 /*
- * Round an address upward to the next multiple of GRANULE size.
+ * Round an address upward or downward to the next multiple of IA64_GRANULE_SIZE.
  */
+#define GRANULEROUNDDOWN(n) ((n) & ~(IA64_GRANULE_SIZE-1))
 #define GRANULEROUNDUP(n) (((n)+IA64_GRANULE_SIZE-1) & ~(IA64_GRANULE_SIZE-1))
 
-static struct ia64_node_data	*node_data[NR_NODES];
-static long			boot_pg_data[8*NR_NODES+sizeof(pg_data_t)]  __initdata;
-static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
-static bootmem_data_t		bdata[NR_NODES][NR_BANKS_PER_NODE+1] __initdata;
-
-extern int  filter_rsvd_memory (unsigned long start, unsigned long end, void *arg);
+/*
+ * Used to locate BOOT_DATA prior to initializing the node data area.
+ */
+#define BOOT_NODE_DATA(node)	pg_data_ptr[node]
 
 /*
- * Return the compact node number of this cpu. Used prior to
- * setting up the cpu_data area.
- *	Note - not fast, intended for boot use only!!
+ * To prevent cache aliasing effects, align per-node structures so that they 
+ * start at addresses that are strided by node number.
  */
-int
-boot_get_local_nodeid(void)
-{
-	int	i;
+#define NODEDATA_ALIGN(addr, node)	((((addr) + 1024*1024-1) & ~(1024*1024-1)) + (node)*PERCPU_PAGE_SIZE)
 
-	for (i = 0; i < NR_CPUS; i++)
-		if (node_cpuid[i].phys_id = hard_smp_processor_id())
-			return node_cpuid[i].nid;
 
-	/* node info missing, so nid should be 0.. */
-	return 0;
-}
+static struct ia64_node_data	*boot_node_data[NR_NODES] __initdata;
+static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
+static bootmem_data_t		bdata[NR_NODES] __initdata;
+static unsigned long		boot_pernode[NR_NODES] __initdata;
+static unsigned long		boot_pernodesize[NR_NODES] __initdata;
 
-/*
- * Return a pointer to the pg_data structure for a node.
- * This function is used ONLY in early boot before the cpu_data
- * structure is available.
- */
-pg_data_t* __init
-boot_get_pg_data_ptr(long node)
-{
-	return pg_data_ptr[node];
-}
+extern char __per_cpu_start[], __per_cpu_end[];
 
 
-/*
- * Return a pointer to the node data for the current node.
- *	(boottime initialization only)
- */
-struct ia64_node_data *
+struct ia64_node_data*
 get_node_data_ptr(void)
 {
-	return node_data[boot_get_local_nodeid()];
+	return boot_node_data[cpu_to_node_map[smp_processor_id()]];	/* ZZZ */
 }
 
 /*
  * We allocate one of the bootmem_data_t structs for each piece of memory
  * that we wish to treat as a contiguous block.  Each such block must start
- * on a BANKSIZE boundary.  Multiple banks per node is not supported.
+ * on a GRANULE boundary.  Multiple banks per node are not supported.
+ *   (Note: on SN2, all memory on a node is trated as a single bank.
+ *   Holes within the bank are supported. This works because memory
+ *   from different banks is not interleaved. The bootmap bitmap
+ *   for the node is somewhat large but not too large).
  */
 static int __init
-build_maps(unsigned long pstart, unsigned long length, int node)
+build_maps(unsigned long start, unsigned long end, int node)
 {
 	bootmem_data_t	*bdp;
 	unsigned long cstart, epfn;
 
-	bdp = pg_data_ptr[node]->bdata;
-	epfn = GRANULEROUNDUP(pstart + length) >> PAGE_SHIFT;
-	cstart = pstart & ~(BANKSIZE - 1);
+	bdp = &bdata[node];
+	epfn = GRANULEROUNDUP(__pa(end)) >> PAGE_SHIFT;
+	cstart = GRANULEROUNDDOWN(__pa(start));
 
 	if (!bdp->node_low_pfn) {
 		bdp->node_boot_start = cstart;
@@ -99,34 +85,96 @@
 	return 0;
 }
 
+
+/*
+ * Count the number of cpus on the node
+ */
+static __inline__ int
+count_cpus(int node)
+{
+	int cpu, n=0;
+
+	for (cpu=0; cpu < NR_CPUS; cpu++)
+		if (node = node_cpuid[cpu].nid)
+			n++;
+	return n;
+}
+
+
 /*
- * Find space on each node for the bootmem map.
+ * Find space on each node for the bootmem map & other per-node data structures.
  *
  * Called by efi_memmap_walk to find boot memory on each node. Note that
  * only blocks that are free are passed to this routine (currently filtered by
  * free_available_memory).
  */
 static int __init
-find_bootmap_space(unsigned long pstart, unsigned long length, int node)
+find_pernode_space(unsigned long start, unsigned long end, int node)
 {
-	unsigned long	mapsize, pages, epfn;
+	unsigned long	mapsize, pages, epfn, map=0, cpu, cpus;
+	unsigned long	pernodesize=0, pernode;
+	unsigned long	cpu_data;
+	unsigned long	pstart, length;
 	bootmem_data_t	*bdp;
 
+	pstart = __pa(start);
+	length = end - start;
 	epfn = (pstart + length) >> PAGE_SHIFT;
-	bdp = &pg_data_ptr[node]->bdata[0];
+	bdp = &bdata[node];
 
 	if (pstart < bdp->node_boot_start || epfn > bdp->node_low_pfn)
 		return 0;
 
-	if (!bdp->node_bootmem_map) {
+	if (!boot_pernode[node]) {
+		cpus = count_cpus(node);
+		pernodesize += PERCPU_PAGE_SIZE * cpus;
+		pernodesize += L1_CACHE_ALIGN(sizeof(pg_data_t));
+		pernodesize += L1_CACHE_ALIGN(sizeof(struct ia64_node_data));
+		pernodesize = PAGE_ALIGN(pernodesize);
+		pernode = NODEDATA_ALIGN(pstart, node);
+	
+		if (pstart + length > (pernode + pernodesize)) {
+			boot_pernode[node] = pernode;
+			boot_pernodesize[node] = pernodesize;
+			memset(__va(pernode), 0, pernodesize);
+
+			cpu_data = pernode;
+			pernode += PERCPU_PAGE_SIZE * cpus;
+
+			pg_data_ptr[node] = __va(pernode);
+			pernode += L1_CACHE_ALIGN(sizeof(pg_data_t));
+
+			boot_node_data[node] = __va(pernode);
+			pernode += L1_CACHE_ALIGN(sizeof(struct ia64_node_data));
+
+			pg_data_ptr[node]->bdata = &bdata[node];
+			pernode += L1_CACHE_ALIGN(sizeof(pg_data_t));
+
+			for (cpu=0; cpu < NR_CPUS; cpu++) {
+				if (node = node_cpuid[cpu].nid) {
+					extern char __per_cpu_start[], __phys_per_cpu_start[];
+					memcpy((void*)cpu_data, __phys_per_cpu_start, __per_cpu_end - __per_cpu_start);
+					__per_cpu_offset[cpu] = (char*)__va(cpu_data) - __per_cpu_start;
+					cpu_data +=  PERCPU_PAGE_SIZE;
+				}
+			}
+		}
+	}
+
+	pernode = boot_pernode[node];
+	pernodesize = boot_pernodesize[node];
+	if (pernode && !bdp->node_bootmem_map) {
 		pages = bdp->node_low_pfn - (bdp->node_boot_start>>PAGE_SHIFT);
 		mapsize = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
-		if (length > mapsize) {
-			init_bootmem_node(
-				BOOT_NODE_DATA(node),
-				pstart>>PAGE_SHIFT, 
-				bdp->node_boot_start>>PAGE_SHIFT,
-				bdp->node_low_pfn);
+
+		if (pernode - pstart > mapsize)
+			map = pstart;
+		else if (pstart + length - pernode - pernodesize > mapsize)
+			map = pernode + pernodesize;
+
+		if (map) {
+			init_bootmem_node(BOOT_NODE_DATA(node),	map>>PAGE_SHIFT, 
+				bdp->node_boot_start>>PAGE_SHIFT, bdp->node_low_pfn);
 		}
 
 	}
@@ -143,9 +191,9 @@
  *
  */
 static int __init
-discontig_free_bootmem_node(unsigned long pstart, unsigned long length, int node)
+discontig_free_bootmem_node(unsigned long start, unsigned long end, int node)
 {
-	free_bootmem_node(BOOT_NODE_DATA(node), pstart, length);
+	free_bootmem_node(BOOT_NODE_DATA(node), __pa(start), end - start);
 
 	return 0;
 }
@@ -158,53 +206,50 @@
 discontig_reserve_bootmem(void)
 {
 	int		node;
-	unsigned long	mapbase, mapsize, pages;
+	unsigned long	base, size, pages;
 	bootmem_data_t	*bdp;
 
 	for (node = 0; node < numnodes; node++) {
 		bdp = BOOT_NODE_DATA(node)->bdata;
 
 		pages = bdp->node_low_pfn - (bdp->node_boot_start>>PAGE_SHIFT);
-		mapsize = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
-		mapbase = __pa(bdp->node_bootmem_map);
-		reserve_bootmem_node(BOOT_NODE_DATA(node), mapbase, mapsize);
+		size = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
+		base = __pa(bdp->node_bootmem_map);
+		reserve_bootmem_node(BOOT_NODE_DATA(node), base, size);
+
+		size = boot_pernodesize[node];
+		base = __pa(boot_pernode[node]);
+		reserve_bootmem_node(BOOT_NODE_DATA(node), base, size);
 	}
 }
 
 /*
- * Allocate per node tables.
- * 	- the pg_data structure is allocated on each node. This minimizes offnode 
- *	  memory references
- *	- the node data is allocated & initialized. Portions of this structure is read-only (after 
- *	  boot) and contains node-local pointers to usefuls data structures located on
- *	  other nodes.
+ * Initialize per-node data
+ *
+ * Finish setting up the node data for this node, then copy it to the other nodes.
  *
- * We also switch to using the "real" pg_data structures at this point. Earlier in boot, we
- * use a different structure. The only use for pg_data prior to the point in boot is to get 
- * the pointer to the bdata for the node.
  */
 static void __init
-allocate_pernode_structures(void)
+initialize_pernode_data(void)
 {
-	pg_data_t	*pgdat=0, *new_pgdat_list=0;
-	int		node, mynode;
+	int	cpu, node;
+
+	memcpy(boot_node_data[0]->pg_data_ptrs, pg_data_ptr, sizeof(pg_data_ptr));
+	memcpy(boot_node_data[0]->node_data_ptrs, boot_node_data, sizeof(boot_node_data));
 
-	mynode = boot_get_local_nodeid();
-	for (node = numnodes - 1; node >= 0 ; node--) {
-		node_data[node] = alloc_bootmem_node(BOOT_NODE_DATA(node), sizeof (struct ia64_node_data));
-		pgdat = __alloc_bootmem_node(BOOT_NODE_DATA(node), sizeof(pg_data_t), SMP_CACHE_BYTES, 0);
-		pgdat->bdata = &(bdata[node][0]);
-		pg_data_ptr[node] = pgdat;
-		pgdat->pgdat_next = new_pgdat_list;
-		new_pgdat_list = pgdat;
+	for (node=1; node < numnodes; node++) {
+		memcpy(boot_node_data[node], boot_node_data[0], sizeof(struct ia64_node_data));
+		boot_node_data[node]->node = node;
 	}
-	
-	memcpy(node_data[mynode]->pg_data_ptrs, pg_data_ptr, sizeof(pg_data_ptr));
-	memcpy(node_data[mynode]->node_data_ptrs, node_data, sizeof(node_data));
 
-	pgdat_list = new_pgdat_list;
+	for (cpu=0; cpu < NR_CPUS; cpu++) {
+		node = node_cpuid[cpu].nid;
+		per_cpu(cpu_info, cpu).node_data = boot_node_data[node];
+		per_cpu(cpu_info, cpu).nodeid = node;
+	}
 }
 
+
 /*
  * Called early in boot to setup the boot memory allocator, and to
  * allocate the node-local pg_data & node-directory data structures..
@@ -212,96 +257,19 @@
 void __init
 discontig_mem_init(void)
 {
-	int	node;
-
 	if (numnodes = 0) {
 		printk(KERN_ERR "node info missing!\n");
 		numnodes = 1;
 	}
 
-	for (node = 0; node < numnodes; node++) {
-		pg_data_ptr[node] = (pg_data_t*) &boot_pg_data[node];
-		pg_data_ptr[node]->bdata = &bdata[node][0];
-	}
-
 	min_low_pfn = -1;
 	max_low_pfn = 0;
 
         efi_memmap_walk(filter_rsvd_memory, build_maps);
-        efi_memmap_walk(filter_rsvd_memory, find_bootmap_space);
+        efi_memmap_walk(filter_rsvd_memory, find_pernode_space);
         efi_memmap_walk(filter_rsvd_memory, discontig_free_bootmem_node);
-	discontig_reserve_bootmem();
-	allocate_pernode_structures();
-}
-
-/*
- * Initialize the paging system.
- *	- determine sizes of each node
- *	- initialize the paging system for the node
- *	- build the nodedir for the node. This contains pointers to
- *	  the per-bank mem_map entries.
- *	- fix the page struct "virtual" pointers. These are bank specific
- *	  values that the paging system doesn't understand.
- *	- replicate the nodedir structure to other nodes	
- */ 
-
-void __init
-discontig_paging_init(void)
-{
-	int		node, mynode;
-	unsigned long	max_dma, zones_size[MAX_NR_ZONES];
-	unsigned long	kaddr, ekaddr, bid;
-	struct page	*page;
-	bootmem_data_t	*bdp;
-
-	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 
-	mynode = boot_get_local_nodeid();
-	for (node = 0; node < numnodes; node++) {
-		long pfn, startpfn;
-
-		memset(zones_size, 0, sizeof(zones_size));
-
-		startpfn = -1;
-		bdp = BOOT_NODE_DATA(node)->bdata;
-		pfn = bdp->node_boot_start >> PAGE_SHIFT;
-		if (startpfn = -1)
-			startpfn = pfn;
-		if (pfn > max_dma)
-			zones_size[ZONE_NORMAL] += (bdp->node_low_pfn - pfn);
-		else if (bdp->node_low_pfn < max_dma)
-			zones_size[ZONE_DMA] += (bdp->node_low_pfn - pfn);
-		else {
-			zones_size[ZONE_DMA] += (max_dma - pfn);
-			zones_size[ZONE_NORMAL] += (bdp->node_low_pfn - max_dma);
-		}
-
-		free_area_init_node(node, NODE_DATA(node), NULL, zones_size, startpfn, 0);
-
-		page = NODE_DATA(node)->node_mem_map;
-
-		bdp = BOOT_NODE_DATA(node)->bdata;
-
-		kaddr = (unsigned long)__va(bdp->node_boot_start);
-		ekaddr = (unsigned long)__va(bdp->node_low_pfn << PAGE_SHIFT);
-		while (kaddr < ekaddr) {
-			if (paddr_to_nid(__pa(kaddr)) = node) {
-				bid = BANK_MEM_MAP_INDEX(kaddr);
-				node_data[mynode]->node_id_map[bid] = node;
-				node_data[mynode]->bank_mem_map_base[bid] = page;
-			}
-			kaddr += BANKSIZE;
-			page += BANKSIZE/PAGE_SIZE;
-		}
-	}
-
-	/*
-	 * Finish setting up the node data for this node, then copy it to the other nodes.
-	 */
-	for (node=0; node < numnodes; node++)
-		if (mynode != node) {
-			memcpy(node_data[node], node_data[mynode], sizeof(struct ia64_node_data));
-			node_data[node]->node = node;
-		}
+	discontig_reserve_bootmem();
+	initialize_pernode_data();
 }
 
diff -Nru a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
--- a/arch/ia64/mm/init.c	Wed Jul 16 15:36:33 2003
+++ b/arch/ia64/mm/init.c	Wed Jul 16 15:36:33 2003
@@ -44,7 +44,7 @@
 #ifdef CONFIG_VIRTUAL_MEM_MAP
 # define LARGE_GAP	0x40000000	/* Use virtual mem map if hole is > than this */
   unsigned long vmalloc_end = VMALLOC_END_INIT;
-  static struct page *vmem_map;
+  struct page *vmem_map;
   static unsigned long num_dma_physpages;
 #endif
 
@@ -240,7 +240,7 @@
 				else if (page_count(pgdat->node_mem_map + i))
 					shared += page_count(pgdat->node_mem_map + i) - 1;
 			}
-			printk("\t%d pages of RAM\n", pgdat->node_spanned_pages);
+			printk("\t%ld pages of RAM\n", pgdat->node_spanned_pages);
 			printk("\t%d reserved pages\n", reserved);
 			printk("\t%d pages shared\n", shared);
 			printk("\t%d pages swap cached\n", cached);
@@ -397,6 +397,7 @@
 {
 	unsigned long address, start_page, end_page;
 	struct page *map_start, *map_end;
+	int node;
 	pgd_t *pgd;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -406,19 +407,20 @@
 
 	start_page = (unsigned long) map_start & PAGE_MASK;
 	end_page = PAGE_ALIGN((unsigned long) map_end);
+	node = paddr_to_nid(__pa(start));
 
 	for (address = start_page; address < end_page; address += PAGE_SIZE) {
 		pgd = pgd_offset_k(address);
 		if (pgd_none(*pgd))
-			pgd_populate(&init_mm, pgd, alloc_bootmem_pages(PAGE_SIZE));
+			pgd_populate(&init_mm, pgd, alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE));
 		pmd = pmd_offset(pgd, address);
 
 		if (pmd_none(*pmd))
-			pmd_populate_kernel(&init_mm, pmd, alloc_bootmem_pages(PAGE_SIZE));
+			pmd_populate_kernel(&init_mm, pmd, alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE));
 		pte = pte_offset_kernel(pmd, address);
 
 		if (pte_none(*pte))
-			set_pte(pte, pfn_pte(__pa(alloc_bootmem_pages(PAGE_SIZE)) >> PAGE_SHIFT,
+			set_pte(pte, pfn_pte(__pa(alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE)) >> PAGE_SHIFT,
 					     PAGE_KERNEL));
 	}
 	return 0;
@@ -431,6 +433,14 @@
 	unsigned long zone;
 };
 
+struct memmap_count_callback_data {
+	int node;
+	unsigned long num_physpages;
+	unsigned long num_dma_physpages;
+	unsigned long min_pfn;
+	unsigned long max_pfn;
+} cdata;
+
 static int
 virtual_memmap_init (u64 start, u64 end, void *arg)
 {
@@ -489,16 +499,6 @@
 }
 
 static int
-count_dma_pages (u64 start, u64 end, void *arg)
-{
-	unsigned long *count = arg;
-
-	if (end <= MAX_DMA_ADDRESS)
-		*count += (end - start) >> PAGE_SHIFT;
-	return 0;
-}
-
-static int
 find_largest_hole (u64 start, u64 end, void *arg)
 {
 	u64 *max_gap = arg;
@@ -514,102 +514,101 @@
 }
 #endif /* CONFIG_VIRTUAL_MEM_MAP */
 
+#define GRANULEROUNDDOWN(n) ((n) & ~(IA64_GRANULE_SIZE-1))
+#define GRANULEROUNDUP(n) (((n)+IA64_GRANULE_SIZE-1) & ~(IA64_GRANULE_SIZE-1))
+#define ORDERROUNDDOWN(n) ((n) & ~((PAGE_SIZE<<MAX_ORDER)-1))
 static int
-count_pages (u64 start, u64 end, void *arg)
+count_pages (unsigned long start, unsigned long end, int node)
 {
-	unsigned long *count = arg;
+	start = __pa(start);
+	end = __pa(end);
 
-	*count += (end - start) >> PAGE_SHIFT;
+	if (node = cdata.node) {
+		cdata.num_physpages += (end - start) >> PAGE_SHIFT;
+		if (start <= __pa(MAX_DMA_ADDRESS))
+			cdata.num_dma_physpages += (min(end, __pa(MAX_DMA_ADDRESS)) - start) >> PAGE_SHIFT;
+		start = GRANULEROUNDDOWN(__pa(start));
+		start = ORDERROUNDDOWN(start);
+		end = GRANULEROUNDUP(__pa(end));
+		cdata.max_pfn = max(cdata.max_pfn, end >> PAGE_SHIFT);
+		cdata.min_pfn = min(cdata.min_pfn, start >> PAGE_SHIFT);
+	}
 	return 0;
 }
 
 /*
  * Set up the page tables.
  */
-
-#ifdef CONFIG_DISCONTIGMEM
 void
 paging_init (void)
 {
-	extern void discontig_paging_init(void);
-
-	discontig_paging_init();
-	efi_memmap_walk(count_pages, &num_physpages);
-	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
-}
-#else /* !CONFIG_DISCONTIGMEM */
-void
-paging_init (void)
-{
-	unsigned long max_dma;
+	unsigned long max_dma_pfn;
 	unsigned long zones_size[MAX_NR_ZONES];
 #  ifdef CONFIG_VIRTUAL_MEM_MAP
 	unsigned long zholes_size[MAX_NR_ZONES];
 	unsigned long max_gap;
 #  endif
+	int node;
 
-	/* initialize mem_map[] */
-
-	memset(zones_size, 0, sizeof(zones_size));
-
-	num_physpages = 0;
-	efi_memmap_walk(count_pages, &num_physpages);
-
-	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
-#  ifdef CONFIG_VIRTUAL_MEM_MAP
-	memset(zholes_size, 0, sizeof(zholes_size));
-
-	num_dma_physpages = 0;
-	efi_memmap_walk(count_dma_pages, &num_dma_physpages);
-
-	if (max_low_pfn < max_dma) {
-		zones_size[ZONE_DMA] = max_low_pfn;
-		zholes_size[ZONE_DMA] = max_low_pfn - num_dma_physpages;
-	} else {
-		zones_size[ZONE_DMA] = max_dma;
-		zholes_size[ZONE_DMA] = max_dma - num_dma_physpages;
-		if (num_physpages > num_dma_physpages) {
-			zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
-			zholes_size[ZONE_NORMAL] = ((max_low_pfn - max_dma)
-						    - (num_physpages - num_dma_physpages));
-		}
-	}
-
+	max_dma_pfn = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 	max_gap = 0;
 	efi_memmap_walk(find_largest_hole, (u64 *)&max_gap);
-	if (max_gap < LARGE_GAP) {
-		vmem_map = (struct page *) 0;
-		free_area_init_node(0, &contig_page_data, NULL, zones_size, 0, zholes_size);
-		mem_map = contig_page_data.node_mem_map;
-	}
-	else {
-		unsigned long map_size;
-
-		/* allocate virtual_mem_map */
 
-		map_size = PAGE_ALIGN(max_low_pfn * sizeof(struct page));
-		vmalloc_end -= map_size;
-		vmem_map = (struct page *) vmalloc_end;
-		efi_memmap_walk(create_mem_map_page_table, 0);
-
-		free_area_init_node(0, &contig_page_data, vmem_map, zones_size, 0, zholes_size);
+	for (node = 0; node < numnodes; node++) {
+		memset(zones_size, 0, sizeof(zones_size));
+		memset(zholes_size, 0, sizeof(zholes_size));
+		memset(&cdata, 0, sizeof(cdata));
+
+		cdata.node = node;
+		cdata.min_pfn = ~0;
+
+		efi_memmap_walk(filter_rsvd_memory, count_pages);
+		num_dma_physpages += cdata.num_dma_physpages;
+		num_physpages += cdata.num_physpages;
+
+		if (cdata.min_pfn >= max_dma_pfn) {
+			/* Above the DMA zone */
+			zones_size[ZONE_NORMAL] = cdata.max_pfn - cdata.min_pfn;
+			zholes_size[ZONE_NORMAL] = cdata.max_pfn - cdata.min_pfn - cdata.num_physpages;
+		} else if (cdata.max_pfn < max_dma_pfn) {
+			/* This block is DMAable */
+			zones_size[ZONE_DMA] = cdata.max_pfn - cdata.min_pfn;
+			zholes_size[ZONE_DMA] = cdata.max_pfn - cdata.min_pfn - cdata.num_dma_physpages;
+		} else {
+			zones_size[ZONE_DMA] = max_dma_pfn - cdata.min_pfn;
+			zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - cdata.num_dma_physpages;
+			zones_size[ZONE_NORMAL] = cdata.max_pfn - max_dma_pfn;
+			zholes_size[ZONE_NORMAL] = zones_size[ZONE_NORMAL] - (cdata.num_physpages - cdata.num_dma_physpages);
+		}
 
-		mem_map = contig_page_data.node_mem_map;
-		printk("Virtual mem_map starts at 0x%p\n", mem_map);
-	}
-#  else /* !CONFIG_VIRTUAL_MEM_MAP */
-	if (max_low_pfn < max_dma)
-		zones_size[ZONE_DMA] = max_low_pfn;
-	else {
-		zones_size[ZONE_DMA] = max_dma;
-		zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
+		if (numnodes = 1 && max_gap < LARGE_GAP) {
+			/* Just one node with no big holes... */
+			vmem_map = (struct page *)0;
+			zones_size[ZONE_DMA] += cdata.min_pfn;
+			zholes_size[ZONE_DMA] += cdata.min_pfn;
+			free_area_init_node(0, NODE_DATA(node), NODE_DATA(node)->node_mem_map,
+					    zones_size, 0, zholes_size);
+		}
+		else {
+			/* allocate virtual mem_map */
+			if (node = 0) {
+				unsigned long map_size;
+				map_size = PAGE_ALIGN(max_low_pfn*sizeof(struct page));
+				vmalloc_end -= map_size;
+				vmem_map = (struct page *) vmalloc_end;
+				efi_memmap_walk(create_mem_map_page_table, 0);
+				printk("Virtual mem_map starts at 0x%p\n", vmem_map);
+#ifndef CONFIG_DISCONTIGMEM
+				mem_map = vmem_map;
+#endif
+			}
+			free_area_init_node(node, NODE_DATA(node), vmem_map + cdata.min_pfn,
+					    zones_size, cdata.min_pfn, zholes_size);
+		}
 	}
-	free_area_init(zones_size);
-#  endif /* !CONFIG_VIRTUAL_MEM_MAP */
+
 	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
-#endif /* !CONFIG_DISCONTIGMEM */
 
 static int
 count_reserved_pages (u64 start, u64 end, void *arg)
diff -Nru a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
--- a/drivers/acpi/Kconfig	Wed Jul 16 15:36:33 2003
+++ b/drivers/acpi/Kconfig	Wed Jul 16 15:36:33 2003
@@ -133,7 +133,7 @@
 
 config ACPI_NUMA
 	bool "NUMA support" if NUMA && (IA64 && !IA64_HP_SIM || X86 && ACPI && !ACPI_HT_ONLY && !X86_64)
-	default y if IA64 && IA64_SGI_SN
+	default y if IA64_GENERIC || IA64_SGI_SN2
 
 config ACPI_ASUS
         tristate "ASUS/Medion Laptop Extras"
diff -Nru a/include/asm-ia64/mmzone.h b/include/asm-ia64/mmzone.h
--- a/include/asm-ia64/mmzone.h	Wed Jul 16 15:36:33 2003
+++ b/include/asm-ia64/mmzone.h	Wed Jul 16 15:36:33 2003
@@ -3,7 +3,7 @@
  * License.  See the file "COPYING" in the main directory of this archive
  * for more details.
  *
- * Copyright (c) 2000 Silicon Graphics, Inc.  All rights reserved.
+ * Copyright (c) 2000,2003 Silicon Graphics, Inc.  All rights reserved.
  * Copyright (c) 2002 NEC Corp.
  * Copyright (c) 2002 Erich Focht <efocht@ess.nec.de>
  * Copyright (c) 2002 Kimio Suganuma <k-suganuma@da.jp.nec.com>
@@ -14,150 +14,50 @@
 #include <linux/config.h>
 #include <linux/init.h>
 
-/*
- * Given a kaddr, find the base mem_map address for the start of the mem_map
- * entries for the bank containing the kaddr.
- */
-#define BANK_MEM_MAP_BASE(kaddr) local_node_data->bank_mem_map_base[BANK_MEM_MAP_INDEX(kaddr)]
-
-/*
- * Given a kaddr, this macro return the relative map number 
- * within the bank.
- */
-#define BANK_MAP_NR(kaddr) 	(BANK_OFFSET(kaddr) >> PAGE_SHIFT)
 
-/*
- * Given a pte, this macro returns a pointer to the page struct for the pte.
- */
-#define pte_page(pte)	virt_to_page(PAGE_OFFSET | (pte_val(pte)&_PFN_MASK))
+#ifdef CONFIG_NUMA
 
-/*
- * Determine if a kaddr is a valid memory address of memory that
- * actually exists. 
- *
- * The check consists of 2 parts:
- *	- verify that the address is a region 7 address & does not 
- *	  contain any bits that preclude it from being a valid platform
- *	  memory address
- *	- verify that the chunk actually exists.
- *
- * Note that IO addresses are NOT considered valid addresses.
- *
- * Note, many platforms can simply check if kaddr exceeds a specific size.  
- *	(However, this won't work on SGI platforms since IO space is embedded 
- * 	within the range of valid memory addresses & nodes have holes in the 
- *	address range between banks). 
- */
-#define kern_addr_valid(kaddr)		({long _kav=(long)(kaddr);	\
-					VALID_MEM_KADDR(_kav);})
-
-/*
- * Given a kaddr, return a pointer to the page struct for the page.
- * If the kaddr does not represent RAM memory that potentially exists, return
- * a pointer the page struct for max_mapnr. IO addresses will
- * return the page for max_nr. Addresses in unpopulated RAM banks may
- * return undefined results OR may panic the system.
- *
- */
-#define virt_to_page(kaddr)	({long _kvtp=(long)(kaddr);	\
-				(VALID_MEM_KADDR(_kvtp))	\
-					? BANK_MEM_MAP_BASE(_kvtp) + BANK_MAP_NR(_kvtp)	\
-					: NULL;})
+#ifdef CONFIG_IA64_DIG
 
 /*
- * Given a page struct entry, return the physical address that the page struct represents.
- * Since IA64 has all memory in the DMA zone, the following works:
+ * Platform definitions for DIG platform with contiguous memory.
  */
-#define page_to_phys(page)	__pa(page_address(page))
-
-#define node_mem_map(nid)	(NODE_DATA(nid)->node_mem_map)
+#define MAX_PHYSNODE_ID	8		/* Maximum node number +1 */
+#define NR_NODES	8		/* Maximum number of nodes in SSI */
+#define NR_MEMBLKS	(NR_NODES * 32)
 
-#define node_localnr(pfn, nid)	((pfn) - NODE_DATA(nid)->node_start_pfn)
 
-#define pfn_to_page(pfn)	(struct page *)(node_mem_map(pfn_to_nid(pfn)) + node_localnr(pfn, pfn_to_nid(pfn)))
 
-#define pfn_to_nid(pfn)		 local_node_data->node_id_map[(pfn << PAGE_SHIFT) >> BANKSHIFT]
-
-#define page_to_pfn(page)	(long)((page - page_zone(page)->zone_mem_map) + page_zone(page)->zone_start_pfn)
 
+#elif CONFIG_IA64_SGI_SN2
 
 /*
- * pfn_valid should be made as fast as possible, and the current definition
- * is valid for machines that are NUMA, but still contiguous, which is what
- * is currently supported. A more generalised, but slower definition would
- * be something like this - mbligh:
- * ( pfn_to_pgdat(pfn) && (pfn < node_end_pfn(pfn_to_nid(pfn))) )
+ * Platform definitions for DIG platform with contiguous memory.
  */
-#define pfn_valid(pfn)          (pfn < max_low_pfn)
-extern unsigned long max_low_pfn;
+#define MAX_PHYSNODE_ID	2048		/* Maximum node number +1 */
+#define NR_NODES	256		/* Maximum number of compute nodes in SSI */
+#define NR_MEMBLKS	(NR_NODES)
 
+#elif CONFIG_IA64_GENERIC
 
-#ifdef CONFIG_IA64_DIG
 
 /*
- * Platform definitions for DIG platform with contiguous memory.
+ * Platform definitions for GENERIC platform with contiguous or discontiguous memory.
  */
-#define MAX_PHYSNODE_ID	8	/* Maximum node number +1 */
-#define NR_NODES	8	/* Maximum number of nodes in SSI */
+#define MAX_PHYSNODE_ID 2048		/* Maximum node number +1 */
+#define NR_NODES        256		/* Maximum number of nodes in SSI */
+#define NR_MEMBLKS      (NR_NODES)
 
-#define MAX_PHYS_MEMORY	(1UL << 40)	/* 1 TB */
 
-/*
- * Bank definitions.
- * Configurable settings for DIG: 512MB/bank:  16GB/node,
- *                               2048MB/bank:  64GB/node,
- *                               8192MB/bank: 256GB/node.
- */
-#define NR_BANKS_PER_NODE	32
-#if defined(CONFIG_IA64_NODESIZE_16GB)
-# define BANKSHIFT		29
-#elif defined(CONFIG_IA64_NODESIZE_64GB)
-# define BANKSHIFT		31
-#elif defined(CONFIG_IA64_NODESIZE_256GB)
-# define BANKSHIFT		33
 #else
-# error Unsupported bank and nodesize!
+#error unknown platform
 #endif
-#define BANKSIZE		(1UL << BANKSHIFT)
-#define BANK_OFFSET(addr)	((unsigned long)(addr) & (BANKSIZE-1))
-#define NR_BANKS		(NR_BANKS_PER_NODE * NR_NODES)
 
-/*
- * VALID_MEM_KADDR returns a boolean to indicate if a kaddr is
- * potentially a valid cacheable identity mapped RAM memory address.
- * Note that the RAM may or may not actually be present!!
- */
-#define VALID_MEM_KADDR(kaddr)	1
+extern void build_cpu_to_node_map(void);
 
-/*
- * Given a nodeid & a bank number, find the address of the mem_map
- * entry for the first page of the bank.
- */
-#define BANK_MEM_MAP_INDEX(kaddr) \
-	(((unsigned long)(kaddr) & (MAX_PHYS_MEMORY-1)) >> BANKSHIFT)
+#else /* CONFIG_NUMA */
 
-#elif defined(CONFIG_IA64_SGI_SN2)
-/*
- * SGI SN2 discontig definitions
- */
-#define MAX_PHYSNODE_ID	2048	/* 2048 node ids (also called nasid) */
-#define NR_NODES	128	/* Maximum number of nodes in SSI */
-#define MAX_PHYS_MEMORY	(1UL << 49)
-
-#define BANKSHIFT		38
-#define NR_BANKS_PER_NODE	4
-#define SN2_NODE_SIZE		(64UL*1024*1024*1024)	/* 64GB per node */
-#define BANKSIZE		(SN2_NODE_SIZE/NR_BANKS_PER_NODE)
-#define BANK_OFFSET(addr)	((unsigned long)(addr) & (BANKSIZE-1))
-#define NR_BANKS		(NR_BANKS_PER_NODE * NR_NODES)
-#define VALID_MEM_KADDR(kaddr)	1
-
-/*
- * Given a nodeid & a bank number, find the address of the mem_map
- * entry for the first page of the bank.
- */
-#define BANK_MEM_MAP_INDEX(kaddr) \
-	(((unsigned long)(kaddr) & (MAX_PHYS_MEMORY-1)) >> BANKSHIFT)
+#define NR_NODES	1
 
-#endif /* CONFIG_IA64_DIG */
+#endif /* CONFIG_NUMA */
 #endif /* _ASM_IA64_MMZONE_H */
diff -Nru a/include/asm-ia64/nodedata.h b/include/asm-ia64/nodedata.h
--- a/include/asm-ia64/nodedata.h	Wed Jul 16 15:36:33 2003
+++ b/include/asm-ia64/nodedata.h	Wed Jul 16 15:36:33 2003
@@ -13,7 +13,7 @@
 #ifndef _ASM_IA64_NODEDATA_H
 #define _ASM_IA64_NODEDATA_H
 
-
+#include <asm/percpu.h>
 #include <asm/mmzone.h>
 
 /*
@@ -22,15 +22,17 @@
 
 struct pglist_data;
 struct ia64_node_data {
-	short			active_cpu_count;
 	short			node;
+	short			active_cpu_count;
+	/*
+	 * The fields are read-only (after boot). They contain pointers
+	 * to various structures located on other nodes. Ths data is
+	 * replicated on each node in order to reduce off-node references.
+	 */
         struct pglist_data	*pg_data_ptrs[NR_NODES];
-	struct page		*bank_mem_map_base[NR_BANKS];
 	struct ia64_node_data	*node_data_ptrs[NR_NODES];
-	short			node_id_map[NR_BANKS];
 };
 
-
 /*
  * Return a pointer to the node_data structure for the executing cpu.
  */
@@ -40,7 +42,8 @@
 /*
  * Return a pointer to the node_data structure for the specified node.
  */
-#define node_data(node)	(local_node_data->node_data_ptrs[node])
+#define node_data(node) (local_node_data->node_data_ptrs[node])
+#define NODE_DATA(nid) (local_node_data->pg_data_ptrs[nid])
 
 /*
  * Get a pointer to the node_id/node_data for the current cpu.
@@ -48,29 +51,5 @@
  */
 extern int boot_get_local_nodeid(void);
 extern struct ia64_node_data *get_node_data_ptr(void);
-
-/*
- * Given a node id, return a pointer to the pg_data_t for the node.
- * The following 2 macros are similar. 
- *
- * NODE_DATA 	- should be used in all code not related to system
- *		  initialization. It uses pernode data structures to minimize
- *		  offnode memory references. However, these structure are not 
- *		  present during boot. This macro can be used once cpu_init
- *		  completes.
- *
- * BOOT_NODE_DATA
- *		- should be used during system initialization 
- *		  prior to freeing __initdata. It does not depend on the percpu
- *		  area being present.
- *
- * NOTE:   The names of these macros are misleading but are difficult to change
- *	   since they are used in generic linux & on other architecures.
- */
-#define NODE_DATA(nid)		(local_node_data->pg_data_ptrs[nid])
-#define BOOT_NODE_DATA(nid)	boot_get_pg_data_ptr((long)(nid))
-
-struct pglist_data;
-extern struct pglist_data * __init boot_get_pg_data_ptr(long);
 
 #endif /* _ASM_IA64_NODEDATA_H */
diff -Nru a/include/asm-ia64/numa.h b/include/asm-ia64/numa.h
--- a/include/asm-ia64/numa.h	Wed Jul 16 15:36:33 2003
+++ b/include/asm-ia64/numa.h	Wed Jul 16 15:36:33 2003
@@ -15,13 +15,21 @@
 
 #ifdef CONFIG_DISCONTIGMEM
 # include <asm/mmzone.h>
-# define NR_MEMBLKS   (NR_BANKS)
 #else
 # define NR_NODES     (8)
 # define NR_MEMBLKS   (NR_NODES * 8)
 #endif
 
 #include <linux/cache.h>
+#include <linux/threads.h>
+#include <linux/smp.h>
+
+#define NODEMASK_WORDCOUNT       ((NR_NODES+(BITS_PER_LONG-1))/BITS_PER_LONG)
+
+#define NODE_MASK_NONE   { [0 ... ((NR_NODES+BITS_PER_LONG-1)/BITS_PER_LONG)-1] = 0 }
+
+typedef unsigned long   nodemask_t[NODEMASK_WORDCOUNT];
+                                                                                                                             
 extern volatile char cpu_to_node_map[NR_CPUS] __cacheline_aligned;
 extern volatile unsigned long node_to_cpu_mask[NR_NODES] __cacheline_aligned;
 
@@ -63,6 +71,12 @@
 extern int paddr_to_nid(unsigned long paddr);
 
 #define local_nodeid (cpu_to_node_map[smp_processor_id()])
+
+#else /* !CONFIG_NUMA */
+
+#define node_distance(from,to) 10
+#define paddr_to_nid(x) 0
+#define local_nodeid 0
 
 #endif /* CONFIG_NUMA */
 
diff -Nru a/include/asm-ia64/page.h b/include/asm-ia64/page.h
--- a/include/asm-ia64/page.h	Wed Jul 16 15:36:33 2003
+++ b/include/asm-ia64/page.h	Wed Jul 16 15:36:33 2003
@@ -93,18 +93,26 @@
 
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
 
-#ifndef CONFIG_DISCONTIGMEM
-# ifdef CONFIG_VIRTUAL_MEM_MAP
-   extern int ia64_pfn_valid (unsigned long pfn);
-#  define pfn_valid(pfn)	(((pfn) < max_mapnr) && ia64_pfn_valid(pfn))
-# else
-#  define pfn_valid(pfn)	((pfn) < max_mapnr)
-# endif
-#define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
-#define page_to_pfn(page)	((unsigned long) (page - mem_map))
-#define pfn_to_page(pfn)	(mem_map + (pfn))
-#define page_to_phys(page)	(page_to_pfn(page) << PAGE_SHIFT)
+#ifdef CONFIG_VIRTUAL_MEM_MAP
+extern int ia64_pfn_valid(unsigned long pfn);
+#else
+#define ia64_pfn_valid(pfn) (1)
+#endif
+
+extern unsigned long max_low_pfn;
+#define pfn_valid(pfn) (((pfn) < max_low_pfn) && ia64_pfn_valid(pfn))
+
+#if defined(CONFIG_VIRTUAL_MEM_MAP) && !defined(CONFIG_DISCONTIGMEM)
+#define vmem_map mem_map
+#else
+extern struct page *vmem_map;
 #endif
+
+#define pfn_to_page(pfn)	(vmem_map + (pfn))
+#define page_to_pfn(page)	((unsigned long) (page - vmem_map))
+
+#define virt_to_page(kaddr)	(pfn_to_page(__pa(kaddr) >> PAGE_SHIFT))
+#define page_to_phys(page)	(page_to_pfn(page) << PAGE_SHIFT)
 
 typedef union ia64_va {
 	struct {
diff -Nru a/include/asm-ia64/pgtable.h b/include/asm-ia64/pgtable.h
--- a/include/asm-ia64/pgtable.h	Wed Jul 16 15:36:33 2003
+++ b/include/asm-ia64/pgtable.h	Wed Jul 16 15:36:33 2003
@@ -174,7 +174,6 @@
 	return (addr & (local_cpu_data->unimpl_pa_mask)) = 0;
 }
 
-#ifndef CONFIG_DISCONTIGMEM
 /*
  * kern_addr_valid(ADDR) tests if ADDR is pointing to valid kernel
  * memory.  For the return value to be meaningful, ADDR must be >@@ -190,7 +189,6 @@
  */
 #define kern_addr_valid(addr)	(1)
 
-#endif
 
 /*
  * Now come the defines and routines to manage and access the three-level
@@ -241,10 +239,8 @@
 #define pte_none(pte) 			(!pte_val(pte))
 #define pte_present(pte)		(pte_val(pte) & (_PAGE_P | _PAGE_PROTNONE))
 #define pte_clear(pte)			(pte_val(*(pte)) = 0UL)
-#ifndef CONFIG_DISCONTIGMEM
 /* pte_page() returns the "struct page *" corresponding to the PTE: */
 #define pte_page(pte)			virt_to_page(((pte_val(pte) & _PFN_MASK) + PAGE_OFFSET))
-#endif
 
 #define pmd_none(pmd)			(!pmd_val(pmd))
 #define pmd_bad(pmd)			(!ia64_phys_addr_valid(pmd_val(pmd)))
@@ -416,6 +412,7 @@
 
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
 extern void paging_init (void);
+extern int filter_rsvd_memory(unsigned long start, unsigned long end, void *arg);
 
 /*
  * Note: The macros below rely on the fact that MAX_SWAPFILES_SHIFT <= number of
diff -Nru a/include/asm-ia64/processor.h b/include/asm-ia64/processor.h
--- a/include/asm-ia64/processor.h	Wed Jul 16 15:36:33 2003
+++ b/include/asm-ia64/processor.h	Wed Jul 16 15:36:33 2003
@@ -185,6 +185,8 @@
 #endif
 #ifdef CONFIG_NUMA
 	struct ia64_node_data *node_data;
+	struct cpuinfo_ia64 *cpu_data[NR_CPUS];
+	int nodeid;
 #endif
 };
 
diff -Nru a/mm/bootmem.c b/mm/bootmem.c
--- a/mm/bootmem.c	Wed Jul 16 15:36:33 2003
+++ b/mm/bootmem.c	Wed Jul 16 15:36:33 2003
@@ -48,8 +48,24 @@
 	bootmem_data_t *bdata = pgdat->bdata;
 	unsigned long mapsize = ((end - start)+7)/8;
 
-	pgdat->pgdat_next = pgdat_list;
-	pgdat_list = pgdat;
+
+	/*
+	 * sort pgdat_list so that the lowest one comes first,
+	 * which makes alloc_bootmem_low_pages work as desired.
+	 */
+	if (!pgdat_list || pgdat_list->node_start_pfn > pgdat->node_start_pfn) {
+		pgdat->pgdat_next = pgdat_list;
+		pgdat_list = pgdat;
+	} else {
+		pg_data_t *tmp = pgdat_list;
+		while (tmp->pgdat_next) {
+			if (tmp->pgdat_next->node_start_pfn > pgdat->node_start_pfn)
+				break;
+			tmp = tmp->pgdat_next;
+		}
+		pgdat->pgdat_next = tmp->pgdat_next;
+		tmp->pgdat_next = pgdat;
+	}
 
 	mapsize = (mapsize + (sizeof(long) - 1UL)) & ~(sizeof(long) - 1UL);
 	bdata->node_bootmem_map = phys_to_virt(mapstart << PAGE_SHIFT);
@@ -251,7 +267,7 @@
 
 static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat)
 {
-	struct page *page = pgdat->node_mem_map;
+	struct page *page;
 	bootmem_data_t *bdata = pgdat->bdata;
 	unsigned long i, count, total = 0;
 	unsigned long idx;
@@ -260,23 +276,23 @@
 	if (!bdata->node_bootmem_map) BUG();
 
 	count = 0;
+	page = virt_to_page(phys_to_virt(bdata->node_boot_start));
 	idx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
 	map = bdata->node_bootmem_map;
 	for (i = 0; i < idx; ) {
 		unsigned long v = ~map[i / BITS_PER_LONG];
 		if (v) {
 			unsigned long m;
-			for (m = 1; m && i < idx; m<<=1, page++, i++) {
+			for (m = 1; m && i < idx; m<<=1, i++) {
 				if (v & m) {
 					count++;
-					ClearPageReserved(page);
-					set_page_count(page, 1);
-					__free_page(page);
+					ClearPageReserved(page+i);
+					set_page_count(page+i, 1);
+					__free_page(page+i);
 				}
 			}
 		} else {
 			i+=BITS_PER_LONG;
-			page += BITS_PER_LONG;
 		}
 	}
 	total += count;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
                   ` (9 preceding siblings ...)
  2003-07-16 22:37 ` Jesse Barnes
@ 2003-07-17  8:23 ` Erich Focht
  2003-07-18  0:16 ` Jesse Barnes
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Erich Focht @ 2003-07-17  8:23 UTC (permalink / raw)
  To: linux-ia64

> > Same comment as above. NEC TX7 is an IA64_DIG platform.
>
> Oops, sorry.  Here's another one.  How does it look?
...

> --- a/arch/ia64/Kconfig	Wed Jul 16 15:36:33 2003
> +++ b/arch/ia64/Kconfig	Wed Jul 16 15:36:33 2003
> @@ -211,7 +211,7 @@
>
>  config NUMA
>  	bool "Enable NUMA support" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
> -	default y if IA64_SGI_SN2
> +	default y if IA64_SGI_SN2 || IA64_GENERIC
>  	help
>  	  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
>  	  Access).  This option is for configuring high-end multiprocessor
> @@ -234,9 +234,8 @@
>  endchoice

OK. Thanks.

>
>  config DISCONTIGMEM
> -	bool
> -	depends on IA64_SGI_SN2 || (IA64_GENERIC || IA64_DIG || IA64_HP_ZX1) &&
> NUMA -	default y
> +	bool "Discontiguous memory support" if IA64_DIG
> +	default y if IA64_SGI_SN2 || IA64_GENERIC
>  	help
>  	  Say Y to support efficient handling of discontiguous physical memory,
>  	  for architectures which are either NUMA (Non-Uniform Memory Access)
> @@ -245,8 +244,7 @@

Maybe that should better be 
     bool "Discontiguous memory support" if (IA64_DIG && NUMA)

Discontigmem alone doesn't make sense, yet, I think...

Regards,
Erich



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
                   ` (10 preceding siblings ...)
  2003-07-17  8:23 ` Erich Focht
@ 2003-07-18  0:16 ` Jesse Barnes
  2003-07-18 16:15 ` Erich Focht
  2003-07-21 18:46 ` Takayoshi Kochi
  13 siblings, 0 replies; 15+ messages in thread
From: Jesse Barnes @ 2003-07-18  0:16 UTC (permalink / raw)
  To: linux-ia64

On Thu, Jul 17, 2003 at 10:23:37AM +0200, Erich Focht wrote:
> Maybe that should better be 
>      bool "Discontiguous memory support" if (IA64_DIG && NUMA)
> 
> Discontigmem alone doesn't make sense, yet, I think...

Not for ia64 I guess, unless we make zx1 use discontig as well.  Patch
appended.

Thanks,
Jesse


# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.1397  -> 1.1399 
#	include/asm-ia64/page.h	1.19    -> 1.20   
#	arch/ia64/kernel/setup.c	1.53    -> 1.54   
#	include/asm-ia64/pgtable.h	1.28    -> 1.29   
#	        mm/bootmem.c	1.18    -> 1.19   
#	include/asm-ia64/numa.h	1.5     -> 1.6    
#	include/asm-ia64/processor.h	1.48    -> 1.49   
#	 arch/ia64/mm/init.c	1.46    -> 1.47   
#	include/asm-ia64/nodedata.h	1.3     -> 1.4    
#	arch/ia64/mm/discontig.c	1.4     -> 1.5    
#	   arch/ia64/Kconfig	1.38    -> 1.40   
#	drivers/acpi/Kconfig	1.12    -> 1.13   
#	include/asm-ia64/mmzone.h	1.4     -> 1.5    
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/07/16	jbarnes@tomahawk.engr.sgi.com	1.1398
# latest discontig
# --------------------------------------------
# 03/07/17	jbarnes@tomahawk.engr.sgi.com	1.1399
# another discontig patch
# --------------------------------------------
#
diff -Nru a/arch/ia64/Kconfig b/arch/ia64/Kconfig
--- a/arch/ia64/Kconfig	Thu Jul 17 16:59:05 2003
+++ b/arch/ia64/Kconfig	Thu Jul 17 16:59:05 2003
@@ -211,7 +211,7 @@
 
 config NUMA
 	bool "Enable NUMA support" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
-	default y if IA64_SGI_SN2
+	default y if IA64_SGI_SN2 || IA64_GENERIC
 	help
 	  Say Y to compile the kernel to support NUMA (Non-Uniform Memory
 	  Access).  This option is for configuring high-end multiprocessor
@@ -234,9 +234,8 @@
 endchoice
 
 config DISCONTIGMEM
-	bool
-	depends on IA64_SGI_SN2 || (IA64_GENERIC || IA64_DIG || IA64_HP_ZX1) && NUMA
-	default y
+	bool "Discontiguous memory support" if (IA64_DIG && NUMA)
+	default y if IA64_SGI_SN2 || IA64_GENERIC
 	help
 	  Say Y to support efficient handling of discontiguous physical memory,
 	  for architectures which are either NUMA (Non-Uniform Memory Access)
@@ -245,8 +244,7 @@
 
 config VIRTUAL_MEM_MAP
 	bool "Enable Virtual Mem Map"
-	depends on !NUMA
-	default y if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
+	default y if !IA64_HP_SIM
 	help
 	  Say Y to compile the kernel with support for a virtual mem map.
 	  This is an alternate method of supporting large holes in the
@@ -259,8 +257,8 @@
 	  are unsure, say Y.
 
 config IA64_MCA
-	bool "Enable IA-64 Machine Check Abort" if IA64_GENERIC || IA64_DIG || IA64_HP_ZX1
-	default y if IA64_SGI_SN2
+	bool "Enable IA-64 Machine Check Abort"
+	default y if !IA64_HP_SIM
 	help
 	  Say Y here to enable machine check support for IA-64.  If you're
 	  unsure, answer Y.
@@ -292,43 +290,12 @@
 	depends on IA64_GENERIC || IA64_DIG || IA64_HP_ZX1 || IA64_SGI_SN2
 	default y
 
-config IA64_SGI_SN_DEBUG
-	bool "Enable extra debugging code"
-	depends on IA64_SGI_SN2
-	help
-	  Turns on extra debugging code in the SGI SN (Scalable NUMA) platform
-	  for IA-64.  Unless you are debugging problems on an SGI SN IA-64 box,
-	  say N.
-
 config IA64_SGI_SN_SIM
 	bool "Enable SGI Medusa Simulator Support"
 	depends on IA64_SGI_SN2
 	help
 	  If you are compiling a kernel that will run under SGI's IA-64
 	  simulator (Medusa) then say Y, otherwise say N.
-
-config IA64_SGI_AUTOTEST
-	bool "Enable autotest (llsc). Option to run cache test instead of booting"
-	depends on IA64_SGI_SN2
-	help
-	  Build a kernel used for hardware validation. If you include the
-	  keyword "autotest" on the boot command line, the kernel does NOT boot.
-	  Instead, it starts all cpus and runs cache coherency tests instead.
-
-	  If unsure, say N.
-
-config SERIAL_SGI_L1_PROTOCOL
-	bool "Enable protocol mode for the L1 console"
-	depends on IA64_SGI_SN2
-	help
-	  Uses protocol mode instead of raw mode for the level 1 console on the
-	  SGI SN (Scalable NUMA) platform for IA-64.  If you are compiling for
-	  an SGI SN box then Y is the recommended value, otherwise say N.
-
-config PERCPU_IRQ
-	bool
-	depends on IA64_SGI_SN2
-	default y
 
 # On IA-64, we always want an ELF /proc/kcore.
 config KCORE_ELF
diff -Nru a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
--- a/arch/ia64/kernel/setup.c	Thu Jul 17 16:59:04 2003
+++ b/arch/ia64/kernel/setup.c	Thu Jul 17 16:59:04 2003
@@ -138,7 +138,7 @@
 call_pernode_memory (unsigned long start, unsigned long end, void *arg)
 {
 	unsigned long rs, re;
-	void (*func)(unsigned long, unsigned long, int, int);
+	void (*func)(unsigned long, unsigned long, int);
 	int i;
 
 	start = PAGE_ALIGN(start);
@@ -149,22 +149,21 @@
 	func = arg;
 
 	if (!num_memblks) {
-		/*
-		 * This machine doesn't have SRAT, so call func with
-		 * nid=0, bank=0.
-		 */
+		/* No SRAT table, to assume one node (node 0) */
 		if (start < end)
-			(*func)(start, end - start, 0, 0);
+			(*func)(start, end, 0);
 		return;
 	}
 
 	for (i = 0; i < num_memblks; i++) {
-		rs = max(start, node_memblk[i].start_paddr);
-		re = min(end, node_memblk[i].start_paddr+node_memblk[i].size);
+		rs = max(__pa(start), node_memblk[i].start_paddr);
+		re = min(__pa(end), node_memblk[i].start_paddr+node_memblk[i].size);
 
 		if (rs < re)
-			(*func)(rs, re-rs, node_memblk[i].nid,
-				node_memblk[i].bank);
+			(*func)((unsigned long)__va(rs), (unsigned long)__va(re), node_memblk[i].nid);
+
+		if ((unsigned long)__va(re) = end)
+			break;
 	}
 }
 
@@ -180,7 +179,7 @@
 filter_rsvd_memory (unsigned long start, unsigned long end, void *arg)
 {
 	unsigned long range_start, range_end, prev_start;
-	void (*func)(unsigned long, unsigned long);
+	void (*func)(unsigned long, unsigned long, int);
 	int i;
 
 #if IGNORE_PFN0
@@ -202,9 +201,9 @@
 
 		if (range_start < range_end)
 #ifdef CONFIG_DISCONTIGMEM
-			call_pernode_memory(__pa(range_start), __pa(range_end), func);
+			call_pernode_memory(range_start, range_end, func);
 #else
-			(*func)(__pa(range_start), range_end - range_start);
+			(*func)(range_start, range_end, 0);
 #endif
 
 		/* nothing more available in this segment */
@@ -703,6 +702,8 @@
 	 * get_free_pages() cannot be used before cpu_init() done.  BSP allocates
 	 * "NR_CPUS" pages for all CPUs to avoid that AP calls get_zeroed_page().
 	 */
+#ifndef CONFIG_DISCONTIGMEM
+	/* for discontig machines, we do this in discontig.c */
 	if (smp_processor_id() = 0) {
 		cpu_data = __alloc_bootmem(PERCPU_PAGE_SIZE * NR_CPUS, PERCPU_PAGE_SIZE,
 					   __pa(MAX_DMA_ADDRESS));
@@ -714,6 +715,7 @@
 			per_cpu(local_per_cpu_offset, cpu) = __per_cpu_offset[cpu];
 		}
 	}
+#endif
 	cpu_data = __per_cpu_start + __per_cpu_offset[smp_processor_id()];
 #else /* !CONFIG_SMP */
 	cpu_data = __phys_per_cpu_start;
diff -Nru a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
--- a/arch/ia64/mm/discontig.c	Thu Jul 17 16:59:05 2003
+++ b/arch/ia64/mm/discontig.c	Thu Jul 17 16:59:05 2003
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2000 Silicon Graphics, Inc.  All rights reserved.
+ * Copyright (c) 2000, 2003 Silicon Graphics, Inc.  All rights reserved.
  * Copyright (c) 2001 Intel Corp.
  * Copyright (c) 2001 Tony Luck <tony.luck@intel.com>
  * Copyright (c) 2002 NEC Corp.
@@ -16,74 +16,60 @@
 #include <linux/mmzone.h>
 #include <linux/acpi.h>
 #include <linux/efi.h>
-
+#include <asm/pgalloc.h>
+#include <asm/tlb.h>
 
 /*
- * Round an address upward to the next multiple of GRANULE size.
+ * Round an address upward or downward to the next multiple of IA64_GRANULE_SIZE.
  */
+#define GRANULEROUNDDOWN(n) ((n) & ~(IA64_GRANULE_SIZE-1))
 #define GRANULEROUNDUP(n) (((n)+IA64_GRANULE_SIZE-1) & ~(IA64_GRANULE_SIZE-1))
 
-static struct ia64_node_data	*node_data[NR_NODES];
-static long			boot_pg_data[8*NR_NODES+sizeof(pg_data_t)]  __initdata;
-static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
-static bootmem_data_t		bdata[NR_NODES][NR_BANKS_PER_NODE+1] __initdata;
-
-extern int  filter_rsvd_memory (unsigned long start, unsigned long end, void *arg);
+/*
+ * Used to locate BOOT_DATA prior to initializing the node data area.
+ */
+#define BOOT_NODE_DATA(node)	pg_data_ptr[node]
 
 /*
- * Return the compact node number of this cpu. Used prior to
- * setting up the cpu_data area.
- *	Note - not fast, intended for boot use only!!
+ * To prevent cache aliasing effects, align per-node structures so that they 
+ * start at addresses that are strided by node number.
  */
-int
-boot_get_local_nodeid(void)
-{
-	int	i;
+#define NODEDATA_ALIGN(addr, node)	((((addr) + 1024*1024-1) & ~(1024*1024-1)) + (node)*PERCPU_PAGE_SIZE)
 
-	for (i = 0; i < NR_CPUS; i++)
-		if (node_cpuid[i].phys_id = hard_smp_processor_id())
-			return node_cpuid[i].nid;
 
-	/* node info missing, so nid should be 0.. */
-	return 0;
-}
+static struct ia64_node_data	*boot_node_data[NR_NODES] __initdata;
+static pg_data_t		*pg_data_ptr[NR_NODES] __initdata;
+static bootmem_data_t		bdata[NR_NODES] __initdata;
+static unsigned long		boot_pernode[NR_NODES] __initdata;
+static unsigned long		boot_pernodesize[NR_NODES] __initdata;
 
-/*
- * Return a pointer to the pg_data structure for a node.
- * This function is used ONLY in early boot before the cpu_data
- * structure is available.
- */
-pg_data_t* __init
-boot_get_pg_data_ptr(long node)
-{
-	return pg_data_ptr[node];
-}
+extern char __per_cpu_start[], __per_cpu_end[];
 
 
-/*
- * Return a pointer to the node data for the current node.
- *	(boottime initialization only)
- */
-struct ia64_node_data *
+struct ia64_node_data*
 get_node_data_ptr(void)
 {
-	return node_data[boot_get_local_nodeid()];
+	return boot_node_data[cpu_to_node_map[smp_processor_id()]];	/* ZZZ */
 }
 
 /*
  * We allocate one of the bootmem_data_t structs for each piece of memory
  * that we wish to treat as a contiguous block.  Each such block must start
- * on a BANKSIZE boundary.  Multiple banks per node is not supported.
+ * on a GRANULE boundary.  Multiple banks per node are not supported.
+ *   (Note: on SN2, all memory on a node is trated as a single bank.
+ *   Holes within the bank are supported. This works because memory
+ *   from different banks is not interleaved. The bootmap bitmap
+ *   for the node is somewhat large but not too large).
  */
 static int __init
-build_maps(unsigned long pstart, unsigned long length, int node)
+build_maps(unsigned long start, unsigned long end, int node)
 {
 	bootmem_data_t	*bdp;
 	unsigned long cstart, epfn;
 
-	bdp = pg_data_ptr[node]->bdata;
-	epfn = GRANULEROUNDUP(pstart + length) >> PAGE_SHIFT;
-	cstart = pstart & ~(BANKSIZE - 1);
+	bdp = &bdata[node];
+	epfn = GRANULEROUNDUP(__pa(end)) >> PAGE_SHIFT;
+	cstart = GRANULEROUNDDOWN(__pa(start));
 
 	if (!bdp->node_low_pfn) {
 		bdp->node_boot_start = cstart;
@@ -99,34 +85,96 @@
 	return 0;
 }
 
+
+/*
+ * Count the number of cpus on the node
+ */
+static __inline__ int
+count_cpus(int node)
+{
+	int cpu, n=0;
+
+	for (cpu=0; cpu < NR_CPUS; cpu++)
+		if (node = node_cpuid[cpu].nid)
+			n++;
+	return n;
+}
+
+
 /*
- * Find space on each node for the bootmem map.
+ * Find space on each node for the bootmem map & other per-node data structures.
  *
  * Called by efi_memmap_walk to find boot memory on each node. Note that
  * only blocks that are free are passed to this routine (currently filtered by
  * free_available_memory).
  */
 static int __init
-find_bootmap_space(unsigned long pstart, unsigned long length, int node)
+find_pernode_space(unsigned long start, unsigned long end, int node)
 {
-	unsigned long	mapsize, pages, epfn;
+	unsigned long	mapsize, pages, epfn, map=0, cpu, cpus;
+	unsigned long	pernodesize=0, pernode;
+	unsigned long	cpu_data;
+	unsigned long	pstart, length;
 	bootmem_data_t	*bdp;
 
+	pstart = __pa(start);
+	length = end - start;
 	epfn = (pstart + length) >> PAGE_SHIFT;
-	bdp = &pg_data_ptr[node]->bdata[0];
+	bdp = &bdata[node];
 
 	if (pstart < bdp->node_boot_start || epfn > bdp->node_low_pfn)
 		return 0;
 
-	if (!bdp->node_bootmem_map) {
+	if (!boot_pernode[node]) {
+		cpus = count_cpus(node);
+		pernodesize += PERCPU_PAGE_SIZE * cpus;
+		pernodesize += L1_CACHE_ALIGN(sizeof(pg_data_t));
+		pernodesize += L1_CACHE_ALIGN(sizeof(struct ia64_node_data));
+		pernodesize = PAGE_ALIGN(pernodesize);
+		pernode = NODEDATA_ALIGN(pstart, node);
+	
+		if (pstart + length > (pernode + pernodesize)) {
+			boot_pernode[node] = pernode;
+			boot_pernodesize[node] = pernodesize;
+			memset(__va(pernode), 0, pernodesize);
+
+			cpu_data = pernode;
+			pernode += PERCPU_PAGE_SIZE * cpus;
+
+			pg_data_ptr[node] = __va(pernode);
+			pernode += L1_CACHE_ALIGN(sizeof(pg_data_t));
+
+			boot_node_data[node] = __va(pernode);
+			pernode += L1_CACHE_ALIGN(sizeof(struct ia64_node_data));
+
+			pg_data_ptr[node]->bdata = &bdata[node];
+			pernode += L1_CACHE_ALIGN(sizeof(pg_data_t));
+
+			for (cpu=0; cpu < NR_CPUS; cpu++) {
+				if (node = node_cpuid[cpu].nid) {
+					extern char __per_cpu_start[], __phys_per_cpu_start[];
+					memcpy((void*)cpu_data, __phys_per_cpu_start, __per_cpu_end - __per_cpu_start);
+					__per_cpu_offset[cpu] = (char*)__va(cpu_data) - __per_cpu_start;
+					cpu_data +=  PERCPU_PAGE_SIZE;
+				}
+			}
+		}
+	}
+
+	pernode = boot_pernode[node];
+	pernodesize = boot_pernodesize[node];
+	if (pernode && !bdp->node_bootmem_map) {
 		pages = bdp->node_low_pfn - (bdp->node_boot_start>>PAGE_SHIFT);
 		mapsize = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
-		if (length > mapsize) {
-			init_bootmem_node(
-				BOOT_NODE_DATA(node),
-				pstart>>PAGE_SHIFT, 
-				bdp->node_boot_start>>PAGE_SHIFT,
-				bdp->node_low_pfn);
+
+		if (pernode - pstart > mapsize)
+			map = pstart;
+		else if (pstart + length - pernode - pernodesize > mapsize)
+			map = pernode + pernodesize;
+
+		if (map) {
+			init_bootmem_node(BOOT_NODE_DATA(node),	map>>PAGE_SHIFT, 
+				bdp->node_boot_start>>PAGE_SHIFT, bdp->node_low_pfn);
 		}
 
 	}
@@ -143,9 +191,9 @@
  *
  */
 static int __init
-discontig_free_bootmem_node(unsigned long pstart, unsigned long length, int node)
+discontig_free_bootmem_node(unsigned long start, unsigned long end, int node)
 {
-	free_bootmem_node(BOOT_NODE_DATA(node), pstart, length);
+	free_bootmem_node(BOOT_NODE_DATA(node), __pa(start), end - start);
 
 	return 0;
 }
@@ -158,53 +206,50 @@
 discontig_reserve_bootmem(void)
 {
 	int		node;
-	unsigned long	mapbase, mapsize, pages;
+	unsigned long	base, size, pages;
 	bootmem_data_t	*bdp;
 
 	for (node = 0; node < numnodes; node++) {
 		bdp = BOOT_NODE_DATA(node)->bdata;
 
 		pages = bdp->node_low_pfn - (bdp->node_boot_start>>PAGE_SHIFT);
-		mapsize = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
-		mapbase = __pa(bdp->node_bootmem_map);
-		reserve_bootmem_node(BOOT_NODE_DATA(node), mapbase, mapsize);
+		size = bootmem_bootmap_pages(pages) << PAGE_SHIFT;
+		base = __pa(bdp->node_bootmem_map);
+		reserve_bootmem_node(BOOT_NODE_DATA(node), base, size);
+
+		size = boot_pernodesize[node];
+		base = __pa(boot_pernode[node]);
+		reserve_bootmem_node(BOOT_NODE_DATA(node), base, size);
 	}
 }
 
 /*
- * Allocate per node tables.
- * 	- the pg_data structure is allocated on each node. This minimizes offnode 
- *	  memory references
- *	- the node data is allocated & initialized. Portions of this structure is read-only (after 
- *	  boot) and contains node-local pointers to usefuls data structures located on
- *	  other nodes.
+ * Initialize per-node data
+ *
+ * Finish setting up the node data for this node, then copy it to the other nodes.
  *
- * We also switch to using the "real" pg_data structures at this point. Earlier in boot, we
- * use a different structure. The only use for pg_data prior to the point in boot is to get 
- * the pointer to the bdata for the node.
  */
 static void __init
-allocate_pernode_structures(void)
+initialize_pernode_data(void)
 {
-	pg_data_t	*pgdat=0, *new_pgdat_list=0;
-	int		node, mynode;
+	int	cpu, node;
+
+	memcpy(boot_node_data[0]->pg_data_ptrs, pg_data_ptr, sizeof(pg_data_ptr));
+	memcpy(boot_node_data[0]->node_data_ptrs, boot_node_data, sizeof(boot_node_data));
 
-	mynode = boot_get_local_nodeid();
-	for (node = numnodes - 1; node >= 0 ; node--) {
-		node_data[node] = alloc_bootmem_node(BOOT_NODE_DATA(node), sizeof (struct ia64_node_data));
-		pgdat = __alloc_bootmem_node(BOOT_NODE_DATA(node), sizeof(pg_data_t), SMP_CACHE_BYTES, 0);
-		pgdat->bdata = &(bdata[node][0]);
-		pg_data_ptr[node] = pgdat;
-		pgdat->pgdat_next = new_pgdat_list;
-		new_pgdat_list = pgdat;
+	for (node=1; node < numnodes; node++) {
+		memcpy(boot_node_data[node], boot_node_data[0], sizeof(struct ia64_node_data));
+		boot_node_data[node]->node = node;
 	}
-	
-	memcpy(node_data[mynode]->pg_data_ptrs, pg_data_ptr, sizeof(pg_data_ptr));
-	memcpy(node_data[mynode]->node_data_ptrs, node_data, sizeof(node_data));
 
-	pgdat_list = new_pgdat_list;
+	for (cpu=0; cpu < NR_CPUS; cpu++) {
+		node = node_cpuid[cpu].nid;
+		per_cpu(cpu_info, cpu).node_data = boot_node_data[node];
+		per_cpu(cpu_info, cpu).nodeid = node;
+	}
 }
 
+
 /*
  * Called early in boot to setup the boot memory allocator, and to
  * allocate the node-local pg_data & node-directory data structures..
@@ -212,96 +257,19 @@
 void __init
 discontig_mem_init(void)
 {
-	int	node;
-
 	if (numnodes = 0) {
 		printk(KERN_ERR "node info missing!\n");
 		numnodes = 1;
 	}
 
-	for (node = 0; node < numnodes; node++) {
-		pg_data_ptr[node] = (pg_data_t*) &boot_pg_data[node];
-		pg_data_ptr[node]->bdata = &bdata[node][0];
-	}
-
 	min_low_pfn = -1;
 	max_low_pfn = 0;
 
         efi_memmap_walk(filter_rsvd_memory, build_maps);
-        efi_memmap_walk(filter_rsvd_memory, find_bootmap_space);
+        efi_memmap_walk(filter_rsvd_memory, find_pernode_space);
         efi_memmap_walk(filter_rsvd_memory, discontig_free_bootmem_node);
-	discontig_reserve_bootmem();
-	allocate_pernode_structures();
-}
-
-/*
- * Initialize the paging system.
- *	- determine sizes of each node
- *	- initialize the paging system for the node
- *	- build the nodedir for the node. This contains pointers to
- *	  the per-bank mem_map entries.
- *	- fix the page struct "virtual" pointers. These are bank specific
- *	  values that the paging system doesn't understand.
- *	- replicate the nodedir structure to other nodes	
- */ 
-
-void __init
-discontig_paging_init(void)
-{
-	int		node, mynode;
-	unsigned long	max_dma, zones_size[MAX_NR_ZONES];
-	unsigned long	kaddr, ekaddr, bid;
-	struct page	*page;
-	bootmem_data_t	*bdp;
-
-	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 
-	mynode = boot_get_local_nodeid();
-	for (node = 0; node < numnodes; node++) {
-		long pfn, startpfn;
-
-		memset(zones_size, 0, sizeof(zones_size));
-
-		startpfn = -1;
-		bdp = BOOT_NODE_DATA(node)->bdata;
-		pfn = bdp->node_boot_start >> PAGE_SHIFT;
-		if (startpfn = -1)
-			startpfn = pfn;
-		if (pfn > max_dma)
-			zones_size[ZONE_NORMAL] += (bdp->node_low_pfn - pfn);
-		else if (bdp->node_low_pfn < max_dma)
-			zones_size[ZONE_DMA] += (bdp->node_low_pfn - pfn);
-		else {
-			zones_size[ZONE_DMA] += (max_dma - pfn);
-			zones_size[ZONE_NORMAL] += (bdp->node_low_pfn - max_dma);
-		}
-
-		free_area_init_node(node, NODE_DATA(node), NULL, zones_size, startpfn, 0);
-
-		page = NODE_DATA(node)->node_mem_map;
-
-		bdp = BOOT_NODE_DATA(node)->bdata;
-
-		kaddr = (unsigned long)__va(bdp->node_boot_start);
-		ekaddr = (unsigned long)__va(bdp->node_low_pfn << PAGE_SHIFT);
-		while (kaddr < ekaddr) {
-			if (paddr_to_nid(__pa(kaddr)) = node) {
-				bid = BANK_MEM_MAP_INDEX(kaddr);
-				node_data[mynode]->node_id_map[bid] = node;
-				node_data[mynode]->bank_mem_map_base[bid] = page;
-			}
-			kaddr += BANKSIZE;
-			page += BANKSIZE/PAGE_SIZE;
-		}
-	}
-
-	/*
-	 * Finish setting up the node data for this node, then copy it to the other nodes.
-	 */
-	for (node=0; node < numnodes; node++)
-		if (mynode != node) {
-			memcpy(node_data[node], node_data[mynode], sizeof(struct ia64_node_data));
-			node_data[node]->node = node;
-		}
+	discontig_reserve_bootmem();
+	initialize_pernode_data();
 }
 
diff -Nru a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
--- a/arch/ia64/mm/init.c	Thu Jul 17 16:59:05 2003
+++ b/arch/ia64/mm/init.c	Thu Jul 17 16:59:05 2003
@@ -44,7 +44,7 @@
 #ifdef CONFIG_VIRTUAL_MEM_MAP
 # define LARGE_GAP	0x40000000	/* Use virtual mem map if hole is > than this */
   unsigned long vmalloc_end = VMALLOC_END_INIT;
-  static struct page *vmem_map;
+  struct page *vmem_map;
   static unsigned long num_dma_physpages;
 #endif
 
@@ -240,7 +240,7 @@
 				else if (page_count(pgdat->node_mem_map + i))
 					shared += page_count(pgdat->node_mem_map + i) - 1;
 			}
-			printk("\t%d pages of RAM\n", pgdat->node_spanned_pages);
+			printk("\t%ld pages of RAM\n", pgdat->node_spanned_pages);
 			printk("\t%d reserved pages\n", reserved);
 			printk("\t%d pages shared\n", shared);
 			printk("\t%d pages swap cached\n", cached);
@@ -397,6 +397,7 @@
 {
 	unsigned long address, start_page, end_page;
 	struct page *map_start, *map_end;
+	int node;
 	pgd_t *pgd;
 	pmd_t *pmd;
 	pte_t *pte;
@@ -406,19 +407,20 @@
 
 	start_page = (unsigned long) map_start & PAGE_MASK;
 	end_page = PAGE_ALIGN((unsigned long) map_end);
+	node = paddr_to_nid(__pa(start));
 
 	for (address = start_page; address < end_page; address += PAGE_SIZE) {
 		pgd = pgd_offset_k(address);
 		if (pgd_none(*pgd))
-			pgd_populate(&init_mm, pgd, alloc_bootmem_pages(PAGE_SIZE));
+			pgd_populate(&init_mm, pgd, alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE));
 		pmd = pmd_offset(pgd, address);
 
 		if (pmd_none(*pmd))
-			pmd_populate_kernel(&init_mm, pmd, alloc_bootmem_pages(PAGE_SIZE));
+			pmd_populate_kernel(&init_mm, pmd, alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE));
 		pte = pte_offset_kernel(pmd, address);
 
 		if (pte_none(*pte))
-			set_pte(pte, pfn_pte(__pa(alloc_bootmem_pages(PAGE_SIZE)) >> PAGE_SHIFT,
+			set_pte(pte, pfn_pte(__pa(alloc_bootmem_pages_node(NODE_DATA(node), PAGE_SIZE)) >> PAGE_SHIFT,
 					     PAGE_KERNEL));
 	}
 	return 0;
@@ -431,6 +433,14 @@
 	unsigned long zone;
 };
 
+struct memmap_count_callback_data {
+	int node;
+	unsigned long num_physpages;
+	unsigned long num_dma_physpages;
+	unsigned long min_pfn;
+	unsigned long max_pfn;
+} cdata;
+
 static int
 virtual_memmap_init (u64 start, u64 end, void *arg)
 {
@@ -489,16 +499,6 @@
 }
 
 static int
-count_dma_pages (u64 start, u64 end, void *arg)
-{
-	unsigned long *count = arg;
-
-	if (end <= MAX_DMA_ADDRESS)
-		*count += (end - start) >> PAGE_SHIFT;
-	return 0;
-}
-
-static int
 find_largest_hole (u64 start, u64 end, void *arg)
 {
 	u64 *max_gap = arg;
@@ -514,102 +514,101 @@
 }
 #endif /* CONFIG_VIRTUAL_MEM_MAP */
 
+#define GRANULEROUNDDOWN(n) ((n) & ~(IA64_GRANULE_SIZE-1))
+#define GRANULEROUNDUP(n) (((n)+IA64_GRANULE_SIZE-1) & ~(IA64_GRANULE_SIZE-1))
+#define ORDERROUNDDOWN(n) ((n) & ~((PAGE_SIZE<<MAX_ORDER)-1))
 static int
-count_pages (u64 start, u64 end, void *arg)
+count_pages (unsigned long start, unsigned long end, int node)
 {
-	unsigned long *count = arg;
+	start = __pa(start);
+	end = __pa(end);
 
-	*count += (end - start) >> PAGE_SHIFT;
+	if (node = cdata.node) {
+		cdata.num_physpages += (end - start) >> PAGE_SHIFT;
+		if (start <= __pa(MAX_DMA_ADDRESS))
+			cdata.num_dma_physpages += (min(end, __pa(MAX_DMA_ADDRESS)) - start) >> PAGE_SHIFT;
+		start = GRANULEROUNDDOWN(__pa(start));
+		start = ORDERROUNDDOWN(start);
+		end = GRANULEROUNDUP(__pa(end));
+		cdata.max_pfn = max(cdata.max_pfn, end >> PAGE_SHIFT);
+		cdata.min_pfn = min(cdata.min_pfn, start >> PAGE_SHIFT);
+	}
 	return 0;
 }
 
 /*
  * Set up the page tables.
  */
-
-#ifdef CONFIG_DISCONTIGMEM
 void
 paging_init (void)
 {
-	extern void discontig_paging_init(void);
-
-	discontig_paging_init();
-	efi_memmap_walk(count_pages, &num_physpages);
-	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
-}
-#else /* !CONFIG_DISCONTIGMEM */
-void
-paging_init (void)
-{
-	unsigned long max_dma;
+	unsigned long max_dma_pfn;
 	unsigned long zones_size[MAX_NR_ZONES];
 #  ifdef CONFIG_VIRTUAL_MEM_MAP
 	unsigned long zholes_size[MAX_NR_ZONES];
 	unsigned long max_gap;
 #  endif
+	int node;
 
-	/* initialize mem_map[] */
-
-	memset(zones_size, 0, sizeof(zones_size));
-
-	num_physpages = 0;
-	efi_memmap_walk(count_pages, &num_physpages);
-
-	max_dma = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
-
-#  ifdef CONFIG_VIRTUAL_MEM_MAP
-	memset(zholes_size, 0, sizeof(zholes_size));
-
-	num_dma_physpages = 0;
-	efi_memmap_walk(count_dma_pages, &num_dma_physpages);
-
-	if (max_low_pfn < max_dma) {
-		zones_size[ZONE_DMA] = max_low_pfn;
-		zholes_size[ZONE_DMA] = max_low_pfn - num_dma_physpages;
-	} else {
-		zones_size[ZONE_DMA] = max_dma;
-		zholes_size[ZONE_DMA] = max_dma - num_dma_physpages;
-		if (num_physpages > num_dma_physpages) {
-			zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
-			zholes_size[ZONE_NORMAL] = ((max_low_pfn - max_dma)
-						    - (num_physpages - num_dma_physpages));
-		}
-	}
-
+	max_dma_pfn = virt_to_phys((void *) MAX_DMA_ADDRESS) >> PAGE_SHIFT;
 	max_gap = 0;
 	efi_memmap_walk(find_largest_hole, (u64 *)&max_gap);
-	if (max_gap < LARGE_GAP) {
-		vmem_map = (struct page *) 0;
-		free_area_init_node(0, &contig_page_data, NULL, zones_size, 0, zholes_size);
-		mem_map = contig_page_data.node_mem_map;
-	}
-	else {
-		unsigned long map_size;
-
-		/* allocate virtual_mem_map */
 
-		map_size = PAGE_ALIGN(max_low_pfn * sizeof(struct page));
-		vmalloc_end -= map_size;
-		vmem_map = (struct page *) vmalloc_end;
-		efi_memmap_walk(create_mem_map_page_table, 0);
-
-		free_area_init_node(0, &contig_page_data, vmem_map, zones_size, 0, zholes_size);
+	for (node = 0; node < numnodes; node++) {
+		memset(zones_size, 0, sizeof(zones_size));
+		memset(zholes_size, 0, sizeof(zholes_size));
+		memset(&cdata, 0, sizeof(cdata));
+
+		cdata.node = node;
+		cdata.min_pfn = ~0;
+
+		efi_memmap_walk(filter_rsvd_memory, count_pages);
+		num_dma_physpages += cdata.num_dma_physpages;
+		num_physpages += cdata.num_physpages;
+
+		if (cdata.min_pfn >= max_dma_pfn) {
+			/* Above the DMA zone */
+			zones_size[ZONE_NORMAL] = cdata.max_pfn - cdata.min_pfn;
+			zholes_size[ZONE_NORMAL] = cdata.max_pfn - cdata.min_pfn - cdata.num_physpages;
+		} else if (cdata.max_pfn < max_dma_pfn) {
+			/* This block is DMAable */
+			zones_size[ZONE_DMA] = cdata.max_pfn - cdata.min_pfn;
+			zholes_size[ZONE_DMA] = cdata.max_pfn - cdata.min_pfn - cdata.num_dma_physpages;
+		} else {
+			zones_size[ZONE_DMA] = max_dma_pfn - cdata.min_pfn;
+			zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - cdata.num_dma_physpages;
+			zones_size[ZONE_NORMAL] = cdata.max_pfn - max_dma_pfn;
+			zholes_size[ZONE_NORMAL] = zones_size[ZONE_NORMAL] - (cdata.num_physpages - cdata.num_dma_physpages);
+		}
 
-		mem_map = contig_page_data.node_mem_map;
-		printk("Virtual mem_map starts at 0x%p\n", mem_map);
-	}
-#  else /* !CONFIG_VIRTUAL_MEM_MAP */
-	if (max_low_pfn < max_dma)
-		zones_size[ZONE_DMA] = max_low_pfn;
-	else {
-		zones_size[ZONE_DMA] = max_dma;
-		zones_size[ZONE_NORMAL] = max_low_pfn - max_dma;
+		if (numnodes = 1 && max_gap < LARGE_GAP) {
+			/* Just one node with no big holes... */
+			vmem_map = (struct page *)0;
+			zones_size[ZONE_DMA] += cdata.min_pfn;
+			zholes_size[ZONE_DMA] += cdata.min_pfn;
+			free_area_init_node(0, NODE_DATA(node), NODE_DATA(node)->node_mem_map,
+					    zones_size, 0, zholes_size);
+		}
+		else {
+			/* allocate virtual mem_map */
+			if (node = 0) {
+				unsigned long map_size;
+				map_size = PAGE_ALIGN(max_low_pfn*sizeof(struct page));
+				vmalloc_end -= map_size;
+				vmem_map = (struct page *) vmalloc_end;
+				efi_memmap_walk(create_mem_map_page_table, 0);
+				printk("Virtual mem_map starts at 0x%p\n", vmem_map);
+#ifndef CONFIG_DISCONTIGMEM
+				mem_map = vmem_map;
+#endif
+			}
+			free_area_init_node(node, NODE_DATA(node), vmem_map + cdata.min_pfn,
+					    zones_size, cdata.min_pfn, zholes_size);
+		}
 	}
-	free_area_init(zones_size);
-#  endif /* !CONFIG_VIRTUAL_MEM_MAP */
+
 	zero_page_memmap_ptr = virt_to_page(ia64_imva(empty_zero_page));
 }
-#endif /* !CONFIG_DISCONTIGMEM */
 
 static int
 count_reserved_pages (u64 start, u64 end, void *arg)
diff -Nru a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
--- a/drivers/acpi/Kconfig	Thu Jul 17 16:59:05 2003
+++ b/drivers/acpi/Kconfig	Thu Jul 17 16:59:05 2003
@@ -133,7 +133,7 @@
 
 config ACPI_NUMA
 	bool "NUMA support" if NUMA && (IA64 && !IA64_HP_SIM || X86 && ACPI && !ACPI_HT_ONLY && !X86_64)
-	default y if IA64 && IA64_SGI_SN
+	default y if IA64_GENERIC || IA64_SGI_SN2
 
 config ACPI_ASUS
         tristate "ASUS/Medion Laptop Extras"
diff -Nru a/include/asm-ia64/mmzone.h b/include/asm-ia64/mmzone.h
--- a/include/asm-ia64/mmzone.h	Thu Jul 17 16:59:05 2003
+++ b/include/asm-ia64/mmzone.h	Thu Jul 17 16:59:05 2003
@@ -3,7 +3,7 @@
  * License.  See the file "COPYING" in the main directory of this archive
  * for more details.
  *
- * Copyright (c) 2000 Silicon Graphics, Inc.  All rights reserved.
+ * Copyright (c) 2000,2003 Silicon Graphics, Inc.  All rights reserved.
  * Copyright (c) 2002 NEC Corp.
  * Copyright (c) 2002 Erich Focht <efocht@ess.nec.de>
  * Copyright (c) 2002 Kimio Suganuma <k-suganuma@da.jp.nec.com>
@@ -14,150 +14,50 @@
 #include <linux/config.h>
 #include <linux/init.h>
 
-/*
- * Given a kaddr, find the base mem_map address for the start of the mem_map
- * entries for the bank containing the kaddr.
- */
-#define BANK_MEM_MAP_BASE(kaddr) local_node_data->bank_mem_map_base[BANK_MEM_MAP_INDEX(kaddr)]
-
-/*
- * Given a kaddr, this macro return the relative map number 
- * within the bank.
- */
-#define BANK_MAP_NR(kaddr) 	(BANK_OFFSET(kaddr) >> PAGE_SHIFT)
 
-/*
- * Given a pte, this macro returns a pointer to the page struct for the pte.
- */
-#define pte_page(pte)	virt_to_page(PAGE_OFFSET | (pte_val(pte)&_PFN_MASK))
+#ifdef CONFIG_NUMA
 
-/*
- * Determine if a kaddr is a valid memory address of memory that
- * actually exists. 
- *
- * The check consists of 2 parts:
- *	- verify that the address is a region 7 address & does not 
- *	  contain any bits that preclude it from being a valid platform
- *	  memory address
- *	- verify that the chunk actually exists.
- *
- * Note that IO addresses are NOT considered valid addresses.
- *
- * Note, many platforms can simply check if kaddr exceeds a specific size.  
- *	(However, this won't work on SGI platforms since IO space is embedded 
- * 	within the range of valid memory addresses & nodes have holes in the 
- *	address range between banks). 
- */
-#define kern_addr_valid(kaddr)		({long _kav=(long)(kaddr);	\
-					VALID_MEM_KADDR(_kav);})
-
-/*
- * Given a kaddr, return a pointer to the page struct for the page.
- * If the kaddr does not represent RAM memory that potentially exists, return
- * a pointer the page struct for max_mapnr. IO addresses will
- * return the page for max_nr. Addresses in unpopulated RAM banks may
- * return undefined results OR may panic the system.
- *
- */
-#define virt_to_page(kaddr)	({long _kvtp=(long)(kaddr);	\
-				(VALID_MEM_KADDR(_kvtp))	\
-					? BANK_MEM_MAP_BASE(_kvtp) + BANK_MAP_NR(_kvtp)	\
-					: NULL;})
+#ifdef CONFIG_IA64_DIG
 
 /*
- * Given a page struct entry, return the physical address that the page struct represents.
- * Since IA64 has all memory in the DMA zone, the following works:
+ * Platform definitions for DIG platform with contiguous memory.
  */
-#define page_to_phys(page)	__pa(page_address(page))
-
-#define node_mem_map(nid)	(NODE_DATA(nid)->node_mem_map)
+#define MAX_PHYSNODE_ID	8		/* Maximum node number +1 */
+#define NR_NODES	8		/* Maximum number of nodes in SSI */
+#define NR_MEMBLKS	(NR_NODES * 32)
 
-#define node_localnr(pfn, nid)	((pfn) - NODE_DATA(nid)->node_start_pfn)
 
-#define pfn_to_page(pfn)	(struct page *)(node_mem_map(pfn_to_nid(pfn)) + node_localnr(pfn, pfn_to_nid(pfn)))
 
-#define pfn_to_nid(pfn)		 local_node_data->node_id_map[(pfn << PAGE_SHIFT) >> BANKSHIFT]
-
-#define page_to_pfn(page)	(long)((page - page_zone(page)->zone_mem_map) + page_zone(page)->zone_start_pfn)
 
+#elif CONFIG_IA64_SGI_SN2
 
 /*
- * pfn_valid should be made as fast as possible, and the current definition
- * is valid for machines that are NUMA, but still contiguous, which is what
- * is currently supported. A more generalised, but slower definition would
- * be something like this - mbligh:
- * ( pfn_to_pgdat(pfn) && (pfn < node_end_pfn(pfn_to_nid(pfn))) )
+ * Platform definitions for DIG platform with contiguous memory.
  */
-#define pfn_valid(pfn)          (pfn < max_low_pfn)
-extern unsigned long max_low_pfn;
+#define MAX_PHYSNODE_ID	2048		/* Maximum node number +1 */
+#define NR_NODES	256		/* Maximum number of compute nodes in SSI */
+#define NR_MEMBLKS	(NR_NODES)
 
+#elif CONFIG_IA64_GENERIC
 
-#ifdef CONFIG_IA64_DIG
 
 /*
- * Platform definitions for DIG platform with contiguous memory.
+ * Platform definitions for GENERIC platform with contiguous or discontiguous memory.
  */
-#define MAX_PHYSNODE_ID	8	/* Maximum node number +1 */
-#define NR_NODES	8	/* Maximum number of nodes in SSI */
+#define MAX_PHYSNODE_ID 2048		/* Maximum node number +1 */
+#define NR_NODES        256		/* Maximum number of nodes in SSI */
+#define NR_MEMBLKS      (NR_NODES)
 
-#define MAX_PHYS_MEMORY	(1UL << 40)	/* 1 TB */
 
-/*
- * Bank definitions.
- * Configurable settings for DIG: 512MB/bank:  16GB/node,
- *                               2048MB/bank:  64GB/node,
- *                               8192MB/bank: 256GB/node.
- */
-#define NR_BANKS_PER_NODE	32
-#if defined(CONFIG_IA64_NODESIZE_16GB)
-# define BANKSHIFT		29
-#elif defined(CONFIG_IA64_NODESIZE_64GB)
-# define BANKSHIFT		31
-#elif defined(CONFIG_IA64_NODESIZE_256GB)
-# define BANKSHIFT		33
 #else
-# error Unsupported bank and nodesize!
+#error unknown platform
 #endif
-#define BANKSIZE		(1UL << BANKSHIFT)
-#define BANK_OFFSET(addr)	((unsigned long)(addr) & (BANKSIZE-1))
-#define NR_BANKS		(NR_BANKS_PER_NODE * NR_NODES)
 
-/*
- * VALID_MEM_KADDR returns a boolean to indicate if a kaddr is
- * potentially a valid cacheable identity mapped RAM memory address.
- * Note that the RAM may or may not actually be present!!
- */
-#define VALID_MEM_KADDR(kaddr)	1
+extern void build_cpu_to_node_map(void);
 
-/*
- * Given a nodeid & a bank number, find the address of the mem_map
- * entry for the first page of the bank.
- */
-#define BANK_MEM_MAP_INDEX(kaddr) \
-	(((unsigned long)(kaddr) & (MAX_PHYS_MEMORY-1)) >> BANKSHIFT)
+#else /* CONFIG_NUMA */
 
-#elif defined(CONFIG_IA64_SGI_SN2)
-/*
- * SGI SN2 discontig definitions
- */
-#define MAX_PHYSNODE_ID	2048	/* 2048 node ids (also called nasid) */
-#define NR_NODES	128	/* Maximum number of nodes in SSI */
-#define MAX_PHYS_MEMORY	(1UL << 49)
-
-#define BANKSHIFT		38
-#define NR_BANKS_PER_NODE	4
-#define SN2_NODE_SIZE		(64UL*1024*1024*1024)	/* 64GB per node */
-#define BANKSIZE		(SN2_NODE_SIZE/NR_BANKS_PER_NODE)
-#define BANK_OFFSET(addr)	((unsigned long)(addr) & (BANKSIZE-1))
-#define NR_BANKS		(NR_BANKS_PER_NODE * NR_NODES)
-#define VALID_MEM_KADDR(kaddr)	1
-
-/*
- * Given a nodeid & a bank number, find the address of the mem_map
- * entry for the first page of the bank.
- */
-#define BANK_MEM_MAP_INDEX(kaddr) \
-	(((unsigned long)(kaddr) & (MAX_PHYS_MEMORY-1)) >> BANKSHIFT)
+#define NR_NODES	1
 
-#endif /* CONFIG_IA64_DIG */
+#endif /* CONFIG_NUMA */
 #endif /* _ASM_IA64_MMZONE_H */
diff -Nru a/include/asm-ia64/nodedata.h b/include/asm-ia64/nodedata.h
--- a/include/asm-ia64/nodedata.h	Thu Jul 17 16:59:05 2003
+++ b/include/asm-ia64/nodedata.h	Thu Jul 17 16:59:05 2003
@@ -13,7 +13,7 @@
 #ifndef _ASM_IA64_NODEDATA_H
 #define _ASM_IA64_NODEDATA_H
 
-
+#include <asm/percpu.h>
 #include <asm/mmzone.h>
 
 /*
@@ -22,15 +22,17 @@
 
 struct pglist_data;
 struct ia64_node_data {
-	short			active_cpu_count;
 	short			node;
+	short			active_cpu_count;
+	/*
+	 * The fields are read-only (after boot). They contain pointers
+	 * to various structures located on other nodes. Ths data is
+	 * replicated on each node in order to reduce off-node references.
+	 */
         struct pglist_data	*pg_data_ptrs[NR_NODES];
-	struct page		*bank_mem_map_base[NR_BANKS];
 	struct ia64_node_data	*node_data_ptrs[NR_NODES];
-	short			node_id_map[NR_BANKS];
 };
 
-
 /*
  * Return a pointer to the node_data structure for the executing cpu.
  */
@@ -40,7 +42,8 @@
 /*
  * Return a pointer to the node_data structure for the specified node.
  */
-#define node_data(node)	(local_node_data->node_data_ptrs[node])
+#define node_data(node) (local_node_data->node_data_ptrs[node])
+#define NODE_DATA(nid) (local_node_data->pg_data_ptrs[nid])
 
 /*
  * Get a pointer to the node_id/node_data for the current cpu.
@@ -48,29 +51,5 @@
  */
 extern int boot_get_local_nodeid(void);
 extern struct ia64_node_data *get_node_data_ptr(void);
-
-/*
- * Given a node id, return a pointer to the pg_data_t for the node.
- * The following 2 macros are similar. 
- *
- * NODE_DATA 	- should be used in all code not related to system
- *		  initialization. It uses pernode data structures to minimize
- *		  offnode memory references. However, these structure are not 
- *		  present during boot. This macro can be used once cpu_init
- *		  completes.
- *
- * BOOT_NODE_DATA
- *		- should be used during system initialization 
- *		  prior to freeing __initdata. It does not depend on the percpu
- *		  area being present.
- *
- * NOTE:   The names of these macros are misleading but are difficult to change
- *	   since they are used in generic linux & on other architecures.
- */
-#define NODE_DATA(nid)		(local_node_data->pg_data_ptrs[nid])
-#define BOOT_NODE_DATA(nid)	boot_get_pg_data_ptr((long)(nid))
-
-struct pglist_data;
-extern struct pglist_data * __init boot_get_pg_data_ptr(long);
 
 #endif /* _ASM_IA64_NODEDATA_H */
diff -Nru a/include/asm-ia64/numa.h b/include/asm-ia64/numa.h
--- a/include/asm-ia64/numa.h	Thu Jul 17 16:59:05 2003
+++ b/include/asm-ia64/numa.h	Thu Jul 17 16:59:05 2003
@@ -15,13 +15,21 @@
 
 #ifdef CONFIG_DISCONTIGMEM
 # include <asm/mmzone.h>
-# define NR_MEMBLKS   (NR_BANKS)
 #else
 # define NR_NODES     (8)
 # define NR_MEMBLKS   (NR_NODES * 8)
 #endif
 
 #include <linux/cache.h>
+#include <linux/threads.h>
+#include <linux/smp.h>
+
+#define NODEMASK_WORDCOUNT       ((NR_NODES+(BITS_PER_LONG-1))/BITS_PER_LONG)
+
+#define NODE_MASK_NONE   { [0 ... ((NR_NODES+BITS_PER_LONG-1)/BITS_PER_LONG)-1] = 0 }
+
+typedef unsigned long   nodemask_t[NODEMASK_WORDCOUNT];
+                                                                                                                             
 extern volatile char cpu_to_node_map[NR_CPUS] __cacheline_aligned;
 extern volatile unsigned long node_to_cpu_mask[NR_NODES] __cacheline_aligned;
 
@@ -63,6 +71,12 @@
 extern int paddr_to_nid(unsigned long paddr);
 
 #define local_nodeid (cpu_to_node_map[smp_processor_id()])
+
+#else /* !CONFIG_NUMA */
+
+#define node_distance(from,to) 10
+#define paddr_to_nid(x) 0
+#define local_nodeid 0
 
 #endif /* CONFIG_NUMA */
 
diff -Nru a/include/asm-ia64/page.h b/include/asm-ia64/page.h
--- a/include/asm-ia64/page.h	Thu Jul 17 16:59:04 2003
+++ b/include/asm-ia64/page.h	Thu Jul 17 16:59:04 2003
@@ -93,18 +93,26 @@
 
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
 
-#ifndef CONFIG_DISCONTIGMEM
-# ifdef CONFIG_VIRTUAL_MEM_MAP
-   extern int ia64_pfn_valid (unsigned long pfn);
-#  define pfn_valid(pfn)	(((pfn) < max_mapnr) && ia64_pfn_valid(pfn))
-# else
-#  define pfn_valid(pfn)	((pfn) < max_mapnr)
-# endif
-#define virt_to_page(kaddr)	pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
-#define page_to_pfn(page)	((unsigned long) (page - mem_map))
-#define pfn_to_page(pfn)	(mem_map + (pfn))
-#define page_to_phys(page)	(page_to_pfn(page) << PAGE_SHIFT)
+#ifdef CONFIG_VIRTUAL_MEM_MAP
+extern int ia64_pfn_valid(unsigned long pfn);
+#else
+#define ia64_pfn_valid(pfn) (1)
+#endif
+
+extern unsigned long max_low_pfn;
+#define pfn_valid(pfn) (((pfn) < max_low_pfn) && ia64_pfn_valid(pfn))
+
+#if defined(CONFIG_VIRTUAL_MEM_MAP) && !defined(CONFIG_DISCONTIGMEM)
+#define vmem_map mem_map
+#else
+extern struct page *vmem_map;
 #endif
+
+#define pfn_to_page(pfn)	(vmem_map + (pfn))
+#define page_to_pfn(page)	((unsigned long) (page - vmem_map))
+
+#define virt_to_page(kaddr)	(pfn_to_page(__pa(kaddr) >> PAGE_SHIFT))
+#define page_to_phys(page)	(page_to_pfn(page) << PAGE_SHIFT)
 
 typedef union ia64_va {
 	struct {
diff -Nru a/include/asm-ia64/pgtable.h b/include/asm-ia64/pgtable.h
--- a/include/asm-ia64/pgtable.h	Thu Jul 17 16:59:04 2003
+++ b/include/asm-ia64/pgtable.h	Thu Jul 17 16:59:05 2003
@@ -174,7 +174,6 @@
 	return (addr & (local_cpu_data->unimpl_pa_mask)) = 0;
 }
 
-#ifndef CONFIG_DISCONTIGMEM
 /*
  * kern_addr_valid(ADDR) tests if ADDR is pointing to valid kernel
  * memory.  For the return value to be meaningful, ADDR must be >@@ -190,7 +189,6 @@
  */
 #define kern_addr_valid(addr)	(1)
 
-#endif
 
 /*
  * Now come the defines and routines to manage and access the three-level
@@ -241,10 +239,8 @@
 #define pte_none(pte) 			(!pte_val(pte))
 #define pte_present(pte)		(pte_val(pte) & (_PAGE_P | _PAGE_PROTNONE))
 #define pte_clear(pte)			(pte_val(*(pte)) = 0UL)
-#ifndef CONFIG_DISCONTIGMEM
 /* pte_page() returns the "struct page *" corresponding to the PTE: */
 #define pte_page(pte)			virt_to_page(((pte_val(pte) & _PFN_MASK) + PAGE_OFFSET))
-#endif
 
 #define pmd_none(pmd)			(!pmd_val(pmd))
 #define pmd_bad(pmd)			(!ia64_phys_addr_valid(pmd_val(pmd)))
@@ -416,6 +412,7 @@
 
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
 extern void paging_init (void);
+extern int filter_rsvd_memory(unsigned long start, unsigned long end, void *arg);
 
 /*
  * Note: The macros below rely on the fact that MAX_SWAPFILES_SHIFT <= number of
diff -Nru a/include/asm-ia64/processor.h b/include/asm-ia64/processor.h
--- a/include/asm-ia64/processor.h	Thu Jul 17 16:59:05 2003
+++ b/include/asm-ia64/processor.h	Thu Jul 17 16:59:05 2003
@@ -185,6 +185,8 @@
 #endif
 #ifdef CONFIG_NUMA
 	struct ia64_node_data *node_data;
+	struct cpuinfo_ia64 *cpu_data[NR_CPUS];
+	int nodeid;
 #endif
 };
 
diff -Nru a/mm/bootmem.c b/mm/bootmem.c
--- a/mm/bootmem.c	Thu Jul 17 16:59:05 2003
+++ b/mm/bootmem.c	Thu Jul 17 16:59:05 2003
@@ -48,8 +48,24 @@
 	bootmem_data_t *bdata = pgdat->bdata;
 	unsigned long mapsize = ((end - start)+7)/8;
 
-	pgdat->pgdat_next = pgdat_list;
-	pgdat_list = pgdat;
+
+	/*
+	 * sort pgdat_list so that the lowest one comes first,
+	 * which makes alloc_bootmem_low_pages work as desired.
+	 */
+	if (!pgdat_list || pgdat_list->node_start_pfn > pgdat->node_start_pfn) {
+		pgdat->pgdat_next = pgdat_list;
+		pgdat_list = pgdat;
+	} else {
+		pg_data_t *tmp = pgdat_list;
+		while (tmp->pgdat_next) {
+			if (tmp->pgdat_next->node_start_pfn > pgdat->node_start_pfn)
+				break;
+			tmp = tmp->pgdat_next;
+		}
+		pgdat->pgdat_next = tmp->pgdat_next;
+		tmp->pgdat_next = pgdat;
+	}
 
 	mapsize = (mapsize + (sizeof(long) - 1UL)) & ~(sizeof(long) - 1UL);
 	bdata->node_bootmem_map = phys_to_virt(mapstart << PAGE_SHIFT);
@@ -251,7 +267,7 @@
 
 static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat)
 {
-	struct page *page = pgdat->node_mem_map;
+	struct page *page;
 	bootmem_data_t *bdata = pgdat->bdata;
 	unsigned long i, count, total = 0;
 	unsigned long idx;
@@ -260,23 +276,23 @@
 	if (!bdata->node_bootmem_map) BUG();
 
 	count = 0;
+	page = virt_to_page(phys_to_virt(bdata->node_boot_start));
 	idx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
 	map = bdata->node_bootmem_map;
 	for (i = 0; i < idx; ) {
 		unsigned long v = ~map[i / BITS_PER_LONG];
 		if (v) {
 			unsigned long m;
-			for (m = 1; m && i < idx; m<<=1, page++, i++) {
+			for (m = 1; m && i < idx; m<<=1, i++) {
 				if (v & m) {
 					count++;
-					ClearPageReserved(page);
-					set_page_count(page, 1);
-					__free_page(page);
+					ClearPageReserved(page+i);
+					set_page_count(page+i, 1);
+					__free_page(page+i);
 				}
 			}
 		} else {
 			i+=BITS_PER_LONG;
-			page += BITS_PER_LONG;
 		}
 	}
 	total += count;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
                   ` (11 preceding siblings ...)
  2003-07-18  0:16 ` Jesse Barnes
@ 2003-07-18 16:15 ` Erich Focht
  2003-07-21 18:46 ` Takayoshi Kochi
  13 siblings, 0 replies; 15+ messages in thread
From: Erich Focht @ 2003-07-18 16:15 UTC (permalink / raw)
  To: linux-ia64

Looks good, applies cleanly on top of 2.6.0-test1 + ia64 but doesn't
boot on TX7 :-( Dies right after the elilo output lines... Maybe Tak
(CC'd) finds some time to have a look?

Regards,
Erich


On Friday 18 July 2003 02:16, Jesse Barnes wrote:
> On Thu, Jul 17, 2003 at 10:23:37AM +0200, Erich Focht wrote:
> > Maybe that should better be
> >      bool "Discontiguous memory support" if (IA64_DIG && NUMA)
> >
> > Discontigmem alone doesn't make sense, yet, I think...
>
> Not for ia64 I guess, unless we make zx1 use discontig as well.  Patch
> appended.
>
> Thanks,
> Jesse



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Discontig-devel] [PATCH] another discontig patch
  2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
                   ` (12 preceding siblings ...)
  2003-07-18 16:15 ` Erich Focht
@ 2003-07-21 18:46 ` Takayoshi Kochi
  13 siblings, 0 replies; 15+ messages in thread
From: Takayoshi Kochi @ 2003-07-21 18:46 UTC (permalink / raw)
  To: linux-ia64

From: Erich Focht <efocht@hpce.nec.com>
Subject: Re: [Discontig-devel] [PATCH] another discontig patch
Date: Fri, 18 Jul 2003 18:15:44 +0200
Message-ID: <200307181815.44593.efocht@hpce.nec.com>

> Looks good, applies cleanly on top of 2.6.0-test1 + ia64 but doesn't
> boot on TX7 :-( Dies right after the elilo output lines... Maybe Tak
> (CC'd) finds some time to have a look?

I'll attend OLS this week so I think I can't have time to test it.
I'll look into it next week.

---
1st Computer Software Division, NEC Corporation
Takayoshi Kochi <kochi@hpc.bs1.fc.nec.co.jp/t-kochi@bq.jp.nec.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2003-07-21 18:46 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-06-21  9:06 [Discontig-devel] [PATCH] another discontig patch Christoph Hellwig
2003-06-21 14:48 ` Martin J. Bligh
2003-06-22  5:53 ` Jesse Barnes
2003-06-22  5:57 ` Jesse Barnes
2003-06-22 15:25 ` Martin J. Bligh
2003-06-23 17:20 ` William Lee Irwin III
2003-07-16 19:29 ` Jesse Barnes
2003-07-16 19:40 ` Matthew Wilcox
2003-07-16 19:51 ` Jesse Barnes
2003-07-16 19:56 ` Erich Focht
2003-07-16 22:37 ` Jesse Barnes
2003-07-17  8:23 ` Erich Focht
2003-07-18  0:16 ` Jesse Barnes
2003-07-18 16:15 ` Erich Focht
2003-07-21 18:46 ` Takayoshi Kochi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.