linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Fix sparsemem on Cell
@ 2006-12-15 16:53 Dave Hansen
  2006-12-15 17:11 ` Andy Whitcroft
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Dave Hansen @ 2006-12-15 16:53 UTC (permalink / raw)
  To: cbe-oss-dev
  Cc: linuxppc-dev, linux-mm, apw, mkravetz, hch, jk, linux-kernel,
	akpm, paulus, benh, gone, Dave Hansen


I think the comments added say it pretty well, but I'll repeat it here.

This fix is pretty similar in concept to the one that Arnd posted
as a temporary workaround, but I've added a few comments explaining
what the actual assumptions are, and improved it a wee little bit.

The end goal here is to simply avoid calling the early_*() functions
when it is _not_ early.  Those functions stop working as soon as
free_initmem() is called.  system_state is set to SYSTEM_RUNNING
just after free_initmem() is called, so it seems appropriate to use
here.

I did think twice about actually using SYSTEM_RUNNING because we
moved away from it in other parts of memory hotplug, but those
were actually for _allocations_ in favor of slab_is_available(),
and we really don't care about the slab here.

The only other assumption is that all memory-hotplug-time pages 
given to memmap_init_zone() are valid and able to be onlined into
any any zone after the system is running.  The "valid" part is
really just a question of whether or not a 'struct page' is there
for the pfn, and *not* whether there is actual memory.  Since
all sparsemem sections have contiguous mem_map[]s within them,
and we only memory hotplug entire sparsemem sections, we can
be confident that this assumption will hold.

As for the memory being in the right node, we'll assume tha
memory hotplug is putting things in the right node.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>

---

 lxc-dave/init/main.c     |    4 ++++
 lxc-dave/mm/page_alloc.c |   28 +++++++++++++++++++++++++---
 2 files changed, 29 insertions(+), 3 deletions(-)

diff -puN init/main.c~sparsemem-fix init/main.c
--- lxc/init/main.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
+++ lxc-dave/init/main.c	2006-12-15 08:49:53.000000000 -0800
@@ -770,6 +770,10 @@ static int init(void * unused)
 	free_initmem();
 	unlock_kernel();
 	mark_rodata_ro();
+	/*
+	 * Memory hotplug requires that this system_state transition
+	 * happer after free_initmem().  (see memmap_init_zone())
+	 */
 	system_state = SYSTEM_RUNNING;
 	numa_default_policy();
 
diff -puN mm/page_alloc.c~sparsemem-fix mm/page_alloc.c
--- lxc/mm/page_alloc.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
+++ lxc-dave/mm/page_alloc.c	2006-12-15 08:49:53.000000000 -0800
@@ -2056,6 +2056,30 @@ static inline unsigned long wait_table_b
 
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
 
+static int can_online_pfn_into_nid(unsigned long pfn, int nid)
+{
+	/*
+	 * There are two things that make this work:
+	 * 1. The early_pfn...() functions are __init and
+	 *    use __initdata.  If the system is < SYSTEM_RUNNING,
+	 *    those functions and their data will still exist.
+	 * 2. We also assume that all actual memory hotplug
+	 *    (as opposed to boot-time) calls to this are only
+	 *    for contiguous memory regions.  With sparsemem,
+	 *    this guaranteed is easy because all sections are
+	 *    contiguous and we never online more than one
+	 *    section at a time.  Boot-time memory can have holes
+	 *    anywhere.
+	 */
+	if (system_state >= SYSTEM_RUNNING)
+		return 1;
+	if (!early_pfn_valid(pfn))
+		return 0;
+	if (!early_pfn_in_nid(pfn, nid))
+		return 0;
+	return 1;
+}
+
 /*
  * Initially all pages are reserved - free ones are freed
  * up by free_all_bootmem() once the early boot process is
@@ -2069,9 +2093,7 @@ void __meminit memmap_init_zone(unsigned
 	unsigned long pfn;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!early_pfn_valid(pfn))
-			continue;
-		if (!early_pfn_in_nid(pfn, nid))
+		if (!can_online_pfn_into_nid(pfn))
 			continue;
 		page = pfn_to_page(pfn);
 		set_page_links(page, zone, nid, pfn);
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-15 16:53 [PATCH] Fix sparsemem on Cell Dave Hansen
@ 2006-12-15 17:11 ` Andy Whitcroft
  2006-12-15 17:24   ` Dave Hansen
  2006-12-15 17:22 ` [PATCH] Fix sparsemem on Cell Michael Buesch
  2006-12-15 20:21 ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 22+ messages in thread
From: Andy Whitcroft @ 2006-12-15 17:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: cbe-oss-dev, linuxppc-dev, linux-mm, mkravetz, hch, jk,
	linux-kernel, akpm, paulus, benh, gone

Dave Hansen wrote:
> I think the comments added say it pretty well, but I'll repeat it here.
> 
> This fix is pretty similar in concept to the one that Arnd posted
> as a temporary workaround, but I've added a few comments explaining
> what the actual assumptions are, and improved it a wee little bit.
> 
> The end goal here is to simply avoid calling the early_*() functions
> when it is _not_ early.  Those functions stop working as soon as
> free_initmem() is called.  system_state is set to SYSTEM_RUNNING
> just after free_initmem() is called, so it seems appropriate to use
> here.
> 
> I did think twice about actually using SYSTEM_RUNNING because we
> moved away from it in other parts of memory hotplug, but those
> were actually for _allocations_ in favor of slab_is_available(),
> and we really don't care about the slab here.
> 
> The only other assumption is that all memory-hotplug-time pages 
> given to memmap_init_zone() are valid and able to be onlined into
> any any zone after the system is running.  The "valid" part is
> really just a question of whether or not a 'struct page' is there
> for the pfn, and *not* whether there is actual memory.  Since
> all sparsemem sections have contiguous mem_map[]s within them,
> and we only memory hotplug entire sparsemem sections, we can
> be confident that this assumption will hold.

ie. that if hotplug is pushing this memory into a zone on a node it 
really does know what its doing, and its putting it in the right place. 
  Obviously that code needs to be 'overlap' aware but thats ok for this 
interface.

> 
> As for the memory being in the right node, we'll assume tha
> memory hotplug is putting things in the right node.
> 
> Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
> 
> ---
> 
>  lxc-dave/init/main.c     |    4 ++++
>  lxc-dave/mm/page_alloc.c |   28 +++++++++++++++++++++++++---
>  2 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff -puN init/main.c~sparsemem-fix init/main.c
> --- lxc/init/main.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
> +++ lxc-dave/init/main.c	2006-12-15 08:49:53.000000000 -0800
> @@ -770,6 +770,10 @@ static int init(void * unused)
>  	free_initmem();
>  	unlock_kernel();
>  	mark_rodata_ro();
> +	/*
> +	 * Memory hotplug requires that this system_state transition
> +	 * happer after free_initmem().  (see memmap_init_zone())
> +	 */

typo: happen

>  	system_state = SYSTEM_RUNNING;
>  	numa_default_policy();
>  
> diff -puN mm/page_alloc.c~sparsemem-fix mm/page_alloc.c
> --- lxc/mm/page_alloc.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
> +++ lxc-dave/mm/page_alloc.c	2006-12-15 08:49:53.000000000 -0800
> @@ -2056,6 +2056,30 @@ static inline unsigned long wait_table_b
>  
>  #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
>  
> +static int can_online_pfn_into_nid(unsigned long pfn, int nid)
> +{

__meminit ?

> +	/*
> +	 * There are two things that make this work:
> +	 * 1. The early_pfn...() functions are __init and
> +	 *    use __initdata.  If the system is < SYSTEM_RUNNING,
> +	 *    those functions and their data will still exist.
> +	 * 2. We also assume that all actual memory hotplug
> +	 *    (as opposed to boot-time) calls to this are only
> +	 *    for contiguous memory regions.  With sparsemem,
> +	 *    this guaranteed is easy because all sections are
> +	 *    contiguous and we never online more than one
> +	 *    section at a time.  Boot-time memory can have holes
> +	 *    anywhere.
> +	 */
> +	if (system_state >= SYSTEM_RUNNING)
> +		return 1;

Is there any value in codifying the assumption here, as a safety net?

if (system_state >= SYSTEM_RUNNING)
#ifdef CONFIG_SPARSEMEM
	return 1;
#else
	return 0;
#endif

> +	if (!early_pfn_valid(pfn))
> +		return 0;
> +	if (!early_pfn_in_nid(pfn, nid))
> +		return 0;
> +	return 1;
> +}
> +
>  /*
>   * Initially all pages are reserved - free ones are freed
>   * up by free_all_bootmem() once the early boot process is
> @@ -2069,9 +2093,7 @@ void __meminit memmap_init_zone(unsigned
>  	unsigned long pfn;
>  
>  	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> -		if (!early_pfn_valid(pfn))
> -			continue;
> -		if (!early_pfn_in_nid(pfn, nid))
> +		if (!can_online_pfn_into_nid(pfn))

We're not passing nid here?

>  			continue;
>  		page = pfn_to_page(pfn);
>  		set_page_links(page, zone, nid, pfn);
> _

Looks like this would do the job to me.

-apw

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-15 16:53 [PATCH] Fix sparsemem on Cell Dave Hansen
  2006-12-15 17:11 ` Andy Whitcroft
@ 2006-12-15 17:22 ` Michael Buesch
  2006-12-15 17:57   ` Dave Hansen
  2006-12-15 20:21 ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 22+ messages in thread
From: Michael Buesch @ 2006-12-15 17:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linuxppc-dev, linux-mm, apw, mkravetz, hch, jk, linux-kernel,
	akpm, paulus, benh, gone, Dave Hansen, cbe-oss-dev

On Friday 15 December 2006 17:53, Dave Hansen wrote:
> 
> I think the comments added say it pretty well, but I'll repeat it here.
> 
> This fix is pretty similar in concept to the one that Arnd posted
> as a temporary workaround, but I've added a few comments explaining
> what the actual assumptions are, and improved it a wee little bit.
> 
> The end goal here is to simply avoid calling the early_*() functions
> when it is _not_ early.  Those functions stop working as soon as
> free_initmem() is called.  system_state is set to SYSTEM_RUNNING
> just after free_initmem() is called, so it seems appropriate to use
> here.
> 
> I did think twice about actually using SYSTEM_RUNNING because we
> moved away from it in other parts of memory hotplug, but those
> were actually for _allocations_ in favor of slab_is_available(),
> and we really don't care about the slab here.
> 
> The only other assumption is that all memory-hotplug-time pages 
> given to memmap_init_zone() are valid and able to be onlined into
> any any zone after the system is running.  The "valid" part is
> really just a question of whether or not a 'struct page' is there
> for the pfn, and *not* whether there is actual memory.  Since
> all sparsemem sections have contiguous mem_map[]s within them,
> and we only memory hotplug entire sparsemem sections, we can
> be confident that this assumption will hold.
> 
> As for the memory being in the right node, we'll assume tha
> memory hotplug is putting things in the right node.
> 
> Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
> 
> ---
> 
>  lxc-dave/init/main.c     |    4 ++++
>  lxc-dave/mm/page_alloc.c |   28 +++++++++++++++++++++++++---
>  2 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff -puN init/main.c~sparsemem-fix init/main.c
> --- lxc/init/main.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
> +++ lxc-dave/init/main.c	2006-12-15 08:49:53.000000000 -0800
> @@ -770,6 +770,10 @@ static int init(void * unused)
>  	free_initmem();
>  	unlock_kernel();
>  	mark_rodata_ro();
> +	/*
> +	 * Memory hotplug requires that this system_state transition
> +	 * happer after free_initmem().  (see memmap_init_zone())

s/happer/happens/

Other than that, can't this possibly race and crash here?
I mean, it's not a big race window, but it can happen, no?

> +	 */
>  	system_state = SYSTEM_RUNNING;
>  	numa_default_policy();
>  
> diff -puN mm/page_alloc.c~sparsemem-fix mm/page_alloc.c
> --- lxc/mm/page_alloc.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
> +++ lxc-dave/mm/page_alloc.c	2006-12-15 08:49:53.000000000 -0800
> @@ -2056,6 +2056,30 @@ static inline unsigned long wait_table_b
>  
>  #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
>  
> +static int can_online_pfn_into_nid(unsigned long pfn, int nid)
> +{
> +	/*
> +	 * There are two things that make this work:
> +	 * 1. The early_pfn...() functions are __init and
> +	 *    use __initdata.  If the system is < SYSTEM_RUNNING,
> +	 *    those functions and their data will still exist.
> +	 * 2. We also assume that all actual memory hotplug
> +	 *    (as opposed to boot-time) calls to this are only
> +	 *    for contiguous memory regions.  With sparsemem,
> +	 *    this guaranteed is easy because all sections are
> +	 *    contiguous and we never online more than one
> +	 *    section at a time.  Boot-time memory can have holes
> +	 *    anywhere.
> +	 */
> +	if (system_state >= SYSTEM_RUNNING)
> +		return 1;
> +	if (!early_pfn_valid(pfn))
> +		return 0;
> +	if (!early_pfn_in_nid(pfn, nid))
> +		return 0;
> +	return 1;
> +}
> +
>  /*
>   * Initially all pages are reserved - free ones are freed
>   * up by free_all_bootmem() once the early boot process is
> @@ -2069,9 +2093,7 @@ void __meminit memmap_init_zone(unsigned
>  	unsigned long pfn;
>  
>  	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> -		if (!early_pfn_valid(pfn))
> -			continue;
> -		if (!early_pfn_in_nid(pfn, nid))
> +		if (!can_online_pfn_into_nid(pfn))
>  			continue;
>  		page = pfn_to_page(pfn);
>  		set_page_links(page, zone, nid, pfn);
> _
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

-- 
Greetings Michael.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-15 17:11 ` Andy Whitcroft
@ 2006-12-15 17:24   ` Dave Hansen
  2006-12-15 19:45     ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2006-12-15 17:24 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: cbe-oss-dev, linuxppc-dev, linux-mm, mkravetz, hch, jk,
	linux-kernel, akpm, paulus, benh, gone, Keith Mannthey

On Fri, 2006-12-15 at 17:11 +0000, Andy Whitcroft wrote:
> ie. that if hotplug is pushing this memory into a zone on a node it
> really does know what its doing, and its putting it in the right place.
>   Obviously that code needs to be 'overlap' aware but thats ok for this
> interface.

I'm not sure if this, especially if we're only doing one section at a
time.

> >  	system_state = SYSTEM_RUNNING;
> >  	numa_default_policy();
> >
> > diff -puN mm/page_alloc.c~sparsemem-fix mm/page_alloc.c
> > --- lxc/mm/page_alloc.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
> > +++ lxc-dave/mm/page_alloc.c	2006-12-15 08:49:53.000000000 -0800
> > @@ -2056,6 +2056,30 @@ static inline unsigned long wait_table_b
> >
> >  #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
> >
> > +static int can_online_pfn_into_nid(unsigned long pfn, int nid)
> > +{
> 
> __meminit ?

Yup.  I'll get that.

> > +	/*
> > +	 * There are two things that make this work:
> > +	 * 1. The early_pfn...() functions are __init and
> > +	 *    use __initdata.  If the system is < SYSTEM_RUNNING,
> > +	 *    those functions and their data will still exist.
> > +	 * 2. We also assume that all actual memory hotplug
> > +	 *    (as opposed to boot-time) calls to this are only
> > +	 *    for contiguous memory regions.  With sparsemem,
> > +	 *    this guaranteed is easy because all sections are
> > +	 *    contiguous and we never online more than one
> > +	 *    section at a time.  Boot-time memory can have holes
> > +	 *    anywhere.
> > +	 */
> > +	if (system_state >= SYSTEM_RUNNING)
> > +		return 1;
> 
> Is there any value in codifying the assumption here, as a safety net?
> 
> if (system_state >= SYSTEM_RUNNING)
> #ifdef CONFIG_SPARSEMEM
> 	return 1;
> #else
> 	return 0;
> #endif

Dunno.  The normal case is that it isn't even called without memory
hotplug.  The only non-sparsemem memory hotplug is Keith's baby, and
they lay out all of their mem_map at boot-time anyway, so I don't think
this gets used.

Keith, do you know whether you even do memmap_init_zone() at runtime,
and if you ever have holes if/when you do?

> > +	if (!early_pfn_valid(pfn))
> > +		return 0;
> > +	if (!early_pfn_in_nid(pfn, nid))
> > +		return 0;
> > +	return 1;
> > +}
> > +
> >  /*
> >   * Initially all pages are reserved - free ones are freed
> >   * up by free_all_bootmem() once the early boot process is
> > @@ -2069,9 +2093,7 @@ void __meminit memmap_init_zone(unsigned
> >  	unsigned long pfn;
> >
> >  	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > -		if (!early_pfn_valid(pfn))
> > -			continue;
> > -		if (!early_pfn_in_nid(pfn, nid))
> > +		if (!can_online_pfn_into_nid(pfn))
> 
> We're not passing nid here?

No, because my ppc64 cross compiler has magically broken.  :(

Fixed patch appended.

-- Dave


I think the comments added say it pretty well, but I'll repeat it here.

This fix is pretty similar in concept to the one that Arnd posted
as a temporary workaround, but I've added a few comments explaining
what the actual assumptions are, and improved it a wee little bit.

The end goal here is to simply avoid calling the early_*() functions
when it is _not_ early.  Those functions stop working as soon as
free_initmem() is called.  system_state is set to SYSTEM_RUNNING
just after free_initmem() is called, so it seems appropriate to use
here.

I did think twice about actually using SYSTEM_RUNNING because we
moved away from it in other parts of memory hotplug, but those
were actually for _allocations_ in favor of slab_is_available(),
and we really don't care about the slab here.

The only other assumption is that all memory-hotplug-time pages 
given to memmap_init_zone() are valid and able to be onlined into
any any zone after the system is running.  The "valid" part is
really just a question of whether or not a 'struct page' is there
for the pfn, and *not* whether there is actual memory.  Since
all sparsemem sections have contiguous mem_map[]s within them,
and we only memory hotplug entire sparsemem sections, we can
be confident that this assumption will hold.

As for the memory being in the right node, we'll assume tha
memory hotplug is putting things in the right node.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>

---

 lxc-dave/init/main.c     |    4 ++++
 lxc-dave/mm/page_alloc.c |   28 +++++++++++++++++++++++++---
 2 files changed, 29 insertions(+), 3 deletions(-)

diff -puN init/main.c~sparsemem-fix init/main.c
--- lxc/init/main.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
+++ lxc-dave/init/main.c	2006-12-15 08:49:53.000000000 -0800
@@ -770,6 +770,10 @@ static int init(void * unused)
 	free_initmem();
 	unlock_kernel();
 	mark_rodata_ro();
+	/*
+	 * Memory hotplug requires that this system_state transition
+	 * happen after free_initmem().  (see memmap_init_zone())
+	 */
 	system_state = SYSTEM_RUNNING;
 	numa_default_policy();
 
diff -puN mm/page_alloc.c~sparsemem-fix mm/page_alloc.c
--- lxc/mm/page_alloc.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
+++ lxc-dave/mm/page_alloc.c	2006-12-15 09:15:00.000000000 -0800
@@ -2056,6 +2056,30 @@ static inline unsigned long wait_table_b
 
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
 
+static int __meminit can_online_pfn_into_nid(unsigned long pfn, int nid)
+{
+	/*
+	 * There are two things that make this work:
+	 * 1. The early_pfn...() functions are __init and
+	 *    use __initdata.  If the system is < SYSTEM_RUNNING,
+	 *    those functions and their data will still exist.
+	 * 2. We also assume that all actual memory hotplug
+	 *    (as opposed to boot-time) calls to this are only
+	 *    for contiguous memory regions.  With sparsemem,
+	 *    this guaranteed is easy because all sections are
+	 *    contiguous and we never online more than one
+	 *    section at a time.  Boot-time memory can have holes
+	 *    anywhere.
+	 */
+	if (system_state >= SYSTEM_RUNNING)
+		return 1;
+	if (!early_pfn_valid(pfn))
+		return 0;
+	if (!early_pfn_in_nid(pfn, nid))
+		return 0;
+	return 1;
+}
+
 /*
  * Initially all pages are reserved - free ones are freed
  * up by free_all_bootmem() once the early boot process is
@@ -2069,9 +2093,7 @@ void __meminit memmap_init_zone(unsigned
 	unsigned long pfn;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!early_pfn_valid(pfn))
-			continue;
-		if (!early_pfn_in_nid(pfn, nid))
+		if (!can_online_pfn_into_nid(pfn, nid))
 			continue;
 		page = pfn_to_page(pfn);
 		set_page_links(page, zone, nid, pfn);
_



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-15 17:22 ` [PATCH] Fix sparsemem on Cell Michael Buesch
@ 2006-12-15 17:57   ` Dave Hansen
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2006-12-15 17:57 UTC (permalink / raw)
  To: Michael Buesch
  Cc: linuxppc-dev, linux-mm, apw, mkravetz, hch, jk, linux-kernel,
	akpm, paulus, benh, gone, cbe-oss-dev, naveen.b.s

On Fri, 2006-12-15 at 18:22 +0100, Michael Buesch wrote:
> On Friday 15 December 2006 17:53, Dave Hansen wrote:
> >  lxc-dave/init/main.c     |    4 ++++
> >  lxc-dave/mm/page_alloc.c |   28 +++++++++++++++++++++++++---
> >  2 files changed, 29 insertions(+), 3 deletions(-)
> >
> > diff -puN init/main.c~sparsemem-fix init/main.c
> > --- lxc/init/main.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
> > +++ lxc-dave/init/main.c	2006-12-15 08:49:53.000000000 -0800
> > @@ -770,6 +770,10 @@ static int init(void * unused)
> >  	free_initmem();
> >  	unlock_kernel();
> >  	mark_rodata_ro();
> > +	/*
> > +	 * Memory hotplug requires that this system_state transition
> > +	 * happer after free_initmem().  (see memmap_init_zone())
> 
> s/happer/happens/
> 
> Other than that, can't this possibly race and crash here?
> I mean, it's not a big race window, but it can happen, no?

That's a good point.  Nice eye.

There are three routes in here: boot-time init, an ACPI call, and a
write to a sysfs file.  Bootmem is taken care of.  The write to a sysfs
file can't happen yet because userspace isn't up.

The only question would be about ACPI.  I _guess_ an ACPI event could
come in at any time, and could hit this race window.  

One other thought I had was to add an argument to memmap_init_zone() to
indicate that the memory being fed to it was contiguous and did not need
the validation checks.

Anybody have thoughts on that?

-- Dave


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-15 17:24   ` Dave Hansen
@ 2006-12-15 19:45     ` Andrew Morton
  2006-12-16  8:03       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2006-12-15 19:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Whitcroft, cbe-oss-dev, linuxppc-dev, linux-mm, mkravetz,
	hch, jk, linux-kernel, paulus, benh, gone, Keith Mannthey

On Fri, 15 Dec 2006 09:24:00 -0800
Dave Hansen <haveblue@us.ibm.com> wrote:

> 
> ...
>
> I think the comments added say it pretty well, but I'll repeat it here.
> 
> This fix is pretty similar in concept to the one that Arnd posted
> as a temporary workaround, but I've added a few comments explaining
> what the actual assumptions are, and improved it a wee little bit.
> 
> The end goal here is to simply avoid calling the early_*() functions
> when it is _not_ early.  Those functions stop working as soon as
> free_initmem() is called.  system_state is set to SYSTEM_RUNNING
> just after free_initmem() is called, so it seems appropriate to use
> here.

Would really prefer not to do this.  system_state is evil.  Its semantics
are poorly-defined and if someone changes them a bit, or changes memory
initialisation order, you get whacked.

I think an mm-private flag with /*documented*/ semantics would be better. 
It's only a byte.

> +static int __meminit can_online_pfn_into_nid(unsigned long pfn, int nid)

I spent some time trying to work out what "can_online_pfn_into_nid" can
possibly mean and failed.  "We can bring a pfn online then turn it into a
NID"?  Don't think so.  "We can bring this page online and allocate it to
this node"?  Maybe.

Perhaps if the function's role in the world was commented it would be clearer.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-15 16:53 [PATCH] Fix sparsemem on Cell Dave Hansen
  2006-12-15 17:11 ` Andy Whitcroft
  2006-12-15 17:22 ` [PATCH] Fix sparsemem on Cell Michael Buesch
@ 2006-12-15 20:21 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 22+ messages in thread
From: Benjamin Herrenschmidt @ 2006-12-15 20:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: cbe-oss-dev, linuxppc-dev, linux-mm, apw, mkravetz, hch, jk,
	linux-kernel, akpm, paulus, gone


> The only other assumption is that all memory-hotplug-time pages 
> given to memmap_init_zone() are valid and able to be onlined into
> any any zone after the system is running.  The "valid" part is
> really just a question of whether or not a 'struct page' is there
> for the pfn, and *not* whether there is actual memory.  Since
> all sparsemem sections have contiguous mem_map[]s within them,
> and we only memory hotplug entire sparsemem sections, we can
> be confident that this assumption will hold.
> 
> As for the memory being in the right node, we'll assume tha
> memory hotplug is putting things in the right node.

BTW, just that people know, what we are adding isn't even memory :-) We
are calling __add_pages() to create struct page for the SPE local stores
and register space as we use them later from a nopage() handler (and no,
we can't use no_pfn just yet for various reasons, notably we need to
handle races with unmap_mapping_ranges() and thus have the truncate
logic in).

Those pages, thus, must never be onlined. Ever. It might make sense to
create a way to inform memory hotplug of that fact, but on the other
hand, I wouldn't bother as I have a plan to get rid of those
__add_pages() completely and work without struct page, maybe in a 2.6.21
timeframe.

Ben.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-15 19:45     ` Andrew Morton
@ 2006-12-16  8:03       ` KAMEZAWA Hiroyuki
  2006-12-18 21:13         ` Dave Hansen
  2006-12-18 22:54         ` Arnd Bergmann
  0 siblings, 2 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-12-16  8:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: haveblue, apw, cbe-oss-dev, linuxppc-dev, linux-mm, mkravetz,
	hch, jk, linux-kernel, paulus, benh, gone, kmannth

On Fri, 15 Dec 2006 11:45:36 -0800
Andrew Morton <akpm@osdl.org> wrote:

> Perhaps if the function's role in the world was commented it would be clearer.
> 

How about patch like this ? (this one is not tested.)
Already-exisiting-more-generic-flag is available ?
 
-Kame
==
 include/linux/memory_hotplug.h |    8 ++++++++
 mm/memory_hotplug.c            |   14 ++++++++++++++
 mm/page_alloc.c                |   11 +++++++----
 3 files changed, 29 insertions(+), 4 deletions(-)

Index: linux-2.6.20-rc1-mm1/include/linux/memory_hotplug.h
===================================================================
--- linux-2.6.20-rc1-mm1.orig/include/linux/memory_hotplug.h	2006-11-30 06:57:37.000000000 +0900
+++ linux-2.6.20-rc1-mm1/include/linux/memory_hotplug.h	2006-12-16 16:42:40.000000000 +0900
@@ -133,6 +133,10 @@
 #endif /* CONFIG_NUMA */
 #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
+extern int under_memory_hotadd(void);
+
+
+
 #else /* ! CONFIG_MEMORY_HOTPLUG */
 /*
  * Stub functions for when hotplug is off
@@ -159,6 +163,10 @@
 	dump_stack();
 	return -ENOSYS;
 }
+static inline int under_memory_hotadd()
+{
+	return 0;
+}
 
 #endif /* ! CONFIG_MEMORY_HOTPLUG */
 static inline int __remove_pages(struct zone *zone, unsigned long start_pfn,
Index: linux-2.6.20-rc1-mm1/mm/memory_hotplug.c
===================================================================
--- linux-2.6.20-rc1-mm1.orig/mm/memory_hotplug.c	2006-12-16 14:24:10.000000000 +0900
+++ linux-2.6.20-rc1-mm1/mm/memory_hotplug.c	2006-12-16 16:51:08.000000000 +0900
@@ -26,6 +26,17 @@
 
 #include <asm/tlbflush.h>
 
+/*
+ * Because memory hotplug shares some codes for initilization with boot,
+ * we sometimes have to check what we are doing ?
+ */
+static atomic_t memory_hotadd_count;
+
+int under_memory_hotadd(void)
+{
+	return atomic_read(&memory_hotadd_count);
+}
+
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
@@ -273,10 +284,13 @@
 		if (ret)
 			goto error;
 	}
+	atomic_inc(&memory_hotadd_count);
 
 	/* call arch's memory hotadd */
 	ret = arch_add_memory(nid, start, size);
 
+	atomic_dec(&memory_hotadd_count);
+
 	if (ret < 0)
 		goto error;
 
Index: linux-2.6.20-rc1-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.20-rc1-mm1.orig/mm/page_alloc.c	2006-12-16 14:24:38.000000000 +0900
+++ linux-2.6.20-rc1-mm1/mm/page_alloc.c	2006-12-16 16:53:02.000000000 +0900
@@ -2069,10 +2069,13 @@
 	unsigned long pfn;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!early_pfn_valid(pfn))
-			continue;
-		if (!early_pfn_in_nid(pfn, nid))
-			continue;
+		if (!under_memory_hotadd()) {
+			/* we are in boot */
+			if (!early_pfn_valid(pfn))
+				continue;
+			if (!early_pfn_in_nid(pfn, nid))
+				continue;
+		}
 		page = pfn_to_page(pfn);
 		set_page_links(page, zone, nid, pfn);
 		init_page_count(page);


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-16  8:03       ` KAMEZAWA Hiroyuki
@ 2006-12-18 21:13         ` Dave Hansen
  2006-12-18 22:15           ` Christoph Hellwig
  2006-12-18 22:54         ` Arnd Bergmann
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2006-12-18 21:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, apw, cbe-oss-dev, linuxppc-dev, linux-mm,
	mkravetz, hch, jk, linux-kernel, paulus, benh, gone, kmannth

On Sat, 2006-12-16 at 17:03 +0900, KAMEZAWA Hiroyuki wrote:
>  /* add this memory to iomem resource */
>  static struct resource *register_memory_resource(u64 start, u64 size)
>  {
> @@ -273,10 +284,13 @@
>  		if (ret)
>  			goto error;
>  	}
> +	atomic_inc(&memory_hotadd_count);
>  
>  	/* call arch's memory hotadd */
>  	ret = arch_add_memory(nid, start, size);
>  
> +	atomic_dec(&memory_hotadd_count);

I'd be willing to be that this will work just fine.  But, I think we can
do it without any static state at all, if we just pass a runtime-or-not
flag down into the arch_add_memory() call chain.

I'll code that up so we can compare to yours.  

-- Dave


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-18 21:13         ` Dave Hansen
@ 2006-12-18 22:15           ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2006-12-18 22:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, apw, cbe-oss-dev, linuxppc-dev,
	linux-mm, mkravetz, hch, jk, linux-kernel, paulus, benh, gone,
	kmannth

On Mon, Dec 18, 2006 at 01:13:57PM -0800, Dave Hansen wrote:
> On Sat, 2006-12-16 at 17:03 +0900, KAMEZAWA Hiroyuki wrote:
> >  /* add this memory to iomem resource */
> >  static struct resource *register_memory_resource(u64 start, u64 size)
> >  {
> > @@ -273,10 +284,13 @@
> >  		if (ret)
> >  			goto error;
> >  	}
> > +	atomic_inc(&memory_hotadd_count);
> >  
> >  	/* call arch's memory hotadd */
> >  	ret = arch_add_memory(nid, start, size);
> >  
> > +	atomic_dec(&memory_hotadd_count);
> 
> I'd be willing to be that this will work just fine.  But, I think we can
> do it without any static state at all, if we just pass a runtime-or-not
> flag down into the arch_add_memory() call chain.
> 
> I'll code that up so we can compare to yours.  

Yes, I stronly concur that passing an explicit flag is much much better
than any hack involving global state.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-16  8:03       ` KAMEZAWA Hiroyuki
  2006-12-18 21:13         ` Dave Hansen
@ 2006-12-18 22:54         ` Arnd Bergmann
  2006-12-18 23:16           ` Dave Hansen
  1 sibling, 1 reply; 22+ messages in thread
From: Arnd Bergmann @ 2006-12-18 22:54 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, kmannth, linux-kernel, hch,
	linux-mm, paulus, mkravetz, gone, cbe-oss-dev, Dave Hansen

On Saturday 16 December 2006 09:03, KAMEZAWA Hiroyuki wrote:
> @@ -273,10 +284,13 @@
>                 if (ret)
>                         goto error;
>         }
> +       atomic_inc(&memory_hotadd_count);
>  
>         /* call arch's memory hotadd */
>         ret = arch_add_memory(nid, start, size);
>  
> +       atomic_dec(&memory_hotadd_count);
> +
>         if (ret < 0)
>                 goto error;
>  

This also doesn't fix the problem on cell, since the time when the bug
happens, we're not calling through this function, or arch_add_memory,
at all, but rather invoke __add_pages directly. As BenH already mentioned,
we shouldn't do really call __add_pages at all.

Let me attempt another fix that might address all cases. This is completely
untested as of now, but also addresses Dave's latest comment.

	Arnd <><

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 1a3d8a2..723d220 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -543,7 +543,7 @@ virtual_memmap_init (u64 start, u64 end, void *arg)
 
 	if (map_start < map_end)
 		memmap_init_zone((unsigned long)(map_end - map_start),
-				 args->nid, args->zone, page_to_pfn(map_start));
+				 args->nid, args->zone, page_to_pfn(map_start), 1);
 	return 0;
 }
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 7f2944d..1e52cd1 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -61,7 +61,7 @@ void memmap_init(unsigned long size, int nid, unsigned long zone,
 
 		if (map_start < map_end)
 			memmap_init_zone((unsigned long)(map_end - map_start),
-					 nid, zone, page_to_pfn(map_start));
+					 nid, zone, page_to_pfn(map_start), 1);
 	}
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a17b147..6d85068 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -978,7 +978,7 @@ extern int early_pfn_to_nid(unsigned long pfn);
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long);
+extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long, int);
 extern void setup_per_zone_pages_min(void);
 extern void mem_init(void);
 extern void show_mem(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0c055a0..16c9930 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -71,7 +71,7 @@ static int __add_zone(struct zone *zone, unsigned long phys_start_pfn)
 		if (ret < 0)
 			return ret;
 	}
-	memmap_init_zone(nr_pages, nid, zone_type, phys_start_pfn);
+	memmap_init_zone(nr_pages, nid, zone_type, phys_start_pfn, 0);
 	return 0;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8c1a116..60d1ac8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1953,17 +1953,19 @@ static inline unsigned long wait_table_bits(unsigned long size)
  * done. Non-atomic initialization, single-pass.
  */
 void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
-		unsigned long start_pfn)
+		unsigned long start_pfn, int early)
 {
 	struct page *page;
 	unsigned long end_pfn = start_pfn + size;
 	unsigned long pfn;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!early_pfn_valid(pfn))
-			continue;
-		if (!early_pfn_in_nid(pfn, nid))
-			continue;
+		if (early) {
+			if (!early_pfn_valid(pfn))
+				continue;
+			if (!early_pfn_in_nid(pfn, nid))
+				continue;
+		}
 		page = pfn_to_page(pfn);
 		set_page_links(page, zone, nid, pfn);
 		init_page_count(page);
@@ -1990,7 +1992,7 @@ void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
 
 #ifndef __HAVE_ARCH_MEMMAP_INIT
 #define memmap_init(size, nid, zone, start_pfn) \
-	memmap_init_zone((size), (nid), (zone), (start_pfn))
+	memmap_init_zone((size), (nid), (zone), (start_pfn), 1)
 #endif
 
 static int __cpuinit zone_batchsize(struct zone *zone)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-18 22:54         ` Arnd Bergmann
@ 2006-12-18 23:16           ` Dave Hansen
  2006-12-19  0:16             ` KAMEZAWA Hiroyuki
  2006-12-19  8:59             ` Arnd Bergmann
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Hansen @ 2006-12-18 23:16 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linuxppc-dev, KAMEZAWA Hiroyuki, Andrew Morton, kmannth,
	linux-kernel, hch, linux-mm, paulus, mkravetz, gone, cbe-oss-dev

On Mon, 2006-12-18 at 23:54 +0100, Arnd Bergmann wrote:
>  #ifndef __HAVE_ARCH_MEMMAP_INIT
>  #define memmap_init(size, nid, zone, start_pfn) \
> -	memmap_init_zone((size), (nid), (zone), (start_pfn))
> +	memmap_init_zone((size), (nid), (zone), (start_pfn), 1)
>  #endif

This is what I was thinking of.  Sometimes I find these kinds of calls a
bit annoying:

	foo(0, 1, 1, 0, 99, 22)

It only takes a minute to look up what all of the numbers do, but that
is one minute too many. :)

How about an enum, or a pair of #defines?

enum context
{
        EARLY,
        HOTPLUG
};
extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
                                enum call_context);
...

So, the call I quoted above would become:

	memmap_init_zone((size), (nid), (zone), (start_pfn), EARLY)

-- Dave


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-18 23:16           ` Dave Hansen
@ 2006-12-19  0:16             ` KAMEZAWA Hiroyuki
  2006-12-19  8:59             ` Arnd Bergmann
  1 sibling, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-12-19  0:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: arnd, linuxppc-dev, akpm, kmannth, linux-kernel, hch, linux-mm,
	paulus, mkravetz, gone, cbe-oss-dev

On Mon, 18 Dec 2006 15:16:20 -0800
Dave Hansen <haveblue@us.ibm.com> wrote:

> enum context
> {
>         EARLY,
>         HOTPLUG
> };
I like this :)

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-18 23:16           ` Dave Hansen
  2006-12-19  0:16             ` KAMEZAWA Hiroyuki
@ 2006-12-19  8:59             ` Arnd Bergmann
  2006-12-19 19:34               ` Dave Hansen
  2007-01-06  1:10               ` [PATCH] Fix sparsemem on Cell (take 3) Dave Hansen
  1 sibling, 2 replies; 22+ messages in thread
From: Arnd Bergmann @ 2006-12-19  8:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linuxppc-dev, KAMEZAWA Hiroyuki, Andrew Morton, kmannth,
	linux-kernel, hch, linux-mm, paulus, mkravetz, gone, cbe-oss-dev

On Tuesday 19 December 2006 00:16, Dave Hansen wrote:
> How about an enum, or a pair of #defines?
> 
> enum context
> {
>         EARLY,
>         HOTPLUG
> };

Sounds good, but since this is in a global header file, it needs
to be in an appropriate name space, like

enum memmap_context {
	MEMMAP_EARLY,
	MEMMAP_HOTPLUG,
};

	Arnd <><

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-19  8:59             ` Arnd Bergmann
@ 2006-12-19 19:34               ` Dave Hansen
  2007-01-06  1:10               ` [PATCH] Fix sparsemem on Cell (take 3) Dave Hansen
  1 sibling, 0 replies; 22+ messages in thread
From: Dave Hansen @ 2006-12-19 19:34 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linuxppc-dev, KAMEZAWA Hiroyuki, Andrew Morton, kmannth,
	linux-kernel, hch, linux-mm, paulus, mkravetz, gone

Pulling cbe... list off the cc because it is giving annoying 'too many
recipients' warnings.

On Tue, 2006-12-19 at 09:59 +0100, Arnd Bergmann wrote:
> On Tuesday 19 December 2006 00:16, Dave Hansen wrote:
> > How about an enum, or a pair of #defines?
> > 
> > enum context
> > {
> >         EARLY,
> >         HOTPLUG
> > };

This patch should do the trick.  Arnd, can you try this to make sure it
solves the issue?  Also, if you get a chance, could you do a quick s390
boot of it?  It was the only arch touched by this whole thing.

It compiles on all of my i386 .configs.

--

The following patch fixes an oops experienced on the Cell architecture
when init-time functions, early_*(), are called at runtime.  It alters
the call paths to make sure that the callers explicitly say whether the
call is being made on behalf of a hotplug even, or happening at
boot-time. 

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>

---

 lxc-dave/arch/s390/mm/vmem.c    |    3 ++-
 lxc-dave/include/linux/mm.h     |    7 ++++++-
 lxc-dave/include/linux/mmzone.h |    3 ++-
 lxc-dave/mm/memory_hotplug.c    |    6 ++++--
 lxc-dave/mm/page_alloc.c        |   25 +++++++++++++++++--------
 5 files changed, 31 insertions(+), 13 deletions(-)

diff -puN mm/page_alloc.c~sparsemem-enum1 mm/page_alloc.c
--- lxc/mm/page_alloc.c~sparsemem-enum1	2006-12-19 09:38:34.000000000 -0800
+++ lxc-dave/mm/page_alloc.c	2006-12-19 11:18:24.000000000 -0800
@@ -2062,17 +2062,24 @@ static inline unsigned long wait_table_b
  * done. Non-atomic initialization, single-pass.
  */
 void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
-		unsigned long start_pfn)
+		unsigned long start_pfn, enum memmap_context context)
 {
 	struct page *page;
 	unsigned long end_pfn = start_pfn + size;
 	unsigned long pfn;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!early_pfn_valid(pfn))
-			continue;
-		if (!early_pfn_in_nid(pfn, nid))
-			continue;
+		/*
+		 * There can be holes in boot-time mem_map[]s
+		 * handed to this function.  They do not
+		 * exist on hotplugged memory.
+		 */
+		if (context == MEMMAP_EARLY) {
+			if (!early_pfn_valid(pfn))
+				continue;
+			if (!early_pfn_in_nid(pfn, nid))
+				continue;
+		}
 		page = pfn_to_page(pfn);
 		set_page_links(page, zone, nid, pfn);
 		init_page_count(page);
@@ -2102,7 +2109,7 @@ void zone_init_free_lists(struct pglist_
 
 #ifndef __HAVE_ARCH_MEMMAP_INIT
 #define memmap_init(size, nid, zone, start_pfn) \
-	memmap_init_zone((size), (nid), (zone), (start_pfn))
+	memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
 #endif
 
 static int __cpuinit zone_batchsize(struct zone *zone)
@@ -2348,7 +2355,8 @@ static __meminit void zone_pcp_init(stru
 
 __meminit int init_currently_empty_zone(struct zone *zone,
 					unsigned long zone_start_pfn,
-					unsigned long size)
+					unsigned long size,
+					enum memmap_context context)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int ret;
@@ -2792,7 +2800,8 @@ static void __meminit free_area_init_cor
 		if (!size)
 			continue;
 
-		ret = init_currently_empty_zone(zone, zone_start_pfn, size);
+		ret = init_currently_empty_zone(zone, zone_start_pfn,
+						size, MEMMAP_EARLY);
 		BUG_ON(ret);
 		zone_start_pfn += size;
 	}
diff -puN include/linux/mm.h~sparsemem-enum1 include/linux/mm.h
--- lxc/include/linux/mm.h~sparsemem-enum1	2006-12-19 09:38:45.000000000 -0800
+++ lxc-dave/include/linux/mm.h	2006-12-19 09:50:47.000000000 -0800
@@ -979,7 +979,12 @@ extern int early_pfn_to_nid(unsigned lon
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long);
+enum memmap_context {
+	MEMMAP_EARLY,
+	MEMMAP_HOTPLUG,
+};
+extern void memmap_init_zone(unsigned long, int, unsigned long,
+				unsigned long, enum memmap_context);
 extern void setup_per_zone_pages_min(void);
 extern void mem_init(void);
 extern void show_mem(void);
diff -puN mm/memory_hotplug.c~sparsemem-enum1 mm/memory_hotplug.c
--- lxc/mm/memory_hotplug.c~sparsemem-enum1	2006-12-19 09:39:19.000000000 -0800
+++ lxc-dave/mm/memory_hotplug.c	2006-12-19 09:50:24.000000000 -0800
@@ -67,11 +67,13 @@ static int __add_zone(struct zone *zone,
 	zone_type = zone - pgdat->node_zones;
 	if (!populated_zone(zone)) {
 		int ret = 0;
-		ret = init_currently_empty_zone(zone, phys_start_pfn, nr_pages);
+		ret = init_currently_empty_zone(zone, phys_start_pfn,
+						nr_pages, MEMMAP_HOTPLUG);
 		if (ret < 0)
 			return ret;
 	}
-	memmap_init_zone(nr_pages, nid, zone_type, phys_start_pfn);
+	memmap_init_zone(nr_pages, nid, zone_type,
+			 phys_start_pfn, MEMMAP_HOTPLUG);
 	return 0;
 }
 
diff -puN arch/s390/mm/vmem.c~sparsemem-enum1 arch/s390/mm/vmem.c
--- lxc/arch/s390/mm/vmem.c~sparsemem-enum1	2006-12-19 09:42:26.000000000 -0800
+++ lxc-dave/arch/s390/mm/vmem.c	2006-12-19 11:17:49.000000000 -0800
@@ -61,7 +61,8 @@ void memmap_init(unsigned long size, int
 
 		if (map_start < map_end)
 			memmap_init_zone((unsigned long)(map_end - map_start),
-					 nid, zone, page_to_pfn(map_start));
+					 nid, zone, page_to_pfn(map_start),
+					 context);
 	}
 }
 
diff -puN include/linux/mmzone.h~sparsemem-enum1 include/linux/mmzone.h
--- lxc/include/linux/mmzone.h~sparsemem-enum1	2006-12-19 09:48:58.000000000 -0800
+++ lxc-dave/include/linux/mmzone.h	2006-12-19 09:53:42.000000000 -0800
@@ -474,7 +474,8 @@ int zone_watermark_ok(struct zone *z, in
 		int classzone_idx, int alloc_flags);
 
 extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
-				     unsigned long size);
+				     unsigned long size,
+				     enum memmap_context context);
 
 #ifdef CONFIG_HAVE_MEMORY_PRESENT
 void memory_present(int nid, unsigned long start, unsigned long end);
_


-- Dave


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] Fix sparsemem on Cell (take 3)
  2006-12-19  8:59             ` Arnd Bergmann
  2006-12-19 19:34               ` Dave Hansen
@ 2007-01-06  1:10               ` Dave Hansen
  2007-01-06  4:52                 ` John Rose
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2007-01-06  1:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linuxppc-dev, KAMEZAWA Hiroyuki, Andrew Morton, kmannth,
	linux-kernel, hch, linux-mm, paulus, mkravetz, gone, cbe-oss-dev,
	Keith Mannthey, John Rose, Arnd Bergmann

I dropped this on the floor over Christmas.  This has had a few smoke
tests on ppc64 and i386 and is ready for -mm.  Against 2.6.20-rc2-mm1.

The following patch fixes an oops experienced on the Cell architecture
when init-time functions, early_*(), are called at runtime.  It alters
the call paths to make sure that the callers explicitly say whether the
call is being made on behalf of a hotplug even, or happening at
boot-time. 

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>

---

 lxc-dave/arch/s390/mm/vmem.c    |    3 ++-
 lxc-dave/include/linux/mm.h     |    7 ++++++-
 lxc-dave/include/linux/mmzone.h |    3 ++-
 lxc-dave/mm/memory_hotplug.c    |    6 ++++--
 lxc-dave/mm/page_alloc.c        |   25 +++++++++++++++++--------
 5 files changed, 31 insertions(+), 13 deletions(-)

diff -puN mm/page_alloc.c~sparsemem-enum1 mm/page_alloc.c
--- lxc/mm/page_alloc.c~sparsemem-enum1	2006-12-19 09:38:34.000000000 -0800
+++ lxc-dave/mm/page_alloc.c	2006-12-19 11:18:24.000000000 -0800
@@ -2062,17 +2062,24 @@ static inline unsigned long wait_table_b
  * done. Non-atomic initialization, single-pass.
  */
 void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
-		unsigned long start_pfn)
+		unsigned long start_pfn, enum memmap_context context)
 {
 	struct page *page;
 	unsigned long end_pfn = start_pfn + size;
 	unsigned long pfn;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!early_pfn_valid(pfn))
-			continue;
-		if (!early_pfn_in_nid(pfn, nid))
-			continue;
+		/*
+		 * There can be holes in boot-time mem_map[]s
+		 * handed to this function.  They do not
+		 * exist on hotplugged memory.
+		 */
+		if (context == MEMMAP_EARLY) {
+			if (!early_pfn_valid(pfn))
+				continue;
+			if (!early_pfn_in_nid(pfn, nid))
+				continue;
+		}
 		page = pfn_to_page(pfn);
 		set_page_links(page, zone, nid, pfn);
 		init_page_count(page);
@@ -2102,7 +2109,7 @@ void zone_init_free_lists(struct pglist_
 
 #ifndef __HAVE_ARCH_MEMMAP_INIT
 #define memmap_init(size, nid, zone, start_pfn) \
-	memmap_init_zone((size), (nid), (zone), (start_pfn))
+	memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
 #endif
 
 static int __cpuinit zone_batchsize(struct zone *zone)
@@ -2348,7 +2355,8 @@ static __meminit void zone_pcp_init(stru
 
 __meminit int init_currently_empty_zone(struct zone *zone,
 					unsigned long zone_start_pfn,
-					unsigned long size)
+					unsigned long size,
+					enum memmap_context context)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int ret;
@@ -2792,7 +2800,8 @@ static void __meminit free_area_init_cor
 		if (!size)
 			continue;
 
-		ret = init_currently_empty_zone(zone, zone_start_pfn, size);
+		ret = init_currently_empty_zone(zone, zone_start_pfn,
+						size, MEMMAP_EARLY);
 		BUG_ON(ret);
 		zone_start_pfn += size;
 	}
diff -puN include/linux/mm.h~sparsemem-enum1 include/linux/mm.h
--- lxc/include/linux/mm.h~sparsemem-enum1	2006-12-19 09:38:45.000000000 -0800
+++ lxc-dave/include/linux/mm.h	2006-12-19 09:50:47.000000000 -0800
@@ -979,7 +979,12 @@ extern int early_pfn_to_nid(unsigned lon
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long);
+enum memmap_context {
+	MEMMAP_EARLY,
+	MEMMAP_HOTPLUG,
+};
+extern void memmap_init_zone(unsigned long, int, unsigned long,
+				unsigned long, enum memmap_context);
 extern void setup_per_zone_pages_min(void);
 extern void mem_init(void);
 extern void show_mem(void);
diff -puN mm/memory_hotplug.c~sparsemem-enum1 mm/memory_hotplug.c
--- lxc/mm/memory_hotplug.c~sparsemem-enum1	2006-12-19 09:39:19.000000000 -0800
+++ lxc-dave/mm/memory_hotplug.c	2006-12-19 09:50:24.000000000 -0800
@@ -67,11 +67,13 @@ static int __add_zone(struct zone *zone,
 	zone_type = zone - pgdat->node_zones;
 	if (!populated_zone(zone)) {
 		int ret = 0;
-		ret = init_currently_empty_zone(zone, phys_start_pfn, nr_pages);
+		ret = init_currently_empty_zone(zone, phys_start_pfn,
+						nr_pages, MEMMAP_HOTPLUG);
 		if (ret < 0)
 			return ret;
 	}
-	memmap_init_zone(nr_pages, nid, zone_type, phys_start_pfn);
+	memmap_init_zone(nr_pages, nid, zone_type,
+			 phys_start_pfn, MEMMAP_HOTPLUG);
 	return 0;
 }
 
diff -puN arch/s390/mm/vmem.c~sparsemem-enum1 arch/s390/mm/vmem.c
--- lxc/arch/s390/mm/vmem.c~sparsemem-enum1	2006-12-19 09:42:26.000000000 -0800
+++ lxc-dave/arch/s390/mm/vmem.c	2006-12-19 11:17:49.000000000 -0800
@@ -61,7 +61,8 @@ void memmap_init(unsigned long size, int
 
 		if (map_start < map_end)
 			memmap_init_zone((unsigned long)(map_end - map_start),
-					 nid, zone, page_to_pfn(map_start));
+					 nid, zone, page_to_pfn(map_start),
+					 context);
 	}
 }
 
diff -puN include/linux/mmzone.h~sparsemem-enum1 include/linux/mmzone.h
--- lxc/include/linux/mmzone.h~sparsemem-enum1	2006-12-19 09:48:58.000000000 -0800
+++ lxc-dave/include/linux/mmzone.h	2006-12-19 09:53:42.000000000 -0800
@@ -474,7 +474,8 @@ int zone_watermark_ok(struct zone *z, in
 		int classzone_idx, int alloc_flags);
 
 extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
-				     unsigned long size);
+				     unsigned long size,
+				     enum memmap_context context);
 
 #ifdef CONFIG_HAVE_MEMORY_PRESENT
 void memory_present(int nid, unsigned long start, unsigned long end);
_


-- Dave


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell (take 3)
  2007-01-06  1:10               ` [PATCH] Fix sparsemem on Cell (take 3) Dave Hansen
@ 2007-01-06  4:52                 ` John Rose
  2007-01-07  8:58                   ` Dave Hansen
  0 siblings, 1 reply; 22+ messages in thread
From: John Rose @ 2007-01-06  4:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, External List, KAMEZAWA Hiroyuki, kmannth, lkml,
	hch, linux-mm, Paul Mackerras, mkravetz, gone, cbe-oss-dev,
	Arnd Bergmann

> I dropped this on the floor over Christmas.  This has had a few smoke
> tests on ppc64 and i386 and is ready for -mm.  Against 2.6.20-rc2-mm1.

Could this break ia64, given that it uses memmap_init_zone()?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell (take 3)
  2007-01-06  4:52                 ` John Rose
@ 2007-01-07  8:58                   ` Dave Hansen
  2007-01-07 12:07                     ` Arnd Bergmann
  2007-01-08  6:47                     ` Tim Pepper
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Hansen @ 2007-01-07  8:58 UTC (permalink / raw)
  To: John Rose
  Cc: Andrew Morton, External List, KAMEZAWA Hiroyuki, kmannth, lkml,
	hch, linux-mm, Paul Mackerras, mkravetz, gone, Arnd Bergmann

On Fri, 2007-01-05 at 22:52 -0600, John Rose wrote:
> > I dropped this on the floor over Christmas.  This has had a few smoke
> > tests on ppc64 and i386 and is ready for -mm.  Against 2.6.20-rc2-mm1.
> 
> Could this break ia64, given that it uses memmap_init_zone()?

You are right, I think it does.

Here's an updated patch to replace the earlier one.  I had to move the
enum definition over to a different header because ia64 evidently has a
different include order.

---

The following patch fixes an oops experienced on the Cell architecture
when init-time functions, early_*(), are called at runtime.  It alters
the call paths to make sure that the callers explicitly say whether the
call is being made on behalf of a hotplug even, or happening at
boot-time. 

It has been compile tested on ia64, s390, i386 and x86_64.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>

---

 lxc-dave/arch/ia64/mm/init.c    |    5 +++--
 lxc-dave/arch/s390/mm/vmem.c    |    3 ++-
 lxc-dave/include/linux/mm.h     |    3 ++-
 lxc-dave/include/linux/mmzone.h |    8 ++++++--
 lxc-dave/mm/memory_hotplug.c    |    6 ++++--
 lxc-dave/mm/page_alloc.c        |   25 +++++++++++++++++--------
 6 files changed, 34 insertions(+), 16 deletions(-)

diff -puN arch/s390/mm/vmem.c~Re-_PATCH_Fix_sparsemem_on_Cell arch/s390/mm/vmem.c
--- lxc/arch/s390/mm/vmem.c~Re-_PATCH_Fix_sparsemem_on_Cell	2007-01-05 15:38:23.000000000 -0800
+++ lxc-dave/arch/s390/mm/vmem.c	2007-01-07 00:47:02.000000000 -0800
@@ -61,7 +61,8 @@ void memmap_init(unsigned long size, int
 
 		if (map_start < map_end)
 			memmap_init_zone((unsigned long)(map_end - map_start),
-					 nid, zone, page_to_pfn(map_start));
+					 nid, zone, page_to_pfn(map_start),
+					 MEMMAP_EARLY);
 	}
 }
 
diff -puN include/linux/mm.h~Re-_PATCH_Fix_sparsemem_on_Cell include/linux/mm.h
--- lxc/include/linux/mm.h~Re-_PATCH_Fix_sparsemem_on_Cell	2007-01-05 15:38:23.000000000 -0800
+++ lxc-dave/include/linux/mm.h	2007-01-06 23:57:59.000000000 -0800
@@ -979,7 +979,8 @@ extern int early_pfn_to_nid(unsigned lon
 #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long);
+extern void memmap_init_zone(unsigned long, int, unsigned long,
+				unsigned long, enum memmap_context);
 extern void setup_per_zone_pages_min(void);
 extern void mem_init(void);
 extern void show_mem(void);
diff -puN include/linux/mmzone.h~Re-_PATCH_Fix_sparsemem_on_Cell include/linux/mmzone.h
--- lxc/include/linux/mmzone.h~Re-_PATCH_Fix_sparsemem_on_Cell	2007-01-05 15:38:23.000000000 -0800
+++ lxc-dave/include/linux/mmzone.h	2007-01-06 23:58:15.000000000 -0800
@@ -471,9 +471,13 @@ void build_all_zonelists(void);
 void wakeup_kswapd(struct zone *zone, int order);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		int classzone_idx, int alloc_flags);
-
+enum memmap_context {
+	MEMMAP_EARLY,
+	MEMMAP_HOTPLUG,
+};
 extern int init_currently_empty_zone(struct zone *zone, unsigned long start_pfn,
-				     unsigned long size);
+				     unsigned long size,
+				     enum memmap_context context);
 
 #ifdef CONFIG_HAVE_MEMORY_PRESENT
 void memory_present(int nid, unsigned long start, unsigned long end);
diff -puN mm/memory_hotplug.c~Re-_PATCH_Fix_sparsemem_on_Cell mm/memory_hotplug.c
--- lxc/mm/memory_hotplug.c~Re-_PATCH_Fix_sparsemem_on_Cell	2007-01-05 15:38:23.000000000 -0800
+++ lxc-dave/mm/memory_hotplug.c	2007-01-05 15:38:23.000000000 -0800
@@ -67,11 +67,13 @@ static int __add_zone(struct zone *zone,
 	zone_type = zone - pgdat->node_zones;
 	if (!populated_zone(zone)) {
 		int ret = 0;
-		ret = init_currently_empty_zone(zone, phys_start_pfn, nr_pages);
+		ret = init_currently_empty_zone(zone, phys_start_pfn,
+						nr_pages, MEMMAP_HOTPLUG);
 		if (ret < 0)
 			return ret;
 	}
-	memmap_init_zone(nr_pages, nid, zone_type, phys_start_pfn);
+	memmap_init_zone(nr_pages, nid, zone_type,
+			 phys_start_pfn, MEMMAP_HOTPLUG);
 	return 0;
 }
 
diff -puN mm/page_alloc.c~Re-_PATCH_Fix_sparsemem_on_Cell mm/page_alloc.c
--- lxc/mm/page_alloc.c~Re-_PATCH_Fix_sparsemem_on_Cell	2007-01-05 15:38:23.000000000 -0800
+++ lxc-dave/mm/page_alloc.c	2007-01-07 00:35:27.000000000 -0800
@@ -2062,17 +2062,24 @@ static inline unsigned long wait_table_b
  * done. Non-atomic initialization, single-pass.
  */
 void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
-		unsigned long start_pfn)
+		unsigned long start_pfn, enum memmap_context context)
 {
 	struct page *page;
 	unsigned long end_pfn = start_pfn + size;
 	unsigned long pfn;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!early_pfn_valid(pfn))
-			continue;
-		if (!early_pfn_in_nid(pfn, nid))
-			continue;
+		/*
+		 * There can be holes in boot-time mem_map[]s
+		 * handed to this function.  They do not
+		 * exist on hotplugged memory.
+		 */
+		if (context == MEMMAP_EARLY) {
+			if (!early_pfn_valid(pfn))
+				continue;
+			if (!early_pfn_in_nid(pfn, nid))
+				continue;
+		}
 		page = pfn_to_page(pfn);
 		set_page_links(page, zone, nid, pfn);
 		init_page_count(page);
@@ -2102,7 +2109,7 @@ void zone_init_free_lists(struct pglist_
 
 #ifndef __HAVE_ARCH_MEMMAP_INIT
 #define memmap_init(size, nid, zone, start_pfn) \
-	memmap_init_zone((size), (nid), (zone), (start_pfn))
+	memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
 #endif
 
 static int __cpuinit zone_batchsize(struct zone *zone)
@@ -2348,7 +2355,8 @@ static __meminit void zone_pcp_init(stru
 
 __meminit int init_currently_empty_zone(struct zone *zone,
 					unsigned long zone_start_pfn,
-					unsigned long size)
+					unsigned long size,
+					enum memmap_context context)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int ret;
@@ -2792,7 +2800,8 @@ static void __meminit free_area_init_cor
 		if (!size)
 			continue;
 
-		ret = init_currently_empty_zone(zone, zone_start_pfn, size);
+		ret = init_currently_empty_zone(zone, zone_start_pfn,
+						size, MEMMAP_EARLY);
 		BUG_ON(ret);
 		zone_start_pfn += size;
 	}
diff -puN arch/ia64/mm/init.c~Re-_PATCH_Fix_sparsemem_on_Cell arch/ia64/mm/init.c
--- lxc/arch/ia64/mm/init.c~Re-_PATCH_Fix_sparsemem_on_Cell	2007-01-06 23:58:55.000000000 -0800
+++ lxc-dave/arch/ia64/mm/init.c	2007-01-07 00:08:01.000000000 -0800
@@ -541,7 +541,8 @@ virtual_memmap_init (u64 start, u64 end,
 
 	if (map_start < map_end)
 		memmap_init_zone((unsigned long)(map_end - map_start),
-				 args->nid, args->zone, page_to_pfn(map_start));
+				 args->nid, args->zone, page_to_pfn(map_start),
+				 MEMMAP_EARLY);
 	return 0;
 }
 
@@ -550,7 +551,7 @@ memmap_init (unsigned long size, int nid
 	     unsigned long start_pfn)
 {
 	if (!vmem_map)
-		memmap_init_zone(size, nid, zone, start_pfn);
+		memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY);
 	else {
 		struct page *start;
 		struct memmap_init_callback_data args;
_


-- Dave


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell (take 3)
  2007-01-07  8:58                   ` Dave Hansen
@ 2007-01-07 12:07                     ` Arnd Bergmann
  2007-01-08  6:47                     ` Tim Pepper
  1 sibling, 0 replies; 22+ messages in thread
From: Arnd Bergmann @ 2007-01-07 12:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: John Rose, Andrew Morton, External List, KAMEZAWA Hiroyuki,
	kmannth, lkml, hch, linux-mm, Paul Mackerras, mkravetz, gone

On Sunday 07 January 2007 09:58, Dave Hansen wrote:
> The following patch fixes an oops experienced on the Cell architecture
> when init-time functions, early_*(), are called at runtime.  It alters
> the call paths to make sure that the callers explicitly say whether the
> call is being made on behalf of a hotplug even, or happening at
> boot-time. 
> 
> It has been compile tested on ia64, s390, i386 and x86_64.

I can't test it here, since I'm travelling at the moment, but
this version looks good to me. Thanks for picking it up again!

> Signed-off-by: Dave Hansen <haveblue@us.ibm.com>

Acked-by: Arnd Bergmann <arndb@de.ibm.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell (take 3)
  2007-01-07  8:58                   ` Dave Hansen
  2007-01-07 12:07                     ` Arnd Bergmann
@ 2007-01-08  6:47                     ` Tim Pepper
  1 sibling, 0 replies; 22+ messages in thread
From: Tim Pepper @ 2007-01-08  6:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: John Rose, Andrew Morton, External List, KAMEZAWA Hiroyuki,
	kmannth, lkml, hch, linux-mm, Paul Mackerras, mkravetz, gone,
	Arnd Bergmann

On 1/7/07, Dave Hansen <haveblue@us.ibm.com> wrote:
> On Fri, 2007-01-05 at 22:52 -0600, John Rose wrote:
> > Could this break ia64, given that it uses memmap_init_zone()?
>
> You are right, I think it does.

Boot tested OK on ia64 with this latest version of the patch.

(forgot to click plain text on gmail the first time..sorry if you got
html mail or repeat)


Tim

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Fix sparsemem on Cell
  2006-12-15 17:14 Dave Hansen
@ 2006-12-17 23:02 ` Arnd Bergmann
  0 siblings, 0 replies; 22+ messages in thread
From: Arnd Bergmann @ 2006-12-17 23:02 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Dave Hansen, cbe-oss-dev, akpm, linux-mm, linux-kernel, hch,
	paulus, mkravetz, gone

On Friday 15 December 2006 18:14, Dave Hansen wrote:
> +       if (system_state >= SYSTEM_RUNNING)
> +               return 1;
> +       if (!early_pfn_valid(pfn))
> +               return 0;
> +       if (!early_pfn_in_nid(pfn, nid))
> +               return 0;

I haven't tried it, but I assume this is still wrong. On cell,
we didn't actually hit the case where the init sections have
been overwritten, since we call __add_pages from an initcall.

However, the pages we add are not part of the early_node_map,
so early_pfn_in_nid() returns a bogus result, causing some
page structs not to get initialized. I believe your patch
is going in the right direction, but it does not solve the
bug we have...

	Arnd <><

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] Fix sparsemem on Cell
@ 2006-12-15 17:14 Dave Hansen
  2006-12-17 23:02 ` Arnd Bergmann
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Hansen @ 2006-12-15 17:14 UTC (permalink / raw)
  To: cbe-oss-dev
  Cc: linuxppc-dev, linux-mm, apw, mkravetz, hch, jk, linux-kernel,
	akpm, paulus, benh, gone, Dave Hansen


I think the comments added say it pretty well, but I'll repeat it here.

This fix is pretty similar in concept to the one that Arnd posted
as a temporary workaround, but I've added a few comments explaining
what the actual assumptions are, and improved it a wee little bit.

The end goal here is to simply avoid calling the early_*() functions
when it is _not_ early.  Those functions stop working as soon as
free_initmem() is called.  system_state is set to SYSTEM_RUNNING
just after free_initmem() is called, so it seems appropriate to use
here.

I did think twice about actually using SYSTEM_RUNNING because we
moved away from it in other parts of memory hotplug, but those
were actually for _allocations_ in favor of slab_is_available(),
and we really don't care about the slab here.

The only other assumption is that all memory-hotplug-time pages 
given to memmap_init_zone() are valid and able to be onlined into
any any zone after the system is running.  The "valid" part is
really just a question of whether or not a 'struct page' is there
for the pfn, and *not* whether there is actual memory.  Since
all sparsemem sections have contiguous mem_map[]s within them,
and we only memory hotplug entire sparsemem sections, we can
be confident that this assumption will hold.

As for the memory being in the right node, we'll assume tha
memory hotplug is putting things in the right node.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>

---

 lxc-dave/init/main.c     |    4 ++++
 lxc-dave/mm/page_alloc.c |   28 +++++++++++++++++++++++++---
 2 files changed, 29 insertions(+), 3 deletions(-)

diff -puN init/main.c~sparsemem-fix init/main.c
--- lxc/init/main.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
+++ lxc-dave/init/main.c	2006-12-15 08:49:53.000000000 -0800
@@ -770,6 +770,10 @@ static int init(void * unused)
 	free_initmem();
 	unlock_kernel();
 	mark_rodata_ro();
+	/*
+	 * Memory hotplug requires that this system_state transition
+	 * happer after free_initmem().  (see memmap_init_zone())
+	 */
 	system_state = SYSTEM_RUNNING;
 	numa_default_policy();
 
diff -puN mm/page_alloc.c~sparsemem-fix mm/page_alloc.c
--- lxc/mm/page_alloc.c~sparsemem-fix	2006-12-15 08:49:53.000000000 -0800
+++ lxc-dave/mm/page_alloc.c	2006-12-15 08:49:53.000000000 -0800
@@ -2056,6 +2056,30 @@ static inline unsigned long wait_table_b
 
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
 
+static int can_online_pfn_into_nid(unsigned long pfn, int nid)
+{
+	/*
+	 * There are two things that make this work:
+	 * 1. The early_pfn...() functions are __init and
+	 *    use __initdata.  If the system is < SYSTEM_RUNNING,
+	 *    those functions and their data will still exist.
+	 * 2. We also assume that all actual memory hotplug
+	 *    (as opposed to boot-time) calls to this are only
+	 *    for contiguous memory regions.  With sparsemem,
+	 *    this guaranteed is easy because all sections are
+	 *    contiguous and we never online more than one
+	 *    section at a time.  Boot-time memory can have holes
+	 *    anywhere.
+	 */
+	if (system_state >= SYSTEM_RUNNING)
+		return 1;
+	if (!early_pfn_valid(pfn))
+		return 0;
+	if (!early_pfn_in_nid(pfn, nid))
+		return 0;
+	return 1;
+}
+
 /*
  * Initially all pages are reserved - free ones are freed
  * up by free_all_bootmem() once the early boot process is
@@ -2069,9 +2093,7 @@ void __meminit memmap_init_zone(unsigned
 	unsigned long pfn;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!early_pfn_valid(pfn))
-			continue;
-		if (!early_pfn_in_nid(pfn, nid))
+		if (!can_online_pfn_into_nid(pfn))
 			continue;
 		page = pfn_to_page(pfn);
 		set_page_links(page, zone, nid, pfn);
_

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2007-01-08  6:47 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-15 16:53 [PATCH] Fix sparsemem on Cell Dave Hansen
2006-12-15 17:11 ` Andy Whitcroft
2006-12-15 17:24   ` Dave Hansen
2006-12-15 19:45     ` Andrew Morton
2006-12-16  8:03       ` KAMEZAWA Hiroyuki
2006-12-18 21:13         ` Dave Hansen
2006-12-18 22:15           ` Christoph Hellwig
2006-12-18 22:54         ` Arnd Bergmann
2006-12-18 23:16           ` Dave Hansen
2006-12-19  0:16             ` KAMEZAWA Hiroyuki
2006-12-19  8:59             ` Arnd Bergmann
2006-12-19 19:34               ` Dave Hansen
2007-01-06  1:10               ` [PATCH] Fix sparsemem on Cell (take 3) Dave Hansen
2007-01-06  4:52                 ` John Rose
2007-01-07  8:58                   ` Dave Hansen
2007-01-07 12:07                     ` Arnd Bergmann
2007-01-08  6:47                     ` Tim Pepper
2006-12-15 17:22 ` [PATCH] Fix sparsemem on Cell Michael Buesch
2006-12-15 17:57   ` Dave Hansen
2006-12-15 20:21 ` Benjamin Herrenschmidt
2006-12-15 17:14 Dave Hansen
2006-12-17 23:02 ` Arnd Bergmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).