linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] [RESEND] x86_64: add memory hotremove config option
@ 2008-09-05 17:21 Gary Hade
  2008-09-05 17:44 ` Ingo Molnar
  2008-09-05 18:04 ` [PATCH] [RESEND] x86_64: " Andi Kleen
  0 siblings, 2 replies; 31+ messages in thread
From: Gary Hade @ 2008-09-05 17:21 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Yasunori Goto, Badari Pulavarty, Mel Gorman,
	Chris McDermott, Gary Hade, linux-kernel, x86, Ingo Molnar

Resending with linux-kernel@vger.kernel.org and x86@kernel.org copied
this time.  No changes other than this and modified Subject line.  The
only response so far on linux-mm has been an Acked-by: from 
Yasunori Goto <y-goto@jp.fujitsu.com>


Add memory hotremove config option to x86_64

Memory hotremove functionality can currently be configured into
the ia64, powerpc, and s390 kernels.  This patch makes it possible
to configure the memory hotremove functionality into the x86_64
kernel as well. 

Signed-off-by: Gary Hade <garyhade@us.ibm.com>

---
 arch/x86/Kconfig      |    3 +++
 arch/x86/mm/init_64.c |   18 ++++++++++++++++++
 2 files changed, 21 insertions(+)

Index: linux-2.6.27-rc5/arch/x86/Kconfig
===================================================================
--- linux-2.6.27-rc5.orig/arch/x86/Kconfig	2008-09-03 13:33:59.000000000 -0700
+++ linux-2.6.27-rc5/arch/x86/Kconfig	2008-09-03 13:34:55.000000000 -0700
@@ -1384,6 +1384,9 @@
 	def_bool y
 	depends on X86_64 || (X86_32 && HIGHMEM)

+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 config HAVE_ARCH_EARLY_PFN_TO_NID
 	def_bool X86_64
 	depends on NUMA
Index: linux-2.6.27-rc5/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.27-rc5.orig/arch/x86/mm/init_64.c	2008-09-03 13:34:08.000000000 -0700
+++ linux-2.6.27-rc5/arch/x86/mm/init_64.c	2008-09-03 13:34:55.000000000 -0700
@@ -740,6 +740,24 @@
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif

+#ifdef CONFIG_MEMORY_HOTREMOVE
+int remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn, end_pfn;
+	unsigned long timeout = 120 * HZ;
+	int ret;
+	start_pfn = start >> PAGE_SHIFT;
+	end_pfn = start_pfn + (size >> PAGE_SHIFT);
+	ret = offline_pages(start_pfn, end_pfn, timeout);
+	if (ret)
+		goto out;
+	/* Arch-specific calls go here */
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(remove_memory);
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+
 #endif /* CONFIG_MEMORY_HOTPLUG */

 /*

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-05 17:21 [PATCH] [RESEND] x86_64: add memory hotremove config option Gary Hade
@ 2008-09-05 17:44 ` Ingo Molnar
  2008-09-05 18:14   ` Badari Pulavarty
  2008-09-05 18:04 ` [PATCH] [RESEND] x86_64: " Andi Kleen
  1 sibling, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2008-09-05 17:44 UTC (permalink / raw)
  To: Gary Hade
  Cc: linux-mm, Andrew Morton, Yasunori Goto, Badari Pulavarty,
	Mel Gorman, Chris McDermott, linux-kernel, x86


* Gary Hade <garyhade@us.ibm.com> wrote:

> Add memory hotremove config option to x86_64
> 
> Memory hotremove functionality can currently be configured into the 
> ia64, powerpc, and s390 kernels.  This patch makes it possible to 
> configure the memory hotremove functionality into the x86_64 kernel as 
> well.

hm, why is it for 64-bit only?

> +++ linux-2.6.27-rc5/arch/x86/Kconfig	2008-09-03 13:34:55.000000000 -0700
> @@ -1384,6 +1384,9 @@
>  	def_bool y
>  	depends on X86_64 || (X86_32 && HIGHMEM)
> 
> +config ARCH_ENABLE_MEMORY_HOTREMOVE
> +	def_bool y

so this will break the build on 32-bit, if CONFIG_MEMORY_HOTREMOVE=y? 
mm/memory_hotplug.c assumes that remove_memory() is provided by the 
architecture.

> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int remove_memory(u64 start, u64 size)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	unsigned long timeout = 120 * HZ;
> +	int ret;
> +	start_pfn = start >> PAGE_SHIFT;
> +	end_pfn = start_pfn + (size >> PAGE_SHIFT);
> +	ret = offline_pages(start_pfn, end_pfn, timeout);
> +	if (ret)
> +		goto out;
> +	/* Arch-specific calls go here */
> +out:
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(remove_memory);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */

hm, nothing appears to be arch-specific about this trivial wrapper 
around offline_pages().

Shouldnt this be moved to the CONFIG_MEMORY_HOTREMOVE portion of 
mm/memory_hotplug.c instead, as a weak function? That way architectures 
only have to enable ARCH_ENABLE_MEMORY_HOTREMOVE - and architectures 
with different/special needs can override it.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-05 17:21 [PATCH] [RESEND] x86_64: add memory hotremove config option Gary Hade
  2008-09-05 17:44 ` Ingo Molnar
@ 2008-09-05 18:04 ` Andi Kleen
  2008-09-05 18:31   ` Badari Pulavarty
  2008-09-05 19:53   ` Gary Hade
  1 sibling, 2 replies; 31+ messages in thread
From: Andi Kleen @ 2008-09-05 18:04 UTC (permalink / raw)
  To: Gary Hade
  Cc: linux-mm, Andrew Morton, Yasunori Goto, Badari Pulavarty,
	Mel Gorman, Chris McDermott, linux-kernel, x86, Ingo Molnar

Gary Hade <garyhade@us.ibm.com> writes:
>
> Add memory hotremove config option to x86_64
>
> Memory hotremove functionality can currently be configured into
> the ia64, powerpc, and s390 kernels.  This patch makes it possible
> to configure the memory hotremove functionality into the x86_64
> kernel as well. 

You forgot to describe how you tested it? Does it actually work.
And why do you want to do it it? What's the use case?

The general understanding was that it doesn't work very well on a real
machine at least because it cannot be controlled how that memory maps
to real pluggable hardware (and you cannot completely empty a node at runtime)
and a Hypervisor would likely use different interfaces anyways.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-05 17:44 ` Ingo Molnar
@ 2008-09-05 18:14   ` Badari Pulavarty
  2008-09-05 18:17     ` Ingo Molnar
  0 siblings, 1 reply; 31+ messages in thread
From: Badari Pulavarty @ 2008-09-05 18:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gary Hade, linux-mm, Andrew Morton, Yasunori Goto, Mel Gorman,
	Chris McDermott, linux-kernel, x86


On Fri, 2008-09-05 at 19:44 +0200, Ingo Molnar wrote:
> * Gary Hade <garyhade@us.ibm.com> wrote:
> 
> > Add memory hotremove config option to x86_64
> > 
> > Memory hotremove functionality can currently be configured into the 
> > ia64, powerpc, and s390 kernels.  This patch makes it possible to 
> > configure the memory hotremove functionality into the x86_64 kernel as 
> > well.
> 
> hm, why is it for 64-bit only?
> 
> > +++ linux-2.6.27-rc5/arch/x86/Kconfig	2008-09-03 13:34:55.000000000 -0700
> > @@ -1384,6 +1384,9 @@
> >  	def_bool y
> >  	depends on X86_64 || (X86_32 && HIGHMEM)
> > 
> > +config ARCH_ENABLE_MEMORY_HOTREMOVE
> > +	def_bool y
> 
> so this will break the build on 32-bit, if CONFIG_MEMORY_HOTREMOVE=y? 
> mm/memory_hotplug.c assumes that remove_memory() is provided by the 
> architecture.
> 
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +int remove_memory(u64 start, u64 size)
> > +{
> > +	unsigned long start_pfn, end_pfn;
> > +	unsigned long timeout = 120 * HZ;
> > +	int ret;
> > +	start_pfn = start >> PAGE_SHIFT;
> > +	end_pfn = start_pfn + (size >> PAGE_SHIFT);
> > +	ret = offline_pages(start_pfn, end_pfn, timeout);
> > +	if (ret)
> > +		goto out;
> > +	/* Arch-specific calls go here */
> > +out:
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(remove_memory);
> > +#endif /* CONFIG_MEMORY_HOTREMOVE */
> 
> hm, nothing appears to be arch-specific about this trivial wrapper 
> around offline_pages().

Yes. All the archs (ppc64, ia64, s390, x86_64) have exact same
function. No architecture needed special handling so far (initial
versions of ppc64 needed extra handling, but I moved the code
to different place). 

We can make this generic and kill all arch-specific ones.
Initially, we didn't know if any arch needs special handling -
so ended up having private functions for each arch.  
I think its time to merge them all.

> 
> Shouldnt this be moved to the CONFIG_MEMORY_HOTREMOVE portion of 
> mm/memory_hotplug.c instead, as a weak function? That way architectures 
> only have to enable ARCH_ENABLE_MEMORY_HOTREMOVE - and architectures 
> with different/special needs can override it.

Yes. We should do that. I will send out a patch.

Thanks,
Badari


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-05 18:14   ` Badari Pulavarty
@ 2008-09-05 18:17     ` Ingo Molnar
  2008-09-08 21:52       ` [PATCH] Cleanup to make remove_memory() arch neutral Badari Pulavarty
  2008-09-08 21:56       ` [PATCH] x86: add memory hotremove config option Badari Pulavarty
  0 siblings, 2 replies; 31+ messages in thread
From: Ingo Molnar @ 2008-09-05 18:17 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Gary Hade, linux-mm, Andrew Morton, Yasunori Goto, Mel Gorman,
	Chris McDermott, linux-kernel, x86


* Badari Pulavarty <pbadari@us.ibm.com> wrote:

> 
> On Fri, 2008-09-05 at 19:44 +0200, Ingo Molnar wrote:
> > * Gary Hade <garyhade@us.ibm.com> wrote:
> > 
> > > Add memory hotremove config option to x86_64
> > > 
> > > Memory hotremove functionality can currently be configured into the 
> > > ia64, powerpc, and s390 kernels.  This patch makes it possible to 
> > > configure the memory hotremove functionality into the x86_64 kernel as 
> > > well.
> > 
> > hm, why is it for 64-bit only?
> > 
> > > +++ linux-2.6.27-rc5/arch/x86/Kconfig	2008-09-03 13:34:55.000000000 -0700
> > > @@ -1384,6 +1384,9 @@
> > >  	def_bool y
> > >  	depends on X86_64 || (X86_32 && HIGHMEM)
> > > 
> > > +config ARCH_ENABLE_MEMORY_HOTREMOVE
> > > +	def_bool y
> > 
> > so this will break the build on 32-bit, if CONFIG_MEMORY_HOTREMOVE=y? 
> > mm/memory_hotplug.c assumes that remove_memory() is provided by the 
> > architecture.
> > 
> > > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > > +int remove_memory(u64 start, u64 size)
> > > +{
> > > +	unsigned long start_pfn, end_pfn;
> > > +	unsigned long timeout = 120 * HZ;
> > > +	int ret;
> > > +	start_pfn = start >> PAGE_SHIFT;
> > > +	end_pfn = start_pfn + (size >> PAGE_SHIFT);
> > > +	ret = offline_pages(start_pfn, end_pfn, timeout);
> > > +	if (ret)
> > > +		goto out;
> > > +	/* Arch-specific calls go here */
> > > +out:
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(remove_memory);
> > > +#endif /* CONFIG_MEMORY_HOTREMOVE */
> > 
> > hm, nothing appears to be arch-specific about this trivial wrapper 
> > around offline_pages().
> 
> Yes. All the archs (ppc64, ia64, s390, x86_64) have exact same
> function. No architecture needed special handling so far (initial
> versions of ppc64 needed extra handling, but I moved the code
> to different place). 
> 
> We can make this generic and kill all arch-specific ones.
> Initially, we didn't know if any arch needs special handling -
> so ended up having private functions for each arch.  
> I think its time to merge them all.
>
> > Shouldnt this be moved to the CONFIG_MEMORY_HOTREMOVE portion of 
> > mm/memory_hotplug.c instead, as a weak function? That way architectures 
> > only have to enable ARCH_ENABLE_MEMORY_HOTREMOVE - and architectures 
> > with different/special needs can override it.
> 
> Yes. We should do that. I will send out a patch.

ok - if all architectures have the same function then please make it a 
regular function not a weak one, and remove all the duplications.

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-05 18:04 ` [PATCH] [RESEND] x86_64: " Andi Kleen
@ 2008-09-05 18:31   ` Badari Pulavarty
  2008-09-05 18:54     ` Andi Kleen
  2008-09-05 19:53   ` Gary Hade
  1 sibling, 1 reply; 31+ messages in thread
From: Badari Pulavarty @ 2008-09-05 18:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gary Hade, linux-mm, Andrew Morton, Yasunori Goto, Mel Gorman,
	Chris McDermott, linux-kernel, x86, Ingo Molnar


On Fri, 2008-09-05 at 20:04 +0200, Andi Kleen wrote:
> Gary Hade <garyhade@us.ibm.com> writes:
> >
> > Add memory hotremove config option to x86_64
> >
> > Memory hotremove functionality can currently be configured into
> > the ia64, powerpc, and s390 kernels.  This patch makes it possible
> > to configure the memory hotremove functionality into the x86_64
> > kernel as well. 
> 
> You forgot to describe how you tested it? Does it actually work.
> And why do you want to do it it? What's the use case?

I will let Gary answer these :)

> The general understanding was that it doesn't work very well on a real
> machine at least because it cannot be controlled how that memory maps
> to real pluggable hardware (and you cannot completely empty a node at runtime)
> and a Hypervisor would likely use different interfaces anyways.

At this time we are interested on node remove (on x86_64). 
It doesn't really work well at this time - due to some of the structures
(pgdat etc) are striped across all nodes. These is no easy way to
relocate them. Yasunori Goto is working on patches to address some of
these issues.

But we are considering adding support to restrict/skip bootmem
allocations on selected nodes. That way, we should be able to do
node remove.

(BTW, on ppc64 this works fine - since we are interested mostly in
removing *some* sections of memory to give it back to hypervisor - 
not entire node removal).

Thanks,
Badari


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-05 18:31   ` Badari Pulavarty
@ 2008-09-05 18:54     ` Andi Kleen
  2008-09-05 22:34       ` Badari Pulavarty
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2008-09-05 18:54 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Andi Kleen, Gary Hade, linux-mm, Andrew Morton, Yasunori Goto,
	Mel Gorman, Chris McDermott, linux-kernel, x86, Ingo Molnar

> At this time we are interested on node remove (on x86_64). 
> It doesn't really work well at this time - 

That's a quite euphemistic way to put it.

> due to some of the structures

That means you can never put any slab data on specific nodes.
And all the kernel subsystems on that node will not ever get local
memory.  How are you going to solve that?  And if you disallow
kernel allocations in so large memory areas you get many of the highmem
issues that plagued 32bit back in the 64bit kernel.

There are lots of other issues. It's quite questionable if this
whole exercise makes sense at all.

> (BTW, on ppc64 this works fine - since we are interested mostly in
> removing *some* sections of memory to give it back to hypervisor - 
> not entire node removal).

Ok for hypervisors you can do it reasonably easy on x86 too, but it's likely
that some hypercall interface is better than going through
sysfs. 

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-05 18:04 ` [PATCH] [RESEND] x86_64: " Andi Kleen
  2008-09-05 18:31   ` Badari Pulavarty
@ 2008-09-05 19:53   ` Gary Hade
  2008-09-05 20:04     ` Andi Kleen
  1 sibling, 1 reply; 31+ messages in thread
From: Gary Hade @ 2008-09-05 19:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gary Hade, linux-mm, Andrew Morton, Yasunori Goto,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86,
	Ingo Molnar

On Fri, Sep 05, 2008 at 08:04:55PM +0200, Andi Kleen wrote:
> Gary Hade <garyhade@us.ibm.com> writes:
> >
> > Add memory hotremove config option to x86_64
> >
> > Memory hotremove functionality can currently be configured into
> > the ia64, powerpc, and s390 kernels.  This patch makes it possible
> > to configure the memory hotremove functionality into the x86_64
> > kernel as well. 
> 
> You forgot to describe how you tested it? Does it actually work.

So far, I have tested it on a 2-node IBM x460, 2-node IBM x3950, and
a 4-node IBM x3950 M2 and have been able to successfully offline and
re-online all memory sections marked as removable multiple times with
no apparent problems.

By directing the change to -mm our hope is that others will try it
on their systems and help us shake out any issues that they my find.

> And why do you want to do it it? What's the use case?

A baby step towards evental total node removal.

> 
> The general understanding was that it doesn't work very well on a real
> machine at least because it cannot be controlled how that memory maps
> to real pluggable hardware (and you cannot completely empty a node at runtime)
> and a Hypervisor would likely use different interfaces anyways.

The inability to offline all non-primary node memory sections
certainly needs to be addressed.  The pgdat removal work that
Yasunori Goto has started will hopefully continue and help resolve
this issue.  We have only just started thinking about issues related
to resources other that CPUs and memory that will need to be released
in preparation for node removal (e.g. memory and i/o resources
assigned to PCI devices on a node targeted for removal).  Much of
this is new territory for us so any suggestions that you and others
can offer will be much appreciated.

Thanks for asking.

Gary

-- 
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503  IBM T/L: 775-4503
garyhade@us.ibm.com
http://www.ibm.com/linux/ltc


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-05 19:53   ` Gary Hade
@ 2008-09-05 20:04     ` Andi Kleen
  2008-09-05 21:54       ` Gary Hade
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2008-09-05 20:04 UTC (permalink / raw)
  To: Gary Hade
  Cc: Andi Kleen, linux-mm, Andrew Morton, Yasunori Goto,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86,
	Ingo Molnar

> The inability to offline all non-primary node memory sections
> certainly needs to be addressed.  The pgdat removal work that
> Yasunori Goto has started will hopefully continue and help resolve
> this issue. 

You make it sound like it's just some minor technical hurdle
that needs to be addressed. But from all analysis of these issues
I've seen so far it's extremly hard and all possible solutions
have serious issues. So before doing some baby steps there
should be at least some general idea how this thing is supposed
to work in the end.

> We have only just started thinking about issues related
> to resources other that CPUs and memory that will need to be released
> in preparation for node removal (e.g. memory and i/o resources
> assigned to PCI devices on a node targeted for removal). 

That's the easy stuff. The hard parts are all the kernel objects
that you cannot move.

-Andi


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-05 20:04     ` Andi Kleen
@ 2008-09-05 21:54       ` Gary Hade
  2008-09-06  0:01         ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Gary Hade @ 2008-09-05 21:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gary Hade, linux-mm, Andrew Morton, Yasunori Goto,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86,
	Ingo Molnar

On Fri, Sep 05, 2008 at 10:04:01PM +0200, Andi Kleen wrote:
> > The inability to offline all non-primary node memory sections
> > certainly needs to be addressed.  The pgdat removal work that
> > Yasunori Goto has started will hopefully continue and help resolve
> > this issue. 
> 
> You make it sound like it's just some minor technical hurdle
> that needs to be addressed.

Sorry, that was not my intent.

> But from all analysis of these issues
> I've seen so far it's extremly hard and all possible solutions
> have serious issues. So before doing some baby steps there
> should be at least some general idea how this thing is supposed
> to work in the end.

I am not sure if I understand why you appear to be opposed to
enabling the hotremove function before all the issues related
to an eventual goal of being able to free all memory on a node
are addressed.  Even in the absence of solutions for these issues
it seems like there could still be other possible benefits such
as the ability to selectively expand and shrink available memory
for testing or debugging purposes.  I believe it would also be
helpful to those working on or testing possible solutions for
the removal issues.

Gary

-- 
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503  IBM T/L: 775-4503
garyhade@us.ibm.com
http://www.ibm.com/linux/ltc


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-05 18:54     ` Andi Kleen
@ 2008-09-05 22:34       ` Badari Pulavarty
  0 siblings, 0 replies; 31+ messages in thread
From: Badari Pulavarty @ 2008-09-05 22:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gary Hade, linux-mm, Andrew Morton, Yasunori Goto, Mel Gorman,
	Chris McDermott, linux-kernel, x86, Ingo Molnar


On Fri, 2008-09-05 at 20:54 +0200, Andi Kleen wrote:
> > At this time we are interested on node remove (on x86_64). 
> > It doesn't really work well at this time - 
> 
> That's a quite euphemistic way to put it.
> 
> > due to some of the structures
> 
> That means you can never put any slab data on specific nodes.
> And all the kernel subsystems on that node will not ever get local
> memory.  How are you going to solve that?  And if you disallow
> kernel allocations in so large memory areas you get many of the highmem
> issues that plagued 32bit back in the 64bit kernel.

You are absolutely correct. There is no easy solution - one has 
to loose performance in order to support node removal, along with
some old x86 issues :(

We were contemplating idea of limiting node removal to few
select set of nodes as a compromise - but it didn't sound right :(

> 
> There are lots of other issues. It's quite questionable if this
> whole exercise makes sense at all.

Same issues exist with ia64 and x86_64 won't be any worse off.
Gary was trying to enable the functionality so that we can atleast
test out offlining memory section easier (test page migration,
isolation code and hash out issues)

Another possible idea being considered (still lot of unknowns)
to make use offline memory section feature for power management
(*cough*).

Anyway, as you can see this patch doesn't add any code - just
enables config option for x86_64. (if you are worried about
code bloat).

> > (BTW, on ppc64 this works fine - since we are interested mostly in
> > removing *some* sections of memory to give it back to hypervisor - 
> > not entire node removal).
> 
> Ok for hypervisors you can do it reasonably easy on x86 too, but it's likely
> that some hypercall interface is better than going through
> sysfs. 

sysfs interface already exists to offline sections of memory. (same
interface as online).

The proposed patch provides easy way to find out what sections of
memory belongs to which node. (could be useful on its own).

Thanks,
Badari


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-05 21:54       ` Gary Hade
@ 2008-09-06  0:01         ` Andi Kleen
  2008-09-06  7:06           ` Yasunori Goto
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2008-09-06  0:01 UTC (permalink / raw)
  To: Gary Hade
  Cc: Andi Kleen, linux-mm, Andrew Morton, Yasunori Goto,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86,
	Ingo Molnar

> I am not sure if I understand why you appear to be opposed to
> enabling the hotremove function before all the issues related

I'm quite sceptical that it can be ever made to work in a useful
way for real hardware (as opposed to an hypervisor para virtual setup
for which this interface is not the right way -- it should be done
in some specific driver instead) 

And if it cannot be made to work then it will be a false promise
to the user. They will see it and think it will work, but it will
not.

This means I don't see a real use case for this feature.

-Andi


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-06  0:01         ` Andi Kleen
@ 2008-09-06  7:06           ` Yasunori Goto
  2008-09-06  8:53             ` Andi Kleen
                               ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Yasunori Goto @ 2008-09-06  7:06 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gary Hade, linux-mm, Andrew Morton, Badari Pulavarty, Mel Gorman,
	Chris McDermott, linux-kernel, x86, Ingo Molnar

> > I am not sure if I understand why you appear to be opposed to
> > enabling the hotremove function before all the issues related
> 
> I'm quite sceptical that it can be ever made to work in a useful
> way for real hardware (as opposed to an hypervisor para virtual setup
> for which this interface is not the right way -- it should be done
> in some specific driver instead) 
> And if it cannot be made to work then it will be a false promise
> to the user. They will see it and think it will work, but it will
> not.
> 
> This means I don't see a real use case for this feature.

I don't think its driver is almighty.
IIRC, balloon driver can be cause of fragmentation for 24-7 system.

In addition, I have heard that memory hotplug would be useful for reducing
of power consumption of DIMM.

I have to admit that memory hotplug has many issues, but I would like to
solve them step by step.


Thanks.
-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-06  7:06           ` Yasunori Goto
@ 2008-09-06  8:53             ` Andi Kleen
  2008-09-08  5:52               ` Nick Piggin
  2008-09-06 14:33             ` Ingo Molnar
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2008-09-06  8:53 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Andi Kleen, Gary Hade, linux-mm, Andrew Morton, Badari Pulavarty,
	Mel Gorman, Chris McDermott, linux-kernel, x86, Ingo Molnar

On Sat, Sep 06, 2008 at 04:06:38PM +0900, Yasunori Goto wrote:
> > not.
> > 
> > This means I don't see a real use case for this feature.
> 
> I don't think its driver is almighty.
> IIRC, balloon driver can be cause of fragmentation for 24-7 system.

Sure the balloon driver can be likely improved too, it's just
that I don't think a balloon driver should call into the function
the original patch in the series hooked up.
> 
> In addition, I have heard that memory hotplug would be useful for reducing
> of power consumption of DIMM.

It's unclear that memory hotplug is the right model for DIMM power management.
The problem is that DIMMs are interleaved, so you again have to completely
free a quite large area. It's not much easier than node hotplug.

> I have to admit that memory hotplug has many issues, but I would like to

Let's call it "node" or "hardware" memory hot unplug, not that
anyone confuses it with the easier VM based hot unplug or the really
easy hotadd.

> solve them step by step.

The question is if they are even solvable in a useful way.
I'm not sure it's that useful to start and then find out
that it doesn't work anyways.

-Andi


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-06  7:06           ` Yasunori Goto
  2008-09-06  8:53             ` Andi Kleen
@ 2008-09-06 14:33             ` Ingo Molnar
  2008-09-06 16:00             ` kamezawa.hiroyu
  2008-09-06 16:05             ` kamezawa.hiroyu
  3 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2008-09-06 14:33 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Andi Kleen, Gary Hade, linux-mm, Andrew Morton, Badari Pulavarty,
	Mel Gorman, Chris McDermott, linux-kernel, x86


* Yasunori Goto <y-goto@jp.fujitsu.com> wrote:

> I don't think its driver is almighty. IIRC, balloon driver can be 
> cause of fragmentation for 24-7 system.
> 
> In addition, I have heard that memory hotplug would be useful for 
> reducing of power consumption of DIMM.
> 
> I have to admit that memory hotplug has many issues, but I would like 
> to solve them step by step.

What would be nice is to insert the information both during bootup and 
in /proc/meminfo and 'free' output that hot-removable memory segments 
are not generic free memory, it's currently a limited resource that 
might or might not be sufficient to serve a given workload.

Perhaps even exclude it from 'total' memory reported by meminfo - to be 
on the safe side of user expectations. In terms of user-space memory it 
is already generic swappable memory but in terms of kernel-space 
allocations it is not.

As i said it earlier in the thread, i certainly have no objections from 
the x86 maintenance side - nothing is worse than a generic kernel 
feature only available on certain less frequently used platforms. Memory 
hotplug has been available for some time in the MM and it's not really 
causing any maintenance trouble at the moment and it is not enabled by 
default either.

Having said that, i have my doubts about its generic utility (the power 
saving aspects are likely not realizable - nobody really wants DIMMs to 
just sit there unused and the cost of dynamic migration is just 
horrendous) - but as long as it's opt-in there's no reason to limit the 
availability of an in-kernel feature artificially.

Removing those limitations of kernel-space allocations should indeed be 
done in baby steps - and whether it's worth turning such memory into 
completely generic kernel memory is an open question.

But the fact that a piece of memory is not fully generic is no reason 
not to allow users to create special, capability-limited RAM resources 
like they can already do via hugetlbfs or ramfs, as long as the the 
capability limitations are advertised clearly.

Yes, memory hotplug has limitations we all understand, but still it's an 
arguably useful feature in some circumstances. If we never give a 
feature a chance to evolve on the main Linux platform that 90%+ of our 
users use it wont ever be truly useful.

Please send the new patches against -git or -tip and we can put them 
into a separate standalone feature topic and can test it on various x86 
boxes and send them towards linux-next if Andrew agrees with that 
process too.

Btw., it would be nice if memory hotplug had a self-test that could be 
activated from the .config and would run autonomously (a bit like 
rcu-torture): it would mark say 10% of all RAM as hot-pluggable during 
bootup and would periodically hot-plug and hot-unplug that memory, every 
10 seconds or 30 seconds or so, transparently. That would also test the 
x86 architecture's pagetable init code, the page migration code, etc. 
(Disabled by default and dependent on DEBUG_KERNEL && EXPERIMENTAL.)

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-06  7:06           ` Yasunori Goto
  2008-09-06  8:53             ` Andi Kleen
  2008-09-06 14:33             ` Ingo Molnar
@ 2008-09-06 16:00             ` kamezawa.hiroyu
  2008-09-06 16:17               ` Ingo Molnar
  2008-09-06 16:05             ` kamezawa.hiroyu
  3 siblings, 1 reply; 31+ messages in thread
From: kamezawa.hiroyu @ 2008-09-06 16:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Yasunori Goto, Andi Kleen, Gary Hade, linux-mm, Andrew Morton,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86

----- Original Message -----
>* Yasunori Goto <y-goto@jp.fujitsu.com> wrote:
>
>> I don't think its driver is almighty. IIRC, balloon driver can be 
>> cause of fragmentation for 24-7 system.
>> 
>> In addition, I have heard that memory hotplug would be useful for 
>> reducing of power consumption of DIMM.
>> 
>> I have to admit that memory hotplug has many issues, but I would like 
>> to solve them step by step.
>
>What would be nice is to insert the information both during bootup and 
>in /proc/meminfo and 'free' output that hot-removable memory segments 
>are not generic free memory, it's currently a limited resource that 
>might or might not be sufficient to serve a given workload.
>
>Perhaps even exclude it from 'total' memory reported by meminfo - to be 
>on the safe side of user expectations. In terms of user-space memory it 
>is already generic swappable memory but in terms of kernel-space 
>allocations it is not.
>
I wonder why anyone doesn't talk about ZONE_MOVABLE...When I wrote memory
hotplug, I assumed help of ZONE_MOVABLE and SPARSEMEM. It is shown in
meminfo.(I think memory hotplug is useful only when ZONE_MOVABLE is used.)

Most of problems which Goto wrote are mainly about placement of memmap and 
pgdat, zones. One example is that "when SPARSEMEM_VMEMMAP is enabled,
memmap is not removed even when memory is removed. "


>As i said it earlier in the thread, i certainly have no objections from 
>the x86 maintenance side - nothing is worse than a generic kernel 
>feature only available on certain less frequently used platforms. Memory 
>hotplug has been available for some time in the MM and it's not really 
>causing any maintenance trouble at the moment and it is not enabled by 
>default either.
>
>Having said that, i have my doubts about its generic utility (the power 
>saving aspects are likely not realizable - nobody really wants DIMMs to 
>just sit there unused and the cost of dynamic migration is just 
>horrendous) - but as long as it's opt-in there's no reason to limit the 
>availability of an in-kernel feature artificially.

Nobody ? maybe just a trade-off problem in user side. 
Even without DIMM hotplug or DIMM's power save mode, making a DIMM idle
is of no use ? I think memory consumes much power when it used.
Memory Hotplug and ZONE_MOVABLE can make some memory idle.
(I'm sorry if my thinking is wrong.)

>
>Removing those limitations of kernel-space allocations should indeed be 
>done in baby steps - and whether it's worth turning such memory into 
>completely generic kernel memory is an open question.
>
I think generic kernel space memory hotplug will never be available.

>But the fact that a piece of memory is not fully generic is no reason 
>not to allow users to create special, capability-limited RAM resources 
>like they can already do via hugetlbfs or ramfs, as long as the the 
>capability limitations are advertised clearly.
>
Hmm, adding a feature like 
 - offline some memory at boot.
 - online-memory-as-hugeltb mode
  
is useful for generic pc users ?

Regards,
-Kame

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-06  7:06           ` Yasunori Goto
                               ` (2 preceding siblings ...)
  2008-09-06 16:00             ` kamezawa.hiroyu
@ 2008-09-06 16:05             ` kamezawa.hiroyu
  3 siblings, 0 replies; 31+ messages in thread
From: kamezawa.hiroyu @ 2008-09-06 16:05 UTC (permalink / raw)
  To: kamezawa.hiroyu
  Cc: Ingo Molnar, Yasunori Goto, Andi Kleen, Gary Hade, linux-mm,
	Andrew Morton, Badari Pulavarty, Mel Gorman, Chris McDermott,
	linux-kernel, x86

----- Original Message -----
>>Having said that, i have my doubts about its generic utility (the power 
>>saving aspects are likely not realizable - nobody really wants DIMMs to 
>>just sit there unused and the cost of dynamic migration is just 
>>horrendous) - but as long as it's opt-in there's no reason to limit the 
>>availability of an in-kernel feature artificially.
>
>Nobody ? maybe just a trade-off problem in user side. 
>Even without DIMM hotplug or DIMM's power save mode, making a DIMM idle
>is of no use ? I think memory consumes much power when it used.
>Memory Hotplug and ZONE_MOVABLE can make some memory idle.
>(I'm sorry if my thinking is wrong.)
>
But I have to point out HDD access consumes far power than memory.
That's trade-off problem depends on usage, anyway.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-06 16:00             ` kamezawa.hiroyu
@ 2008-09-06 16:17               ` Ingo Molnar
  0 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2008-09-06 16:17 UTC (permalink / raw)
  To: kamezawa.hiroyu
  Cc: Yasunori Goto, Andi Kleen, Gary Hade, linux-mm, Andrew Morton,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86


* kamezawa.hiroyu@jp.fujitsu.com <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > Removing those limitations of kernel-space allocations should indeed 
> > be done in baby steps - and whether it's worth turning such memory 
> > into completely generic kernel memory is an open question.
>
> I think generic kernel space memory hotplug will never be available.

yeah, most likely. (It's possible technically even on a native kernel - 
just very expensive to various aspects of the kernel.)

> > But the fact that a piece of memory is not fully generic is no 
> > reason not to allow users to create special, capability-limited RAM 
> > resources like they can already do via hugetlbfs or ramfs, as long 
> > as the the capability limitations are advertised clearly.
>
> Hmm, adding a feature like
>  - offline some memory at boot.
>  - online-memory-as-hugeltb mode
>   
> is useful for generic pc users ?

yeah - it's actually the way how hugetlb should be done. Plus expand 
gbpages to hugetlbfs and hotplug memory on Barcelona CPUs and you can do 
user-space apps that can run for a long time without any TLB misses. 
_That_ might make sense to explore in practice. (i'm not holding my 
breath though, TLB misses are _fast_ on the best x86 CPUs.)

But we wont be able to make such experiments without having the 
capability on x86. So i'd like to break the catch-22 by accepting all 
this into arch/x86, it certainly is simple and makes some sense, it's 
just that i'm not that convinced about it personally at the moment.

So feel free to turn it all into a killer feature (make hugetlb backed 
memory transparent to user-space, etc. etc.) that high-performance 
computing users strive for and all that will change. Please send the 
reshaped patches so we can move past the 'what if' discussion phase ;-)

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-06  8:53             ` Andi Kleen
@ 2008-09-08  5:52               ` Nick Piggin
  2008-09-08  9:36                 ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-09-08  5:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Yasunori Goto, Gary Hade, linux-mm, Andrew Morton,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86,
	Ingo Molnar

On Saturday 06 September 2008 18:53, Andi Kleen wrote:
> On Sat, Sep 06, 2008 at 04:06:38PM +0900, Yasunori Goto wrote:
> > > not.
> > >
> > > This means I don't see a real use case for this feature.
> >
> > I don't think its driver is almighty.
> > IIRC, balloon driver can be cause of fragmentation for 24-7 system.
>
> Sure the balloon driver can be likely improved too, it's just
> that I don't think a balloon driver should call into the function
> the original patch in the series hooked up.
>
> > In addition, I have heard that memory hotplug would be useful for
> > reducing of power consumption of DIMM.
>
> It's unclear that memory hotplug is the right model for DIMM power
> management. The problem is that DIMMs are interleaved, so you again have to
> completely free a quite large area. It's not much easier than node hotplug.
>
> > I have to admit that memory hotplug has many issues, but I would like to
>
> Let's call it "node" or "hardware" memory hot unplug, not that
> anyone confuses it with the easier VM based hot unplug or the really
> easy hotadd.
>
> > solve them step by step.
>
> The question is if they are even solvable in a useful way.
> I'm not sure it's that useful to start and then find out
> that it doesn't work anyways.

You use non-linear mappings for the kernel, so that kernel data is
not tied to a specific physical address. AFAIK, that is the only way
to really do it completely (like the fragmentation problem).

Of course, I don't think that would be a good idea to do that in the
forseeable future.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-08  5:52               ` Nick Piggin
@ 2008-09-08  9:36                 ` Andi Kleen
  2008-09-08  9:46                   ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2008-09-08  9:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Yasunori Goto, Gary Hade, linux-mm, Andrew Morton,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86,
	Ingo Molnar

> You use non-linear mappings for the kernel, so that kernel data is
> not tied to a specific physical address. AFAIK, that is the only way
> to really do it completely (like the fragmentation problem).

Even with that there are lots of issues, like keeping track of 
DMAs or handling executing kernel code.

> 
> Of course, I don't think that would be a good idea to do that in the
> forseeable future.

Agreed.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-08  9:36                 ` Andi Kleen
@ 2008-09-08  9:46                   ` Nick Piggin
  2008-09-08 10:30                     ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-09-08  9:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Yasunori Goto, Gary Hade, linux-mm, Andrew Morton,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86,
	Ingo Molnar

On Monday 08 September 2008 19:36, Andi Kleen wrote:
> > You use non-linear mappings for the kernel, so that kernel data is
> > not tied to a specific physical address. AFAIK, that is the only way
> > to really do it completely (like the fragmentation problem).
>
> Even with that there are lots of issues, like keeping track of
> DMAs or handling executing kernel code.

Right, but the "high level" software solution is to have nonlinear
kernel mappings. Executing kernel code should not be so hard because
it could be handled just like executing user code (ie. the CPU that
is executing will subsequently fault and be blocked until the
relocation is complete).

DMAs aren't trivial at all, but I guess there could be say, a method
to submit and revoke areas of memory for DMA, and the submit would
block if the memory is currently being relocated underneath it (then
it would be able to find the new address).

Anwyay, whatever the case, yeah I'm not trying to say it is trivial
at all. Even without thinking about DMA it would be costly.


> > Of course, I don't think that would be a good idea to do that in the
> > forseeable future.
>
> Agreed.

Same as the "anti-frag" patches. We must not proceed with this kind of
thing on the justification that "in future we'll be able to unplug any
bit of memory". Because it is not just a matter of logical steps to
reach that point, but basically a fundamental rethink of how the kernel
memory mapping should work.

Other realistic justifications are OK, but if someone wants to unplug
everything, then please put effort into *first* making the kernel
mapping nonlinear, and then we can look at the complexity and
performance costs of that fundamental step.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-08  9:46                   ` Nick Piggin
@ 2008-09-08 10:30                     ` Andi Kleen
  2008-09-08 11:19                       ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2008-09-08 10:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Yasunori Goto, Gary Hade, linux-mm, Andrew Morton,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86,
	Ingo Molnar

On Mon, Sep 08, 2008 at 07:46:30PM +1000, Nick Piggin wrote:
> On Monday 08 September 2008 19:36, Andi Kleen wrote:
> > > You use non-linear mappings for the kernel, so that kernel data is
> > > not tied to a specific physical address. AFAIK, that is the only way
> > > to really do it completely (like the fragmentation problem).
> >
> > Even with that there are lots of issues, like keeping track of
> > DMAs or handling executing kernel code.
> 
> Right, but the "high level" software solution is to have nonlinear
> kernel mappings. Executing kernel code should not be so hard because
> it could be handled just like executing user code (ie. the CPU that
> is executing will subsequently fault and be blocked until the
> relocation is complete).

First blocking arbitary code is hard. There is some code parts
which are not allowed to block arbitarily. Machine check or NMI
handlers come to mind, but there are likely more.

Then that would be essentially a hypervisor or micro kernel approach.
e.g. Xen does that already kind of, but even there it would
be quite hard to do fully in a general way. And for hardware hotplug
only the fully generally way is actually useful unfortunately.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-08 10:30                     ` Andi Kleen
@ 2008-09-08 11:19                       ` Nick Piggin
  2008-09-08 11:30                         ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Nick Piggin @ 2008-09-08 11:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Yasunori Goto, Gary Hade, linux-mm, Andrew Morton,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86,
	Ingo Molnar

On Monday 08 September 2008 20:30, Andi Kleen wrote:
> On Mon, Sep 08, 2008 at 07:46:30PM +1000, Nick Piggin wrote:
> > On Monday 08 September 2008 19:36, Andi Kleen wrote:
> > > > You use non-linear mappings for the kernel, so that kernel data is
> > > > not tied to a specific physical address. AFAIK, that is the only way
> > > > to really do it completely (like the fragmentation problem).
> > >
> > > Even with that there are lots of issues, like keeping track of
> > > DMAs or handling executing kernel code.
> >
> > Right, but the "high level" software solution is to have nonlinear
> > kernel mappings. Executing kernel code should not be so hard because
> > it could be handled just like executing user code (ie. the CPU that
> > is executing will subsequently fault and be blocked until the
> > relocation is complete).
>
> First blocking arbitary code is hard. There is some code parts
> which are not allowed to block arbitarily. Machine check or NMI
> handlers come to mind, but there are likely more.

Sorry, by "block", I really mean spin I guess. I mean that the CPU will
be forced to stop executing due to the page fault during this sequence:

for prot RO:
alloc new page
memcpy(new, old)
ptep_clear_flush(ptep)         <--- from here
set_pte(ptep, newpte)          <--- until here

for prot RW, the window also would include the memcpy, however if that
adds too much latency for execute/reads, then it can be mapped RO first,
then memcpy, then flushed and switched.
 

> Then that would be essentially a hypervisor or micro kernel approach.

What would be? Blocking in interrupts? Or non-linear kernel mapping in
general? Nonlinear kernel mapping I don't think anyone disputes is the
only way to defragment (for unplug or large allocations) arbitrary
physical memory with any sort of guarantee. In the future if TLB costs
grow very much larger, I think this might be worth considering.

But until that becomes inevitable, I really don't want to hack the VM
with crap like transparent variable order mappings etc. but rather
"encourage" CPU manufacturers to have big fast TLBs :)


> e.g. Xen does that already kind of, but even there it would
> be quite hard to do fully in a general way. And for hardware hotplug
> only the fully generally way is actually useful unfortunately.

Yeah I don't really get the hardware hotplug thing. For reliability or
anything it should all be done in hardware (eg. warm/hot spare memory
module). For power I guess there is some argument, but I would prefer
to wait the trends out longer before committing to something big: non
volatile ram replacement for dram for example might be achieved in
future.

But if anybody disagrees, they are sure free to implement non-linear
kernel mappings and physical defragmentation and shut me up with
real numbers!


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-08 11:19                       ` Nick Piggin
@ 2008-09-08 11:30                         ` Andi Kleen
  2008-09-08 13:48                           ` Nick Piggin
  0 siblings, 1 reply; 31+ messages in thread
From: Andi Kleen @ 2008-09-08 11:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Yasunori Goto, Gary Hade, linux-mm, Andrew Morton,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86,
	Ingo Molnar

> Sorry, by "block", I really mean spin I guess. I mean that the CPU will
> be forced to stop executing due to the page fault during this sequence:

It's hard for NMIs at least. They cannot execute faults.

In the end you would need to define a core kernel which 
cannot be remapped and the rest which can and you end up
with even more micro kernel like mess.

> ptep_clear_flush(ptep)         <--- from here
> set_pte(ptep, newpte)          <--- until here
> 
> for prot RW, the window also would include the memcpy, however if that
> adds too much latency for execute/reads, then it can be mapped RO first,
> then memcpy, then flushed and switched.
>  
> 
> > Then that would be essentially a hypervisor or micro kernel approach.
> 
> What would be? Blocking in interrupts? Or non-linear kernel mapping in

Well in general someone remapping all the memory beyond you.
That's essentially a hypervisor in my book.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] [RESEND] x86_64: add memory hotremove config option
  2008-09-08 11:30                         ` Andi Kleen
@ 2008-09-08 13:48                           ` Nick Piggin
  0 siblings, 0 replies; 31+ messages in thread
From: Nick Piggin @ 2008-09-08 13:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Yasunori Goto, Gary Hade, linux-mm, Andrew Morton,
	Badari Pulavarty, Mel Gorman, Chris McDermott, linux-kernel, x86,
	Ingo Molnar

On Monday 08 September 2008 21:30, Andi Kleen wrote:
> > Sorry, by "block", I really mean spin I guess. I mean that the CPU will
> > be forced to stop executing due to the page fault during this sequence:
>
> It's hard for NMIs at least. They cannot execute faults.

Well, just for executing code (and reading RO data), then it shouldn't
matter at all actually if the CPU starts executing from the new page
or the old page, so long as there is a way to quiesce NMIs before freeing
the old page.

So the NMI can run, and read data, but it may have a problem with stores.
At least, some kind of redesign of NMI handlers might be required so that
they can make a note of the pending operation and try to do something
sane in that case. Or, there could be a small region of memory; a page or
two, which does not get migrated and NMIs can write to it. I don't think
you need to go so far as saying the entire kernel image must be non
movable just for NMIs.


> In the end you would need to define a core kernel which
> cannot be remapped and the rest which can and you end up
> with even more micro kernel like mess.

Are there any important NMIs that really can't fit with this?


> > ptep_clear_flush(ptep)         <--- from here
> > set_pte(ptep, newpte)          <--- until here
> >
> > for prot RW, the window also would include the memcpy, however if that
> > adds too much latency for execute/reads, then it can be mapped RO first,
> > then memcpy, then flushed and switched.
> >
> > > Then that would be essentially a hypervisor or micro kernel approach.
> >
> > What would be? Blocking in interrupts? Or non-linear kernel mapping in
>
> Well in general someone remapping all the memory beyond you.
> That's essentially a hypervisor in my book.

I don't see it. It is among one of the things a hypervisor may do.
But anyway, call it what you will.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] Cleanup to make  remove_memory() arch neutral
  2008-09-05 18:17     ` Ingo Molnar
@ 2008-09-08 21:52       ` Badari Pulavarty
  2008-09-09  0:56         ` Andrew Morton
  2008-09-08 21:56       ` [PATCH] x86: add memory hotremove config option Badari Pulavarty
  1 sibling, 1 reply; 31+ messages in thread
From: Badari Pulavarty @ 2008-09-08 21:52 UTC (permalink / raw)
  To: Andrew Morton, Andrew Morton
  Cc: Gary Hade, linux-mm, Yasunori Goto, Mel Gorman, Chris McDermott,
	linux-kernel, x86, Ingo Molnar

There is nothing architecture specific about remove_memory().
remove_memory() function is common for all architectures which
support hotplug memory remove. Instead of duplicating it in every
architecture, collapse them into arch neutral function.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>

 arch/ia64/mm/init.c   |   17 -----------------
 arch/powerpc/mm/mem.c |   17 -----------------
 arch/s390/mm/init.c   |   11 -----------
 mm/memory_hotplug.c   |   10 ++++++++++
 4 files changed, 10 insertions(+), 45 deletions(-)

Index: linux-2.6.27-rc5/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.27-rc5.orig/arch/ia64/mm/init.c	2008-08-28 15:52:02.000000000 -0700
+++ linux-2.6.27-rc5/arch/ia64/mm/init.c	2008-09-08 12:38:59.000000000 -0700
@@ -701,23 +701,6 @@ int arch_add_memory(int nid, u64 start, 
 
 	return ret;
 }
-#ifdef CONFIG_MEMORY_HOTREMOVE
-int remove_memory(u64 start, u64 size)
-{
-	unsigned long start_pfn, end_pfn;
-	unsigned long timeout = 120 * HZ;
-	int ret;
-	start_pfn = start >> PAGE_SHIFT;
-	end_pfn = start_pfn + (size >> PAGE_SHIFT);
-	ret = offline_pages(start_pfn, end_pfn, timeout);
-	if (ret)
-		goto out;
-	/* we can free mem_map at this point */
-out:
-	return ret;
-}
-EXPORT_SYMBOL_GPL(remove_memory);
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif
 
 /*
Index: linux-2.6.27-rc5/arch/powerpc/mm/mem.c
===================================================================
--- linux-2.6.27-rc5.orig/arch/powerpc/mm/mem.c	2008-08-28 15:52:02.000000000 -0700
+++ linux-2.6.27-rc5/arch/powerpc/mm/mem.c	2008-09-08 12:39:19.000000000 -0700
@@ -135,23 +135,6 @@ int arch_add_memory(int nid, u64 start, 
 
 	return __add_pages(zone, start_pfn, nr_pages);
 }
-
-#ifdef CONFIG_MEMORY_HOTREMOVE
-int remove_memory(u64 start, u64 size)
-{
-	unsigned long start_pfn, end_pfn;
-	int ret;
-
-	start_pfn = start >> PAGE_SHIFT;
-	end_pfn = start_pfn + (size >> PAGE_SHIFT);
-	ret = offline_pages(start_pfn, end_pfn, 120 * HZ);
-	if (ret)
-		goto out;
-	/* Arch-specific calls go here - next patch */
-out:
-	return ret;
-}
-#endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 /*
Index: linux-2.6.27-rc5/arch/s390/mm/init.c
===================================================================
--- linux-2.6.27-rc5.orig/arch/s390/mm/init.c	2008-08-28 15:52:02.000000000 -0700
+++ linux-2.6.27-rc5/arch/s390/mm/init.c	2008-09-08 12:40:41.000000000 -0700
@@ -189,14 +189,3 @@ int arch_add_memory(int nid, u64 start, 
 	return rc;
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
-
-#ifdef CONFIG_MEMORY_HOTREMOVE
-int remove_memory(u64 start, u64 size)
-{
-	unsigned long start_pfn, end_pfn;
-
-	start_pfn = PFN_DOWN(start);
-	end_pfn = start_pfn + PFN_DOWN(size);
-	return offline_pages(start_pfn, end_pfn, 120 * HZ);
-}
-#endif /* CONFIG_MEMORY_HOTREMOVE */
Index: linux-2.6.27-rc5/mm/memory_hotplug.c
===================================================================
--- linux-2.6.27-rc5.orig/mm/memory_hotplug.c	2008-08-28 15:52:02.000000000 -0700
+++ linux-2.6.27-rc5/mm/memory_hotplug.c	2008-09-08 12:41:37.000000000 -0700
@@ -26,6 +26,7 @@
 #include <linux/delay.h>
 #include <linux/migrate.h>
 #include <linux/page-isolation.h>
+#include <linux/pfn.h>
 
 #include <asm/tlbflush.h>
 
@@ -849,6 +850,15 @@ failed_removal:
 
 	return ret;
 }
+
+int remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn, end_pfn;
+
+	start_pfn = PFN_DOWN(start);
+	end_pfn = start_pfn + PFN_DOWN(size);
+	return offline_pages(start_pfn, end_pfn, 120 * HZ);
+}
 #else
 int remove_memory(u64 start, u64 size)
 {



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] x86: add memory hotremove config option
  2008-09-05 18:17     ` Ingo Molnar
  2008-09-08 21:52       ` [PATCH] Cleanup to make remove_memory() arch neutral Badari Pulavarty
@ 2008-09-08 21:56       ` Badari Pulavarty
  1 sibling, 0 replies; 31+ messages in thread
From: Badari Pulavarty @ 2008-09-08 21:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gary Hade, linux-mm, Andrew Morton, Yasunori Goto, Mel Gorman,
	Chris McDermott, linux-kernel, x86

Cleaned up patch with out remove_memory(). 
Depends on make remove_memory() arch neutral patch.

Thanks,
Badari

Add memory hotremove config option to x86

Memory hotremove functionality can currently be configured into
the ia64, powerpc, and s390 kernels.  This patch makes it possible
to configure the memory hotremove functionality into the x86
kernel as well. 

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Gary Hade <garyhade@us.ibm.com>
---
 arch/x86/Kconfig |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6.27-rc5/arch/x86/Kconfig
===================================================================
--- linux-2.6.27-rc5.orig/arch/x86/Kconfig	2008-09-08 12:36:06.000000000 -0700
+++ linux-2.6.27-rc5/arch/x86/Kconfig	2008-09-08 12:45:30.000000000 -0700
@@ -1384,6 +1384,10 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 	depends on X86_64 || (X86_32 && HIGHMEM)
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+	depends on MEMORY_HOTPLUG
+
 config HAVE_ARCH_EARLY_PFN_TO_NID
 	def_bool X86_64
 	depends on NUMA



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Cleanup to make  remove_memory() arch neutral
  2008-09-08 21:52       ` [PATCH] Cleanup to make remove_memory() arch neutral Badari Pulavarty
@ 2008-09-09  0:56         ` Andrew Morton
  2008-09-09  1:14           ` Randy Dunlap
  2008-09-09  1:21           ` Yasunori Goto
  0 siblings, 2 replies; 31+ messages in thread
From: Andrew Morton @ 2008-09-09  0:56 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: garyhade, linux-mm, y-goto, mel, lcm, linux-kernel, x86, mingo

On Mon, 08 Sep 2008 14:52:34 -0700
Badari Pulavarty <pbadari@us.ibm.com> wrote:

> There is nothing architecture specific about remove_memory().
> remove_memory() function is common for all architectures which
> support hotplug memory remove. Instead of duplicating it in every
> architecture, collapse them into arch neutral function.
> 
> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
> 
>  arch/ia64/mm/init.c   |   17 -----------------
>  arch/powerpc/mm/mem.c |   17 -----------------
>  arch/s390/mm/init.c   |   11 -----------
>  mm/memory_hotplug.c   |   10 ++++++++++
>  4 files changed, 10 insertions(+), 45 deletions(-)

I spent some time trying to build-test this on ia64 and gave up.  How
the heck do you turn on memory hotplug on ia64?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Cleanup to make  remove_memory() arch neutral
  2008-09-09  0:56         ` Andrew Morton
@ 2008-09-09  1:14           ` Randy Dunlap
  2008-09-09  1:21           ` Yasunori Goto
  1 sibling, 0 replies; 31+ messages in thread
From: Randy Dunlap @ 2008-09-09  1:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Badari Pulavarty, garyhade, linux-mm, y-goto, mel, lcm,
	linux-kernel, x86, mingo

On Mon, 8 Sep 2008 17:56:21 -0700 Andrew Morton wrote:

> On Mon, 08 Sep 2008 14:52:34 -0700
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> 
> > There is nothing architecture specific about remove_memory().
> > remove_memory() function is common for all architectures which
> > support hotplug memory remove. Instead of duplicating it in every
> > architecture, collapse them into arch neutral function.
> > 
> > Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
> > 
> >  arch/ia64/mm/init.c   |   17 -----------------
> >  arch/powerpc/mm/mem.c |   17 -----------------
> >  arch/s390/mm/init.c   |   11 -----------
> >  mm/memory_hotplug.c   |   10 ++++++++++
> >  4 files changed, 10 insertions(+), 45 deletions(-)
> 
> I spent some time trying to build-test this on ia64 and gave up.  How
> the heck do you turn on memory hotplug on ia64?

After using ia64 defconfig, all I had to do was enable Sparse Memory model
instead of Discontiguous.


---
~Randy
Linux Plumbers Conference, 17-19 September 2008, Portland, Oregon USA
http://linuxplumbersconf.org/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Cleanup to make  remove_memory() arch neutral
  2008-09-09  0:56         ` Andrew Morton
  2008-09-09  1:14           ` Randy Dunlap
@ 2008-09-09  1:21           ` Yasunori Goto
  2008-09-09 15:12             ` Badari Pulavarty
  1 sibling, 1 reply; 31+ messages in thread
From: Yasunori Goto @ 2008-09-09  1:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Badari Pulavarty, garyhade, linux-mm, mel, lcm, linux-kernel, x86, mingo

> On Mon, 08 Sep 2008 14:52:34 -0700
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> 
> > There is nothing architecture specific about remove_memory().
> > remove_memory() function is common for all architectures which
> > support hotplug memory remove. Instead of duplicating it in every
> > architecture, collapse them into arch neutral function.
> > 
> > Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
> > 
> >  arch/ia64/mm/init.c   |   17 -----------------
> >  arch/powerpc/mm/mem.c |   17 -----------------
> >  arch/s390/mm/init.c   |   11 -----------
> >  mm/memory_hotplug.c   |   10 ++++++++++
> >  4 files changed, 10 insertions(+), 45 deletions(-)
> 
> I spent some time trying to build-test this on ia64 and gave up.  How
> the heck do you turn on memory hotplug on ia64?
> 

EXPORT_SYMBOL_GPL(remove_memory) is removed.
It is required by drivers/acpi/acpi_memhotplug.ko.


-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Cleanup to make  remove_memory() arch neutral
  2008-09-09  1:21           ` Yasunori Goto
@ 2008-09-09 15:12             ` Badari Pulavarty
  0 siblings, 0 replies; 31+ messages in thread
From: Badari Pulavarty @ 2008-09-09 15:12 UTC (permalink / raw)
  To: Yasunori Goto
  Cc: Andrew Morton, garyhade, linux-mm, mel, lcm, linux-kernel, x86, mingo


On Tue, 2008-09-09 at 10:21 +0900, Yasunori Goto wrote:
> > On Mon, 08 Sep 2008 14:52:34 -0700
> > Badari Pulavarty <pbadari@us.ibm.com> wrote:
> > 
> > > There is nothing architecture specific about remove_memory().
> > > remove_memory() function is common for all architectures which
> > > support hotplug memory remove. Instead of duplicating it in every
> > > architecture, collapse them into arch neutral function.
> > > 
> > > Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
> > > 
> > >  arch/ia64/mm/init.c   |   17 -----------------
> > >  arch/powerpc/mm/mem.c |   17 -----------------
> > >  arch/s390/mm/init.c   |   11 -----------
> > >  mm/memory_hotplug.c   |   10 ++++++++++
> > >  4 files changed, 10 insertions(+), 45 deletions(-)
> > 
> > I spent some time trying to build-test this on ia64 and gave up.  How
> > the heck do you turn on memory hotplug on ia64?
> > 
> 
> EXPORT_SYMBOL_GPL(remove_memory) is removed.
> It is required by drivers/acpi/acpi_memhotplug.ko.

Thanks for catching it. I forgot that it was being used
by acpi. Since we didn't export it for ppc and s390,
I assumed its safe to remove the export. Sorry !!

Thanks,
Badari


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2008-09-09 15:12 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-05 17:21 [PATCH] [RESEND] x86_64: add memory hotremove config option Gary Hade
2008-09-05 17:44 ` Ingo Molnar
2008-09-05 18:14   ` Badari Pulavarty
2008-09-05 18:17     ` Ingo Molnar
2008-09-08 21:52       ` [PATCH] Cleanup to make remove_memory() arch neutral Badari Pulavarty
2008-09-09  0:56         ` Andrew Morton
2008-09-09  1:14           ` Randy Dunlap
2008-09-09  1:21           ` Yasunori Goto
2008-09-09 15:12             ` Badari Pulavarty
2008-09-08 21:56       ` [PATCH] x86: add memory hotremove config option Badari Pulavarty
2008-09-05 18:04 ` [PATCH] [RESEND] x86_64: " Andi Kleen
2008-09-05 18:31   ` Badari Pulavarty
2008-09-05 18:54     ` Andi Kleen
2008-09-05 22:34       ` Badari Pulavarty
2008-09-05 19:53   ` Gary Hade
2008-09-05 20:04     ` Andi Kleen
2008-09-05 21:54       ` Gary Hade
2008-09-06  0:01         ` Andi Kleen
2008-09-06  7:06           ` Yasunori Goto
2008-09-06  8:53             ` Andi Kleen
2008-09-08  5:52               ` Nick Piggin
2008-09-08  9:36                 ` Andi Kleen
2008-09-08  9:46                   ` Nick Piggin
2008-09-08 10:30                     ` Andi Kleen
2008-09-08 11:19                       ` Nick Piggin
2008-09-08 11:30                         ` Andi Kleen
2008-09-08 13:48                           ` Nick Piggin
2008-09-06 14:33             ` Ingo Molnar
2008-09-06 16:00             ` kamezawa.hiroyu
2008-09-06 16:17               ` Ingo Molnar
2008-09-06 16:05             ` kamezawa.hiroyu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).