All of lore.kernel.org
 help / color / mirror / Atom feed
* [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks
@ 2010-11-17  2:07 ` shaohui.zheng
  0 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:07 UTC (permalink / raw)
  To: akpm, linux-mm; +Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng

* PATCHSET INTRODUCTION

patch 1: Add function to hide memory region via e820 table. Then emulator will
	     use these memory regions to fake offlined numa nodes.
patch 2: Infrastructure of NUMA hotplug emulation, introduce "hide node".
patch 3: Provide an userland interface to hotplug-add fake offlined nodes.
patch 4: Abstract cpu register functions, make these interface friend for cpu
		 hotplug emulation
patch 5: Support cpu probe/release in x86, it provide a software method to hot
		 add/remove cpu with sysfs interface.
patch 6: Fake CPU socket with logical CPU on x86, to prevent the scheduling
		 domain to build the incorrect hierarchy.
patch 7: extend memory probe interface to support NUMA, we can add the memory to
		 a specified node with the interface.
patch 8: Documentations

* FEEDBACKS & RESPONSES

1) Patch 0
Balbir & Greg: Suggest to use tool git/quilt to manage/send the patchset.
Response: Thanks for the recommendation, With help from Fengguang, I get quilt
		  working, it is a great tool.

2) Patch 2
Jaswinder Singh: if (hidden_num) is not required in patch 2
Response: good catching, it is removed in v2.


3) Patch 3
Dave Hansen: Suggest to create a dedicated sysfs file for each possible node.
Greg: 	  How big would this "list" be?  What will it look like exactly?
Haicheng: It should follow "one value per file". It intends to show acceptable
		  parameters.

		  For example, if we have 4 fake offlined nodes, like node 2-5, then:
			   $ cat /sys/devices/system/node/probe
				 2-5

		  Then user hotadds node3 to system:
			   $ echo 3 > /sys/devices/system/node/probe
			   $ cat /sys/devices/system/node/probe
				 2,4-5

Greg:   As you are trying to add a new sysfs file, please create the matching
		Documentation/ABI/ file as well.
Response: We miss it, and we already add it in v2.

Patch 4 & 5: 
Paul Mundt: This looks like an incredibly painful interface. How about scrapping all
of this _emu() mess and just reworking the register_cpu() interface?
Response: accept Paul's suggestion, and remove the cpu _emu functions.

Patch 7: 
Dave Hansen: If we're going to put multiple values into the file now and
		 add to the ABI, can we be more explicit about it?
		echo "physical_address=0x40000000 numa_node=3" > memory/probe
Response: Dave's new interface was accpeted, and more we still keep the old 
	      format for compatibility. We documented the these interfaces into
		  Documentation/ABI in v2.
Greg: 	suggest to use configfs replace for the memory probe interface
Andi: 	This is a debugging interface. It doesn't need to have the
	  	most pretty interface in the world, because it will be only used for
	  	QA by a few people. it's just a QA interface, not the next generation
		of POSIX.
Response: We still keep it as sysfs interface since node/cpu/memory probe interface
		  are all in sysfs, we can create another group of patches to support
		  configfs if we have this strong requirement in future.


* WHAT IS HOTPLUG EMULATOR 

NUMA hotplug emulator is collectively named for the hotplug emulation
it is able to emulate NUMA Node Hotplug thru a pure software way. It
intends to help people easily debug and test node/cpu/memory hotplug
related stuff on a none-numa-hotplug-support machine, even an UMA machine.

The emulator provides mechanism to emulate the process of physcial cpu/mem
hotadd, it provides possibility to debug CPU and memory hotplug on the machines
without NUMA support for kenrel developers. It offers an interface for cpu
and memory hotplug test purpose.

* WHY DO WE USE HOTPLUG EMULATOR

We are focusing on the hotplug emualation for a few months. The emualor helps
 team to reproduce all the major hotplug bugs. It plays an important role to
the hotplug code quality assuirance. Because of the hotplug emulator, we already
move most of the debug working to virtual evironment.

* Principles & Usages 

NUMA hotplug emulator include 3 different parts, We add a menu item to the
menuconfig to enable/disable them.

1) Node hotplug emulation:

The emulator firstly hides RAM via E820 table, and then it can
fake offlined nodes with the hidden RAM.

After system bootup, user is able to hotplug-add these offlined
nodes, which is just similar to a real hotplug hardware behavior.

Using boot option "numa=hide=N*size" to fake offlined nodes:
	- N is the number of hidden nodes
	- size is the memory size (in MB) per hidden node.

There is a sysfs entry "probe" under /sys/devices/system/node/ for user
to hotplug the fake offlined nodes:

 - to show all fake offlined nodes:
    $ cat /sys/devices/system/node/probe

 - to hotadd a fake offlined node, e.g. nodeid is N:
    $ echo N > /sys/devices/system/node/probe

2) CPU hotplug emulation:

The emulator reserve CPUs throu grub parameter, the reserved CPUs can be
hot-add/hot-remove in software method.

When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU
hotplug process. For the CPU supported SMT, some logical CPUs are in the same
socket, but it may located in different NUMA node after we have emulator.  We
put the logical CPU into a fake CPU socket, and assign it an unique
phys_proc_id. For the fake socket, we put one logical CPU in only.

 - to hide CPUs
	- Using boot option "maxcpus=N" hide CPUs
	  N is the number of initialize CPUs
	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation
      when cpu_hpe is enabled, the rest CPUs will not be initialized 

 - to hot-add CPU to node
	$ echo nid > cpu/probe

 - to hot-remove CPU
	$ echo nid > cpu/release

3) Memory hotplug emulation:

The emulator reserve memory before OS booting, the reserved memory region
is remove from e820 table, and they can be hot-added via the probe interface,
this interface was extend to support add memory to the specified node, It
maintains backwards compatibility.

The difficulty of Memory Release is well-known, we have no plan for it until now.

 - reserve memory throu grub parameter
 	mem=1024m

 - add a memory section to node 3
    $ echo 0x40000000,3 > memory/probe
	OR
    $ echo 1024m,3 > memory/probe

* ACKNOWLEDGMENT 

hotplug emulator includes a team's efforts, thanks all of them.
They are:
Andi Kleen, Haicheng Li, Shaohui Zheng, Fengguang Wu and Yongkang You

-- 
Thanks & Regards,
Shaohui



^ permalink raw reply	[flat|nested] 139+ messages in thread

* [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks
@ 2010-11-17  2:07 ` shaohui.zheng
  0 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:07 UTC (permalink / raw)
  To: akpm, linux-mm; +Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng

* PATCHSET INTRODUCTION

patch 1: Add function to hide memory region via e820 table. Then emulator will
	     use these memory regions to fake offlined numa nodes.
patch 2: Infrastructure of NUMA hotplug emulation, introduce "hide node".
patch 3: Provide an userland interface to hotplug-add fake offlined nodes.
patch 4: Abstract cpu register functions, make these interface friend for cpu
		 hotplug emulation
patch 5: Support cpu probe/release in x86, it provide a software method to hot
		 add/remove cpu with sysfs interface.
patch 6: Fake CPU socket with logical CPU on x86, to prevent the scheduling
		 domain to build the incorrect hierarchy.
patch 7: extend memory probe interface to support NUMA, we can add the memory to
		 a specified node with the interface.
patch 8: Documentations

* FEEDBACKS & RESPONSES

1) Patch 0
Balbir & Greg: Suggest to use tool git/quilt to manage/send the patchset.
Response: Thanks for the recommendation, With help from Fengguang, I get quilt
		  working, it is a great tool.

2) Patch 2
Jaswinder Singh: if (hidden_num) is not required in patch 2
Response: good catching, it is removed in v2.


3) Patch 3
Dave Hansen: Suggest to create a dedicated sysfs file for each possible node.
Greg: 	  How big would this "list" be?  What will it look like exactly?
Haicheng: It should follow "one value per file". It intends to show acceptable
		  parameters.

		  For example, if we have 4 fake offlined nodes, like node 2-5, then:
			   $ cat /sys/devices/system/node/probe
				 2-5

		  Then user hotadds node3 to system:
			   $ echo 3 > /sys/devices/system/node/probe
			   $ cat /sys/devices/system/node/probe
				 2,4-5

Greg:   As you are trying to add a new sysfs file, please create the matching
		Documentation/ABI/ file as well.
Response: We miss it, and we already add it in v2.

Patch 4 & 5: 
Paul Mundt: This looks like an incredibly painful interface. How about scrapping all
of this _emu() mess and just reworking the register_cpu() interface?
Response: accept Paul's suggestion, and remove the cpu _emu functions.

Patch 7: 
Dave Hansen: If we're going to put multiple values into the file now and
		 add to the ABI, can we be more explicit about it?
		echo "physical_address=0x40000000 numa_node=3" > memory/probe
Response: Dave's new interface was accpeted, and more we still keep the old 
	      format for compatibility. We documented the these interfaces into
		  Documentation/ABI in v2.
Greg: 	suggest to use configfs replace for the memory probe interface
Andi: 	This is a debugging interface. It doesn't need to have the
	  	most pretty interface in the world, because it will be only used for
	  	QA by a few people. it's just a QA interface, not the next generation
		of POSIX.
Response: We still keep it as sysfs interface since node/cpu/memory probe interface
		  are all in sysfs, we can create another group of patches to support
		  configfs if we have this strong requirement in future.


* WHAT IS HOTPLUG EMULATOR 

NUMA hotplug emulator is collectively named for the hotplug emulation
it is able to emulate NUMA Node Hotplug thru a pure software way. It
intends to help people easily debug and test node/cpu/memory hotplug
related stuff on a none-numa-hotplug-support machine, even an UMA machine.

The emulator provides mechanism to emulate the process of physcial cpu/mem
hotadd, it provides possibility to debug CPU and memory hotplug on the machines
without NUMA support for kenrel developers. It offers an interface for cpu
and memory hotplug test purpose.

* WHY DO WE USE HOTPLUG EMULATOR

We are focusing on the hotplug emualation for a few months. The emualor helps
 team to reproduce all the major hotplug bugs. It plays an important role to
the hotplug code quality assuirance. Because of the hotplug emulator, we already
move most of the debug working to virtual evironment.

* Principles & Usages 

NUMA hotplug emulator include 3 different parts, We add a menu item to the
menuconfig to enable/disable them.

1) Node hotplug emulation:

The emulator firstly hides RAM via E820 table, and then it can
fake offlined nodes with the hidden RAM.

After system bootup, user is able to hotplug-add these offlined
nodes, which is just similar to a real hotplug hardware behavior.

Using boot option "numa=hide=N*size" to fake offlined nodes:
	- N is the number of hidden nodes
	- size is the memory size (in MB) per hidden node.

There is a sysfs entry "probe" under /sys/devices/system/node/ for user
to hotplug the fake offlined nodes:

 - to show all fake offlined nodes:
    $ cat /sys/devices/system/node/probe

 - to hotadd a fake offlined node, e.g. nodeid is N:
    $ echo N > /sys/devices/system/node/probe

2) CPU hotplug emulation:

The emulator reserve CPUs throu grub parameter, the reserved CPUs can be
hot-add/hot-remove in software method.

When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU
hotplug process. For the CPU supported SMT, some logical CPUs are in the same
socket, but it may located in different NUMA node after we have emulator.  We
put the logical CPU into a fake CPU socket, and assign it an unique
phys_proc_id. For the fake socket, we put one logical CPU in only.

 - to hide CPUs
	- Using boot option "maxcpus=N" hide CPUs
	  N is the number of initialize CPUs
	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation
      when cpu_hpe is enabled, the rest CPUs will not be initialized 

 - to hot-add CPU to node
	$ echo nid > cpu/probe

 - to hot-remove CPU
	$ echo nid > cpu/release

3) Memory hotplug emulation:

The emulator reserve memory before OS booting, the reserved memory region
is remove from e820 table, and they can be hot-added via the probe interface,
this interface was extend to support add memory to the specified node, It
maintains backwards compatibility.

The difficulty of Memory Release is well-known, we have no plan for it until now.

 - reserve memory throu grub parameter
 	mem=1024m

 - add a memory section to node 3
    $ echo 0x40000000,3 > memory/probe
	OR
    $ echo 1024m,3 > memory/probe

* ACKNOWLEDGMENT 

hotplug emulator includes a team's efforts, thanks all of them.
They are:
Andi Kleen, Haicheng Li, Shaohui Zheng, Fengguang Wu and Yongkang You

-- 
Thanks & Regards,
Shaohui


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
  2010-11-17  2:07 ` shaohui.zheng
@ 2010-11-17  2:08   ` shaohui.zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng, Yinghai Lu,
	Haicheng Li, Shaohui Zheng

[-- Attachment #1: 001-hotplug-emulator-x86-add-function-to-hide-memory-region-via-e820.patch --]
[-- Type: text/plain, Size: 2284 bytes --]

From: Haicheng Li <haicheng.li@intel.com>

NUMA hotplug emulator needs to hide memory regions at the very
beginning of kernel booting. Then emulator will use these
memory regions to fake offlined numa nodes.

CC: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
 arch/x86/include/asm/e820.h |    1 +
 arch/x86/kernel/e820.c      |   19 ++++++++++++++++++-
 2 files changed, 19 insertions(+), 1 deletions(-)

Index: linux-hpe4/arch/x86/include/asm/e820.h
===================================================================
--- linux-hpe4.orig/arch/x86/include/asm/e820.h	2010-11-15 17:13:02.483461667 +0800
+++ linux-hpe4/arch/x86/include/asm/e820.h	2010-11-15 17:13:07.083461581 +0800
@@ -129,6 +129,7 @@
 extern void e820_register_active_regions(int nid, unsigned long start_pfn,
 					 unsigned long end_pfn);
 extern u64 e820_hole_size(u64 start, u64 end);
+extern u64 e820_hide_mem(u64 mem_size);
 extern void finish_e820_parsing(void);
 extern void e820_reserve_resources(void);
 extern void e820_reserve_resources_late(void);
Index: linux-hpe4/arch/x86/kernel/e820.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
+++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
@@ -971,6 +971,7 @@
 }
 
 static int userdef __initdata;
+static u64 max_mem_size __initdata = ULLONG_MAX;
 
 /* "mem=nopentium" disables the 4MB page tables. */
 static int __init parse_memopt(char *p)
@@ -989,12 +990,28 @@
 
 	userdef = 1;
 	mem_size = memparse(p, &p);
-	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
+	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
+	max_mem_size = mem_size;
 
 	return 0;
 }
 early_param("mem", parse_memopt);
 
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+u64 __init e820_hide_mem(u64 mem_size)
+{
+	u64 start, end_pfn;
+
+	userdef = 1;
+	end_pfn = e820_end_of_ram_pfn();
+	start = (end_pfn << PAGE_SHIFT) - mem_size;
+	e820_remove_range(start, max_mem_size - start, E820_RAM, 1);
+	max_mem_size = start;
+
+	return start;
+}
+#endif
+
 static int __init parse_memmap_opt(char *p)
 {
 	char *oldp;

-- 
Thanks & Regards,
Shaohui



^ permalink raw reply	[flat|nested] 139+ messages in thread

* [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
@ 2010-11-17  2:08   ` shaohui.zheng
  0 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng, Yinghai Lu,
	Haicheng Li, Shaohui Zheng

[-- Attachment #1: 001-hotplug-emulator-x86-add-function-to-hide-memory-region-via-e820.patch --]
[-- Type: text/plain, Size: 2580 bytes --]

From: Haicheng Li <haicheng.li@intel.com>

NUMA hotplug emulator needs to hide memory regions at the very
beginning of kernel booting. Then emulator will use these
memory regions to fake offlined numa nodes.

CC: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
 arch/x86/include/asm/e820.h |    1 +
 arch/x86/kernel/e820.c      |   19 ++++++++++++++++++-
 2 files changed, 19 insertions(+), 1 deletions(-)

Index: linux-hpe4/arch/x86/include/asm/e820.h
===================================================================
--- linux-hpe4.orig/arch/x86/include/asm/e820.h	2010-11-15 17:13:02.483461667 +0800
+++ linux-hpe4/arch/x86/include/asm/e820.h	2010-11-15 17:13:07.083461581 +0800
@@ -129,6 +129,7 @@
 extern void e820_register_active_regions(int nid, unsigned long start_pfn,
 					 unsigned long end_pfn);
 extern u64 e820_hole_size(u64 start, u64 end);
+extern u64 e820_hide_mem(u64 mem_size);
 extern void finish_e820_parsing(void);
 extern void e820_reserve_resources(void);
 extern void e820_reserve_resources_late(void);
Index: linux-hpe4/arch/x86/kernel/e820.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
+++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
@@ -971,6 +971,7 @@
 }
 
 static int userdef __initdata;
+static u64 max_mem_size __initdata = ULLONG_MAX;
 
 /* "mem=nopentium" disables the 4MB page tables. */
 static int __init parse_memopt(char *p)
@@ -989,12 +990,28 @@
 
 	userdef = 1;
 	mem_size = memparse(p, &p);
-	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
+	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
+	max_mem_size = mem_size;
 
 	return 0;
 }
 early_param("mem", parse_memopt);
 
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+u64 __init e820_hide_mem(u64 mem_size)
+{
+	u64 start, end_pfn;
+
+	userdef = 1;
+	end_pfn = e820_end_of_ram_pfn();
+	start = (end_pfn << PAGE_SHIFT) - mem_size;
+	e820_remove_range(start, max_mem_size - start, E820_RAM, 1);
+	max_mem_size = start;
+
+	return start;
+}
+#endif
+
 static int __init parse_memmap_opt(char *p)
 {
 	char *oldp;

-- 
Thanks & Regards,
Shaohui


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-17  2:07 ` shaohui.zheng
@ 2010-11-17  2:08   ` shaohui.zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng, Yinghai Lu,
	Haicheng Li, Shaohui Zheng

[-- Attachment #1: 002-hotplug-emulator-x86-infrastructure-of-node-hotplug-emulation.patch --]
[-- Type: text/plain, Size: 7052 bytes --]

From: Haicheng Li <haicheng.li@intel.com>

NUMA hotplug emulator introduces a new node state N_HIDDEN to
identify the fake offlined node. It firstly hides RAM via E820
table and then emulates fake offlined nodes with the hidden RAM.

After system bootup, user is able to hotplug-add these offlined
nodes, which is just similar to a real hardware hotplug behavior.

Using boot option "numa=hide=N*size" to fake offlined nodes:
	- N is the number of hidden nodes
	- size is the memory size (in MB) per hidden node.

OPEN: Kernel might use part of hidden memory region as RAM buffer,
      now emulator directly hide 128M extra space to workaround
      this issue.  Any better way to avoid this conflict?

CC: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/arch/x86/include/asm/numa_64.h
===================================================================
--- linux-hpe4.orig/arch/x86/include/asm/numa_64.h	2010-11-15 17:13:02.453461462 +0800
+++ linux-hpe4/arch/x86/include/asm/numa_64.h	2010-11-15 17:13:07.093461818 +0800
@@ -37,7 +37,7 @@
 extern void __cpuinit numa_add_cpu(int cpu);
 extern void __cpuinit numa_remove_cpu(int cpu);
 
-#ifdef CONFIG_NUMA_EMU
+#if defined(CONFIG_NUMA_EMU) || defined(CONFIG_NODE_HOTPLUG_EMU)
 #define FAKE_NODE_MIN_SIZE	((u64)64 << 20)
 #define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1UL))
 #endif /* CONFIG_NUMA_EMU */
Index: linux-hpe4/arch/x86/mm/numa_64.c
===================================================================
--- linux-hpe4.orig/arch/x86/mm/numa_64.c	2010-11-15 17:13:02.463461371 +0800
+++ linux-hpe4/arch/x86/mm/numa_64.c	2010-11-15 17:21:05.510961676 +0800
@@ -304,6 +304,123 @@
 	}
 }
 
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+static char *hp_cmdline __initdata;
+static struct bootnode *hidden_nodes;
+static u64 hp_start;
+static long hidden_num, hp_size;
+static u64 nodes_size[MAX_NUMNODES] __initdata;
+
+int hotadd_hidden_nodes(int nid)
+{
+	int ret;
+
+	if (!node_hidden(nid))
+		return -EINVAL;
+
+	ret = add_memory(nid, hidden_nodes[nid].start,
+			 hidden_nodes[nid].end - hidden_nodes[nid].start);
+	if (!ret) {
+		node_clear_hidden(nid);
+		return 0;
+	} else {
+		return -EEXIST;
+	}
+}
+
+/* parse the comand line for numa=hide */
+static long __init parse_hide_nodes(char *hp_cmdline)
+{
+	int coef = 1, nid = 0;
+	u64 size = 0;
+	long total = 0;
+	char buf[512], *p;
+
+	/* parse numa=hide command-line */
+	hidden_num = 0;
+	p = buf;
+	while (1) {
+		if (*hp_cmdline == ',' || *hp_cmdline == '\0') {
+			*p = '\0';
+			size = simple_strtoul(buf, NULL, 0);
+			printk(KERN_ERR "size: %dM buf:%s coef: %d.\n", (int)size, buf, coef);
+			if (!((size<<20) & FAKE_NODE_MIN_HASH_MASK))
+				printk(KERN_ERR "%d M is less than minimum node size, ignore it.\n", (int)size);
+
+			size <<= 20;
+			/* Round down to nearest FAKE_NODE_MIN_SIZE. */
+			size &= FAKE_NODE_MIN_HASH_MASK;
+
+			if (size) {
+				int i;
+				total += size * coef;
+				for (i = 0; i < coef; i++)
+					nodes_size[nid++] = size;
+				hidden_num += coef;
+			}
+
+			coef = 1;
+			p = buf;
+			if (*hp_cmdline  == '\0')
+				break;
+			hp_cmdline++;
+		} else if (*hp_cmdline ==  '*') {
+			*p++ = '\0';
+			coef = simple_strtoul(buf, NULL, 0);
+			p = buf;
+			hp_cmdline++;
+		} else if (!isdigit(*hp_cmdline)) {
+			break;
+		}
+
+		*p++ = *hp_cmdline++;
+	}
+
+	return total;
+}
+
+static void __init numa_hide_nodes(void)
+{
+	hp_size = parse_hide_nodes(hp_cmdline);
+
+	hp_start = e820_hide_mem(hp_size);
+	if (hp_start <= 0) {
+		printk(KERN_ERR "Hide too much memory, disable node hotplug emualtion.");
+		hidden_num = 0;
+		return;
+	}
+
+	/* leave 128M space for possible RAM buffer usage later
+	 any other better way to avoid this conflict?*/
+
+	e820_hide_mem(128*1024*1024);
+}
+
+static void __init numa_hotplug_emulation(void)
+{
+	int i, num_nodes = 0, nid;
+
+	for_each_online_node(i)
+		if (i > num_nodes)
+			num_nodes = i;
+
+	i = num_nodes + hidden_num;
+	if (!hidden_nodes) {
+		hidden_nodes = alloc_bootmem(sizeof(struct bootnode) * i);
+		memset(hidden_nodes, 0, sizeof(struct bootnode) * i);
+	}
+
+	nid = num_nodes + 1;
+	for (i = 0; i < hidden_num; i++) {
+		node_set(nid, node_possible_map);
+		hidden_nodes[nid].start = hp_start;
+		hidden_nodes[nid].end = hp_start + (nodes_size[i]);
+		hp_start = hidden_nodes[nid].end;
+		node_set_hidden(nid++);
+	}
+}
+#endif /* CONFIG_NODE_HOTPLUG_EMU */
+
 #ifdef CONFIG_NUMA_EMU
 /* Numa emulation */
 static struct bootnode nodes[MAX_NUMNODES] __initdata;
@@ -658,7 +775,7 @@
 
 #ifdef CONFIG_NUMA_EMU
 	if (cmdline && !numa_emulation(start_pfn, last_pfn, acpi, k8))
-		return;
+		goto done;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -666,14 +783,14 @@
 #ifdef CONFIG_ACPI_NUMA
 	if (!numa_off && acpi && !acpi_scan_nodes(start_pfn << PAGE_SHIFT,
 						  last_pfn << PAGE_SHIFT))
-		return;
+		goto done;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
 
 #ifdef CONFIG_K8_NUMA
 	if (!numa_off && k8 && !k8_scan_nodes())
-		return;
+		goto done;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -693,6 +810,13 @@
 		numa_set_node(i, 0);
 	e820_register_active_regions(0, start_pfn, last_pfn);
 	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT);
+
+done:
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+	if (hidden_num)
+		numa_hotplug_emulation();
+#endif
+	return;
 }
 
 unsigned long __init numa_free_all_bootmem(void)
@@ -720,6 +844,12 @@
 	if (!strncmp(opt, "fake=", 5))
 		cmdline = opt + 5;
 #endif
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+	if (!strncmp(opt, "hide=", 5)) {
+		hp_cmdline = opt + 5;
+		numa_hide_nodes();
+	}
+#endif
 #ifdef CONFIG_ACPI_NUMA
 	if (!strncmp(opt, "noacpi", 6))
 		acpi_numa = -1;
Index: linux-hpe4/include/linux/nodemask.h
===================================================================
--- linux-hpe4.orig/include/linux/nodemask.h	2010-11-15 17:13:02.463461371 +0800
+++ linux-hpe4/include/linux/nodemask.h	2010-11-15 17:13:07.093461818 +0800
@@ -371,6 +371,10 @@
  */
 enum node_states {
 	N_POSSIBLE,		/* The node could become online at some point */
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+	N_HIDDEN,		/* The node is hidden at booting time, could be
+				 * onlined in run time */
+#endif
 	N_ONLINE,		/* The node is online */
 	N_NORMAL_MEMORY,	/* The node has regular memory */
 #ifdef CONFIG_HIGHMEM
@@ -470,6 +474,13 @@
 #define node_online(node)	node_state((node), N_ONLINE)
 #define node_possible(node)	node_state((node), N_POSSIBLE)
 
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+#define node_set_hidden(node)	   node_set_state((node), N_HIDDEN)
+#define node_clear_hidden(node)	   node_clear_state((node), N_HIDDEN)
+#define node_hidden(node)	node_state((node), N_HIDDEN)
+extern int hotadd_hidden_nodes(int nid);
+#endif
+
 #define for_each_node(node)	   for_each_node_state(node, N_POSSIBLE)
 #define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
 

-- 
Thanks & Regards,
Shaohui



^ permalink raw reply	[flat|nested] 139+ messages in thread

* [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-17  2:08   ` shaohui.zheng
  0 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng, Yinghai Lu,
	Haicheng Li, Shaohui Zheng

[-- Attachment #1: 002-hotplug-emulator-x86-infrastructure-of-node-hotplug-emulation.patch --]
[-- Type: text/plain, Size: 7348 bytes --]

From: Haicheng Li <haicheng.li@intel.com>

NUMA hotplug emulator introduces a new node state N_HIDDEN to
identify the fake offlined node. It firstly hides RAM via E820
table and then emulates fake offlined nodes with the hidden RAM.

After system bootup, user is able to hotplug-add these offlined
nodes, which is just similar to a real hardware hotplug behavior.

Using boot option "numa=hide=N*size" to fake offlined nodes:
	- N is the number of hidden nodes
	- size is the memory size (in MB) per hidden node.

OPEN: Kernel might use part of hidden memory region as RAM buffer,
      now emulator directly hide 128M extra space to workaround
      this issue.  Any better way to avoid this conflict?

CC: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/arch/x86/include/asm/numa_64.h
===================================================================
--- linux-hpe4.orig/arch/x86/include/asm/numa_64.h	2010-11-15 17:13:02.453461462 +0800
+++ linux-hpe4/arch/x86/include/asm/numa_64.h	2010-11-15 17:13:07.093461818 +0800
@@ -37,7 +37,7 @@
 extern void __cpuinit numa_add_cpu(int cpu);
 extern void __cpuinit numa_remove_cpu(int cpu);
 
-#ifdef CONFIG_NUMA_EMU
+#if defined(CONFIG_NUMA_EMU) || defined(CONFIG_NODE_HOTPLUG_EMU)
 #define FAKE_NODE_MIN_SIZE	((u64)64 << 20)
 #define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1UL))
 #endif /* CONFIG_NUMA_EMU */
Index: linux-hpe4/arch/x86/mm/numa_64.c
===================================================================
--- linux-hpe4.orig/arch/x86/mm/numa_64.c	2010-11-15 17:13:02.463461371 +0800
+++ linux-hpe4/arch/x86/mm/numa_64.c	2010-11-15 17:21:05.510961676 +0800
@@ -304,6 +304,123 @@
 	}
 }
 
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+static char *hp_cmdline __initdata;
+static struct bootnode *hidden_nodes;
+static u64 hp_start;
+static long hidden_num, hp_size;
+static u64 nodes_size[MAX_NUMNODES] __initdata;
+
+int hotadd_hidden_nodes(int nid)
+{
+	int ret;
+
+	if (!node_hidden(nid))
+		return -EINVAL;
+
+	ret = add_memory(nid, hidden_nodes[nid].start,
+			 hidden_nodes[nid].end - hidden_nodes[nid].start);
+	if (!ret) {
+		node_clear_hidden(nid);
+		return 0;
+	} else {
+		return -EEXIST;
+	}
+}
+
+/* parse the comand line for numa=hide */
+static long __init parse_hide_nodes(char *hp_cmdline)
+{
+	int coef = 1, nid = 0;
+	u64 size = 0;
+	long total = 0;
+	char buf[512], *p;
+
+	/* parse numa=hide command-line */
+	hidden_num = 0;
+	p = buf;
+	while (1) {
+		if (*hp_cmdline == ',' || *hp_cmdline == '\0') {
+			*p = '\0';
+			size = simple_strtoul(buf, NULL, 0);
+			printk(KERN_ERR "size: %dM buf:%s coef: %d.\n", (int)size, buf, coef);
+			if (!((size<<20) & FAKE_NODE_MIN_HASH_MASK))
+				printk(KERN_ERR "%d M is less than minimum node size, ignore it.\n", (int)size);
+
+			size <<= 20;
+			/* Round down to nearest FAKE_NODE_MIN_SIZE. */
+			size &= FAKE_NODE_MIN_HASH_MASK;
+
+			if (size) {
+				int i;
+				total += size * coef;
+				for (i = 0; i < coef; i++)
+					nodes_size[nid++] = size;
+				hidden_num += coef;
+			}
+
+			coef = 1;
+			p = buf;
+			if (*hp_cmdline  == '\0')
+				break;
+			hp_cmdline++;
+		} else if (*hp_cmdline ==  '*') {
+			*p++ = '\0';
+			coef = simple_strtoul(buf, NULL, 0);
+			p = buf;
+			hp_cmdline++;
+		} else if (!isdigit(*hp_cmdline)) {
+			break;
+		}
+
+		*p++ = *hp_cmdline++;
+	}
+
+	return total;
+}
+
+static void __init numa_hide_nodes(void)
+{
+	hp_size = parse_hide_nodes(hp_cmdline);
+
+	hp_start = e820_hide_mem(hp_size);
+	if (hp_start <= 0) {
+		printk(KERN_ERR "Hide too much memory, disable node hotplug emualtion.");
+		hidden_num = 0;
+		return;
+	}
+
+	/* leave 128M space for possible RAM buffer usage later
+	 any other better way to avoid this conflict?*/
+
+	e820_hide_mem(128*1024*1024);
+}
+
+static void __init numa_hotplug_emulation(void)
+{
+	int i, num_nodes = 0, nid;
+
+	for_each_online_node(i)
+		if (i > num_nodes)
+			num_nodes = i;
+
+	i = num_nodes + hidden_num;
+	if (!hidden_nodes) {
+		hidden_nodes = alloc_bootmem(sizeof(struct bootnode) * i);
+		memset(hidden_nodes, 0, sizeof(struct bootnode) * i);
+	}
+
+	nid = num_nodes + 1;
+	for (i = 0; i < hidden_num; i++) {
+		node_set(nid, node_possible_map);
+		hidden_nodes[nid].start = hp_start;
+		hidden_nodes[nid].end = hp_start + (nodes_size[i]);
+		hp_start = hidden_nodes[nid].end;
+		node_set_hidden(nid++);
+	}
+}
+#endif /* CONFIG_NODE_HOTPLUG_EMU */
+
 #ifdef CONFIG_NUMA_EMU
 /* Numa emulation */
 static struct bootnode nodes[MAX_NUMNODES] __initdata;
@@ -658,7 +775,7 @@
 
 #ifdef CONFIG_NUMA_EMU
 	if (cmdline && !numa_emulation(start_pfn, last_pfn, acpi, k8))
-		return;
+		goto done;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -666,14 +783,14 @@
 #ifdef CONFIG_ACPI_NUMA
 	if (!numa_off && acpi && !acpi_scan_nodes(start_pfn << PAGE_SHIFT,
 						  last_pfn << PAGE_SHIFT))
-		return;
+		goto done;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
 
 #ifdef CONFIG_K8_NUMA
 	if (!numa_off && k8 && !k8_scan_nodes())
-		return;
+		goto done;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -693,6 +810,13 @@
 		numa_set_node(i, 0);
 	e820_register_active_regions(0, start_pfn, last_pfn);
 	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT);
+
+done:
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+	if (hidden_num)
+		numa_hotplug_emulation();
+#endif
+	return;
 }
 
 unsigned long __init numa_free_all_bootmem(void)
@@ -720,6 +844,12 @@
 	if (!strncmp(opt, "fake=", 5))
 		cmdline = opt + 5;
 #endif
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+	if (!strncmp(opt, "hide=", 5)) {
+		hp_cmdline = opt + 5;
+		numa_hide_nodes();
+	}
+#endif
 #ifdef CONFIG_ACPI_NUMA
 	if (!strncmp(opt, "noacpi", 6))
 		acpi_numa = -1;
Index: linux-hpe4/include/linux/nodemask.h
===================================================================
--- linux-hpe4.orig/include/linux/nodemask.h	2010-11-15 17:13:02.463461371 +0800
+++ linux-hpe4/include/linux/nodemask.h	2010-11-15 17:13:07.093461818 +0800
@@ -371,6 +371,10 @@
  */
 enum node_states {
 	N_POSSIBLE,		/* The node could become online at some point */
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+	N_HIDDEN,		/* The node is hidden at booting time, could be
+				 * onlined in run time */
+#endif
 	N_ONLINE,		/* The node is online */
 	N_NORMAL_MEMORY,	/* The node has regular memory */
 #ifdef CONFIG_HIGHMEM
@@ -470,6 +474,13 @@
 #define node_online(node)	node_state((node), N_ONLINE)
 #define node_possible(node)	node_state((node), N_POSSIBLE)
 
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+#define node_set_hidden(node)	   node_set_state((node), N_HIDDEN)
+#define node_clear_hidden(node)	   node_clear_state((node), N_HIDDEN)
+#define node_hidden(node)	node_state((node), N_HIDDEN)
+extern int hotadd_hidden_nodes(int nid);
+#endif
+
 #define for_each_node(node)	   for_each_node_state(node, N_POSSIBLE)
 #define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
 

-- 
Thanks & Regards,
Shaohui


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [3/8,v3] NUMA Hotplug Emulator: Userland interface to hotplug-add fake offlined nodes.
  2010-11-17  2:07 ` shaohui.zheng
@ 2010-11-17  2:08   ` shaohui.zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Dave Hansen, Christoph Lameter, Haicheng Li, Shaohui Zheng

[-- Attachment #1: 003-hotplug-emulator-userland-interface-to-add-fake-node.patch --]
[-- Type: text/plain, Size: 4571 bytes --]

From: Haicheng Li <haicheng.li@intel.com>

Add a sysfs entry "probe" under /sys/devices/system/node/:

 - to show all fake offlined nodes:
    $ cat /sys/devices/system/node/probe

 - to hotadd a fake offlined node, e.g. nodeid is N:
    $ echo N > /sys/devices/system/node/probe

CC: Dave Hansen <haveblue@us.ibm.com>
CC: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/Documentation/ABI/testing/sysfs-devices-node
===================================================================
--- linux-hpe4.orig/Documentation/ABI/testing/sysfs-devices-node	2010-11-15 17:13:02.433461413 +0800
+++ linux-hpe4/Documentation/ABI/testing/sysfs-devices-node	2010-11-15 17:13:07.093461818 +0800
@@ -5,3 +5,11 @@
 		When this file is written to, all memory within that node
 		will be compacted. When it completes, memory will be freed
 		into blocks which have as many contiguous pages as possible
+
+What:		/sys/devices/system/node/probe
+Date:		Jun 2010
+Contact:	Haicheng Li <haicheng.li@intel.com>
+Description:
+		This file lists all the availabe hidden nodes, when we write
+		a nid number to this interface, and the nid is in the available
+		node list, the hidden node becomes visible.
Index: linux-hpe4/drivers/base/node.c
===================================================================
--- linux-hpe4.orig/drivers/base/node.c	2010-11-15 17:13:02.433461413 +0800
+++ linux-hpe4/drivers/base/node.c	2010-11-15 17:13:07.093461818 +0800
@@ -538,6 +538,25 @@
 	unregister_node(&node_devices[nid]);
 }
 
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+static ssize_t store_nodes_probe(struct sysdev_class *class,
+				  struct sysdev_class_attribute *attr,
+				  const char *buf, size_t count)
+{
+	long nid;
+
+	strict_strtol(buf, 0, &nid);
+	if (nid < 0 || nid > nr_node_ids - 1) {
+		printk(KERN_ERR "Invalid NUMA node id: %ld (0 <= nid < %d).\n",
+			nid, nr_node_ids);
+		return -EPERM;
+	}
+	hotadd_hidden_nodes(nid);
+
+	return count;
+}
+#endif
+
 /*
  * node states attributes
  */
@@ -566,26 +585,35 @@
 	return print_nodes_state(na->state, buf);
 }
 
-#define _NODE_ATTR(name, state) \
+#define _NODE_ATTR_RO(name, state) \
 	{ _SYSDEV_CLASS_ATTR(name, 0444, show_node_state, NULL), state }
 
+#define _NODE_ATTR_RW(name, store_func, state) \
+	{ _SYSDEV_CLASS_ATTR(name, 0644, show_node_state, store_func), state }
+
 static struct node_attr node_state_attr[] = {
-	_NODE_ATTR(possible, N_POSSIBLE),
-	_NODE_ATTR(online, N_ONLINE),
-	_NODE_ATTR(has_normal_memory, N_NORMAL_MEMORY),
-	_NODE_ATTR(has_cpu, N_CPU),
+	[N_POSSIBLE] = _NODE_ATTR_RO(possible, N_POSSIBLE),
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+	[N_HIDDEN] = _NODE_ATTR_RW(probe, store_nodes_probe, N_HIDDEN),
+#endif
+	[N_ONLINE] = _NODE_ATTR_RO(online, N_ONLINE),
+	[N_NORMAL_MEMORY] = _NODE_ATTR_RO(has_normal_memory, N_NORMAL_MEMORY),
 #ifdef CONFIG_HIGHMEM
-	_NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
+	[N_HIGH_MEMORY] = _NODE_ATTR_RO(has_high_memory, N_HIGH_MEMORY),
 #endif
+	[N_CPU] = _NODE_ATTR_RO(has_cpu, N_CPU),
 };
 
 static struct sysdev_class_attribute *node_state_attrs[] = {
-	&node_state_attr[0].attr,
-	&node_state_attr[1].attr,
-	&node_state_attr[2].attr,
-	&node_state_attr[3].attr,
+	&node_state_attr[N_POSSIBLE].attr,
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+	&node_state_attr[N_HIDDEN].attr,
+#endif
+	&node_state_attr[N_ONLINE].attr,
+	&node_state_attr[N_NORMAL_MEMORY].attr,
+	&node_state_attr[N_CPU].attr,
 #ifdef CONFIG_HIGHMEM
-	&node_state_attr[4].attr,
+	&node_state_attr[N_HIGH_MEMORY].attr,
 #endif
 	NULL
 };
Index: linux-hpe4/mm/Kconfig
===================================================================
--- linux-hpe4.orig/mm/Kconfig	2010-11-15 17:13:02.443461606 +0800
+++ linux-hpe4/mm/Kconfig	2010-11-15 17:21:05.535335091 +0800
@@ -147,6 +147,21 @@
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
+config NUMA_HOTPLUG_EMU
+	bool "NUMA hotplug emulator"
+	depends on X86_64 && NUMA && MEMORY_HOTPLUG
+
+	---help---
+
+config NODE_HOTPLUG_EMU
+	bool "Node hotplug emulation"
+	depends on NUMA_HOTPLUG_EMU && MEMORY_HOTPLUG
+	---help---
+	  Enable Node hotplug emulation. The machine will be setup with
+	  hidden virtual nodes when booted with "numa=hide=N*size", where
+	  N is the number of hidden nodes, size is the memory size per
+	  hidden node. This is only useful for debugging.
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.

-- 
Thanks & Regards,
Shaohui



^ permalink raw reply	[flat|nested] 139+ messages in thread

* [3/8,v3] NUMA Hotplug Emulator: Userland interface to hotplug-add fake offlined nodes.
@ 2010-11-17  2:08   ` shaohui.zheng
  0 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Dave Hansen, Christoph Lameter, Haicheng Li, Shaohui Zheng

[-- Attachment #1: 003-hotplug-emulator-userland-interface-to-add-fake-node.patch --]
[-- Type: text/plain, Size: 4867 bytes --]

From: Haicheng Li <haicheng.li@intel.com>

Add a sysfs entry "probe" under /sys/devices/system/node/:

 - to show all fake offlined nodes:
    $ cat /sys/devices/system/node/probe

 - to hotadd a fake offlined node, e.g. nodeid is N:
    $ echo N > /sys/devices/system/node/probe

CC: Dave Hansen <haveblue@us.ibm.com>
CC: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/Documentation/ABI/testing/sysfs-devices-node
===================================================================
--- linux-hpe4.orig/Documentation/ABI/testing/sysfs-devices-node	2010-11-15 17:13:02.433461413 +0800
+++ linux-hpe4/Documentation/ABI/testing/sysfs-devices-node	2010-11-15 17:13:07.093461818 +0800
@@ -5,3 +5,11 @@
 		When this file is written to, all memory within that node
 		will be compacted. When it completes, memory will be freed
 		into blocks which have as many contiguous pages as possible
+
+What:		/sys/devices/system/node/probe
+Date:		Jun 2010
+Contact:	Haicheng Li <haicheng.li@intel.com>
+Description:
+		This file lists all the availabe hidden nodes, when we write
+		a nid number to this interface, and the nid is in the available
+		node list, the hidden node becomes visible.
Index: linux-hpe4/drivers/base/node.c
===================================================================
--- linux-hpe4.orig/drivers/base/node.c	2010-11-15 17:13:02.433461413 +0800
+++ linux-hpe4/drivers/base/node.c	2010-11-15 17:13:07.093461818 +0800
@@ -538,6 +538,25 @@
 	unregister_node(&node_devices[nid]);
 }
 
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+static ssize_t store_nodes_probe(struct sysdev_class *class,
+				  struct sysdev_class_attribute *attr,
+				  const char *buf, size_t count)
+{
+	long nid;
+
+	strict_strtol(buf, 0, &nid);
+	if (nid < 0 || nid > nr_node_ids - 1) {
+		printk(KERN_ERR "Invalid NUMA node id: %ld (0 <= nid < %d).\n",
+			nid, nr_node_ids);
+		return -EPERM;
+	}
+	hotadd_hidden_nodes(nid);
+
+	return count;
+}
+#endif
+
 /*
  * node states attributes
  */
@@ -566,26 +585,35 @@
 	return print_nodes_state(na->state, buf);
 }
 
-#define _NODE_ATTR(name, state) \
+#define _NODE_ATTR_RO(name, state) \
 	{ _SYSDEV_CLASS_ATTR(name, 0444, show_node_state, NULL), state }
 
+#define _NODE_ATTR_RW(name, store_func, state) \
+	{ _SYSDEV_CLASS_ATTR(name, 0644, show_node_state, store_func), state }
+
 static struct node_attr node_state_attr[] = {
-	_NODE_ATTR(possible, N_POSSIBLE),
-	_NODE_ATTR(online, N_ONLINE),
-	_NODE_ATTR(has_normal_memory, N_NORMAL_MEMORY),
-	_NODE_ATTR(has_cpu, N_CPU),
+	[N_POSSIBLE] = _NODE_ATTR_RO(possible, N_POSSIBLE),
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+	[N_HIDDEN] = _NODE_ATTR_RW(probe, store_nodes_probe, N_HIDDEN),
+#endif
+	[N_ONLINE] = _NODE_ATTR_RO(online, N_ONLINE),
+	[N_NORMAL_MEMORY] = _NODE_ATTR_RO(has_normal_memory, N_NORMAL_MEMORY),
 #ifdef CONFIG_HIGHMEM
-	_NODE_ATTR(has_high_memory, N_HIGH_MEMORY),
+	[N_HIGH_MEMORY] = _NODE_ATTR_RO(has_high_memory, N_HIGH_MEMORY),
 #endif
+	[N_CPU] = _NODE_ATTR_RO(has_cpu, N_CPU),
 };
 
 static struct sysdev_class_attribute *node_state_attrs[] = {
-	&node_state_attr[0].attr,
-	&node_state_attr[1].attr,
-	&node_state_attr[2].attr,
-	&node_state_attr[3].attr,
+	&node_state_attr[N_POSSIBLE].attr,
+#ifdef CONFIG_NODE_HOTPLUG_EMU
+	&node_state_attr[N_HIDDEN].attr,
+#endif
+	&node_state_attr[N_ONLINE].attr,
+	&node_state_attr[N_NORMAL_MEMORY].attr,
+	&node_state_attr[N_CPU].attr,
 #ifdef CONFIG_HIGHMEM
-	&node_state_attr[4].attr,
+	&node_state_attr[N_HIGH_MEMORY].attr,
 #endif
 	NULL
 };
Index: linux-hpe4/mm/Kconfig
===================================================================
--- linux-hpe4.orig/mm/Kconfig	2010-11-15 17:13:02.443461606 +0800
+++ linux-hpe4/mm/Kconfig	2010-11-15 17:21:05.535335091 +0800
@@ -147,6 +147,21 @@
 	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
 	depends on MIGRATION
 
+config NUMA_HOTPLUG_EMU
+	bool "NUMA hotplug emulator"
+	depends on X86_64 && NUMA && MEMORY_HOTPLUG
+
+	---help---
+
+config NODE_HOTPLUG_EMU
+	bool "Node hotplug emulation"
+	depends on NUMA_HOTPLUG_EMU && MEMORY_HOTPLUG
+	---help---
+	  Enable Node hotplug emulation. The machine will be setup with
+	  hidden virtual nodes when booted with "numa=hide=N*size", where
+	  N is the number of hidden nodes, size is the memory size per
+	  hidden node. This is only useful for debugging.
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.

-- 
Thanks & Regards,
Shaohui


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [4/8,v3] NUMA Hotplug Emulator: Abstract cpu register functions
  2010-11-17  2:07 ` shaohui.zheng
@ 2010-11-17  2:08   ` shaohui.zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng, Shaohui Zheng

[-- Attachment #1: 004-hotplug-emulator-x86-abstract-cpu-register-functions.patch --]
[-- Type: text/plain, Size: 3359 bytes --]

From: Shaohui Zheng <shaohui.zheng@intel.com>

Abstract cpu register functions, provide a more flexible interface
register_cpu_node, the new interface provides convenience to add cpu
to a specified node, we can use it to add a cpu to a fake node.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/arch/x86/include/asm/cpu.h
===================================================================
--- linux-hpe4.orig/arch/x86/include/asm/cpu.h	2010-11-17 09:00:59.742608402 +0800
+++ linux-hpe4/arch/x86/include/asm/cpu.h	2010-11-17 09:01:10.192838977 +0800
@@ -27,6 +27,7 @@
 
 #ifdef CONFIG_HOTPLUG_CPU
 extern int arch_register_cpu(int num);
+extern int arch_register_cpu_node(int num, int nid);
 extern void arch_unregister_cpu(int);
 #endif
 
Index: linux-hpe4/arch/x86/kernel/topology.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/topology.c	2010-11-17 09:01:01.053461766 +0800
+++ linux-hpe4/arch/x86/kernel/topology.c	2010-11-17 10:05:32.934085248 +0800
@@ -52,6 +52,15 @@
 }
 EXPORT_SYMBOL(arch_register_cpu);
 
+int __ref arch_register_cpu_node(int num, int nid)
+{
+	if (num)
+		per_cpu(cpu_devices, num).cpu.hotpluggable = 1;
+
+	return register_cpu_node(&per_cpu(cpu_devices, num).cpu, num, nid);
+}
+EXPORT_SYMBOL(arch_register_cpu_node);
+
 void arch_unregister_cpu(int num)
 {
 	unregister_cpu(&per_cpu(cpu_devices, num).cpu);
Index: linux-hpe4/drivers/base/cpu.c
===================================================================
--- linux-hpe4.orig/drivers/base/cpu.c	2010-11-17 09:01:01.053461766 +0800
+++ linux-hpe4/drivers/base/cpu.c	2010-11-17 10:05:32.943465010 +0800
@@ -208,17 +208,18 @@
 static SYSDEV_CLASS_ATTR(offline, 0444, print_cpus_offline, NULL);
 
 /*
- * register_cpu - Setup a sysfs device for a CPU.
+ * register_cpu_node - Setup a sysfs device for a CPU.
  * @cpu - cpu->hotpluggable field set to 1 will generate a control file in
  *	  sysfs for this CPU.
  * @num - CPU number to use when creating the device.
+ * @nid - Node ID to use, if any.
  *
  * Initialize and register the CPU device.
  */
-int __cpuinit register_cpu(struct cpu *cpu, int num)
+int __cpuinit register_cpu_node(struct cpu *cpu, int num, int nid)
 {
 	int error;
-	cpu->node_id = cpu_to_node(num);
+	cpu->node_id = nid;
 	cpu->sysdev.id = num;
 	cpu->sysdev.cls = &cpu_sysdev_class;
 
@@ -229,7 +230,7 @@
 	if (!error)
 		per_cpu(cpu_sys_devices, num) = &cpu->sysdev;
 	if (!error)
-		register_cpu_under_node(num, cpu_to_node(num));
+		register_cpu_under_node(num, nid);
 
 #ifdef CONFIG_KEXEC
 	if (!error)
Index: linux-hpe4/include/linux/cpu.h
===================================================================
--- linux-hpe4.orig/include/linux/cpu.h	2010-11-17 09:00:59.772898926 +0800
+++ linux-hpe4/include/linux/cpu.h	2010-11-17 10:05:32.954085309 +0800
@@ -30,7 +30,13 @@
 	struct sys_device sysdev;
 };
 
-extern int register_cpu(struct cpu *cpu, int num);
+extern int register_cpu_node(struct cpu *cpu, int num, int nid);
+
+static inline int register_cpu(struct cpu *cpu, int num)
+{
+	return register_cpu_node(cpu, num, cpu_to_node(num));
+}
+
 extern struct sys_device *get_cpu_sysdev(unsigned cpu);
 
 extern int cpu_add_sysdev_attr(struct sysdev_attribute *attr);

-- 
Thanks & Regards,
Shaohui



^ permalink raw reply	[flat|nested] 139+ messages in thread

* [4/8,v3] NUMA Hotplug Emulator: Abstract cpu register functions
@ 2010-11-17  2:08   ` shaohui.zheng
  0 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng, Shaohui Zheng

[-- Attachment #1: 004-hotplug-emulator-x86-abstract-cpu-register-functions.patch --]
[-- Type: text/plain, Size: 3655 bytes --]

From: Shaohui Zheng <shaohui.zheng@intel.com>

Abstract cpu register functions, provide a more flexible interface
register_cpu_node, the new interface provides convenience to add cpu
to a specified node, we can use it to add a cpu to a fake node.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/arch/x86/include/asm/cpu.h
===================================================================
--- linux-hpe4.orig/arch/x86/include/asm/cpu.h	2010-11-17 09:00:59.742608402 +0800
+++ linux-hpe4/arch/x86/include/asm/cpu.h	2010-11-17 09:01:10.192838977 +0800
@@ -27,6 +27,7 @@
 
 #ifdef CONFIG_HOTPLUG_CPU
 extern int arch_register_cpu(int num);
+extern int arch_register_cpu_node(int num, int nid);
 extern void arch_unregister_cpu(int);
 #endif
 
Index: linux-hpe4/arch/x86/kernel/topology.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/topology.c	2010-11-17 09:01:01.053461766 +0800
+++ linux-hpe4/arch/x86/kernel/topology.c	2010-11-17 10:05:32.934085248 +0800
@@ -52,6 +52,15 @@
 }
 EXPORT_SYMBOL(arch_register_cpu);
 
+int __ref arch_register_cpu_node(int num, int nid)
+{
+	if (num)
+		per_cpu(cpu_devices, num).cpu.hotpluggable = 1;
+
+	return register_cpu_node(&per_cpu(cpu_devices, num).cpu, num, nid);
+}
+EXPORT_SYMBOL(arch_register_cpu_node);
+
 void arch_unregister_cpu(int num)
 {
 	unregister_cpu(&per_cpu(cpu_devices, num).cpu);
Index: linux-hpe4/drivers/base/cpu.c
===================================================================
--- linux-hpe4.orig/drivers/base/cpu.c	2010-11-17 09:01:01.053461766 +0800
+++ linux-hpe4/drivers/base/cpu.c	2010-11-17 10:05:32.943465010 +0800
@@ -208,17 +208,18 @@
 static SYSDEV_CLASS_ATTR(offline, 0444, print_cpus_offline, NULL);
 
 /*
- * register_cpu - Setup a sysfs device for a CPU.
+ * register_cpu_node - Setup a sysfs device for a CPU.
  * @cpu - cpu->hotpluggable field set to 1 will generate a control file in
  *	  sysfs for this CPU.
  * @num - CPU number to use when creating the device.
+ * @nid - Node ID to use, if any.
  *
  * Initialize and register the CPU device.
  */
-int __cpuinit register_cpu(struct cpu *cpu, int num)
+int __cpuinit register_cpu_node(struct cpu *cpu, int num, int nid)
 {
 	int error;
-	cpu->node_id = cpu_to_node(num);
+	cpu->node_id = nid;
 	cpu->sysdev.id = num;
 	cpu->sysdev.cls = &cpu_sysdev_class;
 
@@ -229,7 +230,7 @@
 	if (!error)
 		per_cpu(cpu_sys_devices, num) = &cpu->sysdev;
 	if (!error)
-		register_cpu_under_node(num, cpu_to_node(num));
+		register_cpu_under_node(num, nid);
 
 #ifdef CONFIG_KEXEC
 	if (!error)
Index: linux-hpe4/include/linux/cpu.h
===================================================================
--- linux-hpe4.orig/include/linux/cpu.h	2010-11-17 09:00:59.772898926 +0800
+++ linux-hpe4/include/linux/cpu.h	2010-11-17 10:05:32.954085309 +0800
@@ -30,7 +30,13 @@
 	struct sys_device sysdev;
 };
 
-extern int register_cpu(struct cpu *cpu, int num);
+extern int register_cpu_node(struct cpu *cpu, int num, int nid);
+
+static inline int register_cpu(struct cpu *cpu, int num)
+{
+	return register_cpu_node(cpu, num, cpu_to_node(num));
+}
+
 extern struct sys_device *get_cpu_sysdev(unsigned cpu);
 
 extern int cpu_add_sysdev_attr(struct sysdev_attribute *attr);

-- 
Thanks & Regards,
Shaohui


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [5/8,v3] NUMA Hotplug Emulator: support cpu probe/release in x86
  2010-11-17  2:07 ` shaohui.zheng
@ 2010-11-17  2:08   ` shaohui.zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Ingo Molnar, Len Brown, Yinghai Lu, Shaohui Zheng, Haicheng Li

[-- Attachment #1: 005-hotplug-emulator-x86-support-cpu-probe-release-in-x86.patch --]
[-- Type: text/plain, Size: 10627 bytes --]

From: Shaohui Zheng <shaohui.zheng@intel.com>

Add cpu interface probe/release under sysfs for x86. User can use this
interface to emulate the cpu hot-add process, it is for cpu hotplug 
test purpose. Add a kernel option CONFIG_ARCH_CPU_PROBE_RELEASE for this
feature.

This interface provides a mechanism to emulate cpu hotplug with software
 methods, it becomes possible to do cpu hotplug automation and stress
testing.

Directive:
*) Reserve CPU throu grub parameter like:
	maxcpus=4

the rest CPUs will not be initiliazed. 

*) Probe CPU
we can use the probe interface to hot-add new CPUs:
	echo nid > /sys/devices/system/cpu/probe

*) Release a CPU
	echo cpu > /sys/devices/system/cpu/release

A reserved CPU will be hot-added to the specified node.
1) nid == 0, the CPU will be added to the real node which the CPU
should be in
2) nid != 0, add the CPU to node nid even through it is a fake node.

CC: Ingo Molnar <mingo@elte.hu>
CC: Len Brown <len.brown@intel.com>
CC: Yinghai Lu <Yinghai.Lu@Sun.COM>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
---
Index: linux-hpe4/arch/x86/kernel/acpi/boot.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/acpi/boot.c	2010-11-17 09:00:59.742608402 +0800
+++ linux-hpe4/arch/x86/kernel/acpi/boot.c	2010-11-17 09:01:10.202837209 +0800
@@ -647,8 +647,44 @@
 }
 EXPORT_SYMBOL(acpi_map_lsapic);
 
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+static void acpi_map_cpu2node_emu(int cpu, int physid, int nid)
+{
+#ifdef CONFIG_ACPI_NUMA
+#ifdef CONFIG_X86_64
+	apicid_to_node[physid] = nid;
+	numa_set_node(cpu, nid);
+#else /* CONFIG_X86_32 */
+	apicid_2_node[physid] = nid;
+	cpu_to_node_map[cpu] = nid;
+#endif
+#endif
+}
+
+static u16 cpu_to_apicid_saved[CONFIG_NR_CPUS];
+int __ref acpi_map_lsapic_emu(int pcpu, int nid)
+{
+	/* backup cpu apicid to array cpu_to_apicid_saved */
+	if (cpu_to_apicid_saved[pcpu] == 0 &&
+		per_cpu(x86_cpu_to_apicid, pcpu) != BAD_APICID)
+		cpu_to_apicid_saved[pcpu] = per_cpu(x86_cpu_to_apicid, pcpu);
+
+	per_cpu(x86_cpu_to_apicid, pcpu) = cpu_to_apicid_saved[pcpu];
+	acpi_map_cpu2node_emu(pcpu, per_cpu(x86_cpu_to_apicid, pcpu), nid);
+
+	return pcpu;
+}
+EXPORT_SYMBOL(acpi_map_lsapic_emu);
+#endif
+
 int acpi_unmap_lsapic(int cpu)
 {
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+	/* backup cpu apicid to array cpu_to_apicid_saved */
+	if (cpu_to_apicid_saved[cpu] == 0 &&
+		per_cpu(x86_cpu_to_apicid, cpu) != BAD_APICID)
+		cpu_to_apicid_saved[cpu] = per_cpu(x86_cpu_to_apicid, cpu);
+#endif
 	per_cpu(x86_cpu_to_apicid, cpu) = -1;
 	set_cpu_present(cpu, false);
 	num_processors--;
Index: linux-hpe4/arch/x86/kernel/smpboot.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/smpboot.c	2010-11-17 09:00:59.753464132 +0800
+++ linux-hpe4/arch/x86/kernel/smpboot.c	2010-11-17 10:05:26.913464702 +0800
@@ -107,8 +107,6 @@
         mutex_unlock(&x86_cpu_hotplug_driver_mutex);
 }
 
-ssize_t arch_cpu_probe(const char *buf, size_t count) { return -1; }
-ssize_t arch_cpu_release(const char *buf, size_t count) { return -1; }
 #else
 static struct task_struct *idle_thread_array[NR_CPUS] __cpuinitdata ;
 #define get_idle_for_cpu(x)      (idle_thread_array[(x)])
Index: linux-hpe4/arch/x86/kernel/topology.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/topology.c	2010-11-17 09:01:10.192838977 +0800
+++ linux-hpe4/arch/x86/kernel/topology.c	2010-11-17 10:05:26.924085712 +0800
@@ -30,6 +30,9 @@
 #include <linux/init.h>
 #include <linux/smp.h>
 #include <asm/cpu.h>
+#include <linux/cpu.h>
+#include <linux/topology.h>
+#include <linux/acpi.h>
 
 static DEFINE_PER_CPU(struct x86_cpu, cpu_devices);
 
@@ -66,6 +69,74 @@
 	unregister_cpu(&per_cpu(cpu_devices, num).cpu);
 }
 EXPORT_SYMBOL(arch_unregister_cpu);
+
+ssize_t arch_cpu_probe(const char *buf, size_t count)
+{
+	int nid = 0;
+	int num = 0, selected = 0;
+
+	/* check parameters */
+	if (!buf || count < 2)
+		return -EPERM;
+
+	nid = simple_strtoul(buf, NULL, 0);
+	printk(KERN_DEBUG "Add a cpu to node : %d\n", nid);
+
+	if (nid < 0 || nid > nr_node_ids - 1) {
+		printk(KERN_ERR "Invalid NUMA node id: %d (0 <= nid < %d).\n",
+			nid, nr_node_ids);
+		return -EPERM;
+	}
+
+	if (!node_online(nid)) {
+		printk(KERN_ERR "NUMA node %d is not online, give up.\n", nid);
+		return -EPERM;
+	}
+
+	/* find first uninitialized cpu */
+	for_each_present_cpu(num) {
+		if (per_cpu(cpu_sys_devices, num) == NULL) {
+			selected = num;
+			break;
+		}
+	}
+
+	if (selected >= num_possible_cpus()) {
+		printk(KERN_ERR "No free cpu, give up cpu probing.\n");
+		return -EPERM;
+	}
+
+	/* register cpu */
+	arch_register_cpu_node(selected, nid);
+	acpi_map_lsapic_emu(selected, nid);
+
+	return count;
+}
+EXPORT_SYMBOL(arch_cpu_probe);
+
+ssize_t arch_cpu_release(const char *buf, size_t count)
+{
+	int cpu = 0;
+
+	cpu =  simple_strtoul(buf, NULL, 0);
+	/* cpu 0 is not hotplugable */
+	if (cpu == 0) {
+		printk(KERN_ERR "can not release cpu 0.\n");
+		return -EPERM;
+	}
+
+	if (cpu_online(cpu)) {
+		printk(KERN_DEBUG "offline cpu %d.\n", cpu);
+		cpu_down(cpu);
+	}
+
+	arch_unregister_cpu(cpu);
+	acpi_unmap_lsapic(cpu);
+
+	return count;
+}
+EXPORT_SYMBOL(arch_cpu_release);
+
 #else /* CONFIG_HOTPLUG_CPU */
 
 static int __init arch_register_cpu(int num)
@@ -83,8 +154,14 @@
 		register_one_node(i);
 #endif
 
-	for_each_present_cpu(i)
-		arch_register_cpu(i);
+	/*
+	 * when cpu hotplug emulation enabled, register the online cpu only,
+	 * the rests are reserved for cpu probe.
+	 */
+	for_each_present_cpu(i) {
+		if ((cpu_hpe_on && cpu_online(i)) || !cpu_hpe_on)
+			arch_register_cpu(i);
+	}
 
 	return 0;
 }
Index: linux-hpe4/arch/x86/mm/numa_64.c
===================================================================
--- linux-hpe4.orig/arch/x86/mm/numa_64.c	2010-11-17 09:01:10.132837502 +0800
+++ linux-hpe4/arch/x86/mm/numa_64.c	2010-11-17 09:01:10.202837209 +0800
@@ -12,6 +12,7 @@
 #include <linux/module.h>
 #include <linux/nodemask.h>
 #include <linux/sched.h>
+#include <linux/cpu.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -915,6 +916,19 @@
 }
 #endif
 
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+static __init int cpu_hpe_setup(char *opt)
+{
+	if (!opt)
+		return -EINVAL;
+
+	if (!strncmp(opt, "on", 2) || !strncmp(opt, "1", 1))
+		cpu_hpe_on = 1;
+
+	return 0;
+}
+early_param("cpu_hpe", cpu_hpe_setup);
+#endif  /* CONFIG_ARCH_CPU_PROBE_RELEASE */
 
 void __cpuinit numa_set_node(int cpu, int node)
 {
Index: linux-hpe4/drivers/acpi/processor_driver.c
===================================================================
--- linux-hpe4.orig/drivers/acpi/processor_driver.c	2010-11-17 09:00:59.765335724 +0800
+++ linux-hpe4/drivers/acpi/processor_driver.c	2010-11-17 09:01:10.212839478 +0800
@@ -530,6 +530,14 @@
 		goto err_free_cpumask;
 
 	sysdev = get_cpu_sysdev(pr->id);
+	/*
+	 * Reserve cpu for hotplug emulation, the reserved cpu can be hot-added
+	 * throu the cpu probe interface. Return directly.
+	 */
+	if (sysdev == NULL) {
+		goto out;
+	}
+
 	if (sysfs_create_link(&device->dev.kobj, &sysdev->kobj, "sysdev")) {
 		result = -EFAULT;
 		goto err_remove_fs;
@@ -570,6 +578,7 @@
 		goto err_remove_sysfs;
 	}
 
+out:
 	return 0;
 
 err_remove_sysfs:
Index: linux-hpe4/drivers/base/cpu.c
===================================================================
--- linux-hpe4.orig/drivers/base/cpu.c	2010-11-17 09:01:10.192838977 +0800
+++ linux-hpe4/drivers/base/cpu.c	2010-11-17 09:01:10.212839478 +0800
@@ -22,9 +22,15 @@
 };
 EXPORT_SYMBOL(cpu_sysdev_class);
 
-static DEFINE_PER_CPU(struct sys_device *, cpu_sys_devices);
+DEFINE_PER_CPU(struct sys_device *, cpu_sys_devices);
 
 #ifdef CONFIG_HOTPLUG_CPU
+/*
+ * cpu_hpe_on is a switch to enable/disable cpu hotplug emulation. it is
+ * disabled in default, we can enable it throu grub parameter cpu_hpe=on
+ */
+int cpu_hpe_on;
+
 static ssize_t show_online(struct sys_device *dev, struct sysdev_attribute *attr,
 			   char *buf)
 {
Index: linux-hpe4/include/linux/acpi.h
===================================================================
--- linux-hpe4.orig/include/linux/acpi.h	2010-11-17 09:00:59.772898926 +0800
+++ linux-hpe4/include/linux/acpi.h	2010-11-17 09:01:10.212839478 +0800
@@ -102,6 +102,7 @@
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 /* Arch dependent functions for cpu hotplug support */
 int acpi_map_lsapic(acpi_handle handle, int *pcpu);
+int acpi_map_lsapic_emu(int pcpu, int nid);
 int acpi_unmap_lsapic(int cpu);
 #endif /* CONFIG_ACPI_HOTPLUG_CPU */
 
Index: linux-hpe4/include/linux/cpu.h
===================================================================
--- linux-hpe4.orig/include/linux/cpu.h	2010-11-17 09:01:10.192838977 +0800
+++ linux-hpe4/include/linux/cpu.h	2010-11-17 09:01:10.212839478 +0800
@@ -30,6 +30,8 @@
 	struct sys_device sysdev;
 };
 
+DECLARE_PER_CPU(struct sys_device *, cpu_sys_devices);
+
 extern int register_cpu_node(struct cpu *cpu, int num, int nid);
 
 static inline int register_cpu(struct cpu *cpu, int num)
@@ -149,6 +151,7 @@
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
 int cpu_down(unsigned int cpu);
+extern int cpu_hpe_on;
 
 #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
 extern void cpu_hotplug_driver_lock(void);
@@ -171,6 +174,7 @@
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })
 #define unregister_hotcpu_notifier(nb)	({ (void)(nb); })
+static int cpu_hpe_on;
 #endif		/* CONFIG_HOTPLUG_CPU */
 
 #ifdef CONFIG_PM_SLEEP_SMP
Index: linux-hpe4/mm/Kconfig
===================================================================
--- linux-hpe4.orig/mm/Kconfig	2010-11-17 09:01:10.192838977 +0800
+++ linux-hpe4/mm/Kconfig	2010-11-17 10:05:20.994710783 +0800
@@ -162,6 +162,17 @@
 	  N is the number of hidden nodes, size is the memory size per
 	  hidden node. This is only useful for debugging.
 
+config ARCH_CPU_PROBE_RELEASE
+	def_bool y
+	bool "CPU hotplug emulation"
+	depends on NUMA_HOTPLUG_EMU
+	---help---
+	  Enable cpu hotplug emulation. Reserve cpu with grub parameter
+	  "maxcpus=N", where N is the initial CPU number, the rest physical
+	  CPUs will not be initialized; there is a probe/release interface
+	  is for cpu hot-add/hot-remove to specified node in software method.
+	  This is for debuging and testing purpose
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.

-- 
Thanks & Regards,
Shaohui



^ permalink raw reply	[flat|nested] 139+ messages in thread

* [5/8,v3] NUMA Hotplug Emulator: support cpu probe/release in x86
@ 2010-11-17  2:08   ` shaohui.zheng
  0 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Ingo Molnar, Len Brown, Yinghai Lu, Shaohui Zheng, Haicheng Li

[-- Attachment #1: 005-hotplug-emulator-x86-support-cpu-probe-release-in-x86.patch --]
[-- Type: text/plain, Size: 10923 bytes --]

From: Shaohui Zheng <shaohui.zheng@intel.com>

Add cpu interface probe/release under sysfs for x86. User can use this
interface to emulate the cpu hot-add process, it is for cpu hotplug 
test purpose. Add a kernel option CONFIG_ARCH_CPU_PROBE_RELEASE for this
feature.

This interface provides a mechanism to emulate cpu hotplug with software
 methods, it becomes possible to do cpu hotplug automation and stress
testing.

Directive:
*) Reserve CPU throu grub parameter like:
	maxcpus=4

the rest CPUs will not be initiliazed. 

*) Probe CPU
we can use the probe interface to hot-add new CPUs:
	echo nid > /sys/devices/system/cpu/probe

*) Release a CPU
	echo cpu > /sys/devices/system/cpu/release

A reserved CPU will be hot-added to the specified node.
1) nid == 0, the CPU will be added to the real node which the CPU
should be in
2) nid != 0, add the CPU to node nid even through it is a fake node.

CC: Ingo Molnar <mingo@elte.hu>
CC: Len Brown <len.brown@intel.com>
CC: Yinghai Lu <Yinghai.Lu@Sun.COM>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
---
Index: linux-hpe4/arch/x86/kernel/acpi/boot.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/acpi/boot.c	2010-11-17 09:00:59.742608402 +0800
+++ linux-hpe4/arch/x86/kernel/acpi/boot.c	2010-11-17 09:01:10.202837209 +0800
@@ -647,8 +647,44 @@
 }
 EXPORT_SYMBOL(acpi_map_lsapic);
 
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+static void acpi_map_cpu2node_emu(int cpu, int physid, int nid)
+{
+#ifdef CONFIG_ACPI_NUMA
+#ifdef CONFIG_X86_64
+	apicid_to_node[physid] = nid;
+	numa_set_node(cpu, nid);
+#else /* CONFIG_X86_32 */
+	apicid_2_node[physid] = nid;
+	cpu_to_node_map[cpu] = nid;
+#endif
+#endif
+}
+
+static u16 cpu_to_apicid_saved[CONFIG_NR_CPUS];
+int __ref acpi_map_lsapic_emu(int pcpu, int nid)
+{
+	/* backup cpu apicid to array cpu_to_apicid_saved */
+	if (cpu_to_apicid_saved[pcpu] == 0 &&
+		per_cpu(x86_cpu_to_apicid, pcpu) != BAD_APICID)
+		cpu_to_apicid_saved[pcpu] = per_cpu(x86_cpu_to_apicid, pcpu);
+
+	per_cpu(x86_cpu_to_apicid, pcpu) = cpu_to_apicid_saved[pcpu];
+	acpi_map_cpu2node_emu(pcpu, per_cpu(x86_cpu_to_apicid, pcpu), nid);
+
+	return pcpu;
+}
+EXPORT_SYMBOL(acpi_map_lsapic_emu);
+#endif
+
 int acpi_unmap_lsapic(int cpu)
 {
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+	/* backup cpu apicid to array cpu_to_apicid_saved */
+	if (cpu_to_apicid_saved[cpu] == 0 &&
+		per_cpu(x86_cpu_to_apicid, cpu) != BAD_APICID)
+		cpu_to_apicid_saved[cpu] = per_cpu(x86_cpu_to_apicid, cpu);
+#endif
 	per_cpu(x86_cpu_to_apicid, cpu) = -1;
 	set_cpu_present(cpu, false);
 	num_processors--;
Index: linux-hpe4/arch/x86/kernel/smpboot.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/smpboot.c	2010-11-17 09:00:59.753464132 +0800
+++ linux-hpe4/arch/x86/kernel/smpboot.c	2010-11-17 10:05:26.913464702 +0800
@@ -107,8 +107,6 @@
         mutex_unlock(&x86_cpu_hotplug_driver_mutex);
 }
 
-ssize_t arch_cpu_probe(const char *buf, size_t count) { return -1; }
-ssize_t arch_cpu_release(const char *buf, size_t count) { return -1; }
 #else
 static struct task_struct *idle_thread_array[NR_CPUS] __cpuinitdata ;
 #define get_idle_for_cpu(x)      (idle_thread_array[(x)])
Index: linux-hpe4/arch/x86/kernel/topology.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/topology.c	2010-11-17 09:01:10.192838977 +0800
+++ linux-hpe4/arch/x86/kernel/topology.c	2010-11-17 10:05:26.924085712 +0800
@@ -30,6 +30,9 @@
 #include <linux/init.h>
 #include <linux/smp.h>
 #include <asm/cpu.h>
+#include <linux/cpu.h>
+#include <linux/topology.h>
+#include <linux/acpi.h>
 
 static DEFINE_PER_CPU(struct x86_cpu, cpu_devices);
 
@@ -66,6 +69,74 @@
 	unregister_cpu(&per_cpu(cpu_devices, num).cpu);
 }
 EXPORT_SYMBOL(arch_unregister_cpu);
+
+ssize_t arch_cpu_probe(const char *buf, size_t count)
+{
+	int nid = 0;
+	int num = 0, selected = 0;
+
+	/* check parameters */
+	if (!buf || count < 2)
+		return -EPERM;
+
+	nid = simple_strtoul(buf, NULL, 0);
+	printk(KERN_DEBUG "Add a cpu to node : %d\n", nid);
+
+	if (nid < 0 || nid > nr_node_ids - 1) {
+		printk(KERN_ERR "Invalid NUMA node id: %d (0 <= nid < %d).\n",
+			nid, nr_node_ids);
+		return -EPERM;
+	}
+
+	if (!node_online(nid)) {
+		printk(KERN_ERR "NUMA node %d is not online, give up.\n", nid);
+		return -EPERM;
+	}
+
+	/* find first uninitialized cpu */
+	for_each_present_cpu(num) {
+		if (per_cpu(cpu_sys_devices, num) == NULL) {
+			selected = num;
+			break;
+		}
+	}
+
+	if (selected >= num_possible_cpus()) {
+		printk(KERN_ERR "No free cpu, give up cpu probing.\n");
+		return -EPERM;
+	}
+
+	/* register cpu */
+	arch_register_cpu_node(selected, nid);
+	acpi_map_lsapic_emu(selected, nid);
+
+	return count;
+}
+EXPORT_SYMBOL(arch_cpu_probe);
+
+ssize_t arch_cpu_release(const char *buf, size_t count)
+{
+	int cpu = 0;
+
+	cpu =  simple_strtoul(buf, NULL, 0);
+	/* cpu 0 is not hotplugable */
+	if (cpu == 0) {
+		printk(KERN_ERR "can not release cpu 0.\n");
+		return -EPERM;
+	}
+
+	if (cpu_online(cpu)) {
+		printk(KERN_DEBUG "offline cpu %d.\n", cpu);
+		cpu_down(cpu);
+	}
+
+	arch_unregister_cpu(cpu);
+	acpi_unmap_lsapic(cpu);
+
+	return count;
+}
+EXPORT_SYMBOL(arch_cpu_release);
+
 #else /* CONFIG_HOTPLUG_CPU */
 
 static int __init arch_register_cpu(int num)
@@ -83,8 +154,14 @@
 		register_one_node(i);
 #endif
 
-	for_each_present_cpu(i)
-		arch_register_cpu(i);
+	/*
+	 * when cpu hotplug emulation enabled, register the online cpu only,
+	 * the rests are reserved for cpu probe.
+	 */
+	for_each_present_cpu(i) {
+		if ((cpu_hpe_on && cpu_online(i)) || !cpu_hpe_on)
+			arch_register_cpu(i);
+	}
 
 	return 0;
 }
Index: linux-hpe4/arch/x86/mm/numa_64.c
===================================================================
--- linux-hpe4.orig/arch/x86/mm/numa_64.c	2010-11-17 09:01:10.132837502 +0800
+++ linux-hpe4/arch/x86/mm/numa_64.c	2010-11-17 09:01:10.202837209 +0800
@@ -12,6 +12,7 @@
 #include <linux/module.h>
 #include <linux/nodemask.h>
 #include <linux/sched.h>
+#include <linux/cpu.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -915,6 +916,19 @@
 }
 #endif
 
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+static __init int cpu_hpe_setup(char *opt)
+{
+	if (!opt)
+		return -EINVAL;
+
+	if (!strncmp(opt, "on", 2) || !strncmp(opt, "1", 1))
+		cpu_hpe_on = 1;
+
+	return 0;
+}
+early_param("cpu_hpe", cpu_hpe_setup);
+#endif  /* CONFIG_ARCH_CPU_PROBE_RELEASE */
 
 void __cpuinit numa_set_node(int cpu, int node)
 {
Index: linux-hpe4/drivers/acpi/processor_driver.c
===================================================================
--- linux-hpe4.orig/drivers/acpi/processor_driver.c	2010-11-17 09:00:59.765335724 +0800
+++ linux-hpe4/drivers/acpi/processor_driver.c	2010-11-17 09:01:10.212839478 +0800
@@ -530,6 +530,14 @@
 		goto err_free_cpumask;
 
 	sysdev = get_cpu_sysdev(pr->id);
+	/*
+	 * Reserve cpu for hotplug emulation, the reserved cpu can be hot-added
+	 * throu the cpu probe interface. Return directly.
+	 */
+	if (sysdev == NULL) {
+		goto out;
+	}
+
 	if (sysfs_create_link(&device->dev.kobj, &sysdev->kobj, "sysdev")) {
 		result = -EFAULT;
 		goto err_remove_fs;
@@ -570,6 +578,7 @@
 		goto err_remove_sysfs;
 	}
 
+out:
 	return 0;
 
 err_remove_sysfs:
Index: linux-hpe4/drivers/base/cpu.c
===================================================================
--- linux-hpe4.orig/drivers/base/cpu.c	2010-11-17 09:01:10.192838977 +0800
+++ linux-hpe4/drivers/base/cpu.c	2010-11-17 09:01:10.212839478 +0800
@@ -22,9 +22,15 @@
 };
 EXPORT_SYMBOL(cpu_sysdev_class);
 
-static DEFINE_PER_CPU(struct sys_device *, cpu_sys_devices);
+DEFINE_PER_CPU(struct sys_device *, cpu_sys_devices);
 
 #ifdef CONFIG_HOTPLUG_CPU
+/*
+ * cpu_hpe_on is a switch to enable/disable cpu hotplug emulation. it is
+ * disabled in default, we can enable it throu grub parameter cpu_hpe=on
+ */
+int cpu_hpe_on;
+
 static ssize_t show_online(struct sys_device *dev, struct sysdev_attribute *attr,
 			   char *buf)
 {
Index: linux-hpe4/include/linux/acpi.h
===================================================================
--- linux-hpe4.orig/include/linux/acpi.h	2010-11-17 09:00:59.772898926 +0800
+++ linux-hpe4/include/linux/acpi.h	2010-11-17 09:01:10.212839478 +0800
@@ -102,6 +102,7 @@
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 /* Arch dependent functions for cpu hotplug support */
 int acpi_map_lsapic(acpi_handle handle, int *pcpu);
+int acpi_map_lsapic_emu(int pcpu, int nid);
 int acpi_unmap_lsapic(int cpu);
 #endif /* CONFIG_ACPI_HOTPLUG_CPU */
 
Index: linux-hpe4/include/linux/cpu.h
===================================================================
--- linux-hpe4.orig/include/linux/cpu.h	2010-11-17 09:01:10.192838977 +0800
+++ linux-hpe4/include/linux/cpu.h	2010-11-17 09:01:10.212839478 +0800
@@ -30,6 +30,8 @@
 	struct sys_device sysdev;
 };
 
+DECLARE_PER_CPU(struct sys_device *, cpu_sys_devices);
+
 extern int register_cpu_node(struct cpu *cpu, int num, int nid);
 
 static inline int register_cpu(struct cpu *cpu, int num)
@@ -149,6 +151,7 @@
 #define register_hotcpu_notifier(nb)	register_cpu_notifier(nb)
 #define unregister_hotcpu_notifier(nb)	unregister_cpu_notifier(nb)
 int cpu_down(unsigned int cpu);
+extern int cpu_hpe_on;
 
 #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
 extern void cpu_hotplug_driver_lock(void);
@@ -171,6 +174,7 @@
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)	({ (void)(nb); 0; })
 #define unregister_hotcpu_notifier(nb)	({ (void)(nb); })
+static int cpu_hpe_on;
 #endif		/* CONFIG_HOTPLUG_CPU */
 
 #ifdef CONFIG_PM_SLEEP_SMP
Index: linux-hpe4/mm/Kconfig
===================================================================
--- linux-hpe4.orig/mm/Kconfig	2010-11-17 09:01:10.192838977 +0800
+++ linux-hpe4/mm/Kconfig	2010-11-17 10:05:20.994710783 +0800
@@ -162,6 +162,17 @@
 	  N is the number of hidden nodes, size is the memory size per
 	  hidden node. This is only useful for debugging.
 
+config ARCH_CPU_PROBE_RELEASE
+	def_bool y
+	bool "CPU hotplug emulation"
+	depends on NUMA_HOTPLUG_EMU
+	---help---
+	  Enable cpu hotplug emulation. Reserve cpu with grub parameter
+	  "maxcpus=N", where N is the initial CPU number, the rest physical
+	  CPUs will not be initialized; there is a probe/release interface
+	  is for cpu hot-add/hot-remove to specified node in software method.
+	  This is for debuging and testing purpose
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.

-- 
Thanks & Regards,
Shaohui


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [6/8,v3] NUMA Hotplug Emulator: Fake CPU socket with logical CPU on x86
  2010-11-17  2:07 ` shaohui.zheng
@ 2010-11-17  2:08   ` shaohui.zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Sam Ravnborg, Shaohui Zheng

[-- Attachment #1: 006-hotplug-emulator-fake_socket_with_logic_cpu_on_x86.patch --]
[-- Type: text/plain, Size: 7642 bytes --]

From: Shaohui Zheng <shaohui.zheng@intel.com>

When hotplug a CPU with emulator, we are using a logical CPU to emulate the
CPU hotplug process. For the CPU supported SMT, some logical CPUs are in the
same socket, but it may located in different NUMA node after we have emulator.
it misleads the scheduling domain to build the incorrect hierarchy, and it
causes the following call trace when rebalance the scheduling domain:

divide error: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu8/online
CPU 0 
Modules linked in: fbcon tileblit font bitblit softcursor radeon ttm drm_kms_helper e1000e usbhid via_rhine mii drm i2c_algo_bit igb dca
Pid: 0, comm: swapper Not tainted 2.6.32hpe #78 X8DTN
RIP: 0010:[<ffffffff81051da5>]  [<ffffffff81051da5>] find_busiest_group+0x6c5/0xa10
RSP: 0018:ffff880028203c30  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000015ac0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff880277e8cfa0 RDI: 0000000000000000
RBP: ffff880028203dc0 R08: ffff880277e8cfa0 R09: 0000000000000040
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f16cfc85770 CR3: 0000000001001000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81822000, task ffffffff8184a600)
Stack:
 ffff880028203d60 ffff880028203cd0 ffff8801c204ff08 ffff880028203e38
<0> 0101ffff81018c59 ffff880028203e44 00000001810806bd ffff8801c204fe00
<0> 0000000528200000 ffffffff00000000 0000000000000018 0000000000015ac0
Call Trace:
 <IRQ> 
 [<ffffffff81088ee0>] ? tick_dev_program_event+0x40/0xd0
 [<ffffffff81053b2c>] rebalance_domains+0x17c/0x570
 [<ffffffff81018c89>] ? read_tsc+0x9/0x20
 [<ffffffff81088ee0>] ? tick_dev_program_event+0x40/0xd0
 [<ffffffff810569ed>] run_rebalance_domains+0xbd/0xf0
 [<ffffffff8106471f>] __do_softirq+0xaf/0x1e0
 [<ffffffff810b7d18>] ? handle_IRQ_event+0x58/0x160
 [<ffffffff810130ac>] call_softirq+0x1c/0x30
 [<ffffffff81014a85>] do_softirq+0x65/0xa0
 [<ffffffff810645cd>] irq_exit+0x7d/0x90
 [<ffffffff81013ff0>] do_IRQ+0x70/0xe0
 [<ffffffff810128d3>] ret_from_intr+0x0/0x11
 <EOI> 
 [<ffffffff8133387f>] ? acpi_idle_enter_bm+0x281/0x2b5
 [<ffffffff81333878>] ? acpi_idle_enter_bm+0x27a/0x2b5
 [<ffffffff8145dc8f>] ? cpuidle_idle_call+0x9f/0x130
 [<ffffffff81010e2b>] ? cpu_idle+0xab/0x100
 [<ffffffff8158aee6>] ? rest_init+0x66/0x70
 [<ffffffff81905d90>] ? start_kernel+0x3e3/0x3ef
 [<ffffffff8190533a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81905438>] ? x86_64_start_kernel+0xfa/0x109
Code: 00 00 e9 4c fb ff ff 0f 1f 80 00 00 00 00 48 8b b5 d8 fe ff ff 48 8b 45 a8 4d 29 ef 8b 56 08 48 c1 e0 0a 49 89 f0 48 89 d7 31 d2 <48> f7 f7 31 d2 48 89 45 a0 8b 76 08 4c 89 f0 48 c1 e0 0a 48 f7 
RIP  [<ffffffff81051da5>] find_busiest_group+0x6c5/0xa10
 RSP <ffff880028203c30>

Solution:

We put the logical CPU into a fake CPU socket, and assign it an unique
 phys_proc_id. For the fake socket, we put one logical CPU in only. This
method fixes the above bug.

CC: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/arch/x86/include/asm/processor.h
===================================================================
--- linux-hpe4.orig/arch/x86/include/asm/processor.h	2010-11-17 09:00:51.354100239 +0800
+++ linux-hpe4/arch/x86/include/asm/processor.h	2010-11-17 09:01:10.222837594 +0800
@@ -113,6 +113,15 @@
 	/* Index into per_cpu list: */
 	u16			cpu_index;
 #endif
+
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+	/*
+	 * Use a logic cpu to emulate a physical cpu's hotplug. We put the
+	 * logical cpu into a fake socket, assign a fake physical id to it,
+	 * and create a fake core.
+	 */
+	__u8		cpu_probe_on; /* A flag to enable cpu probe/release */
+#endif
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
 #define X86_VENDOR_INTEL	0
Index: linux-hpe4/arch/x86/kernel/smpboot.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/smpboot.c	2010-11-17 09:01:10.202837209 +0800
+++ linux-hpe4/arch/x86/kernel/smpboot.c	2010-11-17 09:01:10.222837594 +0800
@@ -97,6 +97,7 @@
  */
 static DEFINE_MUTEX(x86_cpu_hotplug_driver_mutex);
 
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
 void cpu_hotplug_driver_lock()
 {
         mutex_lock(&x86_cpu_hotplug_driver_mutex);
@@ -106,6 +107,7 @@
 {
         mutex_unlock(&x86_cpu_hotplug_driver_mutex);
 }
+#endif
 
 #else
 static struct task_struct *idle_thread_array[NR_CPUS] __cpuinitdata ;
@@ -198,6 +200,8 @@
 {
 	int cpuid, phys_id;
 	unsigned long timeout;
+	u8 cpu_probe_on = 0;
+	struct cpuinfo_x86 *c;
 
 	/*
 	 * If waken up by an INIT in an 82489DX configuration
@@ -277,7 +281,20 @@
 	/*
 	 * Save our processor parameters
 	 */
+	c = &cpu_data(cpuid);
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+	cpu_probe_on = c->cpu_probe_on;
+	phys_id = c->phys_proc_id;
+#endif
+
 	smp_store_cpu_info(cpuid);
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+	if (cpu_probe_on) {
+		c->phys_proc_id = phys_id; /* restore the fake phys_proc_id */
+		c->cpu_core_id = 0; /* force the logical cpu to core 0 */
+		c->cpu_probe_on = cpu_probe_on;
+	}
+#endif
 
 	notify_cpu_starting(cpuid);
 
@@ -400,6 +417,11 @@
 {
 	int i;
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
+	int cpu_probe_on = 0;
+
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+	cpu_probe_on = c->cpu_probe_on;
+#endif
 
 	cpumask_set_cpu(cpu, cpu_sibling_setup_mask);
 
@@ -431,7 +453,8 @@
 
 	for_each_cpu(i, cpu_sibling_setup_mask) {
 		if (per_cpu(cpu_llc_id, cpu) != BAD_APICID &&
-		    per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) {
+		    per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i) &&
+			cpu_probe_on == 0) {
 			cpumask_set_cpu(i, c->llc_shared_map);
 			cpumask_set_cpu(cpu, cpu_data(i).llc_shared_map);
 		}
Index: linux-hpe4/arch/x86/kernel/topology.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/topology.c	2010-11-17 09:01:10.202837209 +0800
+++ linux-hpe4/arch/x86/kernel/topology.c	2010-11-17 09:01:10.222837594 +0800
@@ -70,6 +70,36 @@
 }
 EXPORT_SYMBOL(arch_unregister_cpu);
 
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+/*
+ * Put the logical cpu into a new sokect, and encapsule it into core 0.
+ */
+static void fake_cpu_socket_info(int cpu)
+{
+	struct cpuinfo_x86 *c = &cpu_data(cpu);
+	int i, phys_id = 0;
+
+	/* calculate the max phys_id */
+	for_each_present_cpu(i) {
+		struct cpuinfo_x86 *c = &cpu_data(i);
+		if (phys_id < c->phys_proc_id)
+			phys_id = c->phys_proc_id;
+	}
+
+	c->phys_proc_id = phys_id + 1; /* pick up a unused phys_proc_id */
+	c->cpu_core_id = 0; /* always put the logical cpu to core 0 */
+	c->cpu_probe_on = 1;
+}
+
+static void clear_cpu_socket_info(int cpu)
+{
+	struct cpuinfo_x86 *c = &cpu_data(cpu);
+	c->phys_proc_id = 0;
+	c->cpu_core_id = 0;
+	c->cpu_probe_on = 0;
+}
+
+
 ssize_t arch_cpu_probe(const char *buf, size_t count)
 {
 	int nid = 0;
@@ -109,6 +139,7 @@
 	/* register cpu */
 	arch_register_cpu_node(selected, nid);
 	acpi_map_lsapic_emu(selected, nid);
+	fake_cpu_socket_info(selected);
 
 	return count;
 }
@@ -132,10 +163,13 @@
 
 	arch_unregister_cpu(cpu);
 	acpi_unmap_lsapic(cpu);
+	clear_cpu_socket_info(cpu);
+	set_cpu_present(cpu, true);
 
 	return count;
 }
 EXPORT_SYMBOL(arch_cpu_release);
+#endif CONFIG_ARCH_CPU_PROBE_RELEASE
 
 #else /* CONFIG_HOTPLUG_CPU */
 

-- 
Thanks & Regards,
Shaohui



^ permalink raw reply	[flat|nested] 139+ messages in thread

* [6/8,v3] NUMA Hotplug Emulator: Fake CPU socket with logical CPU on x86
@ 2010-11-17  2:08   ` shaohui.zheng
  0 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Sam Ravnborg, Shaohui Zheng

[-- Attachment #1: 006-hotplug-emulator-fake_socket_with_logic_cpu_on_x86.patch --]
[-- Type: text/plain, Size: 7938 bytes --]

From: Shaohui Zheng <shaohui.zheng@intel.com>

When hotplug a CPU with emulator, we are using a logical CPU to emulate the
CPU hotplug process. For the CPU supported SMT, some logical CPUs are in the
same socket, but it may located in different NUMA node after we have emulator.
it misleads the scheduling domain to build the incorrect hierarchy, and it
causes the following call trace when rebalance the scheduling domain:

divide error: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu8/online
CPU 0 
Modules linked in: fbcon tileblit font bitblit softcursor radeon ttm drm_kms_helper e1000e usbhid via_rhine mii drm i2c_algo_bit igb dca
Pid: 0, comm: swapper Not tainted 2.6.32hpe #78 X8DTN
RIP: 0010:[<ffffffff81051da5>]  [<ffffffff81051da5>] find_busiest_group+0x6c5/0xa10
RSP: 0018:ffff880028203c30  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000015ac0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff880277e8cfa0 RDI: 0000000000000000
RBP: ffff880028203dc0 R08: ffff880277e8cfa0 R09: 0000000000000040
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f16cfc85770 CR3: 0000000001001000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81822000, task ffffffff8184a600)
Stack:
 ffff880028203d60 ffff880028203cd0 ffff8801c204ff08 ffff880028203e38
<0> 0101ffff81018c59 ffff880028203e44 00000001810806bd ffff8801c204fe00
<0> 0000000528200000 ffffffff00000000 0000000000000018 0000000000015ac0
Call Trace:
 <IRQ> 
 [<ffffffff81088ee0>] ? tick_dev_program_event+0x40/0xd0
 [<ffffffff81053b2c>] rebalance_domains+0x17c/0x570
 [<ffffffff81018c89>] ? read_tsc+0x9/0x20
 [<ffffffff81088ee0>] ? tick_dev_program_event+0x40/0xd0
 [<ffffffff810569ed>] run_rebalance_domains+0xbd/0xf0
 [<ffffffff8106471f>] __do_softirq+0xaf/0x1e0
 [<ffffffff810b7d18>] ? handle_IRQ_event+0x58/0x160
 [<ffffffff810130ac>] call_softirq+0x1c/0x30
 [<ffffffff81014a85>] do_softirq+0x65/0xa0
 [<ffffffff810645cd>] irq_exit+0x7d/0x90
 [<ffffffff81013ff0>] do_IRQ+0x70/0xe0
 [<ffffffff810128d3>] ret_from_intr+0x0/0x11
 <EOI> 
 [<ffffffff8133387f>] ? acpi_idle_enter_bm+0x281/0x2b5
 [<ffffffff81333878>] ? acpi_idle_enter_bm+0x27a/0x2b5
 [<ffffffff8145dc8f>] ? cpuidle_idle_call+0x9f/0x130
 [<ffffffff81010e2b>] ? cpu_idle+0xab/0x100
 [<ffffffff8158aee6>] ? rest_init+0x66/0x70
 [<ffffffff81905d90>] ? start_kernel+0x3e3/0x3ef
 [<ffffffff8190533a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81905438>] ? x86_64_start_kernel+0xfa/0x109
Code: 00 00 e9 4c fb ff ff 0f 1f 80 00 00 00 00 48 8b b5 d8 fe ff ff 48 8b 45 a8 4d 29 ef 8b 56 08 48 c1 e0 0a 49 89 f0 48 89 d7 31 d2 <48> f7 f7 31 d2 48 89 45 a0 8b 76 08 4c 89 f0 48 c1 e0 0a 48 f7 
RIP  [<ffffffff81051da5>] find_busiest_group+0x6c5/0xa10
 RSP <ffff880028203c30>

Solution:

We put the logical CPU into a fake CPU socket, and assign it an unique
 phys_proc_id. For the fake socket, we put one logical CPU in only. This
method fixes the above bug.

CC: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/arch/x86/include/asm/processor.h
===================================================================
--- linux-hpe4.orig/arch/x86/include/asm/processor.h	2010-11-17 09:00:51.354100239 +0800
+++ linux-hpe4/arch/x86/include/asm/processor.h	2010-11-17 09:01:10.222837594 +0800
@@ -113,6 +113,15 @@
 	/* Index into per_cpu list: */
 	u16			cpu_index;
 #endif
+
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+	/*
+	 * Use a logic cpu to emulate a physical cpu's hotplug. We put the
+	 * logical cpu into a fake socket, assign a fake physical id to it,
+	 * and create a fake core.
+	 */
+	__u8		cpu_probe_on; /* A flag to enable cpu probe/release */
+#endif
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
 #define X86_VENDOR_INTEL	0
Index: linux-hpe4/arch/x86/kernel/smpboot.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/smpboot.c	2010-11-17 09:01:10.202837209 +0800
+++ linux-hpe4/arch/x86/kernel/smpboot.c	2010-11-17 09:01:10.222837594 +0800
@@ -97,6 +97,7 @@
  */
 static DEFINE_MUTEX(x86_cpu_hotplug_driver_mutex);
 
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
 void cpu_hotplug_driver_lock()
 {
         mutex_lock(&x86_cpu_hotplug_driver_mutex);
@@ -106,6 +107,7 @@
 {
         mutex_unlock(&x86_cpu_hotplug_driver_mutex);
 }
+#endif
 
 #else
 static struct task_struct *idle_thread_array[NR_CPUS] __cpuinitdata ;
@@ -198,6 +200,8 @@
 {
 	int cpuid, phys_id;
 	unsigned long timeout;
+	u8 cpu_probe_on = 0;
+	struct cpuinfo_x86 *c;
 
 	/*
 	 * If waken up by an INIT in an 82489DX configuration
@@ -277,7 +281,20 @@
 	/*
 	 * Save our processor parameters
 	 */
+	c = &cpu_data(cpuid);
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+	cpu_probe_on = c->cpu_probe_on;
+	phys_id = c->phys_proc_id;
+#endif
+
 	smp_store_cpu_info(cpuid);
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+	if (cpu_probe_on) {
+		c->phys_proc_id = phys_id; /* restore the fake phys_proc_id */
+		c->cpu_core_id = 0; /* force the logical cpu to core 0 */
+		c->cpu_probe_on = cpu_probe_on;
+	}
+#endif
 
 	notify_cpu_starting(cpuid);
 
@@ -400,6 +417,11 @@
 {
 	int i;
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
+	int cpu_probe_on = 0;
+
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+	cpu_probe_on = c->cpu_probe_on;
+#endif
 
 	cpumask_set_cpu(cpu, cpu_sibling_setup_mask);
 
@@ -431,7 +453,8 @@
 
 	for_each_cpu(i, cpu_sibling_setup_mask) {
 		if (per_cpu(cpu_llc_id, cpu) != BAD_APICID &&
-		    per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) {
+		    per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i) &&
+			cpu_probe_on == 0) {
 			cpumask_set_cpu(i, c->llc_shared_map);
 			cpumask_set_cpu(cpu, cpu_data(i).llc_shared_map);
 		}
Index: linux-hpe4/arch/x86/kernel/topology.c
===================================================================
--- linux-hpe4.orig/arch/x86/kernel/topology.c	2010-11-17 09:01:10.202837209 +0800
+++ linux-hpe4/arch/x86/kernel/topology.c	2010-11-17 09:01:10.222837594 +0800
@@ -70,6 +70,36 @@
 }
 EXPORT_SYMBOL(arch_unregister_cpu);
 
+#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
+/*
+ * Put the logical cpu into a new sokect, and encapsule it into core 0.
+ */
+static void fake_cpu_socket_info(int cpu)
+{
+	struct cpuinfo_x86 *c = &cpu_data(cpu);
+	int i, phys_id = 0;
+
+	/* calculate the max phys_id */
+	for_each_present_cpu(i) {
+		struct cpuinfo_x86 *c = &cpu_data(i);
+		if (phys_id < c->phys_proc_id)
+			phys_id = c->phys_proc_id;
+	}
+
+	c->phys_proc_id = phys_id + 1; /* pick up a unused phys_proc_id */
+	c->cpu_core_id = 0; /* always put the logical cpu to core 0 */
+	c->cpu_probe_on = 1;
+}
+
+static void clear_cpu_socket_info(int cpu)
+{
+	struct cpuinfo_x86 *c = &cpu_data(cpu);
+	c->phys_proc_id = 0;
+	c->cpu_core_id = 0;
+	c->cpu_probe_on = 0;
+}
+
+
 ssize_t arch_cpu_probe(const char *buf, size_t count)
 {
 	int nid = 0;
@@ -109,6 +139,7 @@
 	/* register cpu */
 	arch_register_cpu_node(selected, nid);
 	acpi_map_lsapic_emu(selected, nid);
+	fake_cpu_socket_info(selected);
 
 	return count;
 }
@@ -132,10 +163,13 @@
 
 	arch_unregister_cpu(cpu);
 	acpi_unmap_lsapic(cpu);
+	clear_cpu_socket_info(cpu);
+	set_cpu_present(cpu, true);
 
 	return count;
 }
 EXPORT_SYMBOL(arch_cpu_release);
+#endif CONFIG_ARCH_CPU_PROBE_RELEASE
 
 #else /* CONFIG_HOTPLUG_CPU */
 

-- 
Thanks & Regards,
Shaohui


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-17  2:07 ` shaohui.zheng
@ 2010-11-17  2:08   ` shaohui.zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Dave Hansen, Shaohui Zheng, Haicheng Li, Wu Fengguang

[-- Attachment #1: 007-hotplug-emulator-extend-memory-probe-interface-to-support-numa.patch --]
[-- Type: text/plain, Size: 5602 bytes --]

Extend memory probe interface to support an extra paramter nid,
the reserved memory can be added into this node if node exists.

Add a memory section(128M) to node 3(boots with mem=1024m)

	echo 0x40000000,3 > memory/probe

And more we make it friendly, it is possible to add memory to do

	echo 3g > memory/probe
	echo 1024m,3 > memory/probe

It maintains backwards compatibility.

Another format suggested by Dave Hansen:

	echo physical_address=0x40000000 numa_node=3 > memory/probe

it is more explicit to show meaning of the parameters.

CC: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
Index: linux-hpe4/Documentation/ABI/testing/sysfs-devices-memory
===================================================================
--- linux-hpe4.orig/Documentation/ABI/testing/sysfs-devices-memory	2010-11-17 09:00:50.653461798 +0800
+++ linux-hpe4/Documentation/ABI/testing/sysfs-devices-memory	2010-11-17 09:01:10.262838849 +0800
@@ -60,6 +60,23 @@
 Users:		hotplug memory remove tools
 		http://www.ibm.com/developerworks/wikis/display/LinuxP/powerpc-utils
 
+What:		/sys/devices/system/memory/probe
+Date:		Nov 2010
+Contact:	Linux Memory Management list <linux-mm@kvack.org>
+Description:
+		memory probe interface is for memory hotplug emulation. it is a software
+		interface to test memory hotplug. We provide the start address and numa
+		nodes id, it will add a memory section to the specified node.
+
+		Add a memory section(128M) to node 3(boots with mem=1024m)
+			echo 0x40000000,3 > memory/probe
+
+		A more friendly method
+			echo 3g > memory/probe
+			echo 1024m,3 > memory/probe
+
+		Another format suggested by Dave Hansen:
+			echo physical_address=0x40000000 numa_node=3 > memory/probe
 
 What:		/sys/devices/system/memoryX/nodeY
 Date:		October 2009
Index: linux-hpe4/arch/x86/Kconfig
===================================================================
--- linux-hpe4.orig/arch/x86/Kconfig	2010-11-17 09:00:50.673463029 +0800
+++ linux-hpe4/arch/x86/Kconfig	2010-11-17 09:01:10.282838829 +0800
@@ -1276,10 +1276,6 @@
 	def_bool y
 	depends on ARCH_SPARSEMEM_ENABLE
 
-config ARCH_MEMORY_PROBE
-	def_bool X86_64
-	depends on MEMORY_HOTPLUG
-
 config ILLEGAL_POINTER_VALUE
        hex
        default 0 if X86_32
Index: linux-hpe4/drivers/base/memory.c
===================================================================
--- linux-hpe4.orig/drivers/base/memory.c	2010-11-17 09:00:50.673463029 +0800
+++ linux-hpe4/drivers/base/memory.c	2010-11-17 09:01:10.302838792 +0800
@@ -329,6 +329,9 @@
  * will not need to do it from userspace.  The fake hot-add code
  * as well as ppc64 will do all of their discovery in userspace
  * and will require this interface.
+ *
+ * Parameter format 1: physical_address,numa_node
+ * Parameter format 2: physical_address=0x40000000 numa_node=3
  */
 #ifdef CONFIG_ARCH_MEMORY_PROBE
 static ssize_t
@@ -336,13 +339,53 @@
 		   const char *buf, size_t count)
 {
 	u64 phys_addr;
-	int nid;
+	int nid = 0;
 	int ret;
+	char *p = NULL, *q = NULL;
+	/* format: physical_address=0x40000000 numa_node=3 */
+	p = strchr(buf, '=');
+	if (p != NULL) {
+		*p = '\0';
+		q = strchr(buf, ' ');
+		if (q == NULL) {
+			if (strcmp(buf, "physical_address") != 0)
+				ret = -EPERM;
+			else
+				phys_addr = memparse(p+1, NULL);
+		} else {
+			*q++ = '\0';
+			p = strchr(q, '=');
+			if (strcmp(buf, "physical_address") == 0)
+				phys_addr = memparse(p+1, NULL);
+			if (strcmp(buf, "numa_node") == 0)
+				nid = simple_strtoul(p+1, NULL, 0);
+			if (strcmp(q, "physical_address") == 0)
+				phys_addr = memparse(p+1, NULL);
+			if (strcmp(q, "numa_node") == 0)
+				nid = simple_strtoul(p+1, NULL, 0);
+		}
+	} else { /* physical_address,numa_node */
+		p = strchr(buf, ',');
+		if (p != NULL && strlen(p+1) > 0) {
+			/* nid specified */
+			*p++ = '\0';
+			nid = simple_strtoul(p, NULL, 0);
+			phys_addr = memparse(buf, NULL);
+		} else {
+			phys_addr = memparse(buf, NULL);
+			nid = memory_add_physaddr_to_nid(phys_addr);
+		}
+	}
 
-	phys_addr = simple_strtoull(buf, NULL, 0);
-
-	nid = memory_add_physaddr_to_nid(phys_addr);
-	ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+	if (nid < 0 || nid > nr_node_ids - 1) {
+		printk(KERN_ERR "Invalid node id %d(0<=nid<%d).\n", nid, nr_node_ids);
+		ret = -EPERM;
+	} else {
+		printk(KERN_INFO "Add a memory section to node: %d.\n", nid);
+		ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+		if (ret)
+			count = ret;
+	}
 
 	if (ret)
 		count = ret;
Index: linux-hpe4/mm/Kconfig
===================================================================
--- linux-hpe4.orig/mm/Kconfig	2010-11-17 09:01:10.212839478 +0800
+++ linux-hpe4/mm/Kconfig	2010-11-17 09:01:10.302838792 +0800
@@ -173,6 +173,17 @@
 	  is for cpu hot-add/hot-remove to specified node in software method.
 	  This is for debuging and testing purpose
 
+config ARCH_MEMORY_PROBE
+	def_bool y
+	bool "Memory hotplug emulation"
+	depends on NUMA_HOTPLUG_EMU
+	---help---
+	  Enable memory hotplug emulation. Reserve memory with grub parameter
+	  "mem=N"(such as mem=1024M), where N is the initial memory size, the
+	  rest physical memory will be removed from e820 table; the memory probe
+	  interface is for memory hot-add to specified node in software method.
+	  This is for debuging and testing purpose
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.

-- 
Thanks & Regards,
Shaohui



^ permalink raw reply	[flat|nested] 139+ messages in thread

* [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-17  2:08   ` shaohui.zheng
  0 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Dave Hansen, Shaohui Zheng, Haicheng Li, Wu Fengguang

[-- Attachment #1: 007-hotplug-emulator-extend-memory-probe-interface-to-support-numa.patch --]
[-- Type: text/plain, Size: 5898 bytes --]

Extend memory probe interface to support an extra paramter nid,
the reserved memory can be added into this node if node exists.

Add a memory section(128M) to node 3(boots with mem=1024m)

	echo 0x40000000,3 > memory/probe

And more we make it friendly, it is possible to add memory to do

	echo 3g > memory/probe
	echo 1024m,3 > memory/probe

It maintains backwards compatibility.

Another format suggested by Dave Hansen:

	echo physical_address=0x40000000 numa_node=3 > memory/probe

it is more explicit to show meaning of the parameters.

CC: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
Index: linux-hpe4/Documentation/ABI/testing/sysfs-devices-memory
===================================================================
--- linux-hpe4.orig/Documentation/ABI/testing/sysfs-devices-memory	2010-11-17 09:00:50.653461798 +0800
+++ linux-hpe4/Documentation/ABI/testing/sysfs-devices-memory	2010-11-17 09:01:10.262838849 +0800
@@ -60,6 +60,23 @@
 Users:		hotplug memory remove tools
 		http://www.ibm.com/developerworks/wikis/display/LinuxP/powerpc-utils
 
+What:		/sys/devices/system/memory/probe
+Date:		Nov 2010
+Contact:	Linux Memory Management list <linux-mm@kvack.org>
+Description:
+		memory probe interface is for memory hotplug emulation. it is a software
+		interface to test memory hotplug. We provide the start address and numa
+		nodes id, it will add a memory section to the specified node.
+
+		Add a memory section(128M) to node 3(boots with mem=1024m)
+			echo 0x40000000,3 > memory/probe
+
+		A more friendly method
+			echo 3g > memory/probe
+			echo 1024m,3 > memory/probe
+
+		Another format suggested by Dave Hansen:
+			echo physical_address=0x40000000 numa_node=3 > memory/probe
 
 What:		/sys/devices/system/memoryX/nodeY
 Date:		October 2009
Index: linux-hpe4/arch/x86/Kconfig
===================================================================
--- linux-hpe4.orig/arch/x86/Kconfig	2010-11-17 09:00:50.673463029 +0800
+++ linux-hpe4/arch/x86/Kconfig	2010-11-17 09:01:10.282838829 +0800
@@ -1276,10 +1276,6 @@
 	def_bool y
 	depends on ARCH_SPARSEMEM_ENABLE
 
-config ARCH_MEMORY_PROBE
-	def_bool X86_64
-	depends on MEMORY_HOTPLUG
-
 config ILLEGAL_POINTER_VALUE
        hex
        default 0 if X86_32
Index: linux-hpe4/drivers/base/memory.c
===================================================================
--- linux-hpe4.orig/drivers/base/memory.c	2010-11-17 09:00:50.673463029 +0800
+++ linux-hpe4/drivers/base/memory.c	2010-11-17 09:01:10.302838792 +0800
@@ -329,6 +329,9 @@
  * will not need to do it from userspace.  The fake hot-add code
  * as well as ppc64 will do all of their discovery in userspace
  * and will require this interface.
+ *
+ * Parameter format 1: physical_address,numa_node
+ * Parameter format 2: physical_address=0x40000000 numa_node=3
  */
 #ifdef CONFIG_ARCH_MEMORY_PROBE
 static ssize_t
@@ -336,13 +339,53 @@
 		   const char *buf, size_t count)
 {
 	u64 phys_addr;
-	int nid;
+	int nid = 0;
 	int ret;
+	char *p = NULL, *q = NULL;
+	/* format: physical_address=0x40000000 numa_node=3 */
+	p = strchr(buf, '=');
+	if (p != NULL) {
+		*p = '\0';
+		q = strchr(buf, ' ');
+		if (q == NULL) {
+			if (strcmp(buf, "physical_address") != 0)
+				ret = -EPERM;
+			else
+				phys_addr = memparse(p+1, NULL);
+		} else {
+			*q++ = '\0';
+			p = strchr(q, '=');
+			if (strcmp(buf, "physical_address") == 0)
+				phys_addr = memparse(p+1, NULL);
+			if (strcmp(buf, "numa_node") == 0)
+				nid = simple_strtoul(p+1, NULL, 0);
+			if (strcmp(q, "physical_address") == 0)
+				phys_addr = memparse(p+1, NULL);
+			if (strcmp(q, "numa_node") == 0)
+				nid = simple_strtoul(p+1, NULL, 0);
+		}
+	} else { /* physical_address,numa_node */
+		p = strchr(buf, ',');
+		if (p != NULL && strlen(p+1) > 0) {
+			/* nid specified */
+			*p++ = '\0';
+			nid = simple_strtoul(p, NULL, 0);
+			phys_addr = memparse(buf, NULL);
+		} else {
+			phys_addr = memparse(buf, NULL);
+			nid = memory_add_physaddr_to_nid(phys_addr);
+		}
+	}
 
-	phys_addr = simple_strtoull(buf, NULL, 0);
-
-	nid = memory_add_physaddr_to_nid(phys_addr);
-	ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+	if (nid < 0 || nid > nr_node_ids - 1) {
+		printk(KERN_ERR "Invalid node id %d(0<=nid<%d).\n", nid, nr_node_ids);
+		ret = -EPERM;
+	} else {
+		printk(KERN_INFO "Add a memory section to node: %d.\n", nid);
+		ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+		if (ret)
+			count = ret;
+	}
 
 	if (ret)
 		count = ret;
Index: linux-hpe4/mm/Kconfig
===================================================================
--- linux-hpe4.orig/mm/Kconfig	2010-11-17 09:01:10.212839478 +0800
+++ linux-hpe4/mm/Kconfig	2010-11-17 09:01:10.302838792 +0800
@@ -173,6 +173,17 @@
 	  is for cpu hot-add/hot-remove to specified node in software method.
 	  This is for debuging and testing purpose
 
+config ARCH_MEMORY_PROBE
+	def_bool y
+	bool "Memory hotplug emulation"
+	depends on NUMA_HOTPLUG_EMU
+	---help---
+	  Enable memory hotplug emulation. Reserve memory with grub parameter
+	  "mem=N"(such as mem=1024M), where N is the initial memory size, the
+	  rest physical memory will be removed from e820 table; the memory probe
+	  interface is for memory hot-add to specified node in software method.
+	  This is for debuging and testing purpose
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.

-- 
Thanks & Regards,
Shaohui


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [8/8,v3] NUMA Hotplug Emulator: documentation
  2010-11-17  2:07 ` shaohui.zheng
@ 2010-11-17  2:08   ` shaohui.zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Haicheng Li, Shaohui Zheng

[-- Attachment #1: 008-hotplug-emulator-doc-x86_64-of-numa-hotplug-emulator.patch --]
[-- Type: text/plain, Size: 4784 bytes --]

From: Shaohui Zheng <shaohui.zheng@intel.com>

add a text file Documentation/x86/x86_64/numa_hotplug_emulator.txt
to explain the usage for the hotplug emulator.

Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt	2010-11-17 09:01:10.342836513 +0800
@@ -0,0 +1,92 @@
+NUMA Hotplug Emulator for x86
+---------------------------------------------------
+
+NUMA hotplug emulator is able to emulate NUMA Node Hotplug
+thru a pure software way. It intends to help people easily debug
+and test node/cpu/memory hotplug related stuff on a
+none-numa-hotplug-support machine, even a UMA machine and virtual
+environment.
+
+1) Node hotplug emulation:
+
+The emulator firstly hides RAM via E820 table, and then it can
+fake offlined nodes with the hidden RAM.
+
+After system bootup, user is able to hotplug-add these offlined
+nodes, which is just similar to a real hotplug hardware behavior.
+
+Using boot option "numa=hide=N*size" to fake offlined nodes:
+	- N is the number of hidden nodes
+	- size is the memory size (in MB) per hidden node.
+
+There is a sysfs entry "probe" under /sys/devices/system/node/ for user
+to hotplug the fake offlined nodes:
+
+ - to show all fake offlined nodes:
+    $ cat /sys/devices/system/node/probe
+
+ - to hotadd a fake offlined node, e.g. nodeid is N:
+    $ echo N > /sys/devices/system/node/probe
+
+2) CPU hotplug emulation:
+
+The emulator reserve CPUs throu grub parameter, the reserved CPUs can be
+hot-add/hot-remove in software method, it emulates the process of physical
+cpu hotplug.
+
+When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU
+socket hotplug process. For the CPU supported SMT, some logical CPUs are in the
+same socket, but it may located in different NUMA node after we have emulator.
+We put the logical CPU into a fake CPU socket, and assign it an unique
+phys_proc_id. For the fake socket, we put one logical CPU in only.
+
+ - to hide CPUs
+	- Using boot option "maxcpus=N" hide CPUs
+	  N is the number of initialize CPUs
+	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation
+      when cpu_hpe is enabled, the rest CPUs will not be initialized
+
+ - to hot-add CPU to node
+	$ echo nid > cpu/probe
+
+ - to hot-remove CPU
+	$ echo nid > cpu/release
+
+3) Memory hotplug emulation:
+
+The emulator reserve memory before OS booting, the reserved memory region
+is remove from e820 table, and they can be hot-added via the probe interface,
+this interface was extend to support add memory to the specified node, It
+maintains backwards compatibility.
+
+The difficulty of Memory Release is well-known, we have no plan for it until now.
+
+ - reserve memory throu grub parameter
+ 	mem=1024m
+
+ - add a memory section to node 3
+    $ echo 0x40000000,3 > memory/probe
+	OR
+    $ echo 1024m,3 > memory/probe
+	OR
+    $ echo "physical_address=0x40000000 numa_node=3" > memory/probe
+
+4) Script for hotplug testing
+
+These scripts provides convenience when we hot-add memory/cpu in batch.
+
+- Online all memory sections:
+for m in /sys/devices/system/memory/memory*;
+do
+	echo online > $m/state;
+done
+
+- CPU Online:
+for c in /sys/devices/system/cpu/cpu*;
+do
+	echo 1 > $c/online;
+done
+
+- Haicheng Li <haicheng.li@intel.com>
+- Shaohui Zheng <shaohui.zheng@intel.com>
+  Nov 2010
Index: linux-hpe4/Documentation/x86/x86_64/boot-options.txt
===================================================================
--- linux-hpe4.orig/Documentation/x86/x86_64/boot-options.txt	2010-11-17 10:01:37.093461435 +0800
+++ linux-hpe4/Documentation/x86/x86_64/boot-options.txt	2010-11-17 10:03:10.881043878 +0800
@@ -173,6 +173,13 @@
   numa=fake=<N>
 		If given as an integer, fills all system RAM with N fake nodes
 		interleaved over physical nodes.
+  numa=hide=N*size1[,size2,...]
+		Give an string seperated by comma, each sub string stands for a serie nodes.
+		system will reserve an area to create hide numa nodes for them.
+
+		for example: numa=hide=2*512,256
+			system will reserve (2*512 + 256) M for 3 hide nodes. 2 nodes with 512M memory,
+			and 1 node with 256 memory 
 
 ACPI
 
@@ -316,3 +323,8 @@
 		Do not use GB pages for kernel direct mappings.
 	gbpages
 		Use GB pages for kernel direct mappings.
+	cpu_hpe=on/off
+		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,
+		sysfs provides probe/release interface to hot add/remove cpu dynamically.
+		this option is disabled in default.
+			

-- 
Thanks & Regards,
Shaohui



^ permalink raw reply	[flat|nested] 139+ messages in thread

* [8/8,v3] NUMA Hotplug Emulator: documentation
@ 2010-11-17  2:08   ` shaohui.zheng
  0 siblings, 0 replies; 139+ messages in thread
From: shaohui.zheng @ 2010-11-17  2:08 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Haicheng Li, Shaohui Zheng

[-- Attachment #1: 008-hotplug-emulator-doc-x86_64-of-numa-hotplug-emulator.patch --]
[-- Type: text/plain, Size: 5080 bytes --]

From: Shaohui Zheng <shaohui.zheng@intel.com>

add a text file Documentation/x86/x86_64/numa_hotplug_emulator.txt
to explain the usage for the hotplug emulator.

Signed-off-by: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
---
Index: linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt	2010-11-17 09:01:10.342836513 +0800
@@ -0,0 +1,92 @@
+NUMA Hotplug Emulator for x86
+---------------------------------------------------
+
+NUMA hotplug emulator is able to emulate NUMA Node Hotplug
+thru a pure software way. It intends to help people easily debug
+and test node/cpu/memory hotplug related stuff on a
+none-numa-hotplug-support machine, even a UMA machine and virtual
+environment.
+
+1) Node hotplug emulation:
+
+The emulator firstly hides RAM via E820 table, and then it can
+fake offlined nodes with the hidden RAM.
+
+After system bootup, user is able to hotplug-add these offlined
+nodes, which is just similar to a real hotplug hardware behavior.
+
+Using boot option "numa=hide=N*size" to fake offlined nodes:
+	- N is the number of hidden nodes
+	- size is the memory size (in MB) per hidden node.
+
+There is a sysfs entry "probe" under /sys/devices/system/node/ for user
+to hotplug the fake offlined nodes:
+
+ - to show all fake offlined nodes:
+    $ cat /sys/devices/system/node/probe
+
+ - to hotadd a fake offlined node, e.g. nodeid is N:
+    $ echo N > /sys/devices/system/node/probe
+
+2) CPU hotplug emulation:
+
+The emulator reserve CPUs throu grub parameter, the reserved CPUs can be
+hot-add/hot-remove in software method, it emulates the process of physical
+cpu hotplug.
+
+When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU
+socket hotplug process. For the CPU supported SMT, some logical CPUs are in the
+same socket, but it may located in different NUMA node after we have emulator.
+We put the logical CPU into a fake CPU socket, and assign it an unique
+phys_proc_id. For the fake socket, we put one logical CPU in only.
+
+ - to hide CPUs
+	- Using boot option "maxcpus=N" hide CPUs
+	  N is the number of initialize CPUs
+	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation
+      when cpu_hpe is enabled, the rest CPUs will not be initialized
+
+ - to hot-add CPU to node
+	$ echo nid > cpu/probe
+
+ - to hot-remove CPU
+	$ echo nid > cpu/release
+
+3) Memory hotplug emulation:
+
+The emulator reserve memory before OS booting, the reserved memory region
+is remove from e820 table, and they can be hot-added via the probe interface,
+this interface was extend to support add memory to the specified node, It
+maintains backwards compatibility.
+
+The difficulty of Memory Release is well-known, we have no plan for it until now.
+
+ - reserve memory throu grub parameter
+ 	mem=1024m
+
+ - add a memory section to node 3
+    $ echo 0x40000000,3 > memory/probe
+	OR
+    $ echo 1024m,3 > memory/probe
+	OR
+    $ echo "physical_address=0x40000000 numa_node=3" > memory/probe
+
+4) Script for hotplug testing
+
+These scripts provides convenience when we hot-add memory/cpu in batch.
+
+- Online all memory sections:
+for m in /sys/devices/system/memory/memory*;
+do
+	echo online > $m/state;
+done
+
+- CPU Online:
+for c in /sys/devices/system/cpu/cpu*;
+do
+	echo 1 > $c/online;
+done
+
+- Haicheng Li <haicheng.li@intel.com>
+- Shaohui Zheng <shaohui.zheng@intel.com>
+  Nov 2010
Index: linux-hpe4/Documentation/x86/x86_64/boot-options.txt
===================================================================
--- linux-hpe4.orig/Documentation/x86/x86_64/boot-options.txt	2010-11-17 10:01:37.093461435 +0800
+++ linux-hpe4/Documentation/x86/x86_64/boot-options.txt	2010-11-17 10:03:10.881043878 +0800
@@ -173,6 +173,13 @@
   numa=fake=<N>
 		If given as an integer, fills all system RAM with N fake nodes
 		interleaved over physical nodes.
+  numa=hide=N*size1[,size2,...]
+		Give an string seperated by comma, each sub string stands for a serie nodes.
+		system will reserve an area to create hide numa nodes for them.
+
+		for example: numa=hide=2*512,256
+			system will reserve (2*512 + 256) M for 3 hide nodes. 2 nodes with 512M memory,
+			and 1 node with 256 memory 
 
 ACPI
 
@@ -316,3 +323,8 @@
 		Do not use GB pages for kernel direct mappings.
 	gbpages
 		Use GB pages for kernel direct mappings.
+	cpu_hpe=on/off
+		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,
+		sysfs provides probe/release interface to hot add/remove cpu dynamically.
+		this option is disabled in default.
+			

-- 
Thanks & Regards,
Shaohui


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks
  2010-11-17  2:07 ` shaohui.zheng
@ 2010-11-17  5:22   ` Paul Mundt
  -1 siblings, 0 replies; 139+ messages in thread
From: Paul Mundt @ 2010-11-17  5:22 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, ak, shaohui.zheng

On Wed, Nov 17, 2010 at 10:07:59AM +0800, shaohui.zheng@intel.com wrote:
> * PATCHSET INTRODUCTION
> 
> patch 1: Add function to hide memory region via e820 table. Then emulator will
> 	     use these memory regions to fake offlined numa nodes.
> patch 2: Infrastructure of NUMA hotplug emulation, introduce "hide node".
> patch 3: Provide an userland interface to hotplug-add fake offlined nodes.
> patch 4: Abstract cpu register functions, make these interface friend for cpu
> 		 hotplug emulation
> patch 5: Support cpu probe/release in x86, it provide a software method to hot
> 		 add/remove cpu with sysfs interface.
> patch 6: Fake CPU socket with logical CPU on x86, to prevent the scheduling
> 		 domain to build the incorrect hierarchy.
> patch 7: extend memory probe interface to support NUMA, we can add the memory to
> 		 a specified node with the interface.
> patch 8: Documentations
> 
> * FEEDBACKS & RESPONSES
> 
I had some comments on the other patches in the series that possibly got
missed because of the mail-followup-to confusion:

http://lkml.org/lkml/2010/11/15/11
http://lkml.org/lkml/2010/11/15/14
http://lkml.org/lkml/2010/11/15/15

The other one you've already dealt with.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks
@ 2010-11-17  5:22   ` Paul Mundt
  0 siblings, 0 replies; 139+ messages in thread
From: Paul Mundt @ 2010-11-17  5:22 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, ak, shaohui.zheng

On Wed, Nov 17, 2010 at 10:07:59AM +0800, shaohui.zheng@intel.com wrote:
> * PATCHSET INTRODUCTION
> 
> patch 1: Add function to hide memory region via e820 table. Then emulator will
> 	     use these memory regions to fake offlined numa nodes.
> patch 2: Infrastructure of NUMA hotplug emulation, introduce "hide node".
> patch 3: Provide an userland interface to hotplug-add fake offlined nodes.
> patch 4: Abstract cpu register functions, make these interface friend for cpu
> 		 hotplug emulation
> patch 5: Support cpu probe/release in x86, it provide a software method to hot
> 		 add/remove cpu with sysfs interface.
> patch 6: Fake CPU socket with logical CPU on x86, to prevent the scheduling
> 		 domain to build the incorrect hierarchy.
> patch 7: extend memory probe interface to support NUMA, we can add the memory to
> 		 a specified node with the interface.
> patch 8: Documentations
> 
> * FEEDBACKS & RESPONSES
> 
I had some comments on the other patches in the series that possibly got
missed because of the mail-followup-to confusion:

http://lkml.org/lkml/2010/11/15/11
http://lkml.org/lkml/2010/11/15/14
http://lkml.org/lkml/2010/11/15/15

The other one you've already dealt with.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-17  8:16     ` David Rientjes
@ 2010-11-17  7:51       ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-17  7:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, Nov 17, 2010 at 12:16:47AM -0800, David Rientjes wrote:
> On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote:
> 
> > From: Haicheng Li <haicheng.li@intel.com>
> > 
> > NUMA hotplug emulator introduces a new node state N_HIDDEN to
> > identify the fake offlined node. It firstly hides RAM via E820
> > table and then emulates fake offlined nodes with the hidden RAM.
> > 
> 
> Hmm, why can't you use numa=hide to hide a specified quantity of memory 
> from the kernel and then use the add_memory() interface to hot-add the 
> offlined memory in the desired quantity?  In other words, why do you need 
> to track the offlined nodes with a state?
> 
> The userspace interface would take a desired size of hidden memory to 
> hot-add and the node id would be the first_unset_node(node_online_map).
Yes, it is a good idea, your solution is what we indeed do in our first 2
versions.  We use mem=memsize to hide memory, and we call add_memory interface
to hot-add offlined memory with desired quantity, and we can also add to
desired nodes(even through the nodes does not exists). it is very flexible
solution.

However, this solution was denied since we notice NUMA emulation, we should
reuse it.

Currently, our solution creates static nodes when OS boots, only the node with 
state N_HIDDEN can be hot-added with node/probe interface, and we can query 


> 
> > After system bootup, user is able to hotplug-add these offlined
> > nodes, which is just similar to a real hardware hotplug behavior.
> > 
> > Using boot option "numa=hide=N*size" to fake offlined nodes:
> > 	- N is the number of hidden nodes
> > 	- size is the memory size (in MB) per hidden node.
> > 
> 
> size should be parsed with memparse() so users can specify 'M' or 'G', it 
> would even make your parsing code simpler.
Agree, if we use memparse, users can specify 'M' or 'G', we will added it when
we send next version.

-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-17  7:51       ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-17  7:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, Nov 17, 2010 at 12:16:47AM -0800, David Rientjes wrote:
> On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote:
> 
> > From: Haicheng Li <haicheng.li@intel.com>
> > 
> > NUMA hotplug emulator introduces a new node state N_HIDDEN to
> > identify the fake offlined node. It firstly hides RAM via E820
> > table and then emulates fake offlined nodes with the hidden RAM.
> > 
> 
> Hmm, why can't you use numa=hide to hide a specified quantity of memory 
> from the kernel and then use the add_memory() interface to hot-add the 
> offlined memory in the desired quantity?  In other words, why do you need 
> to track the offlined nodes with a state?
> 
> The userspace interface would take a desired size of hidden memory to 
> hot-add and the node id would be the first_unset_node(node_online_map).
Yes, it is a good idea, your solution is what we indeed do in our first 2
versions.  We use mem=memsize to hide memory, and we call add_memory interface
to hot-add offlined memory with desired quantity, and we can also add to
desired nodes(even through the nodes does not exists). it is very flexible
solution.

However, this solution was denied since we notice NUMA emulation, we should
reuse it.

Currently, our solution creates static nodes when OS boots, only the node with 
state N_HIDDEN can be hot-added with node/probe interface, and we can query 


> 
> > After system bootup, user is able to hotplug-add these offlined
> > nodes, which is just similar to a real hardware hotplug behavior.
> > 
> > Using boot option "numa=hide=N*size" to fake offlined nodes:
> > 	- N is the number of hidden nodes
> > 	- size is the memory size (in MB) per hidden node.
> > 
> 
> size should be parsed with memparse() so users can specify 'M' or 'G', it 
> would even make your parsing code simpler.
Agree, if we use memparse, users can specify 'M' or 'G', we will added it when
we send next version.

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
  2010-11-17  2:08   ` shaohui.zheng
@ 2010-11-17  8:16     ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17  8:16 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote:

> Index: linux-hpe4/arch/x86/kernel/e820.c
> ===================================================================
> --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> @@ -971,6 +971,7 @@
>  }
>  
>  static int userdef __initdata;
> +static u64 max_mem_size __initdata = ULLONG_MAX;
>  
>  /* "mem=nopentium" disables the 4MB page tables. */
>  static int __init parse_memopt(char *p)
> @@ -989,12 +990,28 @@
>  
>  	userdef = 1;
>  	mem_size = memparse(p, &p);
> -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> +	max_mem_size = mem_size;
>  
>  	return 0;
>  }

This needs memmap= support as well, right?

>  early_param("mem", parse_memopt);
>  
> +#ifdef CONFIG_NODE_HOTPLUG_EMU
> +u64 __init e820_hide_mem(u64 mem_size)
> +{
> +	u64 start, end_pfn;
> +
> +	userdef = 1;
> +	end_pfn = e820_end_of_ram_pfn();
> +	start = (end_pfn << PAGE_SHIFT) - mem_size;
> +	e820_remove_range(start, max_mem_size - start, E820_RAM, 1);
> +	max_mem_size = start;
> +
> +	return start;
> +}
> +#endif

This doesn't have any sanity checking for whether e820_remove_range() will 
leave any significant amount of memory behind so the kernel will even boot 
(probably should have a guaranteed FAKE_NODE_MIN_SIZE left behind?).

> +
>  static int __init parse_memmap_opt(char *p)
>  {
>  	char *oldp;

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
@ 2010-11-17  8:16     ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17  8:16 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote:

> Index: linux-hpe4/arch/x86/kernel/e820.c
> ===================================================================
> --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> @@ -971,6 +971,7 @@
>  }
>  
>  static int userdef __initdata;
> +static u64 max_mem_size __initdata = ULLONG_MAX;
>  
>  /* "mem=nopentium" disables the 4MB page tables. */
>  static int __init parse_memopt(char *p)
> @@ -989,12 +990,28 @@
>  
>  	userdef = 1;
>  	mem_size = memparse(p, &p);
> -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> +	max_mem_size = mem_size;
>  
>  	return 0;
>  }

This needs memmap= support as well, right?

>  early_param("mem", parse_memopt);
>  
> +#ifdef CONFIG_NODE_HOTPLUG_EMU
> +u64 __init e820_hide_mem(u64 mem_size)
> +{
> +	u64 start, end_pfn;
> +
> +	userdef = 1;
> +	end_pfn = e820_end_of_ram_pfn();
> +	start = (end_pfn << PAGE_SHIFT) - mem_size;
> +	e820_remove_range(start, max_mem_size - start, E820_RAM, 1);
> +	max_mem_size = start;
> +
> +	return start;
> +}
> +#endif

This doesn't have any sanity checking for whether e820_remove_range() will 
leave any significant amount of memory behind so the kernel will even boot 
(probably should have a guaranteed FAKE_NODE_MIN_SIZE left behind?).

> +
>  static int __init parse_memmap_opt(char *p)
>  {
>  	char *oldp;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-17  2:08   ` shaohui.zheng
@ 2010-11-17  8:16     ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17  8:16 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote:

> From: Haicheng Li <haicheng.li@intel.com>
> 
> NUMA hotplug emulator introduces a new node state N_HIDDEN to
> identify the fake offlined node. It firstly hides RAM via E820
> table and then emulates fake offlined nodes with the hidden RAM.
> 

Hmm, why can't you use numa=hide to hide a specified quantity of memory 
from the kernel and then use the add_memory() interface to hot-add the 
offlined memory in the desired quantity?  In other words, why do you need 
to track the offlined nodes with a state?

The userspace interface would take a desired size of hidden memory to 
hot-add and the node id would be the first_unset_node(node_online_map).

> After system bootup, user is able to hotplug-add these offlined
> nodes, which is just similar to a real hardware hotplug behavior.
> 
> Using boot option "numa=hide=N*size" to fake offlined nodes:
> 	- N is the number of hidden nodes
> 	- size is the memory size (in MB) per hidden node.
> 

size should be parsed with memparse() so users can specify 'M' or 'G', it 
would even make your parsing code simpler.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-17  8:16     ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17  8:16 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote:

> From: Haicheng Li <haicheng.li@intel.com>
> 
> NUMA hotplug emulator introduces a new node state N_HIDDEN to
> identify the fake offlined node. It firstly hides RAM via E820
> table and then emulates fake offlined nodes with the hidden RAM.
> 

Hmm, why can't you use numa=hide to hide a specified quantity of memory 
from the kernel and then use the add_memory() interface to hot-add the 
offlined memory in the desired quantity?  In other words, why do you need 
to track the offlined nodes with a state?

The userspace interface would take a desired size of hidden memory to 
hot-add and the node id would be the first_unset_node(node_online_map).

> After system bootup, user is able to hotplug-add these offlined
> nodes, which is just similar to a real hardware hotplug behavior.
> 
> Using boot option "numa=hide=N*size" to fake offlined nodes:
> 	- N is the number of hidden nodes
> 	- size is the memory size (in MB) per hidden node.
> 

size should be parsed with memparse() so users can specify 'M' or 'G', it 
would even make your parsing code simpler.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [3/8,v3] NUMA Hotplug Emulator: Userland interface to hotplug-add fake offlined nodes.
  2010-11-17  2:08   ` shaohui.zheng
@ 2010-11-17  8:16     ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17  8:16 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Dave Hansen, Christoph Lameter, Haicheng Li

On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote:

> From: Haicheng Li <haicheng.li@intel.com>
> 
> Add a sysfs entry "probe" under /sys/devices/system/node/:
> 
>  - to show all fake offlined nodes:
>     $ cat /sys/devices/system/node/probe
> 
>  - to hotadd a fake offlined node, e.g. nodeid is N:
>     $ echo N > /sys/devices/system/node/probe
> 

This would be much more powerful if we just reserved an amount of memory 
at boot and then allowed users to hot-add a given amount with an 
non-online node id.  Then we can test nodes of various sizes rather than 
being statically committed at boot.

This should be fairly straight-forward by faking 
ACPI_SRAT_MEM_HOT_PLUGGABLE entries, for example.

> Index: linux-hpe4/mm/Kconfig
> ===================================================================
> --- linux-hpe4.orig/mm/Kconfig	2010-11-15 17:13:02.443461606 +0800
> +++ linux-hpe4/mm/Kconfig	2010-11-15 17:21:05.535335091 +0800
> @@ -147,6 +147,21 @@
>  	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
>  	depends on MIGRATION
>  
> +config NUMA_HOTPLUG_EMU
> +	bool "NUMA hotplug emulator"
> +	depends on X86_64 && NUMA && MEMORY_HOTPLUG
> +
> +	---help---
> +
> +config NODE_HOTPLUG_EMU
> +	bool "Node hotplug emulation"
> +	depends on NUMA_HOTPLUG_EMU && MEMORY_HOTPLUG
> +	---help---
> +	  Enable Node hotplug emulation. The machine will be setup with
> +	  hidden virtual nodes when booted with "numa=hide=N*size", where
> +	  N is the number of hidden nodes, size is the memory size per
> +	  hidden node. This is only useful for debugging.
> +

That's clearly wrong, but I don't see why this needs to be a new Kconfig 
option to begin with, can't we enable all of this functionality by default 
under CONFIG_NUMA_EMU && CONFIG_MEMORY_HOTPLUG?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [3/8,v3] NUMA Hotplug Emulator: Userland interface to hotplug-add fake offlined nodes.
@ 2010-11-17  8:16     ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17  8:16 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Dave Hansen, Christoph Lameter, Haicheng Li

On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote:

> From: Haicheng Li <haicheng.li@intel.com>
> 
> Add a sysfs entry "probe" under /sys/devices/system/node/:
> 
>  - to show all fake offlined nodes:
>     $ cat /sys/devices/system/node/probe
> 
>  - to hotadd a fake offlined node, e.g. nodeid is N:
>     $ echo N > /sys/devices/system/node/probe
> 

This would be much more powerful if we just reserved an amount of memory 
at boot and then allowed users to hot-add a given amount with an 
non-online node id.  Then we can test nodes of various sizes rather than 
being statically committed at boot.

This should be fairly straight-forward by faking 
ACPI_SRAT_MEM_HOT_PLUGGABLE entries, for example.

> Index: linux-hpe4/mm/Kconfig
> ===================================================================
> --- linux-hpe4.orig/mm/Kconfig	2010-11-15 17:13:02.443461606 +0800
> +++ linux-hpe4/mm/Kconfig	2010-11-15 17:21:05.535335091 +0800
> @@ -147,6 +147,21 @@
>  	depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
>  	depends on MIGRATION
>  
> +config NUMA_HOTPLUG_EMU
> +	bool "NUMA hotplug emulator"
> +	depends on X86_64 && NUMA && MEMORY_HOTPLUG
> +
> +	---help---
> +
> +config NODE_HOTPLUG_EMU
> +	bool "Node hotplug emulation"
> +	depends on NUMA_HOTPLUG_EMU && MEMORY_HOTPLUG
> +	---help---
> +	  Enable Node hotplug emulation. The machine will be setup with
> +	  hidden virtual nodes when booted with "numa=hide=N*size", where
> +	  N is the number of hidden nodes, size is the memory size per
> +	  hidden node. This is only useful for debugging.
> +

That's clearly wrong, but I don't see why this needs to be a new Kconfig 
option to begin with, can't we enable all of this functionality by default 
under CONFIG_NUMA_EMU && CONFIG_MEMORY_HOTPLUG?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks
  2010-11-17  2:07 ` shaohui.zheng
@ 2010-11-17  9:26   ` Yinghai Lu
  -1 siblings, 0 replies; 139+ messages in thread
From: Yinghai Lu @ 2010-11-17  9:26 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak, shaohui.zheng

On Tue, Nov 16, 2010 at 6:07 PM,  <shaohui.zheng@intel.com> wrote:
>
> * WHAT IS HOTPLUG EMULATOR
>
> NUMA hotplug emulator is collectively named for the hotplug emulation
> it is able to emulate NUMA Node Hotplug thru a pure software way. It
> intends to help people easily debug and test node/cpu/memory hotplug
> related stuff on a none-numa-hotplug-support machine, even an UMA machine.
>
> The emulator provides mechanism to emulate the process of physcial cpu/mem
> hotadd, it provides possibility to debug CPU and memory hotplug on the machines
> without NUMA support for kenrel developers. It offers an interface for cpu
> and memory hotplug test purpose.
>
> * WHY DO WE USE HOTPLUG EMULATOR
>
> We are focusing on the hotplug emualation for a few months. The emualor helps
>  team to reproduce all the major hotplug bugs. It plays an important role to
> the hotplug code quality assuirance. Because of the hotplug emulator, we already
> move most of the debug working to virtual evironment.

You should extend kvm to make it support NUMA hotplug guest.
instead of messing up linux numa code.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks
@ 2010-11-17  9:26   ` Yinghai Lu
  0 siblings, 0 replies; 139+ messages in thread
From: Yinghai Lu @ 2010-11-17  9:26 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak, shaohui.zheng

On Tue, Nov 16, 2010 at 6:07 PM,  <shaohui.zheng@intel.com> wrote:
>
> * WHAT IS HOTPLUG EMULATOR
>
> NUMA hotplug emulator is collectively named for the hotplug emulation
> it is able to emulate NUMA Node Hotplug thru a pure software way. It
> intends to help people easily debug and test node/cpu/memory hotplug
> related stuff on a none-numa-hotplug-support machine, even an UMA machine

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-17  2:08   ` shaohui.zheng
@ 2010-11-17 18:50     ` Dave Hansen
  -1 siblings, 0 replies; 139+ messages in thread
From: Dave Hansen @ 2010-11-17 18:50 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Wed, 2010-11-17 at 10:08 +0800, shaohui.zheng@intel.com wrote:
> And more we make it friendly, it is possible to add memory to do
> 
>         echo 3g > memory/probe
>         echo 1024m,3 > memory/probe
> 
> It maintains backwards compatibility.
> 
> Another format suggested by Dave Hansen:
> 
>         echo physical_address=0x40000000 numa_node=3 > memory/probe
> 
> it is more explicit to show meaning of the parameters.

The other thing that Greg suggested was to use configfs.  Looking back
on it, that makes a lot of sense.  We can do better than these "probe"
files.

In your case, it might be useful to tell the kernel to be able to add
memory in a node and add the node all in one go.  That'll probably be
closer to what the hardware will do, and will exercise different code
paths that the separate "add node", "then add memory" steps that you're
using here.

For the emulator, I also have to wonder if using debugfs is the right
was since its ABI is a bit more, well, _flexible_ over time. :)

> +       depends on NUMA_HOTPLUG_EMU
> +       ---help---
> +         Enable memory hotplug emulation. Reserve memory with grub parameter
> +         "mem=N"(such as mem=1024M), where N is the initial memory size, the
> +         rest physical memory will be removed from e820 table; the memory probe
> +         interface is for memory hot-add to specified node in software method.
> +         This is for debuging and testing purpose

mem= actually sets the largest physical address that we're trying to
use.  If you have a 256MB hole at 768MB, then mem=1G will only get you
768MB of memory.  We probably get this wrong in a number of other places
in the documentation, but we might as well get it right here.

Maybe something like:
        
        Enable emulation of hotplug of NUMA nodes.  To use this, you
        must also boot with the kernel command-line parameter
        "mem=N"(such as mem=1024M), where N is the highest physical
        address you would like to use at boot.  The rest of physical
        memory will be removed from firmware tables and may be then be
        hotplugged with this feature. This is for debuging and testing
        purposes.
        
        Note that you can still examine the original, non-modified
        firmware tables in: /sys/firmware/memmap
        
-- Dave


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-17 18:50     ` Dave Hansen
  0 siblings, 0 replies; 139+ messages in thread
From: Dave Hansen @ 2010-11-17 18:50 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Wed, 2010-11-17 at 10:08 +0800, shaohui.zheng@intel.com wrote:
> And more we make it friendly, it is possible to add memory to do
> 
>         echo 3g > memory/probe
>         echo 1024m,3 > memory/probe
> 
> It maintains backwards compatibility.
> 
> Another format suggested by Dave Hansen:
> 
>         echo physical_address=0x40000000 numa_node=3 > memory/probe
> 
> it is more explicit to show meaning of the parameters.

The other thing that Greg suggested was to use configfs.  Looking back
on it, that makes a lot of sense.  We can do better than these "probe"
files.

In your case, it might be useful to tell the kernel to be able to add
memory in a node and add the node all in one go.  That'll probably be
closer to what the hardware will do, and will exercise different code
paths that the separate "add node", "then add memory" steps that you're
using here.

For the emulator, I also have to wonder if using debugfs is the right
was since its ABI is a bit more, well, _flexible_ over time. :)

> +       depends on NUMA_HOTPLUG_EMU
> +       ---help---
> +         Enable memory hotplug emulation. Reserve memory with grub parameter
> +         "mem=N"(such as mem=1024M), where N is the initial memory size, the
> +         rest physical memory will be removed from e820 table; the memory probe
> +         interface is for memory hot-add to specified node in software method.
> +         This is for debuging and testing purpose

mem= actually sets the largest physical address that we're trying to
use.  If you have a 256MB hole at 768MB, then mem=1G will only get you
768MB of memory.  We probably get this wrong in a number of other places
in the documentation, but we might as well get it right here.

Maybe something like:
        
        Enable emulation of hotplug of NUMA nodes.  To use this, you
        must also boot with the kernel command-line parameter
        "mem=N"(such as mem=1024M), where N is the highest physical
        address you would like to use at boot.  The rest of physical
        memory will be removed from firmware tables and may be then be
        hotplugged with this feature. This is for debuging and testing
        purposes.
        
        Note that you can still examine the original, non-modified
        firmware tables in: /sys/firmware/memmap
        
-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-17  7:51       ` Shaohui Zheng
@ 2010-11-17 21:10         ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17 21:10 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, 17 Nov 2010, Shaohui Zheng wrote:

> > Hmm, why can't you use numa=hide to hide a specified quantity of memory 
> > from the kernel and then use the add_memory() interface to hot-add the 
> > offlined memory in the desired quantity?  In other words, why do you need 
> > to track the offlined nodes with a state?
> > 
> > The userspace interface would take a desired size of hidden memory to 
> > hot-add and the node id would be the first_unset_node(node_online_map).
> Yes, it is a good idea, your solution is what we indeed do in our first 2
> versions.  We use mem=memsize to hide memory, and we call add_memory interface
> to hot-add offlined memory with desired quantity, and we can also add to
> desired nodes(even through the nodes does not exists). it is very flexible
> solution.
> 
> However, this solution was denied since we notice NUMA emulation, we should
> reuse it.
> 

I don't understand why that's a requirement, NUMA emulation is a seperate 
feature.  Although both are primarily used to test and instrument other VM 
and kernel code, NUMA emulation is restricted to only being used at boot 
to fake nodes on smaller machines and can be used to test things like the 
slab allocator.  The NUMA hotplug emulator that you're developing here is 
primarily used to test the hotplug callbacks; for that use-case, it seems 
particularly helpful if nodes can be hotplugged of various sizes and node 
ids rather than having static characteristics that cannot be changed with 
a reboot.

> Currently, our solution creates static nodes when OS boots, only the node with 
> state N_HIDDEN can be hot-added with node/probe interface, and we can query 
> 

The idea that I've proposed (and you've apparently thought about and even 
implemented at one point) is much more powerful than that.  We need not 
query the state of hidden nodes that we've setup at boot but can rather 
use the amount of hidden memory to setup the nodes in any way that we want 
at runtime (various sizes, interleaved node ids, etc).

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-17 21:10         ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17 21:10 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, 17 Nov 2010, Shaohui Zheng wrote:

> > Hmm, why can't you use numa=hide to hide a specified quantity of memory 
> > from the kernel and then use the add_memory() interface to hot-add the 
> > offlined memory in the desired quantity?  In other words, why do you need 
> > to track the offlined nodes with a state?
> > 
> > The userspace interface would take a desired size of hidden memory to 
> > hot-add and the node id would be the first_unset_node(node_online_map).
> Yes, it is a good idea, your solution is what we indeed do in our first 2
> versions.  We use mem=memsize to hide memory, and we call add_memory interface
> to hot-add offlined memory with desired quantity, and we can also add to
> desired nodes(even through the nodes does not exists). it is very flexible
> solution.
> 
> However, this solution was denied since we notice NUMA emulation, we should
> reuse it.
> 

I don't understand why that's a requirement, NUMA emulation is a seperate 
feature.  Although both are primarily used to test and instrument other VM 
and kernel code, NUMA emulation is restricted to only being used at boot 
to fake nodes on smaller machines and can be used to test things like the 
slab allocator.  The NUMA hotplug emulator that you're developing here is 
primarily used to test the hotplug callbacks; for that use-case, it seems 
particularly helpful if nodes can be hotplugged of various sizes and node 
ids rather than having static characteristics that cannot be changed with 
a reboot.

> Currently, our solution creates static nodes when OS boots, only the node with 
> state N_HIDDEN can be hot-added with node/probe interface, and we can query 
> 

The idea that I've proposed (and you've apparently thought about and even 
implemented at one point) is much more powerful than that.  We need not 
query the state of hidden nodes that we've setup at boot but can rather 
use the amount of hidden memory to setup the nodes in any way that we want 
at runtime (various sizes, interleaved node ids, etc).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-17 18:50     ` Dave Hansen
@ 2010-11-17 21:18       ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17 21:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: shaohui.zheng, akpm, linux-mm, linux-kernel, haicheng.li, lethal,
	ak, shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Wed, 17 Nov 2010, Dave Hansen wrote:

> The other thing that Greg suggested was to use configfs.  Looking back
> on it, that makes a lot of sense.  We can do better than these "probe"
> files.
> 
> In your case, it might be useful to tell the kernel to be able to add
> memory in a node and add the node all in one go.  That'll probably be
> closer to what the hardware will do, and will exercise different code
> paths that the separate "add node", "then add memory" steps that you're
> using here.
> 

That seems like a seperate issue of moving the memory hotplug interface 
over to configfs and that seems like it will cause a lot of userspace 
breakage.  The memory hotplug interface can already add memory to a node 
without using the ACPI notifier, so what does it have to do with this 
patchset?

I think what this patchset really wants to do is map offline hot-added 
memory to a different node id before it is onlined.  It needs no 
additional command-line interface or kconfig options, users just need to 
physically hot-add memory at runtime or use mem= when booting to reserve 
present memory from being used.

Then, export the amount of memory that is actually physically present in 
the e820 but was truncated by mem= and allow users to hot-add the memory 
via the probe interface.  Add a writeable 'node' file to offlined memory 
section directories and allow it to be changed prior to online.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-17 21:18       ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17 21:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: shaohui.zheng, akpm, linux-mm, linux-kernel, haicheng.li, lethal,
	ak, shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Wed, 17 Nov 2010, Dave Hansen wrote:

> The other thing that Greg suggested was to use configfs.  Looking back
> on it, that makes a lot of sense.  We can do better than these "probe"
> files.
> 
> In your case, it might be useful to tell the kernel to be able to add
> memory in a node and add the node all in one go.  That'll probably be
> closer to what the hardware will do, and will exercise different code
> paths that the separate "add node", "then add memory" steps that you're
> using here.
> 

That seems like a seperate issue of moving the memory hotplug interface 
over to configfs and that seems like it will cause a lot of userspace 
breakage.  The memory hotplug interface can already add memory to a node 
without using the ACPI notifier, so what does it have to do with this 
patchset?

I think what this patchset really wants to do is map offline hot-added 
memory to a different node id before it is onlined.  It needs no 
additional command-line interface or kconfig options, users just need to 
physically hot-add memory at runtime or use mem= when booting to reserve 
present memory from being used.

Then, export the amount of memory that is actually physically present in 
the e820 but was truncated by mem= and allow users to hot-add the memory 
via the probe interface.  Add a writeable 'node' file to offlined memory 
section directories and allow it to be changed prior to online.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-17 21:18       ` David Rientjes
@ 2010-11-17 21:55         ` Dave Hansen
  -1 siblings, 0 replies; 139+ messages in thread
From: Dave Hansen @ 2010-11-17 21:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: shaohui.zheng, akpm, linux-mm, linux-kernel, haicheng.li, lethal,
	ak, shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

> On Wed, 2010-11-17 at 13:18 -0800, David Rientjes wrote:
> On Wed, 17 Nov 2010, Dave Hansen wrote:
> > The other thing that Greg suggested was to use configfs.  Looking back
> > on it, that makes a lot of sense.  We can do better than these "probe"
> > files.
> > 
> > In your case, it might be useful to tell the kernel to be able to add
> > memory in a node and add the node all in one go.  That'll probably be
> > closer to what the hardware will do, and will exercise different code
> > paths that the separate "add node", "then add memory" steps that you're
> > using here.
> 
> That seems like a seperate issue of moving the memory hotplug interface 
> over to configfs and that seems like it will cause a lot of userspace 
> breakage.  The memory hotplug interface can already add memory to a node 
> without using the ACPI notifier, so what does it have to do with this 
> patchset?

I was actually just thinking of the node hotplug interface not using a
'probe' file.  But, you make a good point.  They _have_ to be tied
together, and doing one via configfs would mean that we probably have to
do the other that way.  We wouldn't have to _remove_ the ...memory/probe
interface (breaking userspace), but we would add some redundancy.

> I think what this patchset really wants to do is map offline hot-added 
> memory to a different node id before it is onlined.  It needs no 
> additional command-line interface or kconfig options, users just need to 
> physically hot-add memory at runtime or use mem= when booting to reserve 
> present memory from being used.
> 
> Then, export the amount of memory that is actually physically present in 
> the e820 but was truncated by mem=

I _think_ that's already effectively done in /sys/firmware/memmap.   

> and allow users to hot-add the memory 
> via the probe interface.  Add a writeable 'node' file to offlined memory 
> section directories and allow it to be changed prior to online.

That would work, in theory.  But, in practice, we allocate the mem_map[]
at probe time.  So, we've already effectively picked a node at probe.
That was done because the probe is equivalent to the hardware "add"
event.  Once the hardware where in the address space the memory is, it
always also knows the node.

But, I guess it also wouldn't be horrible if we just hot-removed and
hot-added an offline section if someone did write to a node file like
you're suggesting.  It might actually exercise some interesting code
paths.

-- Dave


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-17 21:55         ` Dave Hansen
  0 siblings, 0 replies; 139+ messages in thread
From: Dave Hansen @ 2010-11-17 21:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: shaohui.zheng, akpm, linux-mm, linux-kernel, haicheng.li, lethal,
	ak, shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

> On Wed, 2010-11-17 at 13:18 -0800, David Rientjes wrote:
> On Wed, 17 Nov 2010, Dave Hansen wrote:
> > The other thing that Greg suggested was to use configfs.  Looking back
> > on it, that makes a lot of sense.  We can do better than these "probe"
> > files.
> > 
> > In your case, it might be useful to tell the kernel to be able to add
> > memory in a node and add the node all in one go.  That'll probably be
> > closer to what the hardware will do, and will exercise different code
> > paths that the separate "add node", "then add memory" steps that you're
> > using here.
> 
> That seems like a seperate issue of moving the memory hotplug interface 
> over to configfs and that seems like it will cause a lot of userspace 
> breakage.  The memory hotplug interface can already add memory to a node 
> without using the ACPI notifier, so what does it have to do with this 
> patchset?

I was actually just thinking of the node hotplug interface not using a
'probe' file.  But, you make a good point.  They _have_ to be tied
together, and doing one via configfs would mean that we probably have to
do the other that way.  We wouldn't have to _remove_ the ...memory/probe
interface (breaking userspace), but we would add some redundancy.

> I think what this patchset really wants to do is map offline hot-added 
> memory to a different node id before it is onlined.  It needs no 
> additional command-line interface or kconfig options, users just need to 
> physically hot-add memory at runtime or use mem= when booting to reserve 
> present memory from being used.
> 
> Then, export the amount of memory that is actually physically present in 
> the e820 but was truncated by mem=

I _think_ that's already effectively done in /sys/firmware/memmap.   

> and allow users to hot-add the memory 
> via the probe interface.  Add a writeable 'node' file to offlined memory 
> section directories and allow it to be changed prior to online.

That would work, in theory.  But, in practice, we allocate the mem_map[]
at probe time.  So, we've already effectively picked a node at probe.
That was done because the probe is equivalent to the hardware "add"
event.  Once the hardware where in the address space the memory is, it
always also knows the node.

But, I guess it also wouldn't be horrible if we just hot-removed and
hot-added an offline section if someone did write to a node file like
you're suggesting.  It might actually exercise some interesting code
paths.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-17 21:55         ` Dave Hansen
@ 2010-11-17 22:44           ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17 22:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: shaohui.zheng, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, lethal, ak, shaohui.zheng, Haicheng Li,
	Wu Fengguang, Greg KH, Aaron Durbin

On Wed, 17 Nov 2010, Dave Hansen wrote:

> > Then, export the amount of memory that is actually physically present in 
> > the e820 but was truncated by mem=
> 
> I _think_ that's already effectively done in /sys/firmware/memmap.   
> 

Ok.

It's a little complicated because we don't export each online node's 
physical address range so you have to parse the dmesg to find what nodes 
were allocated at boot and determine how much physically present memory 
you have that's hidden but can be hotplugged using the probe files.

Adding Aaron Durbin <adurbin@google.com> to the cc because he has a patch 
that exports the physical address range of each node in their sysfs 
directories.

> > and allow users to hot-add the memory 
> > via the probe interface.  Add a writeable 'node' file to offlined memory 
> > section directories and allow it to be changed prior to online.
> 
> That would work, in theory.  But, in practice, we allocate the mem_map[]
> at probe time.  So, we've already effectively picked a node at probe.
> That was done because the probe is equivalent to the hardware "add"
> event.  Once the hardware where in the address space the memory is, it
> always also knows the node.
> 
> But, I guess it also wouldn't be horrible if we just hot-removed and
> hot-added an offline section if someone did write to a node file like
> you're suggesting.  It might actually exercise some interesting code
> paths.
> 

Since the pages are offline you should be able to modify the memmap when 
the 'node' file is written and use populate_memnodemap() since that file 
is only writeable in an offline state.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-17 22:44           ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17 22:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: shaohui.zheng, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, lethal, ak, shaohui.zheng, Haicheng Li,
	Wu Fengguang, Greg KH, Aaron Durbin

On Wed, 17 Nov 2010, Dave Hansen wrote:

> > Then, export the amount of memory that is actually physically present in 
> > the e820 but was truncated by mem=
> 
> I _think_ that's already effectively done in /sys/firmware/memmap.   
> 

Ok.

It's a little complicated because we don't export each online node's 
physical address range so you have to parse the dmesg to find what nodes 
were allocated at boot and determine how much physically present memory 
you have that's hidden but can be hotplugged using the probe files.

Adding Aaron Durbin <adurbin@google.com> to the cc because he has a patch 
that exports the physical address range of each node in their sysfs 
directories.

> > and allow users to hot-add the memory 
> > via the probe interface.  Add a writeable 'node' file to offlined memory 
> > section directories and allow it to be changed prior to online.
> 
> That would work, in theory.  But, in practice, we allocate the mem_map[]
> at probe time.  So, we've already effectively picked a node at probe.
> That was done because the probe is equivalent to the hardware "add"
> event.  Once the hardware where in the address space the memory is, it
> always also knows the node.
> 
> But, I guess it also wouldn't be horrible if we just hot-removed and
> hot-added an offline section if someone did write to a node file like
> you're suggesting.  It might actually exercise some interesting code
> paths.
> 

Since the pages are offline you should be able to modify the memmap when 
the 'node' file is written and use populate_memnodemap() since that file 
is only writeable in an offline state.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-17 22:44           ` David Rientjes
@ 2010-11-17 23:00             ` Dave Hansen
  -1 siblings, 0 replies; 139+ messages in thread
From: Dave Hansen @ 2010-11-17 23:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: shaohui.zheng, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, lethal, ak, shaohui.zheng, Haicheng Li,
	Wu Fengguang, Greg KH, Aaron Durbin

On Wed, 2010-11-17 at 14:44 -0800, David Rientjes wrote:
> > That would work, in theory.  But, in practice, we allocate the mem_map[]
> > at probe time.  So, we've already effectively picked a node at probe.
> > That was done because the probe is equivalent to the hardware "add"
> > event.  Once the hardware where in the address space the memory is, it
> > always also knows the node.
> > 
> > But, I guess it also wouldn't be horrible if we just hot-removed and
> > hot-added an offline section if someone did write to a node file like
> > you're suggesting.  It might actually exercise some interesting code
> > paths.
> 
> Since the pages are offline you should be able to modify the memmap when 
> the 'node' file is written and use populate_memnodemap() since that file 
> is only writeable in an offline state.

It's not just the mem_map[], though.  When a section is sitting
"offline", it's pretty much all ready to go, except that its pages
aren't in the allocators.  But, all of the other mm structures have
already been modified to make room for the pages.  Zones have been added
or modified, pgdats resized, 'struct page's initialized.

Changing the node implies changing _all_ of those, which requires
unrolling most of what happened when the "echo $foo > probe" operation
happened in the first place.

This is all _doable_, but it's not trivial.  

-- Dave


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-17 23:00             ` Dave Hansen
  0 siblings, 0 replies; 139+ messages in thread
From: Dave Hansen @ 2010-11-17 23:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: shaohui.zheng, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, lethal, ak, shaohui.zheng, Haicheng Li,
	Wu Fengguang, Greg KH, Aaron Durbin

On Wed, 2010-11-17 at 14:44 -0800, David Rientjes wrote:
> > That would work, in theory.  But, in practice, we allocate the mem_map[]
> > at probe time.  So, we've already effectively picked a node at probe.
> > That was done because the probe is equivalent to the hardware "add"
> > event.  Once the hardware where in the address space the memory is, it
> > always also knows the node.
> > 
> > But, I guess it also wouldn't be horrible if we just hot-removed and
> > hot-added an offline section if someone did write to a node file like
> > you're suggesting.  It might actually exercise some interesting code
> > paths.
> 
> Since the pages are offline you should be able to modify the memmap when 
> the 'node' file is written and use populate_memnodemap() since that file 
> is only writeable in an offline state.

It's not just the mem_map[], though.  When a section is sitting
"offline", it's pretty much all ready to go, except that its pages
aren't in the allocators.  But, all of the other mm structures have
already been modified to make room for the pages.  Zones have been added
or modified, pgdats resized, 'struct page's initialized.

Changing the node implies changing _all_ of those, which requires
unrolling most of what happened when the "echo $foo > probe" operation
happened in the first place.

This is all _doable_, but it's not trivial.  

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
  2010-11-17  2:08   ` shaohui.zheng
@ 2010-11-17 23:06     ` Randy Dunlap
  -1 siblings, 0 replies; 139+ messages in thread
From: Randy Dunlap @ 2010-11-17 23:06 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li

On Wed, 17 Nov 2010 10:08:07 +0800 shaohui.zheng@intel.com wrote:

> From: Shaohui Zheng <shaohui.zheng@intel.com>
> 
> add a text file Documentation/x86/x86_64/numa_hotplug_emulator.txt
> to explain the usage for the hotplug emulator.
> 
> Signed-off-by: Haicheng Li <haicheng.li@intel.com>
> Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
> ---
> Index: linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt	2010-11-17 09:01:10.342836513 +0800
> @@ -0,0 +1,92 @@
> +NUMA Hotplug Emulator for x86

(I'm only looking at the documentation file.)

Is this only for x86_64?  if so, please change the line above (for x86).
If not, then don't put this file into the /x86_64/ sub-directory.


> +---------------------------------------------------
> +
> +NUMA hotplug emulator is able to emulate NUMA Node Hotplug
> +thru a pure software way. It intends to help people easily debug
> +and test node/cpu/memory hotplug related stuff on a

                 CPU

> +none-numa-hotplug-support machine, even a UMA machine and virtual

   non-NUMA-hotplug-support machine,

> +environment.
> +
> +1) Node hotplug emulation:
> +
> +The emulator firstly hides RAM via E820 table, and then it can
> +fake offlined nodes with the hidden RAM.
> +
> +After system bootup, user is able to hotplug-add these offlined
> +nodes, which is just similar to a real hotplug hardware behavior.
> +
> +Using boot option "numa=hide=N*size" to fake offlined nodes:
> +	- N is the number of hidden nodes
> +	- size is the memory size (in MB) per hidden node.
> +
> +There is a sysfs entry "probe" under /sys/devices/system/node/ for user
> +to hotplug the fake offlined nodes:
> +
> + - to show all fake offlined nodes:
> +    $ cat /sys/devices/system/node/probe
> +
> + - to hotadd a fake offlined node, e.g. nodeid is N:
> +    $ echo N > /sys/devices/system/node/probe
> +
> +2) CPU hotplug emulation:
> +
> +The emulator reserve CPUs throu grub parameter, the reserved CPUs can be

                             thru a kernel boot parameter;
(hopefully any boot loader will work, not just grub)

> +hot-add/hot-remove in software method, it emulates the process of physical
> +cpu hotplug.

   CPU

> +
> +When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU

        hotplugging

> +socket hotplug process. For the CPU supported SMT, some logical CPUs are in the
> +same socket, but it may located in different NUMA node after we have emulator.
> +We put the logical CPU into a fake CPU socket, and assign it an unique

                                                                a unique

> +phys_proc_id. For the fake socket, we put one logical CPU in only.
> +
> + - to hide CPUs
> +	- Using boot option "maxcpus=N" hide CPUs
> +	  N is the number of initialize CPUs

	  N is the number of CPUs to initialize; the rest will be hidden.

> +	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation

	                                           CPU
    
> +      when cpu_hpe is enabled, the rest CPUs will not be initialized

	                              rest of the CPUs

> +
> + - to hot-add CPU to node
> +	$ echo nid > cpu/probe
> +
> + - to hot-remove CPU
> +	$ echo nid > cpu/release
> +
> +3) Memory hotplug emulation:
> +
> +The emulator reserve memory before OS booting, the reserved memory region

                reserves memory before the OS boots; the reserved

> +is remove from e820 table, and they can be hot-added via the probe interface,

      removed                                                         interface.

> +this interface was extend to support add memory to the specified node, It

   This interface was extended to support adding memory to the specified node. It

> +maintains backwards compatibility.
> +
> +The difficulty of Memory Release is well-known, we have no plan for it until now.
> +
> + - reserve memory throu grub parameter

                     thru a kernel boot parameter

> + 	mem=1024m
> +
> + - add a memory section to node 3
> +    $ echo 0x40000000,3 > memory/probe
> +	OR
> +    $ echo 1024m,3 > memory/probe
> +	OR
> +    $ echo "physical_address=0x40000000 numa_node=3" > memory/probe
> +
> +4) Script for hotplug testing
> +
> +These scripts provides convenience when we hot-add memory/cpu in batch.
> +
> +- Online all memory sections:
> +for m in /sys/devices/system/memory/memory*;
> +do
> +	echo online > $m/state;
> +done
> +
> +- CPU Online:
> +for c in /sys/devices/system/cpu/cpu*;
> +do
> +	echo 1 > $c/online;
> +done
> +
> +- Haicheng Li <haicheng.li@intel.com>
> +- Shaohui Zheng <shaohui.zheng@intel.com>
> +  Nov 2010
> Index: linux-hpe4/Documentation/x86/x86_64/boot-options.txt
> ===================================================================
> --- linux-hpe4.orig/Documentation/x86/x86_64/boot-options.txt	2010-11-17 10:01:37.093461435 +0800
> +++ linux-hpe4/Documentation/x86/x86_64/boot-options.txt	2010-11-17 10:03:10.881043878 +0800
> @@ -173,6 +173,13 @@
>    numa=fake=<N>
>  		If given as an integer, fills all system RAM with N fake nodes
>  		interleaved over physical nodes.
> +  numa=hide=N*size1[,size2,...]
> +		Give an string seperated by comma, each sub string stands for a serie nodes.

		Give a string separated by commas; each substring stands for a node size.
??


> +		system will reserve an area to create hide numa nodes for them.

		System will reserve an area to create or hide NUMA nodes.

> +
> +		for example: numa=hide=2*512,256
> +			system will reserve (2*512 + 256) M for 3 hide nodes. 2 nodes with 512M memory,

			                                  MB for 3 hidden nodes: 2 nodes with
			512 MB memory and 1 node with 256 MB memory

> +			and 1 node with 256 memory 
>  
>  ACPI
>  
> @@ -316,3 +323,8 @@
>  		Do not use GB pages for kernel direct mappings.
>  	gbpages
>  		Use GB pages for kernel direct mappings.
> +	cpu_hpe=on/off
> +		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,

		               CPU                                 method. When

> +		sysfs provides probe/release interface to hot add/remove cpu dynamically.

                                                                         CPUs

> +		this option is disabled in default.

		This                    by default.



---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
@ 2010-11-17 23:06     ` Randy Dunlap
  0 siblings, 0 replies; 139+ messages in thread
From: Randy Dunlap @ 2010-11-17 23:06 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li

On Wed, 17 Nov 2010 10:08:07 +0800 shaohui.zheng@intel.com wrote:

> From: Shaohui Zheng <shaohui.zheng@intel.com>
> 
> add a text file Documentation/x86/x86_64/numa_hotplug_emulator.txt
> to explain the usage for the hotplug emulator.
> 
> Signed-off-by: Haicheng Li <haicheng.li@intel.com>
> Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
> ---
> Index: linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt	2010-11-17 09:01:10.342836513 +0800
> @@ -0,0 +1,92 @@
> +NUMA Hotplug Emulator for x86

(I'm only looking at the documentation file.)

Is this only for x86_64?  if so, please change the line above (for x86).
If not, then don't put this file into the /x86_64/ sub-directory.


> +---------------------------------------------------
> +
> +NUMA hotplug emulator is able to emulate NUMA Node Hotplug
> +thru a pure software way. It intends to help people easily debug
> +and test node/cpu/memory hotplug related stuff on a

                 CPU

> +none-numa-hotplug-support machine, even a UMA machine and virtual

   non-NUMA-hotplug-support machine,

> +environment.
> +
> +1) Node hotplug emulation:
> +
> +The emulator firstly hides RAM via E820 table, and then it can
> +fake offlined nodes with the hidden RAM.
> +
> +After system bootup, user is able to hotplug-add these offlined
> +nodes, which is just similar to a real hotplug hardware behavior.
> +
> +Using boot option "numa=hide=N*size" to fake offlined nodes:
> +	- N is the number of hidden nodes
> +	- size is the memory size (in MB) per hidden node.
> +
> +There is a sysfs entry "probe" under /sys/devices/system/node/ for user
> +to hotplug the fake offlined nodes:
> +
> + - to show all fake offlined nodes:
> +    $ cat /sys/devices/system/node/probe
> +
> + - to hotadd a fake offlined node, e.g. nodeid is N:
> +    $ echo N > /sys/devices/system/node/probe
> +
> +2) CPU hotplug emulation:
> +
> +The emulator reserve CPUs throu grub parameter, the reserved CPUs can be

                             thru a kernel boot parameter;
(hopefully any boot loader will work, not just grub)

> +hot-add/hot-remove in software method, it emulates the process of physical
> +cpu hotplug.

   CPU

> +
> +When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU

        hotplugging

> +socket hotplug process. For the CPU supported SMT, some logical CPUs are in the
> +same socket, but it may located in different NUMA node after we have emulator.
> +We put the logical CPU into a fake CPU socket, and assign it an unique

                                                                a unique

> +phys_proc_id. For the fake socket, we put one logical CPU in only.
> +
> + - to hide CPUs
> +	- Using boot option "maxcpus=N" hide CPUs
> +	  N is the number of initialize CPUs

	  N is the number of CPUs to initialize; the rest will be hidden.

> +	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation

	                                           CPU
    
> +      when cpu_hpe is enabled, the rest CPUs will not be initialized

	                              rest of the CPUs

> +
> + - to hot-add CPU to node
> +	$ echo nid > cpu/probe
> +
> + - to hot-remove CPU
> +	$ echo nid > cpu/release
> +
> +3) Memory hotplug emulation:
> +
> +The emulator reserve memory before OS booting, the reserved memory region

                reserves memory before the OS boots; the reserved

> +is remove from e820 table, and they can be hot-added via the probe interface,

      removed                                                         interface.

> +this interface was extend to support add memory to the specified node, It

   This interface was extended to support adding memory to the specified node. It

> +maintains backwards compatibility.
> +
> +The difficulty of Memory Release is well-known, we have no plan for it until now.
> +
> + - reserve memory throu grub parameter

                     thru a kernel boot parameter

> + 	mem=1024m
> +
> + - add a memory section to node 3
> +    $ echo 0x40000000,3 > memory/probe
> +	OR
> +    $ echo 1024m,3 > memory/probe
> +	OR
> +    $ echo "physical_address=0x40000000 numa_node=3" > memory/probe
> +
> +4) Script for hotplug testing
> +
> +These scripts provides convenience when we hot-add memory/cpu in batch.
> +
> +- Online all memory sections:
> +for m in /sys/devices/system/memory/memory*;
> +do
> +	echo online > $m/state;
> +done
> +
> +- CPU Online:
> +for c in /sys/devices/system/cpu/cpu*;
> +do
> +	echo 1 > $c/online;
> +done
> +
> +- Haicheng Li <haicheng.li@intel.com>
> +- Shaohui Zheng <shaohui.zheng@intel.com>
> +  Nov 2010
> Index: linux-hpe4/Documentation/x86/x86_64/boot-options.txt
> ===================================================================
> --- linux-hpe4.orig/Documentation/x86/x86_64/boot-options.txt	2010-11-17 10:01:37.093461435 +0800
> +++ linux-hpe4/Documentation/x86/x86_64/boot-options.txt	2010-11-17 10:03:10.881043878 +0800
> @@ -173,6 +173,13 @@
>    numa=fake=<N>
>  		If given as an integer, fills all system RAM with N fake nodes
>  		interleaved over physical nodes.
> +  numa=hide=N*size1[,size2,...]
> +		Give an string seperated by comma, each sub string stands for a serie nodes.

		Give a string separated by commas; each substring stands for a node size.
??


> +		system will reserve an area to create hide numa nodes for them.

		System will reserve an area to create or hide NUMA nodes.

> +
> +		for example: numa=hide=2*512,256
> +			system will reserve (2*512 + 256) M for 3 hide nodes. 2 nodes with 512M memory,

			                                  MB for 3 hidden nodes: 2 nodes with
			512 MB memory and 1 node with 256 MB memory

> +			and 1 node with 256 memory 
>  
>  ACPI
>  
> @@ -316,3 +323,8 @@
>  		Do not use GB pages for kernel direct mappings.
>  	gbpages
>  		Use GB pages for kernel direct mappings.
> +	cpu_hpe=on/off
> +		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,

		               CPU                                 method. When

> +		sysfs provides probe/release interface to hot add/remove cpu dynamically.

                                                                         CPUs

> +		this option is disabled in default.

		This                    by default.



---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-17 23:00             ` Dave Hansen
@ 2010-11-17 23:17               ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17 23:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: shaohui.zheng, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, lethal, ak, shaohui.zheng, Haicheng Li,
	Wu Fengguang, Greg KH, Aaron Durbin

On Wed, 17 Nov 2010, Dave Hansen wrote:

> It's not just the mem_map[], though.  When a section is sitting
> "offline", it's pretty much all ready to go, except that its pages
> aren't in the allocators.  But, all of the other mm structures have
> already been modified to make room for the pages.  Zones have been added
> or modified, pgdats resized, 'struct page's initialized.
> 

Ok, so let's create an interface that compliments the probe interface that 
takes a quantity of memory to be hot-added from the amount of hidden RAM 
only after we fake the nodes_add array for each section within that 
quantity by calling update_nodes_add() and then looping through for each 
section calling add_memory().

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-17 23:17               ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-17 23:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: shaohui.zheng, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, lethal, ak, shaohui.zheng, Haicheng Li,
	Wu Fengguang, Greg KH, Aaron Durbin

On Wed, 17 Nov 2010, Dave Hansen wrote:

> It's not just the mem_map[], though.  When a section is sitting
> "offline", it's pretty much all ready to go, except that its pages
> aren't in the allocators.  But, all of the other mm structures have
> already been modified to make room for the pages.  Zones have been added
> or modified, pgdats resized, 'struct page's initialized.
> 

Ok, so let's create an interface that compliments the probe interface that 
takes a quantity of memory to be hot-added from the amount of hidden RAM 
only after we fake the nodes_add array for each section within that 
quantity by calling update_nodes_add() and then looping through for each 
section calling add_memory().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks
  2010-11-17  9:26   ` Yinghai Lu
@ 2010-11-18  2:03     ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  2:03 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak, shaohui.zheng

On Wed, Nov 17, 2010 at 01:26:59AM -0800, Yinghai Lu wrote:
> On Tue, Nov 16, 2010 at 6:07 PM,  <shaohui.zheng@intel.com> wrote:
> >
> > * WHAT IS HOTPLUG EMULATOR
> >
> > NUMA hotplug emulator is collectively named for the hotplug emulation
> > it is able to emulate NUMA Node Hotplug thru a pure software way. It
> > intends to help people easily debug and test node/cpu/memory hotplug
> > related stuff on a none-numa-hotplug-support machine, even an UMA machine.
> >
> > The emulator provides mechanism to emulate the process of physcial cpu/mem
> > hotadd, it provides possibility to debug CPU and memory hotplug on the machines
> > without NUMA support for kenrel developers. It offers an interface for cpu
> > and memory hotplug test purpose.
> >
> > * WHY DO WE USE HOTPLUG EMULATOR
> >
> > We are focusing on the hotplug emualation for a few months. The emualor helps
> >  team to reproduce all the major hotplug bugs. It plays an important role to
> > the hotplug code quality assuirance. Because of the hotplug emulator, we already
> > move most of the debug working to virtual evironment.
> 
> You should extend kvm to make it support NUMA hotplug guest.
> instead of messing up linux numa code.
Yinghai,
	the purpose of hotplug emulator is for linux cpu/memory hotplug testing, so
it should cover the most linux hotplug linux code path. That is why we select to work
under linux kernel, and it was proved that it is helpful for hotplug debuging in linux
kernel.

	for NUMA hotplug in kvm guest, it is another project.

-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks
@ 2010-11-18  2:03     ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  2:03 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak, shaohui.zheng

On Wed, Nov 17, 2010 at 01:26:59AM -0800, Yinghai Lu wrote:
> On Tue, Nov 16, 2010 at 6:07 PM,  <shaohui.zheng@intel.com> wrote:
> >
> > * WHAT IS HOTPLUG EMULATOR
> >
> > NUMA hotplug emulator is collectively named for the hotplug emulation
> > it is able to emulate NUMA Node Hotplug thru a pure software way. It
> > intends to help people easily debug and test node/cpu/memory hotplug
> > related stuff on a none-numa-hotplug-support machine, even an UMA machine.
> >
> > The emulator provides mechanism to emulate the process of physcial cpu/mem
> > hotadd, it provides possibility to debug CPU and memory hotplug on the machines
> > without NUMA support for kenrel developers. It offers an interface for cpu
> > and memory hotplug test purpose.
> >
> > * WHY DO WE USE HOTPLUG EMULATOR
> >
> > We are focusing on the hotplug emualation for a few months. The emualor helps
> >  team to reproduce all the major hotplug bugs. It plays an important role to
> > the hotplug code quality assuirance. Because of the hotplug emulator, we already
> > move most of the debug working to virtual evironment.
> 
> You should extend kvm to make it support NUMA hotplug guest.
> instead of messing up linux numa code.
Yinghai,
	the purpose of hotplug emulator is for linux cpu/memory hotplug testing, so
it should cover the most linux hotplug linux code path. That is why we select to work
under linux kernel, and it was proved that it is helpful for hotplug debuging in linux
kernel.

	for NUMA hotplug in kvm guest, it is another project.

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
  2010-11-17 23:06     ` Randy Dunlap
@ 2010-11-18  2:31       ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  2:31 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li

On Wed, Nov 17, 2010 at 03:06:59PM -0800, Randy Dunlap wrote:
> On Wed, 17 Nov 2010 10:08:07 +0800 shaohui.zheng@intel.com wrote:
> 
> > From: Shaohui Zheng <shaohui.zheng@intel.com>
> > 
> > add a text file Documentation/x86/x86_64/numa_hotplug_emulator.txt
> > to explain the usage for the hotplug emulator.
> > 
> > Signed-off-by: Haicheng Li <haicheng.li@intel.com>
> > Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
> > ---
> > Index: linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt	2010-11-17 09:01:10.342836513 +0800
> > @@ -0,0 +1,92 @@
> > +NUMA Hotplug Emulator for x86
> 
> (I'm only looking at the documentation file.)
> 
> Is this only for x86_64?  if so, please change the line above (for x86).
> If not, then don't put this file into the /x86_64/ sub-directory.

There are only a few x86_64 specific codes on the patch series, so it should
work for both x86_64 and i386. Currently cpu/memory hotplug works stable against
x86_64 kernel, it still has many issues for i386, so we can not do the testing
for emualtor on i386 kernel, I'd prefer to keep the document for x86_64 only.

> ---
> ~Randy
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
I will check the documentation once again, thanks for the careful review from Randy.

-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
@ 2010-11-18  2:31       ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  2:31 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li

On Wed, Nov 17, 2010 at 03:06:59PM -0800, Randy Dunlap wrote:
> On Wed, 17 Nov 2010 10:08:07 +0800 shaohui.zheng@intel.com wrote:
> 
> > From: Shaohui Zheng <shaohui.zheng@intel.com>
> > 
> > add a text file Documentation/x86/x86_64/numa_hotplug_emulator.txt
> > to explain the usage for the hotplug emulator.
> > 
> > Signed-off-by: Haicheng Li <haicheng.li@intel.com>
> > Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com>
> > ---
> > Index: linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt
> > ===================================================================
> > --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> > +++ linux-hpe4/Documentation/x86/x86_64/numa_hotplug_emulator.txt	2010-11-17 09:01:10.342836513 +0800
> > @@ -0,0 +1,92 @@
> > +NUMA Hotplug Emulator for x86
> 
> (I'm only looking at the documentation file.)
> 
> Is this only for x86_64?  if so, please change the line above (for x86).
> If not, then don't put this file into the /x86_64/ sub-directory.

There are only a few x86_64 specific codes on the patch series, so it should
work for both x86_64 and i386. Currently cpu/memory hotplug works stable against
x86_64 kernel, it still has many issues for i386, so we can not do the testing
for emualtor on i386 kernel, I'd prefer to keep the document for x86_64 only.

> ---
> ~Randy
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
I will check the documentation once again, thanks for the careful review from Randy.

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-17 21:10         ` David Rientjes
@ 2010-11-18  4:14           ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  4:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, Nov 17, 2010 at 01:10:50PM -0800, David Rientjes wrote:
> On Wed, 17 Nov 2010, Shaohui Zheng wrote:
> 
> > > Hmm, why can't you use numa=hide to hide a specified quantity of memory 
> > > from the kernel and then use the add_memory() interface to hot-add the 
> > > offlined memory in the desired quantity?  In other words, why do you need 
> > > to track the offlined nodes with a state?
> > > 
> > > The userspace interface would take a desired size of hidden memory to 
> > > hot-add and the node id would be the first_unset_node(node_online_map).
> > Yes, it is a good idea, your solution is what we indeed do in our first 2
> > versions.  We use mem=memsize to hide memory, and we call add_memory interface
> > to hot-add offlined memory with desired quantity, and we can also add to
> > desired nodes(even through the nodes does not exists). it is very flexible
> > solution.
> > 
> > However, this solution was denied since we notice NUMA emulation, we should
> > reuse it.
> > 
> 
> I don't understand why that's a requirement, NUMA emulation is a seperate 
> feature.  Although both are primarily used to test and instrument other VM 
> and kernel code, NUMA emulation is restricted to only being used at boot 
> to fake nodes on smaller machines and can be used to test things like the 
> slab allocator.  The NUMA hotplug emulator that you're developing here is 
> primarily used to test the hotplug callbacks; for that use-case, it seems 
> particularly helpful if nodes can be hotplugged of various sizes and node 
> ids rather than having static characteristics that cannot be changed with 
> a reboot.
> 
I agree with you. the early emulator do the same thing as you said, but there 
is already NUMA emulation to create fake node, our emulator also creates 
fake nodes. We worried about that we will suffer the critiques from the community,
so we drop the original degsin.

I did not know whether other engineers have the same attitude with you. I think 
that I can publish both codes, and let the community to decide which one is prefered.

In my personal opinion, both methods are acceptable for me.

> > Currently, our solution creates static nodes when OS boots, only the node with 
> > state N_HIDDEN can be hot-added with node/probe interface, and we can query 
> > 
> 
> The idea that I've proposed (and you've apparently thought about and even 
> implemented at one point) is much more powerful than that.  We need not 
> query the state of hidden nodes that we've setup at boot but can rather 
> use the amount of hidden memory to setup the nodes in any way that we want 
> at runtime (various sizes, interleaved node ids, etc).

yes, if we select your proposal. we just mark all the nodes as POSSIBLE node.
there is no hidden nodes any more. the node will be created after add memory
to the node first time. 

This is the early patch( Not very formal, it is just an interanl version):

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 454997c..9dc6a02 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -73,6 +73,7 @@
  *
  * node_set_online(node)		set bit 'node' in node_online_map
  * node_set_offline(node)		clear bit 'node' in node_online_map
+ * node_set_possible(node)		set bit 'node' in node_possible_map
  *
  * for_each_node(node)			for-loop node over node_possible_map
  * for_each_online_node(node)		for-loop node over node_online_map
@@ -432,6 +433,11 @@ static inline void node_set_offline(int nid)
 	node_clear_state(nid, N_ONLINE);
 	nr_online_nodes = num_node_state(N_ONLINE);
 }
+
+static inline void node_set_possible(int nid)
+{
+	node_set_state(nid, N_POSSIBLE);
+}
 #else
 
 static inline int node_state(int node, enum node_states state)
@@ -462,6 +468,7 @@ static inline int num_node_state(enum node_states state)
 
 #define node_set_online(node)	   node_set_state((node), N_ONLINE)
 #define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
+#define node_set_possible(node)	   node_set_state((node), N_POSSIBLE)
 #endif
 
 #define node_online_map 	node_states[N_ONLINE]

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..059ebf0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1602,6 +1602,9 @@ config HOTPLUG_CPU
 	  ( Note: power management support will enable this option
 	    automatically on SMP systems. )
 	  Say N if you want to disable CPU hotplug.
+config ARCH_CPU_PROBE_RELEASE
+	def_bool y
+	depends on HOTPLUG_CPU
 
 config COMPAT_VDSO
 	def_bool y
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 550df48..52094bc 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -26,12 +26,11 @@ void __init setup_node_to_cpumask_map(void)
 {
 	unsigned int node, num = 0;
 
-	/* setup nr_node_ids if not done yet */
-	if (nr_node_ids == MAX_NUMNODES) {
-		for_each_node_mask(node, node_possible_map)
-			num = node;
-		nr_node_ids = num + 1;
-	}
+	/* re-setup nr_node_ids, when CONFIG_ARCH_MEMORY_PROBE enabled and mem=XXX
+	specified, nr_node_ids will be set as the maximum value  */
+	for_each_node_mask(node, node_possible_map)
+		num = node;
+	nr_node_ids = num + 1;
 
 	/* allocate the map */
 	for (node = 0; node < nr_node_ids; node++)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index bd02505..3d0e37c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -327,6 +327,8 @@ static int block_size_init(void)
  * will not need to do it from userspace.  The fake hot-add code
  * as well as ppc64 will do all of their discovery in userspace
  * and will require this interface.
+ *
+ * Parameter format: start_addr, nid
  */
 #ifdef CONFIG_ARCH_MEMORY_PROBE
 static ssize_t
@@ -336,10 +338,26 @@ memory_probe_store(struct class *class, const char *buf, size_t count)
 	int nid;
 	int ret;
 
-	phys_addr = simple_strtoull(buf, NULL, 0);
+	char *p = strchr(buf, ',');
+
+	if (p != NULL && strlen(p+1) > 0) {
+		/* nid specified */
+		*p++ = '\0';
+		nid = simple_strtoul(p, NULL, 0);
+		phys_addr = simple_strtoull(buf, NULL, 0);
+	} else {
+		phys_addr = simple_strtoull(buf, NULL, 0);
+		nid = memory_add_physaddr_to_nid(phys_addr);
+	}
 
-	nid = memory_add_physaddr_to_nid(phys_addr);
-	ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+	if (nid < 0 || nid > nr_node_ids - 1) {
+		printk(KERN_ERR "Invalid node id %d(0<=nid<%d).\n", nid, nr_node_ids);
+	} else {
+		printk(KERN_INFO "Add a memory section to node: %d.\n", nid);
+		ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+		if (ret)
+			count = ret;
+	}
 
 	if (ret)
 		count = ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8deb9d0..0d7eeea 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3946,9 +3946,19 @@ static void __init setup_nr_node_ids(void)
 	unsigned int node;
 	unsigned int highest = 0;
 
+	#ifdef CONFIG_ARCH_MEMORY_PROBE
+	/* grub parameter mem=XXX specified */
+	if (1){
+		int cnt;
+		for (cnt = 0; cnt < MAX_NUMNODES; cnt++)
+			node_set_possible(cnt);
+	}
+	#endif
+
 	for_each_node_mask(node, node_possible_map)
 		highest = node;
 	nr_node_ids = highest + 1;
+	printk(KERN_INFO "setup_nr_node_ids: nr_node_ids : %d.\n", nr_node_ids);
 }
 #else
 static inline void setup_nr_node_ids(void)
-- 
Thanks & Regards,
Shaohui


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-18  4:14           ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  4:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, Nov 17, 2010 at 01:10:50PM -0800, David Rientjes wrote:
> On Wed, 17 Nov 2010, Shaohui Zheng wrote:
> 
> > > Hmm, why can't you use numa=hide to hide a specified quantity of memory 
> > > from the kernel and then use the add_memory() interface to hot-add the 
> > > offlined memory in the desired quantity?  In other words, why do you need 
> > > to track the offlined nodes with a state?
> > > 
> > > The userspace interface would take a desired size of hidden memory to 
> > > hot-add and the node id would be the first_unset_node(node_online_map).
> > Yes, it is a good idea, your solution is what we indeed do in our first 2
> > versions.  We use mem=memsize to hide memory, and we call add_memory interface
> > to hot-add offlined memory with desired quantity, and we can also add to
> > desired nodes(even through the nodes does not exists). it is very flexible
> > solution.
> > 
> > However, this solution was denied since we notice NUMA emulation, we should
> > reuse it.
> > 
> 
> I don't understand why that's a requirement, NUMA emulation is a seperate 
> feature.  Although both are primarily used to test and instrument other VM 
> and kernel code, NUMA emulation is restricted to only being used at boot 
> to fake nodes on smaller machines and can be used to test things like the 
> slab allocator.  The NUMA hotplug emulator that you're developing here is 
> primarily used to test the hotplug callbacks; for that use-case, it seems 
> particularly helpful if nodes can be hotplugged of various sizes and node 
> ids rather than having static characteristics that cannot be changed with 
> a reboot.
> 
I agree with you. the early emulator do the same thing as you said, but there 
is already NUMA emulation to create fake node, our emulator also creates 
fake nodes. We worried about that we will suffer the critiques from the community,
so we drop the original degsin.

I did not know whether other engineers have the same attitude with you. I think 
that I can publish both codes, and let the community to decide which one is prefered.

In my personal opinion, both methods are acceptable for me.

> > Currently, our solution creates static nodes when OS boots, only the node with 
> > state N_HIDDEN can be hot-added with node/probe interface, and we can query 
> > 
> 
> The idea that I've proposed (and you've apparently thought about and even 
> implemented at one point) is much more powerful than that.  We need not 
> query the state of hidden nodes that we've setup at boot but can rather 
> use the amount of hidden memory to setup the nodes in any way that we want 
> at runtime (various sizes, interleaved node ids, etc).

yes, if we select your proposal. we just mark all the nodes as POSSIBLE node.
there is no hidden nodes any more. the node will be created after add memory
to the node first time. 

This is the early patch( Not very formal, it is just an interanl version):

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 454997c..9dc6a02 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -73,6 +73,7 @@
  *
  * node_set_online(node)		set bit 'node' in node_online_map
  * node_set_offline(node)		clear bit 'node' in node_online_map
+ * node_set_possible(node)		set bit 'node' in node_possible_map
  *
  * for_each_node(node)			for-loop node over node_possible_map
  * for_each_online_node(node)		for-loop node over node_online_map
@@ -432,6 +433,11 @@ static inline void node_set_offline(int nid)
 	node_clear_state(nid, N_ONLINE);
 	nr_online_nodes = num_node_state(N_ONLINE);
 }
+
+static inline void node_set_possible(int nid)
+{
+	node_set_state(nid, N_POSSIBLE);
+}
 #else
 
 static inline int node_state(int node, enum node_states state)
@@ -462,6 +468,7 @@ static inline int num_node_state(enum node_states state)
 
 #define node_set_online(node)	   node_set_state((node), N_ONLINE)
 #define node_set_offline(node)	   node_clear_state((node), N_ONLINE)
+#define node_set_possible(node)	   node_set_state((node), N_POSSIBLE)
 #endif
 
 #define node_online_map 	node_states[N_ONLINE]

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb40925..059ebf0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1602,6 +1602,9 @@ config HOTPLUG_CPU
 	  ( Note: power management support will enable this option
 	    automatically on SMP systems. )
 	  Say N if you want to disable CPU hotplug.
+config ARCH_CPU_PROBE_RELEASE
+	def_bool y
+	depends on HOTPLUG_CPU
 
 config COMPAT_VDSO
 	def_bool y
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 550df48..52094bc 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -26,12 +26,11 @@ void __init setup_node_to_cpumask_map(void)
 {
 	unsigned int node, num = 0;
 
-	/* setup nr_node_ids if not done yet */
-	if (nr_node_ids == MAX_NUMNODES) {
-		for_each_node_mask(node, node_possible_map)
-			num = node;
-		nr_node_ids = num + 1;
-	}
+	/* re-setup nr_node_ids, when CONFIG_ARCH_MEMORY_PROBE enabled and mem=XXX
+	specified, nr_node_ids will be set as the maximum value  */
+	for_each_node_mask(node, node_possible_map)
+		num = node;
+	nr_node_ids = num + 1;
 
 	/* allocate the map */
 	for (node = 0; node < nr_node_ids; node++)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index bd02505..3d0e37c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -327,6 +327,8 @@ static int block_size_init(void)
  * will not need to do it from userspace.  The fake hot-add code
  * as well as ppc64 will do all of their discovery in userspace
  * and will require this interface.
+ *
+ * Parameter format: start_addr, nid
  */
 #ifdef CONFIG_ARCH_MEMORY_PROBE
 static ssize_t
@@ -336,10 +338,26 @@ memory_probe_store(struct class *class, const char *buf, size_t count)
 	int nid;
 	int ret;
 
-	phys_addr = simple_strtoull(buf, NULL, 0);
+	char *p = strchr(buf, ',');
+
+	if (p != NULL && strlen(p+1) > 0) {
+		/* nid specified */
+		*p++ = '\0';
+		nid = simple_strtoul(p, NULL, 0);
+		phys_addr = simple_strtoull(buf, NULL, 0);
+	} else {
+		phys_addr = simple_strtoull(buf, NULL, 0);
+		nid = memory_add_physaddr_to_nid(phys_addr);
+	}
 
-	nid = memory_add_physaddr_to_nid(phys_addr);
-	ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+	if (nid < 0 || nid > nr_node_ids - 1) {
+		printk(KERN_ERR "Invalid node id %d(0<=nid<%d).\n", nid, nr_node_ids);
+	} else {
+		printk(KERN_INFO "Add a memory section to node: %d.\n", nid);
+		ret = add_memory(nid, phys_addr, PAGES_PER_SECTION << PAGE_SHIFT);
+		if (ret)
+			count = ret;
+	}
 
 	if (ret)
 		count = ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8deb9d0..0d7eeea 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3946,9 +3946,19 @@ static void __init setup_nr_node_ids(void)
 	unsigned int node;
 	unsigned int highest = 0;
 
+	#ifdef CONFIG_ARCH_MEMORY_PROBE
+	/* grub parameter mem=XXX specified */
+	if (1){
+		int cnt;
+		for (cnt = 0; cnt < MAX_NUMNODES; cnt++)
+			node_set_possible(cnt);
+	}
+	#endif
+
 	for_each_node_mask(node, node_possible_map)
 		highest = node;
 	nr_node_ids = highest + 1;
+	printk(KERN_INFO "setup_nr_node_ids: nr_node_ids : %d.\n", nr_node_ids);
 }
 #else
 static inline void setup_nr_node_ids(void)
-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-17 18:50     ` Dave Hansen
@ 2010-11-18  4:36       ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  4:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Wed, Nov 17, 2010 at 10:50:07AM -0800, Dave Hansen wrote:
> On Wed, 2010-11-17 at 10:08 +0800, shaohui.zheng@intel.com wrote:
> > And more we make it friendly, it is possible to add memory to do
> > 
> >         echo 3g > memory/probe
> >         echo 1024m,3 > memory/probe
> > 
> > It maintains backwards compatibility.
> > 
> > Another format suggested by Dave Hansen:
> > 
> >         echo physical_address=0x40000000 numa_node=3 > memory/probe
> > 
> > it is more explicit to show meaning of the parameters.
> 
> The other thing that Greg suggested was to use configfs.  Looking back
> on it, that makes a lot of sense.  We can do better than these "probe"
> files.
> 
> In your case, it might be useful to tell the kernel to be able to add
> memory in a node and add the node all in one go.  That'll probably be
> closer to what the hardware will do, and will exercise different code
> paths that the separate "add node", "then add memory" steps that you're
> using here.
> 
> For the emulator, I also have to wonder if using debugfs is the right
> was since its ABI is a bit more, well, _flexible_ over time. :)

First, the emulator is just for test purpose, and I believe that only very
few people will use it, so we did not want to take so many modification.
the more you changed, the more bugs you may get. the memory/probe interface
is already enough to test memory hot-add.

Second, if we want to use configfs and debugfs for cpu/memory probe interface,
it should implemented in another series patch since it is not part of the emulator.
 We have 8 patches in this patchset now, it is should be very long patch if 
want to add all in.

> 
> > +       depends on NUMA_HOTPLUG_EMU
> > +       ---help---
> > +         Enable memory hotplug emulation. Reserve memory with grub parameter
> > +         "mem=N"(such as mem=1024M), where N is the initial memory size, the
> > +         rest physical memory will be removed from e820 table; the memory probe
> > +         interface is for memory hot-add to specified node in software method.
> > +         This is for debuging and testing purpose
> 
> mem= actually sets the largest physical address that we're trying to
> use.  If you have a 256MB hole at 768MB, then mem=1G will only get you
> 768MB of memory.  We probably get this wrong in a number of other places
> in the documentation, but we might as well get it right here.
> 
> Maybe something like:
>         
>         Enable emulation of hotplug of NUMA nodes.  To use this, you
>         must also boot with the kernel command-line parameter
>         "mem=N"(such as mem=1024M), where N is the highest physical
>         address you would like to use at boot.  The rest of physical
>         memory will be removed from firmware tables and may be then be
>         hotplugged with this feature. This is for debuging and testing
>         purposes.
>         
>         Note that you can still examine the original, non-modified
>         firmware tables in: /sys/firmware/memmap
>         
> -- Dave
I did not aware the memory hole here, good catching.

-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-18  4:36       ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  4:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Wed, Nov 17, 2010 at 10:50:07AM -0800, Dave Hansen wrote:
> On Wed, 2010-11-17 at 10:08 +0800, shaohui.zheng@intel.com wrote:
> > And more we make it friendly, it is possible to add memory to do
> > 
> >         echo 3g > memory/probe
> >         echo 1024m,3 > memory/probe
> > 
> > It maintains backwards compatibility.
> > 
> > Another format suggested by Dave Hansen:
> > 
> >         echo physical_address=0x40000000 numa_node=3 > memory/probe
> > 
> > it is more explicit to show meaning of the parameters.
> 
> The other thing that Greg suggested was to use configfs.  Looking back
> on it, that makes a lot of sense.  We can do better than these "probe"
> files.
> 
> In your case, it might be useful to tell the kernel to be able to add
> memory in a node and add the node all in one go.  That'll probably be
> closer to what the hardware will do, and will exercise different code
> paths that the separate "add node", "then add memory" steps that you're
> using here.
> 
> For the emulator, I also have to wonder if using debugfs is the right
> was since its ABI is a bit more, well, _flexible_ over time. :)

First, the emulator is just for test purpose, and I believe that only very
few people will use it, so we did not want to take so many modification.
the more you changed, the more bugs you may get. the memory/probe interface
is already enough to test memory hot-add.

Second, if we want to use configfs and debugfs for cpu/memory probe interface,
it should implemented in another series patch since it is not part of the emulator.
 We have 8 patches in this patchset now, it is should be very long patch if 
want to add all in.

> 
> > +       depends on NUMA_HOTPLUG_EMU
> > +       ---help---
> > +         Enable memory hotplug emulation. Reserve memory with grub parameter
> > +         "mem=N"(such as mem=1024M), where N is the initial memory size, the
> > +         rest physical memory will be removed from e820 table; the memory probe
> > +         interface is for memory hot-add to specified node in software method.
> > +         This is for debuging and testing purpose
> 
> mem= actually sets the largest physical address that we're trying to
> use.  If you have a 256MB hole at 768MB, then mem=1G will only get you
> 768MB of memory.  We probably get this wrong in a number of other places
> in the documentation, but we might as well get it right here.
> 
> Maybe something like:
>         
>         Enable emulation of hotplug of NUMA nodes.  To use this, you
>         must also boot with the kernel command-line parameter
>         "mem=N"(such as mem=1024M), where N is the highest physical
>         address you would like to use at boot.  The rest of physical
>         memory will be removed from firmware tables and may be then be
>         hotplugged with this feature. This is for debuging and testing
>         purposes.
>         
>         Note that you can still examine the original, non-modified
>         firmware tables in: /sys/firmware/memmap
>         
> -- Dave
I did not aware the memory hole here, good catching.

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-17 21:18       ` David Rientjes
@ 2010-11-18  4:48         ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  4:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Hansen, akpm, linux-mm, linux-kernel, haicheng.li, lethal,
	ak, shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Wed, Nov 17, 2010 at 01:18:50PM -0800, David Rientjes wrote:
> On Wed, 17 Nov 2010, Dave Hansen wrote:
> 
> > The other thing that Greg suggested was to use configfs.  Looking back
> > on it, that makes a lot of sense.  We can do better than these "probe"
> > files.
> > 
> > In your case, it might be useful to tell the kernel to be able to add
> > memory in a node and add the node all in one go.  That'll probably be
> > closer to what the hardware will do, and will exercise different code
> > paths that the separate "add node", "then add memory" steps that you're
> > using here.
> > 
> 
> That seems like a seperate issue of moving the memory hotplug interface 
> over to configfs and that seems like it will cause a lot of userspace 
> breakage.  The memory hotplug interface can already add memory to a node 
> without using the ACPI notifier, so what does it have to do with this 
> patchset?

Agree with you, I do not suggest to implement it in this patchset.

> 
> I think what this patchset really wants to do is map offline hot-added 
> memory to a different node id before it is onlined.  It needs no 
> additional command-line interface or kconfig options, users just need to 
> physically hot-add memory at runtime or use mem= when booting to reserve 
> present memory from being used.

I already send out the implementation in another email, you can help to do
a review.

> 
> Then, export the amount of memory that is actually physically present in 
> the e820 but was truncated by mem= and allow users to hot-add the memory 
> via the probe interface.  Add a writeable 'node' file to offlined memory 
> section directories and allow it to be changed prior to online.

for memory offlining, it is a known diffcult thing, and it is not supported 
well in current kernel, so I do not suggest to provide the offline interface
in the emulator, it just take more pains. We can consider to add it when
the memory offlining works well.

-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-18  4:48         ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  4:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Hansen, akpm, linux-mm, linux-kernel, haicheng.li, lethal,
	ak, shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Wed, Nov 17, 2010 at 01:18:50PM -0800, David Rientjes wrote:
> On Wed, 17 Nov 2010, Dave Hansen wrote:
> 
> > The other thing that Greg suggested was to use configfs.  Looking back
> > on it, that makes a lot of sense.  We can do better than these "probe"
> > files.
> > 
> > In your case, it might be useful to tell the kernel to be able to add
> > memory in a node and add the node all in one go.  That'll probably be
> > closer to what the hardware will do, and will exercise different code
> > paths that the separate "add node", "then add memory" steps that you're
> > using here.
> > 
> 
> That seems like a seperate issue of moving the memory hotplug interface 
> over to configfs and that seems like it will cause a lot of userspace 
> breakage.  The memory hotplug interface can already add memory to a node 
> without using the ACPI notifier, so what does it have to do with this 
> patchset?

Agree with you, I do not suggest to implement it in this patchset.

> 
> I think what this patchset really wants to do is map offline hot-added 
> memory to a different node id before it is onlined.  It needs no 
> additional command-line interface or kconfig options, users just need to 
> physically hot-add memory at runtime or use mem= when booting to reserve 
> present memory from being used.

I already send out the implementation in another email, you can help to do
a review.

> 
> Then, export the amount of memory that is actually physically present in 
> the e820 but was truncated by mem= and allow users to hot-add the memory 
> via the probe interface.  Add a writeable 'node' file to offlined memory 
> section directories and allow it to be changed prior to online.

for memory offlining, it is a known diffcult thing, and it is not supported 
well in current kernel, so I do not suggest to provide the offline interface
in the emulator, it just take more pains. We can consider to add it when
the memory offlining works well.

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-18  6:27             ` Paul Mundt
@ 2010-11-18  5:27               ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  5:27 UTC (permalink / raw)
  To: Paul Mundt
  Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, ak, shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, Nov 18, 2010 at 03:27:15PM +0900, Paul Mundt wrote:
> On Thu, Nov 18, 2010 at 12:14:07PM +0800, Shaohui Zheng wrote:
> > On Wed, Nov 17, 2010 at 01:10:50PM -0800, David Rientjes wrote:
> > > The idea that I've proposed (and you've apparently thought about and even 
> > > implemented at one point) is much more powerful than that.  We need not 
> > > query the state of hidden nodes that we've setup at boot but can rather 
> > > use the amount of hidden memory to setup the nodes in any way that we want 
> > > at runtime (various sizes, interleaved node ids, etc).
> > 
> > yes, if we select your proposal. we just mark all the nodes as POSSIBLE node.
> > there is no hidden nodes any more. the node will be created after add memory
> > to the node first time. 
> > 
> This is roughly what I had in mind in my N_HIDDEN review, so I quite
> favour this approach.

Our testing shows that it is a feasible approach, and it works well.
however, there is still a problem which we should worry about.

in our draft patch, we re-setup nr_node_ids when CONFIG_ARCH_MEMORY_PROBE enabled 
and mem=XXX was specified in grub. we set nr_node_ids as MAX_NUMNODES + 1, because
 we do not know how many nodes will be hot-added through memory/probe interface. 
 it might be a little wasting of memory.

-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-18  5:27               ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  5:27 UTC (permalink / raw)
  To: Paul Mundt
  Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, ak, shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, Nov 18, 2010 at 03:27:15PM +0900, Paul Mundt wrote:
> On Thu, Nov 18, 2010 at 12:14:07PM +0800, Shaohui Zheng wrote:
> > On Wed, Nov 17, 2010 at 01:10:50PM -0800, David Rientjes wrote:
> > > The idea that I've proposed (and you've apparently thought about and even 
> > > implemented at one point) is much more powerful than that.  We need not 
> > > query the state of hidden nodes that we've setup at boot but can rather 
> > > use the amount of hidden memory to setup the nodes in any way that we want 
> > > at runtime (various sizes, interleaved node ids, etc).
> > 
> > yes, if we select your proposal. we just mark all the nodes as POSSIBLE node.
> > there is no hidden nodes any more. the node will be created after add memory
> > to the node first time. 
> > 
> This is roughly what I had in mind in my N_HIDDEN review, so I quite
> favour this approach.

Our testing shows that it is a feasible approach, and it works well.
however, there is still a problem which we should worry about.

in our draft patch, we re-setup nr_node_ids when CONFIG_ARCH_MEMORY_PROBE enabled 
and mem=XXX was specified in grub. we set nr_node_ids as MAX_NUMNODES + 1, because
 we do not know how many nodes will be hot-added through memory/probe interface. 
 it might be a little wasting of memory.

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-18  4:48         ` Shaohui Zheng
@ 2010-11-18  6:24           ` Paul Mundt
  -1 siblings, 0 replies; 139+ messages in thread
From: Paul Mundt @ 2010-11-18  6:24 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: David Rientjes, Dave Hansen, akpm, linux-mm, linux-kernel,
	haicheng.li, ak, shaohui.zheng, Haicheng Li, Wu Fengguang,
	Greg KH

On Thu, Nov 18, 2010 at 12:48:50PM +0800, Shaohui Zheng wrote:
> On Wed, Nov 17, 2010 at 01:18:50PM -0800, David Rientjes wrote:
> > Then, export the amount of memory that is actually physically present in 
> > the e820 but was truncated by mem= and allow users to hot-add the memory 
> > via the probe interface.  Add a writeable 'node' file to offlined memory 
> > section directories and allow it to be changed prior to online.
> 
> for memory offlining, it is a known diffcult thing, and it is not supported 
> well in current kernel, so I do not suggest to provide the offline interface
> in the emulator, it just take more pains. We can consider to add it when
> the memory offlining works well.
> 
This is all stuff that the memblock API can deal with, I'm not sure why
there seems to be an insistence on wedging all manner of unrelated bits
in to e820. Many platforms using memblock today already offline large
amounts of contiguous physical memory for use in drivers, if you were to
follow this scheme and simply layer a node creation shim on top of that
you would end up with something that is almost entirely generic.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-18  6:24           ` Paul Mundt
  0 siblings, 0 replies; 139+ messages in thread
From: Paul Mundt @ 2010-11-18  6:24 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: David Rientjes, Dave Hansen, akpm, linux-mm, linux-kernel,
	haicheng.li, ak, shaohui.zheng, Haicheng Li, Wu Fengguang,
	Greg KH

On Thu, Nov 18, 2010 at 12:48:50PM +0800, Shaohui Zheng wrote:
> On Wed, Nov 17, 2010 at 01:18:50PM -0800, David Rientjes wrote:
> > Then, export the amount of memory that is actually physically present in 
> > the e820 but was truncated by mem= and allow users to hot-add the memory 
> > via the probe interface.  Add a writeable 'node' file to offlined memory 
> > section directories and allow it to be changed prior to online.
> 
> for memory offlining, it is a known diffcult thing, and it is not supported 
> well in current kernel, so I do not suggest to provide the offline interface
> in the emulator, it just take more pains. We can consider to add it when
> the memory offlining works well.
> 
This is all stuff that the memblock API can deal with, I'm not sure why
there seems to be an insistence on wedging all manner of unrelated bits
in to e820. Many platforms using memblock today already offline large
amounts of contiguous physical memory for use in drivers, if you were to
follow this scheme and simply layer a node creation shim on top of that
you would end up with something that is almost entirely generic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-18  4:14           ` Shaohui Zheng
@ 2010-11-18  6:27             ` Paul Mundt
  -1 siblings, 0 replies; 139+ messages in thread
From: Paul Mundt @ 2010-11-18  6:27 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, ak, shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, Nov 18, 2010 at 12:14:07PM +0800, Shaohui Zheng wrote:
> On Wed, Nov 17, 2010 at 01:10:50PM -0800, David Rientjes wrote:
> > The idea that I've proposed (and you've apparently thought about and even 
> > implemented at one point) is much more powerful than that.  We need not 
> > query the state of hidden nodes that we've setup at boot but can rather 
> > use the amount of hidden memory to setup the nodes in any way that we want 
> > at runtime (various sizes, interleaved node ids, etc).
> 
> yes, if we select your proposal. we just mark all the nodes as POSSIBLE node.
> there is no hidden nodes any more. the node will be created after add memory
> to the node first time. 
> 
This is roughly what I had in mind in my N_HIDDEN review, so I quite
favour this approach.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-18  6:27             ` Paul Mundt
  0 siblings, 0 replies; 139+ messages in thread
From: Paul Mundt @ 2010-11-18  6:27 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: David Rientjes, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, ak, shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, Nov 18, 2010 at 12:14:07PM +0800, Shaohui Zheng wrote:
> On Wed, Nov 17, 2010 at 01:10:50PM -0800, David Rientjes wrote:
> > The idea that I've proposed (and you've apparently thought about and even 
> > implemented at one point) is much more powerful than that.  We need not 
> > query the state of hidden nodes that we've setup at boot but can rather 
> > use the amount of hidden memory to setup the nodes in any way that we want 
> > at runtime (various sizes, interleaved node ids, etc).
> 
> yes, if we select your proposal. we just mark all the nodes as POSSIBLE node.
> there is no hidden nodes any more. the node will be created after add memory
> to the node first time. 
> 
This is roughly what I had in mind in my N_HIDDEN review, so I quite
favour this approach.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
  2010-11-17  8:16     ` David Rientjes
@ 2010-11-18  9:20       ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  9:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, Nov 17, 2010 at 12:16:34AM -0800, David Rientjes wrote:
> On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote:
> 
> > Index: linux-hpe4/arch/x86/kernel/e820.c
> > ===================================================================
> > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> > @@ -971,6 +971,7 @@
> >  }
> >  
> >  static int userdef __initdata;
> > +static u64 max_mem_size __initdata = ULLONG_MAX;
> >  
> >  /* "mem=nopentium" disables the 4MB page tables. */
> >  static int __init parse_memopt(char *p)
> > @@ -989,12 +990,28 @@
> >  
> >  	userdef = 1;
> >  	mem_size = memparse(p, &p);
> > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> > +	max_mem_size = mem_size;
> >  
> >  	return 0;
> >  }
> 
> This needs memmap= support as well, right?
we did not do the testing after combine both memmap and numa=hide paramter, 
I think that the result should similar with mem=XX, they both remove a memory
region from the e820 table.

> 
> >  early_param("mem", parse_memopt);
> >  
> > +#ifdef CONFIG_NODE_HOTPLUG_EMU
> > +u64 __init e820_hide_mem(u64 mem_size)
> > +{
> > +	u64 start, end_pfn;
> > +
> > +	userdef = 1;
> > +	end_pfn = e820_end_of_ram_pfn();
> > +	start = (end_pfn << PAGE_SHIFT) - mem_size;
> > +	e820_remove_range(start, max_mem_size - start, E820_RAM, 1);
> > +	max_mem_size = start;
> > +
> > +	return start;
> > +}
> > +#endif
> 
> This doesn't have any sanity checking for whether e820_remove_range() will 
> leave any significant amount of memory behind so the kernel will even boot 
> (probably should have a guaranteed FAKE_NODE_MIN_SIZE left behind?).

it should not be checked here, it should be checked by the function who call
 e820_hide_mem, and truncate the mem_size with FAKE_NODE_MIN_SIZE.

> 
> > +
> >  static int __init parse_memmap_opt(char *p)
> >  {
> >  	char *oldp;

-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
@ 2010-11-18  9:20       ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-18  9:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Wed, Nov 17, 2010 at 12:16:34AM -0800, David Rientjes wrote:
> On Wed, 17 Nov 2010, shaohui.zheng@intel.com wrote:
> 
> > Index: linux-hpe4/arch/x86/kernel/e820.c
> > ===================================================================
> > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> > @@ -971,6 +971,7 @@
> >  }
> >  
> >  static int userdef __initdata;
> > +static u64 max_mem_size __initdata = ULLONG_MAX;
> >  
> >  /* "mem=nopentium" disables the 4MB page tables. */
> >  static int __init parse_memopt(char *p)
> > @@ -989,12 +990,28 @@
> >  
> >  	userdef = 1;
> >  	mem_size = memparse(p, &p);
> > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> > +	max_mem_size = mem_size;
> >  
> >  	return 0;
> >  }
> 
> This needs memmap= support as well, right?
we did not do the testing after combine both memmap and numa=hide paramter, 
I think that the result should similar with mem=XX, they both remove a memory
region from the e820 table.

> 
> >  early_param("mem", parse_memopt);
> >  
> > +#ifdef CONFIG_NODE_HOTPLUG_EMU
> > +u64 __init e820_hide_mem(u64 mem_size)
> > +{
> > +	u64 start, end_pfn;
> > +
> > +	userdef = 1;
> > +	end_pfn = e820_end_of_ram_pfn();
> > +	start = (end_pfn << PAGE_SHIFT) - mem_size;
> > +	e820_remove_range(start, max_mem_size - start, E820_RAM, 1);
> > +	max_mem_size = start;
> > +
> > +	return start;
> > +}
> > +#endif
> 
> This doesn't have any sanity checking for whether e820_remove_range() will 
> leave any significant amount of memory behind so the kernel will even boot 
> (probably should have a guaranteed FAKE_NODE_MIN_SIZE left behind?).

it should not be checked here, it should be checked by the function who call
 e820_hide_mem, and truncate the mem_size with FAKE_NODE_MIN_SIZE.

> 
> > +
> >  static int __init parse_memmap_opt(char *p)
> >  {
> >  	char *oldp;

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-17 22:44           ` David Rientjes
@ 2010-11-18 16:59             ` Aaron Durbin
  -1 siblings, 0 replies; 139+ messages in thread
From: Aaron Durbin @ 2010-11-18 16:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Hansen, shaohui.zheng, Andrew Morton, linux-mm,
	linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Haicheng Li, Wu Fengguang, Greg KH

On Wed, Nov 17, 2010 at 2:44 PM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 17 Nov 2010, Dave Hansen wrote:
>
>> > Then, export the amount of memory that is actually physically present in
>> > the e820 but was truncated by mem=
>>
>> I _think_ that's already effectively done in /sys/firmware/memmap.
>>
>
> Ok.
>
> It's a little complicated because we don't export each online node's
> physical address range so you have to parse the dmesg to find what nodes
> were allocated at boot and determine how much physically present memory
> you have that's hidden but can be hotplugged using the probe files.
>
> Adding Aaron Durbin <adurbin@google.com> to the cc because he has a patch
> that exports the physical address range of each node in their sysfs
> directories.

Is this something that is needed upstream? I can post it if that is the case.
Sorry, I don't have a lot of context w.r.t. this thread.

>
>> > and allow users to hot-add the memory
>> > via the probe interface.  Add a writeable 'node' file to offlined memory
>> > section directories and allow it to be changed prior to online.
>>
>> That would work, in theory.  But, in practice, we allocate the mem_map[]
>> at probe time.  So, we've already effectively picked a node at probe.
>> That was done because the probe is equivalent to the hardware "add"
>> event.  Once the hardware where in the address space the memory is, it
>> always also knows the node.
>>
>> But, I guess it also wouldn't be horrible if we just hot-removed and
>> hot-added an offline section if someone did write to a node file like
>> you're suggesting.  It might actually exercise some interesting code
>> paths.
>>
>
> Since the pages are offline you should be able to modify the memmap when
> the 'node' file is written and use populate_memnodemap() since that file
> is only writeable in an offline state.
>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-18 16:59             ` Aaron Durbin
  0 siblings, 0 replies; 139+ messages in thread
From: Aaron Durbin @ 2010-11-18 16:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Hansen, shaohui.zheng, Andrew Morton, linux-mm,
	linux-kernel, haicheng.li, lethal, ak, shaohui.zheng,
	Haicheng Li, Wu Fengguang, Greg KH

On Wed, Nov 17, 2010 at 2:44 PM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 17 Nov 2010, Dave Hansen wrote:
>
>> > Then, export the amount of memory that is actually physically present in
>> > the e820 but was truncated by mem=
>>
>> I _think_ that's already effectively done in /sys/firmware/memmap.
>>
>
> Ok.
>
> It's a little complicated because we don't export each online node's
> physical address range so you have to parse the dmesg to find what nodes
> were allocated at boot and determine how much physically present memory
> you have that's hidden but can be hotplugged using the probe files.
>
> Adding Aaron Durbin <adurbin@google.com> to the cc because he has a patch
> that exports the physical address range of each node in their sysfs
> directories.

Is this something that is needed upstream? I can post it if that is the case.
Sorry, I don't have a lot of context w.r.t. this thread.

>
>> > and allow users to hot-add the memory
>> > via the probe interface.  Add a writeable 'node' file to offlined memory
>> > section directories and allow it to be changed prior to online.
>>
>> That would work, in theory.  But, in practice, we allocate the mem_map[]
>> at probe time.  So, we've already effectively picked a node at probe.
>> That was done because the probe is equivalent to the hardware "add"
>> event.  Once the hardware where in the address space the memory is, it
>> always also knows the node.
>>
>> But, I guess it also wouldn't be horrible if we just hot-removed and
>> hot-added an offline section if someone did write to a node file like
>> you're suggesting.  It might actually exercise some interesting code
>> paths.
>>
>
> Since the pages are offline you should be able to modify the memmap when
> the 'node' file is written and use populate_memnodemap() since that file
> is only writeable in an offline state.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
  2010-11-18  9:20       ` Shaohui Zheng
@ 2010-11-18 21:16         ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-18 21:16 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, 18 Nov 2010, Shaohui Zheng wrote:

> > > Index: linux-hpe4/arch/x86/kernel/e820.c
> > > ===================================================================
> > > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> > > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> > > @@ -971,6 +971,7 @@
> > >  }
> > >  
> > >  static int userdef __initdata;
> > > +static u64 max_mem_size __initdata = ULLONG_MAX;
> > >  
> > >  /* "mem=nopentium" disables the 4MB page tables. */
> > >  static int __init parse_memopt(char *p)
> > > @@ -989,12 +990,28 @@
> > >  
> > >  	userdef = 1;
> > >  	mem_size = memparse(p, &p);
> > > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> > > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> > > +	max_mem_size = mem_size;
> > >  
> > >  	return 0;
> > >  }
> > 
> > This needs memmap= support as well, right?
> we did not do the testing after combine both memmap and numa=hide paramter, 
> I think that the result should similar with mem=XX, they both remove a memory
> region from the e820 table.
> 

You've modified the parser for mem= but not memmap= so the change needs 
additional support for the latter.

> > >  early_param("mem", parse_memopt);
> > >  
> > > +#ifdef CONFIG_NODE_HOTPLUG_EMU
> > > +u64 __init e820_hide_mem(u64 mem_size)
> > > +{
> > > +	u64 start, end_pfn;
> > > +
> > > +	userdef = 1;
> > > +	end_pfn = e820_end_of_ram_pfn();
> > > +	start = (end_pfn << PAGE_SHIFT) - mem_size;
> > > +	e820_remove_range(start, max_mem_size - start, E820_RAM, 1);
> > > +	max_mem_size = start;
> > > +
> > > +	return start;
> > > +}
> > > +#endif
> > 
> > This doesn't have any sanity checking for whether e820_remove_range() will 
> > leave any significant amount of memory behind so the kernel will even boot 
> > (probably should have a guaranteed FAKE_NODE_MIN_SIZE left behind?).
> 
> it should not be checked here, it should be checked by the function who call
>  e820_hide_mem, and truncate the mem_size with FAKE_NODE_MIN_SIZE.
> 

Your patchset doesn't do that, I'm talking specifically about the amount 
of memory left behind so that the kernel at least still boots.  That seems 
to be a function of e820_hide_mem() to do some sanity checking so we 
actually still get a kernel rather than the responsibility of the 
command-line parser.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
@ 2010-11-18 21:16         ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-18 21:16 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, 18 Nov 2010, Shaohui Zheng wrote:

> > > Index: linux-hpe4/arch/x86/kernel/e820.c
> > > ===================================================================
> > > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> > > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> > > @@ -971,6 +971,7 @@
> > >  }
> > >  
> > >  static int userdef __initdata;
> > > +static u64 max_mem_size __initdata = ULLONG_MAX;
> > >  
> > >  /* "mem=nopentium" disables the 4MB page tables. */
> > >  static int __init parse_memopt(char *p)
> > > @@ -989,12 +990,28 @@
> > >  
> > >  	userdef = 1;
> > >  	mem_size = memparse(p, &p);
> > > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> > > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> > > +	max_mem_size = mem_size;
> > >  
> > >  	return 0;
> > >  }
> > 
> > This needs memmap= support as well, right?
> we did not do the testing after combine both memmap and numa=hide paramter, 
> I think that the result should similar with mem=XX, they both remove a memory
> region from the e820 table.
> 

You've modified the parser for mem= but not memmap= so the change needs 
additional support for the latter.

> > >  early_param("mem", parse_memopt);
> > >  
> > > +#ifdef CONFIG_NODE_HOTPLUG_EMU
> > > +u64 __init e820_hide_mem(u64 mem_size)
> > > +{
> > > +	u64 start, end_pfn;
> > > +
> > > +	userdef = 1;
> > > +	end_pfn = e820_end_of_ram_pfn();
> > > +	start = (end_pfn << PAGE_SHIFT) - mem_size;
> > > +	e820_remove_range(start, max_mem_size - start, E820_RAM, 1);
> > > +	max_mem_size = start;
> > > +
> > > +	return start;
> > > +}
> > > +#endif
> > 
> > This doesn't have any sanity checking for whether e820_remove_range() will 
> > leave any significant amount of memory behind so the kernel will even boot 
> > (probably should have a guaranteed FAKE_NODE_MIN_SIZE left behind?).
> 
> it should not be checked here, it should be checked by the function who call
>  e820_hide_mem, and truncate the mem_size with FAKE_NODE_MIN_SIZE.
> 

Your patchset doesn't do that, I'm talking specifically about the amount 
of memory left behind so that the kernel at least still boots.  That seems 
to be a function of e820_hide_mem() to do some sanity checking so we 
actually still get a kernel rather than the responsibility of the 
command-line parser.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-18  4:14           ` Shaohui Zheng
@ 2010-11-18 21:19             ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-18 21:19 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, 18 Nov 2010, Shaohui Zheng wrote:

> On Wed, Nov 17, 2010 at 01:10:50PM -0800, David Rientjes wrote:
> > I don't understand why that's a requirement, NUMA emulation is a seperate 
> > feature.  Although both are primarily used to test and instrument other VM 
> > and kernel code, NUMA emulation is restricted to only being used at boot 
> > to fake nodes on smaller machines and can be used to test things like the 
> > slab allocator.  The NUMA hotplug emulator that you're developing here is 
> > primarily used to test the hotplug callbacks; for that use-case, it seems 
> > particularly helpful if nodes can be hotplugged of various sizes and node 
> > ids rather than having static characteristics that cannot be changed with 
> > a reboot.
> > 
> I agree with you. the early emulator do the same thing as you said, but there 
> is already NUMA emulation to create fake node, our emulator also creates 
> fake nodes. We worried about that we will suffer the critiques from the community,
> so we drop the original degsin.
> 
> I did not know whether other engineers have the same attitude with you. I think 
> that I can publish both codes, and let the community to decide which one is prefered.
> 
> In my personal opinion, both methods are acceptable for me.
> 

The way that I've proposed it in my email to Dave was different: we use 
the memory hotplug interface to add and online the memory only after an 
interface has been added that will change the node mappings to 
first_unset_node(node_online_map).  The memory hotplug interface may 
create a new pgdat, so this is the node creation mechanism that should be 
used as opposed to those in NUMA emulation.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-18 21:19             ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-18 21:19 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, 18 Nov 2010, Shaohui Zheng wrote:

> On Wed, Nov 17, 2010 at 01:10:50PM -0800, David Rientjes wrote:
> > I don't understand why that's a requirement, NUMA emulation is a seperate 
> > feature.  Although both are primarily used to test and instrument other VM 
> > and kernel code, NUMA emulation is restricted to only being used at boot 
> > to fake nodes on smaller machines and can be used to test things like the 
> > slab allocator.  The NUMA hotplug emulator that you're developing here is 
> > primarily used to test the hotplug callbacks; for that use-case, it seems 
> > particularly helpful if nodes can be hotplugged of various sizes and node 
> > ids rather than having static characteristics that cannot be changed with 
> > a reboot.
> > 
> I agree with you. the early emulator do the same thing as you said, but there 
> is already NUMA emulation to create fake node, our emulator also creates 
> fake nodes. We worried about that we will suffer the critiques from the community,
> so we drop the original degsin.
> 
> I did not know whether other engineers have the same attitude with you. I think 
> that I can publish both codes, and let the community to decide which one is prefered.
> 
> In my personal opinion, both methods are acceptable for me.
> 

The way that I've proposed it in my email to Dave was different: we use 
the memory hotplug interface to add and online the memory only after an 
interface has been added that will change the node mappings to 
first_unset_node(node_online_map).  The memory hotplug interface may 
create a new pgdat, so this is the node creation mechanism that should be 
used as opposed to those in NUMA emulation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-18  5:27               ` Shaohui Zheng
@ 2010-11-18 21:24                 ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-18 21:24 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Paul Mundt, Andrew Morton, linux-mm, linux-kernel, haicheng.li,
	ak, shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, 18 Nov 2010, Shaohui Zheng wrote:

> in our draft patch, we re-setup nr_node_ids when CONFIG_ARCH_MEMORY_PROBE enabled 
> and mem=XXX was specified in grub. we set nr_node_ids as MAX_NUMNODES + 1, because
>  we do not know how many nodes will be hot-added through memory/probe interface. 
>  it might be a little wasting of memory.
> 

nr_node_ids need not be set to anything different at boot, the 
MEM_GOING_ONLINE callback should be used for anything (like the slab 
allocators) where a new node is introduced and needs to be dealt with 
accordingly; this is how regular memory hotplug works, we need no 
additional code in this regard because it's emulated.  If a subsystem 
needs to change in response to a new node going online and doesn't as a 
result of using your emulator, that's a bug and either needs to be fixed 
or prohibited from use with CONFIG_MEMORY_HOTPLUG.

(See the MEM_GOING_ONLINE callback in mm/slub.c, for instance, which deals 
only with the case of node hotplug.)

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-18 21:24                 ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-18 21:24 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Paul Mundt, Andrew Morton, linux-mm, linux-kernel, haicheng.li,
	ak, shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, 18 Nov 2010, Shaohui Zheng wrote:

> in our draft patch, we re-setup nr_node_ids when CONFIG_ARCH_MEMORY_PROBE enabled 
> and mem=XXX was specified in grub. we set nr_node_ids as MAX_NUMNODES + 1, because
>  we do not know how many nodes will be hot-added through memory/probe interface. 
>  it might be a little wasting of memory.
> 

nr_node_ids need not be set to anything different at boot, the 
MEM_GOING_ONLINE callback should be used for anything (like the slab 
allocators) where a new node is introduced and needs to be dealt with 
accordingly; this is how regular memory hotplug works, we need no 
additional code in this regard because it's emulated.  If a subsystem 
needs to change in response to a new node going online and doesn't as a 
result of using your emulator, that's a bug and either needs to be fixed 
or prohibited from use with CONFIG_MEMORY_HOTPLUG.

(See the MEM_GOING_ONLINE callback in mm/slub.c, for instance, which deals 
only with the case of node hotplug.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-18  6:24           ` Paul Mundt
@ 2010-11-18 21:28             ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-18 21:28 UTC (permalink / raw)
  To: Paul Mundt
  Cc: Shaohui Zheng, Dave Hansen, akpm, linux-mm, linux-kernel,
	haicheng.li, ak, shaohui.zheng, Haicheng Li, Wu Fengguang,
	Greg KH

On Thu, 18 Nov 2010, Paul Mundt wrote:

> This is all stuff that the memblock API can deal with, I'm not sure why
> there seems to be an insistence on wedging all manner of unrelated bits
> in to e820. Many platforms using memblock today already offline large
> amounts of contiguous physical memory for use in drivers, if you were to
> follow this scheme and simply layer a node creation shim on top of that
> you would end up with something that is almost entirely generic.
> 

I don't see why this patchset needs to use the memblock API at all, it 
should be built entirely on the generic mem-hotplug API.  The only 
extension needed is the remapping of removed memory to a new node id (done 
on x86 with update_nodes_add()) prior to add_memory() for each arch that 
supports onlining new nodes.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-18 21:28             ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-18 21:28 UTC (permalink / raw)
  To: Paul Mundt
  Cc: Shaohui Zheng, Dave Hansen, akpm, linux-mm, linux-kernel,
	haicheng.li, ak, shaohui.zheng, Haicheng Li, Wu Fengguang,
	Greg KH

On Thu, 18 Nov 2010, Paul Mundt wrote:

> This is all stuff that the memblock API can deal with, I'm not sure why
> there seems to be an insistence on wedging all manner of unrelated bits
> in to e820. Many platforms using memblock today already offline large
> amounts of contiguous physical memory for use in drivers, if you were to
> follow this scheme and simply layer a node creation shim on top of that
> you would end up with something that is almost entirely generic.
> 

I don't see why this patchset needs to use the memblock API at all, it 
should be built entirely on the generic mem-hotplug API.  The only 
extension needed is the remapping of removed memory to a new node id (done 
on x86 with update_nodes_add()) prior to add_memory() for each arch that 
supports onlining new nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-18  4:48         ` Shaohui Zheng
@ 2010-11-18 21:31           ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-18 21:31 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Dave Hansen, akpm, linux-mm, linux-kernel, haicheng.li, lethal,
	ak, shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Thu, 18 Nov 2010, Shaohui Zheng wrote:

> > Then, export the amount of memory that is actually physically present in 
> > the e820 but was truncated by mem= and allow users to hot-add the memory 
> > via the probe interface.  Add a writeable 'node' file to offlined memory 
> > section directories and allow it to be changed prior to online.
> 
> for memory offlining, it is a known diffcult thing, and it is not supported 
> well in current kernel, so I do not suggest to provide the offline interface
> in the emulator, it just take more pains. We can consider to add it when
> the memory offlining works well.
> 

You're referring to the inability to remove memory sections for 
CONFIG_SPARSEMEM_VMEMMAP?  You should still able to test the offlining 
with other memory models of emulated nodes by using the generic support 
already implemented for CONFIG_MEMORY_HOTREMOVE; the short answer is that 
it probably shouldn't matter at all since we already support node 
hot-remove and the fact that they are emulated nodes isn't really of 
interest.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-18 21:31           ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-18 21:31 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Dave Hansen, akpm, linux-mm, linux-kernel, haicheng.li, lethal,
	ak, shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Thu, 18 Nov 2010, Shaohui Zheng wrote:

> > Then, export the amount of memory that is actually physically present in 
> > the e820 but was truncated by mem= and allow users to hot-add the memory 
> > via the probe interface.  Add a writeable 'node' file to offlined memory 
> > section directories and allow it to be changed prior to online.
> 
> for memory offlining, it is a known diffcult thing, and it is not supported 
> well in current kernel, so I do not suggest to provide the offline interface
> in the emulator, it just take more pains. We can consider to add it when
> the memory offlining works well.
> 

You're referring to the inability to remove memory sections for 
CONFIG_SPARSEMEM_VMEMMAP?  You should still able to test the offlining 
with other memory models of emulated nodes by using the generic support 
already implemented for CONFIG_MEMORY_HOTREMOVE; the short answer is that 
it probably shouldn't matter at all since we already support node 
hot-remove and the fact that they are emulated nodes isn't really of 
interest.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
  2010-11-18 21:16         ` David Rientjes
@ 2010-11-19  0:12           ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-19  0:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, Nov 18, 2010 at 01:16:07PM -0800, David Rientjes wrote:
> On Thu, 18 Nov 2010, Shaohui Zheng wrote:
> 
> > > > Index: linux-hpe4/arch/x86/kernel/e820.c
> > > > ===================================================================
> > > > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> > > > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> > > > @@ -971,6 +971,7 @@
> > > >  }
> > > >  
> > > >  static int userdef __initdata;
> > > > +static u64 max_mem_size __initdata = ULLONG_MAX;
> > > >  
> > > >  /* "mem=nopentium" disables the 4MB page tables. */
> > > >  static int __init parse_memopt(char *p)
> > > > @@ -989,12 +990,28 @@
> > > >  
> > > >  	userdef = 1;
> > > >  	mem_size = memparse(p, &p);
> > > > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> > > > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> > > > +	max_mem_size = mem_size;
> > > >  
> > > >  	return 0;
> > > >  }
> > > 
> > > This needs memmap= support as well, right?
> > we did not do the testing after combine both memmap and numa=hide paramter, 
> > I think that the result should similar with mem=XX, they both remove a memory
> > region from the e820 table.
> > 
> 
> You've modified the parser for mem= but not memmap= so the change needs 
> additional support for the latter.
> 

the parser for mem= is not modified, the changed parser is numa=, I add a addtional
option numa=hide=.

>From current discussion, numa=hide= interface should be removed, we will use mem=
to hide memory.

> > > >  early_param("mem", parse_memopt);
> > > >  
> > > > +#ifdef CONFIG_NODE_HOTPLUG_EMU
> > > > +u64 __init e820_hide_mem(u64 mem_size)
> > > > +{
> > > > +	u64 start, end_pfn;
> > > > +
> > > > +	userdef = 1;
> > > > +	end_pfn = e820_end_of_ram_pfn();
> > > > +	start = (end_pfn << PAGE_SHIFT) - mem_size;
> > > > +	e820_remove_range(start, max_mem_size - start, E820_RAM, 1);
> > > > +	max_mem_size = start;
> > > > +
> > > > +	return start;
> > > > +}
> > > > +#endif
> > > 
> > > This doesn't have any sanity checking for whether e820_remove_range() will 
> > > leave any significant amount of memory behind so the kernel will even boot 
> > > (probably should have a guaranteed FAKE_NODE_MIN_SIZE left behind?).
> > 
> > it should not be checked here, it should be checked by the function who call
> >  e820_hide_mem, and truncate the mem_size with FAKE_NODE_MIN_SIZE.
> > 
> 
> Your patchset doesn't do that, I'm talking specifically about the amount 
> of memory left behind so that the kernel at least still boots.  That seems 
> to be a function of e820_hide_mem() to do some sanity checking so we 
> actually still get a kernel rather than the responsibility of the 
> command-line parser.

How much memory is enough to make sure the kernel can still boot, it is very 
hard to measure. it is almost impossible to get the exact data. I try to leave very 
few memory to kernel(hide most memory with numa=hide), it cause a panic directly.

I have no idea about it, do you have any suggestions?

Another example,  
I try to add paramter "mem=1M", it compains "Select item can not fit into memory", 
and I did not find where the error message comes from, I guess that it should 
be printed by grub.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
@ 2010-11-19  0:12           ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-19  0:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, Nov 18, 2010 at 01:16:07PM -0800, David Rientjes wrote:
> On Thu, 18 Nov 2010, Shaohui Zheng wrote:
> 
> > > > Index: linux-hpe4/arch/x86/kernel/e820.c
> > > > ===================================================================
> > > > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> > > > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> > > > @@ -971,6 +971,7 @@
> > > >  }
> > > >  
> > > >  static int userdef __initdata;
> > > > +static u64 max_mem_size __initdata = ULLONG_MAX;
> > > >  
> > > >  /* "mem=nopentium" disables the 4MB page tables. */
> > > >  static int __init parse_memopt(char *p)
> > > > @@ -989,12 +990,28 @@
> > > >  
> > > >  	userdef = 1;
> > > >  	mem_size = memparse(p, &p);
> > > > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> > > > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> > > > +	max_mem_size = mem_size;
> > > >  
> > > >  	return 0;
> > > >  }
> > > 
> > > This needs memmap= support as well, right?
> > we did not do the testing after combine both memmap and numa=hide paramter, 
> > I think that the result should similar with mem=XX, they both remove a memory
> > region from the e820 table.
> > 
> 
> You've modified the parser for mem= but not memmap= so the change needs 
> additional support for the latter.
> 

the parser for mem= is not modified, the changed parser is numa=, I add a addtional
option numa=hide=.

>From current discussion, numa=hide= interface should be removed, we will use mem=
to hide memory.

> > > >  early_param("mem", parse_memopt);
> > > >  
> > > > +#ifdef CONFIG_NODE_HOTPLUG_EMU
> > > > +u64 __init e820_hide_mem(u64 mem_size)
> > > > +{
> > > > +	u64 start, end_pfn;
> > > > +
> > > > +	userdef = 1;
> > > > +	end_pfn = e820_end_of_ram_pfn();
> > > > +	start = (end_pfn << PAGE_SHIFT) - mem_size;
> > > > +	e820_remove_range(start, max_mem_size - start, E820_RAM, 1);
> > > > +	max_mem_size = start;
> > > > +
> > > > +	return start;
> > > > +}
> > > > +#endif
> > > 
> > > This doesn't have any sanity checking for whether e820_remove_range() will 
> > > leave any significant amount of memory behind so the kernel will even boot 
> > > (probably should have a guaranteed FAKE_NODE_MIN_SIZE left behind?).
> > 
> > it should not be checked here, it should be checked by the function who call
> >  e820_hide_mem, and truncate the mem_size with FAKE_NODE_MIN_SIZE.
> > 
> 
> Your patchset doesn't do that, I'm talking specifically about the amount 
> of memory left behind so that the kernel at least still boots.  That seems 
> to be a function of e820_hide_mem() to do some sanity checking so we 
> actually still get a kernel rather than the responsibility of the 
> command-line parser.

How much memory is enough to make sure the kernel can still boot, it is very 
hard to measure. it is almost impossible to get the exact data. I try to leave very 
few memory to kernel(hide most memory with numa=hide), it cause a panic directly.

I have no idea about it, do you have any suggestions?

Another example,  
I try to add paramter "mem=1M", it compains "Select item can not fit into memory", 
and I did not find where the error message comes from, I guess that it should 
be printed by grub.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-18 21:24                 ` David Rientjes
@ 2010-11-19  0:32                   ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-19  0:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Paul Mundt, Andrew Morton, linux-mm, linux-kernel, haicheng.li,
	ak, shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, Nov 18, 2010 at 01:24:52PM -0800, David Rientjes wrote:
> On Thu, 18 Nov 2010, Shaohui Zheng wrote:
> 
> > in our draft patch, we re-setup nr_node_ids when CONFIG_ARCH_MEMORY_PROBE enabled 
> > and mem=XXX was specified in grub. we set nr_node_ids as MAX_NUMNODES + 1, because
> >  we do not know how many nodes will be hot-added through memory/probe interface. 
> >  it might be a little wasting of memory.
> > 
> 
> nr_node_ids need not be set to anything different at boot, the 
> MEM_GOING_ONLINE callback should be used for anything (like the slab 
> allocators) where a new node is introduced and needs to be dealt with 
> accordingly; this is how regular memory hotplug works, we need no 
> additional code in this regard because it's emulated.  If a subsystem 
> needs to change in response to a new node going online and doesn't as a 
> result of using your emulator, that's a bug and either needs to be fixed 
> or prohibited from use with CONFIG_MEMORY_HOTPLUG.
> 
> (See the MEM_GOING_ONLINE callback in mm/slub.c, for instance, which deals 
> only with the case of node hotplug.)

nr_node_ids is the possible node number. when we do regular memory online,
it is oline to a possible node, and it is already counted in to nr_node_ids.

if you increment nr_node_ids dynamically when node online, it causes a lot of
problems. Many data are initialized according to nr_node_ids. That is our
experience when we debug the emulator.

mm/page_alloc.c:
/*
 * Figure out the number of possible node ids.
 */
static void __init setup_nr_node_ids(void)
{
	unsigned int node;
	unsigned int highest = 0;

	for_each_node_mask(node, node_possible_map)
		highest = node;
	nr_node_ids = highest + 1;
}

There is no conflict between emulator and CONFIG_MEMORY_HOTPLUG. A real node can be
 onlined because we already set it as _possible_; if emulator is enabled, all the 
nodes were marked as _possbile_ node, the real ndoe is also included in.

-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-19  0:32                   ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-19  0:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Paul Mundt, Andrew Morton, linux-mm, linux-kernel, haicheng.li,
	ak, shaohui.zheng, Yinghai Lu, Haicheng Li

On Thu, Nov 18, 2010 at 01:24:52PM -0800, David Rientjes wrote:
> On Thu, 18 Nov 2010, Shaohui Zheng wrote:
> 
> > in our draft patch, we re-setup nr_node_ids when CONFIG_ARCH_MEMORY_PROBE enabled 
> > and mem=XXX was specified in grub. we set nr_node_ids as MAX_NUMNODES + 1, because
> >  we do not know how many nodes will be hot-added through memory/probe interface. 
> >  it might be a little wasting of memory.
> > 
> 
> nr_node_ids need not be set to anything different at boot, the 
> MEM_GOING_ONLINE callback should be used for anything (like the slab 
> allocators) where a new node is introduced and needs to be dealt with 
> accordingly; this is how regular memory hotplug works, we need no 
> additional code in this regard because it's emulated.  If a subsystem 
> needs to change in response to a new node going online and doesn't as a 
> result of using your emulator, that's a bug and either needs to be fixed 
> or prohibited from use with CONFIG_MEMORY_HOTPLUG.
> 
> (See the MEM_GOING_ONLINE callback in mm/slub.c, for instance, which deals 
> only with the case of node hotplug.)

nr_node_ids is the possible node number. when we do regular memory online,
it is oline to a possible node, and it is already counted in to nr_node_ids.

if you increment nr_node_ids dynamically when node online, it causes a lot of
problems. Many data are initialized according to nr_node_ids. That is our
experience when we debug the emulator.

mm/page_alloc.c:
/*
 * Figure out the number of possible node ids.
 */
static void __init setup_nr_node_ids(void)
{
	unsigned int node;
	unsigned int highest = 0;

	for_each_node_mask(node, node_possible_map)
		highest = node;
	nr_node_ids = highest + 1;
}

There is no conflict between emulator and CONFIG_MEMORY_HOTPLUG. A real node can be
 onlined because we already set it as _possible_; if emulator is enabled, all the 
nodes were marked as _possbile_ node, the real ndoe is also included in.

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks
  2010-11-17  5:22   ` Paul Mundt
@ 2010-11-19  5:54     ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-19  5:54 UTC (permalink / raw)
  To: Paul Mundt; +Cc: akpm, linux-mm, linux-kernel, haicheng.li, ak, shaohui.zheng

On Wed, Nov 17, 2010 at 02:22:13PM +0900, Paul Mundt wrote:
> On Wed, Nov 17, 2010 at 10:07:59AM +0800, shaohui.zheng@intel.com wrote:
> > * PATCHSET INTRODUCTION
> > 
> > patch 1: Add function to hide memory region via e820 table. Then emulator will
> > 	     use these memory regions to fake offlined numa nodes.
> > patch 2: Infrastructure of NUMA hotplug emulation, introduce "hide node".
> > patch 3: Provide an userland interface to hotplug-add fake offlined nodes.
> > patch 4: Abstract cpu register functions, make these interface friend for cpu
> > 		 hotplug emulation
> > patch 5: Support cpu probe/release in x86, it provide a software method to hot
> > 		 add/remove cpu with sysfs interface.
> > patch 6: Fake CPU socket with logical CPU on x86, to prevent the scheduling
> > 		 domain to build the incorrect hierarchy.
> > patch 7: extend memory probe interface to support NUMA, we can add the memory to
> > 		 a specified node with the interface.
> > patch 8: Documentations
> > 
> > * FEEDBACKS & RESPONSES
> > 
> I had some comments on the other patches in the series that possibly got
> missed because of the mail-followup-to confusion:
> 
> http://lkml.org/lkml/2010/11/15/11
About memblock API, it is a good APIs list to manage memory region. If all the
e820 wrapper function use memblock API, the code should be very clean. currently,
no body use memblock in e820 wrapper, so we should still keep this status, unless
we decide rewrite these e820 wrapper.

Anyway, we already select other way to hide memory, we will not add wrapper on
e820 table anymore.

> http://lkml.org/lkml/2010/11/15/14

I understand, the MACROs are not functions, it will not comsume memory after
compile it. the IFDEF should be removed

> http://lkml.org/lkml/2010/11/15/15
I think that you want to say ARCH_ENABLE_NUMA_HOTPLUG_EMU here, not
ARCH_ENABLE_NUMA_EMU.  the option NUMA_HOTPLUG_EMU is a dummy item, it does not
control any codes, it just try to maintain the node/memory/cpu hotplug
emulation option together, it provides convenience when use want to enable them.


> 
> The other one you've already dealt with.

-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks
@ 2010-11-19  5:54     ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-19  5:54 UTC (permalink / raw)
  To: Paul Mundt; +Cc: akpm, linux-mm, linux-kernel, haicheng.li, ak, shaohui.zheng

On Wed, Nov 17, 2010 at 02:22:13PM +0900, Paul Mundt wrote:
> On Wed, Nov 17, 2010 at 10:07:59AM +0800, shaohui.zheng@intel.com wrote:
> > * PATCHSET INTRODUCTION
> > 
> > patch 1: Add function to hide memory region via e820 table. Then emulator will
> > 	     use these memory regions to fake offlined numa nodes.
> > patch 2: Infrastructure of NUMA hotplug emulation, introduce "hide node".
> > patch 3: Provide an userland interface to hotplug-add fake offlined nodes.
> > patch 4: Abstract cpu register functions, make these interface friend for cpu
> > 		 hotplug emulation
> > patch 5: Support cpu probe/release in x86, it provide a software method to hot
> > 		 add/remove cpu with sysfs interface.
> > patch 6: Fake CPU socket with logical CPU on x86, to prevent the scheduling
> > 		 domain to build the incorrect hierarchy.
> > patch 7: extend memory probe interface to support NUMA, we can add the memory to
> > 		 a specified node with the interface.
> > patch 8: Documentations
> > 
> > * FEEDBACKS & RESPONSES
> > 
> I had some comments on the other patches in the series that possibly got
> missed because of the mail-followup-to confusion:
> 
> http://lkml.org/lkml/2010/11/15/11
About memblock API, it is a good APIs list to manage memory region. If all the
e820 wrapper function use memblock API, the code should be very clean. currently,
no body use memblock in e820 wrapper, so we should still keep this status, unless
we decide rewrite these e820 wrapper.

Anyway, we already select other way to hide memory, we will not add wrapper on
e820 table anymore.

> http://lkml.org/lkml/2010/11/15/14

I understand, the MACROs are not functions, it will not comsume memory after
compile it. the IFDEF should be removed

> http://lkml.org/lkml/2010/11/15/15
I think that you want to say ARCH_ENABLE_NUMA_HOTPLUG_EMU here, not
ARCH_ENABLE_NUMA_EMU.  the option NUMA_HOTPLUG_EMU is a dummy item, it does not
control any codes, it just try to maintain the node/memory/cpu hotplug
emulation option together, it provides convenience when use want to enable them.


> 
> The other one you've already dealt with.

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-17 18:50     ` Dave Hansen
                       ` (2 preceding siblings ...)
  (?)
@ 2010-11-19  7:51     ` Shaohui Zheng
  2010-11-19 16:36         ` Dave Hansen
  -1 siblings, 1 reply; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-19  7:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Wed, Nov 17, 2010 at 10:50:07AM -0800, Dave Hansen wrote:
> On Wed, 2010-11-17 at 10:08 +0800, shaohui.zheng@intel.com wrote:
> > And more we make it friendly, it is possible to add memory to do
> > 
> >         echo 3g > memory/probe
> >         echo 1024m,3 > memory/probe
> > 
> > It maintains backwards compatibility.
> > 
> > Another format suggested by Dave Hansen:
> > 
> >         echo physical_address=0x40000000 numa_node=3 > memory/probe
> > 
> > it is more explicit to show meaning of the parameters.
> 
> The other thing that Greg suggested was to use configfs.  Looking back
> on it, that makes a lot of sense.  We can do better than these "probe"
> files.
> 
> In your case, it might be useful to tell the kernel to be able to add
> memory in a node and add the node all in one go.  That'll probably be
> closer to what the hardware will do, and will exercise different code
> paths that the separate "add node", "then add memory" steps that you're
> using here.
> 
> For the emulator, I also have to wonder if using debugfs is the right
> was since its ABI is a bit more, well, _flexible_ over time. :)

There will be a lot of problems which need to solve if we decide to use configfs or
debugfs. I have no good method to solve these problems, so I want to listen some
advices.

1) How to design the probe interace 
I can not find a good method with configfs to replace current memory/probe 
interface.

As we know, A configfs config_item is created via an explicit userspace
operation mkdir. when we add a memory section, we need to convert it to an mkdir
action. the following implementation is the possible solution.

node/memory hotplug:
/configfs/node
when we hotadd node, we can create dir with command:
	mkdir /configfs/node/nodeX

And export a probe interface
/configfs/node/nodeX/probe, we can use this interface to hot-add memory section
to this node.

after memory hot-add with the probe interface, there should be some memory
entries for each memory section under this directories.

cpu hotplug:
/configfs/cpu/
to hot-add a cpu
	mkdir /configfs/cpu/cpuX
to hot-remove a CPU
	rmdir /configfs/cpu/cpuX

I did not whether it is the expected interface on configfs.

2) co-existence for sysfs and configfs

If we keep both interfaces, thing becomes complicated. when we hot-add
memory/cpu thru sysfs, we should create the sysfs entrie for it, and we should
also create the configfs entries for it. Vice versa, when we hot-add/remove
cpu/memory thru configfs, we should maintain the changes on sysfs, too.

it becomes very complicated after we have both configfs & sysfs interface, and
we should not get them together, we need to get it simple.

the purpose of hotplug emulator is providing a possible solution for cpu/memory
hotplug testing, the interface upgrading is not part of emulator. Let's forget
configfs here.

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
  2010-11-19  7:51     ` Shaohui Zheng
@ 2010-11-19 16:36         ` Dave Hansen
  0 siblings, 0 replies; 139+ messages in thread
From: Dave Hansen @ 2010-11-19 16:36 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Fri, 2010-11-19 at 15:51 +0800, Shaohui Zheng wrote:
> the purpose of hotplug emulator is providing a possible solution for cpu/memory
> hotplug testing, the interface upgrading is not part of emulator. Let's forget
> configfs here. 

If it's just for testing, you're right, we probably shouldn't go to the
trouble of making a new interface.  At the same time, we shouldn't put
something in /sys or configfs that we're not committed to, long-term.

So, not to replace the memory probe file, but _only_ to drive the new
debug-only node hot-add, I think its appropriate place is debugfs.

-- Dave


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA
@ 2010-11-19 16:36         ` Dave Hansen
  0 siblings, 0 replies; 139+ messages in thread
From: Dave Hansen @ 2010-11-19 16:36 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li, Wu Fengguang, Greg KH

On Fri, 2010-11-19 at 15:51 +0800, Shaohui Zheng wrote:
> the purpose of hotplug emulator is providing a possible solution for cpu/memory
> hotplug testing, the interface upgrading is not part of emulator. Let's forget
> configfs here. 

If it's just for testing, you're right, we probably shouldn't go to the
trouble of making a new interface.  At the same time, we shouldn't put
something in /sys or configfs that we're not committed to, long-term.

So, not to replace the memory probe file, but _only_ to drive the new
debug-only node hot-add, I think its appropriate place is debugfs.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
  2010-11-19  0:12           ` Shaohui Zheng
@ 2010-11-21  0:45             ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21  0:45 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal,
	Andi Kleen, Yinghai Lu, Haicheng Li

On Fri, 19 Nov 2010, Shaohui Zheng wrote:

> > > > > Index: linux-hpe4/arch/x86/kernel/e820.c
> > > > > ===================================================================
> > > > > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> > > > > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> > > > > @@ -971,6 +971,7 @@
> > > > >  }
> > > > >  
> > > > >  static int userdef __initdata;
> > > > > +static u64 max_mem_size __initdata = ULLONG_MAX;
> > > > >  
> > > > >  /* "mem=nopentium" disables the 4MB page tables. */
> > > > >  static int __init parse_memopt(char *p)
> > > > > @@ -989,12 +990,28 @@
> > > > >  
> > > > >  	userdef = 1;
> > > > >  	mem_size = memparse(p, &p);
> > > > > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> > > > > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> > > > > +	max_mem_size = mem_size;
> > > > >  
> > > > >  	return 0;
> > > > >  }
> > > > 
> > > > This needs memmap= support as well, right?
> > > we did not do the testing after combine both memmap and numa=hide paramter, 
> > > I think that the result should similar with mem=XX, they both remove a memory
> > > region from the e820 table.
> > > 
> > 
> > You've modified the parser for mem= but not memmap= so the change needs 
> > additional support for the latter.
> > 
> 
> the parser for mem= is not modified, the changed parser is numa=, I add a addtional
> option numa=hide=.
> 

The above hunk is modifying the x86 parser for the mem= parameter.

> > Your patchset doesn't do that, I'm talking specifically about the amount 
> > of memory left behind so that the kernel at least still boots.  That seems 
> > to be a function of e820_hide_mem() to do some sanity checking so we 
> > actually still get a kernel rather than the responsibility of the 
> > command-line parser.
> 
> How much memory is enough to make sure the kernel can still boot, it is very 
> hard to measure. it is almost impossible to get the exact data. I try to leave very 
> few memory to kernel(hide most memory with numa=hide), it cause a panic directly.
> 
> I have no idea about it, do you have any suggestions?
> 

Yes, I think we should use FAKE_NODE_MIN_SIZE to represent the smallest 
node that may be added and so the appropriate behavior or e820_hide_mem() 
would be to leave at least this quantity behind for the kernel to be 
loaded.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
@ 2010-11-21  0:45             ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21  0:45 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Andrew Morton, linux-mm, linux-kernel, haicheng.li, lethal,
	Andi Kleen, Yinghai Lu, Haicheng Li

On Fri, 19 Nov 2010, Shaohui Zheng wrote:

> > > > > Index: linux-hpe4/arch/x86/kernel/e820.c
> > > > > ===================================================================
> > > > > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> > > > > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> > > > > @@ -971,6 +971,7 @@
> > > > >  }
> > > > >  
> > > > >  static int userdef __initdata;
> > > > > +static u64 max_mem_size __initdata = ULLONG_MAX;
> > > > >  
> > > > >  /* "mem=nopentium" disables the 4MB page tables. */
> > > > >  static int __init parse_memopt(char *p)
> > > > > @@ -989,12 +990,28 @@
> > > > >  
> > > > >  	userdef = 1;
> > > > >  	mem_size = memparse(p, &p);
> > > > > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> > > > > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> > > > > +	max_mem_size = mem_size;
> > > > >  
> > > > >  	return 0;
> > > > >  }
> > > > 
> > > > This needs memmap= support as well, right?
> > > we did not do the testing after combine both memmap and numa=hide paramter, 
> > > I think that the result should similar with mem=XX, they both remove a memory
> > > region from the e820 table.
> > > 
> > 
> > You've modified the parser for mem= but not memmap= so the change needs 
> > additional support for the latter.
> > 
> 
> the parser for mem= is not modified, the changed parser is numa=, I add a addtional
> option numa=hide=.
> 

The above hunk is modifying the x86 parser for the mem= parameter.

> > Your patchset doesn't do that, I'm talking specifically about the amount 
> > of memory left behind so that the kernel at least still boots.  That seems 
> > to be a function of e820_hide_mem() to do some sanity checking so we 
> > actually still get a kernel rather than the responsibility of the 
> > command-line parser.
> 
> How much memory is enough to make sure the kernel can still boot, it is very 
> hard to measure. it is almost impossible to get the exact data. I try to leave very 
> few memory to kernel(hide most memory with numa=hide), it cause a panic directly.
> 
> I have no idea about it, do you have any suggestions?
> 

Yes, I think we should use FAKE_NODE_MIN_SIZE to represent the smallest 
node that may be added and so the appropriate behavior or e820_hide_mem() 
would be to leave at least this quantity behind for the kernel to be 
loaded.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-19  0:32                   ` Shaohui Zheng
@ 2010-11-21  0:48                     ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21  0:48 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Paul Mundt, Andrew Morton, linux-mm, linux-kernel, haicheng.li,
	ak, shaohui.zheng, Yinghai Lu, Haicheng Li

On Fri, 19 Nov 2010, Shaohui Zheng wrote:

> nr_node_ids is the possible node number. when we do regular memory online,
> it is oline to a possible node, and it is already counted in to nr_node_ids.
> 
> if you increment nr_node_ids dynamically when node online, it causes a lot of
> problems. Many data are initialized according to nr_node_ids. That is our
> experience when we debug the emulator.
> 

I think what we'll end up wanting to do is something like this, which adds 
a numa=possible=<N> parameter for x86; this will add an additional N 
possible nodes to node_possible_map that we can use to online later.  It 
also adds a new /sys/devices/system/memory/add_node file which takes a 
typical "size@start" value to hot-add an emulated node.  For example, 
using "mem=2G numa=possible=1" on the command line and doing 
echo 128M@0x80000000" > /sys/devices/system/memory/add_node would hot-add 
a node of 128M.

Comments?
---
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -33,6 +33,7 @@ s16 apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = {
 int numa_off __initdata;
 static unsigned long __initdata nodemap_addr;
 static unsigned long __initdata nodemap_size;
+static unsigned long __initdata numa_possible_nodes;
 
 /*
  * Map cpu index to node index
@@ -611,7 +612,7 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 
 #ifdef CONFIG_NUMA_EMU
 	if (cmdline && !numa_emulation(start_pfn, last_pfn, acpi, k8))
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -619,14 +620,14 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 #ifdef CONFIG_ACPI_NUMA
 	if (!numa_off && acpi && !acpi_scan_nodes(start_pfn << PAGE_SHIFT,
 						  last_pfn << PAGE_SHIFT))
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
 
 #ifdef CONFIG_K8_NUMA
 	if (!numa_off && k8 && !k8_scan_nodes())
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -646,6 +647,15 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 		numa_set_node(i, 0);
 	memblock_x86_register_active_regions(0, start_pfn, last_pfn);
 	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT);
+out: __maybe_unused
+	for (i = 0; i < numa_possible_nodes; i++) {
+		int nid;
+
+		nid = first_unset_node(node_possible_map);
+		if (nid == MAX_NUMNODES)
+			break;
+		node_set(nid, node_possible_map);
+	}
 }
 
 unsigned long __init numa_free_all_bootmem(void)
@@ -675,6 +685,8 @@ static __init int numa_setup(char *opt)
 	if (!strncmp(opt, "noacpi", 6))
 		acpi_numa = -1;
 #endif
+	if (!strncmp(opt, "possible=", 9))
+		numa_possible_nodes = simple_strtoul(opt + 9, NULL, 0);
 	return 0;
 }
 early_param("numa", numa_setup);
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -353,10 +353,44 @@ memory_probe_store(struct class *class, struct class_attribute *attr,
 }
 static CLASS_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
 
+static ssize_t
+memory_add_node_store(struct class *class, struct class_attribute *attr,
+		      const char *buf, size_t count)
+{
+	nodemask_t mask;
+	u64 start, size;
+	char *p;
+	int nid;
+	int ret;
+
+	size = memparse(buf, &p);
+	if (size < (PAGES_PER_SECTION << PAGE_SHIFT))
+		return -EINVAL;
+	if (*p != '@')
+		return -EINVAL;
+
+	start = simple_strtoull(p + 1, NULL, 0);
+
+	nodes_andnot(mask, node_possible_map, node_online_map);
+	nid = first_node(mask);
+	if (nid == MAX_NUMNODES)
+		return -EINVAL;
+
+	ret = add_memory(nid, start, size);
+	return ret ? ret : count;
+}
+static CLASS_ATTR(add_node, S_IWUSR, NULL, memory_add_node_store);
+
 static int memory_probe_init(void)
 {
-	return sysfs_create_file(&memory_sysdev_class.kset.kobj,
+	int err;
+
+	err = sysfs_create_file(&memory_sysdev_class.kset.kobj,
 				&class_attr_probe.attr);
+	if (err)
+		return err;
+	return sysfs_create_file(&memory_sysdev_class.kset.kobj,
+				&class_attr_add_node.attr);
 }
 #else
 static inline int memory_probe_init(void)

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-21  0:48                     ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21  0:48 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Paul Mundt, Andrew Morton, linux-mm, linux-kernel, haicheng.li,
	ak, shaohui.zheng, Yinghai Lu, Haicheng Li

On Fri, 19 Nov 2010, Shaohui Zheng wrote:

> nr_node_ids is the possible node number. when we do regular memory online,
> it is oline to a possible node, and it is already counted in to nr_node_ids.
> 
> if you increment nr_node_ids dynamically when node online, it causes a lot of
> problems. Many data are initialized according to nr_node_ids. That is our
> experience when we debug the emulator.
> 

I think what we'll end up wanting to do is something like this, which adds 
a numa=possible=<N> parameter for x86; this will add an additional N 
possible nodes to node_possible_map that we can use to online later.  It 
also adds a new /sys/devices/system/memory/add_node file which takes a 
typical "size@start" value to hot-add an emulated node.  For example, 
using "mem=2G numa=possible=1" on the command line and doing 
echo 128M@0x80000000" > /sys/devices/system/memory/add_node would hot-add 
a node of 128M.

Comments?
---
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -33,6 +33,7 @@ s16 apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = {
 int numa_off __initdata;
 static unsigned long __initdata nodemap_addr;
 static unsigned long __initdata nodemap_size;
+static unsigned long __initdata numa_possible_nodes;
 
 /*
  * Map cpu index to node index
@@ -611,7 +612,7 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 
 #ifdef CONFIG_NUMA_EMU
 	if (cmdline && !numa_emulation(start_pfn, last_pfn, acpi, k8))
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -619,14 +620,14 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 #ifdef CONFIG_ACPI_NUMA
 	if (!numa_off && acpi && !acpi_scan_nodes(start_pfn << PAGE_SHIFT,
 						  last_pfn << PAGE_SHIFT))
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
 
 #ifdef CONFIG_K8_NUMA
 	if (!numa_off && k8 && !k8_scan_nodes())
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -646,6 +647,15 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 		numa_set_node(i, 0);
 	memblock_x86_register_active_regions(0, start_pfn, last_pfn);
 	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT);
+out: __maybe_unused
+	for (i = 0; i < numa_possible_nodes; i++) {
+		int nid;
+
+		nid = first_unset_node(node_possible_map);
+		if (nid == MAX_NUMNODES)
+			break;
+		node_set(nid, node_possible_map);
+	}
 }
 
 unsigned long __init numa_free_all_bootmem(void)
@@ -675,6 +685,8 @@ static __init int numa_setup(char *opt)
 	if (!strncmp(opt, "noacpi", 6))
 		acpi_numa = -1;
 #endif
+	if (!strncmp(opt, "possible=", 9))
+		numa_possible_nodes = simple_strtoul(opt + 9, NULL, 0);
 	return 0;
 }
 early_param("numa", numa_setup);
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -353,10 +353,44 @@ memory_probe_store(struct class *class, struct class_attribute *attr,
 }
 static CLASS_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
 
+static ssize_t
+memory_add_node_store(struct class *class, struct class_attribute *attr,
+		      const char *buf, size_t count)
+{
+	nodemask_t mask;
+	u64 start, size;
+	char *p;
+	int nid;
+	int ret;
+
+	size = memparse(buf, &p);
+	if (size < (PAGES_PER_SECTION << PAGE_SHIFT))
+		return -EINVAL;
+	if (*p != '@')
+		return -EINVAL;
+
+	start = simple_strtoull(p + 1, NULL, 0);
+
+	nodes_andnot(mask, node_possible_map, node_online_map);
+	nid = first_node(mask);
+	if (nid == MAX_NUMNODES)
+		return -EINVAL;
+
+	ret = add_memory(nid, start, size);
+	return ret ? ret : count;
+}
+static CLASS_ATTR(add_node, S_IWUSR, NULL, memory_add_node_store);
+
 static int memory_probe_init(void)
 {
-	return sysfs_create_file(&memory_sysdev_class.kset.kobj,
+	int err;
+
+	err = sysfs_create_file(&memory_sysdev_class.kset.kobj,
 				&class_attr_probe.attr);
+	if (err)
+		return err;
+	return sysfs_create_file(&memory_sysdev_class.kset.kobj,
+				&class_attr_add_node.attr);
 }
 #else
 static inline int memory_probe_init(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 1/2] x86: add numa=possible command line option
  2010-11-21  0:48                     ` David Rientjes
@ 2010-11-21  2:28                       ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21  2:28 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, Thomas Gleixner
  Cc: Greg Kroah-Hartman, Shaohui Zheng, Paul Mundt, Andrew Morton,
	Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap, linux-kernel,
	linux-mm, x86

Adds a numa=possible=<N> command line option to set an additional N nodes
as being possible for memory hotplug.  This set of possible nodes
controls nr_node_ids and the sizes of several dynamically allocated node
arrays.

This allows memory hotplug to create new nodes for newly added memory
rather than binding it to existing nodes.

The first use-case for this will be node hotplug emulation which will use
these possible nodes to create new nodes to test the memory hotplug
callbacks and surrounding memory hotplug code.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/x86/x86_64/boot-options.txt |    4 ++++
 arch/x86/mm/numa_64.c                     |   18 +++++++++++++++---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -174,6 +174,10 @@ NUMA
 		If given as an integer, fills all system RAM with N fake nodes
 		interleaved over physical nodes.
 
+  numa=possible=<N>
+		Sets an additional N nodes as being possible for memory
+		hotplug.
+
 ACPI
 
   acpi=off	Don't enable ACPI
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -33,6 +33,7 @@ s16 apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = {
 int numa_off __initdata;
 static unsigned long __initdata nodemap_addr;
 static unsigned long __initdata nodemap_size;
+static unsigned long __initdata numa_possible_nodes;
 
 /*
  * Map cpu index to node index
@@ -611,7 +612,7 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 
 #ifdef CONFIG_NUMA_EMU
 	if (cmdline && !numa_emulation(start_pfn, last_pfn, acpi, k8))
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -619,14 +620,14 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 #ifdef CONFIG_ACPI_NUMA
 	if (!numa_off && acpi && !acpi_scan_nodes(start_pfn << PAGE_SHIFT,
 						  last_pfn << PAGE_SHIFT))
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
 
 #ifdef CONFIG_K8_NUMA
 	if (!numa_off && k8 && !k8_scan_nodes())
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -646,6 +647,15 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 		numa_set_node(i, 0);
 	memblock_x86_register_active_regions(0, start_pfn, last_pfn);
 	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT);
+out: __maybe_unused
+	for (i = 0; i < numa_possible_nodes; i++) {
+		int nid;
+
+		nid = first_unset_node(node_possible_map);
+		if (nid == MAX_NUMNODES)
+			break;
+		node_set(nid, node_possible_map);
+	}
 }
 
 unsigned long __init numa_free_all_bootmem(void)
@@ -675,6 +685,8 @@ static __init int numa_setup(char *opt)
 	if (!strncmp(opt, "noacpi", 6))
 		acpi_numa = -1;
 #endif
+	if (!strncmp(opt, "possible=", 9))
+		numa_possible_nodes = simple_strtoul(opt + 9, NULL, 0);
 	return 0;
 }
 early_param("numa", numa_setup);

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 1/2] x86: add numa=possible command line option
@ 2010-11-21  2:28                       ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21  2:28 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, Thomas Gleixner
  Cc: Greg Kroah-Hartman, Shaohui Zheng, Paul Mundt, Andrew Morton,
	Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap, linux-kernel,
	linux-mm, x86

Adds a numa=possible=<N> command line option to set an additional N nodes
as being possible for memory hotplug.  This set of possible nodes
controls nr_node_ids and the sizes of several dynamically allocated node
arrays.

This allows memory hotplug to create new nodes for newly added memory
rather than binding it to existing nodes.

The first use-case for this will be node hotplug emulation which will use
these possible nodes to create new nodes to test the memory hotplug
callbacks and surrounding memory hotplug code.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/x86/x86_64/boot-options.txt |    4 ++++
 arch/x86/mm/numa_64.c                     |   18 +++++++++++++++---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
--- a/Documentation/x86/x86_64/boot-options.txt
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -174,6 +174,10 @@ NUMA
 		If given as an integer, fills all system RAM with N fake nodes
 		interleaved over physical nodes.
 
+  numa=possible=<N>
+		Sets an additional N nodes as being possible for memory
+		hotplug.
+
 ACPI
 
   acpi=off	Don't enable ACPI
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -33,6 +33,7 @@ s16 apicid_to_node[MAX_LOCAL_APIC] __cpuinitdata = {
 int numa_off __initdata;
 static unsigned long __initdata nodemap_addr;
 static unsigned long __initdata nodemap_size;
+static unsigned long __initdata numa_possible_nodes;
 
 /*
  * Map cpu index to node index
@@ -611,7 +612,7 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 
 #ifdef CONFIG_NUMA_EMU
 	if (cmdline && !numa_emulation(start_pfn, last_pfn, acpi, k8))
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -619,14 +620,14 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 #ifdef CONFIG_ACPI_NUMA
 	if (!numa_off && acpi && !acpi_scan_nodes(start_pfn << PAGE_SHIFT,
 						  last_pfn << PAGE_SHIFT))
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
 
 #ifdef CONFIG_K8_NUMA
 	if (!numa_off && k8 && !k8_scan_nodes())
-		return;
+		goto out;
 	nodes_clear(node_possible_map);
 	nodes_clear(node_online_map);
 #endif
@@ -646,6 +647,15 @@ void __init initmem_init(unsigned long start_pfn, unsigned long last_pfn,
 		numa_set_node(i, 0);
 	memblock_x86_register_active_regions(0, start_pfn, last_pfn);
 	setup_node_bootmem(0, start_pfn << PAGE_SHIFT, last_pfn << PAGE_SHIFT);
+out: __maybe_unused
+	for (i = 0; i < numa_possible_nodes; i++) {
+		int nid;
+
+		nid = first_unset_node(node_possible_map);
+		if (nid == MAX_NUMNODES)
+			break;
+		node_set(nid, node_possible_map);
+	}
 }
 
 unsigned long __init numa_free_all_bootmem(void)
@@ -675,6 +685,8 @@ static __init int numa_setup(char *opt)
 	if (!strncmp(opt, "noacpi", 6))
 		acpi_numa = -1;
 #endif
+	if (!strncmp(opt, "possible=", 9))
+		numa_possible_nodes = simple_strtoul(opt + 9, NULL, 0);
 	return 0;
 }
 early_param("numa", numa_setup);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 2/2] mm: add node hotplug emulation
  2010-11-21  2:28                       ` David Rientjes
@ 2010-11-21  2:28                         ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21  2:28 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Shaohui Zheng,
	Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap,
	linux-kernel, linux-mm, x86


Add an interface to allow new nodes to be added when performing memory
hot-add.  This provides a convenient interface to test memory hotplug
notifier callbacks and surrounding hotplug code when new nodes are
onlined without actually having a machine with such hotpluggable SRAT
entries.

This adds a new interface at /sys/devices/system/memory/add_node that
behaves in a similar way to the memory hot-add "probe" interface.  Its
format is size@start, where "size" is the size of the new node to be
added and "start" is the physical address of the new memory.

The new node id is a currently offline, but possible, node.  The bit must
be set in node_possible_map so that nr_node_ids is sized appropriately.

For emulation on x86, for example, it would be possible to set aside
memory for hotplugged nodes (say, anything above 2G) and to add an
additional three nodes as being possible on boot with

	mem=2G numa=possible=3

and then creating a new 128M node at runtime:

	# echo 128M@0x80000000 > /sys/devices/system/memory/add_node
	On node 1 totalpages: 0
	init_memory_mapping: 0000000080000000-0000000088000000
	 0080000000 - 0088000000 page 2M

Once the new node has been added, its memory can be onlined.  If this
memory represents memory section 16, for example:

	# echo online > /sys/devices/system/memory/memory16/state
	Built 2 zonelists in Node order, mobility grouping on.  Total pages: 514846
	Policy zone: Normal

 [ The memory section(s) mapped to a particular node are visible via
   /sys/devices/system/node/node1, in this example. ]

The new node is now hotplugged and ready for testing.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/memory-hotplug.txt |   24 ++++++++++++++++++++++++
 drivers/base/memory.c            |   36 +++++++++++++++++++++++++++++++++++-
 2 files changed, 59 insertions(+), 1 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -18,6 +18,7 @@ be changed often.
 4. Physical memory hot-add phase
   4.1 Hardware(Firmware) Support
   4.2 Notify memory hot-add event by hand
+  4.3 Node hotplug emulation
 5. Logical Memory hot-add phase
   5.1. State of memory
   5.2. How to online memory
@@ -215,6 +216,29 @@ current implementation). You'll have to online memory by yourself.
 Please see "How to online memory" in this text.
 
 
+4.3 Node hotplug emulation
+------------
+It is possible to test node hotplug by assigning the newly added memory to a
+new node id when using a different interface with a similar behavior to
+"probe" described in section 4.2.  If a node id is possible (there are bits
+in /sys/devices/system/memory/possible that are not online), then it may be
+used to emulate a newly added node as the result of memory hotplug by using
+the "add_node" interface.
+
+The add_node interface is located at
+/sys/devices/system/memory/add_node
+
+You can create a new node of a specified size starting at the physical
+address of new memory by
+
+% echo size@start_address_of_new_memory > /sys/devices/system/memory/add_node
+
+Where "size" can be represented in megabytes or gigabytes (for example,
+"128M" or "1G").  The minumum size is that of a memory section.
+
+Once the new node has been added, it is possible to online the memory by
+toggling the "state" of its memory section(s) as described in section 5.1.
+
 
 ------------------------------
 5. Logical Memory hot-add phase
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -353,10 +353,44 @@ memory_probe_store(struct class *class, struct class_attribute *attr,
 }
 static CLASS_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
 
+static ssize_t
+memory_add_node_store(struct class *class, struct class_attribute *attr,
+		      const char *buf, size_t count)
+{
+	nodemask_t mask;
+	u64 start, size;
+	char *p;
+	int nid;
+	int ret;
+
+	size = memparse(buf, &p);
+	if (size < (PAGES_PER_SECTION << PAGE_SHIFT))
+		return -EINVAL;
+	if (*p != '@')
+		return -EINVAL;
+
+	start = simple_strtoull(p + 1, NULL, 0);
+
+	nodes_andnot(mask, node_possible_map, node_online_map);
+	nid = first_node(mask);
+	if (nid == MAX_NUMNODES)
+		return -EINVAL;
+
+	ret = add_memory(nid, start, size);
+	return ret ? ret : count;
+}
+static CLASS_ATTR(add_node, S_IWUSR, NULL, memory_add_node_store);
+
 static int memory_probe_init(void)
 {
-	return sysfs_create_file(&memory_sysdev_class.kset.kobj,
+	int err;
+
+	err = sysfs_create_file(&memory_sysdev_class.kset.kobj,
 				&class_attr_probe.attr);
+	if (err)
+		return err;
+	return sysfs_create_file(&memory_sysdev_class.kset.kobj,
+				&class_attr_add_node.attr);
 }
 #else
 static inline int memory_probe_init(void)

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 2/2] mm: add node hotplug emulation
@ 2010-11-21  2:28                         ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21  2:28 UTC (permalink / raw)
  To: Andrew Morton, Greg Kroah-Hartman
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Shaohui Zheng,
	Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap,
	linux-kernel, linux-mm, x86


Add an interface to allow new nodes to be added when performing memory
hot-add.  This provides a convenient interface to test memory hotplug
notifier callbacks and surrounding hotplug code when new nodes are
onlined without actually having a machine with such hotpluggable SRAT
entries.

This adds a new interface at /sys/devices/system/memory/add_node that
behaves in a similar way to the memory hot-add "probe" interface.  Its
format is size@start, where "size" is the size of the new node to be
added and "start" is the physical address of the new memory.

The new node id is a currently offline, but possible, node.  The bit must
be set in node_possible_map so that nr_node_ids is sized appropriately.

For emulation on x86, for example, it would be possible to set aside
memory for hotplugged nodes (say, anything above 2G) and to add an
additional three nodes as being possible on boot with

	mem=2G numa=possible=3

and then creating a new 128M node at runtime:

	# echo 128M@0x80000000 > /sys/devices/system/memory/add_node
	On node 1 totalpages: 0
	init_memory_mapping: 0000000080000000-0000000088000000
	 0080000000 - 0088000000 page 2M

Once the new node has been added, its memory can be onlined.  If this
memory represents memory section 16, for example:

	# echo online > /sys/devices/system/memory/memory16/state
	Built 2 zonelists in Node order, mobility grouping on.  Total pages: 514846
	Policy zone: Normal

 [ The memory section(s) mapped to a particular node are visible via
   /sys/devices/system/node/node1, in this example. ]

The new node is now hotplugged and ready for testing.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/memory-hotplug.txt |   24 ++++++++++++++++++++++++
 drivers/base/memory.c            |   36 +++++++++++++++++++++++++++++++++++-
 2 files changed, 59 insertions(+), 1 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -18,6 +18,7 @@ be changed often.
 4. Physical memory hot-add phase
   4.1 Hardware(Firmware) Support
   4.2 Notify memory hot-add event by hand
+  4.3 Node hotplug emulation
 5. Logical Memory hot-add phase
   5.1. State of memory
   5.2. How to online memory
@@ -215,6 +216,29 @@ current implementation). You'll have to online memory by yourself.
 Please see "How to online memory" in this text.
 
 
+4.3 Node hotplug emulation
+------------
+It is possible to test node hotplug by assigning the newly added memory to a
+new node id when using a different interface with a similar behavior to
+"probe" described in section 4.2.  If a node id is possible (there are bits
+in /sys/devices/system/memory/possible that are not online), then it may be
+used to emulate a newly added node as the result of memory hotplug by using
+the "add_node" interface.
+
+The add_node interface is located at
+/sys/devices/system/memory/add_node
+
+You can create a new node of a specified size starting at the physical
+address of new memory by
+
+% echo size@start_address_of_new_memory > /sys/devices/system/memory/add_node
+
+Where "size" can be represented in megabytes or gigabytes (for example,
+"128M" or "1G").  The minumum size is that of a memory section.
+
+Once the new node has been added, it is possible to online the memory by
+toggling the "state" of its memory section(s) as described in section 5.1.
+
 
 ------------------------------
 5. Logical Memory hot-add phase
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -353,10 +353,44 @@ memory_probe_store(struct class *class, struct class_attribute *attr,
 }
 static CLASS_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
 
+static ssize_t
+memory_add_node_store(struct class *class, struct class_attribute *attr,
+		      const char *buf, size_t count)
+{
+	nodemask_t mask;
+	u64 start, size;
+	char *p;
+	int nid;
+	int ret;
+
+	size = memparse(buf, &p);
+	if (size < (PAGES_PER_SECTION << PAGE_SHIFT))
+		return -EINVAL;
+	if (*p != '@')
+		return -EINVAL;
+
+	start = simple_strtoull(p + 1, NULL, 0);
+
+	nodes_andnot(mask, node_possible_map, node_online_map);
+	nid = first_node(mask);
+	if (nid == MAX_NUMNODES)
+		return -EINVAL;
+
+	ret = add_memory(nid, start, size);
+	return ret ? ret : count;
+}
+static CLASS_ATTR(add_node, S_IWUSR, NULL, memory_add_node_store);
+
 static int memory_probe_init(void)
 {
-	return sysfs_create_file(&memory_sysdev_class.kset.kobj,
+	int err;
+
+	err = sysfs_create_file(&memory_sysdev_class.kset.kobj,
 				&class_attr_probe.attr);
+	if (err)
+		return err;
+	return sysfs_create_file(&memory_sysdev_class.kset.kobj,
+				&class_attr_add_node.attr);
 }
 #else
 static inline int memory_probe_init(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
  2010-11-21  0:45             ` David Rientjes
@ 2010-11-21 14:00               ` Américo Wang
  -1 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-21 14:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Shaohui Zheng, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, lethal, Andi Kleen, Yinghai Lu, Haicheng Li

On Sat, Nov 20, 2010 at 04:45:06PM -0800, David Rientjes wrote:
>On Fri, 19 Nov 2010, Shaohui Zheng wrote:
>
>> > > > > Index: linux-hpe4/arch/x86/kernel/e820.c
>> > > > > ===================================================================
>> > > > > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
>> > > > > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
>> > > > > @@ -971,6 +971,7 @@
>> > > > >  }
>> > > > >  
>> > > > >  static int userdef __initdata;
>> > > > > +static u64 max_mem_size __initdata = ULLONG_MAX;
>> > > > >  
>> > > > >  /* "mem=nopentium" disables the 4MB page tables. */
>> > > > >  static int __init parse_memopt(char *p)
>> > > > > @@ -989,12 +990,28 @@
>> > > > >  
>> > > > >  	userdef = 1;
>> > > > >  	mem_size = memparse(p, &p);
>> > > > > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
>> > > > > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
>> > > > > +	max_mem_size = mem_size;
>> > > > >  
>> > > > >  	return 0;
>> > > > >  }
>> > > > 
>> > > > This needs memmap= support as well, right?
>> > > we did not do the testing after combine both memmap and numa=hide paramter, 
>> > > I think that the result should similar with mem=XX, they both remove a memory
>> > > region from the e820 table.
>> > > 
>> > 
>> > You've modified the parser for mem= but not memmap= so the change needs 
>> > additional support for the latter.
>> > 
>> 
>> the parser for mem= is not modified, the changed parser is numa=, I add a addtional
>> option numa=hide=.
>> 
>
>The above hunk is modifying the x86 parser for the mem= parameter.
>

That is fine as long as "mem=" is parsed before "numa=".

I think "mem=" should always be parsed before "numa=" no matter what
order they are specified in cmdline, since we need know how much total
memory we have at first.

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
@ 2010-11-21 14:00               ` Américo Wang
  0 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-21 14:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Shaohui Zheng, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, lethal, Andi Kleen, Yinghai Lu, Haicheng Li

On Sat, Nov 20, 2010 at 04:45:06PM -0800, David Rientjes wrote:
>On Fri, 19 Nov 2010, Shaohui Zheng wrote:
>
>> > > > > Index: linux-hpe4/arch/x86/kernel/e820.c
>> > > > > ===================================================================
>> > > > > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
>> > > > > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
>> > > > > @@ -971,6 +971,7 @@
>> > > > >  }
>> > > > >  
>> > > > >  static int userdef __initdata;
>> > > > > +static u64 max_mem_size __initdata = ULLONG_MAX;
>> > > > >  
>> > > > >  /* "mem=nopentium" disables the 4MB page tables. */
>> > > > >  static int __init parse_memopt(char *p)
>> > > > > @@ -989,12 +990,28 @@
>> > > > >  
>> > > > >  	userdef = 1;
>> > > > >  	mem_size = memparse(p, &p);
>> > > > > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
>> > > > > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
>> > > > > +	max_mem_size = mem_size;
>> > > > >  
>> > > > >  	return 0;
>> > > > >  }
>> > > > 
>> > > > This needs memmap= support as well, right?
>> > > we did not do the testing after combine both memmap and numa=hide paramter, 
>> > > I think that the result should similar with mem=XX, they both remove a memory
>> > > region from the e820 table.
>> > > 
>> > 
>> > You've modified the parser for mem= but not memmap= so the change needs 
>> > additional support for the latter.
>> > 
>> 
>> the parser for mem= is not modified, the changed parser is numa=, I add a addtional
>> option numa=hide=.
>> 
>
>The above hunk is modifying the x86 parser for the mem= parameter.
>

That is fine as long as "mem=" is parsed before "numa=".

I think "mem=" should always be parsed before "numa=" no matter what
order they are specified in cmdline, since we need know how much total
memory we have at first.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 1/2] x86: add numa=possible command line option
  2010-11-21  2:28                       ` David Rientjes
@ 2010-11-21 14:26                         ` Américo Wang
  -1 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-21 14:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Greg Kroah-Hartman,
	Shaohui Zheng, Paul Mundt, Andrew Morton, Andi Kleen, Yinghai Lu,
	Haicheng Li, Randy Dunlap, linux-kernel, linux-mm, x86


Hi, David

On Sat, Nov 20, 2010 at 06:28:31PM -0800, David Rientjes wrote:
>Adds a numa=possible=<N> command line option to set an additional N nodes
>as being possible for memory hotplug.  This set of possible nodes
>controls nr_node_ids and the sizes of several dynamically allocated node
>arrays.
>
>This allows memory hotplug to create new nodes for newly added memory
>rather than binding it to existing nodes.
>
>The first use-case for this will be node hotplug emulation which will use
>these possible nodes to create new nodes to test the memory hotplug
>callbacks and surrounding memory hotplug code.
>


I am not sure how much value of making this dynamic,
for CPU, we do this at compile time, i.e. NR_CPUS,
so how about NR_NODES?

Also, numa=possible= is not as clear as numa=max=, for me at least.

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 1/2] x86: add numa=possible command line option
@ 2010-11-21 14:26                         ` Américo Wang
  0 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-21 14:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Greg Kroah-Hartman,
	Shaohui Zheng, Paul Mundt, Andrew Morton, Andi Kleen, Yinghai Lu,
	Haicheng Li, Randy Dunlap, linux-kernel, linux-mm, x86


Hi, David

On Sat, Nov 20, 2010 at 06:28:31PM -0800, David Rientjes wrote:
>Adds a numa=possible=<N> command line option to set an additional N nodes
>as being possible for memory hotplug.  This set of possible nodes
>controls nr_node_ids and the sizes of several dynamically allocated node
>arrays.
>
>This allows memory hotplug to create new nodes for newly added memory
>rather than binding it to existing nodes.
>
>The first use-case for this will be node hotplug emulation which will use
>these possible nodes to create new nodes to test the memory hotplug
>callbacks and surrounding memory hotplug code.
>


I am not sure how much value of making this dynamic,
for CPU, we do this at compile time, i.e. NR_CPUS,
so how about NR_NODES?

Also, numa=possible= is not as clear as numa=max=, for me at least.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [5/8,v3] NUMA Hotplug Emulator: support cpu probe/release in x86
  2010-11-17  2:08   ` shaohui.zheng
@ 2010-11-21 14:45     ` Américo Wang
  -1 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-21 14:45 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Ingo Molnar, Len Brown, Yinghai Lu, Haicheng Li

On Wed, Nov 17, 2010 at 10:08:04AM +0800, shaohui.zheng@intel.com wrote:
>From: Shaohui Zheng <shaohui.zheng@intel.com>
>
>Add cpu interface probe/release under sysfs for x86. User can use this
>interface to emulate the cpu hot-add process, it is for cpu hotplug 
>test purpose. Add a kernel option CONFIG_ARCH_CPU_PROBE_RELEASE for this
>feature.
>
>This interface provides a mechanism to emulate cpu hotplug with software
> methods, it becomes possible to do cpu hotplug automation and stress
>testing.
>

Huh? We already have CPU online/offline...

Can you describe more about the difference?

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [5/8,v3] NUMA Hotplug Emulator: support cpu probe/release in x86
@ 2010-11-21 14:45     ` Américo Wang
  0 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-21 14:45 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Ingo Molnar, Len Brown, Yinghai Lu, Haicheng Li

On Wed, Nov 17, 2010 at 10:08:04AM +0800, shaohui.zheng@intel.com wrote:
>From: Shaohui Zheng <shaohui.zheng@intel.com>
>
>Add cpu interface probe/release under sysfs for x86. User can use this
>interface to emulate the cpu hot-add process, it is for cpu hotplug 
>test purpose. Add a kernel option CONFIG_ARCH_CPU_PROBE_RELEASE for this
>feature.
>
>This interface provides a mechanism to emulate cpu hotplug with software
> methods, it becomes possible to do cpu hotplug automation and stress
>testing.
>

Huh? We already have CPU online/offline...

Can you describe more about the difference?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
  2010-11-17  2:08   ` shaohui.zheng
@ 2010-11-21 15:03     ` Américo Wang
  -1 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-21 15:03 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li

On Wed, Nov 17, 2010 at 10:08:07AM +0800, shaohui.zheng@intel.com wrote:
>+2) CPU hotplug emulation:
>+
>+The emulator reserve CPUs throu grub parameter, the reserved CPUs can be
>+hot-add/hot-remove in software method, it emulates the process of physical
>+cpu hotplug.
>+
>+When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU
>+socket hotplug process. For the CPU supported SMT, some logical CPUs are in the
>+same socket, but it may located in different NUMA node after we have emulator.
>+We put the logical CPU into a fake CPU socket, and assign it an unique
>+phys_proc_id. For the fake socket, we put one logical CPU in only.
>+
>+ - to hide CPUs
>+	- Using boot option "maxcpus=N" hide CPUs
>+	  N is the number of initialize CPUs
>+	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation
>+      when cpu_hpe is enabled, the rest CPUs will not be initialized
>+
>+ - to hot-add CPU to node
>+	$ echo nid > cpu/probe
>+
>+ - to hot-remove CPU
>+	$ echo nid > cpu/release
>+

Again, we already have software CPU hotplug,
i.e. /sys/devices/system/cpu/cpuX/online.

You need to pick up another name for this.

>From your documentation above, it looks like you are trying
to move one CPU between nodes?

>+	cpu_hpe=on/off
>+		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,
>+		sysfs provides probe/release interface to hot add/remove cpu dynamically.
>+		this option is disabled in default.
>+			

Why not just a CONFIG? IOW, why do we need to make another boot
parameter for this?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
@ 2010-11-21 15:03     ` Américo Wang
  0 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-21 15:03 UTC (permalink / raw)
  To: shaohui.zheng
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li

On Wed, Nov 17, 2010 at 10:08:07AM +0800, shaohui.zheng@intel.com wrote:
>+2) CPU hotplug emulation:
>+
>+The emulator reserve CPUs throu grub parameter, the reserved CPUs can be
>+hot-add/hot-remove in software method, it emulates the process of physical
>+cpu hotplug.
>+
>+When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU
>+socket hotplug process. For the CPU supported SMT, some logical CPUs are in the
>+same socket, but it may located in different NUMA node after we have emulator.
>+We put the logical CPU into a fake CPU socket, and assign it an unique
>+phys_proc_id. For the fake socket, we put one logical CPU in only.
>+
>+ - to hide CPUs
>+	- Using boot option "maxcpus=N" hide CPUs
>+	  N is the number of initialize CPUs
>+	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation
>+      when cpu_hpe is enabled, the rest CPUs will not be initialized
>+
>+ - to hot-add CPU to node
>+	$ echo nid > cpu/probe
>+
>+ - to hot-remove CPU
>+	$ echo nid > cpu/release
>+

Again, we already have software CPU hotplug,
i.e. /sys/devices/system/cpu/cpuX/online.

You need to pick up another name for this.

>From your documentation above, it looks like you are trying
to move one CPU between nodes?

>+	cpu_hpe=on/off
>+		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,
>+		sysfs provides probe/release interface to hot add/remove cpu dynamically.
>+		this option is disabled in default.
>+			

Why not just a CONFIG? IOW, why do we need to make another boot
parameter for this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-21  0:48                     ` David Rientjes
@ 2010-11-21 15:14                       ` Li, Haicheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Li, Haicheng @ 2010-11-21 15:14 UTC (permalink / raw)
  To: David Rientjes, Zheng, Shaohui
  Cc: Paul Mundt, Andrew Morton, linux-mm, linux-kernel, haicheng.li,
	ak, shaohui.zheng, Yinghai Lu

David Rientjes wrote:
> On Fri, 19 Nov 2010, Shaohui Zheng wrote:
> 
>> nr_node_ids is the possible node number. when we do regular memory
>> online, it is oline to a possible node, and it is already counted in
>> to nr_node_ids. 
>> 
>> if you increment nr_node_ids dynamically when node online, it causes
>> a lot of problems. Many data are initialized according to
>> nr_node_ids. That is our experience when we debug the emulator.
>> 
> 
> I think what we'll end up wanting to do is something like this, which
> adds 
> a numa=possible=<N> parameter for x86; this will add an additional N
> possible nodes to node_possible_map that we can use to online later. 
> It 
> also adds a new /sys/devices/system/memory/add_node file which takes a
> typical "size@start" value to hot-add an emulated node.  For example,
> using "mem=2G numa=possible=1" on the command line and doing
> echo 128M@0x80000000" > /sys/devices/system/memory/add_node would
> hot-add 
> a node of 128M.
> 
> Comments?

Sorry for the late response as I'm in a biz trip recently.

David, your original concern is just about powerful/flexibility. I'm sure our implementation can better meets such requirments.

IMHO, I don't see any powerful/flexibility from your patch, compared to our original implementation. you just make things more complex and mess.

Why not use "numa=hide=N*size" as originally implemented?
- later you just need to online the node once you want. And it naturally/exactly emulates the behavior that current HW provides.
- N is the possible node number. And we can use 128M as the default size for each hidden node if user doesn't specify a size.
- If user wants more mem for hidden node, he just needs specify the "size".
- besides, user can also use "mem=" to hide more mem and later use mem-add i/f to freely attach more mem to the hidden node during runtime.

Your patch introduces additional dependency on "mem=", but ours is simple and flexibly compatible with "mem=" and "numa=emu". 


-haicheng

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-21 15:14                       ` Li, Haicheng
  0 siblings, 0 replies; 139+ messages in thread
From: Li, Haicheng @ 2010-11-21 15:14 UTC (permalink / raw)
  To: David Rientjes, Zheng, Shaohui
  Cc: Paul Mundt, Andrew Morton, linux-mm, linux-kernel, haicheng.li,
	ak, shaohui.zheng, Yinghai Lu

David Rientjes wrote:
> On Fri, 19 Nov 2010, Shaohui Zheng wrote:
> 
>> nr_node_ids is the possible node number. when we do regular memory
>> online, it is oline to a possible node, and it is already counted in
>> to nr_node_ids. 
>> 
>> if you increment nr_node_ids dynamically when node online, it causes
>> a lot of problems. Many data are initialized according to
>> nr_node_ids. That is our experience when we debug the emulator.
>> 
> 
> I think what we'll end up wanting to do is something like this, which
> adds 
> a numa=possible=<N> parameter for x86; this will add an additional N
> possible nodes to node_possible_map that we can use to online later. 
> It 
> also adds a new /sys/devices/system/memory/add_node file which takes a
> typical "size@start" value to hot-add an emulated node.  For example,
> using "mem=2G numa=possible=1" on the command line and doing
> echo 128M@0x80000000" > /sys/devices/system/memory/add_node would
> hot-add 
> a node of 128M.
> 
> Comments?

Sorry for the late response as I'm in a biz trip recently.

David, your original concern is just about powerful/flexibility. I'm sure our implementation can better meets such requirments.

IMHO, I don't see any powerful/flexibility from your patch, compared to our original implementation. you just make things more complex and mess.

Why not use "numa=hide=N*size" as originally implemented?
- later you just need to online the node once you want. And it naturally/exactly emulates the behavior that current HW provides.
- N is the possible node number. And we can use 128M as the default size for each hidden node if user doesn't specify a size.
- If user wants more mem for hidden node, he just needs specify the "size".
- besides, user can also use "mem=" to hide more mem and later use mem-add i/f to freely attach more mem to the hidden node during runtime.

Your patch introduces additional dependency on "mem=", but ours is simple and flexibly compatible with "mem=" and "numa=emu". 


-haicheng
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [8/8,v3] NUMA Hotplug Emulator: documentation
  2010-11-21 15:03     ` Américo Wang
@ 2010-11-21 15:16       ` Li, Haicheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Li, Haicheng @ 2010-11-21 15:16 UTC (permalink / raw)
  To: Américo Wang, Zheng, Shaohui
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak, shaohui.zheng

Américo Wang wrote:
> On Wed, Nov 17, 2010 at 10:08:07AM +0800, shaohui.zheng@intel.com
> wrote: 
>> +2) CPU hotplug emulation:
>> +
>> +The emulator reserve CPUs throu grub parameter, the reserved CPUs
>> can be +hot-add/hot-remove in software method, it emulates the
>> process of physical +cpu hotplug. +
>> +When hotplug a CPU with emulator, we are using a logical CPU to
>> emulate the CPU +socket hotplug process. For the CPU supported SMT,
>> some logical CPUs are in the +same socket, but it may located in
>> different NUMA node after we have emulator. +We put the logical CPU
>> into a fake CPU socket, and assign it an unique +phys_proc_id. For
>> the fake socket, we put one logical CPU in only. + + - to hide CPUs
>> +	- Using boot option "maxcpus=N" hide CPUs
>> +	  N is the number of initialize CPUs
>> +	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation
>> +      when cpu_hpe is enabled, the rest CPUs will not be
>> initialized + + - to hot-add CPU to node
>> +	$ echo nid > cpu/probe
>> +
>> + - to hot-remove CPU
>> +	$ echo nid > cpu/release
>> +
> 
> Again, we already have software CPU hotplug,
> i.e. /sys/devices/system/cpu/cpuX/online.

online here is just for logical CPU online. what we're achieving here is to emulate physical CPU hotadd.


-haicheng

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [8/8,v3] NUMA Hotplug Emulator: documentation
@ 2010-11-21 15:16       ` Li, Haicheng
  0 siblings, 0 replies; 139+ messages in thread
From: Li, Haicheng @ 2010-11-21 15:16 UTC (permalink / raw)
  To: Américo Wang, Zheng, Shaohui
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak, shaohui.zheng

Américo Wang wrote:
> On Wed, Nov 17, 2010 at 10:08:07AM +0800, shaohui.zheng@intel.com
> wrote: 
>> +2) CPU hotplug emulation:
>> +
>> +The emulator reserve CPUs throu grub parameter, the reserved CPUs
>> can be +hot-add/hot-remove in software method, it emulates the
>> process of physical +cpu hotplug. +
>> +When hotplug a CPU with emulator, we are using a logical CPU to
>> emulate the CPU +socket hotplug process. For the CPU supported SMT,
>> some logical CPUs are in the +same socket, but it may located in
>> different NUMA node after we have emulator. +We put the logical CPU
>> into a fake CPU socket, and assign it an unique +phys_proc_id. For
>> the fake socket, we put one logical CPU in only. + + - to hide CPUs
>> +	- Using boot option "maxcpus=N" hide CPUs
>> +	  N is the number of initialize CPUs
>> +	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation
>> +      when cpu_hpe is enabled, the rest CPUs will not be
>> initialized + + - to hot-add CPU to node
>> +	$ echo nid > cpu/probe
>> +
>> + - to hot-remove CPU
>> +	$ echo nid > cpu/release
>> +
> 
> Again, we already have software CPU hotplug,
> i.e. /sys/devices/system/cpu/cpuX/online.

online here is just for logical CPU online. what we're achieving here is to emulate physical CPU hotadd.


-haicheng
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2] mm: add node hotplug emulation
  2010-11-21  2:28                         ` David Rientjes
@ 2010-11-21 17:34                           ` Greg KH
  -1 siblings, 0 replies; 139+ messages in thread
From: Greg KH @ 2010-11-21 17:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sat, Nov 20, 2010 at 06:28:38PM -0800, David Rientjes wrote:
> 
> Add an interface to allow new nodes to be added when performing memory
> hot-add.  This provides a convenient interface to test memory hotplug
> notifier callbacks and surrounding hotplug code when new nodes are
> onlined without actually having a machine with such hotpluggable SRAT
> entries.
> 
> This adds a new interface at /sys/devices/system/memory/add_node that
> behaves in a similar way to the memory hot-add "probe" interface.  Its
> format is size@start, where "size" is the size of the new node to be
> added and "start" is the physical address of the new memory.

Ick, we are trying to clean up the system devices right now which would
prevent this type of tree being added.

> The new node id is a currently offline, but possible, node.  The bit must
> be set in node_possible_map so that nr_node_ids is sized appropriately.
> 
> For emulation on x86, for example, it would be possible to set aside
> memory for hotplugged nodes (say, anything above 2G) and to add an
> additional three nodes as being possible on boot with
> 
> 	mem=2G numa=possible=3
> 
> and then creating a new 128M node at runtime:
> 
> 	# echo 128M@0x80000000 > /sys/devices/system/memory/add_node
> 	On node 1 totalpages: 0
> 	init_memory_mapping: 0000000080000000-0000000088000000
> 	 0080000000 - 0088000000 page 2M
> 
> Once the new node has been added, its memory can be onlined.  If this
> memory represents memory section 16, for example:
> 
> 	# echo online > /sys/devices/system/memory/memory16/state
> 	Built 2 zonelists in Node order, mobility grouping on.  Total pages: 514846
> 	Policy zone: Normal
> 
>  [ The memory section(s) mapped to a particular node are visible via
>    /sys/devices/system/node/node1, in this example. ]
> 
> The new node is now hotplugged and ready for testing.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/memory-hotplug.txt |   24 ++++++++++++++++++++++++
>  drivers/base/memory.c            |   36 +++++++++++++++++++++++++++++++++++-
>  2 files changed, 59 insertions(+), 1 deletions(-)

When adding sysfs files you need to document it in Documentation/ABI
instead.

But as this is a debugging thing, why not just put it in debugfs
instead?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2] mm: add node hotplug emulation
@ 2010-11-21 17:34                           ` Greg KH
  0 siblings, 0 replies; 139+ messages in thread
From: Greg KH @ 2010-11-21 17:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sat, Nov 20, 2010 at 06:28:38PM -0800, David Rientjes wrote:
> 
> Add an interface to allow new nodes to be added when performing memory
> hot-add.  This provides a convenient interface to test memory hotplug
> notifier callbacks and surrounding hotplug code when new nodes are
> onlined without actually having a machine with such hotpluggable SRAT
> entries.
> 
> This adds a new interface at /sys/devices/system/memory/add_node that
> behaves in a similar way to the memory hot-add "probe" interface.  Its
> format is size@start, where "size" is the size of the new node to be
> added and "start" is the physical address of the new memory.

Ick, we are trying to clean up the system devices right now which would
prevent this type of tree being added.

> The new node id is a currently offline, but possible, node.  The bit must
> be set in node_possible_map so that nr_node_ids is sized appropriately.
> 
> For emulation on x86, for example, it would be possible to set aside
> memory for hotplugged nodes (say, anything above 2G) and to add an
> additional three nodes as being possible on boot with
> 
> 	mem=2G numa=possible=3
> 
> and then creating a new 128M node at runtime:
> 
> 	# echo 128M@0x80000000 > /sys/devices/system/memory/add_node
> 	On node 1 totalpages: 0
> 	init_memory_mapping: 0000000080000000-0000000088000000
> 	 0080000000 - 0088000000 page 2M
> 
> Once the new node has been added, its memory can be onlined.  If this
> memory represents memory section 16, for example:
> 
> 	# echo online > /sys/devices/system/memory/memory16/state
> 	Built 2 zonelists in Node order, mobility grouping on.  Total pages: 514846
> 	Policy zone: Normal
> 
>  [ The memory section(s) mapped to a particular node are visible via
>    /sys/devices/system/node/node1, in this example. ]
> 
> The new node is now hotplugged and ready for testing.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/memory-hotplug.txt |   24 ++++++++++++++++++++++++
>  drivers/base/memory.c            |   36 +++++++++++++++++++++++++++++++++++-
>  2 files changed, 59 insertions(+), 1 deletions(-)

When adding sysfs files you need to document it in Documentation/ABI
instead.

But as this is a debugging thing, why not just put it in debugfs
instead?

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
  2010-11-21 14:00               ` Américo Wang
@ 2010-11-21 21:33                 ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21 21:33 UTC (permalink / raw)
  To: Américo Wang
  Cc: Shaohui Zheng, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, lethal, Andi Kleen, Yinghai Lu, Haicheng Li

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2083 bytes --]

On Sun, 21 Nov 2010, Américo Wang wrote:

> >> > > > > Index: linux-hpe4/arch/x86/kernel/e820.c
> >> > > > > ===================================================================
> >> > > > > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> >> > > > > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> >> > > > > @@ -971,6 +971,7 @@
> >> > > > >  }
> >> > > > >  
> >> > > > >  static int userdef __initdata;
> >> > > > > +static u64 max_mem_size __initdata = ULLONG_MAX;
> >> > > > >  
> >> > > > >  /* "mem=nopentium" disables the 4MB page tables. */
> >> > > > >  static int __init parse_memopt(char *p)
> >> > > > > @@ -989,12 +990,28 @@
> >> > > > >  
> >> > > > >  	userdef = 1;
> >> > > > >  	mem_size = memparse(p, &p);
> >> > > > > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> >> > > > > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> >> > > > > +	max_mem_size = mem_size;
> >> > > > >  
> >> > > > >  	return 0;
> >> > > > >  }
> >> > > > 
> >> > > > This needs memmap= support as well, right?
> >> > > we did not do the testing after combine both memmap and numa=hide paramter, 
> >> > > I think that the result should similar with mem=XX, they both remove a memory
> >> > > region from the e820 table.
> >> > > 
> >> > 
> >> > You've modified the parser for mem= but not memmap= so the change needs 
> >> > additional support for the latter.
> >> > 
> >> 
> >> the parser for mem= is not modified, the changed parser is numa=, I add a addtional
> >> option numa=hide=.
> >> 
> >
> >The above hunk is modifying the x86 parser for the mem= parameter.
> >
> 
> That is fine as long as "mem=" is parsed before "numa=".
> 

If you'll read the discussion, I had no problem with modifying the mem 
parser.  I merely suggested that Shaohui modify the memmap parser in the 
same way to save max_mem_size so users can use it as well for the hidden 
nodes, that are now obsolete.  Apparently that was misunderstood by both 
of you although it looks pretty clear above, I dunno.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table.
@ 2010-11-21 21:33                 ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21 21:33 UTC (permalink / raw)
  To: Américo Wang
  Cc: Shaohui Zheng, Andrew Morton, linux-mm, linux-kernel,
	haicheng.li, lethal, Andi Kleen, Yinghai Lu, Haicheng Li

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2085 bytes --]

On Sun, 21 Nov 2010, AmA(C)rico Wang wrote:

> >> > > > > Index: linux-hpe4/arch/x86/kernel/e820.c
> >> > > > > ===================================================================
> >> > > > > --- linux-hpe4.orig/arch/x86/kernel/e820.c	2010-11-15 17:13:02.483461667 +0800
> >> > > > > +++ linux-hpe4/arch/x86/kernel/e820.c	2010-11-15 17:13:07.083461581 +0800
> >> > > > > @@ -971,6 +971,7 @@
> >> > > > >  }
> >> > > > >  
> >> > > > >  static int userdef __initdata;
> >> > > > > +static u64 max_mem_size __initdata = ULLONG_MAX;
> >> > > > >  
> >> > > > >  /* "mem=nopentium" disables the 4MB page tables. */
> >> > > > >  static int __init parse_memopt(char *p)
> >> > > > > @@ -989,12 +990,28 @@
> >> > > > >  
> >> > > > >  	userdef = 1;
> >> > > > >  	mem_size = memparse(p, &p);
> >> > > > > -	e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
> >> > > > > +	e820_remove_range(mem_size, max_mem_size - mem_size, E820_RAM, 1);
> >> > > > > +	max_mem_size = mem_size;
> >> > > > >  
> >> > > > >  	return 0;
> >> > > > >  }
> >> > > > 
> >> > > > This needs memmap= support as well, right?
> >> > > we did not do the testing after combine both memmap and numa=hide paramter, 
> >> > > I think that the result should similar with mem=XX, they both remove a memory
> >> > > region from the e820 table.
> >> > > 
> >> > 
> >> > You've modified the parser for mem= but not memmap= so the change needs 
> >> > additional support for the latter.
> >> > 
> >> 
> >> the parser for mem= is not modified, the changed parser is numa=, I add a addtional
> >> option numa=hide=.
> >> 
> >
> >The above hunk is modifying the x86 parser for the mem= parameter.
> >
> 
> That is fine as long as "mem=" is parsed before "numa=".
> 

If you'll read the discussion, I had no problem with modifying the mem 
parser.  I merely suggested that Shaohui modify the memmap parser in the 
same way to save max_mem_size so users can use it as well for the hidden 
nodes, that are now obsolete.  Apparently that was misunderstood by both 
of you although it looks pretty clear above, I dunno.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
  2010-11-21 15:14                       ` Li, Haicheng
@ 2010-11-21 21:42                         ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21 21:42 UTC (permalink / raw)
  To: Li, Haicheng
  Cc: Zheng, Shaohui, Paul Mundt, Andrew Morton, linux-mm,
	linux-kernel, haicheng.li, ak, shaohui.zheng, Yinghai Lu

On Sun, 21 Nov 2010, Li, Haicheng wrote:

> > I think what we'll end up wanting to do is something like this, which
> > adds 
> > a numa=possible=<N> parameter for x86; this will add an additional N
> > possible nodes to node_possible_map that we can use to online later. 
> > It 
> > also adds a new /sys/devices/system/memory/add_node file which takes a
> > typical "size@start" value to hot-add an emulated node.  For example,
> > using "mem=2G numa=possible=1" on the command line and doing
> > echo 128M@0x80000000" > /sys/devices/system/memory/add_node would
> > hot-add 
> > a node of 128M.
> > 
> > Comments?
> 
> Sorry for the late response as I'm in a biz trip recently.
> 
> David, your original concern is just about powerful/flexibility. I'm 
> sure our implementation can better meets such requirments.
> 

Not with hacky hidden nodes or being unnecessarily tied to e820, it can't.

> IMHO, I don't see any powerful/flexibility from your patch, compared to 
> our original implementation. you just make things more complex and mess.
> Why not use "numa=hide=N*size" as originally implemented?

Hidden nodes are a hack and completely unnecessary for node hotplug 
emulation, there's no need to have additional nodemasks or node states 
throughout the kernel.  They also require that you define the node sizes 
at boot, mine allows you to hotplug multiple node sizes of your choice at 
runtime.

> - later you just need to online the node once you want. And it 
> naturally/exactly emulates the behavior that current HW provides.

My proposal allows you to hotplug various node sizes, they can be 
offlined, their sizes can be subsequently changed, and re-hotplugged.  
It's a very dynamic and flexible model that allows you to emulate all 
possible combinations of node hotplug without constantly rebooting.

> - N is the possible node number. And we can use 128M as the default 
> size for each hidden node if user doesn't specify a size.

My model allows you to define the node size you'd like to add at runtime.

> - If user wants more mem for hidden node, he just needs specify the 
> "size".
> - besides, user can also use "mem=" to hide more mem and later use 
> mem-add i/f to freely attach more mem to the hidden node during runtime.
> 

Each of these requires a reboot, you cannot emulate hotplugging a node, 
offlining it, removing the memory, and re-hotplugging the same node with a 
larger amount of added memory with your model.

> Your patch introduces additional dependency on "mem=", but ours is 
> simple and flexibly compatible with "mem=" and "numa=emu". 
> 

This is the natural use case of mem=, to truncate the memory map to only 
allow the kernel to have a portion of usable memory.  The remainder can be 
used by this new interface, if desired, with complete power and control 
over the size of nodes you're adding without having to conform to hidden 
node sizes that you've specified at boot.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation
@ 2010-11-21 21:42                         ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21 21:42 UTC (permalink / raw)
  To: Li, Haicheng
  Cc: Zheng, Shaohui, Paul Mundt, Andrew Morton, linux-mm,
	linux-kernel, haicheng.li, ak, shaohui.zheng, Yinghai Lu

On Sun, 21 Nov 2010, Li, Haicheng wrote:

> > I think what we'll end up wanting to do is something like this, which
> > adds 
> > a numa=possible=<N> parameter for x86; this will add an additional N
> > possible nodes to node_possible_map that we can use to online later. 
> > It 
> > also adds a new /sys/devices/system/memory/add_node file which takes a
> > typical "size@start" value to hot-add an emulated node.  For example,
> > using "mem=2G numa=possible=1" on the command line and doing
> > echo 128M@0x80000000" > /sys/devices/system/memory/add_node would
> > hot-add 
> > a node of 128M.
> > 
> > Comments?
> 
> Sorry for the late response as I'm in a biz trip recently.
> 
> David, your original concern is just about powerful/flexibility. I'm 
> sure our implementation can better meets such requirments.
> 

Not with hacky hidden nodes or being unnecessarily tied to e820, it can't.

> IMHO, I don't see any powerful/flexibility from your patch, compared to 
> our original implementation. you just make things more complex and mess.
> Why not use "numa=hide=N*size" as originally implemented?

Hidden nodes are a hack and completely unnecessary for node hotplug 
emulation, there's no need to have additional nodemasks or node states 
throughout the kernel.  They also require that you define the node sizes 
at boot, mine allows you to hotplug multiple node sizes of your choice at 
runtime.

> - later you just need to online the node once you want. And it 
> naturally/exactly emulates the behavior that current HW provides.

My proposal allows you to hotplug various node sizes, they can be 
offlined, their sizes can be subsequently changed, and re-hotplugged.  
It's a very dynamic and flexible model that allows you to emulate all 
possible combinations of node hotplug without constantly rebooting.

> - N is the possible node number. And we can use 128M as the default 
> size for each hidden node if user doesn't specify a size.

My model allows you to define the node size you'd like to add at runtime.

> - If user wants more mem for hidden node, he just needs specify the 
> "size".
> - besides, user can also use "mem=" to hide more mem and later use 
> mem-add i/f to freely attach more mem to the hidden node during runtime.
> 

Each of these requires a reboot, you cannot emulate hotplugging a node, 
offlining it, removing the memory, and re-hotplugging the same node with a 
larger amount of added memory with your model.

> Your patch introduces additional dependency on "mem=", but ours is 
> simple and flexibly compatible with "mem=" and "numa=emu". 
> 

This is the natural use case of mem=, to truncate the memory map to only 
allow the kernel to have a portion of usable memory.  The remainder can be 
used by this new interface, if desired, with complete power and control 
over the size of nodes you're adding without having to conform to hidden 
node sizes that you've specified at boot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 1/2] x86: add numa=possible command line option
  2010-11-21 14:26                         ` Américo Wang
@ 2010-11-21 21:46                           ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21 21:46 UTC (permalink / raw)
  To: Américo Wang
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Greg Kroah-Hartman,
	Shaohui Zheng, Paul Mundt, Andrew Morton, Andi Kleen, Yinghai Lu,
	Haicheng Li, Randy Dunlap, linux-kernel, linux-mm, x86

[-- Attachment #1: Type: TEXT/PLAIN, Size: 805 bytes --]

On Sun, 21 Nov 2010, Américo Wang wrote:

> I am not sure how much value of making this dynamic,
> for CPU, we do this at compile time, i.e. NR_CPUS,
> so how about NR_NODES?
> 

This is outside the scope of node hotplug emulation, it needs to be built 
on top of whatever the kernel implements.

> Also, numa=possible= is not as clear as numa=max=, for me at least.
> 

I like name, but it requires that you know how many nodes that system 
already has.  In other words, numa=possible=4 allows you to specify that 4 
additional nodes will be possible, but initially offline, for this or 
other purposes.  numa=max=4 would be no-op if the system actually had 4 
nodes.

I chose numa=possible over numa=additional because it is more clearly tied 
to node_possible_map, which is the only thing it modifies.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 1/2] x86: add numa=possible command line option
@ 2010-11-21 21:46                           ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21 21:46 UTC (permalink / raw)
  To: Américo Wang
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Greg Kroah-Hartman,
	Shaohui Zheng, Paul Mundt, Andrew Morton, Andi Kleen, Yinghai Lu,
	Haicheng Li, Randy Dunlap, linux-kernel, linux-mm, x86

[-- Attachment #1: Type: TEXT/PLAIN, Size: 807 bytes --]

On Sun, 21 Nov 2010, AmA(C)rico Wang wrote:

> I am not sure how much value of making this dynamic,
> for CPU, we do this at compile time, i.e. NR_CPUS,
> so how about NR_NODES?
> 

This is outside the scope of node hotplug emulation, it needs to be built 
on top of whatever the kernel implements.

> Also, numa=possible= is not as clear as numa=max=, for me at least.
> 

I like name, but it requires that you know how many nodes that system 
already has.  In other words, numa=possible=4 allows you to specify that 4 
additional nodes will be possible, but initially offline, for this or 
other purposes.  numa=max=4 would be no-op if the system actually had 4 
nodes.

I chose numa=possible over numa=additional because it is more clearly tied 
to node_possible_map, which is the only thing it modifies.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2] mm: add node hotplug emulation
  2010-11-21 17:34                           ` Greg KH
@ 2010-11-21 21:48                             ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21 21:48 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sun, 21 Nov 2010, Greg KH wrote:

> But as this is a debugging thing, why not just put it in debugfs
> instead?
> 

Ok, I think Paul had a similar suggestion during the discussion of 
Shaohui's patchset.  I'll move it, thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2] mm: add node hotplug emulation
@ 2010-11-21 21:48                             ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21 21:48 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sun, 21 Nov 2010, Greg KH wrote:

> But as this is a debugging thing, why not just put it in debugfs
> instead?
> 

Ok, I think Paul had a similar suggestion during the discussion of 
Shaohui's patchset.  I'll move it, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 2/2 v2] mm: add node hotplug emulation
  2010-11-21 21:48                             ` David Rientjes
@ 2010-11-21 23:08                               ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21 23:08 UTC (permalink / raw)
  To: Andrew Morton, Greg KH
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Shaohui Zheng,
	Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap,
	linux-kernel, linux-mm, x86

Add an interface to allow new nodes to be added when performing memory
hot-add.  This provides a convenient interface to test memory hotplug
notifier callbacks and surrounding hotplug code when new nodes are
onlined without actually having a machine with such hotpluggable SRAT
entries.

This adds a new debugfs interface at /sys/kernel/debug/hotplug/add_node
that behaves in a similar way to the memory hot-add "probe" interface.
Its format is size@start, where "size" is the size of the new node to be
added and "start" is the physical address of the new memory.

The new node id is a currently offline, but possible, node.  The bit must
be set in node_possible_map so that nr_node_ids is sized appropriately.

For emulation on x86, for example, it would be possible to set aside
memory for hotplugged nodes (say, anything above 2G) and to add an
additional four nodes as being possible on boot with

	mem=2G numa=possible=4

and then creating a new 128M node at runtime:

	# echo 128M@0x80000000 > /sys/kernel/debug/hotplug/add_node
	On node 1 totalpages: 0
	init_memory_mapping: 0000000080000000-0000000088000000
	 0080000000 - 0088000000 page 2M

Once the new node has been added, its memory can be onlined.  If this
memory represents memory section 16, for example:

	# echo online > /sys/devices/system/memory/memory16/state
	Built 2 zonelists in Node order, mobility grouping on.  Total pages: 514846
	Policy zone: Normal

 [ The memory section(s) mapped to a particular node are visible via
   /sys/devices/system/node/node1, in this example. ]

The new node is now hotplugged and ready for testing.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 v2: moved to debugfs as suggested by Greg KH

 (patch 1/2: "x86: add numa=possible command line option" is still valid)

 Documentation/memory-hotplug.txt |   24 +++++++++++++++
 mm/memory_hotplug.c              |   59 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 83 insertions(+), 0 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -18,6 +18,7 @@ be changed often.
 4. Physical memory hot-add phase
   4.1 Hardware(Firmware) Support
   4.2 Notify memory hot-add event by hand
+  4.3 Node hotplug emulation
 5. Logical Memory hot-add phase
   5.1. State of memory
   5.2. How to online memory
@@ -215,6 +216,29 @@ current implementation). You'll have to online memory by yourself.
 Please see "How to online memory" in this text.
 
 
+4.3 Node hotplug emulation
+------------
+With debugfs, it is possible to test node hotplug by assigning the newly
+added memory to a new node id when using a different interface with a similar
+behavior to "probe" described in section 4.2.  If a node id is possible
+(there are bits in /sys/devices/system/memory/possible that are not online),
+then it may be used to emulate a newly added node as the result of memory
+hotplug by using the debugfs "add_node" interface.
+
+The add_node interface is located at "hotplug/add_node" at the debugfs mount
+point.
+
+You can create a new node of a specified size starting at the physical
+address of new memory by
+
+% echo size@start_address_of_new_memory > /sys/kernel/debug/hotplug/add_node
+
+Where "size" can be represented in megabytes or gigabytes (for example,
+"128M" or "1G").  The minumum size is that of a memory section.
+
+Once the new node has been added, it is possible to online the memory by
+toggling the "state" of its memory section(s) as described in section 5.1.
+
 
 ------------------------------
 5. Logical Memory hot-add phase
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -910,3 +910,62 @@ int remove_memory(u64 start, u64 size)
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 EXPORT_SYMBOL_GPL(remove_memory);
+
+#ifdef CONFIG_DEBUG_FS
+#include <linux/debugfs.h>
+
+static struct dentry *hotplug_debug_root;
+
+static ssize_t add_node_store(struct file *file, const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	nodemask_t mask;
+	u64 start, size;
+	char buffer[64];
+	char *p;
+	int nid;
+	int ret;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	size = memparse(buffer, &p);
+	if (size < (PAGES_PER_SECTION << PAGE_SHIFT))
+		return -EINVAL;
+	if (*p != '@')
+		return -EINVAL;
+
+	start = simple_strtoull(p + 1, NULL, 0);
+
+	nodes_andnot(mask, node_possible_map, node_online_map);
+	nid = first_node(mask);
+	if (nid == MAX_NUMNODES)
+		return -ENOMEM;
+
+	ret = add_memory(nid, start, size);
+	return ret ? ret : count;
+}
+
+static const struct file_operations add_node_file_ops = {
+	.write		= add_node_store,
+	.llseek		= generic_file_llseek,
+};
+
+static int __init hotplug_debug_init(void)
+{
+	hotplug_debug_root = debugfs_create_dir("hotplug", NULL);
+	if (!hotplug_debug_root)
+		return -ENOMEM;
+
+	if (!debugfs_create_file("add_node", S_IWUSR, hotplug_debug_root,
+			NULL, &add_node_file_ops))
+		return -ENOMEM;
+
+	return 0;
+}
+
+module_init(hotplug_debug_init);
+#endif /* CONFIG_DEBUG_FS */

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [patch 2/2 v2] mm: add node hotplug emulation
@ 2010-11-21 23:08                               ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-21 23:08 UTC (permalink / raw)
  To: Andrew Morton, Greg KH
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Shaohui Zheng,
	Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap,
	linux-kernel, linux-mm, x86

Add an interface to allow new nodes to be added when performing memory
hot-add.  This provides a convenient interface to test memory hotplug
notifier callbacks and surrounding hotplug code when new nodes are
onlined without actually having a machine with such hotpluggable SRAT
entries.

This adds a new debugfs interface at /sys/kernel/debug/hotplug/add_node
that behaves in a similar way to the memory hot-add "probe" interface.
Its format is size@start, where "size" is the size of the new node to be
added and "start" is the physical address of the new memory.

The new node id is a currently offline, but possible, node.  The bit must
be set in node_possible_map so that nr_node_ids is sized appropriately.

For emulation on x86, for example, it would be possible to set aside
memory for hotplugged nodes (say, anything above 2G) and to add an
additional four nodes as being possible on boot with

	mem=2G numa=possible=4

and then creating a new 128M node at runtime:

	# echo 128M@0x80000000 > /sys/kernel/debug/hotplug/add_node
	On node 1 totalpages: 0
	init_memory_mapping: 0000000080000000-0000000088000000
	 0080000000 - 0088000000 page 2M

Once the new node has been added, its memory can be onlined.  If this
memory represents memory section 16, for example:

	# echo online > /sys/devices/system/memory/memory16/state
	Built 2 zonelists in Node order, mobility grouping on.  Total pages: 514846
	Policy zone: Normal

 [ The memory section(s) mapped to a particular node are visible via
   /sys/devices/system/node/node1, in this example. ]

The new node is now hotplugged and ready for testing.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 v2: moved to debugfs as suggested by Greg KH

 (patch 1/2: "x86: add numa=possible command line option" is still valid)

 Documentation/memory-hotplug.txt |   24 +++++++++++++++
 mm/memory_hotplug.c              |   59 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 83 insertions(+), 0 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -18,6 +18,7 @@ be changed often.
 4. Physical memory hot-add phase
   4.1 Hardware(Firmware) Support
   4.2 Notify memory hot-add event by hand
+  4.3 Node hotplug emulation
 5. Logical Memory hot-add phase
   5.1. State of memory
   5.2. How to online memory
@@ -215,6 +216,29 @@ current implementation). You'll have to online memory by yourself.
 Please see "How to online memory" in this text.
 
 
+4.3 Node hotplug emulation
+------------
+With debugfs, it is possible to test node hotplug by assigning the newly
+added memory to a new node id when using a different interface with a similar
+behavior to "probe" described in section 4.2.  If a node id is possible
+(there are bits in /sys/devices/system/memory/possible that are not online),
+then it may be used to emulate a newly added node as the result of memory
+hotplug by using the debugfs "add_node" interface.
+
+The add_node interface is located at "hotplug/add_node" at the debugfs mount
+point.
+
+You can create a new node of a specified size starting at the physical
+address of new memory by
+
+% echo size@start_address_of_new_memory > /sys/kernel/debug/hotplug/add_node
+
+Where "size" can be represented in megabytes or gigabytes (for example,
+"128M" or "1G").  The minumum size is that of a memory section.
+
+Once the new node has been added, it is possible to online the memory by
+toggling the "state" of its memory section(s) as described in section 5.1.
+
 
 ------------------------------
 5. Logical Memory hot-add phase
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -910,3 +910,62 @@ int remove_memory(u64 start, u64 size)
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 EXPORT_SYMBOL_GPL(remove_memory);
+
+#ifdef CONFIG_DEBUG_FS
+#include <linux/debugfs.h>
+
+static struct dentry *hotplug_debug_root;
+
+static ssize_t add_node_store(struct file *file, const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	nodemask_t mask;
+	u64 start, size;
+	char buffer[64];
+	char *p;
+	int nid;
+	int ret;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	size = memparse(buffer, &p);
+	if (size < (PAGES_PER_SECTION << PAGE_SHIFT))
+		return -EINVAL;
+	if (*p != '@')
+		return -EINVAL;
+
+	start = simple_strtoull(p + 1, NULL, 0);
+
+	nodes_andnot(mask, node_possible_map, node_online_map);
+	nid = first_node(mask);
+	if (nid == MAX_NUMNODES)
+		return -ENOMEM;
+
+	ret = add_memory(nid, start, size);
+	return ret ? ret : count;
+}
+
+static const struct file_operations add_node_file_ops = {
+	.write		= add_node_store,
+	.llseek		= generic_file_llseek,
+};
+
+static int __init hotplug_debug_init(void)
+{
+	hotplug_debug_root = debugfs_create_dir("hotplug", NULL);
+	if (!hotplug_debug_root)
+		return -ENOMEM;
+
+	if (!debugfs_create_file("add_node", S_IWUSR, hotplug_debug_root,
+			NULL, &add_node_file_ops))
+		return -ENOMEM;
+
+	return 0;
+}
+
+module_init(hotplug_debug_init);
+#endif /* CONFIG_DEBUG_FS */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
  2010-11-21 15:03     ` Américo Wang
@ 2010-11-21 23:33       ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-21 23:33 UTC (permalink / raw)
  To: Américo Wang
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li

On Sun, Nov 21, 2010 at 11:03:45PM +0800, Américo Wang wrote:
> On Wed, Nov 17, 2010 at 10:08:07AM +0800, shaohui.zheng@intel.com wrote:
> >+2) CPU hotplug emulation:
> >+
> >+The emulator reserve CPUs throu grub parameter, the reserved CPUs can be
> >+hot-add/hot-remove in software method, it emulates the process of physical
> >+cpu hotplug.
> >+
> >+When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU
> >+socket hotplug process. For the CPU supported SMT, some logical CPUs are in the
> >+same socket, but it may located in different NUMA node after we have emulator.
> >+We put the logical CPU into a fake CPU socket, and assign it an unique
> >+phys_proc_id. For the fake socket, we put one logical CPU in only.
> >+
> >+ - to hide CPUs
> >+	- Using boot option "maxcpus=N" hide CPUs
> >+	  N is the number of initialize CPUs
> >+	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation
> >+      when cpu_hpe is enabled, the rest CPUs will not be initialized
> >+
> >+ - to hot-add CPU to node
> >+	$ echo nid > cpu/probe
> >+
> >+ - to hot-remove CPU
> >+	$ echo nid > cpu/release
> >+
> 
> Again, we already have software CPU hotplug,
> i.e. /sys/devices/system/cpu/cpuX/online.
it is cpu online/offline in current kernel, not physical CPU hot-add or hot-remove.
the emulator is a tool to emulate the process of physcial CPU hotplug.
> 
> You need to pick up another name for this.
> 
> >From your documentation above, it looks like you are trying
> to move one CPU between nodes?
Yes, you are correct. With cpu probe/release interface, you can hot-remove a
CPU from a node, and hot-add it to another node.
> 
> >+	cpu_hpe=on/off
> >+		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,
> >+		sysfs provides probe/release interface to hot add/remove cpu dynamically.
> >+		this option is disabled in default.
> >+			
> 
> Why not just a CONFIG? IOW, why do we need to make another boot
> parameter for this?
Only the developer or QA will use the emulator, we did not want to change the
default action for common user who does not care the hotplug emulator, so we
use a kernel parameter as a switch. The common user is not aware the existence
of the emulator.


-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
@ 2010-11-21 23:33       ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-21 23:33 UTC (permalink / raw)
  To: Américo Wang
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li

On Sun, Nov 21, 2010 at 11:03:45PM +0800, Americo Wang wrote:
> On Wed, Nov 17, 2010 at 10:08:07AM +0800, shaohui.zheng@intel.com wrote:
> >+2) CPU hotplug emulation:
> >+
> >+The emulator reserve CPUs throu grub parameter, the reserved CPUs can be
> >+hot-add/hot-remove in software method, it emulates the process of physical
> >+cpu hotplug.
> >+
> >+When hotplug a CPU with emulator, we are using a logical CPU to emulate the CPU
> >+socket hotplug process. For the CPU supported SMT, some logical CPUs are in the
> >+same socket, but it may located in different NUMA node after we have emulator.
> >+We put the logical CPU into a fake CPU socket, and assign it an unique
> >+phys_proc_id. For the fake socket, we put one logical CPU in only.
> >+
> >+ - to hide CPUs
> >+	- Using boot option "maxcpus=N" hide CPUs
> >+	  N is the number of initialize CPUs
> >+	- Using boot option "cpu_hpe=on" to enable cpu hotplug emulation
> >+      when cpu_hpe is enabled, the rest CPUs will not be initialized
> >+
> >+ - to hot-add CPU to node
> >+	$ echo nid > cpu/probe
> >+
> >+ - to hot-remove CPU
> >+	$ echo nid > cpu/release
> >+
> 
> Again, we already have software CPU hotplug,
> i.e. /sys/devices/system/cpu/cpuX/online.
it is cpu online/offline in current kernel, not physical CPU hot-add or hot-remove.
the emulator is a tool to emulate the process of physcial CPU hotplug.
> 
> You need to pick up another name for this.
> 
> >From your documentation above, it looks like you are trying
> to move one CPU between nodes?
Yes, you are correct. With cpu probe/release interface, you can hot-remove a
CPU from a node, and hot-add it to another node.
> 
> >+	cpu_hpe=on/off
> >+		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,
> >+		sysfs provides probe/release interface to hot add/remove cpu dynamically.
> >+		this option is disabled in default.
> >+			
> 
> Why not just a CONFIG? IOW, why do we need to make another boot
> parameter for this?
Only the developer or QA will use the emulator, we did not want to change the
default action for common user who does not care the hotplug emulator, so we
use a kernel parameter as a switch. The common user is not aware the existence
of the emulator.


-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [5/8,v3] NUMA Hotplug Emulator: support cpu probe/release in x86
  2010-11-21 14:45     ` Américo Wang
@ 2010-11-22  0:01       ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-22  0:01 UTC (permalink / raw)
  To: Américo Wang
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Ingo Molnar, Len Brown, Yinghai Lu, Haicheng Li

On Sun, Nov 21, 2010 at 10:45:11PM +0800, Américo Wang wrote:
> On Wed, Nov 17, 2010 at 10:08:04AM +0800, shaohui.zheng@intel.com wrote:
> >From: Shaohui Zheng <shaohui.zheng@intel.com>
> >
> >Add cpu interface probe/release under sysfs for x86. User can use this
> >interface to emulate the cpu hot-add process, it is for cpu hotplug 
> >test purpose. Add a kernel option CONFIG_ARCH_CPU_PROBE_RELEASE for this
> >feature.
> >
> >This interface provides a mechanism to emulate cpu hotplug with software
> > methods, it becomes possible to do cpu hotplug automation and stress
> >testing.
> >
> 
> Huh? We already have CPU online/offline...
> 
> Can you describe more about the difference?
> 
> Thanks.

Again, we already try to discribe the difference between logcial cpu
online/offline and physical cpu online/offline many times.

The following is the my reply on other threads.
-------------------------------------------------------------------------------------------
> 
> I don't get it. CPU hotplug can already be tested using echo 0/1 >
> online, and that works on 386. How is this different?
> 
> It seems to add some numa magic. Why is it important?

Pavel,
	it is not an easy thing to understand the full story since you may not work on this project
so you have such question. Let me do a simpe introductions about the background.

	We need to understand 2 differnets concepts if you wnat to know the reason why we develop 
the hotplug emulaor.

 - CPU logcial online/offline
	it is the existed feature which you mentioned, we can online/offline CPUs throu sysfs
interface /sys/device/system/cpu/cpuX/online (X is an integer, it stands for the CPU number)

	echo 0/1 > /sys/device/cpu/cpuX/online
	
	This is is logical CPU online/offline, when we do such operation, the CPU is already pluged
into the motherboard, and the OS initialized the CPU. the data structure and CPU entries on sysfs
are created, the CPU present mask and possible mask are setted, it does not refer to any physical
hardware. the CPU status becomes online from offline, and ready to schedule to run process by
scheduler.
	
	CPU online/offline is control by the kernel option CONFIG_HOTPLUG_CPU.

 - CPU hot-add/hot-remove

	This is physical CPU hot-add/hot-remove into motherboard, without shutdown the machine, after
the hot-add operation, the new CPU will be powered on, and the OS recognize the new CPUs throu SCI
interrupts, then OS intializes the new CPUs, create the related CPU structures, create sysfs entries
 for the new CPUs. Once all done, the CPU is ready to logcial online.

The process to hot-add CPU:
	 1) Physical CPU hot-add to motherboard when after the machine is powered on
	 2) the BIOS send SCI interrupts to notice the OS 
	 3) Linux hotplug handler parse the data from the acpi_handle data
	 4) hotplug handler initialize the CPU structure according the cpu ACPI data

Current situation:
	1) Provides developers an envronment
	 Only very few hardware can support CPU hot-add/hot-remove, we need create an working environment
	 for developers to write and debug hotplug code even through they do not has such hardward on hand.
	 It is what NUMA hotplug emulator does exactly. Physcial hotplug emuator should be a better name.

  	 We have 2 solutions to solve this problem, and this one is selected finally; if you want to know
	 more about the solutions, we can continue to on this thread.

	2) Offers an automation test inferface for Linux CPU hot-add/hot-remove code
	Linux hot-add/hot-remove code has obvious bugs, but we do not see any automation test suite for it,
	even in LTP project(LTP has hotplug suite for logical CPU online/offline).

	It is a know difficult work to test physcial hot-add/hot-remove code in automation way, but the hotplug
	emualtor does a good job for it. We reproduce all the major hotlug bugs against the internal emulator
	v2 and v3. 
	
We are sharing it to the community, wish more wisdoms and talents are included in it. We want to show an 
exmaple of software emualtion, and hopes more guys benifit from it, this is the purpose for this group
patches.
	
PowerPC supporting
	For ppc, it was added about half year ago by Nathan Fontenot, but x86 does not has such feature.
Thanks for lethal to mention it, we already did some researching about it,  I will reply it in another 
thread.

	commit 12633e803a2a556f6469e0933d08233d0844a2d9
	Author: Nathan Fontenot <nfont@austin.ibm.com>
	Date:   Wed Nov 25 17:23:25 2009 +0000

	commit 1a8061c46c46c960f715c597b9d279ea2ba42bd9
	Author: Nathan Fontenot <nfont@austin.ibm.com>
	Date:   Tue Nov 24 21:13:32 2009 +0000


We inherit the name style from ppc, CPU hot-add/hot-remove is called CPU probe/release in kernel, it was
control by kernel option CONFIG_ARCH_CPU_PROBE_RELEASE.
-- 
Thanks & Regards,
Shaohui

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [5/8,v3] NUMA Hotplug Emulator: support cpu probe/release in x86
@ 2010-11-22  0:01       ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-22  0:01 UTC (permalink / raw)
  To: Américo Wang
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Ingo Molnar, Len Brown, Yinghai Lu, Haicheng Li

On Sun, Nov 21, 2010 at 10:45:11PM +0800, Americo Wang wrote:
> On Wed, Nov 17, 2010 at 10:08:04AM +0800, shaohui.zheng@intel.com wrote:
> >From: Shaohui Zheng <shaohui.zheng@intel.com>
> >
> >Add cpu interface probe/release under sysfs for x86. User can use this
> >interface to emulate the cpu hot-add process, it is for cpu hotplug 
> >test purpose. Add a kernel option CONFIG_ARCH_CPU_PROBE_RELEASE for this
> >feature.
> >
> >This interface provides a mechanism to emulate cpu hotplug with software
> > methods, it becomes possible to do cpu hotplug automation and stress
> >testing.
> >
> 
> Huh? We already have CPU online/offline...
> 
> Can you describe more about the difference?
> 
> Thanks.

Again, we already try to discribe the difference between logcial cpu
online/offline and physical cpu online/offline many times.

The following is the my reply on other threads.
-------------------------------------------------------------------------------------------
> 
> I don't get it. CPU hotplug can already be tested using echo 0/1 >
> online, and that works on 386. How is this different?
> 
> It seems to add some numa magic. Why is it important?

Pavel,
	it is not an easy thing to understand the full story since you may not work on this project
so you have such question. Let me do a simpe introductions about the background.

	We need to understand 2 differnets concepts if you wnat to know the reason why we develop 
the hotplug emulaor.

 - CPU logcial online/offline
	it is the existed feature which you mentioned, we can online/offline CPUs throu sysfs
interface /sys/device/system/cpu/cpuX/online (X is an integer, it stands for the CPU number)

	echo 0/1 > /sys/device/cpu/cpuX/online
	
	This is is logical CPU online/offline, when we do such operation, the CPU is already pluged
into the motherboard, and the OS initialized the CPU. the data structure and CPU entries on sysfs
are created, the CPU present mask and possible mask are setted, it does not refer to any physical
hardware. the CPU status becomes online from offline, and ready to schedule to run process by
scheduler.
	
	CPU online/offline is control by the kernel option CONFIG_HOTPLUG_CPU.

 - CPU hot-add/hot-remove

	This is physical CPU hot-add/hot-remove into motherboard, without shutdown the machine, after
the hot-add operation, the new CPU will be powered on, and the OS recognize the new CPUs throu SCI
interrupts, then OS intializes the new CPUs, create the related CPU structures, create sysfs entries
 for the new CPUs. Once all done, the CPU is ready to logcial online.

The process to hot-add CPU:
	 1) Physical CPU hot-add to motherboard when after the machine is powered on
	 2) the BIOS send SCI interrupts to notice the OS 
	 3) Linux hotplug handler parse the data from the acpi_handle data
	 4) hotplug handler initialize the CPU structure according the cpu ACPI data

Current situation:
	1) Provides developers an envronment
	 Only very few hardware can support CPU hot-add/hot-remove, we need create an working environment
	 for developers to write and debug hotplug code even through they do not has such hardward on hand.
	 It is what NUMA hotplug emulator does exactly. Physcial hotplug emuator should be a better name.

  	 We have 2 solutions to solve this problem, and this one is selected finally; if you want to know
	 more about the solutions, we can continue to on this thread.

	2) Offers an automation test inferface for Linux CPU hot-add/hot-remove code
	Linux hot-add/hot-remove code has obvious bugs, but we do not see any automation test suite for it,
	even in LTP project(LTP has hotplug suite for logical CPU online/offline).

	It is a know difficult work to test physcial hot-add/hot-remove code in automation way, but the hotplug
	emualtor does a good job for it. We reproduce all the major hotlug bugs against the internal emulator
	v2 and v3. 
	
We are sharing it to the community, wish more wisdoms and talents are included in it. We want to show an 
exmaple of software emualtion, and hopes more guys benifit from it, this is the purpose for this group
patches.
	
PowerPC supporting
	For ppc, it was added about half year ago by Nathan Fontenot, but x86 does not has such feature.
Thanks for lethal to mention it, we already did some researching about it,  I will reply it in another 
thread.

	commit 12633e803a2a556f6469e0933d08233d0844a2d9
	Author: Nathan Fontenot <nfont@austin.ibm.com>
	Date:   Wed Nov 25 17:23:25 2009 +0000

	commit 1a8061c46c46c960f715c597b9d279ea2ba42bd9
	Author: Nathan Fontenot <nfont@austin.ibm.com>
	Date:   Tue Nov 24 21:13:32 2009 +0000


We inherit the name style from ppc, CPU hot-add/hot-remove is called CPU probe/release in kernel, it was
control by kernel option CONFIG_ARCH_CPU_PROBE_RELEASE.
-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2 v2] mm: add node hotplug emulation
  2010-11-21 23:08                               ` David Rientjes
@ 2010-11-22  0:56                                 ` Greg KH
  -1 siblings, 0 replies; 139+ messages in thread
From: Greg KH @ 2010-11-22  0:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sun, Nov 21, 2010 at 03:08:17PM -0800, David Rientjes wrote:
> Add an interface to allow new nodes to be added when performing memory
> hot-add.  This provides a convenient interface to test memory hotplug
> notifier callbacks and surrounding hotplug code when new nodes are
> onlined without actually having a machine with such hotpluggable SRAT
> entries.
> 
> This adds a new debugfs interface at /sys/kernel/debug/hotplug/add_node

The rule for debugfs is "there are no rules", but perhaps you might want
to name "hotplug" a bit more specific for what you are doing?  "hotplug"
means pretty much anything these days, so how about s/hotplug/node/
instead as that is what you are controlling.

Just a suggestion...

greg k-h

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2 v2] mm: add node hotplug emulation
@ 2010-11-22  0:56                                 ` Greg KH
  0 siblings, 0 replies; 139+ messages in thread
From: Greg KH @ 2010-11-22  0:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sun, Nov 21, 2010 at 03:08:17PM -0800, David Rientjes wrote:
> Add an interface to allow new nodes to be added when performing memory
> hot-add.  This provides a convenient interface to test memory hotplug
> notifier callbacks and surrounding hotplug code when new nodes are
> onlined without actually having a machine with such hotpluggable SRAT
> entries.
> 
> This adds a new debugfs interface at /sys/kernel/debug/hotplug/add_node

The rule for debugfs is "there are no rules", but perhaps you might want
to name "hotplug" a bit more specific for what you are doing?  "hotplug"
means pretty much anything these days, so how about s/hotplug/node/
instead as that is what you are controlling.

Just a suggestion...

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 1/2] x86: add numa=possible command line option
  2010-11-21 21:46                           ` David Rientjes
@ 2010-11-22 15:43                             ` Américo Wang
  -1 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-22 15:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Américo Wang, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Greg Kroah-Hartman, Shaohui Zheng, Paul Mundt, Andrew Morton,
	Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap, linux-kernel,
	linux-mm, x86

On Sun, Nov 21, 2010 at 01:46:07PM -0800, David Rientjes wrote:
>On Sun, 21 Nov 2010, Américo Wang wrote:
>> Also, numa=possible= is not as clear as numa=max=, for me at least.
>> 
>
>I like name, but it requires that you know how many nodes that system 
>already has.  In other words, numa=possible=4 allows you to specify that 4 
>additional nodes will be possible, but initially offline, for this or 
>other purposes.  numa=max=4 would be no-op if the system actually had 4 
>nodes.
>
>I chose numa=possible over numa=additional because it is more clearly tied 
>to node_possible_map, which is the only thing it modifies.

Okay, I thought "possible" means "max", but "possible" means "addtional" here.
It's clear for me now.

Thanks!


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 1/2] x86: add numa=possible command line option
@ 2010-11-22 15:43                             ` Américo Wang
  0 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-22 15:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Américo Wang, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Greg Kroah-Hartman, Shaohui Zheng, Paul Mundt, Andrew Morton,
	Andi Kleen, Yinghai Lu, Haicheng Li, Randy Dunlap, linux-kernel,
	linux-mm, x86

On Sun, Nov 21, 2010 at 01:46:07PM -0800, David Rientjes wrote:
>On Sun, 21 Nov 2010, AmA(C)rico Wang wrote:
>> Also, numa=possible= is not as clear as numa=max=, for me at least.
>> 
>
>I like name, but it requires that you know how many nodes that system 
>already has.  In other words, numa=possible=4 allows you to specify that 4 
>additional nodes will be possible, but initially offline, for this or 
>other purposes.  numa=max=4 would be no-op if the system actually had 4 
>nodes.
>
>I chose numa=possible over numa=additional because it is more clearly tied 
>to node_possible_map, which is the only thing it modifies.

Okay, I thought "possible" means "max", but "possible" means "addtional" here.
It's clear for me now.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [5/8,v3] NUMA Hotplug Emulator: support cpu probe/release in x86
  2010-11-22  0:01       ` Shaohui Zheng
@ 2010-11-22 15:51         ` Américo Wang
  -1 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-22 15:51 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Américo Wang, akpm, linux-mm, linux-kernel, haicheng.li,
	lethal, ak, shaohui.zheng, Ingo Molnar, Len Brown, Yinghai Lu,
	Haicheng Li

On Mon, Nov 22, 2010 at 08:01:04AM +0800, Shaohui Zheng wrote:
>On Sun, Nov 21, 2010 at 10:45:11PM +0800, Américo Wang wrote:
>> On Wed, Nov 17, 2010 at 10:08:04AM +0800, shaohui.zheng@intel.com wrote:
>> >From: Shaohui Zheng <shaohui.zheng@intel.com>
>> >
>> >Add cpu interface probe/release under sysfs for x86. User can use this
>> >interface to emulate the cpu hot-add process, it is for cpu hotplug 
>> >test purpose. Add a kernel option CONFIG_ARCH_CPU_PROBE_RELEASE for this
>> >feature.
>> >
>> >This interface provides a mechanism to emulate cpu hotplug with software
>> > methods, it becomes possible to do cpu hotplug automation and stress
>> >testing.
>> >
>> 
>> Huh? We already have CPU online/offline...
>> 
>> Can you describe more about the difference?
>> 
>> Thanks.
>
>Again, we already try to discribe the difference between logcial cpu
>online/offline and physical cpu online/offline many times.
>

I see, with "maxcpus=" we will only have the specified number
of CPU's which can be online/offline, you are trying to bring
the rest of CPU's hidden by "maxcpus=". :) Correct?

I think the idea is cool, but I think you need to improve
the documetion, for people who don't follow the hardware
concepts like me. ;)

Thanks.

-- 
Live like a child, think like the god.
 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [5/8,v3] NUMA Hotplug Emulator: support cpu probe/release in x86
@ 2010-11-22 15:51         ` Américo Wang
  0 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-22 15:51 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Américo Wang, akpm, linux-mm, linux-kernel, haicheng.li,
	lethal, ak, shaohui.zheng, Ingo Molnar, Len Brown, Yinghai Lu,
	Haicheng Li

On Mon, Nov 22, 2010 at 08:01:04AM +0800, Shaohui Zheng wrote:
>On Sun, Nov 21, 2010 at 10:45:11PM +0800, AmA(C)rico Wang wrote:
>> On Wed, Nov 17, 2010 at 10:08:04AM +0800, shaohui.zheng@intel.com wrote:
>> >From: Shaohui Zheng <shaohui.zheng@intel.com>
>> >
>> >Add cpu interface probe/release under sysfs for x86. User can use this
>> >interface to emulate the cpu hot-add process, it is for cpu hotplug 
>> >test purpose. Add a kernel option CONFIG_ARCH_CPU_PROBE_RELEASE for this
>> >feature.
>> >
>> >This interface provides a mechanism to emulate cpu hotplug with software
>> > methods, it becomes possible to do cpu hotplug automation and stress
>> >testing.
>> >
>> 
>> Huh? We already have CPU online/offline...
>> 
>> Can you describe more about the difference?
>> 
>> Thanks.
>
>Again, we already try to discribe the difference between logcial cpu
>online/offline and physical cpu online/offline many times.
>

I see, with "maxcpus=" we will only have the specified number
of CPU's which can be online/offline, you are trying to bring
the rest of CPU's hidden by "maxcpus=". :) Correct?

I think the idea is cool, but I think you need to improve
the documetion, for people who don't follow the hardware
concepts like me. ;)

Thanks.

-- 
Live like a child, think like the god.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
  2010-11-21 23:33       ` Shaohui Zheng
@ 2010-11-22 16:04         ` Américo Wang
  -1 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-22 16:04 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Américo Wang, akpm, linux-mm, linux-kernel, haicheng.li,
	lethal, ak, shaohui.zheng, Haicheng Li

On Mon, Nov 22, 2010 at 07:33:51AM +0800, Shaohui Zheng wrote:
>On Sun, Nov 21, 2010 at 11:03:45PM +0800, Américo Wang wrote:
>> 
>> >From your documentation above, it looks like you are trying
>> to move one CPU between nodes?
>Yes, you are correct. With cpu probe/release interface, you can hot-remove a
>CPU from a node, and hot-add it to another node.


Can I also move the CPU to another node _after_ it is hot-added?
Or I have to hot-remove it first and then hot-add it again?

>> 
>> >+	cpu_hpe=on/off
>> >+		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,
>> >+		sysfs provides probe/release interface to hot add/remove cpu dynamically.
>> >+		this option is disabled in default.
>> >+			
>> 
>> Why not just a CONFIG? IOW, why do we need to make another boot
>> parameter for this?
>Only the developer or QA will use the emulator, we did not want to change the
>default action for common user who does not care the hotplug emulator, so we
>use a kernel parameter as a switch. The common user is not aware the existence
>of the emulator.
>

I think it is also useful to other Linux users, e.g. after I
boot with "maxcpus=1", I can still bring the rest 3 CPU's
back without reboot.

Thanks.

-- 
Live like a child, think like the god.
 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
@ 2010-11-22 16:04         ` Américo Wang
  0 siblings, 0 replies; 139+ messages in thread
From: Américo Wang @ 2010-11-22 16:04 UTC (permalink / raw)
  To: Shaohui Zheng
  Cc: Américo Wang, akpm, linux-mm, linux-kernel, haicheng.li,
	lethal, ak, shaohui.zheng, Haicheng Li

On Mon, Nov 22, 2010 at 07:33:51AM +0800, Shaohui Zheng wrote:
>On Sun, Nov 21, 2010 at 11:03:45PM +0800, AmA(C)rico Wang wrote:
>> 
>> >From your documentation above, it looks like you are trying
>> to move one CPU between nodes?
>Yes, you are correct. With cpu probe/release interface, you can hot-remove a
>CPU from a node, and hot-add it to another node.


Can I also move the CPU to another node _after_ it is hot-added?
Or I have to hot-remove it first and then hot-add it again?

>> 
>> >+	cpu_hpe=on/off
>> >+		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,
>> >+		sysfs provides probe/release interface to hot add/remove cpu dynamically.
>> >+		this option is disabled in default.
>> >+			
>> 
>> Why not just a CONFIG? IOW, why do we need to make another boot
>> parameter for this?
>Only the developer or QA will use the emulator, we did not want to change the
>default action for common user who does not care the hotplug emulator, so we
>use a kernel parameter as a switch. The common user is not aware the existence
>of the emulator.
>

I think it is also useful to other Linux users, e.g. after I
boot with "maxcpus=1", I can still bring the rest 3 CPU's
back without reboot.

Thanks.

-- 
Live like a child, think like the god.
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
  2010-11-22 16:04         ` Américo Wang
@ 2010-11-22 23:23           ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-22 23:23 UTC (permalink / raw)
  To: Américo Wang
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li

On Tue, Nov 23, 2010 at 12:04:12AM +0800, Américo Wang wrote:
> On Mon, Nov 22, 2010 at 07:33:51AM +0800, Shaohui Zheng wrote:
> >On Sun, Nov 21, 2010 at 11:03:45PM +0800, Américo Wang wrote:
> >> 
> >> >From your documentation above, it looks like you are trying
> >> to move one CPU between nodes?
> >Yes, you are correct. With cpu probe/release interface, you can hot-remove a
> >CPU from a node, and hot-add it to another node.
> 
> 
> Can I also move the CPU to another node _after_ it is hot-added?
> Or I have to hot-remove it first and then hot-add it again?
of course you can. you can hot-remove it via cpu/release interface, and then hot-add 
it by cpu/probe interface.

With the cpu probe/reelase interface, we can design some stress test cases to hot add/remove
cpu by script.
> 
> >> 
> >> >+	cpu_hpe=on/off
> >> >+		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,
> >> >+		sysfs provides probe/release interface to hot add/remove cpu dynamically.
> >> >+		this option is disabled in default.
> >> >+			
> >> 
> >> Why not just a CONFIG? IOW, why do we need to make another boot
> >> parameter for this?
> >Only the developer or QA will use the emulator, we did not want to change the
> >default action for common user who does not care the hotplug emulator, so we
> >use a kernel parameter as a switch. The common user is not aware the existence
> >of the emulator.
> >
> 
> I think it is also useful to other Linux users, e.g. after I
> boot with "maxcpus=1", I can still bring the rest 3 CPU's
> back without reboot.
You understand it very well. the probe/release on ppc is already implemented,
but for x86, it is a feature missing, so we finished it with these patches.

> 
> Thanks.
> 
> -- 
> Live like a child, think like the god.
>  

-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [8/8,v3] NUMA Hotplug Emulator: documentation
@ 2010-11-22 23:23           ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-22 23:23 UTC (permalink / raw)
  To: Américo Wang
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Haicheng Li

On Tue, Nov 23, 2010 at 12:04:12AM +0800, Americo Wang wrote:
> On Mon, Nov 22, 2010 at 07:33:51AM +0800, Shaohui Zheng wrote:
> >On Sun, Nov 21, 2010 at 11:03:45PM +0800, Americo Wang wrote:
> >> 
> >> >From your documentation above, it looks like you are trying
> >> to move one CPU between nodes?
> >Yes, you are correct. With cpu probe/release interface, you can hot-remove a
> >CPU from a node, and hot-add it to another node.
> 
> 
> Can I also move the CPU to another node _after_ it is hot-added?
> Or I have to hot-remove it first and then hot-add it again?
of course you can. you can hot-remove it via cpu/release interface, and then hot-add 
it by cpu/probe interface.

With the cpu probe/reelase interface, we can design some stress test cases to hot add/remove
cpu by script.
> 
> >> 
> >> >+	cpu_hpe=on/off
> >> >+		Enable/disable cpu hotplug emulation with software method. when cpu_hpe=on,
> >> >+		sysfs provides probe/release interface to hot add/remove cpu dynamically.
> >> >+		this option is disabled in default.
> >> >+			
> >> 
> >> Why not just a CONFIG? IOW, why do we need to make another boot
> >> parameter for this?
> >Only the developer or QA will use the emulator, we did not want to change the
> >default action for common user who does not care the hotplug emulator, so we
> >use a kernel parameter as a switch. The common user is not aware the existence
> >of the emulator.
> >
> 
> I think it is also useful to other Linux users, e.g. after I
> boot with "maxcpus=1", I can still bring the rest 3 CPU's
> back without reboot.
You understand it very well. the probe/release on ppc is already implemented,
but for x86, it is a feature missing, so we finished it with these patches.

> 
> Thanks.
> 
> -- 
> Live like a child, think like the god.
>  

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [5/8,v3] NUMA Hotplug Emulator: support cpu probe/release in x86
  2010-11-22 15:51         ` Américo Wang
@ 2010-11-22 23:29           ` Shaohui Zheng
  -1 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-22 23:29 UTC (permalink / raw)
  To: Américo Wang
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Ingo Molnar, Len Brown, Yinghai Lu, Haicheng Li

On Mon, Nov 22, 2010 at 11:51:52PM +0800, Américo Wang wrote:
> On Mon, Nov 22, 2010 at 08:01:04AM +0800, Shaohui Zheng wrote:
> >On Sun, Nov 21, 2010 at 10:45:11PM +0800, Américo Wang wrote:
> >> On Wed, Nov 17, 2010 at 10:08:04AM +0800, shaohui.zheng@intel.com wrote:
> >> >From: Shaohui Zheng <shaohui.zheng@intel.com>
> >> >
> >> >Add cpu interface probe/release under sysfs for x86. User can use this
> >> >interface to emulate the cpu hot-add process, it is for cpu hotplug 
> >> >test purpose. Add a kernel option CONFIG_ARCH_CPU_PROBE_RELEASE for this
> >> >feature.
> >> >
> >> >This interface provides a mechanism to emulate cpu hotplug with software
> >> > methods, it becomes possible to do cpu hotplug automation and stress
> >> >testing.
> >> >
> >> 
> >> Huh? We already have CPU online/offline...
> >> 
> >> Can you describe more about the difference?
> >> 
> >> Thanks.
> >
> >Again, we already try to discribe the difference between logcial cpu
> >online/offline and physical cpu online/offline many times.
> >
> 
> I see, with "maxcpus=" we will only have the specified number
> of CPU's which can be online/offline, you are trying to bring
> the rest of CPU's hidden by "maxcpus=". :) Correct?
Yes, when we online the rest CPUs, it test our cpu hot-add code logical.
> 
> I think the idea is cool, but I think you need to improve
> the documetion, for people who don't follow the hardware
> concepts like me. ;)
CPU hot-add is supported by only a few hardwares, so many users might never 
see such hardware, we should document it better. thanks for the remind.
> 
> Thanks.
> 
> -- 
> Live like a child, think like the god.
>  

-- 
Thanks & Regards,
Shaohui


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [5/8,v3] NUMA Hotplug Emulator: support cpu probe/release in x86
@ 2010-11-22 23:29           ` Shaohui Zheng
  0 siblings, 0 replies; 139+ messages in thread
From: Shaohui Zheng @ 2010-11-22 23:29 UTC (permalink / raw)
  To: Américo Wang
  Cc: akpm, linux-mm, linux-kernel, haicheng.li, lethal, ak,
	shaohui.zheng, Ingo Molnar, Len Brown, Yinghai Lu, Haicheng Li

On Mon, Nov 22, 2010 at 11:51:52PM +0800, Americo Wang wrote:
> On Mon, Nov 22, 2010 at 08:01:04AM +0800, Shaohui Zheng wrote:
> >On Sun, Nov 21, 2010 at 10:45:11PM +0800, Americo Wang wrote:
> >> On Wed, Nov 17, 2010 at 10:08:04AM +0800, shaohui.zheng@intel.com wrote:
> >> >From: Shaohui Zheng <shaohui.zheng@intel.com>
> >> >
> >> >Add cpu interface probe/release under sysfs for x86. User can use this
> >> >interface to emulate the cpu hot-add process, it is for cpu hotplug 
> >> >test purpose. Add a kernel option CONFIG_ARCH_CPU_PROBE_RELEASE for this
> >> >feature.
> >> >
> >> >This interface provides a mechanism to emulate cpu hotplug with software
> >> > methods, it becomes possible to do cpu hotplug automation and stress
> >> >testing.
> >> >
> >> 
> >> Huh? We already have CPU online/offline...
> >> 
> >> Can you describe more about the difference?
> >> 
> >> Thanks.
> >
> >Again, we already try to discribe the difference between logcial cpu
> >online/offline and physical cpu online/offline many times.
> >
> 
> I see, with "maxcpus=" we will only have the specified number
> of CPU's which can be online/offline, you are trying to bring
> the rest of CPU's hidden by "maxcpus=". :) Correct?
Yes, when we online the rest CPUs, it test our cpu hot-add code logical.
> 
> I think the idea is cool, but I think you need to improve
> the documetion, for people who don't follow the hardware
> concepts like me. ;)
CPU hot-add is supported by only a few hardwares, so many users might never 
see such hardware, we should document it better. thanks for the remind.
> 
> Thanks.
> 
> -- 
> Live like a child, think like the god.
>  

-- 
Thanks & Regards,
Shaohui

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2 v2] mm: add node hotplug emulation
  2010-11-22  0:56                                 ` Greg KH
@ 2010-11-28  1:52                                   ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-28  1:52 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sun, 21 Nov 2010, Greg KH wrote:

> > Add an interface to allow new nodes to be added when performing memory
> > hot-add.  This provides a convenient interface to test memory hotplug
> > notifier callbacks and surrounding hotplug code when new nodes are
> > onlined without actually having a machine with such hotpluggable SRAT
> > entries.
> > 
> > This adds a new debugfs interface at /sys/kernel/debug/hotplug/add_node
> 
> The rule for debugfs is "there are no rules", but perhaps you might want
> to name "hotplug" a bit more specific for what you are doing?  "hotplug"
> means pretty much anything these days, so how about s/hotplug/node/
> instead as that is what you are controlling.
> 
> Just a suggestion...
> 

Hmm, how strongly do you feel about that?  There's nothing node specific 
in the memory hotplug code where this lives, so we'd probably have to 
define the dentry elsewhere and even then it would only needed for 
CONFIG_MEMORY_HOTPLUG.

I personally don't see this as a node debugging but rather memory hotplug 
callback debugging.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2 v2] mm: add node hotplug emulation
@ 2010-11-28  1:52                                   ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-28  1:52 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sun, 21 Nov 2010, Greg KH wrote:

> > Add an interface to allow new nodes to be added when performing memory
> > hot-add.  This provides a convenient interface to test memory hotplug
> > notifier callbacks and surrounding hotplug code when new nodes are
> > onlined without actually having a machine with such hotpluggable SRAT
> > entries.
> > 
> > This adds a new debugfs interface at /sys/kernel/debug/hotplug/add_node
> 
> The rule for debugfs is "there are no rules", but perhaps you might want
> to name "hotplug" a bit more specific for what you are doing?  "hotplug"
> means pretty much anything these days, so how about s/hotplug/node/
> instead as that is what you are controlling.
> 
> Just a suggestion...
> 

Hmm, how strongly do you feel about that?  There's nothing node specific 
in the memory hotplug code where this lives, so we'd probably have to 
define the dentry elsewhere and even then it would only needed for 
CONFIG_MEMORY_HOTPLUG.

I personally don't see this as a node debugging but rather memory hotplug 
callback debugging.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2 v2] mm: add node hotplug emulation
  2010-11-28  1:52                                   ` David Rientjes
@ 2010-11-28  5:17                                     ` Greg KH
  -1 siblings, 0 replies; 139+ messages in thread
From: Greg KH @ 2010-11-28  5:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sat, Nov 27, 2010 at 05:52:03PM -0800, David Rientjes wrote:
> On Sun, 21 Nov 2010, Greg KH wrote:
> 
> > > Add an interface to allow new nodes to be added when performing memory
> > > hot-add.  This provides a convenient interface to test memory hotplug
> > > notifier callbacks and surrounding hotplug code when new nodes are
> > > onlined without actually having a machine with such hotpluggable SRAT
> > > entries.
> > > 
> > > This adds a new debugfs interface at /sys/kernel/debug/hotplug/add_node
> > 
> > The rule for debugfs is "there are no rules", but perhaps you might want
> > to name "hotplug" a bit more specific for what you are doing?  "hotplug"
> > means pretty much anything these days, so how about s/hotplug/node/
> > instead as that is what you are controlling.
> > 
> > Just a suggestion...
> > 
> 
> Hmm, how strongly do you feel about that?  There's nothing node specific 
> in the memory hotplug code where this lives, so we'd probably have to 
> define the dentry elsewhere and even then it would only needed for 
> CONFIG_MEMORY_HOTPLUG.
> 
> I personally don't see this as a node debugging but rather memory hotplug 
> callback debugging.

Then name it as such, not the generic "hotplug" like you just did.
"mem_hotplug" would make sense, right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2 v2] mm: add node hotplug emulation
@ 2010-11-28  5:17                                     ` Greg KH
  0 siblings, 0 replies; 139+ messages in thread
From: Greg KH @ 2010-11-28  5:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sat, Nov 27, 2010 at 05:52:03PM -0800, David Rientjes wrote:
> On Sun, 21 Nov 2010, Greg KH wrote:
> 
> > > Add an interface to allow new nodes to be added when performing memory
> > > hot-add.  This provides a convenient interface to test memory hotplug
> > > notifier callbacks and surrounding hotplug code when new nodes are
> > > onlined without actually having a machine with such hotpluggable SRAT
> > > entries.
> > > 
> > > This adds a new debugfs interface at /sys/kernel/debug/hotplug/add_node
> > 
> > The rule for debugfs is "there are no rules", but perhaps you might want
> > to name "hotplug" a bit more specific for what you are doing?  "hotplug"
> > means pretty much anything these days, so how about s/hotplug/node/
> > instead as that is what you are controlling.
> > 
> > Just a suggestion...
> > 
> 
> Hmm, how strongly do you feel about that?  There's nothing node specific 
> in the memory hotplug code where this lives, so we'd probably have to 
> define the dentry elsewhere and even then it would only needed for 
> CONFIG_MEMORY_HOTPLUG.
> 
> I personally don't see this as a node debugging but rather memory hotplug 
> callback debugging.

Then name it as such, not the generic "hotplug" like you just did.
"mem_hotplug" would make sense, right?

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2 v2] mm: add node hotplug emulation
  2010-11-28  5:17                                     ` Greg KH
@ 2010-11-30  0:04                                       ` David Rientjes
  -1 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-30  0:04 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sat, 27 Nov 2010, Greg KH wrote:

> Then name it as such, not the generic "hotplug" like you just did.
> "mem_hotplug" would make sense, right?
> 

Ok, Shaohui has taken my patches to post as part of the larger series and 
I requested that the interface be changed from s/hotplug/mem_hotplug as 
you suggested (and you should be cc'd).  I agree it's a better name to 
isolate memory hotplug debugging triggers from others.

Thanks Greg!

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [patch 2/2 v2] mm: add node hotplug emulation
@ 2010-11-30  0:04                                       ` David Rientjes
  0 siblings, 0 replies; 139+ messages in thread
From: David Rientjes @ 2010-11-30  0:04 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Shaohui Zheng, Paul Mundt, Andi Kleen, Yinghai Lu, Haicheng Li,
	Randy Dunlap, linux-kernel, linux-mm, x86

On Sat, 27 Nov 2010, Greg KH wrote:

> Then name it as such, not the generic "hotplug" like you just did.
> "mem_hotplug" would make sense, right?
> 

Ok, Shaohui has taken my patches to post as part of the larger series and 
I requested that the interface be changed from s/hotplug/mem_hotplug as 
you suggested (and you should be cc'd).  I agree it's a better name to 
isolate memory hotplug debugging triggers from others.

Thanks Greg!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 139+ messages in thread

end of thread, other threads:[~2010-11-30  0:04 UTC | newest]

Thread overview: 139+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-17  2:07 [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks shaohui.zheng
2010-11-17  2:07 ` shaohui.zheng
2010-11-17  2:08 ` [1/8,v3] NUMA Hotplug Emulator: add function to hide memory region via e820 table shaohui.zheng
2010-11-17  2:08   ` shaohui.zheng
2010-11-17  8:16   ` David Rientjes
2010-11-17  8:16     ` David Rientjes
2010-11-18  9:20     ` Shaohui Zheng
2010-11-18  9:20       ` Shaohui Zheng
2010-11-18 21:16       ` David Rientjes
2010-11-18 21:16         ` David Rientjes
2010-11-19  0:12         ` Shaohui Zheng
2010-11-19  0:12           ` Shaohui Zheng
2010-11-21  0:45           ` David Rientjes
2010-11-21  0:45             ` David Rientjes
2010-11-21 14:00             ` Américo Wang
2010-11-21 14:00               ` Américo Wang
2010-11-21 21:33               ` David Rientjes
2010-11-21 21:33                 ` David Rientjes
2010-11-17  2:08 ` [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation shaohui.zheng
2010-11-17  2:08   ` shaohui.zheng
2010-11-17  8:16   ` David Rientjes
2010-11-17  8:16     ` David Rientjes
2010-11-17  7:51     ` Shaohui Zheng
2010-11-17  7:51       ` Shaohui Zheng
2010-11-17 21:10       ` David Rientjes
2010-11-17 21:10         ` David Rientjes
2010-11-18  4:14         ` Shaohui Zheng
2010-11-18  4:14           ` Shaohui Zheng
2010-11-18  6:27           ` Paul Mundt
2010-11-18  6:27             ` Paul Mundt
2010-11-18  5:27             ` Shaohui Zheng
2010-11-18  5:27               ` Shaohui Zheng
2010-11-18 21:24               ` David Rientjes
2010-11-18 21:24                 ` David Rientjes
2010-11-19  0:32                 ` Shaohui Zheng
2010-11-19  0:32                   ` Shaohui Zheng
2010-11-21  0:48                   ` David Rientjes
2010-11-21  0:48                     ` David Rientjes
2010-11-21  2:28                     ` [patch 1/2] x86: add numa=possible command line option David Rientjes
2010-11-21  2:28                       ` David Rientjes
2010-11-21  2:28                       ` [patch 2/2] mm: add node hotplug emulation David Rientjes
2010-11-21  2:28                         ` David Rientjes
2010-11-21 17:34                         ` Greg KH
2010-11-21 17:34                           ` Greg KH
2010-11-21 21:48                           ` David Rientjes
2010-11-21 21:48                             ` David Rientjes
2010-11-21 23:08                             ` [patch 2/2 v2] " David Rientjes
2010-11-21 23:08                               ` David Rientjes
2010-11-22  0:56                               ` Greg KH
2010-11-22  0:56                                 ` Greg KH
2010-11-28  1:52                                 ` David Rientjes
2010-11-28  1:52                                   ` David Rientjes
2010-11-28  5:17                                   ` Greg KH
2010-11-28  5:17                                     ` Greg KH
2010-11-30  0:04                                     ` David Rientjes
2010-11-30  0:04                                       ` David Rientjes
2010-11-21 14:26                       ` [patch 1/2] x86: add numa=possible command line option Américo Wang
2010-11-21 14:26                         ` Américo Wang
2010-11-21 21:46                         ` David Rientjes
2010-11-21 21:46                           ` David Rientjes
2010-11-22 15:43                           ` Américo Wang
2010-11-22 15:43                             ` Américo Wang
2010-11-21 15:14                     ` [2/8,v3] NUMA Hotplug Emulator: infrastructure of NUMA hotplug emulation Li, Haicheng
2010-11-21 15:14                       ` Li, Haicheng
2010-11-21 21:42                       ` David Rientjes
2010-11-21 21:42                         ` David Rientjes
2010-11-18 21:19           ` David Rientjes
2010-11-18 21:19             ` David Rientjes
2010-11-17  2:08 ` [3/8,v3] NUMA Hotplug Emulator: Userland interface to hotplug-add fake offlined nodes shaohui.zheng
2010-11-17  2:08   ` shaohui.zheng
2010-11-17  8:16   ` David Rientjes
2010-11-17  8:16     ` David Rientjes
2010-11-17  2:08 ` [4/8,v3] NUMA Hotplug Emulator: Abstract cpu register functions shaohui.zheng
2010-11-17  2:08   ` shaohui.zheng
2010-11-17  2:08 ` [5/8,v3] NUMA Hotplug Emulator: support cpu probe/release in x86 shaohui.zheng
2010-11-17  2:08   ` shaohui.zheng
2010-11-21 14:45   ` Américo Wang
2010-11-21 14:45     ` Américo Wang
2010-11-22  0:01     ` Shaohui Zheng
2010-11-22  0:01       ` Shaohui Zheng
2010-11-22 15:51       ` Américo Wang
2010-11-22 15:51         ` Américo Wang
2010-11-22 23:29         ` Shaohui Zheng
2010-11-22 23:29           ` Shaohui Zheng
2010-11-17  2:08 ` [6/8,v3] NUMA Hotplug Emulator: Fake CPU socket with logical CPU on x86 shaohui.zheng
2010-11-17  2:08   ` shaohui.zheng
2010-11-17  2:08 ` [7/8,v3] NUMA Hotplug Emulator: extend memory probe interface to support NUMA shaohui.zheng
2010-11-17  2:08   ` shaohui.zheng
2010-11-17 18:50   ` Dave Hansen
2010-11-17 18:50     ` Dave Hansen
2010-11-17 21:18     ` David Rientjes
2010-11-17 21:18       ` David Rientjes
2010-11-17 21:55       ` Dave Hansen
2010-11-17 21:55         ` Dave Hansen
2010-11-17 22:44         ` David Rientjes
2010-11-17 22:44           ` David Rientjes
2010-11-17 23:00           ` Dave Hansen
2010-11-17 23:00             ` Dave Hansen
2010-11-17 23:17             ` David Rientjes
2010-11-17 23:17               ` David Rientjes
2010-11-18 16:59           ` Aaron Durbin
2010-11-18 16:59             ` Aaron Durbin
2010-11-18  4:48       ` Shaohui Zheng
2010-11-18  4:48         ` Shaohui Zheng
2010-11-18  6:24         ` Paul Mundt
2010-11-18  6:24           ` Paul Mundt
2010-11-18 21:28           ` David Rientjes
2010-11-18 21:28             ` David Rientjes
2010-11-18 21:31         ` David Rientjes
2010-11-18 21:31           ` David Rientjes
2010-11-18  4:36     ` Shaohui Zheng
2010-11-18  4:36       ` Shaohui Zheng
2010-11-19  7:51     ` Shaohui Zheng
2010-11-19 16:36       ` Dave Hansen
2010-11-19 16:36         ` Dave Hansen
2010-11-17  2:08 ` [8/8,v3] NUMA Hotplug Emulator: documentation shaohui.zheng
2010-11-17  2:08   ` shaohui.zheng
2010-11-17 23:06   ` Randy Dunlap
2010-11-17 23:06     ` Randy Dunlap
2010-11-18  2:31     ` Shaohui Zheng
2010-11-18  2:31       ` Shaohui Zheng
2010-11-21 15:03   ` Américo Wang
2010-11-21 15:03     ` Américo Wang
2010-11-21 15:16     ` Li, Haicheng
2010-11-21 15:16       ` Li, Haicheng
2010-11-21 23:33     ` Shaohui Zheng
2010-11-21 23:33       ` Shaohui Zheng
2010-11-22 16:04       ` Américo Wang
2010-11-22 16:04         ` Américo Wang
2010-11-22 23:23         ` Shaohui Zheng
2010-11-22 23:23           ` Shaohui Zheng
2010-11-17  5:22 ` [0/8,v3] NUMA Hotplug Emulator - Introduction & Feedbacks Paul Mundt
2010-11-17  5:22   ` Paul Mundt
2010-11-19  5:54   ` Shaohui Zheng
2010-11-19  5:54     ` Shaohui Zheng
2010-11-17  9:26 ` Yinghai Lu
2010-11-17  9:26   ` Yinghai Lu
2010-11-18  2:03   ` Shaohui Zheng
2010-11-18  2:03     ` Shaohui Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.