linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/14] memory-hotplug: hot-remove physical memory
@ 2012-12-24 12:09 Tang Chen
  2012-12-24 12:09 ` [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence Tang Chen
                   ` (13 more replies)
  0 siblings, 14 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

Hi Andrew,

Here is the physical memory hot-remove patch-set based on 3.8rc-1.

This patch-set aims to implement physical memory hot-removing.

The patches can free/remove the following things:

  - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/14]
  - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/14]
  - page table of removed memory              : [RFC PATCH 7,8,10/14]
  - node and related sysfs files              : [RFC PATCH 13-14/14]


How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
   ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
   You will see the memory device under the directory /sys/bus/acpi/devices/.
   Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
   You can write online/offline to /sys/devices/system/memory/memoryX/state to
   online/offline pages provided by this memory device
5. hotremove the memory device
   You can hotremove the memory device by the hardware, or writing 1 to
   /sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.


Changelogs from v4 to v5:
 Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
         avoid disabling irq because we need flush tlb when free pagetables.
 Patch8: new patch, pick up some common APIs that are used to free direct mapping
         and vmemmap pagetables.
 Patch9: free direct mapping pagetables on x86_64 arch.
 Patch10: free vmemmap pagetables.
 Patch11: since freeing memmap with vmemmap has been implemented, the config
          macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
          no longer needed.
 Patch13: no need to modify acpi_memory_disable_device() since it was removed,
          and add nid parameter when calling remove_memory().

Changelogs from v3 to v4:
 Patch7: remove unused codes.
 Patch8: fix nr_pages that is passed to free_map_bootmem()

Changelogs from v2 to v3:
 Patch9: call sync_global_pgds() if pgd is changed
 Patch10: fix a problem int the patch

Changelogs from v1 to v2:
 Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
         memory block. 2nd iterate: offline primary (i.e. first added) memory
         block.

 Patch3: new patch, no logical change, just remove reduntant codes.

 Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
         after the pagetable is changed.

 Patch12: new patch, free node_data when a node is offlined.


Tang Chen (5):
  memory-hotplug: move pgdat_resize_lock into
    sparse_remove_one_section()
  memory-hotplug: remove page table of x86_64 architecture
  memory-hotplug: remove memmap of sparse-vmemmap
  memory-hotplug: Integrated __remove_section() of
    CONFIG_SPARSEMEM_VMEMMAP.
  memory-hotplug: remove sysfs file of node

Wen Congyang (5):
  memory-hotplug: try to offline the memory twice to avoid dependence
  memory-hotplug: remove redundant codes
  memory-hotplug: introduce new function arch_remove_memory() for
    removing page table depends on architecture
  memory-hotplug: Common APIs to support page tables hot-remove
  memory-hotplug: free node_data when a node is offlined

Yasuaki Ishimatsu (4):
  memory-hotplug: check whether all memory blocks are offlined or not
    when removing memory
  memory-hotplug: remove /sys/firmware/memmap/X sysfs
  memory-hotplug: implement register_page_bootmem_info_section of
    sparse-vmemmap
  memory-hotplug: memory_hotplug: clear zone when removing the memory

 arch/arm64/mm/mmu.c                  |    3 +
 arch/ia64/mm/discontig.c             |   10 +
 arch/ia64/mm/init.c                  |   18 ++
 arch/powerpc/mm/init_64.c            |   10 +
 arch/powerpc/mm/mem.c                |   12 +
 arch/s390/mm/init.c                  |   12 +
 arch/s390/mm/vmem.c                  |   10 +
 arch/sh/mm/init.c                    |   17 ++
 arch/sparc/mm/init_64.c              |   10 +
 arch/tile/mm/init.c                  |    8 +
 arch/x86/include/asm/pgtable_types.h |    1 +
 arch/x86/mm/init_32.c                |   12 +
 arch/x86/mm/init_64.c                |  382 ++++++++++++++++++++++++++++++++
 arch/x86/mm/pageattr.c               |   47 ++--
 drivers/acpi/acpi_memhotplug.c       |    8 +-
 drivers/base/memory.c                |    6 +
 drivers/firmware/memmap.c            |   98 ++++++++-
 include/linux/bootmem.h              |    1 +
 include/linux/firmware-map.h         |    6 +
 include/linux/memory_hotplug.h       |   15 +-
 include/linux/mm.h                   |    4 +-
 mm/memory_hotplug.c                  |  406 ++++++++++++++++++++++++++++++++--
 mm/sparse.c                          |    8 +-
 23 files changed, 1043 insertions(+), 61 deletions(-)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-25  8:35   ` Glauber Costa
  2012-12-26  3:02   ` Kamezawa Hiroyuki
  2012-12-24 12:09 ` [PATCH v5 02/14] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Tang Chen
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

From: Wen Congyang <wency@cn.fujitsu.com>

memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages. When we online memory8, the memory stored page cgroup
is not provided by this memory device. But when we online memory9, the memory
stored page cgroup may be provided by memory8. So we can't offline memory8
now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail. In such case, iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

This idea is suggested by KOSAKI Motohiro.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d04ed87..62e04c9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1388,10 +1388,13 @@ int remove_memory(u64 start, u64 size)
 	unsigned long start_pfn, end_pfn;
 	unsigned long pfn, section_nr;
 	int ret;
+	int return_on_error = 0;
+	int retry = 0;
 
 	start_pfn = PFN_DOWN(start);
 	end_pfn = start_pfn + PFN_DOWN(size);
 
+repeat:
 	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
 		section_nr = pfn_to_section_nr(pfn);
 		if (!present_section_nr(section_nr))
@@ -1410,14 +1413,23 @@ int remove_memory(u64 start, u64 size)
 
 		ret = offline_memory_block(mem);
 		if (ret) {
-			kobject_put(&mem->dev.kobj);
-			return ret;
+			if (return_on_error) {
+				kobject_put(&mem->dev.kobj);
+				return ret;
+			} else {
+				retry = 1;
+			}
 		}
 	}
 
 	if (mem)
 		kobject_put(&mem->dev.kobj);
 
+	if (retry) {
+		return_on_error = 1;
+		goto repeat;
+	}
+
 	return 0;
 }
 #else
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 02/14] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
  2012-12-24 12:09 ` [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-26  3:10   ` Kamezawa Hiroyuki
  2012-12-24 12:09 ` [PATCH v5 03/14] memory-hotplug: remove redundant codes Tang Chen
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

We remove the memory like this:
1. lock memory hotplug
2. offline a memory block
3. unlock memory hotplug
4. repeat 1-3 to offline all memory blocks
5. lock memory hotplug
6. remove memory(TODO)
7. unlock memory hotplug

All memory blocks must be offlined before removing memory. But we don't hold
the lock in the whole operation. So we should check whether all memory blocks
are offlined before step6. Otherwise, kernel maybe panicked.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 drivers/base/memory.c          |    6 +++++
 include/linux/memory_hotplug.h |    1 +
 mm/memory_hotplug.c            |   47 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 987604d..8300a18 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -693,6 +693,12 @@ int offline_memory_block(struct memory_block *mem)
 	return ret;
 }
 
+/* return true if the memory block is offlined, otherwise, return false */
+bool is_memblock_offlined(struct memory_block *mem)
+{
+	return mem->state == MEM_OFFLINE;
+}
+
 /*
  * Initialize the sysfs support for memory devices...
  */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 4a45c4e..8dd0950 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -247,6 +247,7 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int offline_memory_block(struct memory_block *mem);
+extern bool is_memblock_offlined(struct memory_block *mem);
 extern int remove_memory(u64 start, u64 size);
 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 								int nr_pages);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 62e04c9..d43d97b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1430,6 +1430,53 @@ repeat:
 		goto repeat;
 	}
 
+	lock_memory_hotplug();
+
+	/*
+	 * we have offlined all memory blocks like this:
+	 *   1. lock memory hotplug
+	 *   2. offline a memory block
+	 *   3. unlock memory hotplug
+	 *
+	 * repeat step1-3 to offline the memory block. All memory blocks
+	 * must be offlined before removing memory. But we don't hold the
+	 * lock in the whole operation. So we should check whether all
+	 * memory blocks are offlined.
+	 */
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+		section_nr = pfn_to_section_nr(pfn);
+		if (!present_section_nr(section_nr))
+			continue;
+
+		section = __nr_to_section(section_nr);
+		/* same memblock? */
+		if (mem)
+			if ((section_nr >= mem->start_section_nr) &&
+			    (section_nr <= mem->end_section_nr))
+				continue;
+
+		mem = find_memory_block_hinted(section, mem);
+		if (!mem)
+			continue;
+
+		ret = is_memblock_offlined(mem);
+		if (!ret) {
+			pr_warn("removing memory fails, because memory "
+				"[%#010llx-%#010llx] is onlined\n",
+				PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
+				PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1)) - 1);
+
+			kobject_put(&mem->dev.kobj);
+			unlock_memory_hotplug();
+			return ret;
+		}
+	}
+
+	if (mem)
+		kobject_put(&mem->dev.kobj);
+	unlock_memory_hotplug();
+
 	return 0;
 }
 #else
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 03/14] memory-hotplug: remove redundant codes
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
  2012-12-24 12:09 ` [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence Tang Chen
  2012-12-24 12:09 ` [PATCH v5 02/14] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-26  3:20   ` Kamezawa Hiroyuki
  2012-12-24 12:09 ` [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs Tang Chen
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

From: Wen Congyang <wency@cn.fujitsu.com>

offlining memory blocks and checking whether memory blocks are offlined
are very similar. This patch introduces a new function to remove
redundant codes.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c |  101 ++++++++++++++++++++++++++++-----------------------
 1 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d43d97b..dbb04d8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1381,20 +1381,14 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
 }
 
-int remove_memory(u64 start, u64 size)
+static int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
+		void *arg, int (*func)(struct memory_block *, void *))
 {
 	struct memory_block *mem = NULL;
 	struct mem_section *section;
-	unsigned long start_pfn, end_pfn;
 	unsigned long pfn, section_nr;
 	int ret;
-	int return_on_error = 0;
-	int retry = 0;
-
-	start_pfn = PFN_DOWN(start);
-	end_pfn = start_pfn + PFN_DOWN(size);
 
-repeat:
 	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
 		section_nr = pfn_to_section_nr(pfn);
 		if (!present_section_nr(section_nr))
@@ -1411,22 +1405,61 @@ repeat:
 		if (!mem)
 			continue;
 
-		ret = offline_memory_block(mem);
+		ret = func(mem, arg);
 		if (ret) {
-			if (return_on_error) {
-				kobject_put(&mem->dev.kobj);
-				return ret;
-			} else {
-				retry = 1;
-			}
+			kobject_put(&mem->dev.kobj);
+			return ret;
 		}
 	}
 
 	if (mem)
 		kobject_put(&mem->dev.kobj);
 
-	if (retry) {
-		return_on_error = 1;
+	return 0;
+}
+
+static int offline_memory_block_cb(struct memory_block *mem, void *arg)
+{
+	int *ret = arg;
+	int error = offline_memory_block(mem);
+
+	if (error != 0 && *ret == 0)
+		*ret = error;
+
+	return 0;
+}
+
+static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
+{
+	int ret = !is_memblock_offlined(mem);
+
+	if (unlikely(ret))
+		pr_warn("removing memory fails, because memory "
+			"[%#010llx-%#010llx] is onlined\n",
+			PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
+			PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1);
+
+	return ret;
+}
+
+int remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn, end_pfn;
+	int ret = 0;
+	int retry = 1;
+
+	start_pfn = PFN_DOWN(start);
+	end_pfn = start_pfn + PFN_DOWN(size);
+
+repeat:
+	walk_memory_range(start_pfn, end_pfn, &ret,
+			  offline_memory_block_cb);
+	if (ret) {
+		if (!retry)
+			return ret;
+
+		retry = 0;
+		ret = 0;
 		goto repeat;
 	}
 
@@ -1444,37 +1477,13 @@ repeat:
 	 * memory blocks are offlined.
 	 */
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-		section_nr = pfn_to_section_nr(pfn);
-		if (!present_section_nr(section_nr))
-			continue;
-
-		section = __nr_to_section(section_nr);
-		/* same memblock? */
-		if (mem)
-			if ((section_nr >= mem->start_section_nr) &&
-			    (section_nr <= mem->end_section_nr))
-				continue;
-
-		mem = find_memory_block_hinted(section, mem);
-		if (!mem)
-			continue;
-
-		ret = is_memblock_offlined(mem);
-		if (!ret) {
-			pr_warn("removing memory fails, because memory "
-				"[%#010llx-%#010llx] is onlined\n",
-				PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
-				PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1)) - 1);
-
-			kobject_put(&mem->dev.kobj);
-			unlock_memory_hotplug();
-			return ret;
-		}
+	ret = walk_memory_range(start_pfn, end_pfn, NULL,
+				is_memblock_offlined_cb);
+	if (ret) {
+		unlock_memory_hotplug();
+		return ret;
 	}
 
-	if (mem)
-		kobject_put(&mem->dev.kobj);
 	unlock_memory_hotplug();
 
 	return 0;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (2 preceding siblings ...)
  2012-12-24 12:09 ` [PATCH v5 03/14] memory-hotplug: remove redundant codes Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-26  3:30   ` Kamezawa Hiroyuki
  2012-12-24 12:09 ` [PATCH v5 05/14] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Tang Chen
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
sysfs files are created. But there is no code to remove these files. The patch
implements the function to remove them.

Note: The code does not free firmware_map_entry which is allocated by bootmem.
      So the patch makes memory leak. But I think the memory leak size is
      very samll. And it does not affect the system.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 drivers/firmware/memmap.c    |   98 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/firmware-map.h |    6 +++
 mm/memory_hotplug.c          |    5 ++-
 3 files changed, 106 insertions(+), 3 deletions(-)

diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c
index 90723e6..49be12a 100644
--- a/drivers/firmware/memmap.c
+++ b/drivers/firmware/memmap.c
@@ -21,6 +21,7 @@
 #include <linux/types.h>
 #include <linux/bootmem.h>
 #include <linux/slab.h>
+#include <linux/mm.h>
 
 /*
  * Data types ------------------------------------------------------------------
@@ -41,6 +42,7 @@ struct firmware_map_entry {
 	const char		*type;	/* type of the memory range */
 	struct list_head	list;	/* entry for the linked list */
 	struct kobject		kobj;   /* kobject for each entry */
+	unsigned int		bootmem:1; /* allocated from bootmem */
 };
 
 /*
@@ -79,7 +81,26 @@ static const struct sysfs_ops memmap_attr_ops = {
 	.show = memmap_attr_show,
 };
 
+
+static inline struct firmware_map_entry *
+to_memmap_entry(struct kobject *kobj)
+{
+	return container_of(kobj, struct firmware_map_entry, kobj);
+}
+
+static void release_firmware_map_entry(struct kobject *kobj)
+{
+	struct firmware_map_entry *entry = to_memmap_entry(kobj);
+
+	if (entry->bootmem)
+		/* There is no way to free memory allocated from bootmem */
+		return;
+
+	kfree(entry);
+}
+
 static struct kobj_type memmap_ktype = {
+	.release	= release_firmware_map_entry,
 	.sysfs_ops	= &memmap_attr_ops,
 	.default_attrs	= def_attrs,
 };
@@ -94,6 +115,7 @@ static struct kobj_type memmap_ktype = {
  * in firmware initialisation code in one single thread of execution.
  */
 static LIST_HEAD(map_entries);
+static DEFINE_SPINLOCK(map_entries_lock);
 
 /**
  * firmware_map_add_entry() - Does the real work to add a firmware memmap entry.
@@ -118,11 +140,25 @@ static int firmware_map_add_entry(u64 start, u64 end,
 	INIT_LIST_HEAD(&entry->list);
 	kobject_init(&entry->kobj, &memmap_ktype);
 
+	spin_lock(&map_entries_lock);
 	list_add_tail(&entry->list, &map_entries);
+	spin_unlock(&map_entries_lock);
 
 	return 0;
 }
 
+/**
+ * firmware_map_remove_entry() - Does the real work to remove a firmware
+ * memmap entry.
+ * @entry: removed entry.
+ **/
+static inline void firmware_map_remove_entry(struct firmware_map_entry *entry)
+{
+	spin_lock(&map_entries_lock);
+	list_del(&entry->list);
+	spin_unlock(&map_entries_lock);
+}
+
 /*
  * Add memmap entry on sysfs
  */
@@ -144,6 +180,35 @@ static int add_sysfs_fw_map_entry(struct firmware_map_entry *entry)
 	return 0;
 }
 
+/*
+ * Remove memmap entry on sysfs
+ */
+static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
+{
+	kobject_put(&entry->kobj);
+}
+
+/*
+ * Search memmap entry
+ */
+
+static struct firmware_map_entry * __meminit
+firmware_map_find_entry(u64 start, u64 end, const char *type)
+{
+	struct firmware_map_entry *entry;
+
+	spin_lock(&map_entries_lock);
+	list_for_each_entry(entry, &map_entries, list)
+		if ((entry->start == start) && (entry->end == end) &&
+		    (!strcmp(entry->type, type))) {
+			spin_unlock(&map_entries_lock);
+			return entry;
+		}
+
+	spin_unlock(&map_entries_lock);
+	return NULL;
+}
+
 /**
  * firmware_map_add_hotplug() - Adds a firmware mapping entry when we do
  * memory hotplug.
@@ -193,9 +258,36 @@ int __init firmware_map_add_early(u64 start, u64 end, const char *type)
 	if (WARN_ON(!entry))
 		return -ENOMEM;
 
+	entry->bootmem = 1;
 	return firmware_map_add_entry(start, end, type, entry);
 }
 
+/**
+ * firmware_map_remove() - remove a firmware mapping entry
+ * @start: Start of the memory range.
+ * @end:   End of the memory range.
+ * @type:  Type of the memory range.
+ *
+ * removes a firmware mapping entry.
+ *
+ * Returns 0 on success, or -EINVAL if no entry.
+ **/
+int __meminit firmware_map_remove(u64 start, u64 end, const char *type)
+{
+	struct firmware_map_entry *entry;
+
+	entry = firmware_map_find_entry(start, end - 1, type);
+	if (!entry)
+		return -EINVAL;
+
+	firmware_map_remove_entry(entry);
+
+	/* remove the memmap entry */
+	remove_sysfs_fw_map_entry(entry);
+
+	return 0;
+}
+
 /*
  * Sysfs functions -------------------------------------------------------------
  */
@@ -217,8 +309,10 @@ static ssize_t type_show(struct firmware_map_entry *entry, char *buf)
 	return snprintf(buf, PAGE_SIZE, "%s\n", entry->type);
 }
 
-#define to_memmap_attr(_attr) container_of(_attr, struct memmap_attribute, attr)
-#define to_memmap_entry(obj) container_of(obj, struct firmware_map_entry, kobj)
+static inline struct memmap_attribute *to_memmap_attr(struct attribute *attr)
+{
+	return container_of(attr, struct memmap_attribute, attr);
+}
 
 static ssize_t memmap_attr_show(struct kobject *kobj,
 				struct attribute *attr, char *buf)
diff --git a/include/linux/firmware-map.h b/include/linux/firmware-map.h
index 43fe52f..71d4fa7 100644
--- a/include/linux/firmware-map.h
+++ b/include/linux/firmware-map.h
@@ -25,6 +25,7 @@
 
 int firmware_map_add_early(u64 start, u64 end, const char *type);
 int firmware_map_add_hotplug(u64 start, u64 end, const char *type);
+int firmware_map_remove(u64 start, u64 end, const char *type);
 
 #else /* CONFIG_FIRMWARE_MEMMAP */
 
@@ -38,6 +39,11 @@ static inline int firmware_map_add_hotplug(u64 start, u64 end, const char *type)
 	return 0;
 }
 
+static inline int firmware_map_remove(u64 start, u64 end, const char *type)
+{
+	return 0;
+}
+
 #endif /* CONFIG_FIRMWARE_MEMMAP */
 
 #endif /* _LINUX_FIRMWARE_MAP_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index dbb04d8..1f5b5bb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1442,7 +1442,7 @@ static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return ret;
 }
 
-int remove_memory(u64 start, u64 size)
+int __ref remove_memory(u64 start, u64 size)
 {
 	unsigned long start_pfn, end_pfn;
 	int ret = 0;
@@ -1484,6 +1484,9 @@ repeat:
 		return ret;
 	}
 
+	/* remove memmap entry */
+	firmware_map_remove(start, start + size, "System RAM");
+
 	unlock_memory_hotplug();
 
 	return 0;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 05/14] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (3 preceding siblings ...)
  2012-12-24 12:09 ` [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-26  3:37   ` Kamezawa Hiroyuki
  2012-12-24 12:09 ` [PATCH v5 06/14] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap Tang Chen
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

From: Wen Congyang <wency@cn.fujitsu.com>

For removing memory, we need to remove page table. But it depends
on architecture. So the patch introduce arch_remove_memory() for
removing page table. Now it only calls __remove_pages().

Note: __remove_pages() for some archtecuture is not implemented
      (I don't know how to implement it for s390).

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 arch/ia64/mm/init.c            |   18 ++++++++++++++++++
 arch/powerpc/mm/mem.c          |   12 ++++++++++++
 arch/s390/mm/init.c            |   12 ++++++++++++
 arch/sh/mm/init.c              |   17 +++++++++++++++++
 arch/tile/mm/init.c            |    8 ++++++++
 arch/x86/mm/init_32.c          |   12 ++++++++++++
 arch/x86/mm/init_64.c          |   15 +++++++++++++++
 include/linux/memory_hotplug.h |    1 +
 mm/memory_hotplug.c            |    2 ++
 9 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 082e383..e333822 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -689,6 +689,24 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
 	return ret;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	if (ret)
+		pr_warn("%s: Problem encountered in __remove_pages() as"
+			" ret=%d\n", __func__,  ret);
+
+	return ret;
+}
+#endif
 #endif
 
 /*
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 0dba506..09c6451 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -133,6 +133,18 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	return __remove_pages(zone, start_pfn, nr_pages);
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 /*
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index ae672f4..49ce6bb 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -228,4 +228,16 @@ int arch_add_memory(int nid, u64 start, u64 size)
 		vmem_remove_mapping(start, size);
 	return rc;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	/*
+	 * There is no hardware or firmware interface which could trigger a
+	 * hot memory remove on s390. So there is nothing that needs to be
+	 * implemented.
+	 */
+	return -EBUSY;
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 82cc576..1057940 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -558,4 +558,21 @@ int memory_add_physaddr_to_nid(u64 addr)
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	if (unlikely(ret))
+		pr_warn("%s: Failed, __remove_pages() == %d\n", __func__,
+			ret);
+
+	return ret;
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index ef29d6c..2749515 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -935,6 +935,14 @@ int remove_memory(u64 start, u64 size)
 {
 	return -EINVAL;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	/* TODO */
+	return -EBUSY;
+}
+#endif
 #endif
 
 struct kmem_cache *pgd_cache;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 745d66b..3166e78 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -836,6 +836,18 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	return __remove_pages(zone, start_pfn, nr_pages);
+}
+#endif
 #endif
 
 /*
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e779e0b..f78509c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -682,6 +682,21 @@ int arch_add_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int __ref arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	WARN_ON_ONCE(ret);
+
+	return ret;
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static struct kcore_list kcore_vsyscall;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8dd0950..31a563b 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -96,6 +96,7 @@ extern void __online_page_free(struct page *page);
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
+extern int arch_remove_memory(u64 start, u64 size);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 /* reasonably generic interface to expand the physical pages in a zone  */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1f5b5bb..2c5d734 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1487,6 +1487,8 @@ repeat:
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 
+	arch_remove_memory(start, size);
+
 	unlock_memory_hotplug();
 
 	return 0;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 06/14] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (4 preceding siblings ...)
  2012-12-24 12:09 ` [PATCH v5 05/14] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-25  8:09   ` Jianguo Wu
  2012-12-24 12:09 ` [PATCH v5 07/14] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section() Tang Chen
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by get_page_bootmem().
So the patch searches pages of virtual mapping and registers the pages by
get_page_bootmem().

Note: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390,
and sparc.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 arch/ia64/mm/discontig.c       |    6 ++++
 arch/powerpc/mm/init_64.c      |    6 ++++
 arch/s390/mm/vmem.c            |    6 ++++
 arch/sparc/mm/init_64.c        |    6 ++++
 arch/x86/mm/init_64.c          |   52 ++++++++++++++++++++++++++++++++++++++++
 include/linux/memory_hotplug.h |   11 +-------
 include/linux/mm.h             |    3 +-
 mm/memory_hotplug.c            |   33 ++++++++++++++++++++++---
 8 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index c641333..33943db 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -822,4 +822,10 @@ int __meminit vmemmap_populate(struct page *start_page,
 {
 	return vmemmap_populate_basepages(start_page, size, node);
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
 #endif
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..6466440 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,11 @@ int __meminit vmemmap_populate(struct page *start_page,
 
 	return 0;
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 6ed1426..2c14bc2 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -272,6 +272,12 @@ out:
 	return ret;
 }
 
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
+
 /*
  * Add memory segment to the segment list if it doesn't overlap with
  * an already present segment.
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 85be1ca..7e28c9e 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2231,6 +2231,12 @@ void __meminit vmemmap_populate_print_last(void)
 		node_start = 0;
 	}
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
 static void prot_init_common(unsigned long page_none,
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index f78509c..aeaa27e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1000,6 +1000,58 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
 	return 0;
 }
 
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	unsigned long addr = (unsigned long)start_page;
+	unsigned long end = (unsigned long)(start_page + size);
+	unsigned long next;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	for (; addr < end; addr = next) {
+		pte_t *pte = NULL;
+
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = (addr + PAGE_SIZE) & PAGE_MASK;
+			continue;
+		}
+		get_page_bootmem(section_nr, pgd_page(*pgd), MIX_SECTION_INFO);
+
+		pud = pud_offset(pgd, addr);
+		if (pud_none(*pud)) {
+			next = (addr + PAGE_SIZE) & PAGE_MASK;
+			continue;
+		}
+		get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
+
+		if (!cpu_has_pse) {
+			next = (addr + PAGE_SIZE) & PAGE_MASK;
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd))
+				continue;
+			get_page_bootmem(section_nr, pmd_page(*pmd),
+					 MIX_SECTION_INFO);
+
+			pte = pte_offset_kernel(pmd, addr);
+			if (pte_none(*pte))
+				continue;
+			get_page_bootmem(section_nr, pte_page(*pte),
+					 SECTION_INFO);
+		} else {
+			next = pmd_addr_end(addr, end);
+
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd))
+				continue;
+			get_page_bootmem(section_nr, pmd_page(*pmd),
+					 SECTION_INFO);
+		}
+	}
+}
+
 void __meminit vmemmap_populate_print_last(void)
 {
 	if (p_start) {
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 31a563b..2441f36 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -174,17 +174,10 @@ static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
 #endif /* CONFIG_NUMA */
 #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-}
-static inline void put_page_bootmem(struct page *page)
-{
-}
-#else
 extern void register_page_bootmem_info_node(struct pglist_data *pgdat);
 extern void put_page_bootmem(struct page *page);
-#endif
+extern void get_page_bootmem(unsigned long ingo, struct page *page,
+			     unsigned long type);
 
 /*
  * Lock for memory hotplug guarantees 1) all callbacks for memory hotplug
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6320407..1eca498 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1709,7 +1709,8 @@ int vmemmap_populate_basepages(struct page *start_page,
 						unsigned long pages, int node);
 int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
 void vmemmap_populate_print_last(void);
-
+void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
+				  unsigned long size);
 
 enum mf_flags {
 	MF_COUNT_INCREASED = 1 << 0,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2c5d734..34c656b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -91,9 +91,8 @@ static void release_memory_resource(struct resource *res)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
-#ifndef CONFIG_SPARSEMEM_VMEMMAP
-static void get_page_bootmem(unsigned long info,  struct page *page,
-			     unsigned long type)
+void get_page_bootmem(unsigned long info,  struct page *page,
+		      unsigned long type)
 {
 	page->lru.next = (struct list_head *) type;
 	SetPagePrivate(page);
@@ -128,6 +127,7 @@ void __ref put_page_bootmem(struct page *page)
 
 }
 
+#ifndef CONFIG_SPARSEMEM_VMEMMAP
 static void register_page_bootmem_info_section(unsigned long start_pfn)
 {
 	unsigned long *usemap, mapsize, section_nr, i;
@@ -161,6 +161,32 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
 		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
 
 }
+#else
+static void register_page_bootmem_info_section(unsigned long start_pfn)
+{
+	unsigned long *usemap, mapsize, section_nr, i;
+	struct mem_section *ms;
+	struct page *page, *memmap;
+
+	if (!pfn_valid(start_pfn))
+		return;
+
+	section_nr = pfn_to_section_nr(start_pfn);
+	ms = __nr_to_section(section_nr);
+
+	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+
+	register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
+
+	usemap = __nr_to_section(section_nr)->pageblock_flags;
+	page = virt_to_page(usemap);
+
+	mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
+}
+#endif
 
 void register_page_bootmem_info_node(struct pglist_data *pgdat)
 {
@@ -203,7 +229,6 @@ void register_page_bootmem_info_node(struct pglist_data *pgdat)
 			register_page_bootmem_info_section(pfn);
 	}
 }
-#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
 
 static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
 			   unsigned long end_pfn)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 07/14] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section()
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (5 preceding siblings ...)
  2012-12-24 12:09 ` [PATCH v5 06/14] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-26  3:47   ` Kamezawa Hiroyuki
  2012-12-24 12:09 ` [PATCH v5 08/14] memory-hotplug: Common APIs to support page tables hot-remove Tang Chen
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

In __remove_section(), we locked pgdat_resize_lock when calling
sparse_remove_one_section(). This lock will disable irq. But we don't need
to lock the whole function. If we do some work to free pagetables in
free_section_usemap(), we need to call flush_tlb_all(), which need
irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
will be triggered.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c |    4 ----
 mm/sparse.c         |    5 ++++-
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 34c656b..c12bd55 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -442,8 +442,6 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 #else
 static int __remove_section(struct zone *zone, struct mem_section *ms)
 {
-	unsigned long flags;
-	struct pglist_data *pgdat = zone->zone_pgdat;
 	int ret = -EINVAL;
 
 	if (!valid_section(ms))
@@ -453,9 +451,7 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	if (ret)
 		return ret;
 
-	pgdat_resize_lock(pgdat, &flags);
 	sparse_remove_one_section(zone, ms);
-	pgdat_resize_unlock(pgdat, &flags);
 	return 0;
 }
 #endif
diff --git a/mm/sparse.c b/mm/sparse.c
index aadbb2a..05ca73a 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -796,8 +796,10 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 {
 	struct page *memmap = NULL;
-	unsigned long *usemap = NULL;
+	unsigned long *usemap = NULL, flags;
+	struct pglist_data *pgdat = zone->zone_pgdat;
 
+	pgdat_resize_lock(pgdat, &flags);
 	if (ms->section_mem_map) {
 		usemap = ms->pageblock_flags;
 		memmap = sparse_decode_mem_map(ms->section_mem_map,
@@ -805,6 +807,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 		ms->section_mem_map = 0;
 		ms->pageblock_flags = NULL;
 	}
+	pgdat_resize_unlock(pgdat, &flags);
 
 	clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION);
 	free_section_usemap(memmap, usemap);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 08/14] memory-hotplug: Common APIs to support page tables hot-remove
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (6 preceding siblings ...)
  2012-12-24 12:09 ` [PATCH v5 07/14] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section() Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-25  8:17   ` Jianguo Wu
  2012-12-24 12:09 ` [PATCH v5 09/14] memory-hotplug: remove page table of x86_64 architecture Tang Chen
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

From: Wen Congyang <wency@cn.fujitsu.com>

When memory is removed, the corresponding pagetables should alse be removed.
This patch introduces some common APIs to support vmemmap pagetable and x86_64
architecture pagetable removing.

All pages of virtual mapping in removed memory cannot be freedi if some pages
used as PGD/PUD includes not only removed memory but also other memory. So the
patch uses the following way to check whether page can be freed or not.

 1. When removing memory, the page structs of the revmoved memory are filled
    with 0FD.
 2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
    In this case, the page used as PT/PMD can be freed.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/include/asm/pgtable_types.h |    1 +
 arch/x86/mm/init_64.c                |  297 ++++++++++++++++++++++++++++++++++
 arch/x86/mm/pageattr.c               |   47 +++---
 include/linux/bootmem.h              |    1 +
 4 files changed, 324 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 3c32db8..4b6fd2a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
  * as a pte too.
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
 
 #endif	/* !__ASSEMBLY__ */
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index aeaa27e..b30df3c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -682,6 +682,303 @@ int arch_add_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
+#define PAGE_INUSE 0xFD
+
+static void __meminit free_pagetable(struct page *page, int order)
+{
+	struct zone *zone;
+	bool bootmem = false;
+	unsigned long magic;
+
+	/* bootmem page has reserved flag */
+	if (PageReserved(page)) {
+		__ClearPageReserved(page);
+		bootmem = true;
+
+		magic = (unsigned long)page->lru.next;
+		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
+			put_page_bootmem(page);
+		else
+			__free_pages_bootmem(page, order);
+	} else
+		free_pages((unsigned long)page_address(page), order);
+
+	/*
+	 * SECTION_INFO pages and MIX_SECTION_INFO pages
+	 * are all allocated by bootmem.
+	 */
+	if (bootmem) {
+		zone = page_zone(page);
+		zone_span_writelock(zone);
+		zone->present_pages++;
+		zone_span_writeunlock(zone);
+		totalram_pages++;
+	}
+}
+
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+	pte_t *pte;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte = pte_start + i;
+		if (pte_val(*pte))
+			return;
+	}
+
+	/* free a pte talbe */
+	free_pagetable(pmd_page(*pmd), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+	pmd_t *pmd;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		pmd = pmd_start + i;
+		if (pmd_val(*pmd))
+			return;
+	}
+
+	/* free a pmd talbe */
+	free_pagetable(pud_page(*pud), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+/* Return true if pgd is changed, otherwise return false. */
+static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
+{
+	pud_t *pud;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (pud_val(*pud))
+			return false;
+	}
+
+	/* free a pud table */
+	free_pagetable(pgd_page(*pgd), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pgd_clear(pgd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	return true;
+}
+
+static void __meminit
+remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
+		 bool direct)
+{
+	unsigned long next, pages = 0;
+	pte_t *pte;
+	void *page_addr;
+	phys_addr_t phys_addr;
+
+	pte = pte_start + pte_index(addr);
+	for (; addr < end; addr = next, pte++) {
+		next = (addr + PAGE_SIZE) & PAGE_MASK;
+		if (next > end)
+			next = end;
+
+		if (!pte_present(*pte))
+			continue;
+
+		/*
+		 * We mapped [0,1G) memory as identity mapping when
+		 * initializing, in arch/x86/kernel/head_64.S. These
+		 * pagetables cannot be removed.
+		 */
+		phys_addr = pte_val(*pte) + (addr & PAGE_MASK);
+		if (phys_addr < (phys_addr_t)0x40000000)
+			return;
+
+		if (IS_ALIGNED(addr, PAGE_SIZE) &&
+		    IS_ALIGNED(next, PAGE_SIZE)) {
+			if (!direct) {
+				free_pagetable(pte_page(*pte), 0);
+				pages++;
+			}
+
+			spin_lock(&init_mm.page_table_lock);
+			pte_clear(&init_mm, addr, pte);
+			spin_unlock(&init_mm.page_table_lock);
+		} else {
+			/*
+			 * If we are not removing the whole page, it means
+			 * other ptes in this page are being used and we canot
+			 * remove them. So fill the unused ptes with 0xFD, and
+			 * remove the page when it is wholly filled with 0xFD.
+			 */
+			memset((void *)addr, PAGE_INUSE, next - addr);
+			page_addr = page_address(pte_page(*pte));
+
+			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
+				free_pagetable(pte_page(*pte), 0);
+				pages++;
+
+				spin_lock(&init_mm.page_table_lock);
+				pte_clear(&init_mm, addr, pte);
+				spin_unlock(&init_mm.page_table_lock);
+			}
+		}
+	}
+
+	/* Call free_pte_table() in remove_pmd_table(). */
+	flush_tlb_all();
+	if (direct)
+		update_page_count(PG_LEVEL_4K, -pages);
+}
+
+static void __meminit
+remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
+		 bool direct)
+{
+	unsigned long pte_phys, next, pages = 0;
+	pte_t *pte_base;
+	pmd_t *pmd;
+
+	pmd = pmd_start + pmd_index(addr);
+	for (; addr < end; addr = next, pmd++) {
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(*pmd))
+			continue;
+
+		if (pmd_large(*pmd)) {
+			if (IS_ALIGNED(addr, PMD_SIZE) &&
+			    IS_ALIGNED(next, PMD_SIZE)) {
+				if (!direct) {
+					free_pagetable(pmd_page(*pmd),
+						       get_order(PMD_SIZE));
+					pages++;
+				}
+
+				spin_lock(&init_mm.page_table_lock);
+				pmd_clear(pmd);
+				spin_unlock(&init_mm.page_table_lock);
+				continue;
+			}
+
+			/*
+			 * We use 2M page, but we need to remove part of them,
+			 * so split 2M page to 4K page.
+			 */
+			pte_base = (pte_t *)alloc_low_page(&pte_phys);
+			BUG_ON(!pte_base);
+			__split_large_page((pte_t *)pmd, addr,
+					   (pte_t *)pte_base);
+
+			spin_lock(&init_mm.page_table_lock);
+			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
+			spin_unlock(&init_mm.page_table_lock);
+
+			flush_tlb_all();
+		}
+
+		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
+		remove_pte_table(pte_base, addr, next, direct);
+		free_pte_table(pte_base, pmd);
+		unmap_low_page(pte_base);
+	}
+
+	/* Call free_pmd_table() in remove_pud_table(). */
+	if (direct)
+		update_page_count(PG_LEVEL_2M, -pages);
+}
+
+static void __meminit
+remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
+		 bool direct)
+{
+	unsigned long pmd_phys, next, pages = 0;
+	pmd_t *pmd_base;
+	pud_t *pud;
+
+	pud = pud_start + pud_index(addr);
+	for (; addr < end; addr = next, pud++) {
+		next = pud_addr_end(addr, end);
+
+		if (!pud_present(*pud))
+			continue;
+
+		if (pud_large(*pud)) {
+			if (IS_ALIGNED(addr, PUD_SIZE) &&
+			    IS_ALIGNED(next, PUD_SIZE)) {
+				if (!direct) {
+					free_pagetable(pud_page(*pud),
+						       get_order(PUD_SIZE));
+					pages++;
+				}
+
+				spin_lock(&init_mm.page_table_lock);
+				pud_clear(pud);
+				spin_unlock(&init_mm.page_table_lock);
+				continue;
+			}
+
+			/*
+			 * We use 1G page, but we need to remove part of them,
+			 * so split 1G page to 2M page.
+			 */
+			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
+			BUG_ON(!pmd_base);
+			__split_large_page((pte_t *)pud, addr,
+					   (pte_t *)pmd_base);
+
+			spin_lock(&init_mm.page_table_lock);
+			pud_populate(&init_mm, pud, __va(pmd_phys));
+			spin_unlock(&init_mm.page_table_lock);
+
+			flush_tlb_all();
+		}
+
+		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
+		remove_pmd_table(pmd_base, addr, next, direct);
+		free_pmd_table(pmd_base, pud);
+		unmap_low_page(pmd_base);
+	}
+
+	if (direct)
+		update_page_count(PG_LEVEL_1G, -pages);
+}
+
+/* start and end are both virtual address. */
+static void __meminit
+remove_pagetable(unsigned long start, unsigned long end, bool direct)
+{
+	unsigned long next;
+	pgd_t *pgd;
+	pud_t *pud;
+	bool pgd_changed = false;
+
+	for (; start < end; start = next) {
+		pgd = pgd_offset_k(start);
+		if (!pgd_present(*pgd))
+			continue;
+
+		next = pgd_addr_end(start, end);
+
+		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
+		remove_pud_table(pud, start, next, direct);
+		if (free_pud_table(pud, pgd))
+			pgd_changed = true;
+		unmap_low_page(pud);
+	}
+
+	if (pgd_changed)
+		sync_global_pgds(start, end - 1);
+
+	flush_tlb_all();
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 int __ref arch_remove_memory(u64 start, u64 size)
 {
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a718e0d..7dcb6f9 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -501,21 +501,13 @@ out_unlock:
 	return do_split;
 }
 
-static int split_large_page(pte_t *kpte, unsigned long address)
+int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
 {
 	unsigned long pfn, pfninc = 1;
 	unsigned int i, level;
-	pte_t *pbase, *tmp;
+	pte_t *tmp;
 	pgprot_t ref_prot;
-	struct page *base;
-
-	if (!debug_pagealloc)
-		spin_unlock(&cpa_lock);
-	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
-	if (!debug_pagealloc)
-		spin_lock(&cpa_lock);
-	if (!base)
-		return -ENOMEM;
+	struct page *base = virt_to_page(pbase);
 
 	spin_lock(&pgd_lock);
 	/*
@@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
 	 * up for us already:
 	 */
 	tmp = lookup_address(address, &level);
-	if (tmp != kpte)
-		goto out_unlock;
+	if (tmp != kpte) {
+		spin_unlock(&pgd_lock);
+		return 1;
+	}
 
-	pbase = (pte_t *)page_address(base);
 	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
 	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
 	/*
@@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
 	 * going on.
 	 */
 	__flush_tlb_all();
+	spin_unlock(&pgd_lock);
 
-	base = NULL;
+	return 0;
+}
 
-out_unlock:
-	/*
-	 * If we dropped out via the lookup_address check under
-	 * pgd_lock then stick the page back into the pool:
-	 */
-	if (base)
+static int split_large_page(pte_t *kpte, unsigned long address)
+{
+	pte_t *pbase;
+	struct page *base;
+
+	if (!debug_pagealloc)
+		spin_unlock(&cpa_lock);
+	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
+	if (!debug_pagealloc)
+		spin_lock(&cpa_lock);
+	if (!base)
+		return -ENOMEM;
+
+	pbase = (pte_t *)page_address(base);
+	if (__split_large_page(kpte, address, pbase))
 		__free_page(base);
-	spin_unlock(&pgd_lock);
 
 	return 0;
 }
diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index 3f778c2..190ff06 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
 			      unsigned long size);
 extern void free_bootmem(unsigned long physaddr, unsigned long size);
 extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
+extern void __free_pages_bootmem(struct page *page, unsigned int order);
 
 /*
  * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 09/14] memory-hotplug: remove page table of x86_64 architecture
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (7 preceding siblings ...)
  2012-12-24 12:09 ` [PATCH v5 08/14] memory-hotplug: Common APIs to support page tables hot-remove Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-24 12:09 ` [PATCH v5 10/14] memory-hotplug: remove memmap of sparse-vmemmap Tang Chen
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

This patch searches a page table about the removed memory, and clear
page table for x86_64 architecture.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/init_64.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index b30df3c..4b160d8 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -979,6 +979,15 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	flush_tlb_all();
 }
 
+void __meminit
+kernel_physical_mapping_remove(unsigned long start, unsigned long end)
+{
+	start = (unsigned long)__va(start);
+	end = (unsigned long)__va(end);
+
+	remove_pagetable(start, end, true);
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 int __ref arch_remove_memory(u64 start, u64 size)
 {
@@ -988,6 +997,7 @@ int __ref arch_remove_memory(u64 start, u64 size)
 	int ret;
 
 	zone = page_zone(pfn_to_page(start_pfn));
+	kernel_physical_mapping_remove(start, start + size);
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 10/14] memory-hotplug: remove memmap of sparse-vmemmap
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (8 preceding siblings ...)
  2012-12-24 12:09 ` [PATCH v5 09/14] memory-hotplug: remove page table of x86_64 architecture Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-24 12:09 ` [PATCH v5 11/14] memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP Tang Chen
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

This patch introduces a new API vmemmap_free() to free and remove
vmemmap pagetables. Since pagetable implements are different, each
architecture has to provide its own version of vmemmap_free(), just
like vmemmap_populate().

Note:  vmemmap_free() are not implemented for ia64, ppc, s390, and sparc.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/arm64/mm/mmu.c       |    3 +++
 arch/ia64/mm/discontig.c  |    4 ++++
 arch/powerpc/mm/init_64.c |    4 ++++
 arch/s390/mm/vmem.c       |    4 ++++
 arch/sparc/mm/init_64.c   |    4 ++++
 arch/x86/mm/init_64.c     |    8 ++++++++
 include/linux/mm.h        |    1 +
 mm/sparse.c               |    3 ++-
 8 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a6885d8..9834886 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -392,4 +392,7 @@ int __meminit vmemmap_populate(struct page *start_page,
 	return 0;
 }
 #endif	/* CONFIG_ARM64_64K_PAGES */
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 33943db..882a0fd 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -823,6 +823,10 @@ int __meminit vmemmap_populate(struct page *start_page,
 	return vmemmap_populate_basepages(start_page, size, node);
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 6466440..2969591 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -298,6 +298,10 @@ int __meminit vmemmap_populate(struct page *start_page,
 	return 0;
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 2c14bc2..81e6ba3 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -272,6 +272,10 @@ out:
 	return ret;
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 7e28c9e..9cd1ec0 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2232,6 +2232,10 @@ void __meminit vmemmap_populate_print_last(void)
 	}
 }
 
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 4b160d8..029c0b9 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1307,6 +1307,14 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
 	return 0;
 }
 
+void __ref vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+	unsigned long start = (unsigned long)memmap;
+	unsigned long end = (unsigned long)(memmap + nr_pages);
+
+	remove_pagetable(start, end, false);
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1eca498..31d5e5d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1709,6 +1709,7 @@ int vmemmap_populate_basepages(struct page *start_page,
 						unsigned long pages, int node);
 int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
 void vmemmap_populate_print_last(void);
+void vmemmap_free(struct page *memmap, unsigned long nr_pages);
 void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long size);
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 05ca73a..cff9796 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -615,10 +615,11 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
 }
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
-	return; /* XXX: Not implemented yet */
+	vmemmap_free(memmap, nr_pages);
 }
 static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
 {
+	vmemmap_free(memmap, nr_pages);
 }
 #else
 static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 11/14] memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP.
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (9 preceding siblings ...)
  2012-12-24 12:09 ` [PATCH v5 10/14] memory-hotplug: remove memmap of sparse-vmemmap Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-24 12:09 ` [PATCH v5 12/14] memory-hotplug: memory_hotplug: clear zone when removing the memory Tang Chen
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   11 -----------
 1 files changed, 0 insertions(+), 11 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c12bd55..71cb656 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -430,16 +430,6 @@ static int __meminit __add_section(int nid, struct zone *zone,
 	return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
 }
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static int __remove_section(struct zone *zone, struct mem_section *ms)
-{
-	/*
-	 * XXX: Freeing memmap with vmemmap is not implement yet.
-	 *      This should be removed later.
-	 */
-	return -EBUSY;
-}
-#else
 static int __remove_section(struct zone *zone, struct mem_section *ms)
 {
 	int ret = -EINVAL;
@@ -454,7 +444,6 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	sparse_remove_one_section(zone, ms);
 	return 0;
 }
-#endif
 
 /*
  * Reasonably generic function for adding memory.  It is
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 12/14] memory-hotplug: memory_hotplug: clear zone when removing the memory
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (10 preceding siblings ...)
  2012-12-24 12:09 ` [PATCH v5 11/14] memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-24 12:09 ` [PATCH v5 13/14] memory-hotplug: remove sysfs file of node Tang Chen
  2012-12-24 12:09 ` [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined Tang Chen
  13 siblings, 0 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

When a memory is added, we update zone's and pgdat's start_pfn and
spanned_pages in the function __add_zone(). So we should revert them
when the memory is removed.

The patch adds a new function __remove_zone() to do this.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c |  207 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 207 insertions(+), 0 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 71cb656..a1b0632 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -430,8 +430,211 @@ static int __meminit __add_section(int nid, struct zone *zone,
 	return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
 }
 
+/* find the smallest valid pfn in the range [start_pfn, end_pfn) */
+static int find_smallest_section_pfn(int nid, struct zone *zone,
+				     unsigned long start_pfn,
+				     unsigned long end_pfn)
+{
+	struct mem_section *ms;
+
+	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(start_pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (unlikely(pfn_to_nid(start_pfn) != nid))
+			continue;
+
+		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
+			continue;
+
+		return start_pfn;
+	}
+
+	return 0;
+}
+
+/* find the biggest valid pfn in the range [start_pfn, end_pfn). */
+static int find_biggest_section_pfn(int nid, struct zone *zone,
+				    unsigned long start_pfn,
+				    unsigned long end_pfn)
+{
+	struct mem_section *ms;
+	unsigned long pfn;
+
+	/* pfn is the end pfn of a memory section. */
+	pfn = end_pfn - 1;
+	for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (unlikely(pfn_to_nid(pfn) != nid))
+			continue;
+
+		if (zone && zone != page_zone(pfn_to_page(pfn)))
+			continue;
+
+		return pfn;
+	}
+
+	return 0;
+}
+
+static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
+			     unsigned long end_pfn)
+{
+	unsigned long zone_start_pfn =  zone->zone_start_pfn;
+	unsigned long zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	unsigned long pfn;
+	struct mem_section *ms;
+	int nid = zone_to_nid(zone);
+
+	zone_span_writelock(zone);
+	if (zone_start_pfn == start_pfn) {
+		/*
+		 * If the section is smallest section in the zone, it need
+		 * shrink zone->zone_start_pfn and zone->zone_spanned_pages.
+		 * In this case, we find second smallest valid mem_section
+		 * for shrinking zone.
+		 */
+		pfn = find_smallest_section_pfn(nid, zone, end_pfn,
+						zone_end_pfn);
+		if (pfn) {
+			zone->zone_start_pfn = pfn;
+			zone->spanned_pages = zone_end_pfn - pfn;
+		}
+	} else if (zone_end_pfn == end_pfn) {
+		/*
+		 * If the section is biggest section in the zone, it need
+		 * shrink zone->spanned_pages.
+		 * In this case, we find second biggest valid mem_section for
+		 * shrinking zone.
+		 */
+		pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
+					       start_pfn);
+		if (pfn)
+			zone->spanned_pages = pfn - zone_start_pfn + 1;
+	}
+
+	/*
+	 * The section is not biggest or smallest mem_section in the zone, it
+	 * only creates a hole in the zone. So in this case, we need not
+	 * change the zone. But perhaps, the zone has only hole data. Thus
+	 * it check the zone has only hole or not.
+	 */
+	pfn = zone_start_pfn;
+	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (page_zone(pfn_to_page(pfn)) != zone)
+			continue;
+
+		 /* If the section is current section, it continues the loop */
+		if (start_pfn == pfn)
+			continue;
+
+		/* If we find valid section, we have nothing to do */
+		zone_span_writeunlock(zone);
+		return;
+	}
+
+	/* The zone has no valid section */
+	zone->zone_start_pfn = 0;
+	zone->spanned_pages = 0;
+	zone_span_writeunlock(zone);
+}
+
+static void shrink_pgdat_span(struct pglist_data *pgdat,
+			      unsigned long start_pfn, unsigned long end_pfn)
+{
+	unsigned long pgdat_start_pfn =  pgdat->node_start_pfn;
+	unsigned long pgdat_end_pfn =
+		pgdat->node_start_pfn + pgdat->node_spanned_pages;
+	unsigned long pfn;
+	struct mem_section *ms;
+	int nid = pgdat->node_id;
+
+	if (pgdat_start_pfn == start_pfn) {
+		/*
+		 * If the section is smallest section in the pgdat, it need
+		 * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
+		 * In this case, we find second smallest valid mem_section
+		 * for shrinking zone.
+		 */
+		pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
+						pgdat_end_pfn);
+		if (pfn) {
+			pgdat->node_start_pfn = pfn;
+			pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
+		}
+	} else if (pgdat_end_pfn == end_pfn) {
+		/*
+		 * If the section is biggest section in the pgdat, it need
+		 * shrink pgdat->node_spanned_pages.
+		 * In this case, we find second biggest valid mem_section for
+		 * shrinking zone.
+		 */
+		pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
+					       start_pfn);
+		if (pfn)
+			pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
+	}
+
+	/*
+	 * If the section is not biggest or smallest mem_section in the pgdat,
+	 * it only creates a hole in the pgdat. So in this case, we need not
+	 * change the pgdat.
+	 * But perhaps, the pgdat has only hole data. Thus it check the pgdat
+	 * has only hole or not.
+	 */
+	pfn = pgdat_start_pfn;
+	for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (pfn_to_nid(pfn) != nid)
+			continue;
+
+		 /* If the section is current section, it continues the loop */
+		if (start_pfn == pfn)
+			continue;
+
+		/* If we find valid section, we have nothing to do */
+		return;
+	}
+
+	/* The pgdat has no valid section */
+	pgdat->node_start_pfn = 0;
+	pgdat->node_spanned_pages = 0;
+}
+
+static void __remove_zone(struct zone *zone, unsigned long start_pfn)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	int nr_pages = PAGES_PER_SECTION;
+	int zone_type;
+	unsigned long flags;
+
+	zone_type = zone - pgdat->node_zones;
+
+	pgdat_resize_lock(zone->zone_pgdat, &flags);
+	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
+	shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
+	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+}
+
 static int __remove_section(struct zone *zone, struct mem_section *ms)
 {
+	unsigned long start_pfn;
+	int scn_nr;
 	int ret = -EINVAL;
 
 	if (!valid_section(ms))
@@ -441,6 +644,10 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	if (ret)
 		return ret;
 
+	scn_nr = __section_nr(ms);
+	start_pfn = section_nr_to_pfn(scn_nr);
+	__remove_zone(zone, start_pfn);
+
 	sparse_remove_one_section(zone, ms);
 	return 0;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 13/14] memory-hotplug: remove sysfs file of node
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (11 preceding siblings ...)
  2012-12-24 12:09 ` [PATCH v5 12/14] memory-hotplug: memory_hotplug: clear zone when removing the memory Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-24 12:09 ` [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined Tang Chen
  13 siblings, 0 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

This patch introduces a new function try_offline_node() to
remove sysfs file of node when all memory sections of this
node are removed. If some memory sections of this node are
not removed, this function does nothing.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 drivers/acpi/acpi_memhotplug.c |    8 ++++-
 include/linux/memory_hotplug.h |    2 +-
 mm/memory_hotplug.c            |   58 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index eb30e5a..9c53cc6 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -295,9 +295,11 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 
 static int acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
 {
-	int result = 0;
+	int result = 0, nid;
 	struct acpi_memory_info *info, *n;
 
+	nid = acpi_get_node(mem_device->device->handle);
+
 	list_for_each_entry_safe(info, n, &mem_device->res_list, list) {
 		if (info->failed)
 			/* The kernel does not use this memory block */
@@ -310,7 +312,9 @@ static int acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
 			 */
 			return -EBUSY;
 
-		result = remove_memory(info->start_addr, info->length);
+		if (nid < 0)
+			nid = memory_add_physaddr_to_nid(info->start_addr);
+		result = remove_memory(nid, info->start_addr, info->length);
 		if (result)
 			return result;
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 2441f36..f60e728 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -242,7 +242,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int offline_memory_block(struct memory_block *mem);
 extern bool is_memblock_offlined(struct memory_block *mem);
-extern int remove_memory(u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 								int nr_pages);
 extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a1b0632..f8a1d2f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -29,6 +29,7 @@
 #include <linux/suspend.h>
 #include <linux/mm_inline.h>
 #include <linux/firmware-map.h>
+#include <linux/stop_machine.h>
 
 #include <asm/tlbflush.h>
 
@@ -1659,7 +1660,58 @@ static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return ret;
 }
 
-int __ref remove_memory(u64 start, u64 size)
+static int check_cpu_on_node(void *data)
+{
+	struct pglist_data *pgdat = data;
+	int cpu;
+
+	for_each_present_cpu(cpu) {
+		if (cpu_to_node(cpu) == pgdat->node_id)
+			/*
+			 * the cpu on this node isn't removed, and we can't
+			 * offline this node.
+			 */
+			return -EBUSY;
+	}
+
+	return 0;
+}
+
+/* offline the node if all memory sections of this node are removed */
+static void try_offline_node(int nid)
+{
+	unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
+	unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
+	unsigned long pfn;
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+		unsigned long section_nr = pfn_to_section_nr(pfn);
+
+		if (!present_section_nr(section_nr))
+			continue;
+
+		if (pfn_to_nid(pfn) != nid)
+			continue;
+
+		/*
+		 * some memory sections of this node are not removed, and we
+		 * can't offline node now.
+		 */
+		return;
+	}
+
+	if (stop_machine(check_cpu_on_node, NODE_DATA(nid), NULL))
+		return;
+
+	/*
+	 * all memory/cpu of this node are removed, we can offline this
+	 * node now.
+	 */
+	node_set_offline(nid);
+	unregister_one_node(nid);
+}
+
+int __ref remove_memory(int nid, u64 start, u64 size)
 {
 	unsigned long start_pfn, end_pfn;
 	int ret = 0;
@@ -1706,6 +1758,8 @@ repeat:
 
 	arch_remove_memory(start, size);
 
+	try_offline_node(nid);
+
 	unlock_memory_hotplug();
 
 	return 0;
@@ -1715,7 +1769,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
 	return -EINVAL;
 }
-int remove_memory(u64 start, u64 size)
+int remove_memory(int nid, u64 start, u64 size)
 {
 	return -EINVAL;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined
  2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
                   ` (12 preceding siblings ...)
  2012-12-24 12:09 ` [PATCH v5 13/14] memory-hotplug: remove sysfs file of node Tang Chen
@ 2012-12-24 12:09 ` Tang Chen
  2012-12-26  3:55   ` Kamezawa Hiroyuki
  13 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2012-12-24 12:09 UTC (permalink / raw)
  To: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, tangchen,
	hpa, linfeng, laijs, mgorman, yinghai
  Cc: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux

From: Wen Congyang <wency@cn.fujitsu.com>

We call hotadd_new_pgdat() to allocate memory to store node_data. So we
should free it when removing a node.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c |   20 +++++++++++++++++++-
 1 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f8a1d2f..447fa24 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1680,9 +1680,12 @@ static int check_cpu_on_node(void *data)
 /* offline the node if all memory sections of this node are removed */
 static void try_offline_node(int nid)
 {
+	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
-	unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
+	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
 	unsigned long pfn;
+	struct page *pgdat_page = virt_to_page(pgdat);
+	int i;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
 		unsigned long section_nr = pfn_to_section_nr(pfn);
@@ -1709,6 +1712,21 @@ static void try_offline_node(int nid)
 	 */
 	node_set_offline(nid);
 	unregister_one_node(nid);
+
+	if (!PageSlab(pgdat_page) && !PageCompound(pgdat_page))
+		/* node data is allocated from boot memory */
+		return;
+
+	/* free waittable in each zone */
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (zone->wait_table)
+			vfree(zone->wait_table);
+	}
+
+	arch_refresh_nodedata(nid, NULL);
+	arch_free_nodedata(pgdat);
 }
 
 int __ref remove_memory(int nid, u64 start, u64 size)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 06/14] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap
  2012-12-24 12:09 ` [PATCH v5 06/14] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap Tang Chen
@ 2012-12-25  8:09   ` Jianguo Wu
  2012-12-26  3:21     ` Tang Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Jianguo Wu @ 2012-12-25  8:09 UTC (permalink / raw)
  To: Tang Chen
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wency, hpa, linfeng, laijs,
	mgorman, yinghai, x86, linux-mm, linux-kernel, linuxppc-dev,
	linux-acpi, linux-s390, linux-sh, linux-ia64, cmetcalf,
	sparclinux

On 2012/12/24 20:09, Tang Chen wrote:

> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> For removing memmap region of sparse-vmemmap which is allocated bootmem,
> memmap region of sparse-vmemmap needs to be registered by get_page_bootmem().
> So the patch searches pages of virtual mapping and registers the pages by
> get_page_bootmem().
> 
> Note: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390,
> and sparc.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> ---
>  arch/ia64/mm/discontig.c       |    6 ++++
>  arch/powerpc/mm/init_64.c      |    6 ++++
>  arch/s390/mm/vmem.c            |    6 ++++
>  arch/sparc/mm/init_64.c        |    6 ++++
>  arch/x86/mm/init_64.c          |   52 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/memory_hotplug.h |   11 +-------
>  include/linux/mm.h             |    3 +-
>  mm/memory_hotplug.c            |   33 ++++++++++++++++++++++---
>  8 files changed, 109 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> index c641333..33943db 100644
> --- a/arch/ia64/mm/discontig.c
> +++ b/arch/ia64/mm/discontig.c
> @@ -822,4 +822,10 @@ int __meminit vmemmap_populate(struct page *start_page,
>  {
>  	return vmemmap_populate_basepages(start_page, size, node);
>  }
> +
> +void register_page_bootmem_memmap(unsigned long section_nr,
> +				  struct page *start_page, unsigned long size)
> +{
> +	/* TODO */
> +}
>  #endif
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index 95a4529..6466440 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -297,5 +297,11 @@ int __meminit vmemmap_populate(struct page *start_page,
>  
>  	return 0;
>  }
> +
> +void register_page_bootmem_memmap(unsigned long section_nr,
> +				  struct page *start_page, unsigned long size)
> +{
> +	/* TODO */
> +}
>  #endif /* CONFIG_SPARSEMEM_VMEMMAP */
>  
> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> index 6ed1426..2c14bc2 100644
> --- a/arch/s390/mm/vmem.c
> +++ b/arch/s390/mm/vmem.c
> @@ -272,6 +272,12 @@ out:
>  	return ret;
>  }
>  
> +void register_page_bootmem_memmap(unsigned long section_nr,
> +				  struct page *start_page, unsigned long size)
> +{
> +	/* TODO */
> +}
> +
>  /*
>   * Add memory segment to the segment list if it doesn't overlap with
>   * an already present segment.
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 85be1ca..7e28c9e 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2231,6 +2231,12 @@ void __meminit vmemmap_populate_print_last(void)
>  		node_start = 0;
>  	}
>  }
> +
> +void register_page_bootmem_memmap(unsigned long section_nr,
> +				  struct page *start_page, unsigned long size)
> +{
> +	/* TODO */
> +}
>  #endif /* CONFIG_SPARSEMEM_VMEMMAP */
>  
>  static void prot_init_common(unsigned long page_none,
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index f78509c..aeaa27e 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1000,6 +1000,58 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
>  	return 0;
>  }
>  
> +void register_page_bootmem_memmap(unsigned long section_nr,
> +				  struct page *start_page, unsigned long size)
> +{
> +	unsigned long addr = (unsigned long)start_page;
> +	unsigned long end = (unsigned long)(start_page + size);
> +	unsigned long next;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +
> +	for (; addr < end; addr = next) {
> +		pte_t *pte = NULL;
> +
> +		pgd = pgd_offset_k(addr);
> +		if (pgd_none(*pgd)) {
> +			next = (addr + PAGE_SIZE) & PAGE_MASK;
> +			continue;
> +		}
> +		get_page_bootmem(section_nr, pgd_page(*pgd), MIX_SECTION_INFO);
> +
> +		pud = pud_offset(pgd, addr);
> +		if (pud_none(*pud)) {
> +			next = (addr + PAGE_SIZE) & PAGE_MASK;
> +			continue;
> +		}
> +		get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
> +
> +		if (!cpu_has_pse) {
> +			next = (addr + PAGE_SIZE) & PAGE_MASK;
> +			pmd = pmd_offset(pud, addr);
> +			if (pmd_none(*pmd))
> +				continue;
> +			get_page_bootmem(section_nr, pmd_page(*pmd),
> +					 MIX_SECTION_INFO);
> +
> +			pte = pte_offset_kernel(pmd, addr);
> +			if (pte_none(*pte))
> +				continue;
> +			get_page_bootmem(section_nr, pte_page(*pte),
> +					 SECTION_INFO);
> +		} else {
> +			next = pmd_addr_end(addr, end);
> +
> +			pmd = pmd_offset(pud, addr);
> +			if (pmd_none(*pmd))
> +				continue;
> +			get_page_bootmem(section_nr, pmd_page(*pmd),
> +					 SECTION_INFO);

Hi Tang,
	In this case, pmd maps 512 pages, but you only get_page_bootmem() on the first page.
I think the whole 512 pages should be get_page_bootmem(), what do you think?

Thanks,
Jianguo Wu

> +		}
> +	}
> +}
> +
>  void __meminit vmemmap_populate_print_last(void)
>  {
>  	if (p_start) {
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 31a563b..2441f36 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -174,17 +174,10 @@ static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
>  #endif /* CONFIG_NUMA */
>  #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
>  
> -#ifdef CONFIG_SPARSEMEM_VMEMMAP
> -static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
> -{
> -}
> -static inline void put_page_bootmem(struct page *page)
> -{
> -}
> -#else
>  extern void register_page_bootmem_info_node(struct pglist_data *pgdat);
>  extern void put_page_bootmem(struct page *page);
> -#endif
> +extern void get_page_bootmem(unsigned long ingo, struct page *page,
> +			     unsigned long type);
>  
>  /*
>   * Lock for memory hotplug guarantees 1) all callbacks for memory hotplug
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6320407..1eca498 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1709,7 +1709,8 @@ int vmemmap_populate_basepages(struct page *start_page,
>  						unsigned long pages, int node);
>  int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
>  void vmemmap_populate_print_last(void);
> -
> +void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
> +				  unsigned long size);
>  
>  enum mf_flags {
>  	MF_COUNT_INCREASED = 1 << 0,
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2c5d734..34c656b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -91,9 +91,8 @@ static void release_memory_resource(struct resource *res)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
> -#ifndef CONFIG_SPARSEMEM_VMEMMAP
> -static void get_page_bootmem(unsigned long info,  struct page *page,
> -			     unsigned long type)
> +void get_page_bootmem(unsigned long info,  struct page *page,
> +		      unsigned long type)
>  {
>  	page->lru.next = (struct list_head *) type;
>  	SetPagePrivate(page);
> @@ -128,6 +127,7 @@ void __ref put_page_bootmem(struct page *page)
>  
>  }
>  
> +#ifndef CONFIG_SPARSEMEM_VMEMMAP
>  static void register_page_bootmem_info_section(unsigned long start_pfn)
>  {
>  	unsigned long *usemap, mapsize, section_nr, i;
> @@ -161,6 +161,32 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
>  		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
>  
>  }
> +#else
> +static void register_page_bootmem_info_section(unsigned long start_pfn)
> +{
> +	unsigned long *usemap, mapsize, section_nr, i;
> +	struct mem_section *ms;
> +	struct page *page, *memmap;
> +
> +	if (!pfn_valid(start_pfn))
> +		return;
> +
> +	section_nr = pfn_to_section_nr(start_pfn);
> +	ms = __nr_to_section(section_nr);
> +
> +	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> +
> +	register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
> +
> +	usemap = __nr_to_section(section_nr)->pageblock_flags;
> +	page = virt_to_page(usemap);
> +
> +	mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
> +
> +	for (i = 0; i < mapsize; i++, page++)
> +		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
> +}
> +#endif
>  
>  void register_page_bootmem_info_node(struct pglist_data *pgdat)
>  {
> @@ -203,7 +229,6 @@ void register_page_bootmem_info_node(struct pglist_data *pgdat)
>  			register_page_bootmem_info_section(pfn);
>  	}
>  }
> -#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
>  
>  static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
>  			   unsigned long end_pfn)




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 08/14] memory-hotplug: Common APIs to support page tables hot-remove
  2012-12-24 12:09 ` [PATCH v5 08/14] memory-hotplug: Common APIs to support page tables hot-remove Tang Chen
@ 2012-12-25  8:17   ` Jianguo Wu
  2012-12-26  2:49     ` Tang Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Jianguo Wu @ 2012-12-25  8:17 UTC (permalink / raw)
  To: Tang Chen
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wency, hpa, linfeng, laijs,
	mgorman, yinghai, x86, linux-mm, linux-kernel, linuxppc-dev,
	linux-acpi, linux-s390, linux-sh, linux-ia64, cmetcalf,
	sparclinux

On 2012/12/24 20:09, Tang Chen wrote:

> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> When memory is removed, the corresponding pagetables should alse be removed.
> This patch introduces some common APIs to support vmemmap pagetable and x86_64
> architecture pagetable removing.
> 
> All pages of virtual mapping in removed memory cannot be freedi if some pages
> used as PGD/PUD includes not only removed memory but also other memory. So the
> patch uses the following way to check whether page can be freed or not.
> 
>  1. When removing memory, the page structs of the revmoved memory are filled
>     with 0FD.
>  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>     In this case, the page used as PT/PMD can be freed.
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |    1 +
>  arch/x86/mm/init_64.c                |  297 ++++++++++++++++++++++++++++++++++
>  arch/x86/mm/pageattr.c               |   47 +++---
>  include/linux/bootmem.h              |    1 +
>  4 files changed, 324 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 3c32db8..4b6fd2a 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>   * as a pte too.
>   */
>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>  
>  #endif	/* !__ASSEMBLY__ */
>  
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index aeaa27e..b30df3c 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -682,6 +682,303 @@ int arch_add_memory(int nid, u64 start, u64 size)
>  }
>  EXPORT_SYMBOL_GPL(arch_add_memory);
>  
> +#define PAGE_INUSE 0xFD
> +
> +static void __meminit free_pagetable(struct page *page, int order)
> +{
> +	struct zone *zone;
> +	bool bootmem = false;
> +	unsigned long magic;
> +
> +	/* bootmem page has reserved flag */
> +	if (PageReserved(page)) {
> +		__ClearPageReserved(page);
> +		bootmem = true;
> +
> +		magic = (unsigned long)page->lru.next;
> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
> +			put_page_bootmem(page);

Hi Tang,

For removing memmap of sparse-vmemmap, in cpu_has_pse case, if magic == SECTION_INFO,
the order will be get_order(PMD_SIZE), so we need a loop here to put all the 512 pages.

Thanks,
Jianguo Wu

> +		else
> +			__free_pages_bootmem(page, order);
> +	} else
> +		free_pages((unsigned long)page_address(page), order);
> +
> +	/*
> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
> +	 * are all allocated by bootmem.
> +	 */
> +	if (bootmem) {
> +		zone = page_zone(page);
> +		zone_span_writelock(zone);
> +		zone->present_pages++;
> +		zone_span_writeunlock(zone);
> +		totalram_pages++;
> +	}
> +}
> +
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> +	pte_t *pte;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		pte = pte_start + i;
> +		if (pte_val(*pte))
> +			return;
> +	}
> +
> +	/* free a pte talbe */
> +	free_pagetable(pmd_page(*pmd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pmd_clear(pmd);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> +	pmd_t *pmd;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		pmd = pmd_start + i;
> +		if (pmd_val(*pmd))
> +			return;
> +	}
> +
> +	/* free a pmd talbe */
> +	free_pagetable(pud_page(*pud), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pud_clear(pud);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +/* Return true if pgd is changed, otherwise return false. */
> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> +{
> +	pud_t *pud;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		pud = pud_start + i;
> +		if (pud_val(*pud))
> +			return false;
> +	}
> +
> +	/* free a pud table */
> +	free_pagetable(pgd_page(*pgd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pgd_clear(pgd);
> +	spin_unlock(&init_mm.page_table_lock);
> +
> +	return true;
> +}
> +
> +static void __meminit
> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long next, pages = 0;
> +	pte_t *pte;
> +	void *page_addr;
> +	phys_addr_t phys_addr;
> +
> +	pte = pte_start + pte_index(addr);
> +	for (; addr < end; addr = next, pte++) {
> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> +		if (next > end)
> +			next = end;
> +
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		/*
> +		 * We mapped [0,1G) memory as identity mapping when
> +		 * initializing, in arch/x86/kernel/head_64.S. These
> +		 * pagetables cannot be removed.
> +		 */
> +		phys_addr = pte_val(*pte) + (addr & PAGE_MASK);
> +		if (phys_addr < (phys_addr_t)0x40000000)
> +			return;
> +
> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> +			if (!direct) {
> +				free_pagetable(pte_page(*pte), 0);
> +				pages++;
> +			}
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pte_clear(&init_mm, addr, pte);
> +			spin_unlock(&init_mm.page_table_lock);
> +		} else {
> +			/*
> +			 * If we are not removing the whole page, it means
> +			 * other ptes in this page are being used and we canot
> +			 * remove them. So fill the unused ptes with 0xFD, and
> +			 * remove the page when it is wholly filled with 0xFD.
> +			 */
> +			memset((void *)addr, PAGE_INUSE, next - addr);
> +			page_addr = page_address(pte_page(*pte));
> +
> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> +				free_pagetable(pte_page(*pte), 0);
> +				pages++;
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pte_clear(&init_mm, addr, pte);
> +				spin_unlock(&init_mm.page_table_lock);
> +			}
> +		}
> +	}
> +
> +	/* Call free_pte_table() in remove_pmd_table(). */
> +	flush_tlb_all();
> +	if (direct)
> +		update_page_count(PG_LEVEL_4K, -pages);
> +}
> +
> +static void __meminit
> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long pte_phys, next, pages = 0;
> +	pte_t *pte_base;
> +	pmd_t *pmd;
> +
> +	pmd = pmd_start + pmd_index(addr);
> +	for (; addr < end; addr = next, pmd++) {
> +		next = pmd_addr_end(addr, end);
> +
> +		if (!pmd_present(*pmd))
> +			continue;
> +
> +		if (pmd_large(*pmd)) {
> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
> +			    IS_ALIGNED(next, PMD_SIZE)) {
> +				if (!direct) {
> +					free_pagetable(pmd_page(*pmd),
> +						       get_order(PMD_SIZE));
> +					pages++;
> +				}
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pmd_clear(pmd);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 2M page, but we need to remove part of them,
> +			 * so split 2M page to 4K page.
> +			 */
> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
> +			BUG_ON(!pte_base);
> +			__split_large_page((pte_t *)pmd, addr,
> +					   (pte_t *)pte_base);
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> +			spin_unlock(&init_mm.page_table_lock);
> +
> +			flush_tlb_all();
> +		}
> +
> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> +		remove_pte_table(pte_base, addr, next, direct);
> +		free_pte_table(pte_base, pmd);
> +		unmap_low_page(pte_base);
> +	}
> +
> +	/* Call free_pmd_table() in remove_pud_table(). */
> +	if (direct)
> +		update_page_count(PG_LEVEL_2M, -pages);
> +}
> +
> +static void __meminit
> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long pmd_phys, next, pages = 0;
> +	pmd_t *pmd_base;
> +	pud_t *pud;
> +
> +	pud = pud_start + pud_index(addr);
> +	for (; addr < end; addr = next, pud++) {
> +		next = pud_addr_end(addr, end);
> +
> +		if (!pud_present(*pud))
> +			continue;
> +
> +		if (pud_large(*pud)) {
> +			if (IS_ALIGNED(addr, PUD_SIZE) &&
> +			    IS_ALIGNED(next, PUD_SIZE)) {
> +				if (!direct) {
> +					free_pagetable(pud_page(*pud),
> +						       get_order(PUD_SIZE));
> +					pages++;
> +				}
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pud_clear(pud);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 1G page, but we need to remove part of them,
> +			 * so split 1G page to 2M page.
> +			 */
> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
> +			BUG_ON(!pmd_base);
> +			__split_large_page((pte_t *)pud, addr,
> +					   (pte_t *)pmd_base);
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pud_populate(&init_mm, pud, __va(pmd_phys));
> +			spin_unlock(&init_mm.page_table_lock);
> +
> +			flush_tlb_all();
> +		}
> +
> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
> +		remove_pmd_table(pmd_base, addr, next, direct);
> +		free_pmd_table(pmd_base, pud);
> +		unmap_low_page(pmd_base);
> +	}
> +
> +	if (direct)
> +		update_page_count(PG_LEVEL_1G, -pages);
> +}
> +
> +/* start and end are both virtual address. */
> +static void __meminit
> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> +{
> +	unsigned long next;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	bool pgd_changed = false;
> +
> +	for (; start < end; start = next) {
> +		pgd = pgd_offset_k(start);
> +		if (!pgd_present(*pgd))
> +			continue;
> +
> +		next = pgd_addr_end(start, end);
> +
> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> +		remove_pud_table(pud, start, next, direct);
> +		if (free_pud_table(pud, pgd))
> +			pgd_changed = true;
> +		unmap_low_page(pud);
> +	}
> +
> +	if (pgd_changed)
> +		sync_global_pgds(start, end - 1);
> +
> +	flush_tlb_all();
> +}
> +
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  int __ref arch_remove_memory(u64 start, u64 size)
>  {
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index a718e0d..7dcb6f9 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -501,21 +501,13 @@ out_unlock:
>  	return do_split;
>  }
>  
> -static int split_large_page(pte_t *kpte, unsigned long address)
> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>  {
>  	unsigned long pfn, pfninc = 1;
>  	unsigned int i, level;
> -	pte_t *pbase, *tmp;
> +	pte_t *tmp;
>  	pgprot_t ref_prot;
> -	struct page *base;
> -
> -	if (!debug_pagealloc)
> -		spin_unlock(&cpa_lock);
> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> -	if (!debug_pagealloc)
> -		spin_lock(&cpa_lock);
> -	if (!base)
> -		return -ENOMEM;
> +	struct page *base = virt_to_page(pbase);
>  
>  	spin_lock(&pgd_lock);
>  	/*
> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>  	 * up for us already:
>  	 */
>  	tmp = lookup_address(address, &level);
> -	if (tmp != kpte)
> -		goto out_unlock;
> +	if (tmp != kpte) {
> +		spin_unlock(&pgd_lock);
> +		return 1;
> +	}
>  
> -	pbase = (pte_t *)page_address(base);
>  	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>  	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>  	/*
> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>  	 * going on.
>  	 */
>  	__flush_tlb_all();
> +	spin_unlock(&pgd_lock);
>  
> -	base = NULL;
> +	return 0;
> +}
>  
> -out_unlock:
> -	/*
> -	 * If we dropped out via the lookup_address check under
> -	 * pgd_lock then stick the page back into the pool:
> -	 */
> -	if (base)
> +static int split_large_page(pte_t *kpte, unsigned long address)
> +{
> +	pte_t *pbase;
> +	struct page *base;
> +
> +	if (!debug_pagealloc)
> +		spin_unlock(&cpa_lock);
> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> +	if (!debug_pagealloc)
> +		spin_lock(&cpa_lock);
> +	if (!base)
> +		return -ENOMEM;
> +
> +	pbase = (pte_t *)page_address(base);
> +	if (__split_large_page(kpte, address, pbase))
>  		__free_page(base);
> -	spin_unlock(&pgd_lock);
>  
>  	return 0;
>  }
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index 3f778c2..190ff06 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
>  			      unsigned long size);
>  extern void free_bootmem(unsigned long physaddr, unsigned long size);
>  extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
>  
>  /*
>   * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2012-12-24 12:09 ` [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence Tang Chen
@ 2012-12-25  8:35   ` Glauber Costa
  2012-12-30  5:58     ` Wen Congyang
  2012-12-26  3:02   ` Kamezawa Hiroyuki
  1 sibling, 1 reply; 47+ messages in thread
From: Glauber Costa @ 2012-12-25  8:35 UTC (permalink / raw)
  To: Tang Chen
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

On 12/24/2012 04:09 PM, Tang Chen wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> memory can't be offlined when CONFIG_MEMCG is selected.
> For example: there is a memory device on node 1. The address range
> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> and memory11 under the directory /sys/devices/system/memory/.
> 
> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> when we online pages. When we online memory8, the memory stored page cgroup
> is not provided by this memory device. But when we online memory9, the memory
> stored page cgroup may be provided by memory8. So we can't offline memory8
> now. We should offline the memory in the reversed order.
> 
> When the memory device is hotremoved, we will auto offline memory provided
> by this memory device. But we don't know which memory is onlined first, so
> offlining memory may fail. In such case, iterate twice to offline the memory.
> 1st iterate: offline every non primary memory block.
> 2nd iterate: offline primary (i.e. first added) memory block.
> 
> This idea is suggested by KOSAKI Motohiro.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

Maybe there is something here that I am missing - I admit that I came
late to this one, but this really sounds like a very ugly hack, that
really has no place in here.

Retrying, of course, may make sense, if we have reasonable belief that
we may now succeed. If this is the case, you need to document - in the
code - while is that.

The memcg argument, however, doesn't really cut it. Why can't we make
all page_cgroup allocations local to the node they are describing? If
memcg is the culprit here, we should fix it, and not retry. If there is
still any benefit in retrying, then we retry being very specific about why.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 08/14] memory-hotplug: Common APIs to support page tables hot-remove
  2012-12-25  8:17   ` Jianguo Wu
@ 2012-12-26  2:49     ` Tang Chen
  2012-12-26  3:11       ` Tang Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2012-12-26  2:49 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wency, hpa, linfeng, laijs,
	mgorman, yinghai, x86, linux-mm, linux-kernel, linuxppc-dev,
	linux-acpi, linux-s390, linux-sh, linux-ia64, cmetcalf,
	sparclinux

On 12/25/2012 04:17 PM, Jianguo Wu wrote:
>> +
>> +static void __meminit free_pagetable(struct page *page, int order)
>> +{
>> +	struct zone *zone;
>> +	bool bootmem = false;
>> +	unsigned long magic;
>> +
>> +	/* bootmem page has reserved flag */
>> +	if (PageReserved(page)) {
>> +		__ClearPageReserved(page);
>> +		bootmem = true;
>> +
>> +		magic = (unsigned long)page->lru.next;
>> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
>> +			put_page_bootmem(page);
>
> Hi Tang,
>
> For removing memmap of sparse-vmemmap, in cpu_has_pse case, if magic == SECTION_INFO,
> the order will be get_order(PMD_SIZE), so we need a loop here to put all the 512 pages.
>
Hi Wu,

Thanks for reminding me that. I truely missed it.

And since in register_page_bootmem_info_section(), a whole memory
section will be set as SECTION_INFO, I think we don't need to check
the page magic one by one, just the first one is enough. :)

I will fix it, thanks. :)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2012-12-24 12:09 ` [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence Tang Chen
  2012-12-25  8:35   ` Glauber Costa
@ 2012-12-26  3:02   ` Kamezawa Hiroyuki
  2012-12-30  5:49     ` Wen Congyang
  1 sibling, 1 reply; 47+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-26  3:02 UTC (permalink / raw)
  To: Tang Chen
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

(2012/12/24 21:09), Tang Chen wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> memory can't be offlined when CONFIG_MEMCG is selected.
> For example: there is a memory device on node 1. The address range
> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> and memory11 under the directory /sys/devices/system/memory/.
> 
> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> when we online pages. When we online memory8, the memory stored page cgroup
> is not provided by this memory device. But when we online memory9, the memory
> stored page cgroup may be provided by memory8. So we can't offline memory8
> now. We should offline the memory in the reversed order.
> 

If memory8 is onlined as NORMAL memory ...right ?

IIUC, vmalloc() uses __GFP_HIGHMEM but doesn't use __GFP_MOVABLE.

> When the memory device is hotremoved, we will auto offline memory provided
> by this memory device. But we don't know which memory is onlined first, so
> offlining memory may fail. In such case, iterate twice to offline the memory.
> 1st iterate: offline every non primary memory block.
> 2nd iterate: offline primary (i.e. first added) memory block.
> 
> This idea is suggested by KOSAKI Motohiro.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

I'm not sure but the whole DIMM should be onlined as MOVABLE mem ?

Anyway, I agree this kind of retry is required if memory is onlined as NORMAL mem.
But retry-once is ok ?

Thanks,
-Kame

> ---
>   mm/memory_hotplug.c |   16 ++++++++++++++--
>   1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d04ed87..62e04c9 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1388,10 +1388,13 @@ int remove_memory(u64 start, u64 size)
>   	unsigned long start_pfn, end_pfn;
>   	unsigned long pfn, section_nr;
>   	int ret;
> +	int return_on_error = 0;
> +	int retry = 0;
>   
>   	start_pfn = PFN_DOWN(start);
>   	end_pfn = start_pfn + PFN_DOWN(size);
>   
> +repeat:
>   	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>   		section_nr = pfn_to_section_nr(pfn);
>   		if (!present_section_nr(section_nr))
> @@ -1410,14 +1413,23 @@ int remove_memory(u64 start, u64 size)
>   
>   		ret = offline_memory_block(mem);
>   		if (ret) {
> -			kobject_put(&mem->dev.kobj);
> -			return ret;
> +			if (return_on_error) {
> +				kobject_put(&mem->dev.kobj);
> +				return ret;
> +			} else {
> +				retry = 1;
> +			}
>   		}
>   	}
>   
>   	if (mem)
>   		kobject_put(&mem->dev.kobj);
>   
> +	if (retry) {
> +		return_on_error = 1;
> +		goto repeat;
> +	}
> +
>   	return 0;
>   }
>   #else
> 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 02/14] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
  2012-12-24 12:09 ` [PATCH v5 02/14] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Tang Chen
@ 2012-12-26  3:10   ` Kamezawa Hiroyuki
  2012-12-27  3:10     ` Tang Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-26  3:10 UTC (permalink / raw)
  To: Tang Chen
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

(2012/12/24 21:09), Tang Chen wrote:
> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> We remove the memory like this:
> 1. lock memory hotplug
> 2. offline a memory block
> 3. unlock memory hotplug
> 4. repeat 1-3 to offline all memory blocks
> 5. lock memory hotplug
> 6. remove memory(TODO)
> 7. unlock memory hotplug
> 
> All memory blocks must be offlined before removing memory. But we don't hold
> the lock in the whole operation. So we should check whether all memory blocks
> are offlined before step6. Otherwise, kernel maybe panicked.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

a nitpick below.

> ---
>   drivers/base/memory.c          |    6 +++++
>   include/linux/memory_hotplug.h |    1 +
>   mm/memory_hotplug.c            |   47 ++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 54 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 987604d..8300a18 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -693,6 +693,12 @@ int offline_memory_block(struct memory_block *mem)
>   	return ret;
>   }
>   
> +/* return true if the memory block is offlined, otherwise, return false */
> +bool is_memblock_offlined(struct memory_block *mem)
> +{
> +	return mem->state == MEM_OFFLINE;
> +}
> +
>   /*
>    * Initialize the sysfs support for memory devices...
>    */
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 4a45c4e..8dd0950 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -247,6 +247,7 @@ extern int add_memory(int nid, u64 start, u64 size);
>   extern int arch_add_memory(int nid, u64 start, u64 size);
>   extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>   extern int offline_memory_block(struct memory_block *mem);
> +extern bool is_memblock_offlined(struct memory_block *mem);
>   extern int remove_memory(u64 start, u64 size);
>   extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
>   								int nr_pages);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 62e04c9..d43d97b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1430,6 +1430,53 @@ repeat:
>   		goto repeat;
>   	}
>   
> +	lock_memory_hotplug();
> +
> +	/*
> +	 * we have offlined all memory blocks like this:
> +	 *   1. lock memory hotplug
> +	 *   2. offline a memory block
> +	 *   3. unlock memory hotplug
> +	 *
> +	 * repeat step1-3 to offline the memory block. All memory blocks
> +	 * must be offlined before removing memory. But we don't hold the
> +	 * lock in the whole operation. So we should check whether all
> +	 * memory blocks are offlined.
> +	 */
> +
> +	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {

I prefer adding mem = NULL at the start of this for().

> +		section_nr = pfn_to_section_nr(pfn);
> +		if (!present_section_nr(section_nr))
> +			continue;
> +
> +		section = __nr_to_section(section_nr);
> +		/* same memblock? */
> +		if (mem)
> +			if ((section_nr >= mem->start_section_nr) &&
> +			    (section_nr <= mem->end_section_nr))
> +				continue;
> +

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 08/14] memory-hotplug: Common APIs to support page tables hot-remove
  2012-12-26  2:49     ` Tang Chen
@ 2012-12-26  3:11       ` Tang Chen
  2012-12-26  3:19         ` Tang Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2012-12-26  3:11 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wency, hpa, linfeng, laijs,
	mgorman, yinghai, x86, linux-mm, linux-kernel, linuxppc-dev,
	linux-acpi, linux-s390, linux-sh, linux-ia64, cmetcalf,
	sparclinux

On 12/26/2012 10:49 AM, Tang Chen wrote:
> On 12/25/2012 04:17 PM, Jianguo Wu wrote:
>>> +
>>> +static void __meminit free_pagetable(struct page *page, int order)
>>> +{
>>> + struct zone *zone;
>>> + bool bootmem = false;
>>> + unsigned long magic;
>>> +
>>> + /* bootmem page has reserved flag */
>>> + if (PageReserved(page)) {
>>> + __ClearPageReserved(page);
>>> + bootmem = true;
>>> +
>>> + magic = (unsigned long)page->lru.next;
>>> + if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)

And also, I think we don't need to check MIX_SECTION_INFO since it is
for the pageblock_flags, not the memmap in the section.

Thanks. :)

>>> + put_page_bootmem(page);
>>
>> Hi Tang,
>>
>> For removing memmap of sparse-vmemmap, in cpu_has_pse case, if magic
>> == SECTION_INFO,
>> the order will be get_order(PMD_SIZE), so we need a loop here to put
>> all the 512 pages.
>>
> Hi Wu,
>
> Thanks for reminding me that. I truely missed it.
>
> And since in register_page_bootmem_info_section(), a whole memory
> section will be set as SECTION_INFO, I think we don't need to check
> the page magic one by one, just the first one is enough. :)
>
> I will fix it, thanks. :)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 08/14] memory-hotplug: Common APIs to support page tables hot-remove
  2012-12-26  3:11       ` Tang Chen
@ 2012-12-26  3:19         ` Tang Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-26  3:19 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wency, hpa, linfeng, laijs,
	mgorman, yinghai, x86, linux-mm, linux-kernel, linuxppc-dev,
	linux-acpi, linux-s390, linux-sh, linux-ia64, cmetcalf,
	sparclinux

On 12/26/2012 11:11 AM, Tang Chen wrote:
> On 12/26/2012 10:49 AM, Tang Chen wrote:
>> On 12/25/2012 04:17 PM, Jianguo Wu wrote:
>>>> +
>>>> +static void __meminit free_pagetable(struct page *page, int order)
>>>> +{
>>>> + struct zone *zone;
>>>> + bool bootmem = false;
>>>> + unsigned long magic;
>>>> +
>>>> + /* bootmem page has reserved flag */
>>>> + if (PageReserved(page)) {
>>>> + __ClearPageReserved(page);
>>>> + bootmem = true;
>>>> +
>>>> + magic = (unsigned long)page->lru.next;
>>>> + if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
>
> And also, I think we don't need to check MIX_SECTION_INFO since it is
> for the pageblock_flags, not the memmap in the section.

Oh, no :)

We also need to check MIX_SECTION_INFO because we set pgd, pud, pmd
pages as MIX_SECTION_INFO in register_page_bootmem_memmap() in patch6.

Thanks. :)

>
> Thanks. :)
>
>>>> + put_page_bootmem(page);
>>>
>>> Hi Tang,
>>>
>>> For removing memmap of sparse-vmemmap, in cpu_has_pse case, if magic
>>> == SECTION_INFO,
>>> the order will be get_order(PMD_SIZE), so we need a loop here to put
>>> all the 512 pages.
>>>
>> Hi Wu,
>>
>> Thanks for reminding me that. I truely missed it.
>>
>> And since in register_page_bootmem_info_section(), a whole memory
>> section will be set as SECTION_INFO, I think we don't need to check
>> the page magic one by one, just the first one is enough. :)
>>
>> I will fix it, thanks. :)
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 03/14] memory-hotplug: remove redundant codes
  2012-12-24 12:09 ` [PATCH v5 03/14] memory-hotplug: remove redundant codes Tang Chen
@ 2012-12-26  3:20   ` Kamezawa Hiroyuki
  2012-12-27  3:09     ` Tang Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-26  3:20 UTC (permalink / raw)
  To: Tang Chen
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

(2012/12/24 21:09), Tang Chen wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> offlining memory blocks and checking whether memory blocks are offlined
> are very similar. This patch introduces a new function to remove
> redundant codes.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---
>   mm/memory_hotplug.c |  101 ++++++++++++++++++++++++++++-----------------------
>   1 files changed, 55 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d43d97b..dbb04d8 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1381,20 +1381,14 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
>   }
>   
> -int remove_memory(u64 start, u64 size)

please add explanation of this function here. If (*func) returns val other than 0,
this function will fail and returns callback's return value...right ?


> +static int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> +		void *arg, int (*func)(struct memory_block *, void *))
>   {
>   	struct memory_block *mem = NULL;
>   	struct mem_section *section;
> -	unsigned long start_pfn, end_pfn;
>   	unsigned long pfn, section_nr;
>   	int ret;
> -	int return_on_error = 0;
> -	int retry = 0;
> -
> -	start_pfn = PFN_DOWN(start);
> -	end_pfn = start_pfn + PFN_DOWN(size);
>   
> -repeat:

Shouldn't we check lock is held here ? (VM_BUG_ON(!mutex_is_locked(&mem_hotplug_mutex);


>   	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>   		section_nr = pfn_to_section_nr(pfn);
>   		if (!present_section_nr(section_nr))
> @@ -1411,22 +1405,61 @@ repeat:
>   		if (!mem)
>   			continue;
>   
> -		ret = offline_memory_block(mem);
> +		ret = func(mem, arg);
>   		if (ret) {
> -			if (return_on_error) {
> -				kobject_put(&mem->dev.kobj);
> -				return ret;
> -			} else {
> -				retry = 1;
> -			}
> +			kobject_put(&mem->dev.kobj);
> +			return ret;
>   		}
>   	}
>   
>   	if (mem)
>   		kobject_put(&mem->dev.kobj);
>   
> -	if (retry) {
> -		return_on_error = 1;
> +	return 0;
> +}
> +
> +static int offline_memory_block_cb(struct memory_block *mem, void *arg)
> +{
> +	int *ret = arg;
> +	int error = offline_memory_block(mem);
> +
> +	if (error != 0 && *ret == 0)
> +		*ret = error;
> +
> +	return 0;

Always returns 0 and run through all mem blocks for scan-and-retry, right ?
You need explanation here !


> +}
> +
> +static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
> +{
> +	int ret = !is_memblock_offlined(mem);
> +
> +	if (unlikely(ret))
> +		pr_warn("removing memory fails, because memory "
> +			"[%#010llx-%#010llx] is onlined\n",
> +			PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
> +			PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1);
> +
> +	return ret;
> +}
> +
> +int remove_memory(u64 start, u64 size)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int ret = 0;
> +	int retry = 1;
> +
> +	start_pfn = PFN_DOWN(start);
> +	end_pfn = start_pfn + PFN_DOWN(size);
> +
> +repeat:

please explan why you repeat here .

> +	walk_memory_range(start_pfn, end_pfn, &ret,
> +			  offline_memory_block_cb);
> +	if (ret) {
> +		if (!retry)
> +			return ret;
> +
> +		retry = 0;
> +		ret = 0;
>   		goto repeat;
>   	}
>   
> @@ -1444,37 +1477,13 @@ repeat:
>   	 * memory blocks are offlined.
>   	 */
>   
> -	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> -		section_nr = pfn_to_section_nr(pfn);
> -		if (!present_section_nr(section_nr))
> -			continue;
> -
> -		section = __nr_to_section(section_nr);
> -		/* same memblock? */
> -		if (mem)
> -			if ((section_nr >= mem->start_section_nr) &&
> -			    (section_nr <= mem->end_section_nr))
> -				continue;
> -
> -		mem = find_memory_block_hinted(section, mem);
> -		if (!mem)
> -			continue;
> -
> -		ret = is_memblock_offlined(mem);
> -		if (!ret) {
> -			pr_warn("removing memory fails, because memory "
> -				"[%#010llx-%#010llx] is onlined\n",
> -				PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
> -				PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1)) - 1);
> -
> -			kobject_put(&mem->dev.kobj);
> -			unlock_memory_hotplug();
> -			return ret;
> -		}

please explain what you do here. confirming all memory blocks are offlined
before returning 0 ....right ? 

> +	ret = walk_memory_range(start_pfn, end_pfn, NULL,
> +				is_memblock_offlined_cb);
> +	if (ret) {
> +		unlock_memory_hotplug();
> +		return ret;
>   	}
>   
> -	if (mem)
> -		kobject_put(&mem->dev.kobj);
>   	unlock_memory_hotplug();
>   
>   	return 0;
> 

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 06/14] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap
  2012-12-25  8:09   ` Jianguo Wu
@ 2012-12-26  3:21     ` Tang Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-26  3:21 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wency, hpa, linfeng, laijs,
	mgorman, yinghai, x86, linux-mm, linux-kernel, linuxppc-dev,
	linux-acpi, linux-s390, linux-sh, linux-ia64, cmetcalf,
	sparclinux

On 12/25/2012 04:09 PM, Jianguo Wu wrote:
>> +
>> +		if (!cpu_has_pse) {
>> +			next = (addr + PAGE_SIZE)&  PAGE_MASK;
>> +			pmd = pmd_offset(pud, addr);
>> +			if (pmd_none(*pmd))
>> +				continue;
>> +			get_page_bootmem(section_nr, pmd_page(*pmd),
>> +					 MIX_SECTION_INFO);
>> +
>> +			pte = pte_offset_kernel(pmd, addr);
>> +			if (pte_none(*pte))
>> +				continue;
>> +			get_page_bootmem(section_nr, pte_page(*pte),
>> +					 SECTION_INFO);
>> +		} else {
>> +			next = pmd_addr_end(addr, end);
>> +
>> +			pmd = pmd_offset(pud, addr);
>> +			if (pmd_none(*pmd))
>> +				continue;
>> +			get_page_bootmem(section_nr, pmd_page(*pmd),
>> +					 SECTION_INFO);
>
> Hi Tang,
> 	In this case, pmd maps 512 pages, but you only get_page_bootmem() on the first page.
> I think the whole 512 pages should be get_page_bootmem(), what do you think?
>
Hi Wu,

Yes, thanks. I will fix it. :)

Thanks. :)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs
  2012-12-24 12:09 ` [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs Tang Chen
@ 2012-12-26  3:30   ` Kamezawa Hiroyuki
  2012-12-27  3:09     ` Tang Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-26  3:30 UTC (permalink / raw)
  To: Tang Chen
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

(2012/12/24 21:09), Tang Chen wrote:
> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
> sysfs files are created. But there is no code to remove these files. The patch
> implements the function to remove them.
> 
> Note: The code does not free firmware_map_entry which is allocated by bootmem.
>        So the patch makes memory leak. But I think the memory leak size is
>        very samll. And it does not affect the system.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> ---
>   drivers/firmware/memmap.c    |   98 +++++++++++++++++++++++++++++++++++++++++-
>   include/linux/firmware-map.h |    6 +++
>   mm/memory_hotplug.c          |    5 ++-
>   3 files changed, 106 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c
> index 90723e6..49be12a 100644
> --- a/drivers/firmware/memmap.c
> +++ b/drivers/firmware/memmap.c
> @@ -21,6 +21,7 @@
>   #include <linux/types.h>
>   #include <linux/bootmem.h>
>   #include <linux/slab.h>
> +#include <linux/mm.h>
>   
>   /*
>    * Data types ------------------------------------------------------------------
> @@ -41,6 +42,7 @@ struct firmware_map_entry {
>   	const char		*type;	/* type of the memory range */
>   	struct list_head	list;	/* entry for the linked list */
>   	struct kobject		kobj;   /* kobject for each entry */
> +	unsigned int		bootmem:1; /* allocated from bootmem */
>   };

Can't we detect from which the object is allocated from, slab or bootmem ?

Hm, for example,

    PageReserved(virt_to_page(address_of_obj)) ?
    PageSlab(virt_to_page(address_of_obj)) ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 05/14] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture
  2012-12-24 12:09 ` [PATCH v5 05/14] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Tang Chen
@ 2012-12-26  3:37   ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 47+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-26  3:37 UTC (permalink / raw)
  To: Tang Chen
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

(2012/12/24 21:09), Tang Chen wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> For removing memory, we need to remove page table. But it depends
> on architecture. So the patch introduce arch_remove_memory() for
> removing page table. Now it only calls __remove_pages().
> 
> Note: __remove_pages() for some archtecuture is not implemented
>        (I don't know how to implement it for s390).
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

Then, remove code will be symetric to add codes.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 07/14] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section()
  2012-12-24 12:09 ` [PATCH v5 07/14] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section() Tang Chen
@ 2012-12-26  3:47   ` Kamezawa Hiroyuki
  2012-12-26  6:20     ` Tang Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-26  3:47 UTC (permalink / raw)
  To: Tang Chen
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

(2012/12/24 21:09), Tang Chen wrote:
> In __remove_section(), we locked pgdat_resize_lock when calling
> sparse_remove_one_section(). This lock will disable irq. But we don't need
> to lock the whole function. If we do some work to free pagetables in
> free_section_usemap(), we need to call flush_tlb_all(), which need
> irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
> will be triggered.
> 
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

If this is a bug fix, call-trace in your log and BUGFIX or -fix- in patch title
will be appreciated, I think.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined
  2012-12-24 12:09 ` [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined Tang Chen
@ 2012-12-26  3:55   ` Kamezawa Hiroyuki
  2012-12-27 12:16     ` Wen Congyang
  0 siblings, 1 reply; 47+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-26  3:55 UTC (permalink / raw)
  To: Tang Chen
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

(2012/12/24 21:09), Tang Chen wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> We call hotadd_new_pgdat() to allocate memory to store node_data. So we
> should free it when removing a node.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>

I'm sorry but is it safe to remove pgdat ? All zone cache and zonelists are
properly cleared/rebuilded in synchronous way ? and No threads are visinting
zone in vmscan.c ?

Thanks,
-Kame

> ---
>   mm/memory_hotplug.c |   20 +++++++++++++++++++-
>   1 files changed, 19 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f8a1d2f..447fa24 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1680,9 +1680,12 @@ static int check_cpu_on_node(void *data)
>   /* offline the node if all memory sections of this node are removed */
>   static void try_offline_node(int nid)
>   {
> +	pg_data_t *pgdat = NODE_DATA(nid);
>   	unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
> -	unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
> +	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>   	unsigned long pfn;
> +	struct page *pgdat_page = virt_to_page(pgdat);
> +	int i;
>   
>   	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>   		unsigned long section_nr = pfn_to_section_nr(pfn);
> @@ -1709,6 +1712,21 @@ static void try_offline_node(int nid)
>   	 */
>   	node_set_offline(nid);
>   	unregister_one_node(nid);
> +
> +	if (!PageSlab(pgdat_page) && !PageCompound(pgdat_page))
> +		/* node data is allocated from boot memory */
> +		return;
> +
> +	/* free waittable in each zone */
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +		struct zone *zone = pgdat->node_zones + i;
> +
> +		if (zone->wait_table)
> +			vfree(zone->wait_table);
> +	}
> +
> +	arch_refresh_nodedata(nid, NULL);
> +	arch_free_nodedata(pgdat);
>   }
>   
>   int __ref remove_memory(int nid, u64 start, u64 size)
> 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 07/14] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section()
  2012-12-26  3:47   ` Kamezawa Hiroyuki
@ 2012-12-26  6:20     ` Tang Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-26  6:20 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

On 12/26/2012 11:47 AM, Kamezawa Hiroyuki wrote:
> (2012/12/24 21:09), Tang Chen wrote:
>> In __remove_section(), we locked pgdat_resize_lock when calling
>> sparse_remove_one_section(). This lock will disable irq. But we don't need
>> to lock the whole function. If we do some work to free pagetables in
>> free_section_usemap(), we need to call flush_tlb_all(), which need
>> irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
>> will be triggered.
>>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> Signed-off-by: Lai Jiangshan<laijs@cn.fujitsu.com>
>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
> 
> If this is a bug fix, call-trace in your log and BUGFIX or -fix- in patch title
> will be appreciated, I think.
> 
> Acked-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
> 
Hi Kamezawa-san,

Thanks for the reviewing.

I don't think this would be a bug. It is OK to lock the whole
sparse_remove_one_section() if no tlb flushing in free_section_usemap().

But we need to flush tlb in free_section_usemap(), so we need to take
free_section_usemap() out of the lock. :)

I add the call trace to the patch so that people could review it more
easily.

And here is the call trace for this version:

[  454.796248] ------------[ cut here ]------------
[  454.851408] WARNING: at kernel/smp.c:461
smp_call_function_many+0xbd/0x260()
[  454.935620] Hardware name: PRIMEQUEST 1800E
......
[  455.652201] Call Trace:
[  455.681391]  [<ffffffff8106e73f>] warn_slowpath_common+0x7f/0xc0
[  455.753151]  [<ffffffff810560a0>] ? leave_mm+0x50/0x50
[  455.814527]  [<ffffffff8106e79a>] warn_slowpath_null+0x1a/0x20
[  455.884208]  [<ffffffff810e7a9d>] smp_call_function_many+0xbd/0x260
[  455.959082]  [<ffffffff810e7ecb>] smp_call_function+0x3b/0x50
[  456.027722]  [<ffffffff810560a0>] ? leave_mm+0x50/0x50
[  456.089098]  [<ffffffff810e7f4b>] on_each_cpu+0x3b/0xc0
[  456.151512]  [<ffffffff81055f0c>] flush_tlb_all+0x1c/0x20
[  456.216004]  [<ffffffff8104f8de>] remove_pagetable+0x14e/0x1d0
[  456.285683]  [<ffffffff8104f978>] vmemmap_free+0x18/0x20
[  456.349139]  [<ffffffff811b8797>] sparse_remove_one_section+0xf7/0x100
[  456.427126]  [<ffffffff811c5fc2>] __remove_section+0xa2/0xb0
[  456.494726]  [<ffffffff811c6070>] __remove_pages+0xa0/0xd0
[  456.560258]  [<ffffffff81669c7b>] arch_remove_memory+0x6b/0xc0
[  456.629937]  [<ffffffff8166ad28>] remove_memory+0xb8/0xf0
[  456.694431]  [<ffffffff813e686f>] acpi_memory_device_remove+0x53/0x96
[  456.771379]  [<ffffffff813b33c4>] acpi_device_remove+0x90/0xb2
[  456.841059]  [<ffffffff8144b02c>] __device_release_driver+0x7c/0xf0
[  456.915928]  [<ffffffff8144b1af>] device_release_driver+0x2f/0x50
[  456.988719]  [<ffffffff813b4476>] acpi_bus_remove+0x32/0x6d
[  457.055285]  [<ffffffff813b4542>] acpi_bus_trim+0x91/0x102
[  457.120814]  [<ffffffff813b463b>] acpi_bus_hot_remove_device+0x88/0x16b
[  457.199840]  [<ffffffff813afda7>] acpi_os_execute_deferred+0x27/0x34
[  457.275756]  [<ffffffff81091ece>] process_one_work+0x20e/0x5c0
[  457.345434]  [<ffffffff81091e5f>] ? process_one_work+0x19f/0x5c0
[  457.417190]  [<ffffffff813afd80>] ?
acpi_os_wait_events_complete+0x23/0x23
[  457.499332]  [<ffffffff81093f6e>] worker_thread+0x12e/0x370
[  457.565896]  [<ffffffff81093e40>] ? manage_workers+0x180/0x180
[  457.635574]  [<ffffffff8109a09e>] kthread+0xee/0x100
[  457.694871]  [<ffffffff810dfaf9>] ? __lock_release+0x129/0x190
[  457.764552]  [<ffffffff81099fb0>] ? __init_kthread_worker+0x70/0x70
[  457.839427]  [<ffffffff81690aac>] ret_from_fork+0x7c/0xb0
[  457.903914]  [<ffffffff81099fb0>] ? __init_kthread_worker+0x70/0x70
[  457.978784] ---[ end trace 25e85300f542aa01 ]---

Thanks. :)




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 03/14] memory-hotplug: remove redundant codes
  2012-12-26  3:20   ` Kamezawa Hiroyuki
@ 2012-12-27  3:09     ` Tang Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-27  3:09 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

Hi Kamezawa-san,

Thanks for the reviewing. Please see below. :)

On 12/26/2012 11:20 AM, Kamezawa Hiroyuki wrote:
> (2012/12/24 21:09), Tang Chen wrote:
>> From: Wen Congyang<wency@cn.fujitsu.com>
>>
>> offlining memory blocks and checking whether memory blocks are offlined
>> are very similar. This patch introduces a new function to remove
>> redundant codes.
>>
>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>> ---
>>    mm/memory_hotplug.c |  101 ++++++++++++++++++++++++++++-----------------------
>>    1 files changed, 55 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index d43d97b..dbb04d8 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1381,20 +1381,14 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>>    	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
>>    }
>>
>> -int remove_memory(u64 start, u64 size)
> 
> please add explanation of this function here. If (*func) returns val other than 0,
> this function will fail and returns callback's return value...right ?
> 

Yes, it will always return the func()'s return value. I'll add the
comment here. :)

> 
>> +static int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
>> +		void *arg, int (*func)(struct memory_block *, void *))
>>    {
>>    	struct memory_block *mem = NULL;
>>    	struct mem_section *section;
>> -	unsigned long start_pfn, end_pfn;
>>    	unsigned long pfn, section_nr;
>>    	int ret;
>> -	int return_on_error = 0;
>> -	int retry = 0;
>> -
>> -	start_pfn = PFN_DOWN(start);
>> -	end_pfn = start_pfn + PFN_DOWN(size);
>>
>> -repeat:
> 
> Shouldn't we check lock is held here ? (VM_BUG_ON(!mutex_is_locked(&mem_hotplug_mutex);

Well, I think, after applying this patch, walk_memory_range() will be
a separated function. And it can be used somewhere else where we don't
hold this lock. But for now, we can do this check.  :)

> 
> 
>>    	for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
>>    		section_nr = pfn_to_section_nr(pfn);
>>    		if (!present_section_nr(section_nr))
>> @@ -1411,22 +1405,61 @@ repeat:
>>    		if (!mem)
>>    			continue;
>>
>> -		ret = offline_memory_block(mem);
>> +		ret = func(mem, arg);
>>    		if (ret) {
>> -			if (return_on_error) {
>> -				kobject_put(&mem->dev.kobj);
>> -				return ret;
>> -			} else {
>> -				retry = 1;
>> -			}
>> +			kobject_put(&mem->dev.kobj);
>> +			return ret;
>>    		}
>>    	}
>>
>>    	if (mem)
>>    		kobject_put(&mem->dev.kobj);
>>
>> -	if (retry) {
>> -		return_on_error = 1;
>> +	return 0;
>> +}
>> +
>> +static int offline_memory_block_cb(struct memory_block *mem, void *arg)
>> +{
>> +	int *ret = arg;
>> +	int error = offline_memory_block(mem);
>> +
>> +	if (error != 0&&  *ret == 0)
>> +		*ret = error;
>> +
>> +	return 0;
> 
> Always returns 0 and run through all mem blocks for scan-and-retry, right ?
> You need explanation here !

Yes, I'll add the comment. :)

> 
> 
>> +}
>> +
>> +static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
>> +{
>> +	int ret = !is_memblock_offlined(mem);
>> +
>> +	if (unlikely(ret))
>> +		pr_warn("removing memory fails, because memory "
>> +			"[%#010llx-%#010llx] is onlined\n",
>> +			PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
>> +			PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1);
>> +
>> +	return ret;
>> +}
>> +
>> +int remove_memory(u64 start, u64 size)
>> +{
>> +	unsigned long start_pfn, end_pfn;
>> +	int ret = 0;
>> +	int retry = 1;
>> +
>> +	start_pfn = PFN_DOWN(start);
>> +	end_pfn = start_pfn + PFN_DOWN(size);
>> +
>> +repeat:
> 
> please explan why you repeat here .

This repeat is add in patch1. It aims to solve the problem we were
talking about in patch1. I'll add the comment here. :)

> 
>> +	walk_memory_range(start_pfn, end_pfn,&ret,
>> +			  offline_memory_block_cb);
>> +	if (ret) {
>> +		if (!retry)
>> +			return ret;
>> +
>> +		retry = 0;
>> +		ret = 0;
>>    		goto repeat;
>>    	}
>>
>> @@ -1444,37 +1477,13 @@ repeat:
>>    	 * memory blocks are offlined.
>>    	 */
>>
>> -	for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
>> -		section_nr = pfn_to_section_nr(pfn);
>> -		if (!present_section_nr(section_nr))
>> -			continue;
>> -
>> -		section = __nr_to_section(section_nr);
>> -		/* same memblock? */
>> -		if (mem)
>> -			if ((section_nr>= mem->start_section_nr)&&
>> -			    (section_nr<= mem->end_section_nr))
>> -				continue;
>> -
>> -		mem = find_memory_block_hinted(section, mem);
>> -		if (!mem)
>> -			continue;
>> -
>> -		ret = is_memblock_offlined(mem);
>> -		if (!ret) {
>> -			pr_warn("removing memory fails, because memory "
>> -				"[%#010llx-%#010llx] is onlined\n",
>> -				PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
>> -				PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1)) - 1);
>> -
>> -			kobject_put(&mem->dev.kobj);
>> -			unlock_memory_hotplug();
>> -			return ret;
>> -		}
> 
> please explain what you do here. confirming all memory blocks are offlined
> before returning 0 ....right ?

Will be added. :)

Thanks. :)

> 
>> +	ret = walk_memory_range(start_pfn, end_pfn, NULL,
>> +				is_memblock_offlined_cb);
>> +	if (ret) {
>> +		unlock_memory_hotplug();
>> +		return ret;
>>    	}
>>
>> -	if (mem)
>> -		kobject_put(&mem->dev.kobj);
>>    	unlock_memory_hotplug();
>>
>>    	return 0;
>>
> 
> Thanks,
> -Kame
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs
  2012-12-26  3:30   ` Kamezawa Hiroyuki
@ 2012-12-27  3:09     ` Tang Chen
  2013-01-02 14:24       ` Christoph Lameter
  0 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2012-12-27  3:09 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

On 12/26/2012 11:30 AM, Kamezawa Hiroyuki wrote:
>> @@ -41,6 +42,7 @@ struct firmware_map_entry {
>>    	const char		*type;	/* type of the memory range */
>>    	struct list_head	list;	/* entry for the linked list */
>>    	struct kobject		kobj;   /* kobject for each entry */
>> +	unsigned int		bootmem:1; /* allocated from bootmem */
>>    };
> 
> Can't we detect from which the object is allocated from, slab or bootmem ?
> 
> Hm, for example,
> 
>      PageReserved(virt_to_page(address_of_obj)) ?
>      PageSlab(virt_to_page(address_of_obj)) ?
> 

Hi Kamezawa-san,

I think we can detect it without a new member. I think bootmem:1 member
is just for convenience. I think I can remove it. :)

Thanks. :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 02/14] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
  2012-12-26  3:10   ` Kamezawa Hiroyuki
@ 2012-12-27  3:10     ` Tang Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Tang Chen @ 2012-12-27  3:10 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: akpm, rientjes, liuj97, len.brown, benh, paulus, cl, minchan.kim,
	kosaki.motohiro, isimatu.yasuaki, wujianguo, wency, hpa, linfeng,
	laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

On 12/26/2012 11:10 AM, Kamezawa Hiroyuki wrote:
> (2012/12/24 21:09), Tang Chen wrote:
>> From: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>>
>> We remove the memory like this:
>> 1. lock memory hotplug
>> 2. offline a memory block
>> 3. unlock memory hotplug
>> 4. repeat 1-3 to offline all memory blocks
>> 5. lock memory hotplug
>> 6. remove memory(TODO)
>> 7. unlock memory hotplug
>>
>> All memory blocks must be offlined before removing memory. But we don't hold
>> the lock in the whole operation. So we should check whether all memory blocks
>> are offlined before step6. Otherwise, kernel maybe panicked.
>>
>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> 
> Acked-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
> 
> a nitpick below.
> 
>> +
>> +	for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
> 
> I prefer adding mem = NULL at the start of this for().

Hi Kamezawa-san,

Added, thanks. :)

> 
>> +		section_nr = pfn_to_section_nr(pfn);
>> +		if (!present_section_nr(section_nr))
>> +			continue;
>> +
>> +		section = __nr_to_section(section_nr);
>> +		/* same memblock? */
>> +		if (mem)
>> +			if ((section_nr>= mem->start_section_nr)&&
>> +			    (section_nr<= mem->end_section_nr))
>> +				continue;
>> +
> 
> Thanks,
> -Kame
> 
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined
  2012-12-26  3:55   ` Kamezawa Hiroyuki
@ 2012-12-27 12:16     ` Wen Congyang
  2012-12-28  0:28       ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 47+ messages in thread
From: Wen Congyang @ 2012-12-27 12:16 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Tang Chen, akpm, rientjes, liuj97, len.brown, benh, paulus, cl,
	minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo, hpa,
	linfeng, laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

At 12/26/2012 11:55 AM, Kamezawa Hiroyuki Wrote:
> (2012/12/24 21:09), Tang Chen wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> We call hotadd_new_pgdat() to allocate memory to store node_data. So we
>> should free it when removing a node.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> 
> I'm sorry but is it safe to remove pgdat ? All zone cache and zonelists are
> properly cleared/rebuilded in synchronous way ? and No threads are visinting
> zone in vmscan.c ?

We have rebuilt zonelists when a zone has no memory after offlining some pages.

Thanks
Wen Congyang

> 
> Thanks,
> -Kame
> 
>> ---
>>   mm/memory_hotplug.c |   20 +++++++++++++++++++-
>>   1 files changed, 19 insertions(+), 1 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index f8a1d2f..447fa24 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1680,9 +1680,12 @@ static int check_cpu_on_node(void *data)
>>   /* offline the node if all memory sections of this node are removed */
>>   static void try_offline_node(int nid)
>>   {
>> +	pg_data_t *pgdat = NODE_DATA(nid);
>>   	unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
>> -	unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
>> +	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>>   	unsigned long pfn;
>> +	struct page *pgdat_page = virt_to_page(pgdat);
>> +	int i;
>>   
>>   	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>>   		unsigned long section_nr = pfn_to_section_nr(pfn);
>> @@ -1709,6 +1712,21 @@ static void try_offline_node(int nid)
>>   	 */
>>   	node_set_offline(nid);
>>   	unregister_one_node(nid);
>> +
>> +	if (!PageSlab(pgdat_page) && !PageCompound(pgdat_page))
>> +		/* node data is allocated from boot memory */
>> +		return;
>> +
>> +	/* free waittable in each zone */
>> +	for (i = 0; i < MAX_NR_ZONES; i++) {
>> +		struct zone *zone = pgdat->node_zones + i;
>> +
>> +		if (zone->wait_table)
>> +			vfree(zone->wait_table);
>> +	}
>> +
>> +	arch_refresh_nodedata(nid, NULL);
>> +	arch_free_nodedata(pgdat);
>>   }
>>   
>>   int __ref remove_memory(int nid, u64 start, u64 size)
>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined
  2012-12-27 12:16     ` Wen Congyang
@ 2012-12-28  0:28       ` Kamezawa Hiroyuki
  2012-12-30  6:02         ` Wen Congyang
  0 siblings, 1 reply; 47+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:28 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Tang Chen, akpm, rientjes, liuj97, len.brown, benh, paulus, cl,
	minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo, hpa,
	linfeng, laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

(2012/12/27 21:16), Wen Congyang wrote:
> At 12/26/2012 11:55 AM, Kamezawa Hiroyuki Wrote:
>> (2012/12/24 21:09), Tang Chen wrote:
>>> From: Wen Congyang <wency@cn.fujitsu.com>
>>>
>>> We call hotadd_new_pgdat() to allocate memory to store node_data. So we
>>> should free it when removing a node.
>>>
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>
>> I'm sorry but is it safe to remove pgdat ? All zone cache and zonelists are
>> properly cleared/rebuilded in synchronous way ? and No threads are visinting
>> zone in vmscan.c ?
> 
> We have rebuilt zonelists when a zone has no memory after offlining some pages.
> 

How do you guarantee that the address of pgdat/zone is not on stack of any kernel
threads or other kernel objects without reference counting or other syncing method ?


Thanks,
-Kame



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2012-12-26  3:02   ` Kamezawa Hiroyuki
@ 2012-12-30  5:49     ` Wen Congyang
  0 siblings, 0 replies; 47+ messages in thread
From: Wen Congyang @ 2012-12-30  5:49 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Tang Chen, akpm, rientjes, liuj97, len.brown, benh, paulus, cl,
	minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo, hpa,
	linfeng, laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

At 12/26/2012 11:02 AM, Kamezawa Hiroyuki Wrote:
> (2012/12/24 21:09), Tang Chen wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> memory can't be offlined when CONFIG_MEMCG is selected.
>> For example: there is a memory device on node 1. The address range
>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>> and memory11 under the directory /sys/devices/system/memory/.
>>
>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>> when we online pages. When we online memory8, the memory stored page cgroup
>> is not provided by this memory device. But when we online memory9, the memory
>> stored page cgroup may be provided by memory8. So we can't offline memory8
>> now. We should offline the memory in the reversed order.
>>
> 
> If memory8 is onlined as NORMAL memory ...right ?

Yes, memory8 is onlined as NORMAL memory. And when we online memory9, we allocate
memory from memory8 to store page cgroup information.

> 
> IIUC, vmalloc() uses __GFP_HIGHMEM but doesn't use __GFP_MOVABLE.
> 
>> When the memory device is hotremoved, we will auto offline memory provided
>> by this memory device. But we don't know which memory is onlined first, so
>> offlining memory may fail. In such case, iterate twice to offline the memory.
>> 1st iterate: offline every non primary memory block.
>> 2nd iterate: offline primary (i.e. first added) memory block.
>>
>> This idea is suggested by KOSAKI Motohiro.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> 
> I'm not sure but the whole DIMM should be onlined as MOVABLE mem ?

If the whole DIMM is onlined as MOVABLE mem, we can offline it, and don't
retry again.

> 
> Anyway, I agree this kind of retry is required if memory is onlined as NORMAL mem.
> But retry-once is ok ?

I'am not sure, but I think in most cases the user may online the memory according first
which is hot-added first. So we may always fail in the first time, and retry-once can
success.

Thanks
Wen Congyang

> 
> Thanks,
> -Kame
> 
>> ---
>>   mm/memory_hotplug.c |   16 ++++++++++++++--
>>   1 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index d04ed87..62e04c9 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1388,10 +1388,13 @@ int remove_memory(u64 start, u64 size)
>>   	unsigned long start_pfn, end_pfn;
>>   	unsigned long pfn, section_nr;
>>   	int ret;
>> +	int return_on_error = 0;
>> +	int retry = 0;
>>   
>>   	start_pfn = PFN_DOWN(start);
>>   	end_pfn = start_pfn + PFN_DOWN(size);
>>   
>> +repeat:
>>   	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>>   		section_nr = pfn_to_section_nr(pfn);
>>   		if (!present_section_nr(section_nr))
>> @@ -1410,14 +1413,23 @@ int remove_memory(u64 start, u64 size)
>>   
>>   		ret = offline_memory_block(mem);
>>   		if (ret) {
>> -			kobject_put(&mem->dev.kobj);
>> -			return ret;
>> +			if (return_on_error) {
>> +				kobject_put(&mem->dev.kobj);
>> +				return ret;
>> +			} else {
>> +				retry = 1;
>> +			}
>>   		}
>>   	}
>>   
>>   	if (mem)
>>   		kobject_put(&mem->dev.kobj);
>>   
>> +	if (retry) {
>> +		return_on_error = 1;
>> +		goto repeat;
>> +	}
>> +
>>   	return 0;
>>   }
>>   #else
>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2012-12-25  8:35   ` Glauber Costa
@ 2012-12-30  5:58     ` Wen Congyang
  2013-01-09 15:09       ` Glauber Costa
  0 siblings, 1 reply; 47+ messages in thread
From: Wen Congyang @ 2012-12-30  5:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Tang Chen, akpm, rientjes, liuj97, len.brown, benh, paulus, cl,
	minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo, hpa,
	linfeng, laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

At 12/25/2012 04:35 PM, Glauber Costa Wrote:
> On 12/24/2012 04:09 PM, Tang Chen wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> memory can't be offlined when CONFIG_MEMCG is selected.
>> For example: there is a memory device on node 1. The address range
>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>> and memory11 under the directory /sys/devices/system/memory/.
>>
>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>> when we online pages. When we online memory8, the memory stored page cgroup
>> is not provided by this memory device. But when we online memory9, the memory
>> stored page cgroup may be provided by memory8. So we can't offline memory8
>> now. We should offline the memory in the reversed order.
>>
>> When the memory device is hotremoved, we will auto offline memory provided
>> by this memory device. But we don't know which memory is onlined first, so
>> offlining memory may fail. In such case, iterate twice to offline the memory.
>> 1st iterate: offline every non primary memory block.
>> 2nd iterate: offline primary (i.e. first added) memory block.
>>
>> This idea is suggested by KOSAKI Motohiro.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> 
> Maybe there is something here that I am missing - I admit that I came
> late to this one, but this really sounds like a very ugly hack, that
> really has no place in here.
> 
> Retrying, of course, may make sense, if we have reasonable belief that
> we may now succeed. If this is the case, you need to document - in the
> code - while is that.
> 
> The memcg argument, however, doesn't really cut it. Why can't we make
> all page_cgroup allocations local to the node they are describing? If
> memcg is the culprit here, we should fix it, and not retry. If there is
> still any benefit in retrying, then we retry being very specific about why.

We try to make all page_cgroup allocations local to the node they are describing
now. If the memory is the first memory onlined in this node, we will allocate
it from the other node.

For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11
1. memory block 8, page_cgroup allocations are in the other nodes
2. memory block 9, page_cgroup allocations are in memory block 8

So we should offline memory block 9 first. But we don't know in which order
the user online the memory block.

I think we can modify memcg like this:
allocate the memory from the memory block they are describing

I am not sure it is OK to do so.

Thanks
Wen Congyang

> 
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined
  2012-12-28  0:28       ` Kamezawa Hiroyuki
@ 2012-12-30  6:02         ` Wen Congyang
  2013-01-07  5:30           ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 47+ messages in thread
From: Wen Congyang @ 2012-12-30  6:02 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Tang Chen, akpm, rientjes, liuj97, len.brown, benh, paulus, cl,
	minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo, hpa,
	linfeng, laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

At 12/28/2012 08:28 AM, Kamezawa Hiroyuki Wrote:
> (2012/12/27 21:16), Wen Congyang wrote:
>> At 12/26/2012 11:55 AM, Kamezawa Hiroyuki Wrote:
>>> (2012/12/24 21:09), Tang Chen wrote:
>>>> From: Wen Congyang <wency@cn.fujitsu.com>
>>>>
>>>> We call hotadd_new_pgdat() to allocate memory to store node_data. So we
>>>> should free it when removing a node.
>>>>
>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>
>>> I'm sorry but is it safe to remove pgdat ? All zone cache and zonelists are
>>> properly cleared/rebuilded in synchronous way ? and No threads are visinting
>>> zone in vmscan.c ?
>>
>> We have rebuilt zonelists when a zone has no memory after offlining some pages.
>>
> 
> How do you guarantee that the address of pgdat/zone is not on stack of any kernel
> threads or other kernel objects without reference counting or other syncing method ?

No way to guarentee this. But, the kernel should not use the address of pgdat/zone when
it is offlined.

Hmm, what about this: reuse the memory when the node is onlined again?

Thanks
Wen Congyang

> 
> 
> Thanks,
> -Kame
> 
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs
  2012-12-27  3:09     ` Tang Chen
@ 2013-01-02 14:24       ` Christoph Lameter
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Lameter @ 2013-01-02 14:24 UTC (permalink / raw)
  To: Tang Chen
  Cc: Kamezawa Hiroyuki, akpm, rientjes, liuj97, len.brown, benh,
	paulus, minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo,
	wency, hpa, linfeng, laijs, mgorman, yinghai, x86, linux-mm,
	linux-kernel, linuxppc-dev, linux-acpi, linux-s390, linux-sh,
	linux-ia64, cmetcalf, sparclinux

On Thu, 27 Dec 2012, Tang Chen wrote:

> On 12/26/2012 11:30 AM, Kamezawa Hiroyuki wrote:
> >> @@ -41,6 +42,7 @@ struct firmware_map_entry {
> >>    	const char		*type;	/* type of the memory range */
> >>    	struct list_head	list;	/* entry for the linked list */
> >>    	struct kobject		kobj;   /* kobject for each entry */
> >> +	unsigned int		bootmem:1; /* allocated from bootmem */
> >>    };
> >
> > Can't we detect from which the object is allocated from, slab or bootmem ?
> >
> > Hm, for example,
> >
> >      PageReserved(virt_to_page(address_of_obj)) ?
> >      PageSlab(virt_to_page(address_of_obj)) ?
> >
>
> Hi Kamezawa-san,
>
> I think we can detect it without a new member. I think bootmem:1 member
> is just for convenience. I think I can remove it. :)

Larger size slab allocations may fall back to the page allocator but then
the slabs do not track this allocation. That memory can be freed using the
page allocator.

If you see pageslab then you can always remove using the slab allocator.
Otherwise the page allocator should work (unless it was some
special case bootmem allocation).


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined
  2012-12-30  6:02         ` Wen Congyang
@ 2013-01-07  5:30           ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 47+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  5:30 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Tang Chen, akpm, rientjes, liuj97, len.brown, benh, paulus, cl,
	minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo, hpa,
	linfeng, laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux

(2012/12/30 15:02), Wen Congyang wrote:
> At 12/28/2012 08:28 AM, Kamezawa Hiroyuki Wrote:
>> (2012/12/27 21:16), Wen Congyang wrote:
>>> At 12/26/2012 11:55 AM, Kamezawa Hiroyuki Wrote:
>>>> (2012/12/24 21:09), Tang Chen wrote:
>>>>> From: Wen Congyang <wency@cn.fujitsu.com>
>>>>>
>>>>> We call hotadd_new_pgdat() to allocate memory to store node_data. So we
>>>>> should free it when removing a node.
>>>>>
>>>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>>>
>>>> I'm sorry but is it safe to remove pgdat ? All zone cache and zonelists are
>>>> properly cleared/rebuilded in synchronous way ? and No threads are visinting
>>>> zone in vmscan.c ?
>>>
>>> We have rebuilt zonelists when a zone has no memory after offlining some pages.
>>>
>>
>> How do you guarantee that the address of pgdat/zone is not on stack of any kernel
>> threads or other kernel objects without reference counting or other syncing method ?
> 
> No way to guarentee this. But, the kernel should not use the address of pgdat/zone when
> it is offlined.
> 
> Hmm, what about this: reuse the memory when the node is onlined again?
> 

That's the only way which we can go now. Please don't free it.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2012-12-30  5:58     ` Wen Congyang
@ 2013-01-09 15:09       ` Glauber Costa
  2013-01-10  1:38         ` Tang Chen
  2013-02-06  3:07         ` Tang Chen
  0 siblings, 2 replies; 47+ messages in thread
From: Glauber Costa @ 2013-01-09 15:09 UTC (permalink / raw)
  To: Wen Congyang
  Cc: Tang Chen, akpm, rientjes, liuj97, len.brown, benh, paulus, cl,
	minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo, hpa,
	linfeng, laijs, mgorman, yinghai, x86, linux-mm, linux-kernel,
	linuxppc-dev, linux-acpi, linux-s390, linux-sh, linux-ia64,
	cmetcalf, sparclinux, KAMEZAWA Hiroyuki

On 12/30/2012 09:58 AM, Wen Congyang wrote:
> At 12/25/2012 04:35 PM, Glauber Costa Wrote:
>> On 12/24/2012 04:09 PM, Tang Chen wrote:
>>> From: Wen Congyang <wency@cn.fujitsu.com>
>>>
>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>> For example: there is a memory device on node 1. The address range
>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>>> and memory11 under the directory /sys/devices/system/memory/.
>>>
>>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>>> when we online pages. When we online memory8, the memory stored page cgroup
>>> is not provided by this memory device. But when we online memory9, the memory
>>> stored page cgroup may be provided by memory8. So we can't offline memory8
>>> now. We should offline the memory in the reversed order.
>>>
>>> When the memory device is hotremoved, we will auto offline memory provided
>>> by this memory device. But we don't know which memory is onlined first, so
>>> offlining memory may fail. In such case, iterate twice to offline the memory.
>>> 1st iterate: offline every non primary memory block.
>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>
>>> This idea is suggested by KOSAKI Motohiro.
>>>
>>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>>
>> Maybe there is something here that I am missing - I admit that I came
>> late to this one, but this really sounds like a very ugly hack, that
>> really has no place in here.
>>
>> Retrying, of course, may make sense, if we have reasonable belief that
>> we may now succeed. If this is the case, you need to document - in the
>> code - while is that.
>>
>> The memcg argument, however, doesn't really cut it. Why can't we make
>> all page_cgroup allocations local to the node they are describing? If
>> memcg is the culprit here, we should fix it, and not retry. If there is
>> still any benefit in retrying, then we retry being very specific about why.
> 
> We try to make all page_cgroup allocations local to the node they are describing
> now. If the memory is the first memory onlined in this node, we will allocate
> it from the other node.
> 
> For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11
> 1. memory block 8, page_cgroup allocations are in the other nodes
> 2. memory block 9, page_cgroup allocations are in memory block 8
> 
> So we should offline memory block 9 first. But we don't know in which order
> the user online the memory block.
> 
> I think we can modify memcg like this:
> allocate the memory from the memory block they are describing
> 
> I am not sure it is OK to do so.

I don't see a reason why not.

You would have to tweak a bit the lookup function for page_cgroup, but
assuming you will always have the pfns and limits, it should be easy to do.

I think the only tricky part is that today we have a single
node_page_cgroup, and we would of course have to have one per memory
block. My assumption is that the number of memory blocks is limited and
likely not very big. So even a static array would do.

Kamezawa, do you have any input in here?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2013-01-09 15:09       ` Glauber Costa
@ 2013-01-10  1:38         ` Tang Chen
  2013-02-06  3:07         ` Tang Chen
  1 sibling, 0 replies; 47+ messages in thread
From: Tang Chen @ 2013-01-10  1:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Wen Congyang, akpm, rientjes, liuj97, len.brown, benh, paulus,
	cl, minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo,
	hpa, linfeng, laijs, mgorman, yinghai, x86, linux-mm,
	linux-kernel, linuxppc-dev, linux-acpi, linux-s390, linux-sh,
	linux-ia64, cmetcalf, sparclinux, KAMEZAWA Hiroyuki

Hi Glauber,

On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>
>> We try to make all page_cgroup allocations local to the node they are describing
>> now. If the memory is the first memory onlined in this node, we will allocate
>> it from the other node.
>>
>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11
>> 1. memory block 8, page_cgroup allocations are in the other nodes
>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>
>> So we should offline memory block 9 first. But we don't know in which order
>> the user online the memory block.
>>
>> I think we can modify memcg like this:
>> allocate the memory from the memory block they are describing
>>
>> I am not sure it is OK to do so.
>
> I don't see a reason why not.

I'm not sure, but if we do this, we could bring in a fragment for each
memory block (a memory section, 128MB, right?). Is this a problem when
we use large page (such as 1GB page) ?

Even if not, will these fragments make any bad effects ?

Thank. :)

>
> You would have to tweak a bit the lookup function for page_cgroup, but
> assuming you will always have the pfns and limits, it should be easy to do.
>
> I think the only tricky part is that today we have a single
> node_page_cgroup, and we would of course have to have one per memory
> block. My assumption is that the number of memory blocks is limited and
> likely not very big. So even a static array would do.
>
> Kamezawa, do you have any input in here?
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2013-01-09 15:09       ` Glauber Costa
  2013-01-10  1:38         ` Tang Chen
@ 2013-02-06  3:07         ` Tang Chen
  2013-02-06  9:17           ` Tang Chen
  1 sibling, 1 reply; 47+ messages in thread
From: Tang Chen @ 2013-02-06  3:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Wen Congyang, akpm, rientjes, liuj97, len.brown, benh, paulus,
	cl, minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo,
	hpa, linfeng, laijs, mgorman, yinghai, x86, linux-mm,
	linux-kernel, linuxppc-dev, linux-acpi, linux-s390, linux-sh,
	linux-ia64, cmetcalf, sparclinux, KAMEZAWA Hiroyuki, Miao Xie

Hi Glauber, all,

An old thing I want to discuss with you. :)

On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>> For example: there is a memory device on node 1. The address range
>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>
>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>>>> when we online pages. When we online memory8, the memory stored page cgroup
>>>> is not provided by this memory device. But when we online memory9, the memory
>>>> stored page cgroup may be provided by memory8. So we can't offline memory8
>>>> now. We should offline the memory in the reversed order.
>>>>
>>>> When the memory device is hotremoved, we will auto offline memory provided
>>>> by this memory device. But we don't know which memory is onlined first, so
>>>> offlining memory may fail. In such case, iterate twice to offline the memory.
>>>> 1st iterate: offline every non primary memory block.
>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>
>>>> This idea is suggested by KOSAKI Motohiro.
>>>>
>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>
>>> Maybe there is something here that I am missing - I admit that I came
>>> late to this one, but this really sounds like a very ugly hack, that
>>> really has no place in here.
>>>
>>> Retrying, of course, may make sense, if we have reasonable belief that
>>> we may now succeed. If this is the case, you need to document - in the
>>> code - while is that.
>>>
>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>> all page_cgroup allocations local to the node they are describing? If
>>> memcg is the culprit here, we should fix it, and not retry. If there is
>>> still any benefit in retrying, then we retry being very specific about why.
>>
>> We try to make all page_cgroup allocations local to the node they are describing
>> now. If the memory is the first memory onlined in this node, we will allocate
>> it from the other node.
>>
>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11
>> 1. memory block 8, page_cgroup allocations are in the other nodes
>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>
>> So we should offline memory block 9 first. But we don't know in which order
>> the user online the memory block.
>>
>> I think we can modify memcg like this:
>> allocate the memory from the memory block they are describing
>>
>> I am not sure it is OK to do so.
>
> I don't see a reason why not.
>
> You would have to tweak a bit the lookup function for page_cgroup, but
> assuming you will always have the pfns and limits, it should be easy to do.
>
> I think the only tricky part is that today we have a single
> node_page_cgroup, and we would of course have to have one per memory
> block. My assumption is that the number of memory blocks is limited and
> likely not very big. So even a static array would do.
>

About the idea "allocate the memory from the memory block they are 
describing",

online_pages()
  |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this 
section is not in buddy yet.
       |-->page_cgroup_callback()
            |-->online_page_cgroup()
                 |-->init_section_page_cgroup()
                      |-->alloc_page_cgroup() --------- allocate 
page_cgroup from buddy system.

When onlining pages, we allocate page_cgroup from buddy. And the being 
onlined pages are not in
buddy yet. I think we can reserve some memory in the section for 
page_cgroup, and return all the
rest to the buddy.

But when the system is booting,

start_kernel()
  |-->setup_arch()
  |-->mm_init()
  |    |-->mem_init()
  |         |-->numa_free_all_bootmem() -------------- all the pages are 
in buddy system.
  |-->page_cgroup_init()
       |-->init_section_page_cgroup()
            |-->alloc_page_cgroup() ------------------ I don't know how 
to reserve memory in each section.

So any idea about how to deal with it when the system is booting please?


And one more question, a memory section is 128MB in Linux. If we reserve 
part of the them for page_cgroup,
then anyone who wants to allocate a contiguous memory larger than 128MB, 
it will fail, right ?
Is it OK ?

Thanks. :)





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2013-02-06  3:07         ` Tang Chen
@ 2013-02-06  9:17           ` Tang Chen
  2013-02-06 10:10             ` Tang Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2013-02-06  9:17 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Wen Congyang, akpm, rientjes, liuj97, len.brown, benh, paulus,
	cl, minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo,
	hpa, linfeng, laijs, mgorman, yinghai, x86, linux-mm,
	linux-kernel, linuxppc-dev, linux-acpi, linux-s390, linux-sh,
	linux-ia64, cmetcalf, sparclinux, KAMEZAWA Hiroyuki, Miao Xie

Hi all,

On 02/06/2013 11:07 AM, Tang Chen wrote:
> Hi Glauber, all,
>
> An old thing I want to discuss with you. :)
>
> On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>>> For example: there is a memory device on node 1. The address range
>>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9,
>>>>> memory10,
>>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>>
>>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page
>>>>> cgroup
>>>>> when we online pages. When we online memory8, the memory stored
>>>>> page cgroup
>>>>> is not provided by this memory device. But when we online memory9,
>>>>> the memory
>>>>> stored page cgroup may be provided by memory8. So we can't offline
>>>>> memory8
>>>>> now. We should offline the memory in the reversed order.
>>>>>
>>>>> When the memory device is hotremoved, we will auto offline memory
>>>>> provided
>>>>> by this memory device. But we don't know which memory is onlined
>>>>> first, so
>>>>> offlining memory may fail. In such case, iterate twice to offline
>>>>> the memory.
>>>>> 1st iterate: offline every non primary memory block.
>>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>>
>>>>> This idea is suggested by KOSAKI Motohiro.
>>>>>
>>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>>
>>>> Maybe there is something here that I am missing - I admit that I came
>>>> late to this one, but this really sounds like a very ugly hack, that
>>>> really has no place in here.
>>>>
>>>> Retrying, of course, may make sense, if we have reasonable belief that
>>>> we may now succeed. If this is the case, you need to document - in the
>>>> code - while is that.
>>>>
>>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>>> all page_cgroup allocations local to the node they are describing? If
>>>> memcg is the culprit here, we should fix it, and not retry. If there is
>>>> still any benefit in retrying, then we retry being very specific
>>>> about why.
>>>
>>> We try to make all page_cgroup allocations local to the node they are
>>> describing
>>> now. If the memory is the first memory onlined in this node, we will
>>> allocate
>>> it from the other node.
>>>
>>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8
>>> to 11
>>> 1. memory block 8, page_cgroup allocations are in the other nodes
>>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>>
>>> So we should offline memory block 9 first. But we don't know in which
>>> order
>>> the user online the memory block.
>>>
>>> I think we can modify memcg like this:
>>> allocate the memory from the memory block they are describing
>>>
>>> I am not sure it is OK to do so.
>>
>> I don't see a reason why not.
>>
>> You would have to tweak a bit the lookup function for page_cgroup, but
>> assuming you will always have the pfns and limits, it should be easy
>> to do.
>>
>> I think the only tricky part is that today we have a single
>> node_page_cgroup, and we would of course have to have one per memory
>> block. My assumption is that the number of memory blocks is limited and
>> likely not very big. So even a static array would do.
>>
>
> About the idea "allocate the memory from the memory block they are
> describing",
>
> online_pages()
> |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this
> section is not in buddy yet.
> |-->page_cgroup_callback()
> |-->online_page_cgroup()
> |-->init_section_page_cgroup()
> |-->alloc_page_cgroup() --------- allocate page_cgroup from buddy system.
>
> When onlining pages, we allocate page_cgroup from buddy. And the being
> onlined pages are not in
> buddy yet. I think we can reserve some memory in the section for
> page_cgroup, and return all the
> rest to the buddy.
>
> But when the system is booting,
>
> start_kernel()
> |-->setup_arch()
> |-->mm_init()
> | |-->mem_init()
> | |-->numa_free_all_bootmem() -------------- all the pages are in buddy
> system.
> |-->page_cgroup_init()
> |-->init_section_page_cgroup()
> |-->alloc_page_cgroup() ------------------ I don't know how to reserve
> memory in each section.
>
> So any idea about how to deal with it when the system is booting please?
>

How about this way.

1) Add a new flag PAGE_CGROUP_INFO, like SECTION_INFO and MIX_SECTION_INFO.
2) In sparse_init(), reserve some beginning pages of each section as 
bootmem.
3) In register_page_bootmem_info_section(), set these pages as
      page->lru.next = PAGE_CGROUP_INFO;

Then these pages will not go to buddy system.

But I do worry about the fragment problem because part of each section will
be used in the very beginning.

Thanks. :)

>
> And one more question, a memory section is 128MB in Linux. If we reserve
> part of the them for page_cgroup,
> then anyone who wants to allocate a contiguous memory larger than 128MB,
> it will fail, right ?
> Is it OK ?
>
> Thanks. :)
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2013-02-06  9:17           ` Tang Chen
@ 2013-02-06 10:10             ` Tang Chen
  2013-02-06 14:24               ` Glauber Costa
  0 siblings, 1 reply; 47+ messages in thread
From: Tang Chen @ 2013-02-06 10:10 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Wen Congyang, akpm, rientjes, liuj97, len.brown, benh, paulus,
	cl, minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo,
	hpa, linfeng, laijs, mgorman, yinghai, x86, linux-mm,
	linux-kernel, linuxppc-dev, linux-acpi, linux-s390, linux-sh,
	linux-ia64, cmetcalf, sparclinux, KAMEZAWA Hiroyuki, Miao Xie

On 02/06/2013 05:17 PM, Tang Chen wrote:
> Hi all,
>
> On 02/06/2013 11:07 AM, Tang Chen wrote:
>> Hi Glauber, all,
>>
>> An old thing I want to discuss with you. :)
>>
>> On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>>>> For example: there is a memory device on node 1. The address range
>>>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9,
>>>>>> memory10,
>>>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>>>
>>>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page
>>>>>> cgroup
>>>>>> when we online pages. When we online memory8, the memory stored
>>>>>> page cgroup
>>>>>> is not provided by this memory device. But when we online memory9,
>>>>>> the memory
>>>>>> stored page cgroup may be provided by memory8. So we can't offline
>>>>>> memory8
>>>>>> now. We should offline the memory in the reversed order.
>>>>>>
>>>>>> When the memory device is hotremoved, we will auto offline memory
>>>>>> provided
>>>>>> by this memory device. But we don't know which memory is onlined
>>>>>> first, so
>>>>>> offlining memory may fail. In such case, iterate twice to offline
>>>>>> the memory.
>>>>>> 1st iterate: offline every non primary memory block.
>>>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>>>
>>>>>> This idea is suggested by KOSAKI Motohiro.
>>>>>>
>>>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>>>
>>>>> Maybe there is something here that I am missing - I admit that I came
>>>>> late to this one, but this really sounds like a very ugly hack, that
>>>>> really has no place in here.
>>>>>
>>>>> Retrying, of course, may make sense, if we have reasonable belief that
>>>>> we may now succeed. If this is the case, you need to document - in the
>>>>> code - while is that.
>>>>>
>>>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>>>> all page_cgroup allocations local to the node they are describing? If
>>>>> memcg is the culprit here, we should fix it, and not retry. If
>>>>> there is
>>>>> still any benefit in retrying, then we retry being very specific
>>>>> about why.
>>>>
>>>> We try to make all page_cgroup allocations local to the node they are
>>>> describing
>>>> now. If the memory is the first memory onlined in this node, we will
>>>> allocate
>>>> it from the other node.
>>>>
>>>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8
>>>> to 11
>>>> 1. memory block 8, page_cgroup allocations are in the other nodes
>>>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>>>
>>>> So we should offline memory block 9 first. But we don't know in which
>>>> order
>>>> the user online the memory block.
>>>>
>>>> I think we can modify memcg like this:
>>>> allocate the memory from the memory block they are describing
>>>>
>>>> I am not sure it is OK to do so.
>>>
>>> I don't see a reason why not.
>>>
>>> You would have to tweak a bit the lookup function for page_cgroup, but
>>> assuming you will always have the pfns and limits, it should be easy
>>> to do.
>>>
>>> I think the only tricky part is that today we have a single
>>> node_page_cgroup, and we would of course have to have one per memory
>>> block. My assumption is that the number of memory blocks is limited and
>>> likely not very big. So even a static array would do.
>>>
>>
>> About the idea "allocate the memory from the memory block they are
>> describing",
>>
>> online_pages()
>> |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this
>> section is not in buddy yet.
>> |-->page_cgroup_callback()
>> |-->online_page_cgroup()
>> |-->init_section_page_cgroup()
>> |-->alloc_page_cgroup() --------- allocate page_cgroup from buddy system.
>>
>> When onlining pages, we allocate page_cgroup from buddy. And the being
>> onlined pages are not in
>> buddy yet. I think we can reserve some memory in the section for
>> page_cgroup, and return all the
>> rest to the buddy.
>>
>> But when the system is booting,
>>
>> start_kernel()
>> |-->setup_arch()
>> |-->mm_init()
>> | |-->mem_init()
>> | |-->numa_free_all_bootmem() -------------- all the pages are in buddy
>> system.
>> |-->page_cgroup_init()
>> |-->init_section_page_cgroup()
>> |-->alloc_page_cgroup() ------------------ I don't know how to reserve
>> memory in each section.
>>
>> So any idea about how to deal with it when the system is booting please?
>>
>
> How about this way.
>
> 1) Add a new flag PAGE_CGROUP_INFO, like SECTION_INFO and MIX_SECTION_INFO.
> 2) In sparse_init(), reserve some beginning pages of each section as
> bootmem.

Hi all,

After digging into bootmem code, I met another problem.

memblock allocates memory from high address to low address, using 
memblock.current_limit
to remember where the upper limit is. What I am doing will produce a lot 
of fragments,
and the memory will be non-contiguous. So we need to modify memblock again.

I don't think it's a good idea. How do you think ?

Thanks. :)

> 3) In register_page_bootmem_info_section(), set these pages as
> page->lru.next = PAGE_CGROUP_INFO;
>
> Then these pages will not go to buddy system.
>
> But I do worry about the fragment problem because part of each section will
> be used in the very beginning.
>
> Thanks. :)
>
>>
>> And one more question, a memory section is 128MB in Linux. If we reserve
>> part of the them for page_cgroup,
>> then anyone who wants to allocate a contiguous memory larger than 128MB,
>> it will fail, right ?
>> Is it OK ?
>>
>> Thanks. :)
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2013-02-06 10:10             ` Tang Chen
@ 2013-02-06 14:24               ` Glauber Costa
  2013-02-07  7:56                 ` Tang Chen
  0 siblings, 1 reply; 47+ messages in thread
From: Glauber Costa @ 2013-02-06 14:24 UTC (permalink / raw)
  To: Tang Chen
  Cc: Wen Congyang, akpm, rientjes, liuj97, len.brown, benh, paulus,
	cl, minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo,
	hpa, linfeng, laijs, mgorman, yinghai, x86, linux-mm,
	linux-kernel, linuxppc-dev, linux-acpi, linux-s390, linux-sh,
	linux-ia64, cmetcalf, sparclinux, KAMEZAWA Hiroyuki, Miao Xie

On 02/06/2013 02:10 PM, Tang Chen wrote:
> On 02/06/2013 05:17 PM, Tang Chen wrote:
>> Hi all,
>>
>> On 02/06/2013 11:07 AM, Tang Chen wrote:
>>> Hi Glauber, all,
>>>
>>> An old thing I want to discuss with you. :)
>>>
>>> On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>>>>> For example: there is a memory device on node 1. The address range
>>>>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9,
>>>>>>> memory10,
>>>>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>>>>
>>>>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page
>>>>>>> cgroup
>>>>>>> when we online pages. When we online memory8, the memory stored
>>>>>>> page cgroup
>>>>>>> is not provided by this memory device. But when we online memory9,
>>>>>>> the memory
>>>>>>> stored page cgroup may be provided by memory8. So we can't offline
>>>>>>> memory8
>>>>>>> now. We should offline the memory in the reversed order.
>>>>>>>
>>>>>>> When the memory device is hotremoved, we will auto offline memory
>>>>>>> provided
>>>>>>> by this memory device. But we don't know which memory is onlined
>>>>>>> first, so
>>>>>>> offlining memory may fail. In such case, iterate twice to offline
>>>>>>> the memory.
>>>>>>> 1st iterate: offline every non primary memory block.
>>>>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>>>>
>>>>>>> This idea is suggested by KOSAKI Motohiro.
>>>>>>>
>>>>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>>>>
>>>>>> Maybe there is something here that I am missing - I admit that I came
>>>>>> late to this one, but this really sounds like a very ugly hack, that
>>>>>> really has no place in here.
>>>>>>
>>>>>> Retrying, of course, may make sense, if we have reasonable belief
>>>>>> that
>>>>>> we may now succeed. If this is the case, you need to document - in
>>>>>> the
>>>>>> code - while is that.
>>>>>>
>>>>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>>>>> all page_cgroup allocations local to the node they are describing? If
>>>>>> memcg is the culprit here, we should fix it, and not retry. If
>>>>>> there is
>>>>>> still any benefit in retrying, then we retry being very specific
>>>>>> about why.
>>>>>
>>>>> We try to make all page_cgroup allocations local to the node they are
>>>>> describing
>>>>> now. If the memory is the first memory onlined in this node, we will
>>>>> allocate
>>>>> it from the other node.
>>>>>
>>>>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8
>>>>> to 11
>>>>> 1. memory block 8, page_cgroup allocations are in the other nodes
>>>>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>>>>
>>>>> So we should offline memory block 9 first. But we don't know in which
>>>>> order
>>>>> the user online the memory block.
>>>>>
>>>>> I think we can modify memcg like this:
>>>>> allocate the memory from the memory block they are describing
>>>>>
>>>>> I am not sure it is OK to do so.
>>>>
>>>> I don't see a reason why not.
>>>>
>>>> You would have to tweak a bit the lookup function for page_cgroup, but
>>>> assuming you will always have the pfns and limits, it should be easy
>>>> to do.
>>>>
>>>> I think the only tricky part is that today we have a single
>>>> node_page_cgroup, and we would of course have to have one per memory
>>>> block. My assumption is that the number of memory blocks is limited and
>>>> likely not very big. So even a static array would do.
>>>>
>>>
>>> About the idea "allocate the memory from the memory block they are
>>> describing",
>>>
>>> online_pages()
>>> |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this
>>> section is not in buddy yet.
>>> |-->page_cgroup_callback()
>>> |-->online_page_cgroup()
>>> |-->init_section_page_cgroup()
>>> |-->alloc_page_cgroup() --------- allocate page_cgroup from buddy
>>> system.
>>>
>>> When onlining pages, we allocate page_cgroup from buddy. And the being
>>> onlined pages are not in
>>> buddy yet. I think we can reserve some memory in the section for
>>> page_cgroup, and return all the
>>> rest to the buddy.
>>>
>>> But when the system is booting,
>>>
>>> start_kernel()
>>> |-->setup_arch()
>>> |-->mm_init()
>>> | |-->mem_init()
>>> | |-->numa_free_all_bootmem() -------------- all the pages are in buddy
>>> system.
>>> |-->page_cgroup_init()
>>> |-->init_section_page_cgroup()
>>> |-->alloc_page_cgroup() ------------------ I don't know how to reserve
>>> memory in each section.
>>>
>>> So any idea about how to deal with it when the system is booting please?
>>>
>>
>> How about this way.
>>
>> 1) Add a new flag PAGE_CGROUP_INFO, like SECTION_INFO and
>> MIX_SECTION_INFO.
>> 2) In sparse_init(), reserve some beginning pages of each section as
>> bootmem.
> 
> Hi all,
> 
> After digging into bootmem code, I met another problem.
> 
> memblock allocates memory from high address to low address, using
> memblock.current_limit
> to remember where the upper limit is. What I am doing will produce a lot
> of fragments,
> and the memory will be non-contiguous. So we need to modify memblock again.
> 
> I don't think it's a good idea. How do you think ?
> 
> Thanks. :)
> 
>> 3) In register_page_bootmem_info_section(), set these pages as
>> page->lru.next = PAGE_CGROUP_INFO;
>>
>> Then these pages will not go to buddy system.
>>
>> But I do worry about the fragment problem because part of each section
>> will
>> be used in the very beginning.
>>
>> Thanks. :)
>>
>>>
>>> And one more question, a memory section is 128MB in Linux. If we reserve
>>> part of the them for page_cgroup,
>>> then anyone who wants to allocate a contiguous memory larger than 128MB,
>>> it will fail, right ?
>>> Is it OK ?
No, it is not.

Another take on this: Can't we free all the page_cgroup structure before
we actually start removing the sections ? If we do this, we would be
basically left with no problem at all, since when your code starts
running we would no longer have any page_cgroup allocated.

All you have to guarantee is that it happens after the memory block is
already isolated and allocations no longer can reach it.

What do you think ?





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
  2013-02-06 14:24               ` Glauber Costa
@ 2013-02-07  7:56                 ` Tang Chen
  0 siblings, 0 replies; 47+ messages in thread
From: Tang Chen @ 2013-02-07  7:56 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Wen Congyang, akpm, rientjes, liuj97, len.brown, benh, paulus,
	cl, minchan.kim, kosaki.motohiro, isimatu.yasuaki, wujianguo,
	hpa, linfeng, laijs, mgorman, yinghai, x86, linux-mm,
	linux-kernel, linuxppc-dev, linux-acpi, linux-s390, linux-sh,
	linux-ia64, cmetcalf, sparclinux, KAMEZAWA Hiroyuki, Miao Xie

On 02/06/2013 10:24 PM, Glauber Costa wrote:
>>>> And one more question, a memory section is 128MB in Linux. If we reserve
>>>> part of the them for page_cgroup,
>>>> then anyone who wants to allocate a contiguous memory larger than 128MB,
>>>> it will fail, right ?
>>>> Is it OK ?
> No, it is not.
>
> Another take on this: Can't we free all the page_cgroup structure before
> we actually start removing the sections ? If we do this, we would be
> basically left with no problem at all, since when your code starts
> running we would no longer have any page_cgroup allocated.
>
> All you have to guarantee is that it happens after the memory block is
> already isolated and allocations no longer can reach it.
>
> What do you think ?

Hi Glauber,

I don't think so. We can offline some of the sections and leave the 
reset online.

For example, we store page_cgroups of memory9~11 in memory8. So when we 
offline memory8,
we free memory8's page_cgroup storing on other section, but we cannot 
free the page_cgroups
being stored in memory8 if memory9~11 are left online.

So we still need to offline memory9~11, and then offline memory8, right ?
I think it makes no difference.

Thanks. :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2013-02-07  7:57 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-24 12:09 [PATCH v5 00/14] memory-hotplug: hot-remove physical memory Tang Chen
2012-12-24 12:09 ` [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence Tang Chen
2012-12-25  8:35   ` Glauber Costa
2012-12-30  5:58     ` Wen Congyang
2013-01-09 15:09       ` Glauber Costa
2013-01-10  1:38         ` Tang Chen
2013-02-06  3:07         ` Tang Chen
2013-02-06  9:17           ` Tang Chen
2013-02-06 10:10             ` Tang Chen
2013-02-06 14:24               ` Glauber Costa
2013-02-07  7:56                 ` Tang Chen
2012-12-26  3:02   ` Kamezawa Hiroyuki
2012-12-30  5:49     ` Wen Congyang
2012-12-24 12:09 ` [PATCH v5 02/14] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Tang Chen
2012-12-26  3:10   ` Kamezawa Hiroyuki
2012-12-27  3:10     ` Tang Chen
2012-12-24 12:09 ` [PATCH v5 03/14] memory-hotplug: remove redundant codes Tang Chen
2012-12-26  3:20   ` Kamezawa Hiroyuki
2012-12-27  3:09     ` Tang Chen
2012-12-24 12:09 ` [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs Tang Chen
2012-12-26  3:30   ` Kamezawa Hiroyuki
2012-12-27  3:09     ` Tang Chen
2013-01-02 14:24       ` Christoph Lameter
2012-12-24 12:09 ` [PATCH v5 05/14] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Tang Chen
2012-12-26  3:37   ` Kamezawa Hiroyuki
2012-12-24 12:09 ` [PATCH v5 06/14] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap Tang Chen
2012-12-25  8:09   ` Jianguo Wu
2012-12-26  3:21     ` Tang Chen
2012-12-24 12:09 ` [PATCH v5 07/14] memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section() Tang Chen
2012-12-26  3:47   ` Kamezawa Hiroyuki
2012-12-26  6:20     ` Tang Chen
2012-12-24 12:09 ` [PATCH v5 08/14] memory-hotplug: Common APIs to support page tables hot-remove Tang Chen
2012-12-25  8:17   ` Jianguo Wu
2012-12-26  2:49     ` Tang Chen
2012-12-26  3:11       ` Tang Chen
2012-12-26  3:19         ` Tang Chen
2012-12-24 12:09 ` [PATCH v5 09/14] memory-hotplug: remove page table of x86_64 architecture Tang Chen
2012-12-24 12:09 ` [PATCH v5 10/14] memory-hotplug: remove memmap of sparse-vmemmap Tang Chen
2012-12-24 12:09 ` [PATCH v5 11/14] memory-hotplug: Integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP Tang Chen
2012-12-24 12:09 ` [PATCH v5 12/14] memory-hotplug: memory_hotplug: clear zone when removing the memory Tang Chen
2012-12-24 12:09 ` [PATCH v5 13/14] memory-hotplug: remove sysfs file of node Tang Chen
2012-12-24 12:09 ` [PATCH v5 14/14] memory-hotplug: free node_data when a node is offlined Tang Chen
2012-12-26  3:55   ` Kamezawa Hiroyuki
2012-12-27 12:16     ` Wen Congyang
2012-12-28  0:28       ` Kamezawa Hiroyuki
2012-12-30  6:02         ` Wen Congyang
2013-01-07  5:30           ` Kamezawa Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).