linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [Patch v4 00/12] memory-hotplug: hot-remove physical memory
@ 2012-11-27 10:00 Wen Congyang
  2012-11-27 10:00 ` [Patch v4 01/12] memory-hotplug: try to offline the memory twice to avoid dependence Wen Congyang
                   ` (12 more replies)
  0 siblings, 13 replies; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

The patch-set was divided from following thread's patch-set.
    https://lkml.org/lkml/2012/9/5/201

The last version of this patchset:
    https://lkml.org/lkml/2012/11/1/93

If you want to know the reason, please read following thread.

https://lkml.org/lkml/2012/10/2/83

The patch-set has only the function of kernel core side for physical
memory hot remove. So if you use the patch, please apply following
patches.

- bug fix for memory hot remove
  https://lkml.org/lkml/2012/10/31/269
  
- acpi framework
  https://lkml.org/lkml/2012/10/26/175

The patches can free/remove the following things:

  - /sys/firmware/memmap/X/{end, start, type} : [PATCH 2/10]
  - mem_section and related sysfs files       : [PATCH 3-4/10]
  - memmap of sparse-vmemmap                  : [PATCH 5-7/10]
  - page table of removed memory              : [RFC PATCH 8/10]
  - node and related sysfs files              : [RFC PATCH 9-10/10]

* [PATCH 2/10] checks whether the memory can be removed or not.

If you find lack of function for physical memory hot-remove, please let me
know.

How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
   ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
   You will see the memory device under the directory /sys/bus/acpi/devices/.
   Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
   You can write online/offline to /sys/devices/system/memory/memoryX/state to
   online/offline pages provided by this memory device
5. hotremove the memory device
   You can hotremove the memory device by the hardware, or writing 1 to
   /sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.

Known problems:
1. hotremoving memory device may cause kernel panicked
   This bug will be fixed by Liu Jiang's patch:
   https://lkml.org/lkml/2012/7/3/1

Changelogs from v3 to v4:
 Patch7: remove unused codes.
 Patch8: fix nr_pages that is passed to free_map_bootmem()

Changelogs from v2 to v3:
 Patch9: call sync_global_pgds() if pgd is changed
 Patch10: fix a problem int the patch

Changelogs from v1 to v2:
 Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
         memory block. 2nd iterate: offline primary (i.e. first added) memory
         block.

 Patch3: new patch, no logical change, just remove reduntant codes.

 Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
         after the pagetable is changed.

 Patch12: new patch, free node_data when a node is offlined

Wen Congyang (6):
  memory-hotplug: try to offline the memory twice to avoid dependence
  memory-hotplug: remove redundant codes
  memory-hotplug: introduce new function arch_remove_memory() for
    removing page table depends on architecture
  memory-hotplug: remove page table of x86_64 architecture
  memory-hotplug: remove sysfs file of node
  memory-hotplug: free node_data when a node is offlined

Yasuaki Ishimatsu (6):
  memory-hotplug: check whether all memory blocks are offlined or not
    when removing memory
  memory-hotplug: remove /sys/firmware/memmap/X sysfs
  memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP
  memory-hotplug: implement register_page_bootmem_info_section of
    sparse-vmemmap
  memory-hotplug: remove memmap of sparse-vmemmap
  memory-hotplug: memory_hotplug: clear zone when removing the memory

 arch/ia64/mm/discontig.c             |  14 ++
 arch/ia64/mm/init.c                  |  18 ++
 arch/powerpc/mm/init_64.c            |  14 ++
 arch/powerpc/mm/mem.c                |  12 +
 arch/s390/mm/init.c                  |  12 +
 arch/s390/mm/vmem.c                  |  14 ++
 arch/sh/mm/init.c                    |  17 ++
 arch/sparc/mm/init_64.c              |  14 ++
 arch/tile/mm/init.c                  |   8 +
 arch/x86/include/asm/pgtable_types.h |   1 +
 arch/x86/mm/init_32.c                |  12 +
 arch/x86/mm/init_64.c                | 417 +++++++++++++++++++++++++++++++++++
 arch/x86/mm/pageattr.c               |  47 ++--
 drivers/acpi/acpi_memhotplug.c       |   8 +-
 drivers/base/memory.c                |   6 +
 drivers/firmware/memmap.c            |  98 +++++++-
 include/linux/firmware-map.h         |   6 +
 include/linux/memory_hotplug.h       |  15 +-
 include/linux/mm.h                   |   5 +-
 mm/memory_hotplug.c                  | 405 ++++++++++++++++++++++++++++++++--
 mm/sparse.c                          |  19 +-
 21 files changed, 1098 insertions(+), 64 deletions(-)

-- 
1.8.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Patch v4 01/12] memory-hotplug: try to offline the memory twice to avoid dependence
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-12-04  9:17   ` Tang Chen
  2012-11-27 10:00 ` [Patch v4 02/12] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Wen Congyang
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages. When we online memory8, the memory stored page cgroup
is not provided by this memory device. But when we online memory9, the memory
stored page cgroup may be provided by memory8. So we can't offline memory8
now. We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail. In such case, iterate twice to offline the memory.
1st iterate: offline every non primary memory block.
2nd iterate: offline primary (i.e. first added) memory block.

This idea is suggested by KOSAKI Motohiro.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4eeaca..b825dbc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1012,10 +1012,13 @@ int remove_memory(u64 start, u64 size)
 	unsigned long start_pfn, end_pfn;
 	unsigned long pfn, section_nr;
 	int ret;
+	int return_on_error = 0;
+	int retry = 0;
 
 	start_pfn = PFN_DOWN(start);
 	end_pfn = start_pfn + PFN_DOWN(size);
 
+repeat:
 	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
 		section_nr = pfn_to_section_nr(pfn);
 		if (!present_section_nr(section_nr))
@@ -1034,14 +1037,23 @@ int remove_memory(u64 start, u64 size)
 
 		ret = offline_memory_block(mem);
 		if (ret) {
-			kobject_put(&mem->dev.kobj);
-			return ret;
+			if (return_on_error) {
+				kobject_put(&mem->dev.kobj);
+				return ret;
+			} else {
+				retry = 1;
+			}
 		}
 	}
 
 	if (mem)
 		kobject_put(&mem->dev.kobj);
 
+	if (retry) {
+		return_on_error = 1;
+		goto repeat;
+	}
+
 	return 0;
 }
 #else
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v4 02/12] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
  2012-11-27 10:00 ` [Patch v4 01/12] memory-hotplug: try to offline the memory twice to avoid dependence Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-12-04  9:22   ` Tang Chen
  2012-11-27 10:00 ` [Patch v4 03/12] memory-hotplug: remove redundant codes Wen Congyang
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

We remove the memory like this:
1. lock memory hotplug
2. offline a memory block
3. unlock memory hotplug
4. repeat 1-3 to offline all memory blocks
5. lock memory hotplug
6. remove memory(TODO)
7. unlock memory hotplug

All memory blocks must be offlined before removing memory. But we don't hold
the lock in the whole operation. So we should check whether all memory blocks
are offlined before step6. Otherwise, kernel maybe panicked.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 drivers/base/memory.c          |  6 ++++++
 include/linux/memory_hotplug.h |  1 +
 mm/memory_hotplug.c            | 47 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 86c8821..badb025 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -675,6 +675,12 @@ int offline_memory_block(struct memory_block *mem)
 	return ret;
 }
 
+/* return true if the memory block is offlined, otherwise, return false */
+bool is_memblock_offlined(struct memory_block *mem)
+{
+	return mem->state == MEM_OFFLINE;
+}
+
 /*
  * Initialize the sysfs support for memory devices...
  */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 95573ec..38675e9 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -236,6 +236,7 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int offline_memory_block(struct memory_block *mem);
+extern bool is_memblock_offlined(struct memory_block *mem);
 extern int remove_memory(u64 start, u64 size);
 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 								int nr_pages);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b825dbc..b6d1101 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1054,6 +1054,53 @@ repeat:
 		goto repeat;
 	}
 
+	lock_memory_hotplug();
+
+	/*
+	 * we have offlined all memory blocks like this:
+	 *   1. lock memory hotplug
+	 *   2. offline a memory block
+	 *   3. unlock memory hotplug
+	 *
+	 * repeat step1-3 to offline the memory block. All memory blocks
+	 * must be offlined before removing memory. But we don't hold the
+	 * lock in the whole operation. So we should check whether all
+	 * memory blocks are offlined.
+	 */
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+		section_nr = pfn_to_section_nr(pfn);
+		if (!present_section_nr(section_nr))
+			continue;
+
+		section = __nr_to_section(section_nr);
+		/* same memblock? */
+		if (mem)
+			if ((section_nr >= mem->start_section_nr) &&
+			    (section_nr <= mem->end_section_nr))
+				continue;
+
+		mem = find_memory_block_hinted(section, mem);
+		if (!mem)
+			continue;
+
+		ret = is_memblock_offlined(mem);
+		if (!ret) {
+			pr_warn("removing memory fails, because memory "
+				"[%#010llx-%#010llx] is onlined\n",
+				PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
+				PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1)) - 1);
+
+			kobject_put(&mem->dev.kobj);
+			unlock_memory_hotplug();
+			return ret;
+		}
+	}
+
+	if (mem)
+		kobject_put(&mem->dev.kobj);
+	unlock_memory_hotplug();
+
 	return 0;
 }
 #else
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v4 03/12] memory-hotplug: remove redundant codes
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
  2012-11-27 10:00 ` [Patch v4 01/12] memory-hotplug: try to offline the memory twice to avoid dependence Wen Congyang
  2012-11-27 10:00 ` [Patch v4 02/12] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-12-04  9:22   ` Tang Chen
  2012-11-27 10:00 ` [Patch v4 04/12] memory-hotplug: remove /sys/firmware/memmap/X sysfs Wen Congyang
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

offlining memory blocks and checking whether memory blocks are offlined
are very similar. This patch introduces a new function to remove
redundant codes.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c | 101 ++++++++++++++++++++++++++++------------------------
 1 file changed, 55 insertions(+), 46 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b6d1101..6d06488 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1005,20 +1005,14 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
 }
 
-int remove_memory(u64 start, u64 size)
+static int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
+		void *arg, int (*func)(struct memory_block *, void *))
 {
 	struct memory_block *mem = NULL;
 	struct mem_section *section;
-	unsigned long start_pfn, end_pfn;
 	unsigned long pfn, section_nr;
 	int ret;
-	int return_on_error = 0;
-	int retry = 0;
-
-	start_pfn = PFN_DOWN(start);
-	end_pfn = start_pfn + PFN_DOWN(size);
 
-repeat:
 	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
 		section_nr = pfn_to_section_nr(pfn);
 		if (!present_section_nr(section_nr))
@@ -1035,22 +1029,61 @@ repeat:
 		if (!mem)
 			continue;
 
-		ret = offline_memory_block(mem);
+		ret = func(mem, arg);
 		if (ret) {
-			if (return_on_error) {
-				kobject_put(&mem->dev.kobj);
-				return ret;
-			} else {
-				retry = 1;
-			}
+			kobject_put(&mem->dev.kobj);
+			return ret;
 		}
 	}
 
 	if (mem)
 		kobject_put(&mem->dev.kobj);
 
-	if (retry) {
-		return_on_error = 1;
+	return 0;
+}
+
+static int offline_memory_block_cb(struct memory_block *mem, void *arg)
+{
+	int *ret = arg;
+	int error = offline_memory_block(mem);
+
+	if (error != 0 && *ret == 0)
+		*ret = error;
+
+	return 0;
+}
+
+static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
+{
+	int ret = !is_memblock_offlined(mem);
+
+	if (unlikely(ret))
+		pr_warn("removing memory fails, because memory "
+			"[%#010llx-%#010llx] is onlined\n",
+			PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
+			PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1);
+
+	return ret;
+}
+
+int remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn, end_pfn;
+	int ret = 0;
+	int retry = 1;
+
+	start_pfn = PFN_DOWN(start);
+	end_pfn = start_pfn + PFN_DOWN(size);
+
+repeat:
+	walk_memory_range(start_pfn, end_pfn, &ret,
+			  offline_memory_block_cb);
+	if (ret) {
+		if (!retry)
+			return ret;
+
+		retry = 0;
+		ret = 0;
 		goto repeat;
 	}
 
@@ -1068,37 +1101,13 @@ repeat:
 	 * memory blocks are offlined.
 	 */
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-		section_nr = pfn_to_section_nr(pfn);
-		if (!present_section_nr(section_nr))
-			continue;
-
-		section = __nr_to_section(section_nr);
-		/* same memblock? */
-		if (mem)
-			if ((section_nr >= mem->start_section_nr) &&
-			    (section_nr <= mem->end_section_nr))
-				continue;
-
-		mem = find_memory_block_hinted(section, mem);
-		if (!mem)
-			continue;
-
-		ret = is_memblock_offlined(mem);
-		if (!ret) {
-			pr_warn("removing memory fails, because memory "
-				"[%#010llx-%#010llx] is onlined\n",
-				PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
-				PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1)) - 1);
-
-			kobject_put(&mem->dev.kobj);
-			unlock_memory_hotplug();
-			return ret;
-		}
+	ret = walk_memory_range(start_pfn, end_pfn, NULL,
+				is_memblock_offlined_cb);
+	if (ret) {
+		unlock_memory_hotplug();
+		return ret;
 	}
 
-	if (mem)
-		kobject_put(&mem->dev.kobj);
 	unlock_memory_hotplug();
 
 	return 0;
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v4 04/12] memory-hotplug: remove /sys/firmware/memmap/X sysfs
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
                   ` (2 preceding siblings ...)
  2012-11-27 10:00 ` [Patch v4 03/12] memory-hotplug: remove redundant codes Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-11-27 10:00 ` [Patch v4 05/12] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Wen Congyang
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
sysfs files are created. But there is no code to remove these files. The patch
implements the function to remove them.

Note: The code does not free firmware_map_entry which is allocated by bootmem.
      So the patch makes memory leak. But I think the memory leak size is
      very samll. And it does not affect the system.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 drivers/firmware/memmap.c    | 98 +++++++++++++++++++++++++++++++++++++++++++-
 include/linux/firmware-map.h |  6 +++
 mm/memory_hotplug.c          |  5 ++-
 3 files changed, 106 insertions(+), 3 deletions(-)

diff --git a/drivers/firmware/memmap.c b/drivers/firmware/memmap.c
index 90723e6..49be12a 100644
--- a/drivers/firmware/memmap.c
+++ b/drivers/firmware/memmap.c
@@ -21,6 +21,7 @@
 #include <linux/types.h>
 #include <linux/bootmem.h>
 #include <linux/slab.h>
+#include <linux/mm.h>
 
 /*
  * Data types ------------------------------------------------------------------
@@ -41,6 +42,7 @@ struct firmware_map_entry {
 	const char		*type;	/* type of the memory range */
 	struct list_head	list;	/* entry for the linked list */
 	struct kobject		kobj;   /* kobject for each entry */
+	unsigned int		bootmem:1; /* allocated from bootmem */
 };
 
 /*
@@ -79,7 +81,26 @@ static const struct sysfs_ops memmap_attr_ops = {
 	.show = memmap_attr_show,
 };
 
+
+static inline struct firmware_map_entry *
+to_memmap_entry(struct kobject *kobj)
+{
+	return container_of(kobj, struct firmware_map_entry, kobj);
+}
+
+static void release_firmware_map_entry(struct kobject *kobj)
+{
+	struct firmware_map_entry *entry = to_memmap_entry(kobj);
+
+	if (entry->bootmem)
+		/* There is no way to free memory allocated from bootmem */
+		return;
+
+	kfree(entry);
+}
+
 static struct kobj_type memmap_ktype = {
+	.release	= release_firmware_map_entry,
 	.sysfs_ops	= &memmap_attr_ops,
 	.default_attrs	= def_attrs,
 };
@@ -94,6 +115,7 @@ static struct kobj_type memmap_ktype = {
  * in firmware initialisation code in one single thread of execution.
  */
 static LIST_HEAD(map_entries);
+static DEFINE_SPINLOCK(map_entries_lock);
 
 /**
  * firmware_map_add_entry() - Does the real work to add a firmware memmap entry.
@@ -118,11 +140,25 @@ static int firmware_map_add_entry(u64 start, u64 end,
 	INIT_LIST_HEAD(&entry->list);
 	kobject_init(&entry->kobj, &memmap_ktype);
 
+	spin_lock(&map_entries_lock);
 	list_add_tail(&entry->list, &map_entries);
+	spin_unlock(&map_entries_lock);
 
 	return 0;
 }
 
+/**
+ * firmware_map_remove_entry() - Does the real work to remove a firmware
+ * memmap entry.
+ * @entry: removed entry.
+ **/
+static inline void firmware_map_remove_entry(struct firmware_map_entry *entry)
+{
+	spin_lock(&map_entries_lock);
+	list_del(&entry->list);
+	spin_unlock(&map_entries_lock);
+}
+
 /*
  * Add memmap entry on sysfs
  */
@@ -144,6 +180,35 @@ static int add_sysfs_fw_map_entry(struct firmware_map_entry *entry)
 	return 0;
 }
 
+/*
+ * Remove memmap entry on sysfs
+ */
+static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
+{
+	kobject_put(&entry->kobj);
+}
+
+/*
+ * Search memmap entry
+ */
+
+static struct firmware_map_entry * __meminit
+firmware_map_find_entry(u64 start, u64 end, const char *type)
+{
+	struct firmware_map_entry *entry;
+
+	spin_lock(&map_entries_lock);
+	list_for_each_entry(entry, &map_entries, list)
+		if ((entry->start == start) && (entry->end == end) &&
+		    (!strcmp(entry->type, type))) {
+			spin_unlock(&map_entries_lock);
+			return entry;
+		}
+
+	spin_unlock(&map_entries_lock);
+	return NULL;
+}
+
 /**
  * firmware_map_add_hotplug() - Adds a firmware mapping entry when we do
  * memory hotplug.
@@ -193,9 +258,36 @@ int __init firmware_map_add_early(u64 start, u64 end, const char *type)
 	if (WARN_ON(!entry))
 		return -ENOMEM;
 
+	entry->bootmem = 1;
 	return firmware_map_add_entry(start, end, type, entry);
 }
 
+/**
+ * firmware_map_remove() - remove a firmware mapping entry
+ * @start: Start of the memory range.
+ * @end:   End of the memory range.
+ * @type:  Type of the memory range.
+ *
+ * removes a firmware mapping entry.
+ *
+ * Returns 0 on success, or -EINVAL if no entry.
+ **/
+int __meminit firmware_map_remove(u64 start, u64 end, const char *type)
+{
+	struct firmware_map_entry *entry;
+
+	entry = firmware_map_find_entry(start, end - 1, type);
+	if (!entry)
+		return -EINVAL;
+
+	firmware_map_remove_entry(entry);
+
+	/* remove the memmap entry */
+	remove_sysfs_fw_map_entry(entry);
+
+	return 0;
+}
+
 /*
  * Sysfs functions -------------------------------------------------------------
  */
@@ -217,8 +309,10 @@ static ssize_t type_show(struct firmware_map_entry *entry, char *buf)
 	return snprintf(buf, PAGE_SIZE, "%s\n", entry->type);
 }
 
-#define to_memmap_attr(_attr) container_of(_attr, struct memmap_attribute, attr)
-#define to_memmap_entry(obj) container_of(obj, struct firmware_map_entry, kobj)
+static inline struct memmap_attribute *to_memmap_attr(struct attribute *attr)
+{
+	return container_of(attr, struct memmap_attribute, attr);
+}
 
 static ssize_t memmap_attr_show(struct kobject *kobj,
 				struct attribute *attr, char *buf)
diff --git a/include/linux/firmware-map.h b/include/linux/firmware-map.h
index 43fe52fc..71d4fa7 100644
--- a/include/linux/firmware-map.h
+++ b/include/linux/firmware-map.h
@@ -25,6 +25,7 @@
 
 int firmware_map_add_early(u64 start, u64 end, const char *type);
 int firmware_map_add_hotplug(u64 start, u64 end, const char *type);
+int firmware_map_remove(u64 start, u64 end, const char *type);
 
 #else /* CONFIG_FIRMWARE_MEMMAP */
 
@@ -38,6 +39,11 @@ static inline int firmware_map_add_hotplug(u64 start, u64 end, const char *type)
 	return 0;
 }
 
+static inline int firmware_map_remove(u64 start, u64 end, const char *type)
+{
+	return 0;
+}
+
 #endif /* CONFIG_FIRMWARE_MEMMAP */
 
 #endif /* _LINUX_FIRMWARE_MAP_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6d06488..63d5388 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1066,7 +1066,7 @@ static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return ret;
 }
 
-int remove_memory(u64 start, u64 size)
+int __ref remove_memory(u64 start, u64 size)
 {
 	unsigned long start_pfn, end_pfn;
 	int ret = 0;
@@ -1108,6 +1108,9 @@ repeat:
 		return ret;
 	}
 
+	/* remove memmap entry */
+	firmware_map_remove(start, start + size, "System RAM");
+
 	unlock_memory_hotplug();
 
 	return 0;
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v4 05/12] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
                   ` (3 preceding siblings ...)
  2012-11-27 10:00 ` [Patch v4 04/12] memory-hotplug: remove /sys/firmware/memmap/X sysfs Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-12-04  9:30   ` Tang Chen
  2012-11-27 10:00 ` [Patch v4 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP Wen Congyang
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

For removing memory, we need to remove page table. But it depends
on architecture. So the patch introduce arch_remove_memory() for
removing page table. Now it only calls __remove_pages().

Note: __remove_pages() for some archtecuture is not implemented
      (I don't know how to implement it for s390).

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 arch/ia64/mm/init.c            | 18 ++++++++++++++++++
 arch/powerpc/mm/mem.c          | 12 ++++++++++++
 arch/s390/mm/init.c            | 12 ++++++++++++
 arch/sh/mm/init.c              | 17 +++++++++++++++++
 arch/tile/mm/init.c            |  8 ++++++++
 arch/x86/mm/init_32.c          | 12 ++++++++++++
 arch/x86/mm/init_64.c          | 15 +++++++++++++++
 include/linux/memory_hotplug.h |  1 +
 mm/memory_hotplug.c            |  2 ++
 9 files changed, 97 insertions(+)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 082e383..e333822 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -689,6 +689,24 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
 	return ret;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	if (ret)
+		pr_warn("%s: Problem encountered in __remove_pages() as"
+			" ret=%d\n", __func__,  ret);
+
+	return ret;
+}
+#endif
 #endif
 
 /*
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 0dba506..09c6451 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -133,6 +133,18 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	return __remove_pages(zone, start_pfn, nr_pages);
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 /*
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 81e596c..b565190 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -257,4 +257,16 @@ int arch_add_memory(int nid, u64 start, u64 size)
 		vmem_remove_mapping(start, size);
 	return rc;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	/*
+	 * There is no hardware or firmware interface which could trigger a
+	 * hot memory remove on s390. So there is nothing that needs to be
+	 * implemented.
+	 */
+	return -EBUSY;
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 82cc576..1057940 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -558,4 +558,21 @@ int memory_add_physaddr_to_nid(u64 addr)
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	if (unlikely(ret))
+		pr_warn("%s: Failed, __remove_pages() == %d\n", __func__,
+			ret);
+
+	return ret;
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index ef29d6c..2749515 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -935,6 +935,14 @@ int remove_memory(u64 start, u64 size)
 {
 	return -EINVAL;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	/* TODO */
+	return -EBUSY;
+}
+#endif
 #endif
 
 struct kmem_cache *pgd_cache;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 11a5800..b19eba4 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -839,6 +839,18 @@ int arch_add_memory(int nid, u64 start, u64 size)
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	return __remove_pages(zone, start_pfn, nr_pages);
+}
+#endif
 #endif
 
 /*
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3baff25..5675335 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -680,6 +680,21 @@ int arch_add_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int __ref arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone;
+	int ret;
+
+	zone = page_zone(pfn_to_page(start_pfn));
+	ret = __remove_pages(zone, start_pfn, nr_pages);
+	WARN_ON_ONCE(ret);
+
+	return ret;
+}
+#endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static struct kcore_list kcore_vsyscall;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 38675e9..191b2d9 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -85,6 +85,7 @@ extern void __online_page_free(struct page *page);
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
+extern int arch_remove_memory(u64 start, u64 size);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 /* reasonably generic interface to expand the physical pages in a zone  */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 63d5388..e741732 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1111,6 +1111,8 @@ repeat:
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 
+	arch_remove_memory(start, size);
+
 	unlock_memory_hotplug();
 
 	return 0;
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v4 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
                   ` (4 preceding siblings ...)
  2012-11-27 10:00 ` [Patch v4 05/12] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-12-04  9:34   ` Tang Chen
  2012-11-27 10:00 ` [Patch v4 07/12] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap Wen Congyang
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

So the patch add unregister_memory_section() into __remove_section().

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e741732..171610d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -279,11 +279,14 @@ static int __meminit __add_section(int nid, struct zone *zone,
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static int __remove_section(struct zone *zone, struct mem_section *ms)
 {
-	/*
-	 * XXX: Freeing memmap with vmemmap is not implement yet.
-	 *      This should be removed later.
-	 */
-	return -EBUSY;
+	int ret = -EINVAL;
+
+	if (!valid_section(ms))
+		return ret;
+
+	ret = unregister_memory_section(ms);
+
+	return ret;
 }
 #else
 static int __remove_section(struct zone *zone, struct mem_section *ms)
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v4 07/12] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
                   ` (5 preceding siblings ...)
  2012-11-27 10:00 ` [Patch v4 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-11-27 10:00 ` [Patch v4 08/12] memory-hotplug: remove memmap " Wen Congyang
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by get_page_bootmem().
So the patch searches pages of virtual mapping and registers the pages by
get_page_bootmem().

Note: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390,
and sparc.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 arch/ia64/mm/discontig.c       |  6 +++++
 arch/powerpc/mm/init_64.c      |  6 +++++
 arch/s390/mm/vmem.c            |  6 +++++
 arch/sparc/mm/init_64.c        |  6 +++++
 arch/x86/mm/init_64.c          | 52 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/memory_hotplug.h | 11 ++-------
 include/linux/mm.h             |  3 ++-
 mm/memory_hotplug.c            | 33 +++++++++++++++++++++++----
 8 files changed, 109 insertions(+), 14 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index c641333..33943db 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -822,4 +822,10 @@ int __meminit vmemmap_populate(struct page *start_page,
 {
 	return vmemmap_populate_basepages(start_page, size, node);
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
 #endif
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 95a4529..6466440 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -297,5 +297,11 @@ int __meminit vmemmap_populate(struct page *start_page,
 
 	return 0;
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 387c7c6..4f4803a 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -236,6 +236,12 @@ out:
 	return ret;
 }
 
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
+
 /*
  * Add memory segment to the segment list if it doesn't overlap with
  * an already present segment.
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 9e28a11..75a984b 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2231,6 +2231,12 @@ void __meminit vmemmap_populate_print_last(void)
 		node_start = 0;
 	}
 }
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	/* TODO */
+}
 #endif /* CONFIG_SPARSEMEM_VMEMMAP */
 
 static void prot_init_common(unsigned long page_none,
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5675335..795dae3 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -998,6 +998,58 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
 	return 0;
 }
 
+void register_page_bootmem_memmap(unsigned long section_nr,
+				  struct page *start_page, unsigned long size)
+{
+	unsigned long addr = (unsigned long)start_page;
+	unsigned long end = (unsigned long)(start_page + size);
+	unsigned long next;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	for (; addr < end; addr = next) {
+		pte_t *pte = NULL;
+
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			next = (addr + PAGE_SIZE) & PAGE_MASK;
+			continue;
+		}
+		get_page_bootmem(section_nr, pgd_page(*pgd), MIX_SECTION_INFO);
+
+		pud = pud_offset(pgd, addr);
+		if (pud_none(*pud)) {
+			next = (addr + PAGE_SIZE) & PAGE_MASK;
+			continue;
+		}
+		get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
+
+		if (!cpu_has_pse) {
+			next = (addr + PAGE_SIZE) & PAGE_MASK;
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd))
+				continue;
+			get_page_bootmem(section_nr, pmd_page(*pmd),
+					 MIX_SECTION_INFO);
+
+			pte = pte_offset_kernel(pmd, addr);
+			if (pte_none(*pte))
+				continue;
+			get_page_bootmem(section_nr, pte_page(*pte),
+					 SECTION_INFO);
+		} else {
+			next = pmd_addr_end(addr, end);
+
+			pmd = pmd_offset(pud, addr);
+			if (pmd_none(*pmd))
+				continue;
+			get_page_bootmem(section_nr, pmd_page(*pmd),
+					 SECTION_INFO);
+		}
+	}
+}
+
 void __meminit vmemmap_populate_print_last(void)
 {
 	if (p_start) {
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 191b2d9..d4c4402 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -163,17 +163,10 @@ static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)
 #endif /* CONFIG_NUMA */
 #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-}
-static inline void put_page_bootmem(struct page *page)
-{
-}
-#else
 extern void register_page_bootmem_info_node(struct pglist_data *pgdat);
 extern void put_page_bootmem(struct page *page);
-#endif
+extern void get_page_bootmem(unsigned long ingo, struct page *page,
+			     unsigned long type);
 
 /*
  * Lock for memory hotplug guarantees 1) all callbacks for memory hotplug
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bcaab4e..5657670 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1640,7 +1640,8 @@ int vmemmap_populate_basepages(struct page *start_page,
 						unsigned long pages, int node);
 int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
 void vmemmap_populate_print_last(void);
-
+void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
+				  unsigned long size);
 
 enum mf_flags {
 	MF_COUNT_INCREASED = 1 << 0,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 171610d..ccc11b6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -91,9 +91,8 @@ static void release_memory_resource(struct resource *res)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
-#ifndef CONFIG_SPARSEMEM_VMEMMAP
-static void get_page_bootmem(unsigned long info,  struct page *page,
-			     unsigned long type)
+void get_page_bootmem(unsigned long info,  struct page *page,
+		      unsigned long type)
 {
 	page->lru.next = (struct list_head *) type;
 	SetPagePrivate(page);
@@ -120,6 +119,7 @@ void __ref put_page_bootmem(struct page *page)
 
 }
 
+#ifndef CONFIG_SPARSEMEM_VMEMMAP
 static void register_page_bootmem_info_section(unsigned long start_pfn)
 {
 	unsigned long *usemap, mapsize, section_nr, i;
@@ -153,6 +153,32 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
 		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
 
 }
+#else
+static void register_page_bootmem_info_section(unsigned long start_pfn)
+{
+	unsigned long *usemap, mapsize, section_nr, i;
+	struct mem_section *ms;
+	struct page *page, *memmap;
+
+	if (!pfn_valid(start_pfn))
+		return;
+
+	section_nr = pfn_to_section_nr(start_pfn);
+	ms = __nr_to_section(section_nr);
+
+	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+
+	register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
+
+	usemap = __nr_to_section(section_nr)->pageblock_flags;
+	page = virt_to_page(usemap);
+
+	mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
+}
+#endif
 
 void register_page_bootmem_info_node(struct pglist_data *pgdat)
 {
@@ -195,7 +221,6 @@ void register_page_bootmem_info_node(struct pglist_data *pgdat)
 			register_page_bootmem_info_section(pfn);
 	}
 }
-#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
 
 static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
 			   unsigned long end_pfn)
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
                   ` (6 preceding siblings ...)
  2012-11-27 10:00 ` [Patch v4 07/12] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-11-28  9:40   ` Jianguo Wu
  2012-12-04  9:47   ` Tang Chen
  2012-11-27 10:00 ` [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture Wen Congyang
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

All pages of virtual mapping in removed memory cannot be freed, since some pages
used as PGD/PUD includes not only removed memory but also other memory. So the
patch checks whether page can be freed or not.

How to check whether page can be freed or not?
 1. When removing memory, the page structs of the revmoved memory are filled
    with 0FD.
 2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
    In this case, the page used as PT/PMD can be freed.

Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is integrated
into one. So __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is deleted.

Note:  vmemmap_kfree() and vmemmap_free_bootmem() are not implemented for ia64,
ppc, s390, and sparc.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 arch/ia64/mm/discontig.c  |   8 ++++
 arch/powerpc/mm/init_64.c |   8 ++++
 arch/s390/mm/vmem.c       |   8 ++++
 arch/sparc/mm/init_64.c   |   8 ++++
 arch/x86/mm/init_64.c     | 119 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h        |   2 +
 mm/memory_hotplug.c       |  17 +------
 mm/sparse.c               |  19 ++++----
 8 files changed, 165 insertions(+), 24 deletions(-)

diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 33943db..0d23b69 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -823,6 +823,14 @@ int __meminit vmemmap_populate(struct page *start_page,
 	return vmemmap_populate_basepages(start_page, size, node);
 }
 
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 6466440..df7d155 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -298,6 +298,14 @@ int __meminit vmemmap_populate(struct page *start_page,
 	return 0;
 }
 
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 4f4803a..ab69c34 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -236,6 +236,14 @@ out:
 	return ret;
 }
 
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 75a984b..546855d 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2232,6 +2232,14 @@ void __meminit vmemmap_populate_print_last(void)
 	}
 }
 
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 795dae3..e85626d 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -998,6 +998,125 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
 	return 0;
 }
 
+#define PAGE_INUSE 0xFD
+
+unsigned long find_and_clear_pte_page(unsigned long addr, unsigned long end,
+			    struct page **pp, int *page_size)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte = NULL;
+	void *page_addr;
+	unsigned long next;
+
+	*pp = NULL;
+
+	pgd = pgd_offset_k(addr);
+	if (pgd_none(*pgd))
+		return pgd_addr_end(addr, end);
+
+	pud = pud_offset(pgd, addr);
+	if (pud_none(*pud))
+		return pud_addr_end(addr, end);
+
+	if (!cpu_has_pse) {
+		next = (addr + PAGE_SIZE) & PAGE_MASK;
+		pmd = pmd_offset(pud, addr);
+		if (pmd_none(*pmd))
+			return next;
+
+		pte = pte_offset_kernel(pmd, addr);
+		if (pte_none(*pte))
+			return next;
+
+		*page_size = PAGE_SIZE;
+		*pp = pte_page(*pte);
+	} else {
+		next = pmd_addr_end(addr, end);
+
+		pmd = pmd_offset(pud, addr);
+		if (pmd_none(*pmd))
+			return next;
+
+		*page_size = PMD_SIZE;
+		*pp = pmd_page(*pmd);
+	}
+
+	/*
+	 * Removed page structs are filled with 0xFD.
+	 */
+	memset((void *)addr, PAGE_INUSE, next - addr);
+
+	page_addr = page_address(*pp);
+
+	/*
+	 * Check the page is filled with 0xFD or not.
+	 * memchr_inv() returns the address. In this case, we cannot
+	 * clear PTE/PUD entry, since the page is used by other.
+	 * So we cannot also free the page.
+	 *
+	 * memchr_inv() returns NULL. In this case, we can clear
+	 * PTE/PUD entry, since the page is not used by other.
+	 * So we can also free the page.
+	 */
+	if (memchr_inv(page_addr, PAGE_INUSE, *page_size)) {
+		*pp = NULL;
+		return next;
+	}
+
+	if (!cpu_has_pse)
+		pte_clear(&init_mm, addr, pte);
+	else
+		pmd_clear(pmd);
+
+	return next;
+}
+
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+	unsigned long addr = (unsigned long)memmap;
+	unsigned long end = (unsigned long)(memmap + nr_pages);
+	unsigned long next;
+	struct page *page;
+	int page_size;
+
+	for (; addr < end; addr = next) {
+		page = NULL;
+		page_size = 0;
+		next = find_and_clear_pte_page(addr, end, &page, &page_size);
+		if (!page)
+			continue;
+
+		free_pages((unsigned long)page_address(page),
+			    get_order(page_size));
+		__flush_tlb_one(addr);
+	}
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+	unsigned long addr = (unsigned long)memmap;
+	unsigned long end = (unsigned long)(memmap + nr_pages);
+	unsigned long next;
+	struct page *page;
+	int page_size;
+	unsigned long magic;
+
+	for (; addr < end; addr = next) {
+		page = NULL;
+		page_size = 0;
+		next = find_and_clear_pte_page(addr, end, &page, &page_size);
+		if (!page)
+			continue;
+
+		magic = (unsigned long) page->lru.next;
+		if (magic == SECTION_INFO)
+			put_page_bootmem(page);
+		flush_tlb_kernel_range(addr, end);
+	}
+}
+
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5657670..94d5ccd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1642,6 +1642,8 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
 void vmemmap_populate_print_last(void);
 void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long size);
+void vmemmap_kfree(struct page *memmpa, unsigned long nr_pages);
+void vmemmap_free_bootmem(struct page *memmpa, unsigned long nr_pages);
 
 enum mf_flags {
 	MF_COUNT_INCREASED = 1 << 0,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ccc11b6..7797e91 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -301,19 +301,6 @@ static int __meminit __add_section(int nid, struct zone *zone,
 	return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
 }
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static int __remove_section(struct zone *zone, struct mem_section *ms)
-{
-	int ret = -EINVAL;
-
-	if (!valid_section(ms))
-		return ret;
-
-	ret = unregister_memory_section(ms);
-
-	return ret;
-}
-#else
 static int __remove_section(struct zone *zone, struct mem_section *ms)
 {
 	unsigned long flags;
@@ -330,9 +317,9 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	pgdat_resize_lock(pgdat, &flags);
 	sparse_remove_one_section(zone, ms);
 	pgdat_resize_unlock(pgdat, &flags);
-	return 0;
+
+	return ret;
 }
-#endif
 
 /*
  * Reasonably generic function for adding memory.  It is
diff --git a/mm/sparse.c b/mm/sparse.c
index fac95f2..c723bc2 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -613,12 +613,13 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
 	/* This will make the necessary allocations eventually. */
 	return sparse_mem_map_populate(pnum, nid);
 }
-static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
+static void __kfree_section_memmap(struct page *page, unsigned long nr_pages)
 {
-	return; /* XXX: Not implemented yet */
+	vmemmap_kfree(page, nr_pages);
 }
-static void free_map_bootmem(struct page *page, unsigned long nr_pages)
+static void free_map_bootmem(struct page *page)
 {
+	vmemmap_free_bootmem(page, PAGES_PER_SECTION);
 }
 #else
 static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
@@ -658,10 +659,14 @@ static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 			   get_order(sizeof(struct page) * nr_pages));
 }
 
-static void free_map_bootmem(struct page *page, unsigned long nr_pages)
+static void free_map_bootmem(struct page *page)
 {
 	unsigned long maps_section_nr, removing_section_nr, i;
 	unsigned long magic;
+	unsigned long nr_pages;
+
+	nr_pages = PAGE_ALIGN(PAGES_PER_SECTION * sizeof(struct page))
+		>> PAGE_SHIFT;
 
 	for (i = 0; i < nr_pages; i++, page++) {
 		magic = (unsigned long) page->lru.next;
@@ -688,7 +693,6 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 {
 	struct page *usemap_page;
-	unsigned long nr_pages;
 
 	if (!usemap)
 		return;
@@ -713,10 +717,7 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 		struct page *memmap_page;
 		memmap_page = virt_to_page(memmap);
 
-		nr_pages = PAGE_ALIGN(PAGES_PER_SECTION * sizeof(struct page))
-			>> PAGE_SHIFT;
-
-		free_map_bootmem(memmap_page, nr_pages);
+		free_map_bootmem(memmap_page);
 	}
 }
 
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
                   ` (7 preceding siblings ...)
  2012-11-27 10:00 ` [Patch v4 08/12] memory-hotplug: remove memmap " Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-12-07  6:43   ` Tang Chen
  2012-11-27 10:00 ` [Patch v4 10/12] memory-hotplug: memory_hotplug: clear zone when removing the memory Wen Congyang
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Jiang Liu, Wen Congyang, Jianguo Wu,
	Yasuaki Ishimatsu, paulus, Minchan Kim, KOSAKI Motohiro,
	David Rientjes, Christoph Lameter, Andrew Morton, Jiang Liu

For hot removing memory, we sholud remove page table about the memory.
So the patch searches a page table about the removed memory, and clear
page table.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 arch/x86/include/asm/pgtable_types.h |   1 +
 arch/x86/mm/init_64.c                | 231 +++++++++++++++++++++++++++++++++++
 arch/x86/mm/pageattr.c               |  47 +++----
 3 files changed, 257 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ec8a1fc..fb0c24d 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -332,6 +332,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
  * as a pte too.
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
 
 #endif	/* !__ASSEMBLY__ */
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index e85626d..23d932a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -680,6 +680,235 @@ int arch_add_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
+static inline void free_pagetable(struct page *page)
+{
+	struct zone *zone;
+	bool bootmem = false;
+
+	/* bootmem page has reserved flag */
+	if (PageReserved(page)) {
+		__ClearPageReserved(page);
+		bootmem = true;
+	}
+
+	__free_page(page);
+
+	if (bootmem) {
+		zone = page_zone(page);
+		zone_span_writelock(zone);
+		zone->present_pages++;
+		zone_span_writeunlock(zone);
+		totalram_pages++;
+	}
+}
+
+static void free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+	pte_t *pte;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte = pte_start + i;
+		if (pte_val(*pte))
+			return;
+	}
+
+	/* free a pte talbe */
+	free_pagetable(pmd_page(*pmd));
+	pmd_clear(pmd);
+}
+
+static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+	pmd_t *pmd;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		pmd = pmd_start + i;
+		if (pmd_val(*pmd))
+			return;
+	}
+
+	/* free a pmd talbe */
+	free_pagetable(pud_page(*pud));
+	pud_clear(pud);
+}
+
+/* return true if pgd is changed, otherwise return false */
+static bool free_pud_table(pud_t *pud_start, pgd_t *pgd)
+{
+	pud_t *pud;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (pud_val(*pud))
+			return false;
+	}
+
+	/* free a pud table */
+	free_pagetable(pgd_page(*pgd));
+	pgd_clear(pgd);
+
+	return true;
+}
+
+static void __meminit
+phys_pte_remove(pte_t *pte_page, unsigned long addr, unsigned long end)
+{
+	unsigned pages = 0;
+	int i = pte_index(addr);
+
+	pte_t *pte = pte_page + pte_index(addr);
+
+	for (; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
+
+		if (addr >= end)
+			break;
+
+		if (!pte_present(*pte))
+			continue;
+
+		pages++;
+		set_pte(pte, __pte(0));
+	}
+
+	update_page_count(PG_LEVEL_4K, -pages);
+}
+
+static void __meminit
+phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
+{
+	unsigned long pages = 0, next;
+	int i = pmd_index(addr);
+
+	for (; i < PTRS_PER_PMD && addr < end; i++, addr = next) {
+		unsigned long pte_phys;
+		pmd_t *pmd = pmd_page + pmd_index(addr);
+		pte_t *pte;
+
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(*pmd))
+			continue;
+
+		if (pmd_large(*pmd)) {
+			if (IS_ALIGNED(addr, PMD_SIZE) &&
+			    IS_ALIGNED(next, PMD_SIZE)) {
+				set_pmd(pmd, __pmd(0));
+				pages++;
+				continue;
+			}
+
+			/*
+			 * We use 2M page, but we need to remove part of them,
+			 * so split 2M page to 4K page.
+			 */
+			pte = alloc_low_page(&pte_phys);
+			BUG_ON(!pte);
+			__split_large_page((pte_t *)pmd,
+					   (unsigned long)__va(addr), pte);
+
+			spin_lock(&init_mm.page_table_lock);
+			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
+			spin_unlock(&init_mm.page_table_lock);
+
+			/* Do a global flush tlb after splitting a large page */
+			flush_tlb_all();
+		}
+
+		spin_lock(&init_mm.page_table_lock);
+		pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
+		phys_pte_remove(pte, addr, next);
+		free_pte_table(pte, pmd);
+		unmap_low_page(pte);
+		spin_unlock(&init_mm.page_table_lock);
+	}
+	update_page_count(PG_LEVEL_2M, -pages);
+}
+
+static void __meminit
+phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
+{
+	unsigned long pages = 0, next;
+	int i = pud_index(addr);
+
+	for (; i < PTRS_PER_PUD && addr < end; i++, addr = next) {
+		unsigned long pmd_phys;
+		pud_t *pud = pud_page + pud_index(addr);
+		pmd_t *pmd;
+
+		next = pud_addr_end(addr, end);
+
+		if (!pud_present(*pud))
+			continue;
+
+		if (pud_large(*pud)) {
+			if (IS_ALIGNED(addr, PUD_SIZE) &&
+			    IS_ALIGNED(next, PUD_SIZE)) {
+				set_pud(pud, __pud(0));
+				pages++;
+				continue;
+			}
+
+			/*
+			 * We use 1G page, but we need to remove part of them,
+			 * so split 1G page to 2M page.
+			 */
+			pmd = alloc_low_page(&pmd_phys);
+			BUG_ON(!pmd);
+			__split_large_page((pte_t *)pud,
+					   (unsigned long)__va(addr),
+					   (pte_t *)pmd);
+
+			spin_lock(&init_mm.page_table_lock);
+			pud_populate(&init_mm, pud, __va(pmd_phys));
+			spin_unlock(&init_mm.page_table_lock);
+
+			/* Do a global flush tlb after splitting a large page */
+			flush_tlb_all();
+		}
+
+		pmd = map_low_page((pmd_t *)pud_page_vaddr(*pud));
+		phys_pmd_remove(pmd, addr, next);
+		free_pmd_table(pmd, pud);
+		unmap_low_page(pmd);
+	}
+
+	update_page_count(PG_LEVEL_1G, -pages);
+}
+
+void __meminit
+kernel_physical_mapping_remove(unsigned long start, unsigned long end)
+{
+	unsigned long next;
+	bool pgd_changed = false;
+
+	start = (unsigned long)__va(start);
+	end = (unsigned long)__va(end);
+
+	for (; start < end; start = next) {
+		pgd_t *pgd = pgd_offset_k(start);
+		pud_t *pud;
+
+		next = pgd_addr_end(start, end);
+
+		if (!pgd_present(*pgd))
+			continue;
+
+		pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
+		phys_pud_remove(pud, __pa(start), __pa(next));
+		if (free_pud_table(pud, pgd))
+			pgd_changed = true;
+		unmap_low_page(pud);
+	}
+
+	if (pgd_changed)
+		sync_global_pgds(start, end - 1);
+
+	flush_tlb_all();
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 int __ref arch_remove_memory(u64 start, u64 size)
 {
@@ -692,6 +921,8 @@ int __ref arch_remove_memory(u64 start, u64 size)
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
 
+	kernel_physical_mapping_remove(start, start + size);
+
 	return ret;
 }
 #endif
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a718e0d..7dcb6f9 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -501,21 +501,13 @@ out_unlock:
 	return do_split;
 }
 
-static int split_large_page(pte_t *kpte, unsigned long address)
+int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
 {
 	unsigned long pfn, pfninc = 1;
 	unsigned int i, level;
-	pte_t *pbase, *tmp;
+	pte_t *tmp;
 	pgprot_t ref_prot;
-	struct page *base;
-
-	if (!debug_pagealloc)
-		spin_unlock(&cpa_lock);
-	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
-	if (!debug_pagealloc)
-		spin_lock(&cpa_lock);
-	if (!base)
-		return -ENOMEM;
+	struct page *base = virt_to_page(pbase);
 
 	spin_lock(&pgd_lock);
 	/*
@@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
 	 * up for us already:
 	 */
 	tmp = lookup_address(address, &level);
-	if (tmp != kpte)
-		goto out_unlock;
+	if (tmp != kpte) {
+		spin_unlock(&pgd_lock);
+		return 1;
+	}
 
-	pbase = (pte_t *)page_address(base);
 	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
 	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
 	/*
@@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
 	 * going on.
 	 */
 	__flush_tlb_all();
+	spin_unlock(&pgd_lock);
 
-	base = NULL;
+	return 0;
+}
 
-out_unlock:
-	/*
-	 * If we dropped out via the lookup_address check under
-	 * pgd_lock then stick the page back into the pool:
-	 */
-	if (base)
+static int split_large_page(pte_t *kpte, unsigned long address)
+{
+	pte_t *pbase;
+	struct page *base;
+
+	if (!debug_pagealloc)
+		spin_unlock(&cpa_lock);
+	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
+	if (!debug_pagealloc)
+		spin_lock(&cpa_lock);
+	if (!base)
+		return -ENOMEM;
+
+	pbase = (pte_t *)page_address(base);
+	if (__split_large_page(kpte, address, pbase))
 		__free_page(base);
-	spin_unlock(&pgd_lock);
 
 	return 0;
 }
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v4 10/12] memory-hotplug: memory_hotplug: clear zone when removing the memory
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
                   ` (8 preceding siblings ...)
  2012-11-27 10:00 ` [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-12-04 10:09   ` Tang Chen
  2012-11-27 10:00 ` [Patch v4 11/12] memory-hotplug: remove sysfs file of node Wen Congyang
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

When a memory is added, we update zone's and pgdat's start_pfn and
spanned_pages in the function __add_zone(). So we should revert them
when the memory is removed.

The patch adds a new function __remove_zone() to do this.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 207 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7797e91..aa97d56 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -301,10 +301,213 @@ static int __meminit __add_section(int nid, struct zone *zone,
 	return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
 }
 
+/* find the smallest valid pfn in the range [start_pfn, end_pfn) */
+static int find_smallest_section_pfn(int nid, struct zone *zone,
+				     unsigned long start_pfn,
+				     unsigned long end_pfn)
+{
+	struct mem_section *ms;
+
+	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(start_pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (unlikely(pfn_to_nid(start_pfn) != nid))
+			continue;
+
+		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
+			continue;
+
+		return start_pfn;
+	}
+
+	return 0;
+}
+
+/* find the biggest valid pfn in the range [start_pfn, end_pfn). */
+static int find_biggest_section_pfn(int nid, struct zone *zone,
+				    unsigned long start_pfn,
+				    unsigned long end_pfn)
+{
+	struct mem_section *ms;
+	unsigned long pfn;
+
+	/* pfn is the end pfn of a memory section. */
+	pfn = end_pfn - 1;
+	for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (unlikely(pfn_to_nid(pfn) != nid))
+			continue;
+
+		if (zone && zone != page_zone(pfn_to_page(pfn)))
+			continue;
+
+		return pfn;
+	}
+
+	return 0;
+}
+
+static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
+			     unsigned long end_pfn)
+{
+	unsigned long zone_start_pfn =  zone->zone_start_pfn;
+	unsigned long zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	unsigned long pfn;
+	struct mem_section *ms;
+	int nid = zone_to_nid(zone);
+
+	zone_span_writelock(zone);
+	if (zone_start_pfn == start_pfn) {
+		/*
+		 * If the section is smallest section in the zone, it need
+		 * shrink zone->zone_start_pfn and zone->zone_spanned_pages.
+		 * In this case, we find second smallest valid mem_section
+		 * for shrinking zone.
+		 */
+		pfn = find_smallest_section_pfn(nid, zone, end_pfn,
+						zone_end_pfn);
+		if (pfn) {
+			zone->zone_start_pfn = pfn;
+			zone->spanned_pages = zone_end_pfn - pfn;
+		}
+	} else if (zone_end_pfn == end_pfn) {
+		/*
+		 * If the section is biggest section in the zone, it need
+		 * shrink zone->spanned_pages.
+		 * In this case, we find second biggest valid mem_section for
+		 * shrinking zone.
+		 */
+		pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
+					       start_pfn);
+		if (pfn)
+			zone->spanned_pages = pfn - zone_start_pfn + 1;
+	}
+
+	/*
+	 * The section is not biggest or smallest mem_section in the zone, it
+	 * only creates a hole in the zone. So in this case, we need not
+	 * change the zone. But perhaps, the zone has only hole data. Thus
+	 * it check the zone has only hole or not.
+	 */
+	pfn = zone_start_pfn;
+	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (page_zone(pfn_to_page(pfn)) != zone)
+			continue;
+
+		 /* If the section is current section, it continues the loop */
+		if (start_pfn == pfn)
+			continue;
+
+		/* If we find valid section, we have nothing to do */
+		zone_span_writeunlock(zone);
+		return;
+	}
+
+	/* The zone has no valid section */
+	zone->zone_start_pfn = 0;
+	zone->spanned_pages = 0;
+	zone_span_writeunlock(zone);
+}
+
+static void shrink_pgdat_span(struct pglist_data *pgdat,
+			      unsigned long start_pfn, unsigned long end_pfn)
+{
+	unsigned long pgdat_start_pfn =  pgdat->node_start_pfn;
+	unsigned long pgdat_end_pfn =
+		pgdat->node_start_pfn + pgdat->node_spanned_pages;
+	unsigned long pfn;
+	struct mem_section *ms;
+	int nid = pgdat->node_id;
+
+	if (pgdat_start_pfn == start_pfn) {
+		/*
+		 * If the section is smallest section in the pgdat, it need
+		 * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
+		 * In this case, we find second smallest valid mem_section
+		 * for shrinking zone.
+		 */
+		pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
+						pgdat_end_pfn);
+		if (pfn) {
+			pgdat->node_start_pfn = pfn;
+			pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
+		}
+	} else if (pgdat_end_pfn == end_pfn) {
+		/*
+		 * If the section is biggest section in the pgdat, it need
+		 * shrink pgdat->node_spanned_pages.
+		 * In this case, we find second biggest valid mem_section for
+		 * shrinking zone.
+		 */
+		pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
+					       start_pfn);
+		if (pfn)
+			pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
+	}
+
+	/*
+	 * If the section is not biggest or smallest mem_section in the pgdat,
+	 * it only creates a hole in the pgdat. So in this case, we need not
+	 * change the pgdat.
+	 * But perhaps, the pgdat has only hole data. Thus it check the pgdat
+	 * has only hole or not.
+	 */
+	pfn = pgdat_start_pfn;
+	for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+
+		if (unlikely(!valid_section(ms)))
+			continue;
+
+		if (pfn_to_nid(pfn) != nid)
+			continue;
+
+		 /* If the section is current section, it continues the loop */
+		if (start_pfn == pfn)
+			continue;
+
+		/* If we find valid section, we have nothing to do */
+		return;
+	}
+
+	/* The pgdat has no valid section */
+	pgdat->node_start_pfn = 0;
+	pgdat->node_spanned_pages = 0;
+}
+
+static void __remove_zone(struct zone *zone, unsigned long start_pfn)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+	int nr_pages = PAGES_PER_SECTION;
+	int zone_type;
+	unsigned long flags;
+
+	zone_type = zone - pgdat->node_zones;
+
+	pgdat_resize_lock(zone->zone_pgdat, &flags);
+	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
+	shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
+	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+}
+
 static int __remove_section(struct zone *zone, struct mem_section *ms)
 {
 	unsigned long flags;
 	struct pglist_data *pgdat = zone->zone_pgdat;
+	unsigned long start_pfn;
+	int scn_nr;
 	int ret = -EINVAL;
 
 	if (!valid_section(ms))
@@ -314,6 +517,10 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	if (ret)
 		return ret;
 
+	scn_nr = __section_nr(ms);
+	start_pfn = section_nr_to_pfn(scn_nr);
+	__remove_zone(zone, start_pfn);
+
 	pgdat_resize_lock(pgdat, &flags);
 	sparse_remove_one_section(zone, ms);
 	pgdat_resize_unlock(pgdat, &flags);
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v4 11/12] memory-hotplug: remove sysfs file of node
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
                   ` (9 preceding siblings ...)
  2012-11-27 10:00 ` [Patch v4 10/12] memory-hotplug: memory_hotplug: clear zone when removing the memory Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-12-04 10:10   ` Tang Chen
  2012-11-27 10:00 ` [Patch v4 12/12] memory-hotplug: free node_data when a node is offlined Wen Congyang
  2012-11-27 19:27 ` [Patch v4 00/12] memory-hotplug: hot-remove physical memory Andrew Morton
  12 siblings, 1 reply; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

This patch introduces a new function try_offline_node() to
remove sysfs file of node when all memory sections of this
node are removed. If some memory sections of this node are
not removed, this function does nothing.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 drivers/acpi/acpi_memhotplug.c |  8 +++++-
 include/linux/memory_hotplug.h |  2 +-
 mm/memory_hotplug.c            | 58 ++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 24c807f..0780f99 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -310,7 +310,9 @@ static int acpi_memory_disable_device(struct acpi_memory_device *mem_device)
 {
 	int result;
 	struct acpi_memory_info *info, *n;
+	int node;
 
+	node = acpi_get_node(mem_device->device->handle);
 
 	/*
 	 * Ask the VM to offline this memory range.
@@ -318,7 +320,11 @@ static int acpi_memory_disable_device(struct acpi_memory_device *mem_device)
 	 */
 	list_for_each_entry_safe(info, n, &mem_device->res_list, list) {
 		if (info->enabled) {
-			result = remove_memory(info->start_addr, info->length);
+			if (node < 0)
+				node = memory_add_physaddr_to_nid(
+					info->start_addr);
+			result = remove_memory(node, info->start_addr,
+				info->length);
 			if (result)
 				return result;
 		}
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index d4c4402..7b4cfe6 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -231,7 +231,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern int offline_memory_block(struct memory_block *mem);
 extern bool is_memblock_offlined(struct memory_block *mem);
-extern int remove_memory(u64 start, u64 size);
+extern int remove_memory(int node, u64 start, u64 size);
 extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
 								int nr_pages);
 extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index aa97d56..449663e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -29,6 +29,7 @@
 #include <linux/suspend.h>
 #include <linux/mm_inline.h>
 #include <linux/firmware-map.h>
+#include <linux/stop_machine.h>
 
 #include <asm/tlbflush.h>
 
@@ -1288,7 +1289,58 @@ static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
 	return ret;
 }
 
-int __ref remove_memory(u64 start, u64 size)
+static int check_cpu_on_node(void *data)
+{
+	struct pglist_data *pgdat = data;
+	int cpu;
+
+	for_each_present_cpu(cpu) {
+		if (cpu_to_node(cpu) == pgdat->node_id)
+			/*
+			 * the cpu on this node isn't removed, and we can't
+			 * offline this node.
+			 */
+			return -EBUSY;
+	}
+
+	return 0;
+}
+
+/* offline the node if all memory sections of this node are removed */
+static void try_offline_node(int nid)
+{
+	unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
+	unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
+	unsigned long pfn;
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+		unsigned long section_nr = pfn_to_section_nr(pfn);
+
+		if (!present_section_nr(section_nr))
+			continue;
+
+		if (pfn_to_nid(pfn) != nid)
+			continue;
+
+		/*
+		 * some memory sections of this node are not removed, and we
+		 * can't offline node now.
+		 */
+		return;
+	}
+
+	if (stop_machine(check_cpu_on_node, NODE_DATA(nid), NULL))
+		return;
+
+	/*
+	 * all memory/cpu of this node are removed, we can offline this
+	 * node now.
+	 */
+	node_set_offline(nid);
+	unregister_one_node(nid);
+}
+
+int __ref remove_memory(int nid, u64 start, u64 size)
 {
 	unsigned long start_pfn, end_pfn;
 	int ret = 0;
@@ -1335,6 +1387,8 @@ repeat:
 
 	arch_remove_memory(start, size);
 
+	try_offline_node(nid);
+
 	unlock_memory_hotplug();
 
 	return 0;
@@ -1344,7 +1398,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
 	return -EINVAL;
 }
-int remove_memory(u64 start, u64 size)
+int remove_memory(int nid, u64 start, u64 size)
 {
 	return -EINVAL;
 }
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Patch v4 12/12] memory-hotplug: free node_data when a node is offlined
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
                   ` (10 preceding siblings ...)
  2012-11-27 10:00 ` [Patch v4 11/12] memory-hotplug: remove sysfs file of node Wen Congyang
@ 2012-11-27 10:00 ` Wen Congyang
  2012-12-04 10:10   ` Tang Chen
  2012-11-27 19:27 ` [Patch v4 00/12] memory-hotplug: hot-remove physical memory Andrew Morton
  12 siblings, 1 reply; 40+ messages in thread
From: Wen Congyang @ 2012-11-27 10:00 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel, linuxppc-dev, linux-acpi,
	linux-s390, linux-sh, linux-ia64, cmetcalf, sparclinux
  Cc: Len Brown, Wen Congyang, Jianguo Wu, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, Christoph Lameter,
	Andrew Morton, Jiang Liu

We call hotadd_new_pgdat() to allocate memory to store node_data. So we
should free it when removing a node.

CC: David Rientjes <rientjes@google.com>
CC: Jiang Liu <liuj97@gmail.com>
CC: Len Brown <len.brown@intel.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
---
 mm/memory_hotplug.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 449663e..d1451ab 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1309,9 +1309,12 @@ static int check_cpu_on_node(void *data)
 /* offline the node if all memory sections of this node are removed */
 static void try_offline_node(int nid)
 {
+	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
-	unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
+	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
 	unsigned long pfn;
+	struct page *pgdat_page = virt_to_page(pgdat);
+	int i;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
 		unsigned long section_nr = pfn_to_section_nr(pfn);
@@ -1338,6 +1341,21 @@ static void try_offline_node(int nid)
 	 */
 	node_set_offline(nid);
 	unregister_one_node(nid);
+
+	if (!PageSlab(pgdat_page) && !PageCompound(pgdat_page))
+		/* node data is allocated from boot memory */
+		return;
+
+	/* free waittable in each zone */
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (zone->wait_table)
+			vfree(zone->wait_table);
+	}
+
+	arch_refresh_nodedata(nid, NULL);
+	arch_free_nodedata(pgdat);
 }
 
 int __ref remove_memory(int nid, u64 start, u64 size)
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Patch v4 00/12] memory-hotplug: hot-remove physical memory
  2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
                   ` (11 preceding siblings ...)
  2012-11-27 10:00 ` [Patch v4 12/12] memory-hotplug: free node_data when a node is offlined Wen Congyang
@ 2012-11-27 19:27 ` Andrew Morton
  2012-11-27 19:38   ` Rafael J. Wysocki
                     ` (2 more replies)
  12 siblings, 3 replies; 40+ messages in thread
From: Andrew Morton @ 2012-11-27 19:27 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Jianguo Wu, linux-mm, Yasuaki Ishimatsu,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Jiang Liu

On Tue, 27 Nov 2012 18:00:10 +0800
Wen Congyang <wency@cn.fujitsu.com> wrote:

> The patch-set was divided from following thread's patch-set.
>     https://lkml.org/lkml/2012/9/5/201
> 
> The last version of this patchset:
>     https://lkml.org/lkml/2012/11/1/93

As we're now at -rc7 I'd prefer to take a look at all of this after the
3.7 release - please resend everything shortly after 3.8-rc1.

> If you want to know the reason, please read following thread.
> 
> https://lkml.org/lkml/2012/10/2/83

Please include the rationale within each version of the patchset rather
than by linking to an old email.  Because

a) this way, more people are likely to read it

b) it permits the text to be maimtained as the code evolves

c) it permits the text to be included in the mainlnie commit, where
   people can find it.

> The patch-set has only the function of kernel core side for physical
> memory hot remove. So if you use the patch, please apply following
> patches.
> 
> - bug fix for memory hot remove
>   https://lkml.org/lkml/2012/10/31/269
>   
> - acpi framework
>   https://lkml.org/lkml/2012/10/26/175

What's happening with the acpi framework?  has it received any feedback
from the ACPI developers?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 00/12] memory-hotplug: hot-remove physical memory
  2012-11-27 19:27 ` [Patch v4 00/12] memory-hotplug: hot-remove physical memory Andrew Morton
@ 2012-11-27 19:38   ` Rafael J. Wysocki
  2012-11-28  0:43   ` Yasuaki Ishimatsu
  2012-11-30  6:37   ` Tang Chen
  2 siblings, 0 replies; 40+ messages in thread
From: Rafael J. Wysocki @ 2012-11-27 19:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, linux-ia64, Wen Congyang, linux-acpi, linux-sh,
	Len Brown, x86, linux-kernel, cmetcalf, Jianguo Wu, linux-mm,
	Yasuaki Ishimatsu, paulus, Minchan Kim, KOSAKI Motohiro,
	David Rientjes, sparclinux, Christoph Lameter, linuxppc-dev,
	Jiang Liu

On Tuesday, November 27, 2012 11:27:41 AM Andrew Morton wrote:
> On Tue, 27 Nov 2012 18:00:10 +0800
> Wen Congyang <wency@cn.fujitsu.com> wrote:
> 
> > The patch-set was divided from following thread's patch-set.
> >     https://lkml.org/lkml/2012/9/5/201
> > 
> > The last version of this patchset:
> >     https://lkml.org/lkml/2012/11/1/93
> 
> As we're now at -rc7 I'd prefer to take a look at all of this after the
> 3.7 release - please resend everything shortly after 3.8-rc1.
> 
> > If you want to know the reason, please read following thread.
> > 
> > https://lkml.org/lkml/2012/10/2/83
> 
> Please include the rationale within each version of the patchset rather
> than by linking to an old email.  Because
> 
> a) this way, more people are likely to read it
> 
> b) it permits the text to be maimtained as the code evolves
> 
> c) it permits the text to be included in the mainlnie commit, where
>    people can find it.
> 
> > The patch-set has only the function of kernel core side for physical
> > memory hot remove. So if you use the patch, please apply following
> > patches.
> > 
> > - bug fix for memory hot remove
> >   https://lkml.org/lkml/2012/10/31/269
> >   
> > - acpi framework
> >   https://lkml.org/lkml/2012/10/26/175
> 
> What's happening with the acpi framework?  has it received any feedback
> from the ACPI developers?

This particular series is in my tree waiting for the v3.8 merge window.

There's a new one sent yesterday and this one is pending a review.  I'm
not sure if the $subject patchset depends on it, though.

It looks like there are too many hotplug patchsets flying left and right and
it's getting hard to keep track of them all.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 00/12] memory-hotplug: hot-remove physical memory
  2012-11-27 19:27 ` [Patch v4 00/12] memory-hotplug: hot-remove physical memory Andrew Morton
  2012-11-27 19:38   ` Rafael J. Wysocki
@ 2012-11-28  0:43   ` Yasuaki Ishimatsu
  2012-11-30  6:37   ` Tang Chen
  2 siblings, 0 replies; 40+ messages in thread
From: Yasuaki Ishimatsu @ 2012-11-28  0:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, linux-ia64, Wen Congyang, linux-acpi, linux-sh,
	Len Brown, x86, linux-kernel, cmetcalf, Jianguo Wu, linux-mm,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Jiang Liu

Hi Andrew,

2012/11/28 4:27, Andrew Morton wrote:
> On Tue, 27 Nov 2012 18:00:10 +0800
> Wen Congyang <wency@cn.fujitsu.com> wrote:
>
>> The patch-set was divided from following thread's patch-set.
>>      https://lkml.org/lkml/2012/9/5/201
>>
>> The last version of this patchset:
>>      https://lkml.org/lkml/2012/11/1/93
>
> As we're now at -rc7 I'd prefer to take a look at all of this after the
> 3.7 release - please resend everything shortly after 3.8-rc1.

Almost patches about memory hotplug has been merged into your and Rafael's
tree. And these patches are waiting to open the v3.8 merge window.
Remaining patches are only this patch-set. So we hope that this patch-set
is merged into v3.8.

In merging this patch-set into v3.8, Linux on x86_64 makes a memory hot plug
possible.

Thanks,
Yasuaki Ishimatsu

>
>> If you want to know the reason, please read following thread.
>>
>> https://lkml.org/lkml/2012/10/2/83
>
> Please include the rationale within each version of the patchset rather
> than by linking to an old email.  Because
>
> a) this way, more people are likely to read it
>
> b) it permits the text to be maimtained as the code evolves
>
> c) it permits the text to be included in the mainlnie commit, where
>     people can find it.
>
>> The patch-set has only the function of kernel core side for physical
>> memory hot remove. So if you use the patch, please apply following
>> patches.
>>
>> - bug fix for memory hot remove
>>    https://lkml.org/lkml/2012/10/31/269
>>
>> - acpi framework
>>    https://lkml.org/lkml/2012/10/26/175
>
> What's happening with the acpi framework?  has it received any feedback
> from the ACPI developers?
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-11-27 10:00 ` [Patch v4 08/12] memory-hotplug: remove memmap " Wen Congyang
@ 2012-11-28  9:40   ` Jianguo Wu
  2012-11-30  1:45     ` Wen Congyang
  2012-12-04  9:47   ` Tang Chen
  1 sibling, 1 reply; 40+ messages in thread
From: Jianguo Wu @ 2012-11-28  9:40 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, linux-mm, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

Hi Congyang,

I think vmemmap's pgtable pages should be freed after all entries are cleared, I have a patch to do this.
The code logic is the same as [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture.

How do you think about this?

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 include/linux/mm.h  |    1 +
 mm/sparse-vmemmap.c |  214 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/sparse.c         |    5 +-
 3 files changed, 218 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5657670..1f26af5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1642,6 +1642,7 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
 void vmemmap_populate_print_last(void);
 void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long size);
+void vmemmap_free(struct page *memmap, unsigned long nr_pages);
 
 enum mf_flags {
 	MF_COUNT_INCREASED = 1 << 0,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 1b7e22a..242cb28 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -29,6 +29,10 @@
 #include <asm/pgalloc.h>
 #include <asm/pgtable.h>
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+#include <asm/tlbflush.h>
+#endif
+
 /*
  * Allocate a block of memory to be used to back the virtual memory map
  * or to back the page tables that are used to create the mapping.
@@ -224,3 +228,213 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
 		vmemmap_buf_end = NULL;
 	}
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static void vmemmap_free_pages(struct page *page, int order)
+{
+	struct zone *zone;
+	unsigned long magic;
+
+	magic = (unsigned long) page->lru.next;
+	if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
+		put_page_bootmem(page);
+
+		zone = page_zone(page);
+		zone_span_writelock(zone);
+		zone->present_pages++;
+		zone_span_writeunlock(zone);
+		totalram_pages++;
+	} else {
+		if (is_vmalloc_addr(page_address(page)))
+			vfree(page_address(page));
+		else
+			free_pages((unsigned long)page_address(page), order);
+	}
+}
+
+static void free_pte_table(pmd_t *pmd)
+{
+	pte_t *pte, *pte_start;
+	int i;
+
+	pte_start = (pte_t *)pmd_page_vaddr(*pmd);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte = pte_start + i;
+		if (pte_val(*pte))
+			return;
+	}
+
+	/* free a pte talbe */
+	vmemmap_free_pages(pmd_page(*pmd), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+static void free_pmd_table(pud_t *pud)
+{
+	pmd_t *pmd, *pmd_start;
+	int i;
+
+	pmd_start = (pmd_t *)pud_page_vaddr(*pud);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		pmd = pmd_start + i;
+		if (pmd_val(*pmd))
+			return;
+	}
+
+	/* free a pmd talbe */
+	vmemmap_free_pages(pud_page(*pud), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+static void free_pud_table(pgd_t *pgd)
+{
+	pud_t *pud, *pud_start;
+	int i;
+
+	pud_start = (pud_t *)pgd_page_vaddr(*pgd);
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (pud_val(*pud))
+			return;
+	}
+
+	/* free a pud table */
+	vmemmap_free_pages(pgd_page(*pgd), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pgd_clear(pgd);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+static int split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
+{
+	struct page *page = pmd_page(*(pmd_t *)kpte);
+	int i = 0;
+	unsigned long magic;
+	unsigned long section_nr;
+
+	__split_large_page(kpte, address, pbase);
+	__flush_tlb_all();
+
+	magic = (unsigned long) page->lru.next;
+	if (magic == SECTION_INFO) {
+		section_nr = pfn_to_section_nr(page_to_pfn(page));
+		while (i < PTRS_PER_PMD) {
+			page++;
+			i++;
+			get_page_bootmem(section_nr, page, SECTION_INFO);
+		}
+	}
+
+	return 0;
+}
+
+static void vmemmap_pte_remove(pmd_t *pmd, unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	unsigned long next;
+
+	pte = pte_offset_kernel(pmd, addr);
+	for (; addr < end; pte++, addr += PAGE_SIZE) {
+		next = (addr + PAGE_SIZE) & PAGE_MASK;
+		if (next > end)
+			next = end;
+
+		if (pte_none(*pte))
+			continue;
+		if (IS_ALIGNED(addr, PAGE_SIZE) &&
+		    IS_ALIGNED(end, PAGE_SIZE)) {
+			vmemmap_free_pages(pte_page(*pte), 0);
+			spin_lock(&init_mm.page_table_lock);
+			pte_clear(&init_mm, addr, pte);
+			spin_unlock(&init_mm.page_table_lock);
+		}
+	}
+
+	free_pte_table(pmd);
+	__flush_tlb_all();
+}
+
+static void vmemmap_pmd_remove(pud_t *pud, unsigned long addr, unsigned long end)
+{
+	unsigned long next;
+	pmd_t *pmd;
+
+	pmd = pmd_offset(pud, addr);
+	for (; addr < end; addr = next, pmd++) {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(*pmd))
+			continue;
+
+		if (cpu_has_pse) {
+			unsigned long pte_base;
+
+			if (IS_ALIGNED(addr, PMD_SIZE) &&
+			    IS_ALIGNED(next, PMD_SIZE)) {
+				vmemmap_free_pages(pmd_page(*pmd),
+						   get_order(PMD_SIZE));
+				spin_lock(&init_mm.page_table_lock);
+				pmd_clear(pmd);
+				spin_unlock(&init_mm.page_table_lock);
+				continue;
+			}
+
+			/*
+			 * We use 2M page, but we need to remove part of them,
+			 * so split 2M page to 4K page.
+			 */
+			pte_base = get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
+			split_large_page((pte_t *)pmd, addr, (pte_t *)pte_base);
+			__flush_tlb_all();
+
+			spin_lock(&init_mm.page_table_lock);
+			pmd_populate_kernel(&init_mm, pmd, (pte_t *)pte_base);
+			spin_unlock(&init_mm.page_table_lock);
+		}
+
+		vmemmap_pte_remove(pmd, addr, next);
+	}
+
+	free_pmd_table(pud);
+	__flush_tlb_all();
+}
+
+static void vmemmap_pud_remove(pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+	unsigned long next;
+	pud_t *pud;
+
+	pud = pud_offset(pgd, addr);
+	for (; addr < end; addr = next, pud++) {
+		next = pud_addr_end(addr, end);
+		if (pud_none(*pud))
+			continue;
+
+		vmemmap_pmd_remove(pud, addr, next);
+	}
+
+	free_pud_table(pgd);
+	__flush_tlb_all();
+}
+
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+	unsigned long addr = (unsigned long)memmap;
+	unsigned long end = (unsigned long)(memmap + nr_pages);
+	unsigned long next;
+
+	for (; addr < end; addr = next) {
+		pgd_t *pgd = pgd_offset_k(addr);
+
+		next = pgd_addr_end(addr, end);
+		if (!pgd_present(*pgd))
+			continue;
+
+		vmemmap_pud_remove(pgd, addr, next);
+		sync_global_pgds(addr, next);
+	}
+}
+#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index fac95f2..3a16d68 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -613,12 +613,13 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
 	/* This will make the necessary allocations eventually. */
 	return sparse_mem_map_populate(pnum, nid);
 }
-static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
+static void __kfree_section_memmap(struct page *page, unsigned long nr_pages)
 {
-	return; /* XXX: Not implemented yet */
+	vmemmap_free(page, nr_pages);
 }
 static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 {
+	vmemmap_free(page, nr_pages);
 }
 #else
 static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
-- 
1.7.6.1


On 2012/11/27 18:00, Wen Congyang wrote:

> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> All pages of virtual mapping in removed memory cannot be freed, since some pages
> used as PGD/PUD includes not only removed memory but also other memory. So the
> patch checks whether page can be freed or not.
> 
> How to check whether page can be freed or not?
>  1. When removing memory, the page structs of the revmoved memory are filled
>     with 0FD.
>  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>     In this case, the page used as PT/PMD can be freed.
> 
> Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is integrated
> into one. So __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is deleted.
> 
> Note:  vmemmap_kfree() and vmemmap_free_bootmem() are not implemented for ia64,
> ppc, s390, and sparc.
> 
> CC: David Rientjes <rientjes@google.com>
> CC: Jiang Liu <liuj97@gmail.com>
> CC: Len Brown <len.brown@intel.com>
> CC: Christoph Lameter <cl@linux.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> ---
>  arch/ia64/mm/discontig.c  |   8 ++++
>  arch/powerpc/mm/init_64.c |   8 ++++
>  arch/s390/mm/vmem.c       |   8 ++++
>  arch/sparc/mm/init_64.c   |   8 ++++
>  arch/x86/mm/init_64.c     | 119 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h        |   2 +
>  mm/memory_hotplug.c       |  17 +------
>  mm/sparse.c               |  19 ++++----
>  8 files changed, 165 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
> index 33943db..0d23b69 100644
> --- a/arch/ia64/mm/discontig.c
> +++ b/arch/ia64/mm/discontig.c
> @@ -823,6 +823,14 @@ int __meminit vmemmap_populate(struct page *start_page,
>  	return vmemmap_populate_basepages(start_page, size, node);
>  }
>  
> +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
> +{
> +}
> +
> +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
> +{
> +}
> +
>  void register_page_bootmem_memmap(unsigned long section_nr,
>  				  struct page *start_page, unsigned long size)
>  {
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index 6466440..df7d155 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -298,6 +298,14 @@ int __meminit vmemmap_populate(struct page *start_page,
>  	return 0;
>  }
>  
> +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
> +{
> +}
> +
> +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
> +{
> +}
> +
>  void register_page_bootmem_memmap(unsigned long section_nr,
>  				  struct page *start_page, unsigned long size)
>  {
> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> index 4f4803a..ab69c34 100644
> --- a/arch/s390/mm/vmem.c
> +++ b/arch/s390/mm/vmem.c
> @@ -236,6 +236,14 @@ out:
>  	return ret;
>  }
>  
> +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
> +{
> +}
> +
> +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
> +{
> +}
> +
>  void register_page_bootmem_memmap(unsigned long section_nr,
>  				  struct page *start_page, unsigned long size)
>  {
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 75a984b..546855d 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2232,6 +2232,14 @@ void __meminit vmemmap_populate_print_last(void)
>  	}
>  }
>  
> +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
> +{
> +}
> +
> +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
> +{
> +}
> +
>  void register_page_bootmem_memmap(unsigned long section_nr,
>  				  struct page *start_page, unsigned long size)
>  {
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 795dae3..e85626d 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -998,6 +998,125 @@ vmemmap_populate(struct page *start_page, unsigned long size, int node)
>  	return 0;
>  }
>  
> +#define PAGE_INUSE 0xFD
> +
> +unsigned long find_and_clear_pte_page(unsigned long addr, unsigned long end,
> +			    struct page **pp, int *page_size)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte = NULL;
> +	void *page_addr;
> +	unsigned long next;
> +
> +	*pp = NULL;
> +
> +	pgd = pgd_offset_k(addr);
> +	if (pgd_none(*pgd))
> +		return pgd_addr_end(addr, end);
> +
> +	pud = pud_offset(pgd, addr);
> +	if (pud_none(*pud))
> +		return pud_addr_end(addr, end);
> +
> +	if (!cpu_has_pse) {
> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> +		pmd = pmd_offset(pud, addr);
> +		if (pmd_none(*pmd))
> +			return next;
> +
> +		pte = pte_offset_kernel(pmd, addr);
> +		if (pte_none(*pte))
> +			return next;
> +
> +		*page_size = PAGE_SIZE;
> +		*pp = pte_page(*pte);
> +	} else {
> +		next = pmd_addr_end(addr, end);
> +
> +		pmd = pmd_offset(pud, addr);
> +		if (pmd_none(*pmd))
> +			return next;
> +
> +		*page_size = PMD_SIZE;
> +		*pp = pmd_page(*pmd);
> +	}
> +
> +	/*
> +	 * Removed page structs are filled with 0xFD.
> +	 */
> +	memset((void *)addr, PAGE_INUSE, next - addr);
> +
> +	page_addr = page_address(*pp);
> +
> +	/*
> +	 * Check the page is filled with 0xFD or not.
> +	 * memchr_inv() returns the address. In this case, we cannot
> +	 * clear PTE/PUD entry, since the page is used by other.
> +	 * So we cannot also free the page.
> +	 *
> +	 * memchr_inv() returns NULL. In this case, we can clear
> +	 * PTE/PUD entry, since the page is not used by other.
> +	 * So we can also free the page.
> +	 */
> +	if (memchr_inv(page_addr, PAGE_INUSE, *page_size)) {
> +		*pp = NULL;
> +		return next;
> +	}
> +
> +	if (!cpu_has_pse)
> +		pte_clear(&init_mm, addr, pte);
> +	else
> +		pmd_clear(pmd);
> +
> +	return next;
> +}
> +
> +void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
> +{
> +	unsigned long addr = (unsigned long)memmap;
> +	unsigned long end = (unsigned long)(memmap + nr_pages);
> +	unsigned long next;
> +	struct page *page;
> +	int page_size;
> +
> +	for (; addr < end; addr = next) {
> +		page = NULL;
> +		page_size = 0;
> +		next = find_and_clear_pte_page(addr, end, &page, &page_size);
> +		if (!page)
> +			continue;
> +
> +		free_pages((unsigned long)page_address(page),
> +			    get_order(page_size));
> +		__flush_tlb_one(addr);
> +	}
> +}
> +
> +void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
> +{
> +	unsigned long addr = (unsigned long)memmap;
> +	unsigned long end = (unsigned long)(memmap + nr_pages);
> +	unsigned long next;
> +	struct page *page;
> +	int page_size;
> +	unsigned long magic;
> +
> +	for (; addr < end; addr = next) {
> +		page = NULL;
> +		page_size = 0;
> +		next = find_and_clear_pte_page(addr, end, &page, &page_size);
> +		if (!page)
> +			continue;
> +
> +		magic = (unsigned long) page->lru.next;
> +		if (magic == SECTION_INFO)
> +			put_page_bootmem(page);
> +		flush_tlb_kernel_range(addr, end);
> +	}
> +}
> +
>  void register_page_bootmem_memmap(unsigned long section_nr,
>  				  struct page *start_page, unsigned long size)
>  {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5657670..94d5ccd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1642,6 +1642,8 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
>  void vmemmap_populate_print_last(void);
>  void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>  				  unsigned long size);
> +void vmemmap_kfree(struct page *memmpa, unsigned long nr_pages);
> +void vmemmap_free_bootmem(struct page *memmpa, unsigned long nr_pages);
>  
>  enum mf_flags {
>  	MF_COUNT_INCREASED = 1 << 0,
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index ccc11b6..7797e91 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -301,19 +301,6 @@ static int __meminit __add_section(int nid, struct zone *zone,
>  	return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
>  }
>  
> -#ifdef CONFIG_SPARSEMEM_VMEMMAP
> -static int __remove_section(struct zone *zone, struct mem_section *ms)
> -{
> -	int ret = -EINVAL;
> -
> -	if (!valid_section(ms))
> -		return ret;
> -
> -	ret = unregister_memory_section(ms);
> -
> -	return ret;
> -}
> -#else
>  static int __remove_section(struct zone *zone, struct mem_section *ms)
>  {
>  	unsigned long flags;
> @@ -330,9 +317,9 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
>  	pgdat_resize_lock(pgdat, &flags);
>  	sparse_remove_one_section(zone, ms);
>  	pgdat_resize_unlock(pgdat, &flags);
> -	return 0;
> +
> +	return ret;
>  }
> -#endif
>  
>  /*
>   * Reasonably generic function for adding memory.  It is
> diff --git a/mm/sparse.c b/mm/sparse.c
> index fac95f2..c723bc2 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -613,12 +613,13 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
>  	/* This will make the necessary allocations eventually. */
>  	return sparse_mem_map_populate(pnum, nid);
>  }
> -static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
> +static void __kfree_section_memmap(struct page *page, unsigned long nr_pages)
>  {
> -	return; /* XXX: Not implemented yet */
> +	vmemmap_kfree(page, nr_pages);
>  }
> -static void free_map_bootmem(struct page *page, unsigned long nr_pages)
> +static void free_map_bootmem(struct page *page)
>  {
> +	vmemmap_free_bootmem(page, PAGES_PER_SECTION);
>  }
>  #else
>  static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
> @@ -658,10 +659,14 @@ static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
>  			   get_order(sizeof(struct page) * nr_pages));
>  }
>  
> -static void free_map_bootmem(struct page *page, unsigned long nr_pages)
> +static void free_map_bootmem(struct page *page)
>  {
>  	unsigned long maps_section_nr, removing_section_nr, i;
>  	unsigned long magic;
> +	unsigned long nr_pages;
> +
> +	nr_pages = PAGE_ALIGN(PAGES_PER_SECTION * sizeof(struct page))
> +		>> PAGE_SHIFT;
>  
>  	for (i = 0; i < nr_pages; i++, page++) {
>  		magic = (unsigned long) page->lru.next;
> @@ -688,7 +693,6 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages)
>  static void free_section_usemap(struct page *memmap, unsigned long *usemap)
>  {
>  	struct page *usemap_page;
> -	unsigned long nr_pages;
>  
>  	if (!usemap)
>  		return;
> @@ -713,10 +717,7 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
>  		struct page *memmap_page;
>  		memmap_page = virt_to_page(memmap);
>  
> -		nr_pages = PAGE_ALIGN(PAGES_PER_SECTION * sizeof(struct page))
> -			>> PAGE_SHIFT;
> -
> -		free_map_bootmem(memmap_page, nr_pages);
> +		free_map_bootmem(memmap_page);
>  	}
>  }
>  

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-11-28  9:40   ` Jianguo Wu
@ 2012-11-30  1:45     ` Wen Congyang
  2012-11-30  2:47       ` Jianguo Wu
  2012-12-03  2:23       ` Jianguo Wu
  0 siblings, 2 replies; 40+ messages in thread
From: Wen Congyang @ 2012-11-30  1:45 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, linux-mm, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

At 11/28/2012 05:40 PM, Jianguo Wu Wrote:
> Hi Congyang,
> 
> I think vmemmap's pgtable pages should be freed after all entries are cleared, I have a patch to do this.
> The code logic is the same as [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture.
> 
> How do you think about this?
> 
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
> ---
>  include/linux/mm.h  |    1 +
>  mm/sparse-vmemmap.c |  214 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/sparse.c         |    5 +-
>  3 files changed, 218 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5657670..1f26af5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1642,6 +1642,7 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
>  void vmemmap_populate_print_last(void);
>  void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>  				  unsigned long size);
> +void vmemmap_free(struct page *memmap, unsigned long nr_pages);
>  
>  enum mf_flags {
>  	MF_COUNT_INCREASED = 1 << 0,
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 1b7e22a..242cb28 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -29,6 +29,10 @@
>  #include <asm/pgalloc.h>
>  #include <asm/pgtable.h>
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +#include <asm/tlbflush.h>
> +#endif
> +
>  /*
>   * Allocate a block of memory to be used to back the virtual memory map
>   * or to back the page tables that are used to create the mapping.
> @@ -224,3 +228,213 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>  		vmemmap_buf_end = NULL;
>  	}
>  }
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +static void vmemmap_free_pages(struct page *page, int order)
> +{
> +	struct zone *zone;
> +	unsigned long magic;
> +
> +	magic = (unsigned long) page->lru.next;
> +	if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> +		put_page_bootmem(page);
> +
> +		zone = page_zone(page);
> +		zone_span_writelock(zone);
> +		zone->present_pages++;
> +		zone_span_writeunlock(zone);
> +		totalram_pages++;
> +	} else {
> +		if (is_vmalloc_addr(page_address(page)))
> +			vfree(page_address(page));

Hmm, vmemmap doesn't use vmalloc() to allocate memory.

> +		else
> +			free_pages((unsigned long)page_address(page), order);
> +	}
> +}
> +
> +static void free_pte_table(pmd_t *pmd)
> +{
> +	pte_t *pte, *pte_start;
> +	int i;
> +
> +	pte_start = (pte_t *)pmd_page_vaddr(*pmd);
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		pte = pte_start + i;
> +		if (pte_val(*pte))
> +			return;
> +	}
> +
> +	/* free a pte talbe */
> +	vmemmap_free_pages(pmd_page(*pmd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pmd_clear(pmd);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +static void free_pmd_table(pud_t *pud)
> +{
> +	pmd_t *pmd, *pmd_start;
> +	int i;
> +
> +	pmd_start = (pmd_t *)pud_page_vaddr(*pud);
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		pmd = pmd_start + i;
> +		if (pmd_val(*pmd))
> +			return;
> +	}
> +
> +	/* free a pmd talbe */
> +	vmemmap_free_pages(pud_page(*pud), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pud_clear(pud);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +static void free_pud_table(pgd_t *pgd)
> +{
> +	pud_t *pud, *pud_start;
> +	int i;
> +
> +	pud_start = (pud_t *)pgd_page_vaddr(*pgd);
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		pud = pud_start + i;
> +		if (pud_val(*pud))
> +			return;
> +	}
> +
> +	/* free a pud table */
> +	vmemmap_free_pages(pgd_page(*pgd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pgd_clear(pgd);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +static int split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
> +{
> +	struct page *page = pmd_page(*(pmd_t *)kpte);
> +	int i = 0;
> +	unsigned long magic;
> +	unsigned long section_nr;
> +
> +	__split_large_page(kpte, address, pbase);
> +	__flush_tlb_all();
> +
> +	magic = (unsigned long) page->lru.next;
> +	if (magic == SECTION_INFO) {
> +		section_nr = pfn_to_section_nr(page_to_pfn(page));
> +		while (i < PTRS_PER_PMD) {
> +			page++;
> +			i++;
> +			get_page_bootmem(section_nr, page, SECTION_INFO);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void vmemmap_pte_remove(pmd_t *pmd, unsigned long addr, unsigned long end)
> +{
> +	pte_t *pte;
> +	unsigned long next;
> +
> +	pte = pte_offset_kernel(pmd, addr);
> +	for (; addr < end; pte++, addr += PAGE_SIZE) {
> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> +		if (next > end)
> +			next = end;
> +
> +		if (pte_none(*pte))
> +			continue;
> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
> +		    IS_ALIGNED(end, PAGE_SIZE)) {
> +			vmemmap_free_pages(pte_page(*pte), 0);
> +			spin_lock(&init_mm.page_table_lock);
> +			pte_clear(&init_mm, addr, pte);
> +			spin_unlock(&init_mm.page_table_lock);

If addr or end is not alianed with PAGE_SIZE, you may leak some
memory.

> +		}
> +	}
> +
> +	free_pte_table(pmd);
> +	__flush_tlb_all();
> +}
> +
> +static void vmemmap_pmd_remove(pud_t *pud, unsigned long addr, unsigned long end)
> +{
> +	unsigned long next;
> +	pmd_t *pmd;
> +
> +	pmd = pmd_offset(pud, addr);
> +	for (; addr < end; addr = next, pmd++) {
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(*pmd))
> +			continue;
> +
> +		if (cpu_has_pse) {
> +			unsigned long pte_base;
> +
> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
> +			    IS_ALIGNED(next, PMD_SIZE)) {
> +				vmemmap_free_pages(pmd_page(*pmd),
> +						   get_order(PMD_SIZE));
> +				spin_lock(&init_mm.page_table_lock);
> +				pmd_clear(pmd);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 2M page, but we need to remove part of them,
> +			 * so split 2M page to 4K page.
> +			 */
> +			pte_base = get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);

get_zeored_page() may fail. You should handle this error.

> +			split_large_page((pte_t *)pmd, addr, (pte_t *)pte_base);
> +			__flush_tlb_all();
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pmd_populate_kernel(&init_mm, pmd, (pte_t *)pte_base);
> +			spin_unlock(&init_mm.page_table_lock);
> +		}
> +
> +		vmemmap_pte_remove(pmd, addr, next);
> +	}
> +
> +	free_pmd_table(pud);
> +	__flush_tlb_all();
> +}
> +
> +static void vmemmap_pud_remove(pgd_t *pgd, unsigned long addr, unsigned long end)
> +{
> +	unsigned long next;
> +	pud_t *pud;
> +
> +	pud = pud_offset(pgd, addr);
> +	for (; addr < end; addr = next, pud++) {
> +		next = pud_addr_end(addr, end);
> +		if (pud_none(*pud))
> +			continue;
> +
> +		vmemmap_pmd_remove(pud, addr, next);
> +	}
> +
> +	free_pud_table(pgd);
> +	__flush_tlb_all();
> +}
> +
> +void vmemmap_free(struct page *memmap, unsigned long nr_pages)
> +{
> +	unsigned long addr = (unsigned long)memmap;
> +	unsigned long end = (unsigned long)(memmap + nr_pages);
> +	unsigned long next;
> +
> +	for (; addr < end; addr = next) {
> +		pgd_t *pgd = pgd_offset_k(addr);
> +
> +		next = pgd_addr_end(addr, end);
> +		if (!pgd_present(*pgd))
> +			continue;
> +
> +		vmemmap_pud_remove(pgd, addr, next);
> +		sync_global_pgds(addr, next);

The parameter for sync_global_pgds() is [start, end], not
[start, end)

> +	}
> +}
> +#endif
> diff --git a/mm/sparse.c b/mm/sparse.c
> index fac95f2..3a16d68 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -613,12 +613,13 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
>  	/* This will make the necessary allocations eventually. */
>  	return sparse_mem_map_populate(pnum, nid);
>  }
> -static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
> +static void __kfree_section_memmap(struct page *page, unsigned long nr_pages)
Why do you change this line?

>  {
> -	return; /* XXX: Not implemented yet */
> +	vmemmap_free(page, nr_pages);
>  }
>  static void free_map_bootmem(struct page *page, unsigned long nr_pages)
>  {
> +	vmemmap_free(page, nr_pages);
>  }
>  #else
>  static struct page *__kmalloc_section_memmap(unsigned long nr_pages)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-11-30  1:45     ` Wen Congyang
@ 2012-11-30  2:47       ` Jianguo Wu
  2012-11-30  2:55         ` Yasuaki Ishimatsu
  2012-12-03  2:23       ` Jianguo Wu
  1 sibling, 1 reply; 40+ messages in thread
From: Jianguo Wu @ 2012-11-30  2:47 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, linux-mm, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

Hi Congyang,

Thanks for your review and comments.

On 2012/11/30 9:45, Wen Congyang wrote:

> At 11/28/2012 05:40 PM, Jianguo Wu Wrote:
>> Hi Congyang,
>>
>> I think vmemmap's pgtable pages should be freed after all entries are cleared, I have a patch to do this.
>> The code logic is the same as [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture.
>>
>> How do you think about this?
>>
>> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
>> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
>> ---
>>  include/linux/mm.h  |    1 +
>>  mm/sparse-vmemmap.c |  214 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  mm/sparse.c         |    5 +-
>>  3 files changed, 218 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5657670..1f26af5 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1642,6 +1642,7 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
>>  void vmemmap_populate_print_last(void);
>>  void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>>  				  unsigned long size);
>> +void vmemmap_free(struct page *memmap, unsigned long nr_pages);
>>  
>>  enum mf_flags {
>>  	MF_COUNT_INCREASED = 1 << 0,
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 1b7e22a..242cb28 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -29,6 +29,10 @@
>>  #include <asm/pgalloc.h>
>>  #include <asm/pgtable.h>
>>  
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +#include <asm/tlbflush.h>
>> +#endif
>> +
>>  /*
>>   * Allocate a block of memory to be used to back the virtual memory map
>>   * or to back the page tables that are used to create the mapping.
>> @@ -224,3 +228,213 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>>  		vmemmap_buf_end = NULL;
>>  	}
>>  }
>> +
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +static void vmemmap_free_pages(struct page *page, int order)
>> +{
>> +	struct zone *zone;
>> +	unsigned long magic;
>> +
>> +	magic = (unsigned long) page->lru.next;
>> +	if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>> +		put_page_bootmem(page);
>> +
>> +		zone = page_zone(page);
>> +		zone_span_writelock(zone);
>> +		zone->present_pages++;
>> +		zone_span_writeunlock(zone);
>> +		totalram_pages++;
>> +	} else {
>> +		if (is_vmalloc_addr(page_address(page)))
>> +			vfree(page_address(page));
> 
> Hmm, vmemmap doesn't use vmalloc() to allocate memory.
> 

yes, this can be removed.

>> +		else
>> +			free_pages((unsigned long)page_address(page), order);
>> +	}
>> +}
>> +
>> +static void free_pte_table(pmd_t *pmd)
>> +{
>> +	pte_t *pte, *pte_start;
>> +	int i;
>> +
>> +	pte_start = (pte_t *)pmd_page_vaddr(*pmd);
>> +	for (i = 0; i < PTRS_PER_PTE; i++) {
>> +		pte = pte_start + i;
>> +		if (pte_val(*pte))
>> +			return;
>> +	}
>> +
>> +	/* free a pte talbe */
>> +	vmemmap_free_pages(pmd_page(*pmd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pmd_clear(pmd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static void free_pmd_table(pud_t *pud)
>> +{
>> +	pmd_t *pmd, *pmd_start;
>> +	int i;
>> +
>> +	pmd_start = (pmd_t *)pud_page_vaddr(*pud);
>> +	for (i = 0; i < PTRS_PER_PMD; i++) {
>> +		pmd = pmd_start + i;
>> +		if (pmd_val(*pmd))
>> +			return;
>> +	}
>> +
>> +	/* free a pmd talbe */
>> +	vmemmap_free_pages(pud_page(*pud), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pud_clear(pud);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static void free_pud_table(pgd_t *pgd)
>> +{
>> +	pud_t *pud, *pud_start;
>> +	int i;
>> +
>> +	pud_start = (pud_t *)pgd_page_vaddr(*pgd);
>> +	for (i = 0; i < PTRS_PER_PUD; i++) {
>> +		pud = pud_start + i;
>> +		if (pud_val(*pud))
>> +			return;
>> +	}
>> +
>> +	/* free a pud table */
>> +	vmemmap_free_pages(pgd_page(*pgd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pgd_clear(pgd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static int split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>> +{
>> +	struct page *page = pmd_page(*(pmd_t *)kpte);
>> +	int i = 0;
>> +	unsigned long magic;
>> +	unsigned long section_nr;
>> +
>> +	__split_large_page(kpte, address, pbase);
>> +	__flush_tlb_all();
>> +
>> +	magic = (unsigned long) page->lru.next;
>> +	if (magic == SECTION_INFO) {
>> +		section_nr = pfn_to_section_nr(page_to_pfn(page));
>> +		while (i < PTRS_PER_PMD) {
>> +			page++;
>> +			i++;
>> +			get_page_bootmem(section_nr, page, SECTION_INFO);
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void vmemmap_pte_remove(pmd_t *pmd, unsigned long addr, unsigned long end)
>> +{
>> +	pte_t *pte;
>> +	unsigned long next;
>> +
>> +	pte = pte_offset_kernel(pmd, addr);
>> +	for (; addr < end; pte++, addr += PAGE_SIZE) {
>> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
>> +		if (next > end)
>> +			next = end;
>> +
>> +		if (pte_none(*pte))
>> +			continue;
>> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
>> +		    IS_ALIGNED(end, PAGE_SIZE)) {
>> +			vmemmap_free_pages(pte_page(*pte), 0);
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pte_clear(&init_mm, addr, pte);
>> +			spin_unlock(&init_mm.page_table_lock);
> 
> If addr or end is not alianed with PAGE_SIZE, you may leak some
> memory.
> 

yes, I think we can handle this situation with the method you mentioned in the change log:
1. When removing memory, the page structs of the revmoved memory are filled
   with 0xFD.
2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
   In this case, the page used as PT/PMD can be freed.

By the way, why is 0xFD?

>> +		}
>> +	}
>> +
>> +	free_pte_table(pmd);
>> +	__flush_tlb_all();
>> +}
>> +
>> +static void vmemmap_pmd_remove(pud_t *pud, unsigned long addr, unsigned long end)
>> +{
>> +	unsigned long next;
>> +	pmd_t *pmd;
>> +
>> +	pmd = pmd_offset(pud, addr);
>> +	for (; addr < end; addr = next, pmd++) {
>> +		next = pmd_addr_end(addr, end);
>> +		if (pmd_none(*pmd))
>> +			continue;
>> +
>> +		if (cpu_has_pse) {
>> +			unsigned long pte_base;
>> +
>> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
>> +			    IS_ALIGNED(next, PMD_SIZE)) {
>> +				vmemmap_free_pages(pmd_page(*pmd),
>> +						   get_order(PMD_SIZE));
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pmd_clear(pmd);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * We use 2M page, but we need to remove part of them,
>> +			 * so split 2M page to 4K page.
>> +			 */
>> +			pte_base = get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
> 
> get_zeored_page() may fail. You should handle this error.
> 

That means system is out of memory, I will trigger a bug_on.

>> +			split_large_page((pte_t *)pmd, addr, (pte_t *)pte_base);
>> +			__flush_tlb_all();
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pmd_populate_kernel(&init_mm, pmd, (pte_t *)pte_base);
>> +			spin_unlock(&init_mm.page_table_lock);
>> +		}
>> +
>> +		vmemmap_pte_remove(pmd, addr, next);
>> +	}
>> +
>> +	free_pmd_table(pud);
>> +	__flush_tlb_all();
>> +}
>> +
>> +static void vmemmap_pud_remove(pgd_t *pgd, unsigned long addr, unsigned long end)
>> +{
>> +	unsigned long next;
>> +	pud_t *pud;
>> +
>> +	pud = pud_offset(pgd, addr);
>> +	for (; addr < end; addr = next, pud++) {
>> +		next = pud_addr_end(addr, end);
>> +		if (pud_none(*pud))
>> +			continue;
>> +
>> +		vmemmap_pmd_remove(pud, addr, next);
>> +	}
>> +
>> +	free_pud_table(pgd);
>> +	__flush_tlb_all();
>> +}
>> +
>> +void vmemmap_free(struct page *memmap, unsigned long nr_pages)
>> +{
>> +	unsigned long addr = (unsigned long)memmap;
>> +	unsigned long end = (unsigned long)(memmap + nr_pages);
>> +	unsigned long next;
>> +
>> +	for (; addr < end; addr = next) {
>> +		pgd_t *pgd = pgd_offset_k(addr);
>> +
>> +		next = pgd_addr_end(addr, end);
>> +		if (!pgd_present(*pgd))
>> +			continue;
>> +
>> +		vmemmap_pud_remove(pgd, addr, next);
>> +		sync_global_pgds(addr, next);
> 
> The parameter for sync_global_pgds() is [start, end], not
> [start, end)
> 

yes, thanks.

>> +	}
>> +}
>> +#endif
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index fac95f2..3a16d68 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -613,12 +613,13 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
>>  	/* This will make the necessary allocations eventually. */
>>  	return sparse_mem_map_populate(pnum, nid);
>>  }
>> -static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
>> +static void __kfree_section_memmap(struct page *page, unsigned long nr_pages)
> Why do you change this line?
> 

0k, it is no need to change.

>>  {
>> -	return; /* XXX: Not implemented yet */
>> +	vmemmap_free(page, nr_pages);
>>  }
>>  static void free_map_bootmem(struct page *page, unsigned long nr_pages)
>>  {
>> +	vmemmap_free(page, nr_pages);
>>  }
>>  #else
>>  static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-11-30  2:47       ` Jianguo Wu
@ 2012-11-30  2:55         ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 40+ messages in thread
From: Yasuaki Ishimatsu @ 2012-11-30  2:55 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: linux-s390, linux-ia64, Wen Congyang, linux-acpi, linux-sh,
	Len Brown, x86, linux-kernel, cmetcalf, linux-mm, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

Hi Jianguo,

2012/11/30 11:47, Jianguo Wu wrote:
> Hi Congyang,
>
> Thanks for your review and comments.
>
> On 2012/11/30 9:45, Wen Congyang wrote:
>
>> At 11/28/2012 05:40 PM, Jianguo Wu Wrote:
>>> Hi Congyang,
>>>
>>> I think vmemmap's pgtable pages should be freed after all entries are cleared, I have a patch to do this.
>>> The code logic is the same as [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture.
>>>
>>> How do you think about this?
>>>
>>> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
>>> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
>>> ---
>>>   include/linux/mm.h  |    1 +
>>>   mm/sparse-vmemmap.c |  214 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>   mm/sparse.c         |    5 +-
>>>   3 files changed, 218 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 5657670..1f26af5 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1642,6 +1642,7 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
>>>   void vmemmap_populate_print_last(void);
>>>   void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>>>   				  unsigned long size);
>>> +void vmemmap_free(struct page *memmap, unsigned long nr_pages);
>>>
>>>   enum mf_flags {
>>>   	MF_COUNT_INCREASED = 1 << 0,
>>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>>> index 1b7e22a..242cb28 100644
>>> --- a/mm/sparse-vmemmap.c
>>> +++ b/mm/sparse-vmemmap.c
>>> @@ -29,6 +29,10 @@
>>>   #include <asm/pgalloc.h>
>>>   #include <asm/pgtable.h>
>>>
>>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>>> +#include <asm/tlbflush.h>
>>> +#endif
>>> +
>>>   /*
>>>    * Allocate a block of memory to be used to back the virtual memory map
>>>    * or to back the page tables that are used to create the mapping.
>>> @@ -224,3 +228,213 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>>>   		vmemmap_buf_end = NULL;
>>>   	}
>>>   }
>>> +
>>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>>> +static void vmemmap_free_pages(struct page *page, int order)
>>> +{
>>> +	struct zone *zone;
>>> +	unsigned long magic;
>>> +
>>> +	magic = (unsigned long) page->lru.next;
>>> +	if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>>> +		put_page_bootmem(page);
>>> +
>>> +		zone = page_zone(page);
>>> +		zone_span_writelock(zone);
>>> +		zone->present_pages++;
>>> +		zone_span_writeunlock(zone);
>>> +		totalram_pages++;
>>> +	} else {
>>> +		if (is_vmalloc_addr(page_address(page)))
>>> +			vfree(page_address(page));
>>
>> Hmm, vmemmap doesn't use vmalloc() to allocate memory.
>>
>
> yes, this can be removed.
>
>>> +		else
>>> +			free_pages((unsigned long)page_address(page), order);
>>> +	}
>>> +}
>>> +
>>> +static void free_pte_table(pmd_t *pmd)
>>> +{
>>> +	pte_t *pte, *pte_start;
>>> +	int i;
>>> +
>>> +	pte_start = (pte_t *)pmd_page_vaddr(*pmd);
>>> +	for (i = 0; i < PTRS_PER_PTE; i++) {
>>> +		pte = pte_start + i;
>>> +		if (pte_val(*pte))
>>> +			return;
>>> +	}
>>> +
>>> +	/* free a pte talbe */
>>> +	vmemmap_free_pages(pmd_page(*pmd), 0);
>>> +	spin_lock(&init_mm.page_table_lock);
>>> +	pmd_clear(pmd);
>>> +	spin_unlock(&init_mm.page_table_lock);
>>> +}
>>> +
>>> +static void free_pmd_table(pud_t *pud)
>>> +{
>>> +	pmd_t *pmd, *pmd_start;
>>> +	int i;
>>> +
>>> +	pmd_start = (pmd_t *)pud_page_vaddr(*pud);
>>> +	for (i = 0; i < PTRS_PER_PMD; i++) {
>>> +		pmd = pmd_start + i;
>>> +		if (pmd_val(*pmd))
>>> +			return;
>>> +	}
>>> +
>>> +	/* free a pmd talbe */
>>> +	vmemmap_free_pages(pud_page(*pud), 0);
>>> +	spin_lock(&init_mm.page_table_lock);
>>> +	pud_clear(pud);
>>> +	spin_unlock(&init_mm.page_table_lock);
>>> +}
>>> +
>>> +static void free_pud_table(pgd_t *pgd)
>>> +{
>>> +	pud_t *pud, *pud_start;
>>> +	int i;
>>> +
>>> +	pud_start = (pud_t *)pgd_page_vaddr(*pgd);
>>> +	for (i = 0; i < PTRS_PER_PUD; i++) {
>>> +		pud = pud_start + i;
>>> +		if (pud_val(*pud))
>>> +			return;
>>> +	}
>>> +
>>> +	/* free a pud table */
>>> +	vmemmap_free_pages(pgd_page(*pgd), 0);
>>> +	spin_lock(&init_mm.page_table_lock);
>>> +	pgd_clear(pgd);
>>> +	spin_unlock(&init_mm.page_table_lock);
>>> +}
>>> +
>>> +static int split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>>> +{
>>> +	struct page *page = pmd_page(*(pmd_t *)kpte);
>>> +	int i = 0;
>>> +	unsigned long magic;
>>> +	unsigned long section_nr;
>>> +
>>> +	__split_large_page(kpte, address, pbase);
>>> +	__flush_tlb_all();
>>> +
>>> +	magic = (unsigned long) page->lru.next;
>>> +	if (magic == SECTION_INFO) {
>>> +		section_nr = pfn_to_section_nr(page_to_pfn(page));
>>> +		while (i < PTRS_PER_PMD) {
>>> +			page++;
>>> +			i++;
>>> +			get_page_bootmem(section_nr, page, SECTION_INFO);
>>> +		}
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static void vmemmap_pte_remove(pmd_t *pmd, unsigned long addr, unsigned long end)
>>> +{
>>> +	pte_t *pte;
>>> +	unsigned long next;
>>> +
>>> +	pte = pte_offset_kernel(pmd, addr);
>>> +	for (; addr < end; pte++, addr += PAGE_SIZE) {
>>> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
>>> +		if (next > end)
>>> +			next = end;
>>> +
>>> +		if (pte_none(*pte))
>>> +			continue;
>>> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
>>> +		    IS_ALIGNED(end, PAGE_SIZE)) {
>>> +			vmemmap_free_pages(pte_page(*pte), 0);
>>> +			spin_lock(&init_mm.page_table_lock);
>>> +			pte_clear(&init_mm, addr, pte);
>>> +			spin_unlock(&init_mm.page_table_lock);
>>
>> If addr or end is not alianed with PAGE_SIZE, you may leak some
>> memory.
>>
>
> yes, I think we can handle this situation with the method you mentioned in the change log:
> 1. When removing memory, the page structs of the revmoved memory are filled
>     with 0xFD.
> 2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>     In this case, the page used as PT/PMD can be freed.
>
> By the way, why is 0xFD?

There is no reason. I just filled the page with unique number.

Thanks,
Yasuaki Ishimatsu

>
>>> +		}
>>> +	}
>>> +
>>> +	free_pte_table(pmd);
>>> +	__flush_tlb_all();
>>> +}
>>> +
>>> +static void vmemmap_pmd_remove(pud_t *pud, unsigned long addr, unsigned long end)
>>> +{
>>> +	unsigned long next;
>>> +	pmd_t *pmd;
>>> +
>>> +	pmd = pmd_offset(pud, addr);
>>> +	for (; addr < end; addr = next, pmd++) {
>>> +		next = pmd_addr_end(addr, end);
>>> +		if (pmd_none(*pmd))
>>> +			continue;
>>> +
>>> +		if (cpu_has_pse) {
>>> +			unsigned long pte_base;
>>> +
>>> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
>>> +			    IS_ALIGNED(next, PMD_SIZE)) {
>>> +				vmemmap_free_pages(pmd_page(*pmd),
>>> +						   get_order(PMD_SIZE));
>>> +				spin_lock(&init_mm.page_table_lock);
>>> +				pmd_clear(pmd);
>>> +				spin_unlock(&init_mm.page_table_lock);
>>> +				continue;
>>> +			}
>>> +
>>> +			/*
>>> +			 * We use 2M page, but we need to remove part of them,
>>> +			 * so split 2M page to 4K page.
>>> +			 */
>>> +			pte_base = get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
>>
>> get_zeored_page() may fail. You should handle this error.
>>
>
> That means system is out of memory, I will trigger a bug_on.
>
>>> +			split_large_page((pte_t *)pmd, addr, (pte_t *)pte_base);
>>> +			__flush_tlb_all();
>>> +
>>> +			spin_lock(&init_mm.page_table_lock);
>>> +			pmd_populate_kernel(&init_mm, pmd, (pte_t *)pte_base);
>>> +			spin_unlock(&init_mm.page_table_lock);
>>> +		}
>>> +
>>> +		vmemmap_pte_remove(pmd, addr, next);
>>> +	}
>>> +
>>> +	free_pmd_table(pud);
>>> +	__flush_tlb_all();
>>> +}
>>> +
>>> +static void vmemmap_pud_remove(pgd_t *pgd, unsigned long addr, unsigned long end)
>>> +{
>>> +	unsigned long next;
>>> +	pud_t *pud;
>>> +
>>> +	pud = pud_offset(pgd, addr);
>>> +	for (; addr < end; addr = next, pud++) {
>>> +		next = pud_addr_end(addr, end);
>>> +		if (pud_none(*pud))
>>> +			continue;
>>> +
>>> +		vmemmap_pmd_remove(pud, addr, next);
>>> +	}
>>> +
>>> +	free_pud_table(pgd);
>>> +	__flush_tlb_all();
>>> +}
>>> +
>>> +void vmemmap_free(struct page *memmap, unsigned long nr_pages)
>>> +{
>>> +	unsigned long addr = (unsigned long)memmap;
>>> +	unsigned long end = (unsigned long)(memmap + nr_pages);
>>> +	unsigned long next;
>>> +
>>> +	for (; addr < end; addr = next) {
>>> +		pgd_t *pgd = pgd_offset_k(addr);
>>> +
>>> +		next = pgd_addr_end(addr, end);
>>> +		if (!pgd_present(*pgd))
>>> +			continue;
>>> +
>>> +		vmemmap_pud_remove(pgd, addr, next);
>>> +		sync_global_pgds(addr, next);
>>
>> The parameter for sync_global_pgds() is [start, end], not
>> [start, end)
>>
>
> yes, thanks.
>
>>> +	}
>>> +}
>>> +#endif
>>> diff --git a/mm/sparse.c b/mm/sparse.c
>>> index fac95f2..3a16d68 100644
>>> --- a/mm/sparse.c
>>> +++ b/mm/sparse.c
>>> @@ -613,12 +613,13 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
>>>   	/* This will make the necessary allocations eventually. */
>>>   	return sparse_mem_map_populate(pnum, nid);
>>>   }
>>> -static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
>>> +static void __kfree_section_memmap(struct page *page, unsigned long nr_pages)
>> Why do you change this line?
>>
>
> 0k, it is no need to change.
>
>>>   {
>>> -	return; /* XXX: Not implemented yet */
>>> +	vmemmap_free(page, nr_pages);
>>>   }
>>>   static void free_map_bootmem(struct page *page, unsigned long nr_pages)
>>>   {
>>> +	vmemmap_free(page, nr_pages);
>>>   }
>>>   #else
>>>   static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
>>
>>
>> .
>>
>
>
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 00/12] memory-hotplug: hot-remove physical memory
  2012-11-27 19:27 ` [Patch v4 00/12] memory-hotplug: hot-remove physical memory Andrew Morton
  2012-11-27 19:38   ` Rafael J. Wysocki
  2012-11-28  0:43   ` Yasuaki Ishimatsu
@ 2012-11-30  6:37   ` Tang Chen
  2 siblings, 0 replies; 40+ messages in thread
From: Tang Chen @ 2012-11-30  6:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, linux-ia64, Wen Congyang, linux-acpi, linux-sh,
	Len Brown, x86, linux-kernel, cmetcalf, Jianguo Wu, linux-mm,
	Yasuaki Ishimatsu, paulus, Minchan Kim, KOSAKI Motohiro,
	David Rientjes, sparclinux, Christoph Lameter, linuxppc-dev,
	Jiang Liu

Hi Andrew,

On 11/28/2012 03:27 AM, Andrew Morton wrote:
>>
>> - acpi framework
>>    https://lkml.org/lkml/2012/10/26/175
>
> What's happening with the acpi framework?  has it received any feedback
> from the ACPI developers?

About ACPI framework, we are trying to do the following.

     The memory device can be removed by 2 ways:
     1. send eject request by SCI
     2. echo 1 >/sys/bus/pci/devices/PNP0C80:XX/eject

     In the 1st case, acpi_memory_disable_device() will be called.
     In the 2nd case, acpi_memory_device_remove() will be called.
     acpi_memory_device_remove() will also be called when we unbind the
     memory device from the driver acpi_memhotplug or a driver
     initialization fails.

     acpi_memory_disable_device() has already implemented a code which
     offlines memory and releases acpi_memory_info struct . But
     acpi_memory_device_remove() has not implemented it yet.

     So the patch prepares the framework for hot removing memory and
     adds the framework into acpi_memory_device_remove().

All the ACPI related patches have been put into the linux-next branch
of the linux-pm.git tree as v3.8 material.Please refer to the following
url.
https://lkml.org/lkml/2012/11/2/160

So for now, with this patch set, we can do memory hot-remove on x86_64
linux.

I do hope you would merge them before 3.8-rc1, so that we can use this
functionality in 3.8.

As we are still testing all memory hotplug related functionalities, I
hope we can do the bug fix during 3.8 rc.

Thanks. :)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-11-30  1:45     ` Wen Congyang
  2012-11-30  2:47       ` Jianguo Wu
@ 2012-12-03  2:23       ` Jianguo Wu
  2012-12-04  9:13         ` Tang Chen
  2012-12-07  1:42         ` Tang Chen
  1 sibling, 2 replies; 40+ messages in thread
From: Jianguo Wu @ 2012-12-03  2:23 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, linux-mm, Yasuaki Ishimatsu, paulus,
	Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

Hi Congyang,

This is the new version.

Thanks,
Jianguo Wu.


Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
---
 include/linux/mm.h  |    1 +
 mm/sparse-vmemmap.c |  231 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/sparse.c         |    3 +-
 3 files changed, 234 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5657670..1f26af5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1642,6 +1642,7 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
 void vmemmap_populate_print_last(void);
 void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long size);
+void vmemmap_free(struct page *memmap, unsigned long nr_pages);
 
 enum mf_flags {
 	MF_COUNT_INCREASED = 1 << 0,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 1b7e22a..748732d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -29,6 +29,10 @@
 #include <asm/pgalloc.h>
 #include <asm/pgtable.h>
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+#include <asm/tlbflush.h>
+#endif
+
 /*
  * Allocate a block of memory to be used to back the virtual memory map
  * or to back the page tables that are used to create the mapping.
@@ -224,3 +228,230 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
 		vmemmap_buf_end = NULL;
 	}
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+
+#define PAGE_INUSE 0xFD
+
+static void vmemmap_free_pages(struct page *page, int order)
+{
+	struct zone *zone;
+	unsigned long magic;
+
+	magic = (unsigned long) page->lru.next;
+	if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
+		put_page_bootmem(page);
+
+		zone = page_zone(page);
+		zone_span_writelock(zone);
+		zone->present_pages++;
+		zone_span_writeunlock(zone);
+		totalram_pages++;
+	} else
+		free_pages((unsigned long)page_address(page), order);
+}
+
+static void free_pte_table(pmd_t *pmd)
+{
+	pte_t *pte, *pte_start;
+	int i;
+
+	pte_start = (pte_t *)pmd_page_vaddr(*pmd);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte = pte_start + i;
+		if (pte_val(*pte))
+			return;
+	}
+
+	/* free a pte talbe */
+	vmemmap_free_pages(pmd_page(*pmd), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+static void free_pmd_table(pud_t *pud)
+{
+	pmd_t *pmd, *pmd_start;
+	int i;
+
+	pmd_start = (pmd_t *)pud_page_vaddr(*pud);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		pmd = pmd_start + i;
+		if (pmd_val(*pmd))
+			return;
+	}
+
+	/* free a pmd talbe */
+	vmemmap_free_pages(pud_page(*pud), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+static void free_pud_table(pgd_t *pgd)
+{
+	pud_t *pud, *pud_start;
+	int i;
+
+	pud_start = (pud_t *)pgd_page_vaddr(*pgd);
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (pud_val(*pud))
+			return;
+	}
+
+	/* free a pud table */
+	vmemmap_free_pages(pgd_page(*pgd), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pgd_clear(pgd);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+static int split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
+{
+	struct page *page = pmd_page(*(pmd_t *)kpte);
+	int i = 0;
+	unsigned long magic;
+	unsigned long section_nr;
+
+	__split_large_page(kpte, address, pbase);
+	__flush_tlb_all();
+
+	magic = (unsigned long) page->lru.next;
+	if (magic == SECTION_INFO) {
+		section_nr = pfn_to_section_nr(page_to_pfn(page));
+		while (i < PTRS_PER_PMD) {
+			page++;
+			i++;
+			get_page_bootmem(section_nr, page, SECTION_INFO);
+		}
+	}
+
+	return 0;
+}
+
+static void vmemmap_pte_remove(pmd_t *pmd, unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	unsigned long next;
+	void *page_addr;
+
+	pte = pte_offset_kernel(pmd, addr);
+	for (; addr < end; pte++, addr += PAGE_SIZE) {
+		next = (addr + PAGE_SIZE) & PAGE_MASK;
+		if (next > end)
+			next = end;
+
+		if (pte_none(*pte))
+			continue;
+		if (IS_ALIGNED(addr, PAGE_SIZE) &&
+		    IS_ALIGNED(next, PAGE_SIZE)) {
+			vmemmap_free_pages(pte_page(*pte), 0);
+			spin_lock(&init_mm.page_table_lock);
+			pte_clear(&init_mm, addr, pte);
+			spin_unlock(&init_mm.page_table_lock);
+		} else {
+			/*
+			 * Removed page structs are filled with 0xFD.
+			 */
+			memset((void *)addr, PAGE_INUSE, next - addr);
+			page_addr = page_address(pte_page(*pte));
+
+			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
+				spin_lock(&init_mm.page_table_lock);
+				pte_clear(&init_mm, addr, pte);
+				spin_unlock(&init_mm.page_table_lock);
+			}
+		}
+	}
+
+	free_pte_table(pmd);
+	__flush_tlb_all();
+}
+
+static void vmemmap_pmd_remove(pud_t *pud, unsigned long addr, unsigned long end)
+{
+	unsigned long next;
+	pmd_t *pmd;
+
+	pmd = pmd_offset(pud, addr);
+	for (; addr < end; addr = next, pmd++) {
+		next = pmd_addr_end(addr, end);
+		if (pmd_none(*pmd))
+			continue;
+
+		if (cpu_has_pse) {
+			unsigned long pte_base;
+
+			if (IS_ALIGNED(addr, PMD_SIZE) &&
+			    IS_ALIGNED(next, PMD_SIZE)) {
+				vmemmap_free_pages(pmd_page(*pmd),
+						   get_order(PMD_SIZE));
+				spin_lock(&init_mm.page_table_lock);
+				pmd_clear(pmd);
+				spin_unlock(&init_mm.page_table_lock);
+				continue;
+			}
+
+			/*
+			 * We use 2M page, but we need to remove part of them,
+			 * so split 2M page to 4K page.
+			 */
+			pte_base = get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
+			if (!pte_base) {
+				WARN_ON(1);
+				continue;
+			}
+
+			split_large_page((pte_t *)pmd, addr, (pte_t *)pte_base);
+			__flush_tlb_all();
+
+			spin_lock(&init_mm.page_table_lock);
+			pmd_populate_kernel(&init_mm, pmd, (pte_t *)pte_base);
+			spin_unlock(&init_mm.page_table_lock);
+		}
+
+		vmemmap_pte_remove(pmd, addr, next);
+	}
+
+	free_pmd_table(pud);
+	__flush_tlb_all();
+}
+
+static void vmemmap_pud_remove(pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+	unsigned long next;
+	pud_t *pud;
+
+	pud = pud_offset(pgd, addr);
+	for (; addr < end; addr = next, pud++) {
+		next = pud_addr_end(addr, end);
+		if (pud_none(*pud))
+			continue;
+
+		vmemmap_pmd_remove(pud, addr, next);
+	}
+
+	free_pud_table(pgd);
+	__flush_tlb_all();
+}
+
+void vmemmap_free(struct page *memmap, unsigned long nr_pages)
+{
+	unsigned long addr = (unsigned long)memmap;
+	unsigned long end = (unsigned long)(memmap + nr_pages);
+	unsigned long next;
+
+	for (; addr < end; addr = next) {
+		pgd_t *pgd = pgd_offset_k(addr);
+
+		next = pgd_addr_end(addr, end);
+		if (!pgd_present(*pgd))
+			continue;
+
+		vmemmap_pud_remove(pgd, addr, next);
+		sync_global_pgds(addr, next - 1);
+	}
+}
+#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index fac95f2..4060229 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -615,10 +615,11 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
 }
 static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
 {
-	return; /* XXX: Not implemented yet */
+	vmemmap_free(memmap, nr_pages);
 }
 static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 {
+	vmemmap_free(page, nr_pages);
 }
 #else
 static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
-- 
1.7.6.1


On 2012/11/30 9:45, Wen Congyang wrote:

> At 11/28/2012 05:40 PM, Jianguo Wu Wrote:
>> Hi Congyang,
>>
>> I think vmemmap's pgtable pages should be freed after all entries are cleared, I have a patch to do this.
>> The code logic is the same as [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture.
>>
>> How do you think about this?
>>
>> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
>> Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
>> ---
>>  include/linux/mm.h  |    1 +
>>  mm/sparse-vmemmap.c |  214 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  mm/sparse.c         |    5 +-
>>  3 files changed, 218 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5657670..1f26af5 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1642,6 +1642,7 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
>>  void vmemmap_populate_print_last(void);
>>  void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>>  				  unsigned long size);
>> +void vmemmap_free(struct page *memmap, unsigned long nr_pages);
>>  
>>  enum mf_flags {
>>  	MF_COUNT_INCREASED = 1 << 0,
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 1b7e22a..242cb28 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -29,6 +29,10 @@
>>  #include <asm/pgalloc.h>
>>  #include <asm/pgtable.h>
>>  
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +#include <asm/tlbflush.h>
>> +#endif
>> +
>>  /*
>>   * Allocate a block of memory to be used to back the virtual memory map
>>   * or to back the page tables that are used to create the mapping.
>> @@ -224,3 +228,213 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>>  		vmemmap_buf_end = NULL;
>>  	}
>>  }
>> +
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +static void vmemmap_free_pages(struct page *page, int order)
>> +{
>> +	struct zone *zone;
>> +	unsigned long magic;
>> +
>> +	magic = (unsigned long) page->lru.next;
>> +	if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>> +		put_page_bootmem(page);
>> +
>> +		zone = page_zone(page);
>> +		zone_span_writelock(zone);
>> +		zone->present_pages++;
>> +		zone_span_writeunlock(zone);
>> +		totalram_pages++;
>> +	} else {
>> +		if (is_vmalloc_addr(page_address(page)))
>> +			vfree(page_address(page));
> 
> Hmm, vmemmap doesn't use vmalloc() to allocate memory.
> 
>> +		else
>> +			free_pages((unsigned long)page_address(page), order);
>> +	}
>> +}
>> +
>> +static void free_pte_table(pmd_t *pmd)
>> +{
>> +	pte_t *pte, *pte_start;
>> +	int i;
>> +
>> +	pte_start = (pte_t *)pmd_page_vaddr(*pmd);
>> +	for (i = 0; i < PTRS_PER_PTE; i++) {
>> +		pte = pte_start + i;
>> +		if (pte_val(*pte))
>> +			return;
>> +	}
>> +
>> +	/* free a pte talbe */
>> +	vmemmap_free_pages(pmd_page(*pmd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pmd_clear(pmd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static void free_pmd_table(pud_t *pud)
>> +{
>> +	pmd_t *pmd, *pmd_start;
>> +	int i;
>> +
>> +	pmd_start = (pmd_t *)pud_page_vaddr(*pud);
>> +	for (i = 0; i < PTRS_PER_PMD; i++) {
>> +		pmd = pmd_start + i;
>> +		if (pmd_val(*pmd))
>> +			return;
>> +	}
>> +
>> +	/* free a pmd talbe */
>> +	vmemmap_free_pages(pud_page(*pud), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pud_clear(pud);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static void free_pud_table(pgd_t *pgd)
>> +{
>> +	pud_t *pud, *pud_start;
>> +	int i;
>> +
>> +	pud_start = (pud_t *)pgd_page_vaddr(*pgd);
>> +	for (i = 0; i < PTRS_PER_PUD; i++) {
>> +		pud = pud_start + i;
>> +		if (pud_val(*pud))
>> +			return;
>> +	}
>> +
>> +	/* free a pud table */
>> +	vmemmap_free_pages(pgd_page(*pgd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pgd_clear(pgd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static int split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>> +{
>> +	struct page *page = pmd_page(*(pmd_t *)kpte);
>> +	int i = 0;
>> +	unsigned long magic;
>> +	unsigned long section_nr;
>> +
>> +	__split_large_page(kpte, address, pbase);
>> +	__flush_tlb_all();
>> +
>> +	magic = (unsigned long) page->lru.next;
>> +	if (magic == SECTION_INFO) {
>> +		section_nr = pfn_to_section_nr(page_to_pfn(page));
>> +		while (i < PTRS_PER_PMD) {
>> +			page++;
>> +			i++;
>> +			get_page_bootmem(section_nr, page, SECTION_INFO);
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void vmemmap_pte_remove(pmd_t *pmd, unsigned long addr, unsigned long end)
>> +{
>> +	pte_t *pte;
>> +	unsigned long next;
>> +
>> +	pte = pte_offset_kernel(pmd, addr);
>> +	for (; addr < end; pte++, addr += PAGE_SIZE) {
>> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
>> +		if (next > end)
>> +			next = end;
>> +
>> +		if (pte_none(*pte))
>> +			continue;
>> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
>> +		    IS_ALIGNED(end, PAGE_SIZE)) {
>> +			vmemmap_free_pages(pte_page(*pte), 0);
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pte_clear(&init_mm, addr, pte);
>> +			spin_unlock(&init_mm.page_table_lock);
> 
> If addr or end is not alianed with PAGE_SIZE, you may leak some
> memory.
> 
>> +		}
>> +	}
>> +
>> +	free_pte_table(pmd);
>> +	__flush_tlb_all();
>> +}
>> +
>> +static void vmemmap_pmd_remove(pud_t *pud, unsigned long addr, unsigned long end)
>> +{
>> +	unsigned long next;
>> +	pmd_t *pmd;
>> +
>> +	pmd = pmd_offset(pud, addr);
>> +	for (; addr < end; addr = next, pmd++) {
>> +		next = pmd_addr_end(addr, end);
>> +		if (pmd_none(*pmd))
>> +			continue;
>> +
>> +		if (cpu_has_pse) {
>> +			unsigned long pte_base;
>> +
>> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
>> +			    IS_ALIGNED(next, PMD_SIZE)) {
>> +				vmemmap_free_pages(pmd_page(*pmd),
>> +						   get_order(PMD_SIZE));
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pmd_clear(pmd);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * We use 2M page, but we need to remove part of them,
>> +			 * so split 2M page to 4K page.
>> +			 */
>> +			pte_base = get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
> 
> get_zeored_page() may fail. You should handle this error.
> 
>> +			split_large_page((pte_t *)pmd, addr, (pte_t *)pte_base);
>> +			__flush_tlb_all();
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pmd_populate_kernel(&init_mm, pmd, (pte_t *)pte_base);
>> +			spin_unlock(&init_mm.page_table_lock);
>> +		}
>> +
>> +		vmemmap_pte_remove(pmd, addr, next);
>> +	}
>> +
>> +	free_pmd_table(pud);
>> +	__flush_tlb_all();
>> +}
>> +
>> +static void vmemmap_pud_remove(pgd_t *pgd, unsigned long addr, unsigned long end)
>> +{
>> +	unsigned long next;
>> +	pud_t *pud;
>> +
>> +	pud = pud_offset(pgd, addr);
>> +	for (; addr < end; addr = next, pud++) {
>> +		next = pud_addr_end(addr, end);
>> +		if (pud_none(*pud))
>> +			continue;
>> +
>> +		vmemmap_pmd_remove(pud, addr, next);
>> +	}
>> +
>> +	free_pud_table(pgd);
>> +	__flush_tlb_all();
>> +}
>> +
>> +void vmemmap_free(struct page *memmap, unsigned long nr_pages)
>> +{
>> +	unsigned long addr = (unsigned long)memmap;
>> +	unsigned long end = (unsigned long)(memmap + nr_pages);
>> +	unsigned long next;
>> +
>> +	for (; addr < end; addr = next) {
>> +		pgd_t *pgd = pgd_offset_k(addr);
>> +
>> +		next = pgd_addr_end(addr, end);
>> +		if (!pgd_present(*pgd))
>> +			continue;
>> +
>> +		vmemmap_pud_remove(pgd, addr, next);
>> +		sync_global_pgds(addr, next);
> 
> The parameter for sync_global_pgds() is [start, end], not
> [start, end)
> 
>> +	}
>> +}
>> +#endif
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index fac95f2..3a16d68 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -613,12 +613,13 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
>>  	/* This will make the necessary allocations eventually. */
>>  	return sparse_mem_map_populate(pnum, nid);
>>  }
>> -static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
>> +static void __kfree_section_memmap(struct page *page, unsigned long nr_pages)
> Why do you change this line?
> 
>>  {
>> -	return; /* XXX: Not implemented yet */
>> +	vmemmap_free(page, nr_pages);
>>  }
>>  static void free_map_bootmem(struct page *page, unsigned long nr_pages)
>>  {
>> +	vmemmap_free(page, nr_pages);
>>  }
>>  #else
>>  static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
> 
> 
> .
> 

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-12-03  2:23       ` Jianguo Wu
@ 2012-12-04  9:13         ` Tang Chen
  2012-12-04 12:20           ` Jianguo Wu
  2012-12-07  1:42         ` Tang Chen
  1 sibling, 1 reply; 40+ messages in thread
From: Tang Chen @ 2012-12-04  9:13 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: linux-s390, linux-ia64, Wen Congyang, linux-acpi, linux-sh,
	Len Brown, x86, linux-kernel, cmetcalf, linux-mm,
	Yasuaki Ishimatsu, paulus, Minchan Kim, KOSAKI Motohiro,
	David Rientjes, sparclinux, Christoph Lameter, linuxppc-dev,
	Andrew Morton, Jiang Liu

Hi Wu,

Sorry to make noise here. Please see below. :)

On 12/03/2012 10:23 AM, Jianguo Wu wrote:
> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
> Signed-off-by: Jiang Liu<jiang.liu@huawei.com>
> ---
>   include/linux/mm.h  |    1 +
>   mm/sparse-vmemmap.c |  231 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   mm/sparse.c         |    3 +-
>   3 files changed, 234 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5657670..1f26af5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1642,6 +1642,7 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
>   void vmemmap_populate_print_last(void);
>   void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>   				  unsigned long size);
> +void vmemmap_free(struct page *memmap, unsigned long nr_pages);
>
>   enum mf_flags {
>   	MF_COUNT_INCREASED = 1<<  0,
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 1b7e22a..748732d 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -29,6 +29,10 @@
>   #include<asm/pgalloc.h>
>   #include<asm/pgtable.h>
>
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +#include<asm/tlbflush.h>
> +#endif
> +
>   /*
>    * Allocate a block of memory to be used to back the virtual memory map
>    * or to back the page tables that are used to create the mapping.
> @@ -224,3 +228,230 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>   		vmemmap_buf_end = NULL;
>   	}
>   }
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +
> +#define PAGE_INUSE 0xFD
> +
> +static void vmemmap_free_pages(struct page *page, int order)
> +{
> +	struct zone *zone;
> +	unsigned long magic;
> +
> +	magic = (unsigned long) page->lru.next;
> +	if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> +		put_page_bootmem(page);
> +
> +		zone = page_zone(page);
> +		zone_span_writelock(zone);
> +		zone->present_pages++;
> +		zone_span_writeunlock(zone);
> +		totalram_pages++;

Seems that we have different ways to handle pages allocated by bootmem
or by regular allocator. Is the checking way in [PATCH 09/12] available
here ?

+	/* bootmem page has reserved flag */
+	if (PageReserved(page)) {
......
+	}

If so, I think we can just merge these two functions.

> +	} else
> +		free_pages((unsigned long)page_address(page), order);
> +}
> +
> +static void free_pte_table(pmd_t *pmd)
> +{
> +	pte_t *pte, *pte_start;
> +	int i;
> +
> +	pte_start = (pte_t *)pmd_page_vaddr(*pmd);
> +	for (i = 0; i<  PTRS_PER_PTE; i++) {
> +		pte = pte_start + i;
> +		if (pte_val(*pte))
> +			return;
> +	}
> +
> +	/* free a pte talbe */
> +	vmemmap_free_pages(pmd_page(*pmd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pmd_clear(pmd);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +static void free_pmd_table(pud_t *pud)
> +{
> +	pmd_t *pmd, *pmd_start;
> +	int i;
> +
> +	pmd_start = (pmd_t *)pud_page_vaddr(*pud);
> +	for (i = 0; i<  PTRS_PER_PMD; i++) {
> +		pmd = pmd_start + i;
> +		if (pmd_val(*pmd))
> +			return;
> +	}
> +
> +	/* free a pmd talbe */
> +	vmemmap_free_pages(pud_page(*pud), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pud_clear(pud);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +static void free_pud_table(pgd_t *pgd)
> +{
> +	pud_t *pud, *pud_start;
> +	int i;
> +
> +	pud_start = (pud_t *)pgd_page_vaddr(*pgd);
> +	for (i = 0; i<  PTRS_PER_PUD; i++) {
> +		pud = pud_start + i;
> +		if (pud_val(*pud))
> +			return;
> +	}
> +
> +	/* free a pud table */
> +	vmemmap_free_pages(pgd_page(*pgd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pgd_clear(pgd);
> +	spin_unlock(&init_mm.page_table_lock);
> +}

All the free_xxx_table() are very similar to the functions in
[PATCH 09/12]. Could we reuse them anyway ?

> +
> +static int split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
> +{
> +	struct page *page = pmd_page(*(pmd_t *)kpte);
> +	int i = 0;
> +	unsigned long magic;
> +	unsigned long section_nr;
> +
> +	__split_large_page(kpte, address, pbase);

Is this patch going to replace [PATCH 08/12] ?

If so, __split_large_page() was added and exported in [PATCH 09/12],
then we should move it here, right ?

If not, free_map_bootmem() and __kfree_section_memmap() were changed in
[PATCH 08/12], and we need to handle this.

> +	__flush_tlb_all();
> +
> +	magic = (unsigned long) page->lru.next;
> +	if (magic == SECTION_INFO) {
> +		section_nr = pfn_to_section_nr(page_to_pfn(page));
> +		while (i<  PTRS_PER_PMD) {
> +			page++;
> +			i++;
> +			get_page_bootmem(section_nr, page, SECTION_INFO);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void vmemmap_pte_remove(pmd_t *pmd, unsigned long addr, unsigned long end)
> +{
> +	pte_t *pte;
> +	unsigned long next;
> +	void *page_addr;
> +
> +	pte = pte_offset_kernel(pmd, addr);
> +	for (; addr<  end; pte++, addr += PAGE_SIZE) {
> +		next = (addr + PAGE_SIZE)&  PAGE_MASK;
> +		if (next>  end)
> +			next = end;
> +
> +		if (pte_none(*pte))
> +			continue;
> +		if (IS_ALIGNED(addr, PAGE_SIZE)&&
> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> +			vmemmap_free_pages(pte_page(*pte), 0);
> +			spin_lock(&init_mm.page_table_lock);
> +			pte_clear(&init_mm, addr, pte);
> +			spin_unlock(&init_mm.page_table_lock);
> +		} else {
> +			/*
> +			 * Removed page structs are filled with 0xFD.
> +			 */
> +			memset((void *)addr, PAGE_INUSE, next - addr);
> +			page_addr = page_address(pte_page(*pte));
> +
> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> +				spin_lock(&init_mm.page_table_lock);
> +				pte_clear(&init_mm, addr, pte);
> +				spin_unlock(&init_mm.page_table_lock);
> +			}
> +		}
> +	}
> +
> +	free_pte_table(pmd);
> +	__flush_tlb_all();
> +}
> +
> +static void vmemmap_pmd_remove(pud_t *pud, unsigned long addr, unsigned long end)
> +{
> +	unsigned long next;
> +	pmd_t *pmd;
> +
> +	pmd = pmd_offset(pud, addr);
> +	for (; addr<  end; addr = next, pmd++) {
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(*pmd))
> +			continue;
> +
> +		if (cpu_has_pse) {
> +			unsigned long pte_base;
> +
> +			if (IS_ALIGNED(addr, PMD_SIZE)&&
> +			    IS_ALIGNED(next, PMD_SIZE)) {
> +				vmemmap_free_pages(pmd_page(*pmd),
> +						   get_order(PMD_SIZE));
> +				spin_lock(&init_mm.page_table_lock);
> +				pmd_clear(pmd);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 2M page, but we need to remove part of them,
> +			 * so split 2M page to 4K page.
> +			 */
> +			pte_base = get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
> +			if (!pte_base) {
> +				WARN_ON(1);
> +				continue;
> +			}
> +
> +			split_large_page((pte_t *)pmd, addr, (pte_t *)pte_base);
> +			__flush_tlb_all();
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pmd_populate_kernel(&init_mm, pmd, (pte_t *)pte_base);
> +			spin_unlock(&init_mm.page_table_lock);
> +		}
> +
> +		vmemmap_pte_remove(pmd, addr, next);
> +	}
> +
> +	free_pmd_table(pud);
> +	__flush_tlb_all();
> +}
> +
> +static void vmemmap_pud_remove(pgd_t *pgd, unsigned long addr, unsigned long end)
> +{
> +	unsigned long next;
> +	pud_t *pud;
> +
> +	pud = pud_offset(pgd, addr);
> +	for (; addr<  end; addr = next, pud++) {
> +		next = pud_addr_end(addr, end);
> +		if (pud_none(*pud))
> +			continue;
> +
> +		vmemmap_pmd_remove(pud, addr, next);
> +	}
> +
> +	free_pud_table(pgd);
> +	__flush_tlb_all();
> +}
> +
> +void vmemmap_free(struct page *memmap, unsigned long nr_pages)
> +{
> +	unsigned long addr = (unsigned long)memmap;
> +	unsigned long end = (unsigned long)(memmap + nr_pages);
> +	unsigned long next;
> +
> +	for (; addr<  end; addr = next) {
> +		pgd_t *pgd = pgd_offset_k(addr);
> +
> +		next = pgd_addr_end(addr, end);
> +		if (!pgd_present(*pgd))
> +			continue;
> +
> +		vmemmap_pud_remove(pgd, addr, next);
> +		sync_global_pgds(addr, next - 1);
> +	}
> +}
> +#endif
> diff --git a/mm/sparse.c b/mm/sparse.c
> index fac95f2..4060229 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -615,10 +615,11 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
>   }
>   static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
>   {
> -	return; /* XXX: Not implemented yet */
> +	vmemmap_free(memmap, nr_pages);
>   }
>   static void free_map_bootmem(struct page *page, unsigned long nr_pages)

In the latest kernel, this line was:
static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)

>   {
> +	vmemmap_free(page, nr_pages);
>   }
>   #else
>   static struct page *__kmalloc_section_memmap(unsigned long nr_pages)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 01/12] memory-hotplug: try to offline the memory twice to avoid dependence
  2012-11-27 10:00 ` [Patch v4 01/12] memory-hotplug: try to offline the memory twice to avoid dependence Wen Congyang
@ 2012-12-04  9:17   ` Tang Chen
  0 siblings, 0 replies; 40+ messages in thread
From: Tang Chen @ 2012-12-04  9:17 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Jianguo Wu, linux-mm, Yasuaki Ishimatsu,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

On 11/27/2012 06:00 PM, Wen Congyang wrote:
> memory can't be offlined when CONFIG_MEMCG is selected.
> For example: there is a memory device on node 1. The address range
> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> and memory11 under the directory /sys/devices/system/memory/.
>
> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> when we online pages. When we online memory8, the memory stored page cgroup
> is not provided by this memory device. But when we online memory9, the memory
> stored page cgroup may be provided by memory8. So we can't offline memory8
> now. We should offline the memory in the reversed order.
>
> When the memory device is hotremoved, we will auto offline memory provided
> by this memory device. But we don't know which memory is onlined first, so
> offlining memory may fail. In such case, iterate twice to offline the memory.
> 1st iterate: offline every non primary memory block.
> 2nd iterate: offline primary (i.e. first added) memory block.
>
> This idea is suggested by KOSAKI Motohiro.
>
> CC: David Rientjes<rientjes@google.com>
> CC: Jiang Liu<liuj97@gmail.com>
> CC: Len Brown<len.brown@intel.com>
> CC: Christoph Lameter<cl@linux.com>
> Cc: Minchan Kim<minchan.kim@gmail.com>
> CC: Andrew Morton<akpm@linux-foundation.org>
> CC: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> CC: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>

Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>

> ---
>   mm/memory_hotplug.c | 16 ++++++++++++++--
>   1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e4eeaca..b825dbc 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1012,10 +1012,13 @@ int remove_memory(u64 start, u64 size)
>   	unsigned long start_pfn, end_pfn;
>   	unsigned long pfn, section_nr;
>   	int ret;
> +	int return_on_error = 0;
> +	int retry = 0;
>
>   	start_pfn = PFN_DOWN(start);
>   	end_pfn = start_pfn + PFN_DOWN(size);
>
> +repeat:
>   	for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
>   		section_nr = pfn_to_section_nr(pfn);
>   		if (!present_section_nr(section_nr))
> @@ -1034,14 +1037,23 @@ int remove_memory(u64 start, u64 size)
>
>   		ret = offline_memory_block(mem);
>   		if (ret) {
> -			kobject_put(&mem->dev.kobj);
> -			return ret;
> +			if (return_on_error) {
> +				kobject_put(&mem->dev.kobj);
> +				return ret;
> +			} else {
> +				retry = 1;
> +			}
>   		}
>   	}
>
>   	if (mem)
>   		kobject_put(&mem->dev.kobj);
>
> +	if (retry) {
> +		return_on_error = 1;
> +		goto repeat;
> +	}
> +
>   	return 0;
>   }
>   #else

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 02/12] memory-hotplug: check whether all memory blocks are offlined or not when removing memory
  2012-11-27 10:00 ` [Patch v4 02/12] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Wen Congyang
@ 2012-12-04  9:22   ` Tang Chen
  0 siblings, 0 replies; 40+ messages in thread
From: Tang Chen @ 2012-12-04  9:22 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Jianguo Wu, linux-mm, Yasuaki Ishimatsu,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

On 11/27/2012 06:00 PM, Wen Congyang wrote:
> From: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>
> We remove the memory like this:
> 1. lock memory hotplug
> 2. offline a memory block
> 3. unlock memory hotplug
> 4. repeat 1-3 to offline all memory blocks
> 5. lock memory hotplug
> 6. remove memory(TODO)
> 7. unlock memory hotplug
>
> All memory blocks must be offlined before removing memory. But we don't hold
> the lock in the whole operation. So we should check whether all memory blocks
> are offlined before step6. Otherwise, kernel maybe panicked.
>
> CC: David Rientjes<rientjes@google.com>
> CC: Jiang Liu<liuj97@gmail.com>
> CC: Len Brown<len.brown@intel.com>
> CC: Christoph Lameter<cl@linux.com>
> Cc: Minchan Kim<minchan.kim@gmail.com>
> CC: Andrew Morton<akpm@linux-foundation.org>
> CC: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>

Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>

> ---
>   drivers/base/memory.c          |  6 ++++++
>   include/linux/memory_hotplug.h |  1 +
>   mm/memory_hotplug.c            | 47 ++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 54 insertions(+)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 86c8821..badb025 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -675,6 +675,12 @@ int offline_memory_block(struct memory_block *mem)
>   	return ret;
>   }
>
> +/* return true if the memory block is offlined, otherwise, return false */
> +bool is_memblock_offlined(struct memory_block *mem)
> +{
> +	return mem->state == MEM_OFFLINE;
> +}
> +
>   /*
>    * Initialize the sysfs support for memory devices...
>    */
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 95573ec..38675e9 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -236,6 +236,7 @@ extern int add_memory(int nid, u64 start, u64 size);
>   extern int arch_add_memory(int nid, u64 start, u64 size);
>   extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>   extern int offline_memory_block(struct memory_block *mem);
> +extern bool is_memblock_offlined(struct memory_block *mem);
>   extern int remove_memory(u64 start, u64 size);
>   extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
>   								int nr_pages);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b825dbc..b6d1101 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1054,6 +1054,53 @@ repeat:
>   		goto repeat;
>   	}
>
> +	lock_memory_hotplug();
> +
> +	/*
> +	 * we have offlined all memory blocks like this:
> +	 *   1. lock memory hotplug
> +	 *   2. offline a memory block
> +	 *   3. unlock memory hotplug
> +	 *
> +	 * repeat step1-3 to offline the memory block. All memory blocks
> +	 * must be offlined before removing memory. But we don't hold the
> +	 * lock in the whole operation. So we should check whether all
> +	 * memory blocks are offlined.
> +	 */
> +
> +	for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
> +		section_nr = pfn_to_section_nr(pfn);
> +		if (!present_section_nr(section_nr))
> +			continue;
> +
> +		section = __nr_to_section(section_nr);
> +		/* same memblock? */
> +		if (mem)
> +			if ((section_nr>= mem->start_section_nr)&&
> +			    (section_nr<= mem->end_section_nr))
> +				continue;
> +
> +		mem = find_memory_block_hinted(section, mem);
> +		if (!mem)
> +			continue;
> +
> +		ret = is_memblock_offlined(mem);
> +		if (!ret) {
> +			pr_warn("removing memory fails, because memory "
> +				"[%#010llx-%#010llx] is onlined\n",
> +				PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
> +				PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1)) - 1);
> +
> +			kobject_put(&mem->dev.kobj);
> +			unlock_memory_hotplug();
> +			return ret;
> +		}
> +	}
> +
> +	if (mem)
> +		kobject_put(&mem->dev.kobj);
> +	unlock_memory_hotplug();
> +
>   	return 0;
>   }
>   #else

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 03/12] memory-hotplug: remove redundant codes
  2012-11-27 10:00 ` [Patch v4 03/12] memory-hotplug: remove redundant codes Wen Congyang
@ 2012-12-04  9:22   ` Tang Chen
  2012-12-04 10:31     ` Tang Chen
  0 siblings, 1 reply; 40+ messages in thread
From: Tang Chen @ 2012-12-04  9:22 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Jianguo Wu, linux-mm, Yasuaki Ishimatsu,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

On 11/27/2012 06:00 PM, Wen Congyang wrote:
> offlining memory blocks and checking whether memory blocks are offlined
> are very similar. This patch introduces a new function to remove
> redundant codes.
>
> CC: David Rientjes<rientjes@google.com>
> CC: Jiang Liu<liuj97@gmail.com>
> CC: Len Brown<len.brown@intel.com>
> CC: Christoph Lameter<cl@linux.com>
> Cc: Minchan Kim<minchan.kim@gmail.com>
> CC: Andrew Morton<akpm@linux-foundation.org>
> CC: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> CC: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>

Can we merge this patch with [PATCH 03/12] ?

Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>

> ---
>   mm/memory_hotplug.c | 101 ++++++++++++++++++++++++++++------------------------
>   1 file changed, 55 insertions(+), 46 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b6d1101..6d06488 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1005,20 +1005,14 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
>   }
>
> -int remove_memory(u64 start, u64 size)
> +static int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> +		void *arg, int (*func)(struct memory_block *, void *))
>   {
>   	struct memory_block *mem = NULL;
>   	struct mem_section *section;
> -	unsigned long start_pfn, end_pfn;
>   	unsigned long pfn, section_nr;
>   	int ret;
> -	int return_on_error = 0;
> -	int retry = 0;
> -
> -	start_pfn = PFN_DOWN(start);
> -	end_pfn = start_pfn + PFN_DOWN(size);
>
> -repeat:
>   	for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
>   		section_nr = pfn_to_section_nr(pfn);
>   		if (!present_section_nr(section_nr))
> @@ -1035,22 +1029,61 @@ repeat:
>   		if (!mem)
>   			continue;
>
> -		ret = offline_memory_block(mem);
> +		ret = func(mem, arg);
>   		if (ret) {
> -			if (return_on_error) {
> -				kobject_put(&mem->dev.kobj);
> -				return ret;
> -			} else {
> -				retry = 1;
> -			}
> +			kobject_put(&mem->dev.kobj);
> +			return ret;
>   		}
>   	}
>
>   	if (mem)
>   		kobject_put(&mem->dev.kobj);
>
> -	if (retry) {
> -		return_on_error = 1;
> +	return 0;
> +}
> +
> +static int offline_memory_block_cb(struct memory_block *mem, void *arg)
> +{
> +	int *ret = arg;
> +	int error = offline_memory_block(mem);
> +
> +	if (error != 0&&  *ret == 0)
> +		*ret = error;
> +
> +	return 0;
> +}
> +
> +static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
> +{
> +	int ret = !is_memblock_offlined(mem);
> +
> +	if (unlikely(ret))
> +		pr_warn("removing memory fails, because memory "
> +			"[%#010llx-%#010llx] is onlined\n",
> +			PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
> +			PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1);
> +
> +	return ret;
> +}
> +
> +int remove_memory(u64 start, u64 size)
> +{
> +	unsigned long start_pfn, end_pfn;
> +	int ret = 0;
> +	int retry = 1;
> +
> +	start_pfn = PFN_DOWN(start);
> +	end_pfn = start_pfn + PFN_DOWN(size);
> +
> +repeat:
> +	walk_memory_range(start_pfn, end_pfn,&ret,
> +			  offline_memory_block_cb);
> +	if (ret) {
> +		if (!retry)
> +			return ret;
> +
> +		retry = 0;
> +		ret = 0;
>   		goto repeat;
>   	}
>
> @@ -1068,37 +1101,13 @@ repeat:
>   	 * memory blocks are offlined.
>   	 */
>
> -	for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
> -		section_nr = pfn_to_section_nr(pfn);
> -		if (!present_section_nr(section_nr))
> -			continue;
> -
> -		section = __nr_to_section(section_nr);
> -		/* same memblock? */
> -		if (mem)
> -			if ((section_nr>= mem->start_section_nr)&&
> -			    (section_nr<= mem->end_section_nr))
> -				continue;
> -
> -		mem = find_memory_block_hinted(section, mem);
> -		if (!mem)
> -			continue;
> -
> -		ret = is_memblock_offlined(mem);
> -		if (!ret) {
> -			pr_warn("removing memory fails, because memory "
> -				"[%#010llx-%#010llx] is onlined\n",
> -				PFN_PHYS(section_nr_to_pfn(mem->start_section_nr)),
> -				PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1)) - 1);
> -
> -			kobject_put(&mem->dev.kobj);
> -			unlock_memory_hotplug();
> -			return ret;
> -		}
> +	ret = walk_memory_range(start_pfn, end_pfn, NULL,
> +				is_memblock_offlined_cb);
> +	if (ret) {
> +		unlock_memory_hotplug();
> +		return ret;
>   	}
>
> -	if (mem)
> -		kobject_put(&mem->dev.kobj);
>   	unlock_memory_hotplug();
>
>   	return 0;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 05/12] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture
  2012-11-27 10:00 ` [Patch v4 05/12] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Wen Congyang
@ 2012-12-04  9:30   ` Tang Chen
  0 siblings, 0 replies; 40+ messages in thread
From: Tang Chen @ 2012-12-04  9:30 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Jianguo Wu, linux-mm, Yasuaki Ishimatsu,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

On 11/27/2012 06:00 PM, Wen Congyang wrote:
> For removing memory, we need to remove page table. But it depends
> on architecture. So the patch introduce arch_remove_memory() for
> removing page table. Now it only calls __remove_pages().
>
> Note: __remove_pages() for some archtecuture is not implemented
>        (I don't know how to implement it for s390).
>
> CC: David Rientjes<rientjes@google.com>
> CC: Jiang Liu<liuj97@gmail.com>
> CC: Len Brown<len.brown@intel.com>
> CC: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> CC: Paul Mackerras<paulus@samba.org>
> CC: Christoph Lameter<cl@linux.com>
> Cc: Minchan Kim<minchan.kim@gmail.com>
> CC: Andrew Morton<akpm@linux-foundation.org>
> CC: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> CC: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
> ---
>   arch/ia64/mm/init.c            | 18 ++++++++++++++++++
>   arch/powerpc/mm/mem.c          | 12 ++++++++++++
>   arch/s390/mm/init.c            | 12 ++++++++++++
>   arch/sh/mm/init.c              | 17 +++++++++++++++++
>   arch/tile/mm/init.c            |  8 ++++++++
>   arch/x86/mm/init_32.c          | 12 ++++++++++++
>   arch/x86/mm/init_64.c          | 15 +++++++++++++++
>   include/linux/memory_hotplug.h |  1 +
>   mm/memory_hotplug.c            |  2 ++
>   9 files changed, 97 insertions(+)
>
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index 082e383..e333822 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -689,6 +689,24 @@ int arch_add_memory(int nid, u64 start, u64 size)
>
>   	return ret;
>   }
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int arch_remove_memory(u64 start, u64 size)
> +{
> +	unsigned long start_pfn = start>>  PAGE_SHIFT;
> +	unsigned long nr_pages = size>>  PAGE_SHIFT;
> +	struct zone *zone;
> +	int ret;
> +
> +	zone = page_zone(pfn_to_page(start_pfn));
> +	ret = __remove_pages(zone, start_pfn, nr_pages);
> +	if (ret)
> +		pr_warn("%s: Problem encountered in __remove_pages() as"
> +			" ret=%d\n", __func__,  ret);
> +
> +	return ret;

Just a little question, why do we have different handlers for ret on
different platforms ?  Sometimes we print a msg, sometimes we just
return, and sometimes we give a WARN_ON(). But no big deal. :)

Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>

> +}
> +#endif
>   #endif
>
>   /*
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 0dba506..09c6451 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -133,6 +133,18 @@ int arch_add_memory(int nid, u64 start, u64 size)
>
>   	return __add_pages(nid, zone, start_pfn, nr_pages);
>   }
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int arch_remove_memory(u64 start, u64 size)
> +{
> +	unsigned long start_pfn = start>>  PAGE_SHIFT;
> +	unsigned long nr_pages = size>>  PAGE_SHIFT;
> +	struct zone *zone;
> +
> +	zone = page_zone(pfn_to_page(start_pfn));
> +	return __remove_pages(zone, start_pfn, nr_pages);
> +}
> +#endif
>   #endif /* CONFIG_MEMORY_HOTPLUG */
>
>   /*
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 81e596c..b565190 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -257,4 +257,16 @@ int arch_add_memory(int nid, u64 start, u64 size)
>   		vmem_remove_mapping(start, size);
>   	return rc;
>   }
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int arch_remove_memory(u64 start, u64 size)
> +{
> +	/*
> +	 * There is no hardware or firmware interface which could trigger a
> +	 * hot memory remove on s390. So there is nothing that needs to be
> +	 * implemented.
> +	 */
> +	return -EBUSY;
> +}
> +#endif
>   #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index 82cc576..1057940 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -558,4 +558,21 @@ int memory_add_physaddr_to_nid(u64 addr)
>   EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>   #endif
>
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int arch_remove_memory(u64 start, u64 size)
> +{
> +	unsigned long start_pfn = start>>  PAGE_SHIFT;
> +	unsigned long nr_pages = size>>  PAGE_SHIFT;
> +	struct zone *zone;
> +	int ret;
> +
> +	zone = page_zone(pfn_to_page(start_pfn));
> +	ret = __remove_pages(zone, start_pfn, nr_pages);
> +	if (unlikely(ret))
> +		pr_warn("%s: Failed, __remove_pages() == %d\n", __func__,
> +			ret);
> +
> +	return ret;
> +}
> +#endif
>   #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
> index ef29d6c..2749515 100644
> --- a/arch/tile/mm/init.c
> +++ b/arch/tile/mm/init.c
> @@ -935,6 +935,14 @@ int remove_memory(u64 start, u64 size)
>   {
>   	return -EINVAL;
>   }
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int arch_remove_memory(u64 start, u64 size)
> +{
> +	/* TODO */
> +	return -EBUSY;
> +}
> +#endif
>   #endif
>
>   struct kmem_cache *pgd_cache;
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 11a5800..b19eba4 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -839,6 +839,18 @@ int arch_add_memory(int nid, u64 start, u64 size)
>
>   	return __add_pages(nid, zone, start_pfn, nr_pages);
>   }
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int arch_remove_memory(u64 start, u64 size)
> +{
> +	unsigned long start_pfn = start>>  PAGE_SHIFT;
> +	unsigned long nr_pages = size>>  PAGE_SHIFT;
> +	struct zone *zone;
> +
> +	zone = page_zone(pfn_to_page(start_pfn));
> +	return __remove_pages(zone, start_pfn, nr_pages);
> +}
> +#endif
>   #endif
>
>   /*
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3baff25..5675335 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -680,6 +680,21 @@ int arch_add_memory(int nid, u64 start, u64 size)
>   }
>   EXPORT_SYMBOL_GPL(arch_add_memory);
>
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int __ref arch_remove_memory(u64 start, u64 size)
> +{
> +	unsigned long start_pfn = start>>  PAGE_SHIFT;
> +	unsigned long nr_pages = size>>  PAGE_SHIFT;
> +	struct zone *zone;
> +	int ret;
> +
> +	zone = page_zone(pfn_to_page(start_pfn));
> +	ret = __remove_pages(zone, start_pfn, nr_pages);
> +	WARN_ON_ONCE(ret);
> +
> +	return ret;
> +}
> +#endif
>   #endif /* CONFIG_MEMORY_HOTPLUG */
>
>   static struct kcore_list kcore_vsyscall;
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 38675e9..191b2d9 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -85,6 +85,7 @@ extern void __online_page_free(struct page *page);
>
>   #ifdef CONFIG_MEMORY_HOTREMOVE
>   extern bool is_pageblock_removable_nolock(struct page *page);
> +extern int arch_remove_memory(u64 start, u64 size);
>   #endif /* CONFIG_MEMORY_HOTREMOVE */
>
>   /* reasonably generic interface to expand the physical pages in a zone  */
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 63d5388..e741732 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1111,6 +1111,8 @@ repeat:
>   	/* remove memmap entry */
>   	firmware_map_remove(start, start + size, "System RAM");
>
> +	arch_remove_memory(start, size);
> +
>   	unlock_memory_hotplug();
>
>   	return 0;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP
  2012-11-27 10:00 ` [Patch v4 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP Wen Congyang
@ 2012-12-04  9:34   ` Tang Chen
  0 siblings, 0 replies; 40+ messages in thread
From: Tang Chen @ 2012-12-04  9:34 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Jianguo Wu, linux-mm, Yasuaki Ishimatsu,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

On 11/27/2012 06:00 PM, Wen Congyang wrote:
> From: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>
> Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if
> we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.
>
> So the patch add unregister_memory_section() into __remove_section().
>
> CC: David Rientjes<rientjes@google.com>
> CC: Jiang Liu<liuj97@gmail.com>
> CC: Len Brown<len.brown@intel.com>
> CC: Christoph Lameter<cl@linux.com>
> Cc: Minchan Kim<minchan.kim@gmail.com>
> CC: Andrew Morton<akpm@linux-foundation.org>
> CC: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>

__remove_section() of CONFIG_SPARSEMEM_VMEMMAP will be integrated
into one in [PATCH 08/12], so I think we can merge this patch into
[PATCH 08/12].

Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>

> ---
>   mm/memory_hotplug.c | 13 ++++++++-----
>   1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e741732..171610d 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -279,11 +279,14 @@ static int __meminit __add_section(int nid, struct zone *zone,
>   #ifdef CONFIG_SPARSEMEM_VMEMMAP
>   static int __remove_section(struct zone *zone, struct mem_section *ms)
>   {
> -	/*
> -	 * XXX: Freeing memmap with vmemmap is not implement yet.
> -	 *      This should be removed later.
> -	 */
> -	return -EBUSY;
> +	int ret = -EINVAL;
> +
> +	if (!valid_section(ms))
> +		return ret;
> +
> +	ret = unregister_memory_section(ms);
> +
> +	return ret;
>   }
>   #else
>   static int __remove_section(struct zone *zone, struct mem_section *ms)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-11-27 10:00 ` [Patch v4 08/12] memory-hotplug: remove memmap " Wen Congyang
  2012-11-28  9:40   ` Jianguo Wu
@ 2012-12-04  9:47   ` Tang Chen
  1 sibling, 0 replies; 40+ messages in thread
From: Tang Chen @ 2012-12-04  9:47 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Jianguo Wu, linux-mm, Yasuaki Ishimatsu,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

On 11/27/2012 06:00 PM, Wen Congyang wrote:
>   static int __remove_section(struct zone *zone, struct mem_section *ms)
>   {
>   	unsigned long flags;
> @@ -330,9 +317,9 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
>   	pgdat_resize_lock(pgdat,&flags);
>   	sparse_remove_one_section(zone, ms);
>   	pgdat_resize_unlock(pgdat,&flags);
> -	return 0;
> +
> +	return ret;

I think we don't need to change this line. :)

Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 10/12] memory-hotplug: memory_hotplug: clear zone when removing the memory
  2012-11-27 10:00 ` [Patch v4 10/12] memory-hotplug: memory_hotplug: clear zone when removing the memory Wen Congyang
@ 2012-12-04 10:09   ` Tang Chen
  0 siblings, 0 replies; 40+ messages in thread
From: Tang Chen @ 2012-12-04 10:09 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Jianguo Wu, linux-mm, Yasuaki Ishimatsu,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

On 11/27/2012 06:00 PM, Wen Congyang wrote:
> From: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>
> When a memory is added, we update zone's and pgdat's start_pfn and
> spanned_pages in the function __add_zone(). So we should revert them
> when the memory is removed.
>
> The patch adds a new function __remove_zone() to do this.
>
> CC: David Rientjes<rientjes@google.com>
> CC: Jiang Liu<liuj97@gmail.com>
> CC: Len Brown<len.brown@intel.com>
> CC: Christoph Lameter<cl@linux.com>
> Cc: Minchan Kim<minchan.kim@gmail.com>
> CC: Andrew Morton<akpm@linux-foundation.org>
> CC: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>

Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>

> ---
>   mm/memory_hotplug.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 207 insertions(+)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7797e91..aa97d56 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -301,10 +301,213 @@ static int __meminit __add_section(int nid, struct zone *zone,
>   	return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
>   }
>
> +/* find the smallest valid pfn in the range [start_pfn, end_pfn) */
> +static int find_smallest_section_pfn(int nid, struct zone *zone,
> +				     unsigned long start_pfn,
> +				     unsigned long end_pfn)
> +{
> +	struct mem_section *ms;
> +
> +	for (; start_pfn<  end_pfn; start_pfn += PAGES_PER_SECTION) {
> +		ms = __pfn_to_section(start_pfn);
> +
> +		if (unlikely(!valid_section(ms)))
> +			continue;
> +
> +		if (unlikely(pfn_to_nid(start_pfn) != nid))
> +			continue;
> +
> +		if (zone&&  zone != page_zone(pfn_to_page(start_pfn)))
> +			continue;
> +
> +		return start_pfn;
> +	}
> +
> +	return 0;
> +}
> +
> +/* find the biggest valid pfn in the range [start_pfn, end_pfn). */
> +static int find_biggest_section_pfn(int nid, struct zone *zone,
> +				    unsigned long start_pfn,
> +				    unsigned long end_pfn)
> +{
> +	struct mem_section *ms;
> +	unsigned long pfn;
> +
> +	/* pfn is the end pfn of a memory section. */
> +	pfn = end_pfn - 1;
> +	for (; pfn>= start_pfn; pfn -= PAGES_PER_SECTION) {
> +		ms = __pfn_to_section(pfn);
> +
> +		if (unlikely(!valid_section(ms)))
> +			continue;
> +
> +		if (unlikely(pfn_to_nid(pfn) != nid))
> +			continue;
> +
> +		if (zone&&  zone != page_zone(pfn_to_page(pfn)))
> +			continue;
> +
> +		return pfn;
> +	}
> +
> +	return 0;
> +}
> +
> +static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
> +			     unsigned long end_pfn)
> +{
> +	unsigned long zone_start_pfn =  zone->zone_start_pfn;
> +	unsigned long zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> +	unsigned long pfn;
> +	struct mem_section *ms;
> +	int nid = zone_to_nid(zone);
> +
> +	zone_span_writelock(zone);
> +	if (zone_start_pfn == start_pfn) {
> +		/*
> +		 * If the section is smallest section in the zone, it need
> +		 * shrink zone->zone_start_pfn and zone->zone_spanned_pages.
> +		 * In this case, we find second smallest valid mem_section
> +		 * for shrinking zone.
> +		 */
> +		pfn = find_smallest_section_pfn(nid, zone, end_pfn,
> +						zone_end_pfn);
> +		if (pfn) {
> +			zone->zone_start_pfn = pfn;
> +			zone->spanned_pages = zone_end_pfn - pfn;
> +		}
> +	} else if (zone_end_pfn == end_pfn) {
> +		/*
> +		 * If the section is biggest section in the zone, it need
> +		 * shrink zone->spanned_pages.
> +		 * In this case, we find second biggest valid mem_section for
> +		 * shrinking zone.
> +		 */
> +		pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
> +					       start_pfn);
> +		if (pfn)
> +			zone->spanned_pages = pfn - zone_start_pfn + 1;
> +	}
> +
> +	/*
> +	 * The section is not biggest or smallest mem_section in the zone, it
> +	 * only creates a hole in the zone. So in this case, we need not
> +	 * change the zone. But perhaps, the zone has only hole data. Thus
> +	 * it check the zone has only hole or not.
> +	 */
> +	pfn = zone_start_pfn;
> +	for (; pfn<  zone_end_pfn; pfn += PAGES_PER_SECTION) {
> +		ms = __pfn_to_section(pfn);
> +
> +		if (unlikely(!valid_section(ms)))
> +			continue;
> +
> +		if (page_zone(pfn_to_page(pfn)) != zone)
> +			continue;
> +
> +		 /* If the section is current section, it continues the loop */
> +		if (start_pfn == pfn)
> +			continue;
> +
> +		/* If we find valid section, we have nothing to do */
> +		zone_span_writeunlock(zone);
> +		return;
> +	}
> +
> +	/* The zone has no valid section */
> +	zone->zone_start_pfn = 0;
> +	zone->spanned_pages = 0;
> +	zone_span_writeunlock(zone);
> +}
> +
> +static void shrink_pgdat_span(struct pglist_data *pgdat,
> +			      unsigned long start_pfn, unsigned long end_pfn)
> +{
> +	unsigned long pgdat_start_pfn =  pgdat->node_start_pfn;
> +	unsigned long pgdat_end_pfn =
> +		pgdat->node_start_pfn + pgdat->node_spanned_pages;
> +	unsigned long pfn;
> +	struct mem_section *ms;
> +	int nid = pgdat->node_id;
> +
> +	if (pgdat_start_pfn == start_pfn) {
> +		/*
> +		 * If the section is smallest section in the pgdat, it need
> +		 * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
> +		 * In this case, we find second smallest valid mem_section
> +		 * for shrinking zone.
> +		 */
> +		pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
> +						pgdat_end_pfn);
> +		if (pfn) {
> +			pgdat->node_start_pfn = pfn;
> +			pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
> +		}
> +	} else if (pgdat_end_pfn == end_pfn) {
> +		/*
> +		 * If the section is biggest section in the pgdat, it need
> +		 * shrink pgdat->node_spanned_pages.
> +		 * In this case, we find second biggest valid mem_section for
> +		 * shrinking zone.
> +		 */
> +		pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
> +					       start_pfn);
> +		if (pfn)
> +			pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
> +	}
> +
> +	/*
> +	 * If the section is not biggest or smallest mem_section in the pgdat,
> +	 * it only creates a hole in the pgdat. So in this case, we need not
> +	 * change the pgdat.
> +	 * But perhaps, the pgdat has only hole data. Thus it check the pgdat
> +	 * has only hole or not.
> +	 */
> +	pfn = pgdat_start_pfn;
> +	for (; pfn<  pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
> +		ms = __pfn_to_section(pfn);
> +
> +		if (unlikely(!valid_section(ms)))
> +			continue;
> +
> +		if (pfn_to_nid(pfn) != nid)
> +			continue;
> +
> +		 /* If the section is current section, it continues the loop */
> +		if (start_pfn == pfn)
> +			continue;
> +
> +		/* If we find valid section, we have nothing to do */
> +		return;
> +	}
> +
> +	/* The pgdat has no valid section */
> +	pgdat->node_start_pfn = 0;
> +	pgdat->node_spanned_pages = 0;
> +}
> +
> +static void __remove_zone(struct zone *zone, unsigned long start_pfn)
> +{
> +	struct pglist_data *pgdat = zone->zone_pgdat;
> +	int nr_pages = PAGES_PER_SECTION;
> +	int zone_type;
> +	unsigned long flags;
> +
> +	zone_type = zone - pgdat->node_zones;
> +
> +	pgdat_resize_lock(zone->zone_pgdat,&flags);
> +	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
> +	shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
> +	pgdat_resize_unlock(zone->zone_pgdat,&flags);
> +}
> +
>   static int __remove_section(struct zone *zone, struct mem_section *ms)
>   {
>   	unsigned long flags;
>   	struct pglist_data *pgdat = zone->zone_pgdat;
> +	unsigned long start_pfn;
> +	int scn_nr;
>   	int ret = -EINVAL;
>
>   	if (!valid_section(ms))
> @@ -314,6 +517,10 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
>   	if (ret)
>   		return ret;
>
> +	scn_nr = __section_nr(ms);
> +	start_pfn = section_nr_to_pfn(scn_nr);
> +	__remove_zone(zone, start_pfn);
> +
>   	pgdat_resize_lock(pgdat,&flags);
>   	sparse_remove_one_section(zone, ms);
>   	pgdat_resize_unlock(pgdat,&flags);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 11/12] memory-hotplug: remove sysfs file of node
  2012-11-27 10:00 ` [Patch v4 11/12] memory-hotplug: remove sysfs file of node Wen Congyang
@ 2012-12-04 10:10   ` Tang Chen
  0 siblings, 0 replies; 40+ messages in thread
From: Tang Chen @ 2012-12-04 10:10 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Jianguo Wu, linux-mm, Yasuaki Ishimatsu,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

On 11/27/2012 06:00 PM, Wen Congyang wrote:
> This patch introduces a new function try_offline_node() to
> remove sysfs file of node when all memory sections of this
> node are removed. If some memory sections of this node are
> not removed, this function does nothing.
>
> CC: David Rientjes<rientjes@google.com>
> CC: Jiang Liu<liuj97@gmail.com>
> CC: Len Brown<len.brown@intel.com>
> CC: Christoph Lameter<cl@linux.com>
> Cc: Minchan Kim<minchan.kim@gmail.com>
> CC: Andrew Morton<akpm@linux-foundation.org>
> CC: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> CC: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>

Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>

> ---
>   drivers/acpi/acpi_memhotplug.c |  8 +++++-
>   include/linux/memory_hotplug.h |  2 +-
>   mm/memory_hotplug.c            | 58 ++++++++++++++++++++++++++++++++++++++++--
>   3 files changed, 64 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 24c807f..0780f99 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -310,7 +310,9 @@ static int acpi_memory_disable_device(struct acpi_memory_device *mem_device)
>   {
>   	int result;
>   	struct acpi_memory_info *info, *n;
> +	int node;
>
> +	node = acpi_get_node(mem_device->device->handle);
>
>   	/*
>   	 * Ask the VM to offline this memory range.
> @@ -318,7 +320,11 @@ static int acpi_memory_disable_device(struct acpi_memory_device *mem_device)
>   	 */
>   	list_for_each_entry_safe(info, n,&mem_device->res_list, list) {
>   		if (info->enabled) {
> -			result = remove_memory(info->start_addr, info->length);
> +			if (node<  0)
> +				node = memory_add_physaddr_to_nid(
> +					info->start_addr);
> +			result = remove_memory(node, info->start_addr,
> +				info->length);
>   			if (result)
>   				return result;
>   		}
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index d4c4402..7b4cfe6 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -231,7 +231,7 @@ extern int arch_add_memory(int nid, u64 start, u64 size);
>   extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>   extern int offline_memory_block(struct memory_block *mem);
>   extern bool is_memblock_offlined(struct memory_block *mem);
> -extern int remove_memory(u64 start, u64 size);
> +extern int remove_memory(int node, u64 start, u64 size);
>   extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
>   								int nr_pages);
>   extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index aa97d56..449663e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -29,6 +29,7 @@
>   #include<linux/suspend.h>
>   #include<linux/mm_inline.h>
>   #include<linux/firmware-map.h>
> +#include<linux/stop_machine.h>
>
>   #include<asm/tlbflush.h>
>
> @@ -1288,7 +1289,58 @@ static int is_memblock_offlined_cb(struct memory_block *mem, void *arg)
>   	return ret;
>   }
>
> -int __ref remove_memory(u64 start, u64 size)
> +static int check_cpu_on_node(void *data)
> +{
> +	struct pglist_data *pgdat = data;
> +	int cpu;
> +
> +	for_each_present_cpu(cpu) {
> +		if (cpu_to_node(cpu) == pgdat->node_id)
> +			/*
> +			 * the cpu on this node isn't removed, and we can't
> +			 * offline this node.
> +			 */
> +			return -EBUSY;
> +	}
> +
> +	return 0;
> +}
> +
> +/* offline the node if all memory sections of this node are removed */
> +static void try_offline_node(int nid)
> +{
> +	unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
> +	unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
> +	unsigned long pfn;
> +
> +	for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
> +		unsigned long section_nr = pfn_to_section_nr(pfn);
> +
> +		if (!present_section_nr(section_nr))
> +			continue;
> +
> +		if (pfn_to_nid(pfn) != nid)
> +			continue;
> +
> +		/*
> +		 * some memory sections of this node are not removed, and we
> +		 * can't offline node now.
> +		 */
> +		return;
> +	}
> +
> +	if (stop_machine(check_cpu_on_node, NODE_DATA(nid), NULL))
> +		return;
> +
> +	/*
> +	 * all memory/cpu of this node are removed, we can offline this
> +	 * node now.
> +	 */
> +	node_set_offline(nid);
> +	unregister_one_node(nid);
> +}
> +
> +int __ref remove_memory(int nid, u64 start, u64 size)
>   {
>   	unsigned long start_pfn, end_pfn;
>   	int ret = 0;
> @@ -1335,6 +1387,8 @@ repeat:
>
>   	arch_remove_memory(start, size);
>
> +	try_offline_node(nid);
> +
>   	unlock_memory_hotplug();
>
>   	return 0;
> @@ -1344,7 +1398,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   {
>   	return -EINVAL;
>   }
> -int remove_memory(u64 start, u64 size)
> +int remove_memory(int nid, u64 start, u64 size)
>   {
>   	return -EINVAL;
>   }

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 12/12] memory-hotplug: free node_data when a node is offlined
  2012-11-27 10:00 ` [Patch v4 12/12] memory-hotplug: free node_data when a node is offlined Wen Congyang
@ 2012-12-04 10:10   ` Tang Chen
  0 siblings, 0 replies; 40+ messages in thread
From: Tang Chen @ 2012-12-04 10:10 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Jianguo Wu, linux-mm, Yasuaki Ishimatsu,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

On 11/27/2012 06:00 PM, Wen Congyang wrote:
> We call hotadd_new_pgdat() to allocate memory to store node_data. So we
> should free it when removing a node.
>
> CC: David Rientjes<rientjes@google.com>
> CC: Jiang Liu<liuj97@gmail.com>
> CC: Len Brown<len.brown@intel.com>
> CC: Benjamin Herrenschmidt<benh@kernel.crashing.org>
> CC: Paul Mackerras<paulus@samba.org>
> CC: Christoph Lameter<cl@linux.com>
> Cc: Minchan Kim<minchan.kim@gmail.com>
> CC: Andrew Morton<akpm@linux-foundation.org>
> CC: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> CC: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>

Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>

> ---
>   mm/memory_hotplug.c | 20 +++++++++++++++++++-
>   1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 449663e..d1451ab 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1309,9 +1309,12 @@ static int check_cpu_on_node(void *data)
>   /* offline the node if all memory sections of this node are removed */
>   static void try_offline_node(int nid)
>   {
> +	pg_data_t *pgdat = NODE_DATA(nid);
>   	unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
> -	unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
> +	unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
>   	unsigned long pfn;
> +	struct page *pgdat_page = virt_to_page(pgdat);
> +	int i;
>
>   	for (pfn = start_pfn; pfn<  end_pfn; pfn += PAGES_PER_SECTION) {
>   		unsigned long section_nr = pfn_to_section_nr(pfn);
> @@ -1338,6 +1341,21 @@ static void try_offline_node(int nid)
>   	 */
>   	node_set_offline(nid);
>   	unregister_one_node(nid);
> +
> +	if (!PageSlab(pgdat_page)&&  !PageCompound(pgdat_page))
> +		/* node data is allocated from boot memory */
> +		return;
> +
> +	/* free waittable in each zone */
> +	for (i = 0; i<  MAX_NR_ZONES; i++) {
> +		struct zone *zone = pgdat->node_zones + i;
> +
> +		if (zone->wait_table)
> +			vfree(zone->wait_table);
> +	}
> +
> +	arch_refresh_nodedata(nid, NULL);
> +	arch_free_nodedata(pgdat);
>   }
>
>   int __ref remove_memory(int nid, u64 start, u64 size)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 03/12] memory-hotplug: remove redundant codes
  2012-12-04  9:22   ` Tang Chen
@ 2012-12-04 10:31     ` Tang Chen
  0 siblings, 0 replies; 40+ messages in thread
From: Tang Chen @ 2012-12-04 10:31 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, Len Brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Jianguo Wu, linux-mm, Yasuaki Ishimatsu,
	paulus, Minchan Kim, KOSAKI Motohiro, David Rientjes, sparclinux,
	Christoph Lameter, linuxppc-dev, Andrew Morton, Jiang Liu

On 12/04/2012 05:22 PM, Tang Chen wrote:
> On 11/27/2012 06:00 PM, Wen Congyang wrote:
>> offlining memory blocks and checking whether memory blocks are offlined
>> are very similar. This patch introduces a new function to remove
>> redundant codes.
>>
>> CC: David Rientjes<rientjes@google.com>
>> CC: Jiang Liu<liuj97@gmail.com>
>> CC: Len Brown<len.brown@intel.com>
>> CC: Christoph Lameter<cl@linux.com>
>> Cc: Minchan Kim<minchan.kim@gmail.com>
>> CC: Andrew Morton<akpm@linux-foundation.org>
>> CC: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>> CC: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>
> Can we merge this patch with [PATCH 03/12] ?

Sorry, I think we can merge this patch into [PATCH 02/12].
Thanks. :)

>
> Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-12-04  9:13         ` Tang Chen
@ 2012-12-04 12:20           ` Jianguo Wu
  2012-12-05  2:07             ` Tang Chen
  0 siblings, 1 reply; 40+ messages in thread
From: Jianguo Wu @ 2012-12-04 12:20 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-s390, linux-ia64, Wen Congyang, linux-acpi, linux-sh,
	Len Brown, x86, linux-kernel, cmetcalf, linux-mm,
	Yasuaki Ishimatsu, paulus, Minchan Kim, KOSAKI Motohiro,
	David Rientjes, sparclinux, Christoph Lameter, linuxppc-dev,
	Andrew Morton, Jiang Liu

Hi Tang,

Thanks for your review and comments, Please see below for my reply.

On 2012/12/4 17:13, Tang Chen wrote:

> Hi Wu,
> 
> Sorry to make noise here. Please see below. :)
> 
> On 12/03/2012 10:23 AM, Jianguo Wu wrote:
>> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
>> Signed-off-by: Jiang Liu<jiang.liu@huawei.com>
>> ---
>>   include/linux/mm.h  |    1 +
>>   mm/sparse-vmemmap.c |  231 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   mm/sparse.c         |    3 +-
>>   3 files changed, 234 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5657670..1f26af5 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1642,6 +1642,7 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
>>   void vmemmap_populate_print_last(void);
>>   void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>>                     unsigned long size);
>> +void vmemmap_free(struct page *memmap, unsigned long nr_pages);
>>
>>   enum mf_flags {
>>       MF_COUNT_INCREASED = 1<<  0,
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 1b7e22a..748732d 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -29,6 +29,10 @@
>>   #include<asm/pgalloc.h>
>>   #include<asm/pgtable.h>
>>
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +#include<asm/tlbflush.h>
>> +#endif
>> +
>>   /*
>>    * Allocate a block of memory to be used to back the virtual memory map
>>    * or to back the page tables that are used to create the mapping.
>> @@ -224,3 +228,230 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>>           vmemmap_buf_end = NULL;
>>       }
>>   }
>> +
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +
>> +#define PAGE_INUSE 0xFD
>> +
>> +static void vmemmap_free_pages(struct page *page, int order)
>> +{
>> +    struct zone *zone;
>> +    unsigned long magic;
>> +
>> +    magic = (unsigned long) page->lru.next;
>> +    if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>> +        put_page_bootmem(page);
>> +
>> +        zone = page_zone(page);
>> +        zone_span_writelock(zone);
>> +        zone->present_pages++;
>> +        zone_span_writeunlock(zone);
>> +        totalram_pages++;
> 
> Seems that we have different ways to handle pages allocated by bootmem
> or by regular allocator. Is the checking way in [PATCH 09/12] available
> here ?
> 
> +    /* bootmem page has reserved flag */
> +    if (PageReserved(page)) {
> ......
> +    }
> 
> If so, I think we can just merge these two functions.

Hmm, direct mapping table isn't allocated by bootmem allocator such as memblock, can't be free by put_page_bootmem().
But I will try to merge these two functions.

> 
>> +    } else
>> +        free_pages((unsigned long)page_address(page), order);
>> +}
>> +
>> +static void free_pte_table(pmd_t *pmd)
>> +{
>> +    pte_t *pte, *pte_start;
>> +    int i;
>> +
>> +    pte_start = (pte_t *)pmd_page_vaddr(*pmd);
>> +    for (i = 0; i<  PTRS_PER_PTE; i++) {
>> +        pte = pte_start + i;
>> +        if (pte_val(*pte))
>> +            return;
>> +    }
>> +
>> +    /* free a pte talbe */
>> +    vmemmap_free_pages(pmd_page(*pmd), 0);
>> +    spin_lock(&init_mm.page_table_lock);
>> +    pmd_clear(pmd);
>> +    spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static void free_pmd_table(pud_t *pud)
>> +{
>> +    pmd_t *pmd, *pmd_start;
>> +    int i;
>> +
>> +    pmd_start = (pmd_t *)pud_page_vaddr(*pud);
>> +    for (i = 0; i<  PTRS_PER_PMD; i++) {
>> +        pmd = pmd_start + i;
>> +        if (pmd_val(*pmd))
>> +            return;
>> +    }
>> +
>> +    /* free a pmd talbe */
>> +    vmemmap_free_pages(pud_page(*pud), 0);
>> +    spin_lock(&init_mm.page_table_lock);
>> +    pud_clear(pud);
>> +    spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static void free_pud_table(pgd_t *pgd)
>> +{
>> +    pud_t *pud, *pud_start;
>> +    int i;
>> +
>> +    pud_start = (pud_t *)pgd_page_vaddr(*pgd);
>> +    for (i = 0; i<  PTRS_PER_PUD; i++) {
>> +        pud = pud_start + i;
>> +        if (pud_val(*pud))
>> +            return;
>> +    }
>> +
>> +    /* free a pud table */
>> +    vmemmap_free_pages(pgd_page(*pgd), 0);
>> +    spin_lock(&init_mm.page_table_lock);
>> +    pgd_clear(pgd);
>> +    spin_unlock(&init_mm.page_table_lock);
>> +}
> 
> All the free_xxx_table() are very similar to the functions in
> [PATCH 09/12]. Could we reuse them anyway ?

yes, we can reuse them.

> 
>> +
>> +static int split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>> +{
>> +    struct page *page = pmd_page(*(pmd_t *)kpte);
>> +    int i = 0;
>> +    unsigned long magic;
>> +    unsigned long section_nr;
>> +
>> +    __split_large_page(kpte, address, pbase);
> 
> Is this patch going to replace [PATCH 08/12] ?
> 

I wish to replace [PATCH 08/12], but need Congyang and Yasuaki to confirm first:)

> If so, __split_large_page() was added and exported in [PATCH 09/12],
> then we should move it here, right ?

yes.

and what do you think about moving vmemmap_pud[pmd/pte]_remove() to arch/x86/mm/init_64.c,
to be consistent with vmemmap_populate() ?

I will rework [PATCH 08/12] and [PATCH 09/12] soon.

Thanks,
Jianguo Wu.

> 
> If not, free_map_bootmem() and __kfree_section_memmap() were changed in
> [PATCH 08/12], and we need to handle this.
> 
>> +    __flush_tlb_all();
>> +
>> +    magic = (unsigned long) page->lru.next;
>> +    if (magic == SECTION_INFO) {
>> +        section_nr = pfn_to_section_nr(page_to_pfn(page));
>> +        while (i<  PTRS_PER_PMD) {
>> +            page++;
>> +            i++;
>> +            get_page_bootmem(section_nr, page, SECTION_INFO);
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void vmemmap_pte_remove(pmd_t *pmd, unsigned long addr, unsigned long end)
>> +{
>> +    pte_t *pte;
>> +    unsigned long next;
>> +    void *page_addr;
>> +
>> +    pte = pte_offset_kernel(pmd, addr);
>> +    for (; addr<  end; pte++, addr += PAGE_SIZE) {
>> +        next = (addr + PAGE_SIZE)&  PAGE_MASK;
>> +        if (next>  end)
>> +            next = end;
>> +
>> +        if (pte_none(*pte))
>> +            continue;
>> +        if (IS_ALIGNED(addr, PAGE_SIZE)&&
>> +            IS_ALIGNED(next, PAGE_SIZE)) {
>> +            vmemmap_free_pages(pte_page(*pte), 0);
>> +            spin_lock(&init_mm.page_table_lock);
>> +            pte_clear(&init_mm, addr, pte);
>> +            spin_unlock(&init_mm.page_table_lock);
>> +        } else {
>> +            /*
>> +             * Removed page structs are filled with 0xFD.
>> +             */
>> +            memset((void *)addr, PAGE_INUSE, next - addr);
>> +            page_addr = page_address(pte_page(*pte));
>> +
>> +            if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
>> +                spin_lock(&init_mm.page_table_lock);
>> +                pte_clear(&init_mm, addr, pte);
>> +                spin_unlock(&init_mm.page_table_lock);
>> +            }
>> +        }
>> +    }
>> +
>> +    free_pte_table(pmd);
>> +    __flush_tlb_all();
>> +}
>> +
>> +static void vmemmap_pmd_remove(pud_t *pud, unsigned long addr, unsigned long end)
>> +{
>> +    unsigned long next;
>> +    pmd_t *pmd;
>> +
>> +    pmd = pmd_offset(pud, addr);
>> +    for (; addr<  end; addr = next, pmd++) {
>> +        next = pmd_addr_end(addr, end);
>> +        if (pmd_none(*pmd))
>> +            continue;
>> +
>> +        if (cpu_has_pse) {
>> +            unsigned long pte_base;
>> +
>> +            if (IS_ALIGNED(addr, PMD_SIZE)&&
>> +                IS_ALIGNED(next, PMD_SIZE)) {
>> +                vmemmap_free_pages(pmd_page(*pmd),
>> +                           get_order(PMD_SIZE));
>> +                spin_lock(&init_mm.page_table_lock);
>> +                pmd_clear(pmd);
>> +                spin_unlock(&init_mm.page_table_lock);
>> +                continue;
>> +            }
>> +
>> +            /*
>> +             * We use 2M page, but we need to remove part of them,
>> +             * so split 2M page to 4K page.
>> +             */
>> +            pte_base = get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
>> +            if (!pte_base) {
>> +                WARN_ON(1);
>> +                continue;
>> +            }
>> +
>> +            split_large_page((pte_t *)pmd, addr, (pte_t *)pte_base);
>> +            __flush_tlb_all();
>> +
>> +            spin_lock(&init_mm.page_table_lock);
>> +            pmd_populate_kernel(&init_mm, pmd, (pte_t *)pte_base);
>> +            spin_unlock(&init_mm.page_table_lock);
>> +        }
>> +
>> +        vmemmap_pte_remove(pmd, addr, next);
>> +    }
>> +
>> +    free_pmd_table(pud);
>> +    __flush_tlb_all();
>> +}
>> +
>> +static void vmemmap_pud_remove(pgd_t *pgd, unsigned long addr, unsigned long end)
>> +{
>> +    unsigned long next;
>> +    pud_t *pud;
>> +
>> +    pud = pud_offset(pgd, addr);
>> +    for (; addr<  end; addr = next, pud++) {
>> +        next = pud_addr_end(addr, end);
>> +        if (pud_none(*pud))
>> +            continue;
>> +
>> +        vmemmap_pmd_remove(pud, addr, next);
>> +    }
>> +
>> +    free_pud_table(pgd);
>> +    __flush_tlb_all();
>> +}
>> +
>> +void vmemmap_free(struct page *memmap, unsigned long nr_pages)
>> +{
>> +    unsigned long addr = (unsigned long)memmap;
>> +    unsigned long end = (unsigned long)(memmap + nr_pages);
>> +    unsigned long next;
>> +
>> +    for (; addr<  end; addr = next) {
>> +        pgd_t *pgd = pgd_offset_k(addr);
>> +
>> +        next = pgd_addr_end(addr, end);
>> +        if (!pgd_present(*pgd))
>> +            continue;
>> +
>> +        vmemmap_pud_remove(pgd, addr, next);
>> +        sync_global_pgds(addr, next - 1);
>> +    }
>> +}
>> +#endif
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index fac95f2..4060229 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -615,10 +615,11 @@ static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
>>   }
>>   static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
>>   {
>> -    return; /* XXX: Not implemented yet */
>> +    vmemmap_free(memmap, nr_pages);
>>   }
>>   static void free_map_bootmem(struct page *page, unsigned long nr_pages)
> 
> In the latest kernel, this line was:
> static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
> 
>>   {
>> +    vmemmap_free(page, nr_pages);
>>   }
>>   #else
>>   static struct page *__kmalloc_section_memmap(unsigned long nr_pages)
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-12-04 12:20           ` Jianguo Wu
@ 2012-12-05  2:07             ` Tang Chen
  2012-12-05  3:23               ` Jianguo Wu
  0 siblings, 1 reply; 40+ messages in thread
From: Tang Chen @ 2012-12-05  2:07 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: linux-s390, linux-ia64, Wen Congyang, linux-acpi, linux-sh,
	Len Brown, x86, linux-kernel, cmetcalf, linux-mm,
	Yasuaki Ishimatsu, paulus, Minchan Kim, KOSAKI Motohiro,
	David Rientjes, sparclinux, Christoph Lameter, linuxppc-dev,
	Andrew Morton, Jiang Liu

Hi Wu,

On 12/04/2012 08:20 PM, Jianguo Wu wrote:
(snip)
>>
>> Seems that we have different ways to handle pages allocated by bootmem
>> or by regular allocator. Is the checking way in [PATCH 09/12] available
>> here ?
>>
>> +    /* bootmem page has reserved flag */
>> +    if (PageReserved(page)) {
>> ......
>> +    }
>>
>> If so, I think we can just merge these two functions.
>
> Hmm, direct mapping table isn't allocated by bootmem allocator such as memblock, can't be free by put_page_bootmem().
> But I will try to merge these two functions.
>

Oh, I didn't notice this, thanks. :)

(snip)

>>> +
>>> +    __split_large_page(kpte, address, pbase);
>>
>> Is this patch going to replace [PATCH 08/12] ?
>>
>
> I wish to replace [PATCH 08/12], but need Congyang and Yasuaki to confirm first:)
>
>> If so, __split_large_page() was added and exported in [PATCH 09/12],
>> then we should move it here, right ?
>
> yes.
>
> and what do you think about moving vmemmap_pud[pmd/pte]_remove() to arch/x86/mm/init_64.c,
> to be consistent with vmemmap_populate() ?

It is a good idea since pud/pmd/pte related code could be platform
dependent. And I'm also trying to move vmemmap_free() to
arch/x86/mm/init_64.c too. I want to have a common interface just
like vmemmap_populate(). :)

>
> I will rework [PATCH 08/12] and [PATCH 09/12] soon.

I am rebasing the whole patch set now. And I think I chould finish part
of your work too. A new patch-set is coming soon, and your rework is
also welcome. :)

Thanks. :)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-12-05  2:07             ` Tang Chen
@ 2012-12-05  3:23               ` Jianguo Wu
  0 siblings, 0 replies; 40+ messages in thread
From: Jianguo Wu @ 2012-12-05  3:23 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-s390, linux-ia64, Wen Congyang, linux-acpi, linux-sh,
	Len Brown, x86, linux-kernel, cmetcalf, linux-mm,
	Yasuaki Ishimatsu, paulus, Minchan Kim, KOSAKI Motohiro,
	David Rientjes, sparclinux, Christoph Lameter, linuxppc-dev,
	Andrew Morton, Jiang Liu

Hi Tang,

On 2012/12/5 10:07, Tang Chen wrote:

> Hi Wu,
> 
> On 12/04/2012 08:20 PM, Jianguo Wu wrote:
> (snip)
>>>
>>> Seems that we have different ways to handle pages allocated by bootmem
>>> or by regular allocator. Is the checking way in [PATCH 09/12] available
>>> here ?
>>>
>>> +    /* bootmem page has reserved flag */
>>> +    if (PageReserved(page)) {
>>> ......
>>> +    }
>>>
>>> If so, I think we can just merge these two functions.
>>
>> Hmm, direct mapping table isn't allocated by bootmem allocator such as memblock, can't be free by put_page_bootmem().
>> But I will try to merge these two functions.
>>
> 
> Oh, I didn't notice this, thanks. :)
> 
> (snip)
> 
>>>> +
>>>> +    __split_large_page(kpte, address, pbase);
>>>
>>> Is this patch going to replace [PATCH 08/12] ?
>>>
>>
>> I wish to replace [PATCH 08/12], but need Congyang and Yasuaki to confirm first:)
>>
>>> If so, __split_large_page() was added and exported in [PATCH 09/12],
>>> then we should move it here, right ?
>>
>> yes.
>>
>> and what do you think about moving vmemmap_pud[pmd/pte]_remove() to arch/x86/mm/init_64.c,
>> to be consistent with vmemmap_populate() ?
> 
> It is a good idea since pud/pmd/pte related code could be platform
> dependent. And I'm also trying to move vmemmap_free() to
> arch/x86/mm/init_64.c too. I want to have a common interface just
> like vmemmap_populate(). :)
> 

Great.

>>
>> I will rework [PATCH 08/12] and [PATCH 09/12] soon.
> 
> I am rebasing the whole patch set now. And I think I chould finish part
> of your work too. A new patch-set is coming soon, and your rework is
> also welcome. :)
>

Since you are rebasing now, I will wait for your new patche-set :).

Thanks.
Jianguo Wu

> Thanks. :)
> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-12-03  2:23       ` Jianguo Wu
  2012-12-04  9:13         ` Tang Chen
@ 2012-12-07  1:42         ` Tang Chen
  2012-12-07  2:20           ` Jianguo Wu
  1 sibling, 1 reply; 40+ messages in thread
From: Tang Chen @ 2012-12-07  1:42 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: linux-s390, linux-ia64, Wen Congyang, linux-acpi, linux-sh,
	Len Brown, x86, linux-kernel, cmetcalf, linux-mm,
	Yasuaki Ishimatsu, paulus, Minchan Kim, KOSAKI Motohiro,
	David Rientjes, sparclinux, Christoph Lameter, linuxppc-dev,
	Andrew Morton, Jiang Liu

Hi Wu,

I met some problems when I was digging into the code. It's very
kind of you if you could help me with that. :)

If I misunderstood your code, please tell me.
Please see below. :)

On 12/03/2012 10:23 AM, Jianguo Wu wrote:
> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
> Signed-off-by: Jiang Liu<jiang.liu@huawei.com>
> ---
>   include/linux/mm.h  |    1 +
>   mm/sparse-vmemmap.c |  231 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   mm/sparse.c         |    3 +-
>   3 files changed, 234 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5657670..1f26af5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1642,6 +1642,7 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
>   void vmemmap_populate_print_last(void);
>   void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>   				  unsigned long size);
> +void vmemmap_free(struct page *memmap, unsigned long nr_pages);
>
>   enum mf_flags {
>   	MF_COUNT_INCREASED = 1<<  0,
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 1b7e22a..748732d 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -29,6 +29,10 @@
>   #include<asm/pgalloc.h>
>   #include<asm/pgtable.h>
>
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +#include<asm/tlbflush.h>
> +#endif
> +
>   /*
>    * Allocate a block of memory to be used to back the virtual memory map
>    * or to back the page tables that are used to create the mapping.
> @@ -224,3 +228,230 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>   		vmemmap_buf_end = NULL;
>   	}
>   }
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +
> +#define PAGE_INUSE 0xFD
> +
> +static void vmemmap_free_pages(struct page *page, int order)
> +{
> +	struct zone *zone;
> +	unsigned long magic;
> +
> +	magic = (unsigned long) page->lru.next;
> +	if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> +		put_page_bootmem(page);
> +
> +		zone = page_zone(page);
> +		zone_span_writelock(zone);
> +		zone->present_pages++;
> +		zone_span_writeunlock(zone);
> +		totalram_pages++;
> +	} else
> +		free_pages((unsigned long)page_address(page), order);

Here, I think SECTION_INFO and MIX_SECTION_INFO pages are all allocated
by bootmem, so I put this function this way.

I'm not sure if parameter order is necessary here. It will always be 0
in your code. Is this OK to you ?

static void free_pagetable(struct page *page)
{
         struct zone *zone;
         bool bootmem = false;
         unsigned long magic;

         /* bootmem page has reserved flag */
         if (PageReserved(page)) {
                 __ClearPageReserved(page);
                 bootmem = true;
         }

         magic = (unsigned long) page->lru.next;
         if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
                 put_page_bootmem(page);
         else
                 __free_page(page);

         /*
          * SECTION_INFO pages and MIX_SECTION_INFO pages
          * are all allocated by bootmem.
          */
         if (bootmem) {
                 zone = page_zone(page);
                 zone_span_writelock(zone);
                 zone->present_pages++;
                 zone_span_writeunlock(zone);
                 totalram_pages++;
         }
}

(snip)

> +
> +static void vmemmap_pte_remove(pmd_t *pmd, unsigned long addr, unsigned long end)
> +{
> +	pte_t *pte;
> +	unsigned long next;
> +	void *page_addr;
> +
> +	pte = pte_offset_kernel(pmd, addr);
> +	for (; addr<  end; pte++, addr += PAGE_SIZE) {
> +		next = (addr + PAGE_SIZE)&  PAGE_MASK;
> +		if (next>  end)
> +			next = end;
> +
> +		if (pte_none(*pte))

Here, you checked xxx_none() in your vmemmap_xxx_remove(), but you used
!xxx_present() in your x86_64 patches. Is it OK if I only check
!xxx_present() ?

> +			continue;
> +		if (IS_ALIGNED(addr, PAGE_SIZE)&&
> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> +			vmemmap_free_pages(pte_page(*pte), 0);
> +			spin_lock(&init_mm.page_table_lock);
> +			pte_clear(&init_mm, addr, pte);
> +			spin_unlock(&init_mm.page_table_lock);
> +		} else {
> +			/*
> +			 * Removed page structs are filled with 0xFD.
> +			 */
> +			memset((void *)addr, PAGE_INUSE, next - addr);
> +			page_addr = page_address(pte_page(*pte));
> +
> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> +				spin_lock(&init_mm.page_table_lock);
> +				pte_clear(&init_mm, addr, pte);
> +				spin_unlock(&init_mm.page_table_lock);

Here, since we clear pte, we should also free the page, right ?

> +			}
> +		}
> +	}
> +
> +	free_pte_table(pmd);
> +	__flush_tlb_all();
> +}
> +
> +static void vmemmap_pmd_remove(pud_t *pud, unsigned long addr, unsigned long end)
> +{
> +	unsigned long next;
> +	pmd_t *pmd;
> +
> +	pmd = pmd_offset(pud, addr);
> +	for (; addr<  end; addr = next, pmd++) {
> +		next = (addr, end);

And by the way, there isn't pte_addr_end() in kernel, why ?
I saw you calculated it like this:

                 next = (addr + PAGE_SIZE) & PAGE_MASK;
                 if (next > end)
                         next = end;

This logic is very similar to {pmd|pud|pgd}_addr_end(). Shall we add a
pte_addr_end() or something ? :)
Since there is no such code in kernel for a long time, I think there
must be some reasons.

I merged free_xxx_table() and remove_xxx_table() as common interfaces.

And again, thanks for your patient and nice explanation. :)

(snip)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap
  2012-12-07  1:42         ` Tang Chen
@ 2012-12-07  2:20           ` Jianguo Wu
  0 siblings, 0 replies; 40+ messages in thread
From: Jianguo Wu @ 2012-12-07  2:20 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-s390, linux-ia64, Wen Congyang, linux-acpi, linux-sh,
	Len Brown, x86, linux-kernel, cmetcalf, linux-mm,
	Yasuaki Ishimatsu, paulus, Minchan Kim, KOSAKI Motohiro,
	David Rientjes, sparclinux, Christoph Lameter, linuxppc-dev,
	Andrew Morton, Jiang Liu

Hi Tang,

On 2012/12/7 9:42, Tang Chen wrote:

> Hi Wu,
> 
> I met some problems when I was digging into the code. It's very
> kind of you if you could help me with that. :)
> 
> If I misunderstood your code, please tell me.
> Please see below. :)
> 
> On 12/03/2012 10:23 AM, Jianguo Wu wrote:
>> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
>> Signed-off-by: Jiang Liu<jiang.liu@huawei.com>
>> ---
>>   include/linux/mm.h  |    1 +
>>   mm/sparse-vmemmap.c |  231 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   mm/sparse.c         |    3 +-
>>   3 files changed, 234 insertions(+), 1 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5657670..1f26af5 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1642,6 +1642,7 @@ int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
>>   void vmemmap_populate_print_last(void);
>>   void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>>                     unsigned long size);
>> +void vmemmap_free(struct page *memmap, unsigned long nr_pages);
>>
>>   enum mf_flags {
>>       MF_COUNT_INCREASED = 1<<  0,
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 1b7e22a..748732d 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -29,6 +29,10 @@
>>   #include<asm/pgalloc.h>
>>   #include<asm/pgtable.h>
>>
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +#include<asm/tlbflush.h>
>> +#endif
>> +
>>   /*
>>    * Allocate a block of memory to be used to back the virtual memory map
>>    * or to back the page tables that are used to create the mapping.
>> @@ -224,3 +228,230 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>>           vmemmap_buf_end = NULL;
>>       }
>>   }
>> +
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +
>> +#define PAGE_INUSE 0xFD
>> +
>> +static void vmemmap_free_pages(struct page *page, int order)
>> +{
>> +    struct zone *zone;
>> +    unsigned long magic;
>> +
>> +    magic = (unsigned long) page->lru.next;
>> +    if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>> +        put_page_bootmem(page);
>> +
>> +        zone = page_zone(page);
>> +        zone_span_writelock(zone);
>> +        zone->present_pages++;
>> +        zone_span_writeunlock(zone);
>> +        totalram_pages++;
>> +    } else
>> +        free_pages((unsigned long)page_address(page), order);
> 
> Here, I think SECTION_INFO and MIX_SECTION_INFO pages are all allocated
> by bootmem, so I put this function this way.
> 
> I'm not sure if parameter order is necessary here. It will always be 0
> in your code. Is this OK to you ?
> 

parameter order is necessary in cpu_has_pse case:
	vmemmap_pmd_remove
		free_pagetable(pmd_page(*pmd), get_order(PMD_SIZE))

> static void free_pagetable(struct page *page)
> {
>         struct zone *zone;
>         bool bootmem = false;
>         unsigned long magic;
> 
>         /* bootmem page has reserved flag */
>         if (PageReserved(page)) {
>                 __ClearPageReserved(page);
>                 bootmem = true;
>         }
> 
>         magic = (unsigned long) page->lru.next;
>         if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
>                 put_page_bootmem(page);
>         else
>                 __free_page(page);
> 
>         /*
>          * SECTION_INFO pages and MIX_SECTION_INFO pages
>          * are all allocated by bootmem.
>          */
>         if (bootmem) {
>                 zone = page_zone(page);
>                 zone_span_writelock(zone);
>                 zone->present_pages++;
>                 zone_span_writeunlock(zone);
>                 totalram_pages++;
>         }
> }
> 
> (snip)
> 
>> +
>> +static void vmemmap_pte_remove(pmd_t *pmd, unsigned long addr, unsigned long end)
>> +{
>> +    pte_t *pte;
>> +    unsigned long next;
>> +    void *page_addr;
>> +
>> +    pte = pte_offset_kernel(pmd, addr);
>> +    for (; addr<  end; pte++, addr += PAGE_SIZE) {
>> +        next = (addr + PAGE_SIZE)&  PAGE_MASK;
>> +        if (next>  end)
>> +            next = end;
>> +
>> +        if (pte_none(*pte))
> 
> Here, you checked xxx_none() in your vmemmap_xxx_remove(), but you used
> !xxx_present() in your x86_64 patches. Is it OK if I only check
> !xxx_present() ?

It is Ok.

> 
>> +            continue;
>> +        if (IS_ALIGNED(addr, PAGE_SIZE)&&
>> +            IS_ALIGNED(next, PAGE_SIZE)) {
>> +            vmemmap_free_pages(pte_page(*pte), 0);
>> +            spin_lock(&init_mm.page_table_lock);
>> +            pte_clear(&init_mm, addr, pte);
>> +            spin_unlock(&init_mm.page_table_lock);
>> +        } else {
>> +            /*
>> +             * Removed page structs are filled with 0xFD.
>> +             */
>> +            memset((void *)addr, PAGE_INUSE, next - addr);
>> +            page_addr = page_address(pte_page(*pte));
>> +
>> +            if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
>> +                spin_lock(&init_mm.page_table_lock);
>> +                pte_clear(&init_mm, addr, pte);
>> +                spin_unlock(&init_mm.page_table_lock);
> 
> Here, since we clear pte, we should also free the page, right ?
> 

Right, I forgot here, sorry.

>> +            }
>> +        }
>> +    }
>> +
>> +    free_pte_table(pmd);
>> +    __flush_tlb_all();
>> +}
>> +
>> +static void vmemmap_pmd_remove(pud_t *pud, unsigned long addr, unsigned long end)
>> +{
>> +    unsigned long next;
>> +    pmd_t *pmd;
>> +
>> +    pmd = pmd_offset(pud, addr);
>> +    for (; addr<  end; addr = next, pmd++) {
>> +        next = (addr, end);
> 
> And by the way, there isn't pte_addr_end() in kernel, why ?
> I saw you calculated it like this:
> 
>                 next = (addr + PAGE_SIZE) & PAGE_MASK;
>                 if (next > end)
>                         next = end;
> 
> This logic is very similar to {pmd|pud|pgd}_addr_end(). Shall we add a
> pte_addr_end() or something ? :)

Maybe just keep this for now if no other place need pte_addr_end()?

> Since there is no such code in kernel for a long time, I think there
> must be some reasons.

Maybe in current kernel, doesn't deal not PTE_SIZE alignment address?
 

> 
> I merged free_xxx_table() and remove_xxx_table() as common interfaces.

Greate!

Thanks for your work:).

> 
> And again, thanks for your patient and nice explanation. :)
> 
> (snip)
> 
> .
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture
  2012-11-27 10:00 ` [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture Wen Congyang
@ 2012-12-07  6:43   ` Tang Chen
  2012-12-07  7:06     ` Jianguo Wu
  0 siblings, 1 reply; 40+ messages in thread
From: Tang Chen @ 2012-12-07  6:43 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-ia64, linux-sh, linux-mm, paulus, sparclinux,
	Christoph Lameter, linux-s390, x86, linux-acpi,
	Yasuaki Ishimatsu, KOSAKI Motohiro, David Rientjes, Jiang Liu,
	Len Brown, Jiang Liu, cmetcalf, Jianguo Wu, linux-kernel,
	Minchan Kim, Andrew Morton, linuxppc-dev

On 11/27/2012 06:00 PM, Wen Congyang wrote:
> For hot removing memory, we sholud remove page table about the memory.
> So the patch searches a page table about the removed memory, and clear
> page table.

(snip)

> +void __meminit
> +kernel_physical_mapping_remove(unsigned long start, unsigned long end)
> +{
> +	unsigned long next;
> +	bool pgd_changed = false;
> +
> +	start = (unsigned long)__va(start);
> +	end = (unsigned long)__va(end);

Hi Wu,

Here, you expect start and end are physical addresses. But in
phys_xxx_remove() function, I think using virtual addresses is just
fine. Functions like pmd_addr_end() and pud_index() only calculate
an offset.

So, would you please tell me if we have to use physical addresses here ?

Thanks. :)

> +
> +	for (; start<  end; start = next) {
> +		pgd_t *pgd = pgd_offset_k(start);
> +		pud_t *pud;
> +
> +		next = pgd_addr_end(start, end);
> +
> +		if (!pgd_present(*pgd))
> +			continue;
> +
> +		pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> +		phys_pud_remove(pud, __pa(start), __pa(next));
> +		if (free_pud_table(pud, pgd))
> +			pgd_changed = true;
> +		unmap_low_page(pud);
> +	}
> +
> +	if (pgd_changed)
> +		sync_global_pgds(start, end - 1);
> +
> +	flush_tlb_all();
> +}
> +
>   #ifdef CONFIG_MEMORY_HOTREMOVE
>   int __ref arch_remove_memory(u64 start, u64 size)
>   {
> @@ -692,6 +921,8 @@ int __ref arch_remove_memory(u64 start, u64 size)
>   	ret = __remove_pages(zone, start_pfn, nr_pages);
>   	WARN_ON_ONCE(ret);
>
> +	kernel_physical_mapping_remove(start, start + size);
> +
>   	return ret;
>   }
>   #endif

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture
  2012-12-07  6:43   ` Tang Chen
@ 2012-12-07  7:06     ` Jianguo Wu
  0 siblings, 0 replies; 40+ messages in thread
From: Jianguo Wu @ 2012-12-07  7:06 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, sparclinux,
	Christoph Lameter, linux-s390, x86, linux-acpi,
	Yasuaki Ishimatsu, KOSAKI Motohiro, David Rientjes, Jiang Liu,
	Len Brown, Jiang Liu, Wen Congyang, cmetcalf, linux-kernel,
	Minchan Kim, Andrew Morton, linuxppc-dev

On 2012/12/7 14:43, Tang Chen wrote:

> On 11/27/2012 06:00 PM, Wen Congyang wrote:
>> For hot removing memory, we sholud remove page table about the memory.
>> So the patch searches a page table about the removed memory, and clear
>> page table.
> 
> (snip)
> 
>> +void __meminit
>> +kernel_physical_mapping_remove(unsigned long start, unsigned long end)
>> +{
>> +    unsigned long next;
>> +    bool pgd_changed = false;
>> +
>> +    start = (unsigned long)__va(start);
>> +    end = (unsigned long)__va(end);
> 
> Hi Wu,
> 
> Here, you expect start and end are physical addresses. But in
> phys_xxx_remove() function, I think using virtual addresses is just
> fine. Functions like pmd_addr_end() and pud_index() only calculate
> an offset.
>

Hi Tang,

 

Virtual addresses will work fine, I used physical addresses in order to
keep consistent with phys_pud[pmd/pte]_init(), So I think we should keep this.

Thanks,
Jianguo Wu

> So, would you please tell me if we have to use physical addresses here ?
> 
> Thanks. :)
> 
>> +
>> +    for (; start<  end; start = next) {
>> +        pgd_t *pgd = pgd_offset_k(start);
>> +        pud_t *pud;
>> +
>> +        next = pgd_addr_end(start, end);
>> +
>> +        if (!pgd_present(*pgd))
>> +            continue;
>> +
>> +        pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>> +        phys_pud_remove(pud, __pa(start), __pa(next));
>> +        if (free_pud_table(pud, pgd))
>> +            pgd_changed = true;
>> +        unmap_low_page(pud);
>> +    }
>> +
>> +    if (pgd_changed)
>> +        sync_global_pgds(start, end - 1);
>> +
>> +    flush_tlb_all();
>> +}
>> +
>>   #ifdef CONFIG_MEMORY_HOTREMOVE
>>   int __ref arch_remove_memory(u64 start, u64 size)
>>   {
>> @@ -692,6 +921,8 @@ int __ref arch_remove_memory(u64 start, u64 size)
>>       ret = __remove_pages(zone, start_pfn, nr_pages);
>>       WARN_ON_ONCE(ret);
>>
>> +    kernel_physical_mapping_remove(start, start + size);
>> +
>>       return ret;
>>   }
>>   #endif
> 
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2012-12-07  7:06 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-27 10:00 [Patch v4 00/12] memory-hotplug: hot-remove physical memory Wen Congyang
2012-11-27 10:00 ` [Patch v4 01/12] memory-hotplug: try to offline the memory twice to avoid dependence Wen Congyang
2012-12-04  9:17   ` Tang Chen
2012-11-27 10:00 ` [Patch v4 02/12] memory-hotplug: check whether all memory blocks are offlined or not when removing memory Wen Congyang
2012-12-04  9:22   ` Tang Chen
2012-11-27 10:00 ` [Patch v4 03/12] memory-hotplug: remove redundant codes Wen Congyang
2012-12-04  9:22   ` Tang Chen
2012-12-04 10:31     ` Tang Chen
2012-11-27 10:00 ` [Patch v4 04/12] memory-hotplug: remove /sys/firmware/memmap/X sysfs Wen Congyang
2012-11-27 10:00 ` [Patch v4 05/12] memory-hotplug: introduce new function arch_remove_memory() for removing page table depends on architecture Wen Congyang
2012-12-04  9:30   ` Tang Chen
2012-11-27 10:00 ` [Patch v4 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP Wen Congyang
2012-12-04  9:34   ` Tang Chen
2012-11-27 10:00 ` [Patch v4 07/12] memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap Wen Congyang
2012-11-27 10:00 ` [Patch v4 08/12] memory-hotplug: remove memmap " Wen Congyang
2012-11-28  9:40   ` Jianguo Wu
2012-11-30  1:45     ` Wen Congyang
2012-11-30  2:47       ` Jianguo Wu
2012-11-30  2:55         ` Yasuaki Ishimatsu
2012-12-03  2:23       ` Jianguo Wu
2012-12-04  9:13         ` Tang Chen
2012-12-04 12:20           ` Jianguo Wu
2012-12-05  2:07             ` Tang Chen
2012-12-05  3:23               ` Jianguo Wu
2012-12-07  1:42         ` Tang Chen
2012-12-07  2:20           ` Jianguo Wu
2012-12-04  9:47   ` Tang Chen
2012-11-27 10:00 ` [Patch v4 09/12] memory-hotplug: remove page table of x86_64 architecture Wen Congyang
2012-12-07  6:43   ` Tang Chen
2012-12-07  7:06     ` Jianguo Wu
2012-11-27 10:00 ` [Patch v4 10/12] memory-hotplug: memory_hotplug: clear zone when removing the memory Wen Congyang
2012-12-04 10:09   ` Tang Chen
2012-11-27 10:00 ` [Patch v4 11/12] memory-hotplug: remove sysfs file of node Wen Congyang
2012-12-04 10:10   ` Tang Chen
2012-11-27 10:00 ` [Patch v4 12/12] memory-hotplug: free node_data when a node is offlined Wen Congyang
2012-12-04 10:10   ` Tang Chen
2012-11-27 19:27 ` [Patch v4 00/12] memory-hotplug: hot-remove physical memory Andrew Morton
2012-11-27 19:38   ` Rafael J. Wysocki
2012-11-28  0:43   ` Yasuaki Ishimatsu
2012-11-30  6:37   ` Tang Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).