All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
@ 2017-11-23 11:13 ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:13 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, m.bielski, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, catalin.marinas, mhocko,
	realean2

Hi all,

this is a second round of patches to introduce memory hotplug and
hotremove support for arm64. It builds on the work previously published at
[1] and it implements the feedback received in the first round of reviews.

The patchset applies and has been tested on commit bebc6082da0a ("Linux
4.14"). 

Due to a small regression introduced with commit 8135d8926c08
("mm: memory_hotplug: memory hotremove supports thp migration"), you
will need to appy patch [2] first, until the fix is not upstreamed.

Comments and feedback are gold.

[1] https://lkml.org/lkml/2017/4/11/536
[2] https://lkml.org/lkml/2017/11/20/902

Changes v1->v2:
- swapper pgtable updated in place on hot add, avoiding unnecessary copy
- stop_machine used to updated swapper on hot add, avoiding races
- introduced check on offlining state before hot remove
- new memblock flag used to mark partially unused vmemmap pages, avoiding
  the nasty 0xFD hack used in the prev rev (and in x86 hot remove code)
- proper cleaning sequence for p[um]ds,ptes and related TLB management
- Removed macros that changed hot remove behavior based on number
  of pgtable levels. Now this is hidden in the pgtable traversal macros.
- Check on the corner case where P[UM]Ds would have to be split during
  hot remove: now this is forbidden.
- Minor fixes and refactoring.

Andrea Reale (4):
  mm: memory_hotplug: Remove assumption on memory state before hotremove
  mm: memory_hotplug: memblock to track partially removed vmemmap mem
  mm: memory_hotplug: Add memory hotremove probe device
  mm: memory-hotplug: Add memory hot remove support for arm64

Maciej Bielski (1):
  mm: memory_hotplug: Memory hotplug (add) support for arm64

 arch/arm64/Kconfig             |  15 +
 arch/arm64/configs/defconfig   |   2 +
 arch/arm64/include/asm/mmu.h   |   7 +
 arch/arm64/mm/init.c           | 116 ++++++++
 arch/arm64/mm/mmu.c            | 609 ++++++++++++++++++++++++++++++++++++++++-
 drivers/acpi/acpi_memhotplug.c |   2 +-
 drivers/base/memory.c          |  34 ++-
 include/linux/memblock.h       |  12 +
 include/linux/memory_hotplug.h |   9 +-
 mm/memblock.c                  |  32 +++
 mm/memory_hotplug.c            |  13 +-
 11 files changed, 835 insertions(+), 16 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
@ 2017-11-23 11:13 ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:13 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, m.bielski, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, catalin.marinas, mhocko,
	realean2

Hi all,

this is a second round of patches to introduce memory hotplug and
hotremove support for arm64. It builds on the work previously published at
[1] and it implements the feedback received in the first round of reviews.

The patchset applies and has been tested on commit bebc6082da0a ("Linux
4.14"). 

Due to a small regression introduced with commit 8135d8926c08
("mm: memory_hotplug: memory hotremove supports thp migration"), you
will need to appy patch [2] first, until the fix is not upstreamed.

Comments and feedback are gold.

[1] https://lkml.org/lkml/2017/4/11/536
[2] https://lkml.org/lkml/2017/11/20/902

Changes v1->v2:
- swapper pgtable updated in place on hot add, avoiding unnecessary copy
- stop_machine used to updated swapper on hot add, avoiding races
- introduced check on offlining state before hot remove
- new memblock flag used to mark partially unused vmemmap pages, avoiding
  the nasty 0xFD hack used in the prev rev (and in x86 hot remove code)
- proper cleaning sequence for p[um]ds,ptes and related TLB management
- Removed macros that changed hot remove behavior based on number
  of pgtable levels. Now this is hidden in the pgtable traversal macros.
- Check on the corner case where P[UM]Ds would have to be split during
  hot remove: now this is forbidden.
- Minor fixes and refactoring.

Andrea Reale (4):
  mm: memory_hotplug: Remove assumption on memory state before hotremove
  mm: memory_hotplug: memblock to track partially removed vmemmap mem
  mm: memory_hotplug: Add memory hotremove probe device
  mm: memory-hotplug: Add memory hot remove support for arm64

Maciej Bielski (1):
  mm: memory_hotplug: Memory hotplug (add) support for arm64

 arch/arm64/Kconfig             |  15 +
 arch/arm64/configs/defconfig   |   2 +
 arch/arm64/include/asm/mmu.h   |   7 +
 arch/arm64/mm/init.c           | 116 ++++++++
 arch/arm64/mm/mmu.c            | 609 ++++++++++++++++++++++++++++++++++++++++-
 drivers/acpi/acpi_memhotplug.c |   2 +-
 drivers/base/memory.c          |  34 ++-
 include/linux/memblock.h       |  12 +
 include/linux/memory_hotplug.h |   9 +-
 mm/memblock.c                  |  32 +++
 mm/memory_hotplug.c            |  13 +-
 11 files changed, 835 insertions(+), 16 deletions(-)

-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
@ 2017-11-23 11:13 ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:13 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

this is a second round of patches to introduce memory hotplug and
hotremove support for arm64. It builds on the work previously published at
[1] and it implements the feedback received in the first round of reviews.

The patchset applies and has been tested on commit bebc6082da0a ("Linux
4.14"). 

Due to a small regression introduced with commit 8135d8926c08
("mm: memory_hotplug: memory hotremove supports thp migration"), you
will need to appy patch [2] first, until the fix is not upstreamed.

Comments and feedback are gold.

[1] https://lkml.org/lkml/2017/4/11/536
[2] https://lkml.org/lkml/2017/11/20/902

Changes v1->v2:
- swapper pgtable updated in place on hot add, avoiding unnecessary copy
- stop_machine used to updated swapper on hot add, avoiding races
- introduced check on offlining state before hot remove
- new memblock flag used to mark partially unused vmemmap pages, avoiding
  the nasty 0xFD hack used in the prev rev (and in x86 hot remove code)
- proper cleaning sequence for p[um]ds,ptes and related TLB management
- Removed macros that changed hot remove behavior based on number
  of pgtable levels. Now this is hidden in the pgtable traversal macros.
- Check on the corner case where P[UM]Ds would have to be split during
  hot remove: now this is forbidden.
- Minor fixes and refactoring.

Andrea Reale (4):
  mm: memory_hotplug: Remove assumption on memory state before hotremove
  mm: memory_hotplug: memblock to track partially removed vmemmap mem
  mm: memory_hotplug: Add memory hotremove probe device
  mm: memory-hotplug: Add memory hot remove support for arm64

Maciej Bielski (1):
  mm: memory_hotplug: Memory hotplug (add) support for arm64

 arch/arm64/Kconfig             |  15 +
 arch/arm64/configs/defconfig   |   2 +
 arch/arm64/include/asm/mmu.h   |   7 +
 arch/arm64/mm/init.c           | 116 ++++++++
 arch/arm64/mm/mmu.c            | 609 ++++++++++++++++++++++++++++++++++++++++-
 drivers/acpi/acpi_memhotplug.c |   2 +-
 drivers/base/memory.c          |  34 ++-
 include/linux/memblock.h       |  12 +
 include/linux/memory_hotplug.h |   9 +-
 mm/memblock.c                  |  32 +++
 mm/memory_hotplug.c            |  13 +-
 11 files changed, 835 insertions(+), 16 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
  2017-11-23 11:13 ` Andrea Reale
  (?)
@ 2017-11-23 11:13   ` Maciej Bielski
  -1 siblings, 0 replies; 156+ messages in thread
From: Maciej Bielski @ 2017-11-23 11:13 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, ar, arunks, mark.rutland, scott.branden,
	will.deacon, qiuxishi, catalin.marinas, mhocko, realean2

Introduces memory hotplug functionality (hot-add) for arm64.

Changes v1->v2:
- swapper pgtable updated in place on hot add, avoiding unnecessary copy:
  all changes are additive and non destructive.

- stop_machine used to updated swapper on hot add, avoiding races

- checking if pagealloc is under debug to stay coherent with mem_map

Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
---
 arch/arm64/Kconfig           | 12 ++++++
 arch/arm64/configs/defconfig |  1 +
 arch/arm64/include/asm/mmu.h |  3 ++
 arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
 arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
 5 files changed, 142 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6..c736bba 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -641,6 +641,14 @@ config HOTPLUG_CPU
 	  Say Y here to experiment with turning CPUs off and on.  CPUs
 	  can be controlled through /sys/devices/system/cpu.
 
+config ARCH_HAS_ADD_PAGES
+	def_bool y
+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
+
+config ARCH_ENABLE_MEMORY_HOTPLUG
+	def_bool y
+    depends on !NUMA
+
 # Common NUMA Features
 config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
@@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
 
 source "mm/Kconfig"
 
+config ARCH_MEMORY_PROBE
+	def_bool y
+	depends on MEMORY_HOTPLUG
+
 config SECCOMP
 	bool "Enable seccomp to safely compute untrusted bytecode"
 	---help---
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 34480e9..5fc5656 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
 CONFIG_SCHED_MC=y
 CONFIG_NUMA=y
 CONFIG_PREEMPT=y
+CONFIG_MEMORY_HOTPLUG=y
 CONFIG_KSM=y
 CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_CMA=y
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 0d34bf0..2b3fa4d 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 			       pgprot_t prot, bool page_mappings_only);
 extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
 extern void mark_linear_text_alias_ro(void);
+#ifdef CONFIG_MEMORY_HOTPLUG
+extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
+#endif
 
 #endif
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 5960bef..e96e7d3 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
 	return 0;
 }
 __initcall(register_mem_limit_dumper);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+int add_pages(int nid, unsigned long start_pfn,
+		unsigned long nr_pages, bool want_memblock)
+{
+	int ret;
+	u64 start_addr = start_pfn << PAGE_SHIFT;
+	/*
+	 * Mark the first page in the range as unusable. This is needed
+	 * because __add_section (within __add_pages) wants pfn_valid
+	 * of it to be false, and in arm64 pfn falid is implemented by
+	 * just checking at the nomap flag for existing blocks.
+	 *
+	 * A small trick here is that __add_section() requires only
+	 * phys_start_pfn (that is the first pfn of a section) to be
+	 * invalid. Regardless of whether it was assumed (by the function
+	 * author) that all pfns within a section are either all valid
+	 * or all invalid, it allows to avoid looping twice (once here,
+	 * second when memblock_clear_nomap() is called) through all
+	 * pfns of the section and modify only one pfn. Thanks to that,
+	 * further, in __add_zone() only this very first pfn is skipped
+	 * and corresponding page is not flagged reserved. Therefore it
+	 * is enough to correct this setup only for it.
+	 *
+	 * When arch_add_memory() returns the walk_memory_range() function
+	 * is called and passed with online_memory_block() callback,
+	 * which execution finally reaches the memory_block_action()
+	 * function, where also only the first pfn of a memory block is
+	 * checked to be reserved. Above, it was first pfn of a section,
+	 * here it is a block but
+	 * (drivers/base/memory.c):
+	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
+	 * (include/linux/memory.h):
+	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
+	 * so we can consider block and section equivalently
+	 */
+	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
+	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
+
+	/*
+	 * Make the pages usable after they have been added.
+	 * This will make pfn_valid return true
+	 */
+	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
+
+	/*
+	 * This is a hack to avoid having to mix arch specific code
+	 * into arch independent code. SetPageReserved is supposed
+	 * to be called by __add_zone (within __add_section, within
+	 * __add_pages). However, when it is called there, it assumes that
+	 * pfn_valid returns true.  For the way pfn_valid is implemented
+	 * in arm64 (a check on the nomap flag), the only way to make
+	 * this evaluate true inside __add_zone is to clear the nomap
+	 * flags of blocks in architecture independent code.
+	 *
+	 * To avoid this, we set the Reserved flag here after we cleared
+	 * the nomap flag in the line above.
+	 */
+	SetPageReserved(pfn_to_page(start_pfn));
+
+	return ret;
+}
+
+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+{
+	int ret;
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	unsigned long end_pfn = start_pfn + nr_pages;
+	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
+
+	if (end_pfn > max_sparsemem_pfn) {
+		pr_err("end_pfn too big");
+		return -1;
+	}
+	hotplug_paging(start, size);
+
+	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
+
+	if (ret)
+		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
+			__func__, ret);
+
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index f1eb15e..d93043d 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -28,6 +28,7 @@
 #include <linux/mman.h>
 #include <linux/nodemask.h>
 #include <linux/memblock.h>
+#include <linux/stop_machine.h>
 #include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
@@ -615,6 +616,44 @@ void __init paging_init(void)
 		      SWAPPER_DIR_SIZE - PAGE_SIZE);
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+/*
+ * hotplug_paging() is used by memory hotplug to build new page tables
+ * for hot added memory.
+ */
+
+struct mem_range {
+	phys_addr_t base;
+	phys_addr_t size;
+};
+
+static int __hotplug_paging(void *data)
+{
+	int flags = 0;
+	struct mem_range *section = data;
+
+	if (debug_pagealloc_enabled())
+		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
+
+	__create_pgd_mapping(swapper_pg_dir, section->base,
+			__phys_to_virt(section->base), section->size,
+			PAGE_KERNEL, pgd_pgtable_alloc, flags);
+
+	return 0;
+}
+
+inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
+{
+	struct mem_range section = {
+		.base = start,
+		.size = size,
+	};
+
+	stop_machine(__hotplug_paging, &section, NULL);
+}
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
 /*
  * Check whether a kernel address is valid (derived from arch/x86/).
  */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-23 11:13   ` Maciej Bielski
  0 siblings, 0 replies; 156+ messages in thread
From: Maciej Bielski @ 2017-11-23 11:13 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, ar, arunks, mark.rutland, scott.branden,
	will.deacon, qiuxishi, catalin.marinas, mhocko, realean2

Introduces memory hotplug functionality (hot-add) for arm64.

Changes v1->v2:
- swapper pgtable updated in place on hot add, avoiding unnecessary copy:
  all changes are additive and non destructive.

- stop_machine used to updated swapper on hot add, avoiding races

- checking if pagealloc is under debug to stay coherent with mem_map

Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
---
 arch/arm64/Kconfig           | 12 ++++++
 arch/arm64/configs/defconfig |  1 +
 arch/arm64/include/asm/mmu.h |  3 ++
 arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
 arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
 5 files changed, 142 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6..c736bba 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -641,6 +641,14 @@ config HOTPLUG_CPU
 	  Say Y here to experiment with turning CPUs off and on.  CPUs
 	  can be controlled through /sys/devices/system/cpu.
 
+config ARCH_HAS_ADD_PAGES
+	def_bool y
+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
+
+config ARCH_ENABLE_MEMORY_HOTPLUG
+	def_bool y
+    depends on !NUMA
+
 # Common NUMA Features
 config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
@@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
 
 source "mm/Kconfig"
 
+config ARCH_MEMORY_PROBE
+	def_bool y
+	depends on MEMORY_HOTPLUG
+
 config SECCOMP
 	bool "Enable seccomp to safely compute untrusted bytecode"
 	---help---
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 34480e9..5fc5656 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
 CONFIG_SCHED_MC=y
 CONFIG_NUMA=y
 CONFIG_PREEMPT=y
+CONFIG_MEMORY_HOTPLUG=y
 CONFIG_KSM=y
 CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_CMA=y
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 0d34bf0..2b3fa4d 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 			       pgprot_t prot, bool page_mappings_only);
 extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
 extern void mark_linear_text_alias_ro(void);
+#ifdef CONFIG_MEMORY_HOTPLUG
+extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
+#endif
 
 #endif
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 5960bef..e96e7d3 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
 	return 0;
 }
 __initcall(register_mem_limit_dumper);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+int add_pages(int nid, unsigned long start_pfn,
+		unsigned long nr_pages, bool want_memblock)
+{
+	int ret;
+	u64 start_addr = start_pfn << PAGE_SHIFT;
+	/*
+	 * Mark the first page in the range as unusable. This is needed
+	 * because __add_section (within __add_pages) wants pfn_valid
+	 * of it to be false, and in arm64 pfn falid is implemented by
+	 * just checking at the nomap flag for existing blocks.
+	 *
+	 * A small trick here is that __add_section() requires only
+	 * phys_start_pfn (that is the first pfn of a section) to be
+	 * invalid. Regardless of whether it was assumed (by the function
+	 * author) that all pfns within a section are either all valid
+	 * or all invalid, it allows to avoid looping twice (once here,
+	 * second when memblock_clear_nomap() is called) through all
+	 * pfns of the section and modify only one pfn. Thanks to that,
+	 * further, in __add_zone() only this very first pfn is skipped
+	 * and corresponding page is not flagged reserved. Therefore it
+	 * is enough to correct this setup only for it.
+	 *
+	 * When arch_add_memory() returns the walk_memory_range() function
+	 * is called and passed with online_memory_block() callback,
+	 * which execution finally reaches the memory_block_action()
+	 * function, where also only the first pfn of a memory block is
+	 * checked to be reserved. Above, it was first pfn of a section,
+	 * here it is a block but
+	 * (drivers/base/memory.c):
+	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
+	 * (include/linux/memory.h):
+	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
+	 * so we can consider block and section equivalently
+	 */
+	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
+	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
+
+	/*
+	 * Make the pages usable after they have been added.
+	 * This will make pfn_valid return true
+	 */
+	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
+
+	/*
+	 * This is a hack to avoid having to mix arch specific code
+	 * into arch independent code. SetPageReserved is supposed
+	 * to be called by __add_zone (within __add_section, within
+	 * __add_pages). However, when it is called there, it assumes that
+	 * pfn_valid returns true.  For the way pfn_valid is implemented
+	 * in arm64 (a check on the nomap flag), the only way to make
+	 * this evaluate true inside __add_zone is to clear the nomap
+	 * flags of blocks in architecture independent code.
+	 *
+	 * To avoid this, we set the Reserved flag here after we cleared
+	 * the nomap flag in the line above.
+	 */
+	SetPageReserved(pfn_to_page(start_pfn));
+
+	return ret;
+}
+
+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+{
+	int ret;
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	unsigned long end_pfn = start_pfn + nr_pages;
+	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
+
+	if (end_pfn > max_sparsemem_pfn) {
+		pr_err("end_pfn too big");
+		return -1;
+	}
+	hotplug_paging(start, size);
+
+	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
+
+	if (ret)
+		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
+			__func__, ret);
+
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index f1eb15e..d93043d 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -28,6 +28,7 @@
 #include <linux/mman.h>
 #include <linux/nodemask.h>
 #include <linux/memblock.h>
+#include <linux/stop_machine.h>
 #include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
@@ -615,6 +616,44 @@ void __init paging_init(void)
 		      SWAPPER_DIR_SIZE - PAGE_SIZE);
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+/*
+ * hotplug_paging() is used by memory hotplug to build new page tables
+ * for hot added memory.
+ */
+
+struct mem_range {
+	phys_addr_t base;
+	phys_addr_t size;
+};
+
+static int __hotplug_paging(void *data)
+{
+	int flags = 0;
+	struct mem_range *section = data;
+
+	if (debug_pagealloc_enabled())
+		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
+
+	__create_pgd_mapping(swapper_pg_dir, section->base,
+			__phys_to_virt(section->base), section->size,
+			PAGE_KERNEL, pgd_pgtable_alloc, flags);
+
+	return 0;
+}
+
+inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
+{
+	struct mem_range section = {
+		.base = start,
+		.size = size,
+	};
+
+	stop_machine(__hotplug_paging, &section, NULL);
+}
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
 /*
  * Check whether a kernel address is valid (derived from arch/x86/).
  */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-23 11:13   ` Maciej Bielski
  0 siblings, 0 replies; 156+ messages in thread
From: Maciej Bielski @ 2017-11-23 11:13 UTC (permalink / raw)
  To: linux-arm-kernel

Introduces memory hotplug functionality (hot-add) for arm64.

Changes v1->v2:
- swapper pgtable updated in place on hot add, avoiding unnecessary copy:
  all changes are additive and non destructive.

- stop_machine used to updated swapper on hot add, avoiding races

- checking if pagealloc is under debug to stay coherent with mem_map

Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
---
 arch/arm64/Kconfig           | 12 ++++++
 arch/arm64/configs/defconfig |  1 +
 arch/arm64/include/asm/mmu.h |  3 ++
 arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
 arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
 5 files changed, 142 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0df64a6..c736bba 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -641,6 +641,14 @@ config HOTPLUG_CPU
 	  Say Y here to experiment with turning CPUs off and on.  CPUs
 	  can be controlled through /sys/devices/system/cpu.
 
+config ARCH_HAS_ADD_PAGES
+	def_bool y
+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
+
+config ARCH_ENABLE_MEMORY_HOTPLUG
+	def_bool y
+    depends on !NUMA
+
 # Common NUMA Features
 config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
@@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
 
 source "mm/Kconfig"
 
+config ARCH_MEMORY_PROBE
+	def_bool y
+	depends on MEMORY_HOTPLUG
+
 config SECCOMP
 	bool "Enable seccomp to safely compute untrusted bytecode"
 	---help---
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 34480e9..5fc5656 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
 CONFIG_SCHED_MC=y
 CONFIG_NUMA=y
 CONFIG_PREEMPT=y
+CONFIG_MEMORY_HOTPLUG=y
 CONFIG_KSM=y
 CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_CMA=y
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 0d34bf0..2b3fa4d 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 			       pgprot_t prot, bool page_mappings_only);
 extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
 extern void mark_linear_text_alias_ro(void);
+#ifdef CONFIG_MEMORY_HOTPLUG
+extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
+#endif
 
 #endif
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 5960bef..e96e7d3 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
 	return 0;
 }
 __initcall(register_mem_limit_dumper);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+int add_pages(int nid, unsigned long start_pfn,
+		unsigned long nr_pages, bool want_memblock)
+{
+	int ret;
+	u64 start_addr = start_pfn << PAGE_SHIFT;
+	/*
+	 * Mark the first page in the range as unusable. This is needed
+	 * because __add_section (within __add_pages) wants pfn_valid
+	 * of it to be false, and in arm64 pfn falid is implemented by
+	 * just checking@the nomap flag for existing blocks.
+	 *
+	 * A small trick here is that __add_section() requires only
+	 * phys_start_pfn (that is the first pfn of a section) to be
+	 * invalid. Regardless of whether it was assumed (by the function
+	 * author) that all pfns within a section are either all valid
+	 * or all invalid, it allows to avoid looping twice (once here,
+	 * second when memblock_clear_nomap() is called) through all
+	 * pfns of the section and modify only one pfn. Thanks to that,
+	 * further, in __add_zone() only this very first pfn is skipped
+	 * and corresponding page is not flagged reserved. Therefore it
+	 * is enough to correct this setup only for it.
+	 *
+	 * When arch_add_memory() returns the walk_memory_range() function
+	 * is called and passed with online_memory_block() callback,
+	 * which execution finally reaches the memory_block_action()
+	 * function, where also only the first pfn of a memory block is
+	 * checked to be reserved. Above, it was first pfn of a section,
+	 * here it is a block but
+	 * (drivers/base/memory.c):
+	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
+	 * (include/linux/memory.h):
+	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
+	 * so we can consider block and section equivalently
+	 */
+	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
+	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
+
+	/*
+	 * Make the pages usable after they have been added.
+	 * This will make pfn_valid return true
+	 */
+	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
+
+	/*
+	 * This is a hack to avoid having to mix arch specific code
+	 * into arch independent code. SetPageReserved is supposed
+	 * to be called by __add_zone (within __add_section, within
+	 * __add_pages). However, when it is called there, it assumes that
+	 * pfn_valid returns true.  For the way pfn_valid is implemented
+	 * in arm64 (a check on the nomap flag), the only way to make
+	 * this evaluate true inside __add_zone is to clear the nomap
+	 * flags of blocks in architecture independent code.
+	 *
+	 * To avoid this, we set the Reserved flag here after we cleared
+	 * the nomap flag in the line above.
+	 */
+	SetPageReserved(pfn_to_page(start_pfn));
+
+	return ret;
+}
+
+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
+{
+	int ret;
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	unsigned long end_pfn = start_pfn + nr_pages;
+	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
+
+	if (end_pfn > max_sparsemem_pfn) {
+		pr_err("end_pfn too big");
+		return -1;
+	}
+	hotplug_paging(start, size);
+
+	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
+
+	if (ret)
+		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
+			__func__, ret);
+
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index f1eb15e..d93043d 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -28,6 +28,7 @@
 #include <linux/mman.h>
 #include <linux/nodemask.h>
 #include <linux/memblock.h>
+#include <linux/stop_machine.h>
 #include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
@@ -615,6 +616,44 @@ void __init paging_init(void)
 		      SWAPPER_DIR_SIZE - PAGE_SIZE);
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+/*
+ * hotplug_paging() is used by memory hotplug to build new page tables
+ * for hot added memory.
+ */
+
+struct mem_range {
+	phys_addr_t base;
+	phys_addr_t size;
+};
+
+static int __hotplug_paging(void *data)
+{
+	int flags = 0;
+	struct mem_range *section = data;
+
+	if (debug_pagealloc_enabled())
+		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
+
+	__create_pgd_mapping(swapper_pg_dir, section->base,
+			__phys_to_virt(section->base), section->size,
+			PAGE_KERNEL, pgd_pgtable_alloc, flags);
+
+	return 0;
+}
+
+inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
+{
+	struct mem_range section = {
+		.base = start,
+		.size = size,
+	};
+
+	stop_machine(__hotplug_paging, &section, NULL);
+}
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
 /*
  * Check whether a kernel address is valid (derived from arch/x86/).
  */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-23 11:13 ` Andrea Reale
  (?)
@ 2017-11-23 11:14   ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, m.bielski, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, catalin.marinas, mhocko,
	rafael.j.wysocki, realean2

Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
introduced an assumption whereas when control
reaches remove_memory the corresponding memory has been already
offlined. In that case, the acpi_memhotplug was making sure that
the assumption held.
This assumption, however, is not necessarily true if offlining
and removal are not done by the same "controller" (for example,
when first offlining via sysfs).

Removing this assumption for the generic remove_memory code
and moving it in the specific acpi_memhotplug code. This is
a dependency for the software-aided arm64 offlining and removal
process.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
---
 drivers/acpi/acpi_memhotplug.c |  2 +-
 include/linux/memory_hotplug.h |  9 ++++++---
 mm/memory_hotplug.c            | 13 +++++++++----
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef..b0126a0 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
 			nid = memory_add_physaddr_to_nid(info->start_addr);
 
 		acpi_unbind_memory_blocks(info);
-		remove_memory(nid, info->start_addr, info->length);
+		BUG_ON(remove_memory(nid, info->start_addr, info->length));
 		list_del(&info->list);
 		kfree(info);
 	}
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 58e110a..1a9c7b2 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
 extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
 extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 
 #else
 static inline bool is_mem_section_removable(unsigned long pfn,
@@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	return -EINVAL;
 }
 
-static inline void remove_memory(int nid, u64 start, u64 size) {}
+static inline int remove_memory(int nid, u64 start, u64 size)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
@@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
 extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
 		unsigned long map_offset);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4b5f29..d5f15af 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
  * and online/offline operations before this call, as required by
  * try_offline_node().
  */
-void __ref remove_memory(int nid, u64 start, u64 size)
+int __ref remove_memory(int nid, u64 start, u64 size)
 {
 	int ret;
 
@@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
 				check_memblock_offlined_cb);
 	if (ret)
-		BUG();
+		goto end_remove;
+
+	ret = arch_remove_memory(start, size);
+
+	if (ret)
+		goto end_remove;
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(start, size);
-
 	try_offline_node(nid);
 
+end_remove:
 	mem_hotplug_done();
+	return ret;
 }
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-23 11:14   ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, m.bielski, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, catalin.marinas, mhocko,
	rafael.j.wysocki, realean2

Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
introduced an assumption whereas when control
reaches remove_memory the corresponding memory has been already
offlined. In that case, the acpi_memhotplug was making sure that
the assumption held.
This assumption, however, is not necessarily true if offlining
and removal are not done by the same "controller" (for example,
when first offlining via sysfs).

Removing this assumption for the generic remove_memory code
and moving it in the specific acpi_memhotplug code. This is
a dependency for the software-aided arm64 offlining and removal
process.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
---
 drivers/acpi/acpi_memhotplug.c |  2 +-
 include/linux/memory_hotplug.h |  9 ++++++---
 mm/memory_hotplug.c            | 13 +++++++++----
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef..b0126a0 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
 			nid = memory_add_physaddr_to_nid(info->start_addr);
 
 		acpi_unbind_memory_blocks(info);
-		remove_memory(nid, info->start_addr, info->length);
+		BUG_ON(remove_memory(nid, info->start_addr, info->length));
 		list_del(&info->list);
 		kfree(info);
 	}
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 58e110a..1a9c7b2 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
 extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
 extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 
 #else
 static inline bool is_mem_section_removable(unsigned long pfn,
@@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	return -EINVAL;
 }
 
-static inline void remove_memory(int nid, u64 start, u64 size) {}
+static inline int remove_memory(int nid, u64 start, u64 size)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
@@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
 extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
 		unsigned long map_offset);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4b5f29..d5f15af 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
  * and online/offline operations before this call, as required by
  * try_offline_node().
  */
-void __ref remove_memory(int nid, u64 start, u64 size)
+int __ref remove_memory(int nid, u64 start, u64 size)
 {
 	int ret;
 
@@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
 				check_memblock_offlined_cb);
 	if (ret)
-		BUG();
+		goto end_remove;
+
+	ret = arch_remove_memory(start, size);
+
+	if (ret)
+		goto end_remove;
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(start, size);
-
 	try_offline_node(nid);
 
+end_remove:
 	mem_hotplug_done();
+	return ret;
 }
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-23 11:14   ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:14 UTC (permalink / raw)
  To: linux-arm-kernel

Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
introduced an assumption whereas when control
reaches remove_memory the corresponding memory has been already
offlined. In that case, the acpi_memhotplug was making sure that
the assumption held.
This assumption, however, is not necessarily true if offlining
and removal are not done by the same "controller" (for example,
when first offlining via sysfs).

Removing this assumption for the generic remove_memory code
and moving it in the specific acpi_memhotplug code. This is
a dependency for the software-aided arm64 offlining and removal
process.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
---
 drivers/acpi/acpi_memhotplug.c |  2 +-
 include/linux/memory_hotplug.h |  9 ++++++---
 mm/memory_hotplug.c            | 13 +++++++++----
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef..b0126a0 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
 			nid = memory_add_physaddr_to_nid(info->start_addr);
 
 		acpi_unbind_memory_blocks(info);
-		remove_memory(nid, info->start_addr, info->length);
+		BUG_ON(remove_memory(nid, info->start_addr, info->length));
 		list_del(&info->list);
 		kfree(info);
 	}
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 58e110a..1a9c7b2 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
 extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
 extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 
 #else
 static inline bool is_mem_section_removable(unsigned long pfn,
@@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	return -EINVAL;
 }
 
-static inline void remove_memory(int nid, u64 start, u64 size) {}
+static inline int remove_memory(int nid, u64 start, u64 size)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
@@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
 extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
 		unsigned long map_offset);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4b5f29..d5f15af 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
  * and online/offline operations before this call, as required by
  * try_offline_node().
  */
-void __ref remove_memory(int nid, u64 start, u64 size)
+int __ref remove_memory(int nid, u64 start, u64 size)
 {
 	int ret;
 
@@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
 				check_memblock_offlined_cb);
 	if (ret)
-		BUG();
+		goto end_remove;
+
+	ret = arch_remove_memory(start, size);
+
+	if (ret)
+		goto end_remove;
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(start, size);
-
 	try_offline_node(nid);
 
+end_remove:
 	mem_hotplug_done();
+	return ret;
 }
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
  2017-11-23 11:13 ` Andrea Reale
  (?)
@ 2017-11-23 11:14   ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, m.bielski, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, catalin.marinas, mhocko,
	realean2

When hot-removing memory we need to free vmemmap memory.
However, depending on the memory is being removed, it might
not be always possible to free a full vmemmap page / huge-page
because part of it might still be used.

Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
hot-remove") introduced a workaround for x86
hot-remove, by which partially unused areas are filled with
the 0xFD constant. Full pages are only removed when fully
filled by 0xFDs.

This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
the goal of using it in place of 0xFDs. For now, this will be used for
the arm64 port of memory hot remove, but the idea is to eventually use
the same mechanism for x86 as well.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
---
 include/linux/memblock.h | 12 ++++++++++++
 mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7..0daec05 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -26,6 +26,9 @@ enum {
 	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
 	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
 	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
+#endif
 };
 
 struct memblock_region {
@@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
 int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
 ulong choose_memblock_flags(void);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
+int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
+#endif
 
 /* Low level functions */
 int memblock_add_range(struct memblock_type *type,
@@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
 	return m->flags & MEMBLOCK_NOMAP;
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
+		phys_addr_t start, phys_addr_t end);
+#endif
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
diff --git a/mm/memblock.c b/mm/memblock.c
index 9120578..30d5aa4 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
 	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
+		phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
+}
+int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
+		phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
+}
+#endif
 /**
  * __next_reserved_mem_region - next function for for_each_reserved_region()
  * @idx: pointer to u64 loop variable
@@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
 	}
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
+		phys_addr_t start, phys_addr_t end)
+{
+	u64 i;
+	struct memblock_region *r;
+
+	i = memblock_search(mt, start);
+	r = &(mt->regions[i]);
+	while (r->base < end) {
+		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
+			return 0;
+
+		r = &(memblock.memory.regions[++i]);
+	}
+
+	return 1;
+}
+#endif
+
 void __init_memblock memblock_set_current_limit(phys_addr_t limit)
 {
 	memblock.current_limit = limit;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-11-23 11:14   ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, m.bielski, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, catalin.marinas, mhocko,
	realean2

When hot-removing memory we need to free vmemmap memory.
However, depending on the memory is being removed, it might
not be always possible to free a full vmemmap page / huge-page
because part of it might still be used.

Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
hot-remove") introduced a workaround for x86
hot-remove, by which partially unused areas are filled with
the 0xFD constant. Full pages are only removed when fully
filled by 0xFDs.

This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
the goal of using it in place of 0xFDs. For now, this will be used for
the arm64 port of memory hot remove, but the idea is to eventually use
the same mechanism for x86 as well.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
---
 include/linux/memblock.h | 12 ++++++++++++
 mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7..0daec05 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -26,6 +26,9 @@ enum {
 	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
 	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
 	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
+#endif
 };
 
 struct memblock_region {
@@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
 int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
 ulong choose_memblock_flags(void);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
+int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
+#endif
 
 /* Low level functions */
 int memblock_add_range(struct memblock_type *type,
@@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
 	return m->flags & MEMBLOCK_NOMAP;
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
+		phys_addr_t start, phys_addr_t end);
+#endif
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
diff --git a/mm/memblock.c b/mm/memblock.c
index 9120578..30d5aa4 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
 	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
+		phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
+}
+int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
+		phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
+}
+#endif
 /**
  * __next_reserved_mem_region - next function for for_each_reserved_region()
  * @idx: pointer to u64 loop variable
@@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
 	}
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
+		phys_addr_t start, phys_addr_t end)
+{
+	u64 i;
+	struct memblock_region *r;
+
+	i = memblock_search(mt, start);
+	r = &(mt->regions[i]);
+	while (r->base < end) {
+		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
+			return 0;
+
+		r = &(memblock.memory.regions[++i]);
+	}
+
+	return 1;
+}
+#endif
+
 void __init_memblock memblock_set_current_limit(phys_addr_t limit)
 {
 	memblock.current_limit = limit;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-11-23 11:14   ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:14 UTC (permalink / raw)
  To: linux-arm-kernel

When hot-removing memory we need to free vmemmap memory.
However, depending on the memory is being removed, it might
not be always possible to free a full vmemmap page / huge-page
because part of it might still be used.

Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
hot-remove") introduced a workaround for x86
hot-remove, by which partially unused areas are filled with
the 0xFD constant. Full pages are only removed when fully
filled by 0xFDs.

This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
the goal of using it in place of 0xFDs. For now, this will be used for
the arm64 port of memory hot remove, but the idea is to eventually use
the same mechanism for x86 as well.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
---
 include/linux/memblock.h | 12 ++++++++++++
 mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7..0daec05 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -26,6 +26,9 @@ enum {
 	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
 	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
 	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
+#endif
 };
 
 struct memblock_region {
@@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
 int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
 ulong choose_memblock_flags(void);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
+int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
+#endif
 
 /* Low level functions */
 int memblock_add_range(struct memblock_type *type,
@@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
 	return m->flags & MEMBLOCK_NOMAP;
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
+		phys_addr_t start, phys_addr_t end);
+#endif
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
diff --git a/mm/memblock.c b/mm/memblock.c
index 9120578..30d5aa4 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
 	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
+		phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
+}
+int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
+		phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
+}
+#endif
 /**
  * __next_reserved_mem_region - next function for for_each_reserved_region()
  * @idx: pointer to u64 loop variable
@@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
 	}
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
+		phys_addr_t start, phys_addr_t end)
+{
+	u64 i;
+	struct memblock_region *r;
+
+	i = memblock_search(mt, start);
+	r = &(mt->regions[i]);
+	while (r->base < end) {
+		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
+			return 0;
+
+		r = &(memblock.memory.regions[++i]);
+	}
+
+	return 1;
+}
+#endif
+
 void __init_memblock memblock_set_current_limit(phys_addr_t limit)
 {
 	memblock.current_limit = limit;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-11-23 11:13 ` Andrea Reale
  (?)
@ 2017-11-23 11:14   ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, m.bielski, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, catalin.marinas, mhocko,
	realean2

Adding a "remove" sysfs handle that can be used to trigger
memory hotremove manually, exactly simmetrically with
what happens with the "probe" device for hot-add.

This is usueful for architecture that do not rely on
ACPI for memory hot-remove.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
---
 drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 1d60b58..8ccb67c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
 }
 
 static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
-#endif
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static ssize_t
+memory_remove_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	u64 phys_addr;
+	int nid, ret;
+	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
+
+	ret = kstrtoull(buf, 0, &phys_addr);
+	if (ret)
+		return ret;
+
+	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
+		return -EINVAL;
+
+	nid = memory_add_physaddr_to_nid(phys_addr);
+	ret = lock_device_hotplug_sysfs();
+	if (ret)
+		return ret;
+
+	remove_memory(nid, phys_addr,
+			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+	unlock_device_hotplug();
+	return count;
+}
+static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+#endif /* CONFIG_ARCH_MEMORY_PROBE */
 
 #ifdef CONFIG_MEMORY_FAILURE
 /*
@@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
 static struct attribute *memory_root_attrs[] = {
 #ifdef CONFIG_ARCH_MEMORY_PROBE
 	&dev_attr_probe.attr,
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	&dev_attr_remove.attr,
+#endif
 #endif
 
 #ifdef CONFIG_MEMORY_FAILURE
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-23 11:14   ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:14 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, m.bielski, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, catalin.marinas, mhocko,
	realean2

Adding a "remove" sysfs handle that can be used to trigger
memory hotremove manually, exactly simmetrically with
what happens with the "probe" device for hot-add.

This is usueful for architecture that do not rely on
ACPI for memory hot-remove.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
---
 drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 1d60b58..8ccb67c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
 }
 
 static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
-#endif
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static ssize_t
+memory_remove_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	u64 phys_addr;
+	int nid, ret;
+	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
+
+	ret = kstrtoull(buf, 0, &phys_addr);
+	if (ret)
+		return ret;
+
+	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
+		return -EINVAL;
+
+	nid = memory_add_physaddr_to_nid(phys_addr);
+	ret = lock_device_hotplug_sysfs();
+	if (ret)
+		return ret;
+
+	remove_memory(nid, phys_addr,
+			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+	unlock_device_hotplug();
+	return count;
+}
+static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+#endif /* CONFIG_ARCH_MEMORY_PROBE */
 
 #ifdef CONFIG_MEMORY_FAILURE
 /*
@@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
 static struct attribute *memory_root_attrs[] = {
 #ifdef CONFIG_ARCH_MEMORY_PROBE
 	&dev_attr_probe.attr,
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	&dev_attr_remove.attr,
+#endif
 #endif
 
 #ifdef CONFIG_MEMORY_FAILURE
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-23 11:14   ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:14 UTC (permalink / raw)
  To: linux-arm-kernel

Adding a "remove" sysfs handle that can be used to trigger
memory hotremove manually, exactly simmetrically with
what happens with the "probe" device for hot-add.

This is usueful for architecture that do not rely on
ACPI for memory hot-remove.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
---
 drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 1d60b58..8ccb67c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
 }
 
 static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
-#endif
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static ssize_t
+memory_remove_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	u64 phys_addr;
+	int nid, ret;
+	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
+
+	ret = kstrtoull(buf, 0, &phys_addr);
+	if (ret)
+		return ret;
+
+	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
+		return -EINVAL;
+
+	nid = memory_add_physaddr_to_nid(phys_addr);
+	ret = lock_device_hotplug_sysfs();
+	if (ret)
+		return ret;
+
+	remove_memory(nid, phys_addr,
+			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+	unlock_device_hotplug();
+	return count;
+}
+static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+#endif /* CONFIG_ARCH_MEMORY_PROBE */
 
 #ifdef CONFIG_MEMORY_FAILURE
 /*
@@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
 static struct attribute *memory_root_attrs[] = {
 #ifdef CONFIG_ARCH_MEMORY_PROBE
 	&dev_attr_probe.attr,
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	&dev_attr_remove.attr,
+#endif
 #endif
 
 #ifdef CONFIG_MEMORY_FAILURE
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 5/5] mm: memory-hotplug: Add memory hot remove support for arm64
  2017-11-23 11:13 ` Andrea Reale
  (?)
@ 2017-11-23 11:15   ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:15 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, m.bielski, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, catalin.marinas, mhocko,
	realean2

Implementation of pagetable cleanup routines for arm64 memory hot remove.

How to offline:
 1. Logical Hot remove (offline)
 - # echo offline > /sys/devices/system/memory/memoryXX/state
 2. Physical Hot remove (offline)
 - (if offline is successful)
 - # echo $section_phy_address > /sys/devices/system/memory/remove

Changes v1->v2:
- introduced check on offlining state before hot remove:
  in x86 (and possibly other architectures), offlining of pages and hot
  remove of physical memory happen in a single step, i.e., via an acpi
  event. In this patchset we are introducing a "remove" sysfs handle
  that triggers the physical hot-remove process after manual offlining.

- new memblock flag used to mark partially unused vmemmap pages, avoiding
  the nasty 0xFD hack used in the prev rev (and in x86 hot remove code):
  the hot remove process needs to take care of freeing vmemmap pages
  and mappings for the memory being removed. Sometimes, it might be not
  possible to free fully a vmemmap page (because it is being used for
  other mappings); in such a case we mark part of that page as unused and
  we free it only when it is fully unused. In the previous version, in
  symmetry to x86 hot remove code, we were doing this marking by filling
  the unused parts of the page with an aribitrary 0xFD constant. In this
  version, we are using a new memblock flag for the same purpose.

- proper cleaning sequence for p[um]ds,ptes and related TLB management:
  i) clear the page table, ii) flush tlb, iii) free the pagetable page

- Removed macros that changed hot remove behavior based on number
  of pgtable levels. Now this is hidden in the pgtable traversal macros.

- Check on the corner case where P[UM]Ds would have to be split during
  hot remove: now this is forbidden.
  Hot addition and removal is done at SECTION_SIZE_BITS granularity
  (currently 1GB).  The only case when we would have to split a P[UM]D
  is when SECTION_SIZE_BITS is smaller than a P[UM]D mapped area (never
  by default), AND when we are removing some P[UM]D-mapped memory that
  was never hot-added (there since boot).  If the above conditions hold,
  we avoid splitting the P[UM]Ds and, instead, we forbid hot removal.

- Minor fixes and refactoring.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
---
 arch/arm64/Kconfig           |   3 +
 arch/arm64/configs/defconfig |   1 +
 arch/arm64/include/asm/mmu.h |   4 +
 arch/arm64/mm/init.c         |  29 +++
 arch/arm64/mm/mmu.c          | 572 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 601 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c736bba..c362ddf 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -649,6 +649,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
     depends on !NUMA
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 # Common NUMA Features
 config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 5fc5656..cdac3b8 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -81,6 +81,7 @@ CONFIG_SCHED_MC=y
 CONFIG_NUMA=y
 CONFIG_PREEMPT=y
 CONFIG_MEMORY_HOTPLUG=y
+CONFIG_MEMORY_HOTREMOVE=y
 CONFIG_KSM=y
 CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_CMA=y
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 2b3fa4d..ca11567 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -42,6 +42,10 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
 extern void mark_linear_text_alias_ro(void);
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+extern int remove_pagetable(unsigned long start,
+	unsigned long end, bool linear_map, bool check_split);
+#endif
 #endif
 
 #endif
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index e96e7d3..406b378 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -808,4 +808,33 @@ int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
 	return ret;
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	unsigned long va_start = (unsigned long) __va(start);
+	unsigned long va_end = (unsigned long)__va(start + size);
+	struct page *page = pfn_to_page(start_pfn);
+	struct zone *zone;
+	int ret = 0;
+
+	/*
+	 * Check if mem can be removed without splitting
+	 * PUD/PMD mappings.
+	 */
+	ret = remove_pagetable(va_start, va_end, true, true);
+	if (!ret) {
+		zone = page_zone(page);
+		ret = __remove_pages(zone, start_pfn, nr_pages);
+		WARN_ON_ONCE(ret);
+
+		/* Actually remove the mapping */
+		remove_pagetable(va_start, va_end, true, false);
+	}
+
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d93043d..e6f8c91 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -25,6 +25,7 @@
 #include <linux/ioport.h>
 #include <linux/kexec.h>
 #include <linux/libfdt.h>
+#include <linux/memremap.h>
 #include <linux/mman.h>
 #include <linux/nodemask.h>
 #include <linux/memblock.h>
@@ -652,12 +653,532 @@ inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
 
 	stop_machine(__hotplug_paging, &section, NULL);
 }
-#endif /* CONFIG_MEMORY_HOTPLUG */
+#ifdef CONFIG_MEMORY_HOTREMOVE
+
+static void  free_pagetable(struct page *page, int order, bool linear_map)
+{
+	unsigned long magic;
+	unsigned int nr_pages = 1 << order;
+	struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
+
+	if (altmap) {
+		vmem_altmap_free(altmap, nr_pages);
+		return;
+	}
+
+	/* bootmem page has reserved flag */
+	if (PageReserved(page)) {
+		__ClearPageReserved(page);
+
+		magic = (unsigned long)page->lru.next;
+		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
+			while (nr_pages--)
+				put_page_bootmem(page++);
+		} else {
+			while (nr_pages--)
+				free_reserved_page(page++);
+		}
+	} else {
+		/*
+		 * Only linear_map pagetable allocation (those allocated via
+		 * hotplug) call the pgtable_page_ctor; vmemmap pgtable
+		 * allocations don't.
+		 */
+		if (linear_map)
+			pgtable_page_dtor(page);
+
+		free_pages((unsigned long)page_address(page), order);
+	}
+}
+
+static void free_pte_table(unsigned long addr, pmd_t *pmd, bool linear_map)
+{
+	pte_t *pte;
+	struct page *page;
+	int i;
+
+	pte =  pte_offset_kernel(pmd, 0L);
+	/* Check if there is no valid entry in the PMD */
+	for (i = 0; i < PTRS_PER_PTE; i++, pte++) {
+		if (!pte_none(*pte))
+			return;
+	}
+
+	page = pmd_page(*pmd);
+	/*
+	 * This spin lock could be only taken in _pte_aloc_kernel
+	 * in mm/memory.c and nowhere else (for arm64). Not sure if
+	 * the function above can be called concurrently. In doubt,
+	 * I am living it here for now, but it probably can be removed
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	/* Make sure addr is aligned with first address of the PMD*/
+	addr &= PMD_MASK;
+	/*
+	 * Invalidate TLB walk caches to PTE
+	 * Not sure what is the index of the TLB walk caches.
+	 * i.e., if it is indexed just by addr & PMD_MASK or it can be
+	 * indexed by any address. Flushing everything to stay on the safe
+	 * side.
+	 */
+	flush_tlb_kernel_range(addr, addr + PMD_SIZE);
+
+	free_pagetable(page, 0, linear_map);
+}
+
+static void free_pmd_table(unsigned long addr, pud_t *pud, bool linear_map)
+{
+	pmd_t *pmd;
+	struct page *page;
+	int i;
+
+	pmd = pmd_offset(pud, 0L);
+	/*
+	 * If PMD is folded onto PUD, cleanup was already performed
+	 * up in the call stack. No more work needs to be done.
+	 */
+	if ((pud_t *) pmd == pud)
+		return;
+
+	/* Check if there is no valid entry in the PMD */
+	for (i = 0; i < PTRS_PER_PMD; i++, pmd++) {
+		if (!pmd_none(*pmd))
+			return;
+	}
+
+	page = pud_page(*pud);
+	/*
+	 * This spin lock could be only taken in _pte_aloc_kernel
+	 * in mm/memory.c and nowhere else (for arm64). Not sure if
+	 * the function above can be called concurrently. In doubt,
+	 * I am living it here for now, but it probably can be removed
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+
+	/* Make sure addr is aligned with first address of the PMD*/
+	addr &= PUD_MASK;
+	/*
+	 * Invalidate TLB walk caches to PMD
+	 * Not sure what is the index of the TLB walk caches.
+	 * i.e., if it is indexed just by addr & PUD_MASK or it can be
+	 * indexed by any address. Flushing everything to stay on the safe
+	 * side.
+	 */
+	flush_tlb_kernel_range(addr, addr + PUD_SIZE);
+
+	free_pagetable(page, 0, linear_map);
+}
+
+static void free_pud_table(unsigned long addr, pgd_t *pgd, bool linear_map)
+{
+	pud_t *pud;
+	struct page *page;
+	int i;
+
+	pud = pud_offset(pgd, 0L);
+	/*
+	 * If PUD is folded onto PGD, cleanup was already performed
+	 * up in the call stack. No more work needs to be done.
+	 */
+	if ((pgd_t *)pud == pgd)
+		return;
+
+	/* Check if there is no valid entry in the PUD */
+	for (i = 0; i < PTRS_PER_PUD; i++, pud++) {
+		if (!pud_none(*pud))
+			return;
+	}
+
+	page = pgd_page(*pgd);
+
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pgd_clear(pgd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	/* Make sure addr is aligned with first address of the PUD*/
+	addr &= PGDIR_MASK;
+	/*
+	 * Invalidate TLB walk caches to PUD
+	 *
+	 * Not sure what is the index of the TLB walk caches.
+	 * i.e., if it is indexed just by addr & PGDIR_MASK or it can be
+	 * indexed by any address. Flushing everything to stay on the safe
+	 * side
+	 */
+	flush_tlb_kernel_range(addr, addr + PGD_SIZE);
+
+	free_pagetable(page, 0, linear_map);
+}
+
+static void mark_n_free_pte_vmemmap(pte_t *pte,
+		unsigned long addr, unsigned long size)
+{
+	unsigned long page_offset =  (addr & (~PAGE_MASK));
+	phys_addr_t page_start = pte_val(*pte) & PHYS_MASK & (s32)PAGE_MASK;
+	phys_addr_t pa_start = page_start + page_offset;
+
+	memblock_mark_unused_vmemmap(pa_start, size);
+
+	if (memblock_is_vmemmap_unused_range(&memblock.memory,
+				page_start, page_start + PAGE_SIZE)) {
+
+		free_pagetable(pte_page(*pte), 0, false);
+		memblock_clear_unused_vmemmap(page_start, PAGE_SIZE);
+
+		/*
+		 * This spin lock could be only
+		 * taken in _pte_aloc_kernel in
+		 * mm/memory.c and nowhere else
+		 * (for arm64). Not sure if the
+		 * function above can be called
+		 * concurrently. In doubt,
+		 * I am living it here for now,
+		 * but it probably can be removed.
+		 */
+		spin_lock(&init_mm.page_table_lock);
+		pte_clear(&init_mm, addr, pte);
+		spin_unlock(&init_mm.page_table_lock);
+
+		flush_tlb_kernel_range(addr & PAGE_MASK,
+				(addr + PAGE_SIZE) & PAGE_MASK);
+	}
+}
+
+static void mark_n_free_pmd_vmemmap(pmd_t *pmd,
+		unsigned long addr, unsigned long size)
+{
+	unsigned long sec_offset =  (addr & (~PMD_MASK));
+	phys_addr_t page_start = pmd_page_paddr(*pmd);
+	phys_addr_t pa_start = page_start + sec_offset;
+
+	memblock_mark_unused_vmemmap(pa_start, size);
+
+	if (memblock_is_vmemmap_unused_range(&memblock.memory,
+				page_start, page_start + PMD_SIZE)) {
+
+		free_pagetable(pmd_page(*pmd),
+				get_order(PMD_SIZE), false);
+
+		memblock_clear_unused_vmemmap(page_start, PMD_SIZE);
+		/*
+		 * This spin lock could be only
+		 * taken in _pte_aloc_kernel in
+		 * mm/memory.c and nowhere else
+		 * (for arm64). Not sure if the
+		 * function above can be called
+		 * concurrently. In doubt,
+		 * I am living it here for now,
+		 * but it probably can be removed.
+		 */
+		spin_lock(&init_mm.page_table_lock);
+		pmd_clear(pmd);
+		spin_unlock(&init_mm.page_table_lock);
+
+		flush_tlb_kernel_range(addr & PMD_MASK,
+				(addr + PMD_SIZE) & PMD_MASK);
+	}
+}
+
+static void rm_pte_mapping(pte_t *pte, unsigned long addr,
+		unsigned long next, bool linear_map)
+{
+	/*
+	 * Linear map pages were already freed when offlining.
+	 * We aonly need to free vmemmap pages.
+	 */
+	if (!linear_map)
+		free_pagetable(pte_page(*pte), 0, false);
+
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pte_clear(&init_mm, addr, pte);
+	spin_unlock(&init_mm.page_table_lock);
+
+	flush_tlb_kernel_range(addr, next);
+}
+
+static void rm_pmd_mapping(pmd_t *pmd, unsigned long addr,
+		unsigned long next, bool linear_map)
+{
+	/* Freeing vmemmap pages */
+	if (!linear_map)
+		free_pagetable(pmd_page(*pmd),
+				get_order(PMD_SIZE), false);
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	flush_tlb_kernel_range(addr, next);
+}
+
+static void rm_pud_mapping(pud_t *pud, unsigned long addr,
+		unsigned long next, bool linear_map)
+{
+	/** We never map vmemmap space on PUDs */
+	BUG_ON(!linear_map);
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+
+	flush_tlb_kernel_range(addr, next);
+}
 
 /*
- * Check whether a kernel address is valid (derived from arch/x86/).
+ * Used in hot-remove, cleans up PTE entries from addr to end from the pointed
+ * pte table. If linear_map is true, this is used called to remove the tables
+ * for the memory being hot-removed. If false, this is called to clean-up the
+ * tables (and the memory) that were used for the vmemmap of memory being
+ * hot-removed.
  */
-int kern_addr_valid(unsigned long addr)
+static void remove_pte_table(pte_t *pte, unsigned long addr,
+	unsigned long end, bool linear_map)
+{
+	unsigned long next;
+
+
+	for (; addr < end; addr = next, pte++) {
+		next = (addr + PAGE_SIZE) & PAGE_MASK;
+		if (next > end)
+			next = end;
+
+		if (!pte_present(*pte))
+			continue;
+
+		if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
+			rm_pte_mapping(pte, addr, next, linear_map);
+		} else {
+			unsigned long sz = next - addr;
+			/*
+			 * If we are here, we are freeing vmemmap pages since
+			 * linear_map mapped memory ranges to be freed
+			 * are aligned.
+			 *
+			 * If we are not removing the whole page, it means
+			 * other page structs in this page are being used and
+			 * we canot remove them. We use memblock to mark these
+			 * unused pieces and we only removed when they are fully
+			 * unuesed.
+			 */
+			mark_n_free_pte_vmemmap(pte, addr, sz);
+		}
+	}
+}
+
+/**
+ * Used in hot-remove, cleans up PMD entries from addr to end from the pointed
+ * pmd table.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PMD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+static int remove_pmd_table(pmd_t *pmd, unsigned long addr,
+	unsigned long end, bool linear_map, bool check_split)
+{
+	int err = 0;
+	unsigned long next;
+	pte_t *pte;
+
+	for (; !err && addr < end; addr = next, pmd++) {
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(*pmd))
+			continue;
+
+		if (pmd_sect(*pmd)) {
+			if (IS_ALIGNED(addr, PMD_SIZE) &&
+					IS_ALIGNED(next, PMD_SIZE)) {
+
+				if (!check_split)
+					rm_pmd_mapping(pmd, addr, next,
+							linear_map);
+
+			} else { /* not aligned to PMD size */
+
+				/*
+				 * This should only occur for vmemap.
+				 * If it does happen for linear map,
+				 * we do not support splitting PMDs,
+				 * so we return error
+				 */
+				if (linear_map) {
+					pr_warn("Hot-remove failed. Cannot split PMD mapping\n");
+					err = -EBUSY;
+				} else if (!check_split) {
+					unsigned long sz = next - addr;
+					/* Freeing vmemmap pages.*/
+					mark_n_free_pmd_vmemmap(pmd, addr, sz);
+				}
+			}
+		} else { /* ! pmd_sect() */
+
+			BUG_ON(!pmd_table(*pmd));
+			if (!check_split) {
+				pte = pte_offset_map(pmd, addr);
+				remove_pte_table(pte, addr, next, linear_map);
+				free_pte_table(addr, pmd, linear_map);
+			}
+		}
+	}
+
+	return err;
+}
+
+/**
+ * Used in hot-remove, cleans up PUD entries from addr to end from the pointed
+ * pmd table.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PUD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+static int remove_pud_table(pud_t *pud, unsigned long addr,
+	unsigned long end, bool linear_map, bool check_split)
+{
+	int err = 0;
+	unsigned long next;
+	pmd_t *pmd;
+
+	for (; !err && addr < end; addr = next, pud++) {
+		next = pud_addr_end(addr, end);
+		if (!pud_present(*pud))
+			continue;
+
+		/*
+		 * If we are using 4K granules, check if we are using
+		 * 1GB section mapping.
+		 */
+		if (pud_sect(*pud)) {
+			if (IS_ALIGNED(addr, PUD_SIZE) &&
+					IS_ALIGNED(next, PUD_SIZE)) {
+
+				if (!check_split)
+					rm_pud_mapping(pud, addr, next,
+							linear_map);
+
+			} else { /* not aligned to PUD size */
+				/*
+				 * As above, we never map vmemmap
+				 * space on PUDs
+				 */
+				BUG_ON(!linear_map);
+				pr_warn("Hot-remove failed. Cannot split PUD mapping\n");
+				err = -EBUSY;
+			}
+		} else { /* !pud_sect() */
+			BUG_ON(!pud_table(*pud));
+
+			pmd = pmd_offset(pud, addr);
+			err = remove_pmd_table(pmd, addr, next,
+					linear_map, check_split);
+			if (!check_split)
+				free_pmd_table(addr, pud, linear_map);
+		}
+	}
+
+	return err;
+}
+
+/**
+ * Used in hot-remove, cleans up kernel page tables from addr to end.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PUD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+int remove_pagetable(unsigned long start, unsigned long end,
+		bool linear_map, bool check_split)
+{
+	int err;
+	unsigned long next;
+	unsigned long addr;
+	pgd_t *pgd;
+	pud_t *pud;
+
+	for (addr = start; addr < end; addr = next) {
+		next = pgd_addr_end(addr, end);
+
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd))
+			continue;
+
+		pud = pud_offset(pgd, addr);
+		err = remove_pud_table(pud, addr, next,
+				linear_map, check_split);
+		if (err)
+			break;
+
+		if (!check_split)
+			free_pud_table(addr, pgd, linear_map);
+	}
+
+	if (!check_split)
+		flush_tlb_all();
+
+	return err;
+}
+
+
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+static unsigned long walk_kern_pgtable(unsigned long addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -676,26 +1197,51 @@ int kern_addr_valid(unsigned long addr)
 		return 0;
 
 	if (pud_sect(*pud))
-		return pfn_valid(pud_pfn(*pud));
+		return pud_pfn(*pud);
 
 	pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd))
 		return 0;
 
 	if (pmd_sect(*pmd))
-		return pfn_valid(pmd_pfn(*pmd));
+		return pmd_pfn(*pmd);
 
 	pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte))
 		return 0;
 
-	return pfn_valid(pte_pfn(*pte));
+	return pte_pfn(*pte);
+}
+
+/*
+ * Check whether a kernel address is valid (derived from arch/x86/).
+ */
+int kern_addr_valid(unsigned long addr)
+{
+	return pfn_valid(walk_kern_pgtable(addr));
 }
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 {
-	return vmemmap_populate_basepages(start, end, node);
+	int err;
+
+	err = vmemmap_populate_basepages(start, end, node);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+    /*
+     * A bit inefficient (restarting from PGD every time) but saves
+     * from lots of duplicated code. Also, this is only called
+     * at hot-add time, which should not be a frequent operation
+     */
+	for (; start < end; start += PAGE_SIZE) {
+		unsigned long pfn = walk_kern_pgtable(start);
+		phys_addr_t pa_start = ((phys_addr_t)pfn) << PAGE_SHIFT;
+
+		memblock_clear_unused_vmemmap(pa_start, PAGE_SIZE);
+	}
+#endif
+	return err;
 }
 #else	/* !ARM64_SWAPPER_USES_SECTION_MAPS */
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
@@ -726,8 +1272,15 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 				return -ENOMEM;
 
 			set_pmd(pmd, __pmd(__pa(p) | PROT_SECT_NORMAL));
-		} else
+		} else {
+			unsigned long sec_offset =  (addr & (~PMD_MASK));
+			phys_addr_t pa_start =
+				pmd_page_paddr(*pmd) + sec_offset;
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+			memblock_clear_unused_vmemmap(pa_start, next - addr);
+#endif
+		}
 	} while (addr = next, addr != end);
 
 	return 0;
@@ -735,6 +1288,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 #endif	/* CONFIG_ARM64_64K_PAGES */
 void vmemmap_free(unsigned long start, unsigned long end)
 {
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	remove_pagetable(start, end, false, false);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 5/5] mm: memory-hotplug: Add memory hot remove support for arm64
@ 2017-11-23 11:15   ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:15 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, m.bielski, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, catalin.marinas, mhocko,
	realean2

Implementation of pagetable cleanup routines for arm64 memory hot remove.

How to offline:
 1. Logical Hot remove (offline)
 - # echo offline > /sys/devices/system/memory/memoryXX/state
 2. Physical Hot remove (offline)
 - (if offline is successful)
 - # echo $section_phy_address > /sys/devices/system/memory/remove

Changes v1->v2:
- introduced check on offlining state before hot remove:
  in x86 (and possibly other architectures), offlining of pages and hot
  remove of physical memory happen in a single step, i.e., via an acpi
  event. In this patchset we are introducing a "remove" sysfs handle
  that triggers the physical hot-remove process after manual offlining.

- new memblock flag used to mark partially unused vmemmap pages, avoiding
  the nasty 0xFD hack used in the prev rev (and in x86 hot remove code):
  the hot remove process needs to take care of freeing vmemmap pages
  and mappings for the memory being removed. Sometimes, it might be not
  possible to free fully a vmemmap page (because it is being used for
  other mappings); in such a case we mark part of that page as unused and
  we free it only when it is fully unused. In the previous version, in
  symmetry to x86 hot remove code, we were doing this marking by filling
  the unused parts of the page with an aribitrary 0xFD constant. In this
  version, we are using a new memblock flag for the same purpose.

- proper cleaning sequence for p[um]ds,ptes and related TLB management:
  i) clear the page table, ii) flush tlb, iii) free the pagetable page

- Removed macros that changed hot remove behavior based on number
  of pgtable levels. Now this is hidden in the pgtable traversal macros.

- Check on the corner case where P[UM]Ds would have to be split during
  hot remove: now this is forbidden.
  Hot addition and removal is done at SECTION_SIZE_BITS granularity
  (currently 1GB).  The only case when we would have to split a P[UM]D
  is when SECTION_SIZE_BITS is smaller than a P[UM]D mapped area (never
  by default), AND when we are removing some P[UM]D-mapped memory that
  was never hot-added (there since boot).  If the above conditions hold,
  we avoid splitting the P[UM]Ds and, instead, we forbid hot removal.

- Minor fixes and refactoring.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
---
 arch/arm64/Kconfig           |   3 +
 arch/arm64/configs/defconfig |   1 +
 arch/arm64/include/asm/mmu.h |   4 +
 arch/arm64/mm/init.c         |  29 +++
 arch/arm64/mm/mmu.c          | 572 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 601 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c736bba..c362ddf 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -649,6 +649,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
     depends on !NUMA
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 # Common NUMA Features
 config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 5fc5656..cdac3b8 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -81,6 +81,7 @@ CONFIG_SCHED_MC=y
 CONFIG_NUMA=y
 CONFIG_PREEMPT=y
 CONFIG_MEMORY_HOTPLUG=y
+CONFIG_MEMORY_HOTREMOVE=y
 CONFIG_KSM=y
 CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_CMA=y
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 2b3fa4d..ca11567 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -42,6 +42,10 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
 extern void mark_linear_text_alias_ro(void);
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+extern int remove_pagetable(unsigned long start,
+	unsigned long end, bool linear_map, bool check_split);
+#endif
 #endif
 
 #endif
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index e96e7d3..406b378 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -808,4 +808,33 @@ int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
 	return ret;
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	unsigned long va_start = (unsigned long) __va(start);
+	unsigned long va_end = (unsigned long)__va(start + size);
+	struct page *page = pfn_to_page(start_pfn);
+	struct zone *zone;
+	int ret = 0;
+
+	/*
+	 * Check if mem can be removed without splitting
+	 * PUD/PMD mappings.
+	 */
+	ret = remove_pagetable(va_start, va_end, true, true);
+	if (!ret) {
+		zone = page_zone(page);
+		ret = __remove_pages(zone, start_pfn, nr_pages);
+		WARN_ON_ONCE(ret);
+
+		/* Actually remove the mapping */
+		remove_pagetable(va_start, va_end, true, false);
+	}
+
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d93043d..e6f8c91 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -25,6 +25,7 @@
 #include <linux/ioport.h>
 #include <linux/kexec.h>
 #include <linux/libfdt.h>
+#include <linux/memremap.h>
 #include <linux/mman.h>
 #include <linux/nodemask.h>
 #include <linux/memblock.h>
@@ -652,12 +653,532 @@ inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
 
 	stop_machine(__hotplug_paging, &section, NULL);
 }
-#endif /* CONFIG_MEMORY_HOTPLUG */
+#ifdef CONFIG_MEMORY_HOTREMOVE
+
+static void  free_pagetable(struct page *page, int order, bool linear_map)
+{
+	unsigned long magic;
+	unsigned int nr_pages = 1 << order;
+	struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
+
+	if (altmap) {
+		vmem_altmap_free(altmap, nr_pages);
+		return;
+	}
+
+	/* bootmem page has reserved flag */
+	if (PageReserved(page)) {
+		__ClearPageReserved(page);
+
+		magic = (unsigned long)page->lru.next;
+		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
+			while (nr_pages--)
+				put_page_bootmem(page++);
+		} else {
+			while (nr_pages--)
+				free_reserved_page(page++);
+		}
+	} else {
+		/*
+		 * Only linear_map pagetable allocation (those allocated via
+		 * hotplug) call the pgtable_page_ctor; vmemmap pgtable
+		 * allocations don't.
+		 */
+		if (linear_map)
+			pgtable_page_dtor(page);
+
+		free_pages((unsigned long)page_address(page), order);
+	}
+}
+
+static void free_pte_table(unsigned long addr, pmd_t *pmd, bool linear_map)
+{
+	pte_t *pte;
+	struct page *page;
+	int i;
+
+	pte =  pte_offset_kernel(pmd, 0L);
+	/* Check if there is no valid entry in the PMD */
+	for (i = 0; i < PTRS_PER_PTE; i++, pte++) {
+		if (!pte_none(*pte))
+			return;
+	}
+
+	page = pmd_page(*pmd);
+	/*
+	 * This spin lock could be only taken in _pte_aloc_kernel
+	 * in mm/memory.c and nowhere else (for arm64). Not sure if
+	 * the function above can be called concurrently. In doubt,
+	 * I am living it here for now, but it probably can be removed
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	/* Make sure addr is aligned with first address of the PMD*/
+	addr &= PMD_MASK;
+	/*
+	 * Invalidate TLB walk caches to PTE
+	 * Not sure what is the index of the TLB walk caches.
+	 * i.e., if it is indexed just by addr & PMD_MASK or it can be
+	 * indexed by any address. Flushing everything to stay on the safe
+	 * side.
+	 */
+	flush_tlb_kernel_range(addr, addr + PMD_SIZE);
+
+	free_pagetable(page, 0, linear_map);
+}
+
+static void free_pmd_table(unsigned long addr, pud_t *pud, bool linear_map)
+{
+	pmd_t *pmd;
+	struct page *page;
+	int i;
+
+	pmd = pmd_offset(pud, 0L);
+	/*
+	 * If PMD is folded onto PUD, cleanup was already performed
+	 * up in the call stack. No more work needs to be done.
+	 */
+	if ((pud_t *) pmd == pud)
+		return;
+
+	/* Check if there is no valid entry in the PMD */
+	for (i = 0; i < PTRS_PER_PMD; i++, pmd++) {
+		if (!pmd_none(*pmd))
+			return;
+	}
+
+	page = pud_page(*pud);
+	/*
+	 * This spin lock could be only taken in _pte_aloc_kernel
+	 * in mm/memory.c and nowhere else (for arm64). Not sure if
+	 * the function above can be called concurrently. In doubt,
+	 * I am living it here for now, but it probably can be removed
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+
+	/* Make sure addr is aligned with first address of the PMD*/
+	addr &= PUD_MASK;
+	/*
+	 * Invalidate TLB walk caches to PMD
+	 * Not sure what is the index of the TLB walk caches.
+	 * i.e., if it is indexed just by addr & PUD_MASK or it can be
+	 * indexed by any address. Flushing everything to stay on the safe
+	 * side.
+	 */
+	flush_tlb_kernel_range(addr, addr + PUD_SIZE);
+
+	free_pagetable(page, 0, linear_map);
+}
+
+static void free_pud_table(unsigned long addr, pgd_t *pgd, bool linear_map)
+{
+	pud_t *pud;
+	struct page *page;
+	int i;
+
+	pud = pud_offset(pgd, 0L);
+	/*
+	 * If PUD is folded onto PGD, cleanup was already performed
+	 * up in the call stack. No more work needs to be done.
+	 */
+	if ((pgd_t *)pud == pgd)
+		return;
+
+	/* Check if there is no valid entry in the PUD */
+	for (i = 0; i < PTRS_PER_PUD; i++, pud++) {
+		if (!pud_none(*pud))
+			return;
+	}
+
+	page = pgd_page(*pgd);
+
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pgd_clear(pgd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	/* Make sure addr is aligned with first address of the PUD*/
+	addr &= PGDIR_MASK;
+	/*
+	 * Invalidate TLB walk caches to PUD
+	 *
+	 * Not sure what is the index of the TLB walk caches.
+	 * i.e., if it is indexed just by addr & PGDIR_MASK or it can be
+	 * indexed by any address. Flushing everything to stay on the safe
+	 * side
+	 */
+	flush_tlb_kernel_range(addr, addr + PGD_SIZE);
+
+	free_pagetable(page, 0, linear_map);
+}
+
+static void mark_n_free_pte_vmemmap(pte_t *pte,
+		unsigned long addr, unsigned long size)
+{
+	unsigned long page_offset =  (addr & (~PAGE_MASK));
+	phys_addr_t page_start = pte_val(*pte) & PHYS_MASK & (s32)PAGE_MASK;
+	phys_addr_t pa_start = page_start + page_offset;
+
+	memblock_mark_unused_vmemmap(pa_start, size);
+
+	if (memblock_is_vmemmap_unused_range(&memblock.memory,
+				page_start, page_start + PAGE_SIZE)) {
+
+		free_pagetable(pte_page(*pte), 0, false);
+		memblock_clear_unused_vmemmap(page_start, PAGE_SIZE);
+
+		/*
+		 * This spin lock could be only
+		 * taken in _pte_aloc_kernel in
+		 * mm/memory.c and nowhere else
+		 * (for arm64). Not sure if the
+		 * function above can be called
+		 * concurrently. In doubt,
+		 * I am living it here for now,
+		 * but it probably can be removed.
+		 */
+		spin_lock(&init_mm.page_table_lock);
+		pte_clear(&init_mm, addr, pte);
+		spin_unlock(&init_mm.page_table_lock);
+
+		flush_tlb_kernel_range(addr & PAGE_MASK,
+				(addr + PAGE_SIZE) & PAGE_MASK);
+	}
+}
+
+static void mark_n_free_pmd_vmemmap(pmd_t *pmd,
+		unsigned long addr, unsigned long size)
+{
+	unsigned long sec_offset =  (addr & (~PMD_MASK));
+	phys_addr_t page_start = pmd_page_paddr(*pmd);
+	phys_addr_t pa_start = page_start + sec_offset;
+
+	memblock_mark_unused_vmemmap(pa_start, size);
+
+	if (memblock_is_vmemmap_unused_range(&memblock.memory,
+				page_start, page_start + PMD_SIZE)) {
+
+		free_pagetable(pmd_page(*pmd),
+				get_order(PMD_SIZE), false);
+
+		memblock_clear_unused_vmemmap(page_start, PMD_SIZE);
+		/*
+		 * This spin lock could be only
+		 * taken in _pte_aloc_kernel in
+		 * mm/memory.c and nowhere else
+		 * (for arm64). Not sure if the
+		 * function above can be called
+		 * concurrently. In doubt,
+		 * I am living it here for now,
+		 * but it probably can be removed.
+		 */
+		spin_lock(&init_mm.page_table_lock);
+		pmd_clear(pmd);
+		spin_unlock(&init_mm.page_table_lock);
+
+		flush_tlb_kernel_range(addr & PMD_MASK,
+				(addr + PMD_SIZE) & PMD_MASK);
+	}
+}
+
+static void rm_pte_mapping(pte_t *pte, unsigned long addr,
+		unsigned long next, bool linear_map)
+{
+	/*
+	 * Linear map pages were already freed when offlining.
+	 * We aonly need to free vmemmap pages.
+	 */
+	if (!linear_map)
+		free_pagetable(pte_page(*pte), 0, false);
+
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pte_clear(&init_mm, addr, pte);
+	spin_unlock(&init_mm.page_table_lock);
+
+	flush_tlb_kernel_range(addr, next);
+}
+
+static void rm_pmd_mapping(pmd_t *pmd, unsigned long addr,
+		unsigned long next, bool linear_map)
+{
+	/* Freeing vmemmap pages */
+	if (!linear_map)
+		free_pagetable(pmd_page(*pmd),
+				get_order(PMD_SIZE), false);
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	flush_tlb_kernel_range(addr, next);
+}
+
+static void rm_pud_mapping(pud_t *pud, unsigned long addr,
+		unsigned long next, bool linear_map)
+{
+	/** We never map vmemmap space on PUDs */
+	BUG_ON(!linear_map);
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+
+	flush_tlb_kernel_range(addr, next);
+}
 
 /*
- * Check whether a kernel address is valid (derived from arch/x86/).
+ * Used in hot-remove, cleans up PTE entries from addr to end from the pointed
+ * pte table. If linear_map is true, this is used called to remove the tables
+ * for the memory being hot-removed. If false, this is called to clean-up the
+ * tables (and the memory) that were used for the vmemmap of memory being
+ * hot-removed.
  */
-int kern_addr_valid(unsigned long addr)
+static void remove_pte_table(pte_t *pte, unsigned long addr,
+	unsigned long end, bool linear_map)
+{
+	unsigned long next;
+
+
+	for (; addr < end; addr = next, pte++) {
+		next = (addr + PAGE_SIZE) & PAGE_MASK;
+		if (next > end)
+			next = end;
+
+		if (!pte_present(*pte))
+			continue;
+
+		if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
+			rm_pte_mapping(pte, addr, next, linear_map);
+		} else {
+			unsigned long sz = next - addr;
+			/*
+			 * If we are here, we are freeing vmemmap pages since
+			 * linear_map mapped memory ranges to be freed
+			 * are aligned.
+			 *
+			 * If we are not removing the whole page, it means
+			 * other page structs in this page are being used and
+			 * we canot remove them. We use memblock to mark these
+			 * unused pieces and we only removed when they are fully
+			 * unuesed.
+			 */
+			mark_n_free_pte_vmemmap(pte, addr, sz);
+		}
+	}
+}
+
+/**
+ * Used in hot-remove, cleans up PMD entries from addr to end from the pointed
+ * pmd table.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PMD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+static int remove_pmd_table(pmd_t *pmd, unsigned long addr,
+	unsigned long end, bool linear_map, bool check_split)
+{
+	int err = 0;
+	unsigned long next;
+	pte_t *pte;
+
+	for (; !err && addr < end; addr = next, pmd++) {
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(*pmd))
+			continue;
+
+		if (pmd_sect(*pmd)) {
+			if (IS_ALIGNED(addr, PMD_SIZE) &&
+					IS_ALIGNED(next, PMD_SIZE)) {
+
+				if (!check_split)
+					rm_pmd_mapping(pmd, addr, next,
+							linear_map);
+
+			} else { /* not aligned to PMD size */
+
+				/*
+				 * This should only occur for vmemap.
+				 * If it does happen for linear map,
+				 * we do not support splitting PMDs,
+				 * so we return error
+				 */
+				if (linear_map) {
+					pr_warn("Hot-remove failed. Cannot split PMD mapping\n");
+					err = -EBUSY;
+				} else if (!check_split) {
+					unsigned long sz = next - addr;
+					/* Freeing vmemmap pages.*/
+					mark_n_free_pmd_vmemmap(pmd, addr, sz);
+				}
+			}
+		} else { /* ! pmd_sect() */
+
+			BUG_ON(!pmd_table(*pmd));
+			if (!check_split) {
+				pte = pte_offset_map(pmd, addr);
+				remove_pte_table(pte, addr, next, linear_map);
+				free_pte_table(addr, pmd, linear_map);
+			}
+		}
+	}
+
+	return err;
+}
+
+/**
+ * Used in hot-remove, cleans up PUD entries from addr to end from the pointed
+ * pmd table.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PUD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+static int remove_pud_table(pud_t *pud, unsigned long addr,
+	unsigned long end, bool linear_map, bool check_split)
+{
+	int err = 0;
+	unsigned long next;
+	pmd_t *pmd;
+
+	for (; !err && addr < end; addr = next, pud++) {
+		next = pud_addr_end(addr, end);
+		if (!pud_present(*pud))
+			continue;
+
+		/*
+		 * If we are using 4K granules, check if we are using
+		 * 1GB section mapping.
+		 */
+		if (pud_sect(*pud)) {
+			if (IS_ALIGNED(addr, PUD_SIZE) &&
+					IS_ALIGNED(next, PUD_SIZE)) {
+
+				if (!check_split)
+					rm_pud_mapping(pud, addr, next,
+							linear_map);
+
+			} else { /* not aligned to PUD size */
+				/*
+				 * As above, we never map vmemmap
+				 * space on PUDs
+				 */
+				BUG_ON(!linear_map);
+				pr_warn("Hot-remove failed. Cannot split PUD mapping\n");
+				err = -EBUSY;
+			}
+		} else { /* !pud_sect() */
+			BUG_ON(!pud_table(*pud));
+
+			pmd = pmd_offset(pud, addr);
+			err = remove_pmd_table(pmd, addr, next,
+					linear_map, check_split);
+			if (!check_split)
+				free_pmd_table(addr, pud, linear_map);
+		}
+	}
+
+	return err;
+}
+
+/**
+ * Used in hot-remove, cleans up kernel page tables from addr to end.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PUD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+int remove_pagetable(unsigned long start, unsigned long end,
+		bool linear_map, bool check_split)
+{
+	int err;
+	unsigned long next;
+	unsigned long addr;
+	pgd_t *pgd;
+	pud_t *pud;
+
+	for (addr = start; addr < end; addr = next) {
+		next = pgd_addr_end(addr, end);
+
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd))
+			continue;
+
+		pud = pud_offset(pgd, addr);
+		err = remove_pud_table(pud, addr, next,
+				linear_map, check_split);
+		if (err)
+			break;
+
+		if (!check_split)
+			free_pud_table(addr, pgd, linear_map);
+	}
+
+	if (!check_split)
+		flush_tlb_all();
+
+	return err;
+}
+
+
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+static unsigned long walk_kern_pgtable(unsigned long addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -676,26 +1197,51 @@ int kern_addr_valid(unsigned long addr)
 		return 0;
 
 	if (pud_sect(*pud))
-		return pfn_valid(pud_pfn(*pud));
+		return pud_pfn(*pud);
 
 	pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd))
 		return 0;
 
 	if (pmd_sect(*pmd))
-		return pfn_valid(pmd_pfn(*pmd));
+		return pmd_pfn(*pmd);
 
 	pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte))
 		return 0;
 
-	return pfn_valid(pte_pfn(*pte));
+	return pte_pfn(*pte);
+}
+
+/*
+ * Check whether a kernel address is valid (derived from arch/x86/).
+ */
+int kern_addr_valid(unsigned long addr)
+{
+	return pfn_valid(walk_kern_pgtable(addr));
 }
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 {
-	return vmemmap_populate_basepages(start, end, node);
+	int err;
+
+	err = vmemmap_populate_basepages(start, end, node);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+    /*
+     * A bit inefficient (restarting from PGD every time) but saves
+     * from lots of duplicated code. Also, this is only called
+     * at hot-add time, which should not be a frequent operation
+     */
+	for (; start < end; start += PAGE_SIZE) {
+		unsigned long pfn = walk_kern_pgtable(start);
+		phys_addr_t pa_start = ((phys_addr_t)pfn) << PAGE_SHIFT;
+
+		memblock_clear_unused_vmemmap(pa_start, PAGE_SIZE);
+	}
+#endif
+	return err;
 }
 #else	/* !ARM64_SWAPPER_USES_SECTION_MAPS */
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
@@ -726,8 +1272,15 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 				return -ENOMEM;
 
 			set_pmd(pmd, __pmd(__pa(p) | PROT_SECT_NORMAL));
-		} else
+		} else {
+			unsigned long sec_offset =  (addr & (~PMD_MASK));
+			phys_addr_t pa_start =
+				pmd_page_paddr(*pmd) + sec_offset;
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+			memblock_clear_unused_vmemmap(pa_start, next - addr);
+#endif
+		}
 	} while (addr = next, addr != end);
 
 	return 0;
@@ -735,6 +1288,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 #endif	/* CONFIG_ARM64_64K_PAGES */
 void vmemmap_free(unsigned long start, unsigned long end)
 {
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	remove_pagetable(start, end, false, false);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH v2 5/5] mm: memory-hotplug: Add memory hot remove support for arm64
@ 2017-11-23 11:15   ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 11:15 UTC (permalink / raw)
  To: linux-arm-kernel

Implementation of pagetable cleanup routines for arm64 memory hot remove.

How to offline:
 1. Logical Hot remove (offline)
 - # echo offline > /sys/devices/system/memory/memoryXX/state
 2. Physical Hot remove (offline)
 - (if offline is successful)
 - # echo $section_phy_address > /sys/devices/system/memory/remove

Changes v1->v2:
- introduced check on offlining state before hot remove:
  in x86 (and possibly other architectures), offlining of pages and hot
  remove of physical memory happen in a single step, i.e., via an acpi
  event. In this patchset we are introducing a "remove" sysfs handle
  that triggers the physical hot-remove process after manual offlining.

- new memblock flag used to mark partially unused vmemmap pages, avoiding
  the nasty 0xFD hack used in the prev rev (and in x86 hot remove code):
  the hot remove process needs to take care of freeing vmemmap pages
  and mappings for the memory being removed. Sometimes, it might be not
  possible to free fully a vmemmap page (because it is being used for
  other mappings); in such a case we mark part of that page as unused and
  we free it only when it is fully unused. In the previous version, in
  symmetry to x86 hot remove code, we were doing this marking by filling
  the unused parts of the page with an aribitrary 0xFD constant. In this
  version, we are using a new memblock flag for the same purpose.

- proper cleaning sequence for p[um]ds,ptes and related TLB management:
  i) clear the page table, ii) flush tlb, iii) free the pagetable page

- Removed macros that changed hot remove behavior based on number
  of pgtable levels. Now this is hidden in the pgtable traversal macros.

- Check on the corner case where P[UM]Ds would have to be split during
  hot remove: now this is forbidden.
  Hot addition and removal is done at SECTION_SIZE_BITS granularity
  (currently 1GB).  The only case when we would have to split a P[UM]D
  is when SECTION_SIZE_BITS is smaller than a P[UM]D mapped area (never
  by default), AND when we are removing some P[UM]D-mapped memory that
  was never hot-added (there since boot).  If the above conditions hold,
  we avoid splitting the P[UM]Ds and, instead, we forbid hot removal.

- Minor fixes and refactoring.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
---
 arch/arm64/Kconfig           |   3 +
 arch/arm64/configs/defconfig |   1 +
 arch/arm64/include/asm/mmu.h |   4 +
 arch/arm64/mm/init.c         |  29 +++
 arch/arm64/mm/mmu.c          | 572 ++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 601 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c736bba..c362ddf 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -649,6 +649,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
     depends on !NUMA
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 # Common NUMA Features
 config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 5fc5656..cdac3b8 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -81,6 +81,7 @@ CONFIG_SCHED_MC=y
 CONFIG_NUMA=y
 CONFIG_PREEMPT=y
 CONFIG_MEMORY_HOTPLUG=y
+CONFIG_MEMORY_HOTREMOVE=y
 CONFIG_KSM=y
 CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_CMA=y
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 2b3fa4d..ca11567 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -42,6 +42,10 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
 extern void mark_linear_text_alias_ro(void);
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+extern int remove_pagetable(unsigned long start,
+	unsigned long end, bool linear_map, bool check_split);
+#endif
 #endif
 
 #endif
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index e96e7d3..406b378 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -808,4 +808,33 @@ int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
 	return ret;
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	unsigned long va_start = (unsigned long) __va(start);
+	unsigned long va_end = (unsigned long)__va(start + size);
+	struct page *page = pfn_to_page(start_pfn);
+	struct zone *zone;
+	int ret = 0;
+
+	/*
+	 * Check if mem can be removed without splitting
+	 * PUD/PMD mappings.
+	 */
+	ret = remove_pagetable(va_start, va_end, true, true);
+	if (!ret) {
+		zone = page_zone(page);
+		ret = __remove_pages(zone, start_pfn, nr_pages);
+		WARN_ON_ONCE(ret);
+
+		/* Actually remove the mapping */
+		remove_pagetable(va_start, va_end, true, false);
+	}
+
+	return ret;
+}
+
+#endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d93043d..e6f8c91 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -25,6 +25,7 @@
 #include <linux/ioport.h>
 #include <linux/kexec.h>
 #include <linux/libfdt.h>
+#include <linux/memremap.h>
 #include <linux/mman.h>
 #include <linux/nodemask.h>
 #include <linux/memblock.h>
@@ -652,12 +653,532 @@ inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
 
 	stop_machine(__hotplug_paging, &section, NULL);
 }
-#endif /* CONFIG_MEMORY_HOTPLUG */
+#ifdef CONFIG_MEMORY_HOTREMOVE
+
+static void  free_pagetable(struct page *page, int order, bool linear_map)
+{
+	unsigned long magic;
+	unsigned int nr_pages = 1 << order;
+	struct vmem_altmap *altmap = to_vmem_altmap((unsigned long) page);
+
+	if (altmap) {
+		vmem_altmap_free(altmap, nr_pages);
+		return;
+	}
+
+	/* bootmem page has reserved flag */
+	if (PageReserved(page)) {
+		__ClearPageReserved(page);
+
+		magic = (unsigned long)page->lru.next;
+		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
+			while (nr_pages--)
+				put_page_bootmem(page++);
+		} else {
+			while (nr_pages--)
+				free_reserved_page(page++);
+		}
+	} else {
+		/*
+		 * Only linear_map pagetable allocation (those allocated via
+		 * hotplug) call the pgtable_page_ctor; vmemmap pgtable
+		 * allocations don't.
+		 */
+		if (linear_map)
+			pgtable_page_dtor(page);
+
+		free_pages((unsigned long)page_address(page), order);
+	}
+}
+
+static void free_pte_table(unsigned long addr, pmd_t *pmd, bool linear_map)
+{
+	pte_t *pte;
+	struct page *page;
+	int i;
+
+	pte =  pte_offset_kernel(pmd, 0L);
+	/* Check if there is no valid entry in the PMD */
+	for (i = 0; i < PTRS_PER_PTE; i++, pte++) {
+		if (!pte_none(*pte))
+			return;
+	}
+
+	page = pmd_page(*pmd);
+	/*
+	 * This spin lock could be only taken in _pte_aloc_kernel
+	 * in mm/memory.c and nowhere else (for arm64). Not sure if
+	 * the function above can be called concurrently. In doubt,
+	 * I am living it here for now, but it probably can be removed
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	/* Make sure addr is aligned with first address of the PMD*/
+	addr &= PMD_MASK;
+	/*
+	 * Invalidate TLB walk caches to PTE
+	 * Not sure what is the index of the TLB walk caches.
+	 * i.e., if it is indexed just by addr & PMD_MASK or it can be
+	 * indexed by any address. Flushing everything to stay on the safe
+	 * side.
+	 */
+	flush_tlb_kernel_range(addr, addr + PMD_SIZE);
+
+	free_pagetable(page, 0, linear_map);
+}
+
+static void free_pmd_table(unsigned long addr, pud_t *pud, bool linear_map)
+{
+	pmd_t *pmd;
+	struct page *page;
+	int i;
+
+	pmd = pmd_offset(pud, 0L);
+	/*
+	 * If PMD is folded onto PUD, cleanup was already performed
+	 * up in the call stack. No more work needs to be done.
+	 */
+	if ((pud_t *) pmd == pud)
+		return;
+
+	/* Check if there is no valid entry in the PMD */
+	for (i = 0; i < PTRS_PER_PMD; i++, pmd++) {
+		if (!pmd_none(*pmd))
+			return;
+	}
+
+	page = pud_page(*pud);
+	/*
+	 * This spin lock could be only taken in _pte_aloc_kernel
+	 * in mm/memory.c and nowhere else (for arm64). Not sure if
+	 * the function above can be called concurrently. In doubt,
+	 * I am living it here for now, but it probably can be removed
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+
+	/* Make sure addr is aligned with first address of the PMD*/
+	addr &= PUD_MASK;
+	/*
+	 * Invalidate TLB walk caches to PMD
+	 * Not sure what is the index of the TLB walk caches.
+	 * i.e., if it is indexed just by addr & PUD_MASK or it can be
+	 * indexed by any address. Flushing everything to stay on the safe
+	 * side.
+	 */
+	flush_tlb_kernel_range(addr, addr + PUD_SIZE);
+
+	free_pagetable(page, 0, linear_map);
+}
+
+static void free_pud_table(unsigned long addr, pgd_t *pgd, bool linear_map)
+{
+	pud_t *pud;
+	struct page *page;
+	int i;
+
+	pud = pud_offset(pgd, 0L);
+	/*
+	 * If PUD is folded onto PGD, cleanup was already performed
+	 * up in the call stack. No more work needs to be done.
+	 */
+	if ((pgd_t *)pud == pgd)
+		return;
+
+	/* Check if there is no valid entry in the PUD */
+	for (i = 0; i < PTRS_PER_PUD; i++, pud++) {
+		if (!pud_none(*pud))
+			return;
+	}
+
+	page = pgd_page(*pgd);
+
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pgd_clear(pgd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	/* Make sure addr is aligned with first address of the PUD*/
+	addr &= PGDIR_MASK;
+	/*
+	 * Invalidate TLB walk caches to PUD
+	 *
+	 * Not sure what is the index of the TLB walk caches.
+	 * i.e., if it is indexed just by addr & PGDIR_MASK or it can be
+	 * indexed by any address. Flushing everything to stay on the safe
+	 * side
+	 */
+	flush_tlb_kernel_range(addr, addr + PGD_SIZE);
+
+	free_pagetable(page, 0, linear_map);
+}
+
+static void mark_n_free_pte_vmemmap(pte_t *pte,
+		unsigned long addr, unsigned long size)
+{
+	unsigned long page_offset =  (addr & (~PAGE_MASK));
+	phys_addr_t page_start = pte_val(*pte) & PHYS_MASK & (s32)PAGE_MASK;
+	phys_addr_t pa_start = page_start + page_offset;
+
+	memblock_mark_unused_vmemmap(pa_start, size);
+
+	if (memblock_is_vmemmap_unused_range(&memblock.memory,
+				page_start, page_start + PAGE_SIZE)) {
+
+		free_pagetable(pte_page(*pte), 0, false);
+		memblock_clear_unused_vmemmap(page_start, PAGE_SIZE);
+
+		/*
+		 * This spin lock could be only
+		 * taken in _pte_aloc_kernel in
+		 * mm/memory.c and nowhere else
+		 * (for arm64). Not sure if the
+		 * function above can be called
+		 * concurrently. In doubt,
+		 * I am living it here for now,
+		 * but it probably can be removed.
+		 */
+		spin_lock(&init_mm.page_table_lock);
+		pte_clear(&init_mm, addr, pte);
+		spin_unlock(&init_mm.page_table_lock);
+
+		flush_tlb_kernel_range(addr & PAGE_MASK,
+				(addr + PAGE_SIZE) & PAGE_MASK);
+	}
+}
+
+static void mark_n_free_pmd_vmemmap(pmd_t *pmd,
+		unsigned long addr, unsigned long size)
+{
+	unsigned long sec_offset =  (addr & (~PMD_MASK));
+	phys_addr_t page_start = pmd_page_paddr(*pmd);
+	phys_addr_t pa_start = page_start + sec_offset;
+
+	memblock_mark_unused_vmemmap(pa_start, size);
+
+	if (memblock_is_vmemmap_unused_range(&memblock.memory,
+				page_start, page_start + PMD_SIZE)) {
+
+		free_pagetable(pmd_page(*pmd),
+				get_order(PMD_SIZE), false);
+
+		memblock_clear_unused_vmemmap(page_start, PMD_SIZE);
+		/*
+		 * This spin lock could be only
+		 * taken in _pte_aloc_kernel in
+		 * mm/memory.c and nowhere else
+		 * (for arm64). Not sure if the
+		 * function above can be called
+		 * concurrently. In doubt,
+		 * I am living it here for now,
+		 * but it probably can be removed.
+		 */
+		spin_lock(&init_mm.page_table_lock);
+		pmd_clear(pmd);
+		spin_unlock(&init_mm.page_table_lock);
+
+		flush_tlb_kernel_range(addr & PMD_MASK,
+				(addr + PMD_SIZE) & PMD_MASK);
+	}
+}
+
+static void rm_pte_mapping(pte_t *pte, unsigned long addr,
+		unsigned long next, bool linear_map)
+{
+	/*
+	 * Linear map pages were already freed when offlining.
+	 * We aonly need to free vmemmap pages.
+	 */
+	if (!linear_map)
+		free_pagetable(pte_page(*pte), 0, false);
+
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pte_clear(&init_mm, addr, pte);
+	spin_unlock(&init_mm.page_table_lock);
+
+	flush_tlb_kernel_range(addr, next);
+}
+
+static void rm_pmd_mapping(pmd_t *pmd, unsigned long addr,
+		unsigned long next, bool linear_map)
+{
+	/* Freeing vmemmap pages */
+	if (!linear_map)
+		free_pagetable(pmd_page(*pmd),
+				get_order(PMD_SIZE), false);
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+
+	flush_tlb_kernel_range(addr, next);
+}
+
+static void rm_pud_mapping(pud_t *pud, unsigned long addr,
+		unsigned long next, bool linear_map)
+{
+	/** We never map vmemmap space on PUDs */
+	BUG_ON(!linear_map);
+	/*
+	 * This spin lock could be only
+	 * taken in _pte_aloc_kernel in
+	 * mm/memory.c and nowhere else
+	 * (for arm64). Not sure if the
+	 * function above can be called
+	 * concurrently. In doubt,
+	 * I am living it here for now,
+	 * but it probably can be removed.
+	 */
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+
+	flush_tlb_kernel_range(addr, next);
+}
 
 /*
- * Check whether a kernel address is valid (derived from arch/x86/).
+ * Used in hot-remove, cleans up PTE entries from addr to end from the pointed
+ * pte table. If linear_map is true, this is used called to remove the tables
+ * for the memory being hot-removed. If false, this is called to clean-up the
+ * tables (and the memory) that were used for the vmemmap of memory being
+ * hot-removed.
  */
-int kern_addr_valid(unsigned long addr)
+static void remove_pte_table(pte_t *pte, unsigned long addr,
+	unsigned long end, bool linear_map)
+{
+	unsigned long next;
+
+
+	for (; addr < end; addr = next, pte++) {
+		next = (addr + PAGE_SIZE) & PAGE_MASK;
+		if (next > end)
+			next = end;
+
+		if (!pte_present(*pte))
+			continue;
+
+		if (PAGE_ALIGNED(addr) && PAGE_ALIGNED(next)) {
+			rm_pte_mapping(pte, addr, next, linear_map);
+		} else {
+			unsigned long sz = next - addr;
+			/*
+			 * If we are here, we are freeing vmemmap pages since
+			 * linear_map mapped memory ranges to be freed
+			 * are aligned.
+			 *
+			 * If we are not removing the whole page, it means
+			 * other page structs in this page are being used and
+			 * we canot remove them. We use memblock to mark these
+			 * unused pieces and we only removed when they are fully
+			 * unuesed.
+			 */
+			mark_n_free_pte_vmemmap(pte, addr, sz);
+		}
+	}
+}
+
+/**
+ * Used in hot-remove, cleans up PMD entries from addr to end from the pointed
+ * pmd table.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PMD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+static int remove_pmd_table(pmd_t *pmd, unsigned long addr,
+	unsigned long end, bool linear_map, bool check_split)
+{
+	int err = 0;
+	unsigned long next;
+	pte_t *pte;
+
+	for (; !err && addr < end; addr = next, pmd++) {
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(*pmd))
+			continue;
+
+		if (pmd_sect(*pmd)) {
+			if (IS_ALIGNED(addr, PMD_SIZE) &&
+					IS_ALIGNED(next, PMD_SIZE)) {
+
+				if (!check_split)
+					rm_pmd_mapping(pmd, addr, next,
+							linear_map);
+
+			} else { /* not aligned to PMD size */
+
+				/*
+				 * This should only occur for vmemap.
+				 * If it does happen for linear map,
+				 * we do not support splitting PMDs,
+				 * so we return error
+				 */
+				if (linear_map) {
+					pr_warn("Hot-remove failed. Cannot split PMD mapping\n");
+					err = -EBUSY;
+				} else if (!check_split) {
+					unsigned long sz = next - addr;
+					/* Freeing vmemmap pages.*/
+					mark_n_free_pmd_vmemmap(pmd, addr, sz);
+				}
+			}
+		} else { /* ! pmd_sect() */
+
+			BUG_ON(!pmd_table(*pmd));
+			if (!check_split) {
+				pte = pte_offset_map(pmd, addr);
+				remove_pte_table(pte, addr, next, linear_map);
+				free_pte_table(addr, pmd, linear_map);
+			}
+		}
+	}
+
+	return err;
+}
+
+/**
+ * Used in hot-remove, cleans up PUD entries from addr to end from the pointed
+ * pmd table.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PUD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+static int remove_pud_table(pud_t *pud, unsigned long addr,
+	unsigned long end, bool linear_map, bool check_split)
+{
+	int err = 0;
+	unsigned long next;
+	pmd_t *pmd;
+
+	for (; !err && addr < end; addr = next, pud++) {
+		next = pud_addr_end(addr, end);
+		if (!pud_present(*pud))
+			continue;
+
+		/*
+		 * If we are using 4K granules, check if we are using
+		 * 1GB section mapping.
+		 */
+		if (pud_sect(*pud)) {
+			if (IS_ALIGNED(addr, PUD_SIZE) &&
+					IS_ALIGNED(next, PUD_SIZE)) {
+
+				if (!check_split)
+					rm_pud_mapping(pud, addr, next,
+							linear_map);
+
+			} else { /* not aligned to PUD size */
+				/*
+				 * As above, we never map vmemmap
+				 * space on PUDs
+				 */
+				BUG_ON(!linear_map);
+				pr_warn("Hot-remove failed. Cannot split PUD mapping\n");
+				err = -EBUSY;
+			}
+		} else { /* !pud_sect() */
+			BUG_ON(!pud_table(*pud));
+
+			pmd = pmd_offset(pud, addr);
+			err = remove_pmd_table(pmd, addr, next,
+					linear_map, check_split);
+			if (!check_split)
+				free_pmd_table(addr, pud, linear_map);
+		}
+	}
+
+	return err;
+}
+
+/**
+ * Used in hot-remove, cleans up kernel page tables from addr to end.
+ *
+ * If linear_map is true, this is used called to remove the tables for the
+ * memory being hot-removed. If false, this is called to clean-up the tables
+ * (and the memory) that were used for the vmemmap of memory being hot-removed.
+ *
+ * If check_split is true, no change is done on the table: the call only
+ * checks whether removing the entries would cause a section mapped PUD
+ * to be split. In such a case, -EBUSY is returned by the method.
+ */
+int remove_pagetable(unsigned long start, unsigned long end,
+		bool linear_map, bool check_split)
+{
+	int err;
+	unsigned long next;
+	unsigned long addr;
+	pgd_t *pgd;
+	pud_t *pud;
+
+	for (addr = start; addr < end; addr = next) {
+		next = pgd_addr_end(addr, end);
+
+		pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd))
+			continue;
+
+		pud = pud_offset(pgd, addr);
+		err = remove_pud_table(pud, addr, next,
+				linear_map, check_split);
+		if (err)
+			break;
+
+		if (!check_split)
+			free_pud_table(addr, pgd, linear_map);
+	}
+
+	if (!check_split)
+		flush_tlb_all();
+
+	return err;
+}
+
+
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
+static unsigned long walk_kern_pgtable(unsigned long addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -676,26 +1197,51 @@ int kern_addr_valid(unsigned long addr)
 		return 0;
 
 	if (pud_sect(*pud))
-		return pfn_valid(pud_pfn(*pud));
+		return pud_pfn(*pud);
 
 	pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd))
 		return 0;
 
 	if (pmd_sect(*pmd))
-		return pfn_valid(pmd_pfn(*pmd));
+		return pmd_pfn(*pmd);
 
 	pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte))
 		return 0;
 
-	return pfn_valid(pte_pfn(*pte));
+	return pte_pfn(*pte);
+}
+
+/*
+ * Check whether a kernel address is valid (derived from arch/x86/).
+ */
+int kern_addr_valid(unsigned long addr)
+{
+	return pfn_valid(walk_kern_pgtable(addr));
 }
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 {
-	return vmemmap_populate_basepages(start, end, node);
+	int err;
+
+	err = vmemmap_populate_basepages(start, end, node);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+    /*
+     * A bit inefficient (restarting from PGD every time) but saves
+     * from lots of duplicated code. Also, this is only called
+     *@hot-add time, which should not be a frequent operation
+     */
+	for (; start < end; start += PAGE_SIZE) {
+		unsigned long pfn = walk_kern_pgtable(start);
+		phys_addr_t pa_start = ((phys_addr_t)pfn) << PAGE_SHIFT;
+
+		memblock_clear_unused_vmemmap(pa_start, PAGE_SIZE);
+	}
+#endif
+	return err;
 }
 #else	/* !ARM64_SWAPPER_USES_SECTION_MAPS */
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
@@ -726,8 +1272,15 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 				return -ENOMEM;
 
 			set_pmd(pmd, __pmd(__pa(p) | PROT_SECT_NORMAL));
-		} else
+		} else {
+			unsigned long sec_offset =  (addr & (~PMD_MASK));
+			phys_addr_t pa_start =
+				pmd_page_paddr(*pmd) + sec_offset;
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+			memblock_clear_unused_vmemmap(pa_start, next - addr);
+#endif
+		}
 	} while (addr = next, addr != end);
 
 	return 0;
@@ -735,6 +1288,9 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 #endif	/* CONFIG_ARM64_64K_PAGES */
 void vmemmap_free(unsigned long start, unsigned long end)
 {
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	remove_pagetable(start, end, false, false);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
  2017-11-23 11:13 ` Andrea Reale
  (?)
@ 2017-11-23 16:02   ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-23 16:02 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 23-11-17 11:13:35, Andrea Reale wrote:
> Hi all,

Hi,

> 
> this is a second round of patches to introduce memory hotplug and
> hotremove support for arm64. It builds on the work previously published at
> [1] and it implements the feedback received in the first round of reviews.
> 
> The patchset applies and has been tested on commit bebc6082da0a ("Linux
> 4.14"). 
> 
> Due to a small regression introduced with commit 8135d8926c08
> ("mm: memory_hotplug: memory hotremove supports thp migration"), you
> will need to appy patch [2] first, until the fix is not upstreamed.
> 
> Comments and feedback are gold.
> 
> [1] https://lkml.org/lkml/2017/4/11/536
> [2] https://lkml.org/lkml/2017/11/20/902

I will try to have a look but I do not expect to understand any of arm64
specific changes so I will focus on the generic code but it would help a
_lot_ if the cover letter provided some overview of what has been done
from a higher level POV. What are the arch pieces and what is the
generic code missing. A quick glance over patches suggests that
changelogs for specific patches are modest as well. Could you give us
more information please? Reviewing hundreds lines of code without
context is a pain.
 
> Changes v1->v2:
> - swapper pgtable updated in place on hot add, avoiding unnecessary copy
> - stop_machine used to updated swapper on hot add, avoiding races
> - introduced check on offlining state before hot remove
> - new memblock flag used to mark partially unused vmemmap pages, avoiding
>   the nasty 0xFD hack used in the prev rev (and in x86 hot remove code)
> - proper cleaning sequence for p[um]ds,ptes and related TLB management
> - Removed macros that changed hot remove behavior based on number
>   of pgtable levels. Now this is hidden in the pgtable traversal macros.
> - Check on the corner case where P[UM]Ds would have to be split during
>   hot remove: now this is forbidden.
> - Minor fixes and refactoring.
> 
> Andrea Reale (4):
>   mm: memory_hotplug: Remove assumption on memory state before hotremove
>   mm: memory_hotplug: memblock to track partially removed vmemmap mem
>   mm: memory_hotplug: Add memory hotremove probe device
>   mm: memory-hotplug: Add memory hot remove support for arm64
> 
> Maciej Bielski (1):
>   mm: memory_hotplug: Memory hotplug (add) support for arm64
> 
>  arch/arm64/Kconfig             |  15 +
>  arch/arm64/configs/defconfig   |   2 +
>  arch/arm64/include/asm/mmu.h   |   7 +
>  arch/arm64/mm/init.c           | 116 ++++++++
>  arch/arm64/mm/mmu.c            | 609 ++++++++++++++++++++++++++++++++++++++++-
>  drivers/acpi/acpi_memhotplug.c |   2 +-
>  drivers/base/memory.c          |  34 ++-
>  include/linux/memblock.h       |  12 +
>  include/linux/memory_hotplug.h |   9 +-
>  mm/memblock.c                  |  32 +++
>  mm/memory_hotplug.c            |  13 +-
>  11 files changed, 835 insertions(+), 16 deletions(-)
> 
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
@ 2017-11-23 16:02   ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-23 16:02 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 23-11-17 11:13:35, Andrea Reale wrote:
> Hi all,

Hi,

> 
> this is a second round of patches to introduce memory hotplug and
> hotremove support for arm64. It builds on the work previously published at
> [1] and it implements the feedback received in the first round of reviews.
> 
> The patchset applies and has been tested on commit bebc6082da0a ("Linux
> 4.14"). 
> 
> Due to a small regression introduced with commit 8135d8926c08
> ("mm: memory_hotplug: memory hotremove supports thp migration"), you
> will need to appy patch [2] first, until the fix is not upstreamed.
> 
> Comments and feedback are gold.
> 
> [1] https://lkml.org/lkml/2017/4/11/536
> [2] https://lkml.org/lkml/2017/11/20/902

I will try to have a look but I do not expect to understand any of arm64
specific changes so I will focus on the generic code but it would help a
_lot_ if the cover letter provided some overview of what has been done
from a higher level POV. What are the arch pieces and what is the
generic code missing. A quick glance over patches suggests that
changelogs for specific patches are modest as well. Could you give us
more information please? Reviewing hundreds lines of code without
context is a pain.
 
> Changes v1->v2:
> - swapper pgtable updated in place on hot add, avoiding unnecessary copy
> - stop_machine used to updated swapper on hot add, avoiding races
> - introduced check on offlining state before hot remove
> - new memblock flag used to mark partially unused vmemmap pages, avoiding
>   the nasty 0xFD hack used in the prev rev (and in x86 hot remove code)
> - proper cleaning sequence for p[um]ds,ptes and related TLB management
> - Removed macros that changed hot remove behavior based on number
>   of pgtable levels. Now this is hidden in the pgtable traversal macros.
> - Check on the corner case where P[UM]Ds would have to be split during
>   hot remove: now this is forbidden.
> - Minor fixes and refactoring.
> 
> Andrea Reale (4):
>   mm: memory_hotplug: Remove assumption on memory state before hotremove
>   mm: memory_hotplug: memblock to track partially removed vmemmap mem
>   mm: memory_hotplug: Add memory hotremove probe device
>   mm: memory-hotplug: Add memory hot remove support for arm64
> 
> Maciej Bielski (1):
>   mm: memory_hotplug: Memory hotplug (add) support for arm64
> 
>  arch/arm64/Kconfig             |  15 +
>  arch/arm64/configs/defconfig   |   2 +
>  arch/arm64/include/asm/mmu.h   |   7 +
>  arch/arm64/mm/init.c           | 116 ++++++++
>  arch/arm64/mm/mmu.c            | 609 ++++++++++++++++++++++++++++++++++++++++-
>  drivers/acpi/acpi_memhotplug.c |   2 +-
>  drivers/base/memory.c          |  34 ++-
>  include/linux/memblock.h       |  12 +
>  include/linux/memory_hotplug.h |   9 +-
>  mm/memblock.c                  |  32 +++
>  mm/memory_hotplug.c            |  13 +-
>  11 files changed, 835 insertions(+), 16 deletions(-)
> 
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
@ 2017-11-23 16:02   ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-23 16:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu 23-11-17 11:13:35, Andrea Reale wrote:
> Hi all,

Hi,

> 
> this is a second round of patches to introduce memory hotplug and
> hotremove support for arm64. It builds on the work previously published at
> [1] and it implements the feedback received in the first round of reviews.
> 
> The patchset applies and has been tested on commit bebc6082da0a ("Linux
> 4.14"). 
> 
> Due to a small regression introduced with commit 8135d8926c08
> ("mm: memory_hotplug: memory hotremove supports thp migration"), you
> will need to appy patch [2] first, until the fix is not upstreamed.
> 
> Comments and feedback are gold.
> 
> [1] https://lkml.org/lkml/2017/4/11/536
> [2] https://lkml.org/lkml/2017/11/20/902

I will try to have a look but I do not expect to understand any of arm64
specific changes so I will focus on the generic code but it would help a
_lot_ if the cover letter provided some overview of what has been done
from a higher level POV. What are the arch pieces and what is the
generic code missing. A quick glance over patches suggests that
changelogs for specific patches are modest as well. Could you give us
more information please? Reviewing hundreds lines of code without
context is a pain.
 
> Changes v1->v2:
> - swapper pgtable updated in place on hot add, avoiding unnecessary copy
> - stop_machine used to updated swapper on hot add, avoiding races
> - introduced check on offlining state before hot remove
> - new memblock flag used to mark partially unused vmemmap pages, avoiding
>   the nasty 0xFD hack used in the prev rev (and in x86 hot remove code)
> - proper cleaning sequence for p[um]ds,ptes and related TLB management
> - Removed macros that changed hot remove behavior based on number
>   of pgtable levels. Now this is hidden in the pgtable traversal macros.
> - Check on the corner case where P[UM]Ds would have to be split during
>   hot remove: now this is forbidden.
> - Minor fixes and refactoring.
> 
> Andrea Reale (4):
>   mm: memory_hotplug: Remove assumption on memory state before hotremove
>   mm: memory_hotplug: memblock to track partially removed vmemmap mem
>   mm: memory_hotplug: Add memory hotremove probe device
>   mm: memory-hotplug: Add memory hot remove support for arm64
> 
> Maciej Bielski (1):
>   mm: memory_hotplug: Memory hotplug (add) support for arm64
> 
>  arch/arm64/Kconfig             |  15 +
>  arch/arm64/configs/defconfig   |   2 +
>  arch/arm64/include/asm/mmu.h   |   7 +
>  arch/arm64/mm/init.c           | 116 ++++++++
>  arch/arm64/mm/mmu.c            | 609 ++++++++++++++++++++++++++++++++++++++++-
>  drivers/acpi/acpi_memhotplug.c |   2 +-
>  drivers/base/memory.c          |  34 ++-
>  include/linux/memblock.h       |  12 +
>  include/linux/memory_hotplug.h |   9 +-
>  mm/memblock.c                  |  32 +++
>  mm/memory_hotplug.c            |  13 +-
>  11 files changed, 835 insertions(+), 16 deletions(-)
> 
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
  2017-11-23 16:02   ` Michal Hocko
  (?)
@ 2017-11-23 17:33     ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 17:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 23 Nov 2017, 17:02, Michal Hocko wrote:

Hi Michal,

> I will try to have a look but I do not expect to understand any of arm64
> specific changes so I will focus on the generic code but it would help a
> _lot_ if the cover letter provided some overview of what has been done
> from a higher level POV. What are the arch pieces and what is the
> generic code missing. A quick glance over patches suggests that
> changelogs for specific patches are modest as well. Could you give us
> more information please? Reviewing hundreds lines of code without
> context is a pain.

sorry for the lack of details. I will try to provide a better
overview in the following. Please, feel free to ask for more details
where needed.

Overall, the goal of the patchset is to implement arch_memory_add and
arch_memory_remove for arm64, to support the generic memory_hotplug
framework. 

Hot add
-------
Not so many surprises here. We implement the arch specific
arch_add_memory, which builds the kernel page tables via hotplug_paging()
and then calls arch specific add_pages(). We need the arch specific
add_pages() to implement a trick that makes the satus of pages being
added accepted by the asumptions made in the generic __add_pages. (See
code comments).

Hot remove
----------
The code is basically a port of x86_64 hot remove, with several relevant
changes that I am highlithing below. 

* Architecture specific code:
- We implement arch_remove_memory() which takes care of i) calling
  the generic __remove_pages and ii) tearing down kernel page tables
  (remove_pagetable()).

- We implement the arch specific vmemmap_free(), which is called by the
  generic code to free vmemmap for memory being removed. vmemmap_free(),
  in its turn, reuses the code of remove_pagetable() to do its job.

- remove_pagetable() (called by the two functions above), removes kernel
  page tables and, in the case of vmemmap, also removes the actual
  vmemmap pages. The function never splits P[UM]D mapped page
  table entries, and fails in case such a split is requested.
  To implement this behavior, we do a two passes call of
  remove_pagetable() in arch_remove_memory(): the first pass does not
  alter any of the pagetable contents, but only checks whether some
  P[UM]D split would occur; in the case the first pass succeeds, the
  second pass does the actual removal job.
  Actually, the case where a P[UM]D would be split should be extremely
  rare - so denying the removal should not be a big deal: 
  in fact, hot-add and hot-remove add memory at the granularity of
  SECTION_SIZE_BITS, which is hardcoded to 30 for arm64 at the moment,
  and PMDs and PUDs map 2MB and 1GB worth of 4K pages, respectively. 
  In order for a split to occur, someone should first decrease 
  SECTION_SIZE_BITS and then ask to remove some p[um]d sub area that
  was mapped at boot to the full p[um]d.

* Generic code
- [SYSFS and x86 ACPI changes]. In x86, hot remove is triggered by ACPI,
  which performs memory offlining and removal in one atomic step. To
  enable memory removal in the absence of ACPI, we add a sysfs `remove`
  handle (/sys/devices/system/memory/remove), symmetrically to the
  existing memory probe device (existing since the beginning of time
  with commit 3947be1969a9 ("memory hotplug: sysfs and add/remove
  functions")). To hot-remove a section, one would first offline it
  (echo offline > /sys/devices/system/memory/memoryXX/state) and then
  call remove on this new remove handle, passing the phy address of the
  section being removed.
  Now, the x86 code assumes that offline and remove are done in one
  single atomic step (ACPI- Commit 242831eb15a0 ("Memory hotplug / ACPI:
  Simplify memory removal")). In this spirit, the generic code also
  assumed that when someone called memory_hotplug.c:remove_memory, then
  that memory would have been already offlined. If that was not the case,
  it would raise a BUG().
  In our case, offlining and removal are done in separate steps,
  so we remove this assumptions and fail the removal if the memory
  was not previously offlined. We also consider the possibility that
  arch_remove_memory itself might fail. As explained above, in some rare
  cases, it actually might in our arm64 implementation.
  While functional to our implementation, I believe that the assumption
  of offlining and removal in one atomic step is not obvious for all
  the architectures in general.
- [Memblock changes]. In x86 hot-remove implementation - commit
  ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
  hot-remove") -, when freeing
  vmemmap, if a vmemmap page is only partially cleared and some of its
  content is still used, then the vmemap page is obviously not freed. 
  Instead, the partially unused content of that paged is memset to the
  seemingly totally arbitrary 0xFD constant. When all the page content
  is found to be set to 0xFD, then the page is freed. 
  After some good feedback received on the v1 of this patchset, we
  decided to get rid of this 0xFD trick for our arm64 port. Instead, we
  added a memblock flag, that we use to mark partially unused vmemmap
  areas (like 0xFD was doing before). We then check memblock rather than
  the content of the page to decide whether we can free it or not.
  
I hope this is a better cover letter. 

Best regards,
Andrea


> > Changes v1->v2:
> > - swapper pgtable updated in place on hot add, avoiding unnecessary copy
> > - stop_machine used to updated swapper on hot add, avoiding races
> > - introduced check on offlining state before hot remove
> > - new memblock flag used to mark partially unused vmemmap pages, avoiding
> >   the nasty 0xFD hack used in the prev rev (and in x86 hot remove code)
> > - proper cleaning sequence for p[um]ds,ptes and related TLB management
> > - Removed macros that changed hot remove behavior based on number
> >   of pgtable levels. Now this is hidden in the pgtable traversal macros.
> > - Check on the corner case where P[UM]Ds would have to be split during
> >   hot remove: now this is forbidden.
> > - Minor fixes and refactoring.
> > 
> > Andrea Reale (4):
> >   mm: memory_hotplug: Remove assumption on memory state before hotremove
> >   mm: memory_hotplug: memblock to track partially removed vmemmap mem
> >   mm: memory_hotplug: Add memory hotremove probe device
> >   mm: memory-hotplug: Add memory hot remove support for arm64
> > 
> > Maciej Bielski (1):
> >   mm: memory_hotplug: Memory hotplug (add) support for arm64
> > 
> >  arch/arm64/Kconfig             |  15 +
> >  arch/arm64/configs/defconfig   |   2 +
> >  arch/arm64/include/asm/mmu.h   |   7 +
> >  arch/arm64/mm/init.c           | 116 ++++++++
> >  arch/arm64/mm/mmu.c            | 609 ++++++++++++++++++++++++++++++++++++++++-
> >  drivers/acpi/acpi_memhotplug.c |   2 +-
> >  drivers/base/memory.c          |  34 ++-
> >  include/linux/memblock.h       |  12 +
> >  include/linux/memory_hotplug.h |   9 +-
> >  mm/memblock.c                  |  32 +++
> >  mm/memory_hotplug.c            |  13 +-
> >  11 files changed, 835 insertions(+), 16 deletions(-)
> > 
> > -- 
> > 2.7.4
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
@ 2017-11-23 17:33     ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 17:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 23 Nov 2017, 17:02, Michal Hocko wrote:

Hi Michal,

> I will try to have a look but I do not expect to understand any of arm64
> specific changes so I will focus on the generic code but it would help a
> _lot_ if the cover letter provided some overview of what has been done
> from a higher level POV. What are the arch pieces and what is the
> generic code missing. A quick glance over patches suggests that
> changelogs for specific patches are modest as well. Could you give us
> more information please? Reviewing hundreds lines of code without
> context is a pain.

sorry for the lack of details. I will try to provide a better
overview in the following. Please, feel free to ask for more details
where needed.

Overall, the goal of the patchset is to implement arch_memory_add and
arch_memory_remove for arm64, to support the generic memory_hotplug
framework. 

Hot add
-------
Not so many surprises here. We implement the arch specific
arch_add_memory, which builds the kernel page tables via hotplug_paging()
and then calls arch specific add_pages(). We need the arch specific
add_pages() to implement a trick that makes the satus of pages being
added accepted by the asumptions made in the generic __add_pages. (See
code comments).

Hot remove
----------
The code is basically a port of x86_64 hot remove, with several relevant
changes that I am highlithing below. 

* Architecture specific code:
- We implement arch_remove_memory() which takes care of i) calling
  the generic __remove_pages and ii) tearing down kernel page tables
  (remove_pagetable()).

- We implement the arch specific vmemmap_free(), which is called by the
  generic code to free vmemmap for memory being removed. vmemmap_free(),
  in its turn, reuses the code of remove_pagetable() to do its job.

- remove_pagetable() (called by the two functions above), removes kernel
  page tables and, in the case of vmemmap, also removes the actual
  vmemmap pages. The function never splits P[UM]D mapped page
  table entries, and fails in case such a split is requested.
  To implement this behavior, we do a two passes call of
  remove_pagetable() in arch_remove_memory(): the first pass does not
  alter any of the pagetable contents, but only checks whether some
  P[UM]D split would occur; in the case the first pass succeeds, the
  second pass does the actual removal job.
  Actually, the case where a P[UM]D would be split should be extremely
  rare - so denying the removal should not be a big deal: 
  in fact, hot-add and hot-remove add memory at the granularity of
  SECTION_SIZE_BITS, which is hardcoded to 30 for arm64 at the moment,
  and PMDs and PUDs map 2MB and 1GB worth of 4K pages, respectively. 
  In order for a split to occur, someone should first decrease 
  SECTION_SIZE_BITS and then ask to remove some p[um]d sub area that
  was mapped at boot to the full p[um]d.

* Generic code
- [SYSFS and x86 ACPI changes]. In x86, hot remove is triggered by ACPI,
  which performs memory offlining and removal in one atomic step. To
  enable memory removal in the absence of ACPI, we add a sysfs `remove`
  handle (/sys/devices/system/memory/remove), symmetrically to the
  existing memory probe device (existing since the beginning of time
  with commit 3947be1969a9 ("memory hotplug: sysfs and add/remove
  functions")). To hot-remove a section, one would first offline it
  (echo offline > /sys/devices/system/memory/memoryXX/state) and then
  call remove on this new remove handle, passing the phy address of the
  section being removed.
  Now, the x86 code assumes that offline and remove are done in one
  single atomic step (ACPI- Commit 242831eb15a0 ("Memory hotplug / ACPI:
  Simplify memory removal")). In this spirit, the generic code also
  assumed that when someone called memory_hotplug.c:remove_memory, then
  that memory would have been already offlined. If that was not the case,
  it would raise a BUG().
  In our case, offlining and removal are done in separate steps,
  so we remove this assumptions and fail the removal if the memory
  was not previously offlined. We also consider the possibility that
  arch_remove_memory itself might fail. As explained above, in some rare
  cases, it actually might in our arm64 implementation.
  While functional to our implementation, I believe that the assumption
  of offlining and removal in one atomic step is not obvious for all
  the architectures in general.
- [Memblock changes]. In x86 hot-remove implementation - commit
  ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
  hot-remove") -, when freeing
  vmemmap, if a vmemmap page is only partially cleared and some of its
  content is still used, then the vmemap page is obviously not freed. 
  Instead, the partially unused content of that paged is memset to the
  seemingly totally arbitrary 0xFD constant. When all the page content
  is found to be set to 0xFD, then the page is freed. 
  After some good feedback received on the v1 of this patchset, we
  decided to get rid of this 0xFD trick for our arm64 port. Instead, we
  added a memblock flag, that we use to mark partially unused vmemmap
  areas (like 0xFD was doing before). We then check memblock rather than
  the content of the page to decide whether we can free it or not.
  
I hope this is a better cover letter. 

Best regards,
Andrea


> > Changes v1->v2:
> > - swapper pgtable updated in place on hot add, avoiding unnecessary copy
> > - stop_machine used to updated swapper on hot add, avoiding races
> > - introduced check on offlining state before hot remove
> > - new memblock flag used to mark partially unused vmemmap pages, avoiding
> >   the nasty 0xFD hack used in the prev rev (and in x86 hot remove code)
> > - proper cleaning sequence for p[um]ds,ptes and related TLB management
> > - Removed macros that changed hot remove behavior based on number
> >   of pgtable levels. Now this is hidden in the pgtable traversal macros.
> > - Check on the corner case where P[UM]Ds would have to be split during
> >   hot remove: now this is forbidden.
> > - Minor fixes and refactoring.
> > 
> > Andrea Reale (4):
> >   mm: memory_hotplug: Remove assumption on memory state before hotremove
> >   mm: memory_hotplug: memblock to track partially removed vmemmap mem
> >   mm: memory_hotplug: Add memory hotremove probe device
> >   mm: memory-hotplug: Add memory hot remove support for arm64
> > 
> > Maciej Bielski (1):
> >   mm: memory_hotplug: Memory hotplug (add) support for arm64
> > 
> >  arch/arm64/Kconfig             |  15 +
> >  arch/arm64/configs/defconfig   |   2 +
> >  arch/arm64/include/asm/mmu.h   |   7 +
> >  arch/arm64/mm/init.c           | 116 ++++++++
> >  arch/arm64/mm/mmu.c            | 609 ++++++++++++++++++++++++++++++++++++++++-
> >  drivers/acpi/acpi_memhotplug.c |   2 +-
> >  drivers/base/memory.c          |  34 ++-
> >  include/linux/memblock.h       |  12 +
> >  include/linux/memory_hotplug.h |   9 +-
> >  mm/memblock.c                  |  32 +++
> >  mm/memory_hotplug.c            |  13 +-
> >  11 files changed, 835 insertions(+), 16 deletions(-)
> > 
> > -- 
> > 2.7.4
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
@ 2017-11-23 17:33     ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-23 17:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu 23 Nov 2017, 17:02, Michal Hocko wrote:

Hi Michal,

> I will try to have a look but I do not expect to understand any of arm64
> specific changes so I will focus on the generic code but it would help a
> _lot_ if the cover letter provided some overview of what has been done
> from a higher level POV. What are the arch pieces and what is the
> generic code missing. A quick glance over patches suggests that
> changelogs for specific patches are modest as well. Could you give us
> more information please? Reviewing hundreds lines of code without
> context is a pain.

sorry for the lack of details. I will try to provide a better
overview in the following. Please, feel free to ask for more details
where needed.

Overall, the goal of the patchset is to implement arch_memory_add and
arch_memory_remove for arm64, to support the generic memory_hotplug
framework. 

Hot add
-------
Not so many surprises here. We implement the arch specific
arch_add_memory, which builds the kernel page tables via hotplug_paging()
and then calls arch specific add_pages(). We need the arch specific
add_pages() to implement a trick that makes the satus of pages being
added accepted by the asumptions made in the generic __add_pages. (See
code comments).

Hot remove
----------
The code is basically a port of x86_64 hot remove, with several relevant
changes that I am highlithing below. 

* Architecture specific code:
- We implement arch_remove_memory() which takes care of i) calling
  the generic __remove_pages and ii) tearing down kernel page tables
  (remove_pagetable()).

- We implement the arch specific vmemmap_free(), which is called by the
  generic code to free vmemmap for memory being removed. vmemmap_free(),
  in its turn, reuses the code of remove_pagetable() to do its job.

- remove_pagetable() (called by the two functions above), removes kernel
  page tables and, in the case of vmemmap, also removes the actual
  vmemmap pages. The function never splits P[UM]D mapped page
  table entries, and fails in case such a split is requested.
  To implement this behavior, we do a two passes call of
  remove_pagetable() in arch_remove_memory(): the first pass does not
  alter any of the pagetable contents, but only checks whether some
  P[UM]D split would occur; in the case the first pass succeeds, the
  second pass does the actual removal job.
  Actually, the case where a P[UM]D would be split should be extremely
  rare - so denying the removal should not be a big deal: 
  in fact, hot-add and hot-remove add memory at the granularity of
  SECTION_SIZE_BITS, which is hardcoded to 30 for arm64 at the moment,
  and PMDs and PUDs map 2MB and 1GB worth of 4K pages, respectively. 
  In order for a split to occur, someone should first decrease 
  SECTION_SIZE_BITS and then ask to remove some p[um]d sub area that
  was mapped at boot to the full p[um]d.

* Generic code
- [SYSFS and x86 ACPI changes]. In x86, hot remove is triggered by ACPI,
  which performs memory offlining and removal in one atomic step. To
  enable memory removal in the absence of ACPI, we add a sysfs `remove`
  handle (/sys/devices/system/memory/remove), symmetrically to the
  existing memory probe device (existing since the beginning of time
  with commit 3947be1969a9 ("memory hotplug: sysfs and add/remove
  functions")). To hot-remove a section, one would first offline it
  (echo offline > /sys/devices/system/memory/memoryXX/state) and then
  call remove on this new remove handle, passing the phy address of the
  section being removed.
  Now, the x86 code assumes that offline and remove are done in one
  single atomic step (ACPI- Commit 242831eb15a0 ("Memory hotplug / ACPI:
  Simplify memory removal")). In this spirit, the generic code also
  assumed that when someone called memory_hotplug.c:remove_memory, then
  that memory would have been already offlined. If that was not the case,
  it would raise a BUG().
  In our case, offlining and removal are done in separate steps,
  so we remove this assumptions and fail the removal if the memory
  was not previously offlined. We also consider the possibility that
  arch_remove_memory itself might fail. As explained above, in some rare
  cases, it actually might in our arm64 implementation.
  While functional to our implementation, I believe that the assumption
  of offlining and removal in one atomic step is not obvious for all
  the architectures in general.
- [Memblock changes]. In x86 hot-remove implementation - commit
  ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
  hot-remove") -, when freeing
  vmemmap, if a vmemmap page is only partially cleared and some of its
  content is still used, then the vmemap page is obviously not freed. 
  Instead, the partially unused content of that paged is memset to the
  seemingly totally arbitrary 0xFD constant. When all the page content
  is found to be set to 0xFD, then the page is freed. 
  After some good feedback received on the v1 of this patchset, we
  decided to get rid of this 0xFD trick for our arm64 port. Instead, we
  added a memblock flag, that we use to mark partially unused vmemmap
  areas (like 0xFD was doing before). We then check memblock rather than
  the content of the page to decide whether we can free it or not.
  
I hope this is a better cover letter. 

Best regards,
Andrea


> > Changes v1->v2:
> > - swapper pgtable updated in place on hot add, avoiding unnecessary copy
> > - stop_machine used to updated swapper on hot add, avoiding races
> > - introduced check on offlining state before hot remove
> > - new memblock flag used to mark partially unused vmemmap pages, avoiding
> >   the nasty 0xFD hack used in the prev rev (and in x86 hot remove code)
> > - proper cleaning sequence for p[um]ds,ptes and related TLB management
> > - Removed macros that changed hot remove behavior based on number
> >   of pgtable levels. Now this is hidden in the pgtable traversal macros.
> > - Check on the corner case where P[UM]Ds would have to be split during
> >   hot remove: now this is forbidden.
> > - Minor fixes and refactoring.
> > 
> > Andrea Reale (4):
> >   mm: memory_hotplug: Remove assumption on memory state before hotremove
> >   mm: memory_hotplug: memblock to track partially removed vmemmap mem
> >   mm: memory_hotplug: Add memory hotremove probe device
> >   mm: memory-hotplug: Add memory hot remove support for arm64
> > 
> > Maciej Bielski (1):
> >   mm: memory_hotplug: Memory hotplug (add) support for arm64
> > 
> >  arch/arm64/Kconfig             |  15 +
> >  arch/arm64/configs/defconfig   |   2 +
> >  arch/arm64/include/asm/mmu.h   |   7 +
> >  arch/arm64/mm/init.c           | 116 ++++++++
> >  arch/arm64/mm/mmu.c            | 609 ++++++++++++++++++++++++++++++++++++++++-
> >  drivers/acpi/acpi_memhotplug.c |   2 +-
> >  drivers/base/memory.c          |  34 ++-
> >  include/linux/memblock.h       |  12 +
> >  include/linux/memory_hotplug.h |   9 +-
> >  mm/memblock.c                  |  32 +++
> >  mm/memory_hotplug.c            |  13 +-
> >  11 files changed, 835 insertions(+), 16 deletions(-)
> > 
> > -- 
> > 2.7.4
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-23 11:14   ` Andrea Reale
  (?)
@ 2017-11-23 22:18     ` Rafael J. Wysocki
  -1 siblings, 0 replies; 156+ messages in thread
From: Rafael J. Wysocki @ 2017-11-23 22:18 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

On 11/23/2017 12:14 PM, Andrea Reale wrote:
> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> introduced an assumption whereas when control
> reaches remove_memory the corresponding memory has been already
> offlined. In that case, the acpi_memhotplug was making sure that
> the assumption held.
> This assumption, however, is not necessarily true if offlining
> and removal are not done by the same "controller" (for example,
> when first offlining via sysfs).
>
> Removing this assumption for the generic remove_memory code
> and moving it in the specific acpi_memhotplug code. This is
> a dependency for the software-aided arm64 offlining and removal
> process.
>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>

Please resend this with a CC to linux-acpi.

Thanks!

> ---
>   drivers/acpi/acpi_memhotplug.c |  2 +-
>   include/linux/memory_hotplug.h |  9 ++++++---
>   mm/memory_hotplug.c            | 13 +++++++++----
>   3 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 6b0d3ef..b0126a0 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>   			nid = memory_add_physaddr_to_nid(info->start_addr);
>   
>   		acpi_unbind_memory_blocks(info);
> -		remove_memory(nid, info->start_addr, info->length);
> +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
>   		list_del(&info->list);
>   		kfree(info);
>   	}
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 58e110a..1a9c7b2 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
>   extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
>   extern void try_offline_node(int nid);
>   extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>   
>   #else
>   static inline bool is_mem_section_removable(unsigned long pfn,
> @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	return -EINVAL;
>   }
>   
> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline int remove_memory(int nid, u64 start, u64 size)
> +{
> +	return -EINVAL;
> +}
>   #endif /* CONFIG_MEMORY_HOTREMOVE */
>   
>   extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>   		unsigned long nr_pages);
>   extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>   extern bool is_memblock_offlined(struct memory_block *mem);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>   extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
>   extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
>   		unsigned long map_offset);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d4b5f29..d5f15af 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
>    * and online/offline operations before this call, as required by
>    * try_offline_node().
>    */
> -void __ref remove_memory(int nid, u64 start, u64 size)
> +int __ref remove_memory(int nid, u64 start, u64 size)
>   {
>   	int ret;
>   
> @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>   	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
>   				check_memblock_offlined_cb);
>   	if (ret)
> -		BUG();
> +		goto end_remove;
> +
> +	ret = arch_remove_memory(start, size);
> +
> +	if (ret)
> +		goto end_remove;
>   
>   	/* remove memmap entry */
>   	firmware_map_remove(start, start + size, "System RAM");
>   	memblock_free(start, size);
>   	memblock_remove(start, size);
>   
> -	arch_remove_memory(start, size);
> -
>   	try_offline_node(nid);
>   
> +end_remove:
>   	mem_hotplug_done();
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(remove_memory);
>   #endif /* CONFIG_MEMORY_HOTREMOVE */

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-23 22:18     ` Rafael J. Wysocki
  0 siblings, 0 replies; 156+ messages in thread
From: Rafael J. Wysocki @ 2017-11-23 22:18 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

On 11/23/2017 12:14 PM, Andrea Reale wrote:
> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> introduced an assumption whereas when control
> reaches remove_memory the corresponding memory has been already
> offlined. In that case, the acpi_memhotplug was making sure that
> the assumption held.
> This assumption, however, is not necessarily true if offlining
> and removal are not done by the same "controller" (for example,
> when first offlining via sysfs).
>
> Removing this assumption for the generic remove_memory code
> and moving it in the specific acpi_memhotplug code. This is
> a dependency for the software-aided arm64 offlining and removal
> process.
>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>

Please resend this with a CC to linux-acpi.

Thanks!

> ---
>   drivers/acpi/acpi_memhotplug.c |  2 +-
>   include/linux/memory_hotplug.h |  9 ++++++---
>   mm/memory_hotplug.c            | 13 +++++++++----
>   3 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 6b0d3ef..b0126a0 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>   			nid = memory_add_physaddr_to_nid(info->start_addr);
>   
>   		acpi_unbind_memory_blocks(info);
> -		remove_memory(nid, info->start_addr, info->length);
> +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
>   		list_del(&info->list);
>   		kfree(info);
>   	}
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 58e110a..1a9c7b2 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
>   extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
>   extern void try_offline_node(int nid);
>   extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>   
>   #else
>   static inline bool is_mem_section_removable(unsigned long pfn,
> @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	return -EINVAL;
>   }
>   
> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline int remove_memory(int nid, u64 start, u64 size)
> +{
> +	return -EINVAL;
> +}
>   #endif /* CONFIG_MEMORY_HOTREMOVE */
>   
>   extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>   		unsigned long nr_pages);
>   extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>   extern bool is_memblock_offlined(struct memory_block *mem);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>   extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
>   extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
>   		unsigned long map_offset);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d4b5f29..d5f15af 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
>    * and online/offline operations before this call, as required by
>    * try_offline_node().
>    */
> -void __ref remove_memory(int nid, u64 start, u64 size)
> +int __ref remove_memory(int nid, u64 start, u64 size)
>   {
>   	int ret;
>   
> @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>   	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
>   				check_memblock_offlined_cb);
>   	if (ret)
> -		BUG();
> +		goto end_remove;
> +
> +	ret = arch_remove_memory(start, size);
> +
> +	if (ret)
> +		goto end_remove;
>   
>   	/* remove memmap entry */
>   	firmware_map_remove(start, start + size, "System RAM");
>   	memblock_free(start, size);
>   	memblock_remove(start, size);
>   
> -	arch_remove_memory(start, size);
> -
>   	try_offline_node(nid);
>   
> +end_remove:
>   	mem_hotplug_done();
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(remove_memory);
>   #endif /* CONFIG_MEMORY_HOTREMOVE */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-23 22:18     ` Rafael J. Wysocki
  0 siblings, 0 replies; 156+ messages in thread
From: Rafael J. Wysocki @ 2017-11-23 22:18 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/23/2017 12:14 PM, Andrea Reale wrote:
> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> introduced an assumption whereas when control
> reaches remove_memory the corresponding memory has been already
> offlined. In that case, the acpi_memhotplug was making sure that
> the assumption held.
> This assumption, however, is not necessarily true if offlining
> and removal are not done by the same "controller" (for example,
> when first offlining via sysfs).
>
> Removing this assumption for the generic remove_memory code
> and moving it in the specific acpi_memhotplug code. This is
> a dependency for the software-aided arm64 offlining and removal
> process.
>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>

Please resend this with a CC to linux-acpi.

Thanks!

> ---
>   drivers/acpi/acpi_memhotplug.c |  2 +-
>   include/linux/memory_hotplug.h |  9 ++++++---
>   mm/memory_hotplug.c            | 13 +++++++++----
>   3 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 6b0d3ef..b0126a0 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>   			nid = memory_add_physaddr_to_nid(info->start_addr);
>   
>   		acpi_unbind_memory_blocks(info);
> -		remove_memory(nid, info->start_addr, info->length);
> +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
>   		list_del(&info->list);
>   		kfree(info);
>   	}
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 58e110a..1a9c7b2 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
>   extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
>   extern void try_offline_node(int nid);
>   extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>   
>   #else
>   static inline bool is_mem_section_removable(unsigned long pfn,
> @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	return -EINVAL;
>   }
>   
> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline int remove_memory(int nid, u64 start, u64 size)
> +{
> +	return -EINVAL;
> +}
>   #endif /* CONFIG_MEMORY_HOTREMOVE */
>   
>   extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>   		unsigned long nr_pages);
>   extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>   extern bool is_memblock_offlined(struct memory_block *mem);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>   extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
>   extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
>   		unsigned long map_offset);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d4b5f29..d5f15af 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
>    * and online/offline operations before this call, as required by
>    * try_offline_node().
>    */
> -void __ref remove_memory(int nid, u64 start, u64 size)
> +int __ref remove_memory(int nid, u64 start, u64 size)
>   {
>   	int ret;
>   
> @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>   	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
>   				check_memblock_offlined_cb);
>   	if (ret)
> -		BUG();
> +		goto end_remove;
> +
> +	ret = arch_remove_memory(start, size);
> +
> +	if (ret)
> +		goto end_remove;
>   
>   	/* remove memmap entry */
>   	firmware_map_remove(start, start + size, "System RAM");
>   	memblock_free(start, size);
>   	memblock_remove(start, size);
>   
> -	arch_remove_memory(start, size);
> -
>   	try_offline_node(nid);
>   
> +end_remove:
>   	mem_hotplug_done();
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(remove_memory);
>   #endif /* CONFIG_MEMORY_HOTREMOVE */

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
  2017-11-23 11:13   ` Maciej Bielski
  (?)
@ 2017-11-24  5:55     ` Arun KS
  -1 siblings, 0 replies; 156+ messages in thread
From: Arun KS @ 2017-11-24  5:55 UTC (permalink / raw)
  To: Maciej Bielski
  Cc: linux-arm-kernel, linux-kernel, linux-mm, ar, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	Catalin Marinas, mhocko, realean2

On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
<m.bielski@virtualopensystems.com> wrote:
> Introduces memory hotplug functionality (hot-add) for arm64.
>
> Changes v1->v2:
> - swapper pgtable updated in place on hot add, avoiding unnecessary copy:
>   all changes are additive and non destructive.
>
> - stop_machine used to updated swapper on hot add, avoiding races
>
> - checking if pagealloc is under debug to stay coherent with mem_map
>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> ---
>  arch/arm64/Kconfig           | 12 ++++++
>  arch/arm64/configs/defconfig |  1 +
>  arch/arm64/include/asm/mmu.h |  3 ++
>  arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
>  arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
>  5 files changed, 142 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0df64a6..c736bba 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -641,6 +641,14 @@ config HOTPLUG_CPU
>           Say Y here to experiment with turning CPUs off and on.  CPUs
>           can be controlled through /sys/devices/system/cpu.
>
> +config ARCH_HAS_ADD_PAGES
> +       def_bool y
> +       depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +
> +config ARCH_ENABLE_MEMORY_HOTPLUG
> +       def_bool y
> +    depends on !NUMA
> +
>  # Common NUMA Features
>  config NUMA
>         bool "Numa Memory Allocation and Scheduler Support"
> @@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
>
>  source "mm/Kconfig"
>
> +config ARCH_MEMORY_PROBE
> +       def_bool y
> +       depends on MEMORY_HOTPLUG
> +
>  config SECCOMP
>         bool "Enable seccomp to safely compute untrusted bytecode"
>         ---help---
> diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> index 34480e9..5fc5656 100644
> --- a/arch/arm64/configs/defconfig
> +++ b/arch/arm64/configs/defconfig
> @@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
>  CONFIG_SCHED_MC=y
>  CONFIG_NUMA=y
>  CONFIG_PREEMPT=y
> +CONFIG_MEMORY_HOTPLUG=y
>  CONFIG_KSM=y
>  CONFIG_TRANSPARENT_HUGEPAGE=y
>  CONFIG_CMA=y
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 0d34bf0..2b3fa4d 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>                                pgprot_t prot, bool page_mappings_only);
>  extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
>  extern void mark_linear_text_alias_ro(void);
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
> +#endif
>
>  #endif
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 5960bef..e96e7d3 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
>         return 0;
>  }
>  __initcall(register_mem_limit_dumper);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +int add_pages(int nid, unsigned long start_pfn,
> +               unsigned long nr_pages, bool want_memblock)
> +{
> +       int ret;
> +       u64 start_addr = start_pfn << PAGE_SHIFT;
> +       /*
> +        * Mark the first page in the range as unusable. This is needed
> +        * because __add_section (within __add_pages) wants pfn_valid
> +        * of it to be false, and in arm64 pfn falid is implemented by
> +        * just checking at the nomap flag for existing blocks.
> +        *
> +        * A small trick here is that __add_section() requires only
> +        * phys_start_pfn (that is the first pfn of a section) to be
> +        * invalid. Regardless of whether it was assumed (by the function
> +        * author) that all pfns within a section are either all valid
> +        * or all invalid, it allows to avoid looping twice (once here,
> +        * second when memblock_clear_nomap() is called) through all
> +        * pfns of the section and modify only one pfn. Thanks to that,
> +        * further, in __add_zone() only this very first pfn is skipped
> +        * and corresponding page is not flagged reserved. Therefore it
> +        * is enough to correct this setup only for it.
> +        *
> +        * When arch_add_memory() returns the walk_memory_range() function
> +        * is called and passed with online_memory_block() callback,
> +        * which execution finally reaches the memory_block_action()
> +        * function, where also only the first pfn of a memory block is
> +        * checked to be reserved. Above, it was first pfn of a section,
> +        * here it is a block but
> +        * (drivers/base/memory.c):
> +        *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> +        * (include/linux/memory.h):
> +        *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> +        * so we can consider block and section equivalently
> +        */
> +       memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> +       ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +       /*
> +        * Make the pages usable after they have been added.
> +        * This will make pfn_valid return true
> +        */
> +       memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> +
> +       /*
> +        * This is a hack to avoid having to mix arch specific code
> +        * into arch independent code. SetPageReserved is supposed
> +        * to be called by __add_zone (within __add_section, within
> +        * __add_pages). However, when it is called there, it assumes that
> +        * pfn_valid returns true.  For the way pfn_valid is implemented
> +        * in arm64 (a check on the nomap flag), the only way to make
> +        * this evaluate true inside __add_zone is to clear the nomap
> +        * flags of blocks in architecture independent code.
> +        *
> +        * To avoid this, we set the Reserved flag here after we cleared
> +        * the nomap flag in the line above.
> +        */
> +       SetPageReserved(pfn_to_page(start_pfn));
> +
> +       return ret;
> +}
> +
> +int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> +{
> +       int ret;
> +       unsigned long start_pfn = start >> PAGE_SHIFT;
> +       unsigned long nr_pages = size >> PAGE_SHIFT;
> +       unsigned long end_pfn = start_pfn + nr_pages;
> +       unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> +
> +       if (end_pfn > max_sparsemem_pfn) {
> +               pr_err("end_pfn too big");
> +               return -1;
> +       }
> +       hotplug_paging(start, size);
> +
> +       ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +       if (ret)
> +               pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> +                       __func__, ret);
> +
> +       return ret;
> +}
> +
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index f1eb15e..d93043d 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -28,6 +28,7 @@
>  #include <linux/mman.h>
>  #include <linux/nodemask.h>
>  #include <linux/memblock.h>
> +#include <linux/stop_machine.h>
>  #include <linux/fs.h>
>  #include <linux/io.h>
>  #include <linux/mm.h>
> @@ -615,6 +616,44 @@ void __init paging_init(void)
>                       SWAPPER_DIR_SIZE - PAGE_SIZE);
>  }
>
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +
> +/*
> + * hotplug_paging() is used by memory hotplug to build new page tables
> + * for hot added memory.
> + */
> +
> +struct mem_range {
> +       phys_addr_t base;
> +       phys_addr_t size;
> +};
> +
> +static int __hotplug_paging(void *data)
> +{
> +       int flags = 0;
> +       struct mem_range *section = data;
> +
> +       if (debug_pagealloc_enabled())
> +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> +
> +       __create_pgd_mapping(swapper_pg_dir, section->base,
> +                       __phys_to_virt(section->base), section->size,
> +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);

Hello Andrea,

__hotplug_paging runs on stop_machine context.
cpu stop callbacks must not sleep.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479

__create_pgd_mapping uses pgd_pgtable_alloc. which does
__get_free_page(PGALLOC_GFP)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342

PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM

#define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
#define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)

Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for

might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150

and then BUG()

I was testing on 4.4 kernel, but cross checked with 4.14 as well.

Regards,
Arun


> +
> +       return 0;
> +}
> +
> +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> +{
> +       struct mem_range section = {
> +               .base = start,
> +               .size = size,
> +       };
> +
> +       stop_machine(__hotplug_paging, &section, NULL);
> +}
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
>  /*
>   * Check whether a kernel address is valid (derived from arch/x86/).
>   */
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-24  5:55     ` Arun KS
  0 siblings, 0 replies; 156+ messages in thread
From: Arun KS @ 2017-11-24  5:55 UTC (permalink / raw)
  To: Maciej Bielski
  Cc: linux-arm-kernel, linux-kernel, linux-mm, ar, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	Catalin Marinas, mhocko, realean2

On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
<m.bielski@virtualopensystems.com> wrote:
> Introduces memory hotplug functionality (hot-add) for arm64.
>
> Changes v1->v2:
> - swapper pgtable updated in place on hot add, avoiding unnecessary copy:
>   all changes are additive and non destructive.
>
> - stop_machine used to updated swapper on hot add, avoiding races
>
> - checking if pagealloc is under debug to stay coherent with mem_map
>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> ---
>  arch/arm64/Kconfig           | 12 ++++++
>  arch/arm64/configs/defconfig |  1 +
>  arch/arm64/include/asm/mmu.h |  3 ++
>  arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
>  arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
>  5 files changed, 142 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0df64a6..c736bba 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -641,6 +641,14 @@ config HOTPLUG_CPU
>           Say Y here to experiment with turning CPUs off and on.  CPUs
>           can be controlled through /sys/devices/system/cpu.
>
> +config ARCH_HAS_ADD_PAGES
> +       def_bool y
> +       depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +
> +config ARCH_ENABLE_MEMORY_HOTPLUG
> +       def_bool y
> +    depends on !NUMA
> +
>  # Common NUMA Features
>  config NUMA
>         bool "Numa Memory Allocation and Scheduler Support"
> @@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
>
>  source "mm/Kconfig"
>
> +config ARCH_MEMORY_PROBE
> +       def_bool y
> +       depends on MEMORY_HOTPLUG
> +
>  config SECCOMP
>         bool "Enable seccomp to safely compute untrusted bytecode"
>         ---help---
> diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> index 34480e9..5fc5656 100644
> --- a/arch/arm64/configs/defconfig
> +++ b/arch/arm64/configs/defconfig
> @@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
>  CONFIG_SCHED_MC=y
>  CONFIG_NUMA=y
>  CONFIG_PREEMPT=y
> +CONFIG_MEMORY_HOTPLUG=y
>  CONFIG_KSM=y
>  CONFIG_TRANSPARENT_HUGEPAGE=y
>  CONFIG_CMA=y
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 0d34bf0..2b3fa4d 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>                                pgprot_t prot, bool page_mappings_only);
>  extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
>  extern void mark_linear_text_alias_ro(void);
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
> +#endif
>
>  #endif
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 5960bef..e96e7d3 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
>         return 0;
>  }
>  __initcall(register_mem_limit_dumper);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +int add_pages(int nid, unsigned long start_pfn,
> +               unsigned long nr_pages, bool want_memblock)
> +{
> +       int ret;
> +       u64 start_addr = start_pfn << PAGE_SHIFT;
> +       /*
> +        * Mark the first page in the range as unusable. This is needed
> +        * because __add_section (within __add_pages) wants pfn_valid
> +        * of it to be false, and in arm64 pfn falid is implemented by
> +        * just checking at the nomap flag for existing blocks.
> +        *
> +        * A small trick here is that __add_section() requires only
> +        * phys_start_pfn (that is the first pfn of a section) to be
> +        * invalid. Regardless of whether it was assumed (by the function
> +        * author) that all pfns within a section are either all valid
> +        * or all invalid, it allows to avoid looping twice (once here,
> +        * second when memblock_clear_nomap() is called) through all
> +        * pfns of the section and modify only one pfn. Thanks to that,
> +        * further, in __add_zone() only this very first pfn is skipped
> +        * and corresponding page is not flagged reserved. Therefore it
> +        * is enough to correct this setup only for it.
> +        *
> +        * When arch_add_memory() returns the walk_memory_range() function
> +        * is called and passed with online_memory_block() callback,
> +        * which execution finally reaches the memory_block_action()
> +        * function, where also only the first pfn of a memory block is
> +        * checked to be reserved. Above, it was first pfn of a section,
> +        * here it is a block but
> +        * (drivers/base/memory.c):
> +        *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> +        * (include/linux/memory.h):
> +        *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> +        * so we can consider block and section equivalently
> +        */
> +       memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> +       ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +       /*
> +        * Make the pages usable after they have been added.
> +        * This will make pfn_valid return true
> +        */
> +       memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> +
> +       /*
> +        * This is a hack to avoid having to mix arch specific code
> +        * into arch independent code. SetPageReserved is supposed
> +        * to be called by __add_zone (within __add_section, within
> +        * __add_pages). However, when it is called there, it assumes that
> +        * pfn_valid returns true.  For the way pfn_valid is implemented
> +        * in arm64 (a check on the nomap flag), the only way to make
> +        * this evaluate true inside __add_zone is to clear the nomap
> +        * flags of blocks in architecture independent code.
> +        *
> +        * To avoid this, we set the Reserved flag here after we cleared
> +        * the nomap flag in the line above.
> +        */
> +       SetPageReserved(pfn_to_page(start_pfn));
> +
> +       return ret;
> +}
> +
> +int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> +{
> +       int ret;
> +       unsigned long start_pfn = start >> PAGE_SHIFT;
> +       unsigned long nr_pages = size >> PAGE_SHIFT;
> +       unsigned long end_pfn = start_pfn + nr_pages;
> +       unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> +
> +       if (end_pfn > max_sparsemem_pfn) {
> +               pr_err("end_pfn too big");
> +               return -1;
> +       }
> +       hotplug_paging(start, size);
> +
> +       ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +       if (ret)
> +               pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> +                       __func__, ret);
> +
> +       return ret;
> +}
> +
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index f1eb15e..d93043d 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -28,6 +28,7 @@
>  #include <linux/mman.h>
>  #include <linux/nodemask.h>
>  #include <linux/memblock.h>
> +#include <linux/stop_machine.h>
>  #include <linux/fs.h>
>  #include <linux/io.h>
>  #include <linux/mm.h>
> @@ -615,6 +616,44 @@ void __init paging_init(void)
>                       SWAPPER_DIR_SIZE - PAGE_SIZE);
>  }
>
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +
> +/*
> + * hotplug_paging() is used by memory hotplug to build new page tables
> + * for hot added memory.
> + */
> +
> +struct mem_range {
> +       phys_addr_t base;
> +       phys_addr_t size;
> +};
> +
> +static int __hotplug_paging(void *data)
> +{
> +       int flags = 0;
> +       struct mem_range *section = data;
> +
> +       if (debug_pagealloc_enabled())
> +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> +
> +       __create_pgd_mapping(swapper_pg_dir, section->base,
> +                       __phys_to_virt(section->base), section->size,
> +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);

Hello Andrea,

__hotplug_paging runs on stop_machine context.
cpu stop callbacks must not sleep.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479

__create_pgd_mapping uses pgd_pgtable_alloc. which does
__get_free_page(PGALLOC_GFP)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342

PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM

#define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
#define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)

Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for

might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150

and then BUG()

I was testing on 4.4 kernel, but cross checked with 4.14 as well.

Regards,
Arun


> +
> +       return 0;
> +}
> +
> +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> +{
> +       struct mem_range section = {
> +               .base = start,
> +               .size = size,
> +       };
> +
> +       stop_machine(__hotplug_paging, &section, NULL);
> +}
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
>  /*
>   * Check whether a kernel address is valid (derived from arch/x86/).
>   */
> --
> 2.7.4
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-24  5:55     ` Arun KS
  0 siblings, 0 replies; 156+ messages in thread
From: Arun KS @ 2017-11-24  5:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
<m.bielski@virtualopensystems.com> wrote:
> Introduces memory hotplug functionality (hot-add) for arm64.
>
> Changes v1->v2:
> - swapper pgtable updated in place on hot add, avoiding unnecessary copy:
>   all changes are additive and non destructive.
>
> - stop_machine used to updated swapper on hot add, avoiding races
>
> - checking if pagealloc is under debug to stay coherent with mem_map
>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> ---
>  arch/arm64/Kconfig           | 12 ++++++
>  arch/arm64/configs/defconfig |  1 +
>  arch/arm64/include/asm/mmu.h |  3 ++
>  arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
>  arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
>  5 files changed, 142 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0df64a6..c736bba 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -641,6 +641,14 @@ config HOTPLUG_CPU
>           Say Y here to experiment with turning CPUs off and on.  CPUs
>           can be controlled through /sys/devices/system/cpu.
>
> +config ARCH_HAS_ADD_PAGES
> +       def_bool y
> +       depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +
> +config ARCH_ENABLE_MEMORY_HOTPLUG
> +       def_bool y
> +    depends on !NUMA
> +
>  # Common NUMA Features
>  config NUMA
>         bool "Numa Memory Allocation and Scheduler Support"
> @@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
>
>  source "mm/Kconfig"
>
> +config ARCH_MEMORY_PROBE
> +       def_bool y
> +       depends on MEMORY_HOTPLUG
> +
>  config SECCOMP
>         bool "Enable seccomp to safely compute untrusted bytecode"
>         ---help---
> diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> index 34480e9..5fc5656 100644
> --- a/arch/arm64/configs/defconfig
> +++ b/arch/arm64/configs/defconfig
> @@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
>  CONFIG_SCHED_MC=y
>  CONFIG_NUMA=y
>  CONFIG_PREEMPT=y
> +CONFIG_MEMORY_HOTPLUG=y
>  CONFIG_KSM=y
>  CONFIG_TRANSPARENT_HUGEPAGE=y
>  CONFIG_CMA=y
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 0d34bf0..2b3fa4d 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>                                pgprot_t prot, bool page_mappings_only);
>  extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
>  extern void mark_linear_text_alias_ro(void);
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
> +#endif
>
>  #endif
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 5960bef..e96e7d3 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
>         return 0;
>  }
>  __initcall(register_mem_limit_dumper);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +int add_pages(int nid, unsigned long start_pfn,
> +               unsigned long nr_pages, bool want_memblock)
> +{
> +       int ret;
> +       u64 start_addr = start_pfn << PAGE_SHIFT;
> +       /*
> +        * Mark the first page in the range as unusable. This is needed
> +        * because __add_section (within __add_pages) wants pfn_valid
> +        * of it to be false, and in arm64 pfn falid is implemented by
> +        * just checking at the nomap flag for existing blocks.
> +        *
> +        * A small trick here is that __add_section() requires only
> +        * phys_start_pfn (that is the first pfn of a section) to be
> +        * invalid. Regardless of whether it was assumed (by the function
> +        * author) that all pfns within a section are either all valid
> +        * or all invalid, it allows to avoid looping twice (once here,
> +        * second when memblock_clear_nomap() is called) through all
> +        * pfns of the section and modify only one pfn. Thanks to that,
> +        * further, in __add_zone() only this very first pfn is skipped
> +        * and corresponding page is not flagged reserved. Therefore it
> +        * is enough to correct this setup only for it.
> +        *
> +        * When arch_add_memory() returns the walk_memory_range() function
> +        * is called and passed with online_memory_block() callback,
> +        * which execution finally reaches the memory_block_action()
> +        * function, where also only the first pfn of a memory block is
> +        * checked to be reserved. Above, it was first pfn of a section,
> +        * here it is a block but
> +        * (drivers/base/memory.c):
> +        *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> +        * (include/linux/memory.h):
> +        *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> +        * so we can consider block and section equivalently
> +        */
> +       memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> +       ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +       /*
> +        * Make the pages usable after they have been added.
> +        * This will make pfn_valid return true
> +        */
> +       memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> +
> +       /*
> +        * This is a hack to avoid having to mix arch specific code
> +        * into arch independent code. SetPageReserved is supposed
> +        * to be called by __add_zone (within __add_section, within
> +        * __add_pages). However, when it is called there, it assumes that
> +        * pfn_valid returns true.  For the way pfn_valid is implemented
> +        * in arm64 (a check on the nomap flag), the only way to make
> +        * this evaluate true inside __add_zone is to clear the nomap
> +        * flags of blocks in architecture independent code.
> +        *
> +        * To avoid this, we set the Reserved flag here after we cleared
> +        * the nomap flag in the line above.
> +        */
> +       SetPageReserved(pfn_to_page(start_pfn));
> +
> +       return ret;
> +}
> +
> +int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> +{
> +       int ret;
> +       unsigned long start_pfn = start >> PAGE_SHIFT;
> +       unsigned long nr_pages = size >> PAGE_SHIFT;
> +       unsigned long end_pfn = start_pfn + nr_pages;
> +       unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> +
> +       if (end_pfn > max_sparsemem_pfn) {
> +               pr_err("end_pfn too big");
> +               return -1;
> +       }
> +       hotplug_paging(start, size);
> +
> +       ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +       if (ret)
> +               pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> +                       __func__, ret);
> +
> +       return ret;
> +}
> +
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index f1eb15e..d93043d 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -28,6 +28,7 @@
>  #include <linux/mman.h>
>  #include <linux/nodemask.h>
>  #include <linux/memblock.h>
> +#include <linux/stop_machine.h>
>  #include <linux/fs.h>
>  #include <linux/io.h>
>  #include <linux/mm.h>
> @@ -615,6 +616,44 @@ void __init paging_init(void)
>                       SWAPPER_DIR_SIZE - PAGE_SIZE);
>  }
>
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +
> +/*
> + * hotplug_paging() is used by memory hotplug to build new page tables
> + * for hot added memory.
> + */
> +
> +struct mem_range {
> +       phys_addr_t base;
> +       phys_addr_t size;
> +};
> +
> +static int __hotplug_paging(void *data)
> +{
> +       int flags = 0;
> +       struct mem_range *section = data;
> +
> +       if (debug_pagealloc_enabled())
> +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> +
> +       __create_pgd_mapping(swapper_pg_dir, section->base,
> +                       __phys_to_virt(section->base), section->size,
> +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);

Hello Andrea,

__hotplug_paging runs on stop_machine context.
cpu stop callbacks must not sleep.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479

__create_pgd_mapping uses pgd_pgtable_alloc. which does
__get_free_page(PGALLOC_GFP)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342

PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM

#define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
#define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)

Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for

might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150

and then BUG()

I was testing on 4.4 kernel, but cross checked with 4.14 as well.

Regards,
Arun


> +
> +       return 0;
> +}
> +
> +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> +{
> +       struct mem_range section = {
> +               .base = start,
> +               .size = size,
> +       };
> +
> +       stop_machine(__hotplug_paging, &section, NULL);
> +}
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
>  /*
>   * Check whether a kernel address is valid (derived from arch/x86/).
>   */
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
  2017-11-24  5:55     ` Arun KS
  (?)
@ 2017-11-24  9:42       ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24  9:42 UTC (permalink / raw)
  To: Arun KS
  Cc: Maciej Bielski, linux-arm-kernel, linux-kernel, linux-mm, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	Catalin Marinas, mhocko, realean2

Hi Arun,


On Fri 24 Nov 2017, 11:25, Arun KS wrote:
> On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
> <m.bielski@virtualopensystems.com> wrote:
>> [ ...]
> > Introduces memory hotplug functionality (hot-add) for arm64.
> > @@ -615,6 +616,44 @@ void __init paging_init(void)
> >                       SWAPPER_DIR_SIZE - PAGE_SIZE);
> >  }
> >
> > +#ifdef CONFIG_MEMORY_HOTPLUG
> > +
> > +/*
> > + * hotplug_paging() is used by memory hotplug to build new page tables
> > + * for hot added memory.
> > + */
> > +
> > +struct mem_range {
> > +       phys_addr_t base;
> > +       phys_addr_t size;
> > +};
> > +
> > +static int __hotplug_paging(void *data)
> > +{
> > +       int flags = 0;
> > +       struct mem_range *section = data;
> > +
> > +       if (debug_pagealloc_enabled())
> > +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > +
> > +       __create_pgd_mapping(swapper_pg_dir, section->base,
> > +                       __phys_to_virt(section->base), section->size,
> > +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);
> 
> Hello Andrea,
> 
> __hotplug_paging runs on stop_machine context.
> cpu stop callbacks must not sleep.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479
> 
> __create_pgd_mapping uses pgd_pgtable_alloc. which does
> __get_free_page(PGALLOC_GFP)
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342
> 
> PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM
> 
> #define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
> #define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> 
> Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for
> 
> might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150
> 
> and then BUG()

Well spotted, thanks for reporting the problem. One possible solution
would be to revert back to building the updated page tables on a copy
pgdir (as it was done in v1 of this patchset) and then replacing swapper
atomically with stop_machine.

Actually, I am not sure if stop_machine is strictly needed,
if we modify the swapper pgdir live: for example, in x86_64
kernel_physical_mapping_init, atomicity is ensured by spin-locking on
init_mm.page_table_lock.
https://elixir.free-electrons.com/linux/v4.14/source/arch/x86/mm/init_64.c#L684
I'll spend some time investigating whoever else could be working
concurrently on the swapper pgdir.

Any suggestion or pointer is very welcome.

Thanks,
Andrea

> I was testing on 4.4 kernel, but cross checked with 4.14 as well.
> 
> Regards,
> Arun
> 
> 
> > +
> > +       return 0;
> > +}
> > +
> > +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > +{
> > +       struct mem_range section = {
> > +               .base = start,
> > +               .size = size,
> > +       };
> > +
> > +       stop_machine(__hotplug_paging, &section, NULL);
> > +}
> > +#endif /* CONFIG_MEMORY_HOTPLUG */
> > +
> >  /*
> >   * Check whether a kernel address is valid (derived from arch/x86/).
> >   */
> > --
> > 2.7.4
> >
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-24  9:42       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24  9:42 UTC (permalink / raw)
  To: Arun KS
  Cc: Maciej Bielski, linux-arm-kernel, linux-kernel, linux-mm, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	Catalin Marinas, mhocko, realean2

Hi Arun,


On Fri 24 Nov 2017, 11:25, Arun KS wrote:
> On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
> <m.bielski@virtualopensystems.com> wrote:
>> [ ...]
> > Introduces memory hotplug functionality (hot-add) for arm64.
> > @@ -615,6 +616,44 @@ void __init paging_init(void)
> >                       SWAPPER_DIR_SIZE - PAGE_SIZE);
> >  }
> >
> > +#ifdef CONFIG_MEMORY_HOTPLUG
> > +
> > +/*
> > + * hotplug_paging() is used by memory hotplug to build new page tables
> > + * for hot added memory.
> > + */
> > +
> > +struct mem_range {
> > +       phys_addr_t base;
> > +       phys_addr_t size;
> > +};
> > +
> > +static int __hotplug_paging(void *data)
> > +{
> > +       int flags = 0;
> > +       struct mem_range *section = data;
> > +
> > +       if (debug_pagealloc_enabled())
> > +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > +
> > +       __create_pgd_mapping(swapper_pg_dir, section->base,
> > +                       __phys_to_virt(section->base), section->size,
> > +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);
> 
> Hello Andrea,
> 
> __hotplug_paging runs on stop_machine context.
> cpu stop callbacks must not sleep.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479
> 
> __create_pgd_mapping uses pgd_pgtable_alloc. which does
> __get_free_page(PGALLOC_GFP)
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342
> 
> PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM
> 
> #define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
> #define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> 
> Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for
> 
> might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150
> 
> and then BUG()

Well spotted, thanks for reporting the problem. One possible solution
would be to revert back to building the updated page tables on a copy
pgdir (as it was done in v1 of this patchset) and then replacing swapper
atomically with stop_machine.

Actually, I am not sure if stop_machine is strictly needed,
if we modify the swapper pgdir live: for example, in x86_64
kernel_physical_mapping_init, atomicity is ensured by spin-locking on
init_mm.page_table_lock.
https://elixir.free-electrons.com/linux/v4.14/source/arch/x86/mm/init_64.c#L684
I'll spend some time investigating whoever else could be working
concurrently on the swapper pgdir.

Any suggestion or pointer is very welcome.

Thanks,
Andrea

> I was testing on 4.4 kernel, but cross checked with 4.14 as well.
> 
> Regards,
> Arun
> 
> 
> > +
> > +       return 0;
> > +}
> > +
> > +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > +{
> > +       struct mem_range section = {
> > +               .base = start,
> > +               .size = size,
> > +       };
> > +
> > +       stop_machine(__hotplug_paging, &section, NULL);
> > +}
> > +#endif /* CONFIG_MEMORY_HOTPLUG */
> > +
> >  /*
> >   * Check whether a kernel address is valid (derived from arch/x86/).
> >   */
> > --
> > 2.7.4
> >
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-24  9:42       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24  9:42 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Arun,


On Fri 24 Nov 2017, 11:25, Arun KS wrote:
> On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
> <m.bielski@virtualopensystems.com> wrote:
>> [ ...]
> > Introduces memory hotplug functionality (hot-add) for arm64.
> > @@ -615,6 +616,44 @@ void __init paging_init(void)
> >                       SWAPPER_DIR_SIZE - PAGE_SIZE);
> >  }
> >
> > +#ifdef CONFIG_MEMORY_HOTPLUG
> > +
> > +/*
> > + * hotplug_paging() is used by memory hotplug to build new page tables
> > + * for hot added memory.
> > + */
> > +
> > +struct mem_range {
> > +       phys_addr_t base;
> > +       phys_addr_t size;
> > +};
> > +
> > +static int __hotplug_paging(void *data)
> > +{
> > +       int flags = 0;
> > +       struct mem_range *section = data;
> > +
> > +       if (debug_pagealloc_enabled())
> > +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > +
> > +       __create_pgd_mapping(swapper_pg_dir, section->base,
> > +                       __phys_to_virt(section->base), section->size,
> > +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);
> 
> Hello Andrea,
> 
> __hotplug_paging runs on stop_machine context.
> cpu stop callbacks must not sleep.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479
> 
> __create_pgd_mapping uses pgd_pgtable_alloc. which does
> __get_free_page(PGALLOC_GFP)
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342
> 
> PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM
> 
> #define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
> #define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> 
> Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for
> 
> might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150
> 
> and then BUG()

Well spotted, thanks for reporting the problem. One possible solution
would be to revert back to building the updated page tables on a copy
pgdir (as it was done in v1 of this patchset) and then replacing swapper
atomically with stop_machine.

Actually, I am not sure if stop_machine is strictly needed,
if we modify the swapper pgdir live: for example, in x86_64
kernel_physical_mapping_init, atomicity is ensured by spin-locking on
init_mm.page_table_lock.
https://elixir.free-electrons.com/linux/v4.14/source/arch/x86/mm/init_64.c#L684
I'll spend some time investigating whoever else could be working
concurrently on the swapper pgdir.

Any suggestion or pointer is very welcome.

Thanks,
Andrea

> I was testing on 4.4 kernel, but cross checked with 4.14 as well.
> 
> Regards,
> Arun
> 
> 
> > +
> > +       return 0;
> > +}
> > +
> > +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > +{
> > +       struct mem_range section = {
> > +               .base = start,
> > +               .size = size,
> > +       };
> > +
> > +       stop_machine(__hotplug_paging, &section, NULL);
> > +}
> > +#endif /* CONFIG_MEMORY_HOTPLUG */
> > +
> >  /*
> >   * Check whether a kernel address is valid (derived from arch/x86/).
> >   */
> > --
> > 2.7.4
> >
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-23 11:13 ` Andrea Reale
                   ` (7 preceding siblings ...)
  (?)
@ 2017-11-24 10:22 ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 10:22 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-kernel, linux-mm, m.bielski, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, catalin.marinas, mhocko,
	rafael.j.wysocki, linux-acpi

Resending the patch adding linux-acpi in CC, as suggested by Rafael.
Everyone else: apologies for the noise.

Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
introduced an assumption whereas when control
reaches remove_memory the corresponding memory has been already
offlined. In that case, the acpi_memhotplug was making sure that
the assumption held.
This assumption, however, is not necessarily true if offlining
and removal are not done by the same "controller" (for example,
when first offlining via sysfs).

Removing this assumption for the generic remove_memory code
and moving it in the specific acpi_memhotplug code. This is
a dependency for the software-aided arm64 offlining and removal
process.

Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
---
 drivers/acpi/acpi_memhotplug.c |  2 +-
 include/linux/memory_hotplug.h |  9 ++++++---
 mm/memory_hotplug.c            | 13 +++++++++----
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 6b0d3ef..b0126a0 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
 			nid = memory_add_physaddr_to_nid(info->start_addr);
 
 		acpi_unbind_memory_blocks(info);
-		remove_memory(nid, info->start_addr, info->length);
+		BUG_ON(remove_memory(nid, info->start_addr, info->length));
 		list_del(&info->list);
 		kfree(info);
 	}
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 58e110a..1a9c7b2 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
 extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
 extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 
 #else
 static inline bool is_mem_section_removable(unsigned long pfn,
@@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 	return -EINVAL;
 }
 
-static inline void remove_memory(int nid, u64 start, u64 size) {}
+static inline int remove_memory(int nid, u64 start, u64 size)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
@@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
 extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
 		unsigned long map_offset);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4b5f29..d5f15af 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
  * and online/offline operations before this call, as required by
  * try_offline_node().
  */
-void __ref remove_memory(int nid, u64 start, u64 size)
+int __ref remove_memory(int nid, u64 start, u64 size)
 {
 	int ret;
 
@@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
 				check_memblock_offlined_cb);
 	if (ret)
-		BUG();
+		goto end_remove;
+
+	ret = arch_remove_memory(start, size);
+
+	if (ret)
+		goto end_remove;
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(start, size);
-
 	try_offline_node(nid);
 
+end_remove:
 	mem_hotplug_done();
+	return ret;
 }
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-11-23 11:14   ` Andrea Reale
  (?)
@ 2017-11-24 10:35     ` zhong jiang
  -1 siblings, 0 replies; 156+ messages in thread
From: zhong jiang @ 2017-11-24 10:35 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

HI, Andrea

I don't see "memory_add_physaddr_to_nid" in arch/arm64.
Am I miss something?

Thnaks
zhongjiang

On 2017/11/23 19:14, Andrea Reale wrote:
> Adding a "remove" sysfs handle that can be used to trigger
> memory hotremove manually, exactly simmetrically with
> what happens with the "probe" device for hot-add.
>
> This is usueful for architecture that do not rely on
> ACPI for memory hot-remove.
>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 1d60b58..8ccb67c 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>  }
>  
>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> -#endif
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +static ssize_t
> +memory_remove_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	u64 phys_addr;
> +	int nid, ret;
> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> +
> +	ret = kstrtoull(buf, 0, &phys_addr);
> +	if (ret)
> +		return ret;
> +
> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> +		return -EINVAL;
> +
> +	nid = memory_add_physaddr_to_nid(phys_addr);
> +	ret = lock_device_hotplug_sysfs();
> +	if (ret)
> +		return ret;
> +
> +	remove_memory(nid, phys_addr,
> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> +	unlock_device_hotplug();
> +	return count;
> +}
> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>  
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>  static struct attribute *memory_root_attrs[] = {
>  #ifdef CONFIG_ARCH_MEMORY_PROBE
>  	&dev_attr_probe.attr,
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	&dev_attr_remove.attr,
> +#endif
>  #endif
>  
>  #ifdef CONFIG_MEMORY_FAILURE

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-24 10:35     ` zhong jiang
  0 siblings, 0 replies; 156+ messages in thread
From: zhong jiang @ 2017-11-24 10:35 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

HI, Andrea

I don't see "memory_add_physaddr_to_nid" in arch/arm64.
Am I miss something?

Thnaks
zhongjiang

On 2017/11/23 19:14, Andrea Reale wrote:
> Adding a "remove" sysfs handle that can be used to trigger
> memory hotremove manually, exactly simmetrically with
> what happens with the "probe" device for hot-add.
>
> This is usueful for architecture that do not rely on
> ACPI for memory hot-remove.
>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 1d60b58..8ccb67c 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>  }
>  
>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> -#endif
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +static ssize_t
> +memory_remove_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	u64 phys_addr;
> +	int nid, ret;
> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> +
> +	ret = kstrtoull(buf, 0, &phys_addr);
> +	if (ret)
> +		return ret;
> +
> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> +		return -EINVAL;
> +
> +	nid = memory_add_physaddr_to_nid(phys_addr);
> +	ret = lock_device_hotplug_sysfs();
> +	if (ret)
> +		return ret;
> +
> +	remove_memory(nid, phys_addr,
> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> +	unlock_device_hotplug();
> +	return count;
> +}
> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>  
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>  static struct attribute *memory_root_attrs[] = {
>  #ifdef CONFIG_ARCH_MEMORY_PROBE
>  	&dev_attr_probe.attr,
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	&dev_attr_remove.attr,
> +#endif
>  #endif
>  
>  #ifdef CONFIG_MEMORY_FAILURE


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-24 10:35     ` zhong jiang
  0 siblings, 0 replies; 156+ messages in thread
From: zhong jiang @ 2017-11-24 10:35 UTC (permalink / raw)
  To: linux-arm-kernel

HI, Andrea

I don't see "memory_add_physaddr_to_nid" in arch/arm64.
Am I miss something?

Thnaks
zhongjiang

On 2017/11/23 19:14, Andrea Reale wrote:
> Adding a "remove" sysfs handle that can be used to trigger
> memory hotremove manually, exactly simmetrically with
> what happens with the "probe" device for hot-add.
>
> This is usueful for architecture that do not rely on
> ACPI for memory hot-remove.
>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 1d60b58..8ccb67c 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>  }
>  
>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> -#endif
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +static ssize_t
> +memory_remove_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	u64 phys_addr;
> +	int nid, ret;
> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> +
> +	ret = kstrtoull(buf, 0, &phys_addr);
> +	if (ret)
> +		return ret;
> +
> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> +		return -EINVAL;
> +
> +	nid = memory_add_physaddr_to_nid(phys_addr);
> +	ret = lock_device_hotplug_sysfs();
> +	if (ret)
> +		return ret;
> +
> +	remove_memory(nid, phys_addr,
> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> +	unlock_device_hotplug();
> +	return count;
> +}
> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>  
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>  static struct attribute *memory_root_attrs[] = {
>  #ifdef CONFIG_ARCH_MEMORY_PROBE
>  	&dev_attr_probe.attr,
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	&dev_attr_remove.attr,
> +#endif
>  #endif
>  
>  #ifdef CONFIG_MEMORY_FAILURE

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-11-24 10:35     ` zhong jiang
  (?)
@ 2017-11-24 10:44       ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 10:44 UTC (permalink / raw)
  To: zhong jiang
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

Hi zhongjiang,

On Fri 24 Nov 2017, 18:35, zhong jiang wrote:
> HI, Andrea
> 
> I don't see "memory_add_physaddr_to_nid" in arch/arm64.
> Am I miss something?

When !CONFIG_NUMA it is defined in include/linux/memory_hotplug.h as 0.
In patch 1/5 of this series we require !NUMA to enable
ARCH_ENABLE_MEMORY_HOTPLUG.

The reason for this simplification is simply that we would not know how
to decide the correct node to which to add memory when NUMA is on.
Any suggestion on that matter is welcome. 

Thanks,
Andrea

> Thnaks
> zhongjiang
> 
> On 2017/11/23 19:14, Andrea Reale wrote:
> > Adding a "remove" sysfs handle that can be used to trigger
> > memory hotremove manually, exactly simmetrically with
> > what happens with the "probe" device for hot-add.
> >
> > This is usueful for architecture that do not rely on
> > ACPI for memory hot-remove.
> >
> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > ---
> >  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 1d60b58..8ccb67c 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >  }
> >  
> >  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> > -#endif
> > +
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +static ssize_t
> > +memory_remove_store(struct device *dev,
> > +		struct device_attribute *attr, const char *buf, size_t count)
> > +{
> > +	u64 phys_addr;
> > +	int nid, ret;
> > +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> > +
> > +	ret = kstrtoull(buf, 0, &phys_addr);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> > +		return -EINVAL;
> > +
> > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > +	ret = lock_device_hotplug_sysfs();
> > +	if (ret)
> > +		return ret;
> > +
> > +	remove_memory(nid, phys_addr,
> > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > +	unlock_device_hotplug();
> > +	return count;
> > +}
> > +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> > +#endif /* CONFIG_MEMORY_HOTREMOVE */
> > +#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  /*
> > @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >  static struct attribute *memory_root_attrs[] = {
> >  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >  	&dev_attr_probe.attr,
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	&dev_attr_remove.attr,
> > +#endif
> >  #endif
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> 
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-24 10:44       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 10:44 UTC (permalink / raw)
  To: zhong jiang
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

Hi zhongjiang,

On Fri 24 Nov 2017, 18:35, zhong jiang wrote:
> HI, Andrea
> 
> I don't see "memory_add_physaddr_to_nid" in arch/arm64.
> Am I miss something?

When !CONFIG_NUMA it is defined in include/linux/memory_hotplug.h as 0.
In patch 1/5 of this series we require !NUMA to enable
ARCH_ENABLE_MEMORY_HOTPLUG.

The reason for this simplification is simply that we would not know how
to decide the correct node to which to add memory when NUMA is on.
Any suggestion on that matter is welcome. 

Thanks,
Andrea

> Thnaks
> zhongjiang
> 
> On 2017/11/23 19:14, Andrea Reale wrote:
> > Adding a "remove" sysfs handle that can be used to trigger
> > memory hotremove manually, exactly simmetrically with
> > what happens with the "probe" device for hot-add.
> >
> > This is usueful for architecture that do not rely on
> > ACPI for memory hot-remove.
> >
> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > ---
> >  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 1d60b58..8ccb67c 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >  }
> >  
> >  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> > -#endif
> > +
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +static ssize_t
> > +memory_remove_store(struct device *dev,
> > +		struct device_attribute *attr, const char *buf, size_t count)
> > +{
> > +	u64 phys_addr;
> > +	int nid, ret;
> > +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> > +
> > +	ret = kstrtoull(buf, 0, &phys_addr);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> > +		return -EINVAL;
> > +
> > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > +	ret = lock_device_hotplug_sysfs();
> > +	if (ret)
> > +		return ret;
> > +
> > +	remove_memory(nid, phys_addr,
> > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > +	unlock_device_hotplug();
> > +	return count;
> > +}
> > +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> > +#endif /* CONFIG_MEMORY_HOTREMOVE */
> > +#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  /*
> > @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >  static struct attribute *memory_root_attrs[] = {
> >  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >  	&dev_attr_probe.attr,
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	&dev_attr_remove.attr,
> > +#endif
> >  #endif
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-24 10:44       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 10:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi zhongjiang,

On Fri 24 Nov 2017, 18:35, zhong jiang wrote:
> HI, Andrea
> 
> I don't see "memory_add_physaddr_to_nid" in arch/arm64.
> Am I miss something?

When !CONFIG_NUMA it is defined in include/linux/memory_hotplug.h as 0.
In patch 1/5 of this series we require !NUMA to enable
ARCH_ENABLE_MEMORY_HOTPLUG.

The reason for this simplification is simply that we would not know how
to decide the correct node to which to add memory when NUMA is on.
Any suggestion on that matter is welcome. 

Thanks,
Andrea

> Thnaks
> zhongjiang
> 
> On 2017/11/23 19:14, Andrea Reale wrote:
> > Adding a "remove" sysfs handle that can be used to trigger
> > memory hotremove manually, exactly simmetrically with
> > what happens with the "probe" device for hot-add.
> >
> > This is usueful for architecture that do not rely on
> > ACPI for memory hot-remove.
> >
> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > ---
> >  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 1d60b58..8ccb67c 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >  }
> >  
> >  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> > -#endif
> > +
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +static ssize_t
> > +memory_remove_store(struct device *dev,
> > +		struct device_attribute *attr, const char *buf, size_t count)
> > +{
> > +	u64 phys_addr;
> > +	int nid, ret;
> > +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> > +
> > +	ret = kstrtoull(buf, 0, &phys_addr);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> > +		return -EINVAL;
> > +
> > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > +	ret = lock_device_hotplug_sysfs();
> > +	if (ret)
> > +		return ret;
> > +
> > +	remove_memory(nid, phys_addr,
> > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > +	unlock_device_hotplug();
> > +	return count;
> > +}
> > +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> > +#endif /* CONFIG_MEMORY_HOTREMOVE */
> > +#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  /*
> > @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >  static struct attribute *memory_root_attrs[] = {
> >  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >  	&dev_attr_probe.attr,
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	&dev_attr_remove.attr,
> > +#endif
> >  #endif
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> 
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
  2017-11-24  9:42       ` Andrea Reale
  (?)
@ 2017-11-24 10:53         ` Maciej Bielski
  -1 siblings, 0 replies; 156+ messages in thread
From: Maciej Bielski @ 2017-11-24 10:53 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, ar, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	Catalin Marinas, mhocko, realean2

On Fri, Nov 24, 2017 at 09:42:33AM +0000, Andrea Reale wrote:
> Hi Arun,
>
>
> On Fri 24 Nov 2017, 11:25, Arun KS wrote:
> > On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
> > <m.bielski@virtualopensystems.com> wrote:
> >> [ ...]
> > > Introduces memory hotplug functionality (hot-add) for arm64.
> > > @@ -615,6 +616,44 @@ void __init paging_init(void)
> > >                       SWAPPER_DIR_SIZE - PAGE_SIZE);
> > >  }
> > >
> > > +#ifdef CONFIG_MEMORY_HOTPLUG
> > > +
> > > +/*
> > > + * hotplug_paging() is used by memory hotplug to build new page tables
> > > + * for hot added memory.
> > > + */
> > > +
> > > +struct mem_range {
> > > +       phys_addr_t base;
> > > +       phys_addr_t size;
> > > +};
> > > +
> > > +static int __hotplug_paging(void *data)
> > > +{
> > > +       int flags = 0;
> > > +       struct mem_range *section = data;
> > > +
> > > +       if (debug_pagealloc_enabled())
> > > +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > > +
> > > +       __create_pgd_mapping(swapper_pg_dir, section->base,
> > > +                       __phys_to_virt(section->base), section->size,
> > > +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);
> >
> > Hello Andrea,
> >
> > __hotplug_paging runs on stop_machine context.
> > cpu stop callbacks must not sleep.
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479
> >
> > __create_pgd_mapping uses pgd_pgtable_alloc. which does
> > __get_free_page(PGALLOC_GFP)
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342
> >
> > PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM
> >
> > #define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
> > #define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> >
> > Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for
> >
> > might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150
> >
> > and then BUG()
>
> Well spotted, thanks for reporting the problem. One possible solution
> would be to revert back to building the updated page tables on a copy
> pgdir (as it was done in v1 of this patchset) and then replacing swapper
> atomically with stop_machine.
>
> Actually, I am not sure if stop_machine is strictly needed,
> if we modify the swapper pgdir live: for example, in x86_64
> kernel_physical_mapping_init, atomicity is ensured by spin-locking on
> init_mm.page_table_lock.
> https://elixir.free-electrons.com/linux/v4.14/source/arch/x86/mm/init_64.c#L684
> I'll spend some time investigating whoever else could be working
> concurrently on the swapper pgdir.
>
> Any suggestion or pointer is very welcome.

Hi Andrea, Arun,

Alternative approach could be implementing pgd_pgtable_alloc_nosleep() and
pointing this to hotplug_paging(). Subsequently, it could use different flags,
eg:

#define PGALLOC_GFP_NORECLAIM	(__GFP_IO | __GFP_FS | __GFP_NOTRACK | __GFP_ZERO)

Is this unefficient approach in any way?
Do we like the fact that the memory-attaching thread can go to sleep?

BR,

>
> Thanks,
> Andrea
>
> > I was testing on 4.4 kernel, but cross checked with 4.14 as well.
> >
> > Regards,
> > Arun
> >
> >
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > > +{
> > > +       struct mem_range section = {
> > > +               .base = start,
> > > +               .size = size,
> > > +       };
> > > +
> > > +       stop_machine(__hotplug_paging, &section, NULL);
> > > +}
> > > +#endif /* CONFIG_MEMORY_HOTPLUG */
> > > +
> > >  /*
> > >   * Check whether a kernel address is valid (derived from arch/x86/).
> > >   */
> > > --
> > > 2.7.4
> > >
> >
>

--
Maciej Bielski

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-24 10:53         ` Maciej Bielski
  0 siblings, 0 replies; 156+ messages in thread
From: Maciej Bielski @ 2017-11-24 10:53 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, arunks, mark.rutland,
	scott.branden, will.deacon, qiuxishi, Catalin Marinas, mhocko,
	realean2

On Fri, Nov 24, 2017 at 09:42:33AM +0000, Andrea Reale wrote:
> Hi Arun,
>
>
> On Fri 24 Nov 2017, 11:25, Arun KS wrote:
> > On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
> > <m.bielski@virtualopensystems.com> wrote:
> >> [ ...]
> > > Introduces memory hotplug functionality (hot-add) for arm64.
> > > @@ -615,6 +616,44 @@ void __init paging_init(void)
> > >                       SWAPPER_DIR_SIZE - PAGE_SIZE);
> > >  }
> > >
> > > +#ifdef CONFIG_MEMORY_HOTPLUG
> > > +
> > > +/*
> > > + * hotplug_paging() is used by memory hotplug to build new page tables
> > > + * for hot added memory.
> > > + */
> > > +
> > > +struct mem_range {
> > > +       phys_addr_t base;
> > > +       phys_addr_t size;
> > > +};
> > > +
> > > +static int __hotplug_paging(void *data)
> > > +{
> > > +       int flags = 0;
> > > +       struct mem_range *section = data;
> > > +
> > > +       if (debug_pagealloc_enabled())
> > > +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > > +
> > > +       __create_pgd_mapping(swapper_pg_dir, section->base,
> > > +                       __phys_to_virt(section->base), section->size,
> > > +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);
> >
> > Hello Andrea,
> >
> > __hotplug_paging runs on stop_machine context.
> > cpu stop callbacks must not sleep.
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479
> >
> > __create_pgd_mapping uses pgd_pgtable_alloc. which does
> > __get_free_page(PGALLOC_GFP)
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342
> >
> > PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM
> >
> > #define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
> > #define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> >
> > Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for
> >
> > might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150
> >
> > and then BUG()
>
> Well spotted, thanks for reporting the problem. One possible solution
> would be to revert back to building the updated page tables on a copy
> pgdir (as it was done in v1 of this patchset) and then replacing swapper
> atomically with stop_machine.
>
> Actually, I am not sure if stop_machine is strictly needed,
> if we modify the swapper pgdir live: for example, in x86_64
> kernel_physical_mapping_init, atomicity is ensured by spin-locking on
> init_mm.page_table_lock.
> https://elixir.free-electrons.com/linux/v4.14/source/arch/x86/mm/init_64.c#L684
> I'll spend some time investigating whoever else could be working
> concurrently on the swapper pgdir.
>
> Any suggestion or pointer is very welcome.

Hi Andrea, Arun,

Alternative approach could be implementing pgd_pgtable_alloc_nosleep() and
pointing this to hotplug_paging(). Subsequently, it could use different flags,
eg:

#define PGALLOC_GFP_NORECLAIM	(__GFP_IO | __GFP_FS | __GFP_NOTRACK | __GFP_ZERO)

Is this unefficient approach in any way?
Do we like the fact that the memory-attaching thread can go to sleep?

BR,

>
> Thanks,
> Andrea
>
> > I was testing on 4.4 kernel, but cross checked with 4.14 as well.
> >
> > Regards,
> > Arun
> >
> >
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > > +{
> > > +       struct mem_range section = {
> > > +               .base = start,
> > > +               .size = size,
> > > +       };
> > > +
> > > +       stop_machine(__hotplug_paging, &section, NULL);
> > > +}
> > > +#endif /* CONFIG_MEMORY_HOTPLUG */
> > > +
> > >  /*
> > >   * Check whether a kernel address is valid (derived from arch/x86/).
> > >   */
> > > --
> > > 2.7.4
> > >
> >
>

--
Maciej Bielski

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-24 10:53         ` Maciej Bielski
  0 siblings, 0 replies; 156+ messages in thread
From: Maciej Bielski @ 2017-11-24 10:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 24, 2017 at 09:42:33AM +0000, Andrea Reale wrote:
> Hi Arun,
>
>
> On Fri 24 Nov 2017, 11:25, Arun KS wrote:
> > On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
> > <m.bielski@virtualopensystems.com> wrote:
> >> [ ...]
> > > Introduces memory hotplug functionality (hot-add) for arm64.
> > > @@ -615,6 +616,44 @@ void __init paging_init(void)
> > >                       SWAPPER_DIR_SIZE - PAGE_SIZE);
> > >  }
> > >
> > > +#ifdef CONFIG_MEMORY_HOTPLUG
> > > +
> > > +/*
> > > + * hotplug_paging() is used by memory hotplug to build new page tables
> > > + * for hot added memory.
> > > + */
> > > +
> > > +struct mem_range {
> > > +       phys_addr_t base;
> > > +       phys_addr_t size;
> > > +};
> > > +
> > > +static int __hotplug_paging(void *data)
> > > +{
> > > +       int flags = 0;
> > > +       struct mem_range *section = data;
> > > +
> > > +       if (debug_pagealloc_enabled())
> > > +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > > +
> > > +       __create_pgd_mapping(swapper_pg_dir, section->base,
> > > +                       __phys_to_virt(section->base), section->size,
> > > +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);
> >
> > Hello Andrea,
> >
> > __hotplug_paging runs on stop_machine context.
> > cpu stop callbacks must not sleep.
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479
> >
> > __create_pgd_mapping uses pgd_pgtable_alloc. which does
> > __get_free_page(PGALLOC_GFP)
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342
> >
> > PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM
> >
> > #define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
> > #define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> >
> > Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for
> >
> > might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150
> >
> > and then BUG()
>
> Well spotted, thanks for reporting the problem. One possible solution
> would be to revert back to building the updated page tables on a copy
> pgdir (as it was done in v1 of this patchset) and then replacing swapper
> atomically with stop_machine.
>
> Actually, I am not sure if stop_machine is strictly needed,
> if we modify the swapper pgdir live: for example, in x86_64
> kernel_physical_mapping_init, atomicity is ensured by spin-locking on
> init_mm.page_table_lock.
> https://elixir.free-electrons.com/linux/v4.14/source/arch/x86/mm/init_64.c#L684
> I'll spend some time investigating whoever else could be working
> concurrently on the swapper pgdir.
>
> Any suggestion or pointer is very welcome.

Hi Andrea, Arun,

Alternative approach could be implementing pgd_pgtable_alloc_nosleep() and
pointing this to hotplug_paging(). Subsequently, it could use different flags,
eg:

#define PGALLOC_GFP_NORECLAIM	(__GFP_IO | __GFP_FS | __GFP_NOTRACK | __GFP_ZERO)

Is this unefficient approach in any way?
Do we like the fact that the memory-attaching thread can go to sleep?

BR,

>
> Thanks,
> Andrea
>
> > I was testing on 4.4 kernel, but cross checked with 4.14 as well.
> >
> > Regards,
> > Arun
> >
> >
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > > +{
> > > +       struct mem_range section = {
> > > +               .base = start,
> > > +               .size = size,
> > > +       };
> > > +
> > > +       stop_machine(__hotplug_paging, &section, NULL);
> > > +}
> > > +#endif /* CONFIG_MEMORY_HOTPLUG */
> > > +
> > >  /*
> > >   * Check whether a kernel address is valid (derived from arch/x86/).
> > >   */
> > > --
> > > 2.7.4
> > >
> >
>

--
Maciej Bielski

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-11-24 10:44       ` Andrea Reale
  (?)
@ 2017-11-24 12:17         ` zhong jiang
  -1 siblings, 0 replies; 156+ messages in thread
From: zhong jiang @ 2017-11-24 12:17 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

Hi, Andrea

most of server will benefit from NUMA ,it is best to sovle the issue without
spcial restrictions.

At least we can obtain the numa information from dtb. therefore, The memory can
online correctly.

Thanks
zhongjiang

On 2017/11/24 18:44, Andrea Reale wrote:
> Hi zhongjiang,
>
> On Fri 24 Nov 2017, 18:35, zhong jiang wrote:
>> HI, Andrea
>>
>> I don't see "memory_add_physaddr_to_nid" in arch/arm64.
>> Am I miss something?
> When !CONFIG_NUMA it is defined in include/linux/memory_hotplug.h as 0.
> In patch 1/5 of this series we require !NUMA to enable
> ARCH_ENABLE_MEMORY_HOTPLUG.
>
> The reason for this simplification is simply that we would not know how
> to decide the correct node to which to add memory when NUMA is on.
> Any suggestion on that matter is welcome. 
>
> Thanks,
> Andrea
>
>> Thnaks
>> zhongjiang
>>
>> On 2017/11/23 19:14, Andrea Reale wrote:
>>> Adding a "remove" sysfs handle that can be used to trigger
>>> memory hotremove manually, exactly simmetrically with
>>> what happens with the "probe" device for hot-add.
>>>
>>> This is usueful for architecture that do not rely on
>>> ACPI for memory hot-remove.
>>>
>>> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
>>> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
>>> ---
>>>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>> index 1d60b58..8ccb67c 100644
>>> --- a/drivers/base/memory.c
>>> +++ b/drivers/base/memory.c
>>> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>>>  }
>>>  
>>>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
>>> -#endif
>>> +
>>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>>> +static ssize_t
>>> +memory_remove_store(struct device *dev,
>>> +		struct device_attribute *attr, const char *buf, size_t count)
>>> +{
>>> +	u64 phys_addr;
>>> +	int nid, ret;
>>> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
>>> +
>>> +	ret = kstrtoull(buf, 0, &phys_addr);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
>>> +		return -EINVAL;
>>> +
>>> +	nid = memory_add_physaddr_to_nid(phys_addr);
>>> +	ret = lock_device_hotplug_sysfs();
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	remove_memory(nid, phys_addr,
>>> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
>>> +	unlock_device_hotplug();
>>> +	return count;
>>> +}
>>> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
>>> +#endif /* CONFIG_MEMORY_HOTREMOVE */
>>> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>>>  
>>>  #ifdef CONFIG_MEMORY_FAILURE
>>>  /*
>>> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>>>  static struct attribute *memory_root_attrs[] = {
>>>  #ifdef CONFIG_ARCH_MEMORY_PROBE
>>>  	&dev_attr_probe.attr,
>>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>>> +	&dev_attr_remove.attr,
>>> +#endif
>>>  #endif
>>>  
>>>  #ifdef CONFIG_MEMORY_FAILURE
>>
>
> .
>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-24 12:17         ` zhong jiang
  0 siblings, 0 replies; 156+ messages in thread
From: zhong jiang @ 2017-11-24 12:17 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

Hi, Andrea

most of server will benefit from NUMA ,it is best to sovle the issue without
spcial restrictions.

At least we can obtain the numa information from dtb. therefore, The memory can
online correctly.

Thanks
zhongjiang

On 2017/11/24 18:44, Andrea Reale wrote:
> Hi zhongjiang,
>
> On Fri 24 Nov 2017, 18:35, zhong jiang wrote:
>> HI, Andrea
>>
>> I don't see "memory_add_physaddr_to_nid" in arch/arm64.
>> Am I miss something?
> When !CONFIG_NUMA it is defined in include/linux/memory_hotplug.h as 0.
> In patch 1/5 of this series we require !NUMA to enable
> ARCH_ENABLE_MEMORY_HOTPLUG.
>
> The reason for this simplification is simply that we would not know how
> to decide the correct node to which to add memory when NUMA is on.
> Any suggestion on that matter is welcome. 
>
> Thanks,
> Andrea
>
>> Thnaks
>> zhongjiang
>>
>> On 2017/11/23 19:14, Andrea Reale wrote:
>>> Adding a "remove" sysfs handle that can be used to trigger
>>> memory hotremove manually, exactly simmetrically with
>>> what happens with the "probe" device for hot-add.
>>>
>>> This is usueful for architecture that do not rely on
>>> ACPI for memory hot-remove.
>>>
>>> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
>>> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
>>> ---
>>>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>> index 1d60b58..8ccb67c 100644
>>> --- a/drivers/base/memory.c
>>> +++ b/drivers/base/memory.c
>>> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>>>  }
>>>  
>>>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
>>> -#endif
>>> +
>>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>>> +static ssize_t
>>> +memory_remove_store(struct device *dev,
>>> +		struct device_attribute *attr, const char *buf, size_t count)
>>> +{
>>> +	u64 phys_addr;
>>> +	int nid, ret;
>>> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
>>> +
>>> +	ret = kstrtoull(buf, 0, &phys_addr);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
>>> +		return -EINVAL;
>>> +
>>> +	nid = memory_add_physaddr_to_nid(phys_addr);
>>> +	ret = lock_device_hotplug_sysfs();
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	remove_memory(nid, phys_addr,
>>> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
>>> +	unlock_device_hotplug();
>>> +	return count;
>>> +}
>>> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
>>> +#endif /* CONFIG_MEMORY_HOTREMOVE */
>>> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>>>  
>>>  #ifdef CONFIG_MEMORY_FAILURE
>>>  /*
>>> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>>>  static struct attribute *memory_root_attrs[] = {
>>>  #ifdef CONFIG_ARCH_MEMORY_PROBE
>>>  	&dev_attr_probe.attr,
>>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>>> +	&dev_attr_remove.attr,
>>> +#endif
>>>  #endif
>>>  
>>>  #ifdef CONFIG_MEMORY_FAILURE
>>
>
> .
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-24 12:17         ` zhong jiang
  0 siblings, 0 replies; 156+ messages in thread
From: zhong jiang @ 2017-11-24 12:17 UTC (permalink / raw)
  To: linux-arm-kernel

Hi, Andrea

most of server will benefit from NUMA ,it is best to sovle the issue without
spcial restrictions.

At least we can obtain the numa information from dtb. therefore, The memory can
online correctly.

Thanks
zhongjiang

On 2017/11/24 18:44, Andrea Reale wrote:
> Hi zhongjiang,
>
> On Fri 24 Nov 2017, 18:35, zhong jiang wrote:
>> HI, Andrea
>>
>> I don't see "memory_add_physaddr_to_nid" in arch/arm64.
>> Am I miss something?
> When !CONFIG_NUMA it is defined in include/linux/memory_hotplug.h as 0.
> In patch 1/5 of this series we require !NUMA to enable
> ARCH_ENABLE_MEMORY_HOTPLUG.
>
> The reason for this simplification is simply that we would not know how
> to decide the correct node to which to add memory when NUMA is on.
> Any suggestion on that matter is welcome. 
>
> Thanks,
> Andrea
>
>> Thnaks
>> zhongjiang
>>
>> On 2017/11/23 19:14, Andrea Reale wrote:
>>> Adding a "remove" sysfs handle that can be used to trigger
>>> memory hotremove manually, exactly simmetrically with
>>> what happens with the "probe" device for hot-add.
>>>
>>> This is usueful for architecture that do not rely on
>>> ACPI for memory hot-remove.
>>>
>>> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
>>> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
>>> ---
>>>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>>>  1 file changed, 33 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>> index 1d60b58..8ccb67c 100644
>>> --- a/drivers/base/memory.c
>>> +++ b/drivers/base/memory.c
>>> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>>>  }
>>>  
>>>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
>>> -#endif
>>> +
>>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>>> +static ssize_t
>>> +memory_remove_store(struct device *dev,
>>> +		struct device_attribute *attr, const char *buf, size_t count)
>>> +{
>>> +	u64 phys_addr;
>>> +	int nid, ret;
>>> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
>>> +
>>> +	ret = kstrtoull(buf, 0, &phys_addr);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
>>> +		return -EINVAL;
>>> +
>>> +	nid = memory_add_physaddr_to_nid(phys_addr);
>>> +	ret = lock_device_hotplug_sysfs();
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	remove_memory(nid, phys_addr,
>>> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
>>> +	unlock_device_hotplug();
>>> +	return count;
>>> +}
>>> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
>>> +#endif /* CONFIG_MEMORY_HOTREMOVE */
>>> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>>>  
>>>  #ifdef CONFIG_MEMORY_FAILURE
>>>  /*
>>> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>>>  static struct attribute *memory_root_attrs[] = {
>>>  #ifdef CONFIG_ARCH_MEMORY_PROBE
>>>  	&dev_attr_probe.attr,
>>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>>> +	&dev_attr_remove.attr,
>>> +#endif
>>>  #endif
>>>  
>>>  #ifdef CONFIG_MEMORY_FAILURE
>>
>
> .
>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-11-24 12:17         ` zhong jiang
  (?)
@ 2017-11-24 14:29           ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 14:29 UTC (permalink / raw)
  To: zhong jiang
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

Hi zhongjian,

On Fri 24 Nov 2017, 20:17, zhong jiang wrote:
> Hi, Andrea
> 
> most of server will benefit from NUMA ,it is best to sovle the issue without
> spcial restrictions.
> 
> At least we can obtain the numa information from dtb. therefore, The memory can
> online correctly.

I fully agree it's an important feature, that should eventually be there. 

But, at least in my understanding, the implementation is not as
straightfoward as it looks. If I declare a memory node in the fdt, then,
at boot, the kernel will expect that memory to actually be there to be
used: this is not true if I want to plug my dimms only later at runtime.
So I think that declaring the hotpluggable memory in an fdt memory
node might not feasible without changes.

One idea could be to add a new property to memory nodes, to specify what
memory is potentially hotplugguable. For example, something like:

memory@0 {
  device_type = "memory";
  reg = <0x0 0x0 0x0 0x40000000>;
  hot-add-range = <0x0 0x40000000 0x0 0x40000000>;
  numa-node-id=<0>;
}

memory@10000000000 {
  device_type = "memory";
  reg = <0x100 0x0 0x0 0x40000000>;
  hot-add-range = <0x100 0x40000000 0x0 0x40000000>;
  numa-node-id=<1>;
}

The information in this imaginary "hot-add-range" property would be
ignored at boot and only checked by the hot add process to see to which
NUMA domain some phy memory belongs.

Of course this is just an example, and my limited knowledge of fdt
doesn't make me the best person to think what's the best approach.

All this to say: in absence of a clear and agreed approach, we released
the patch with the !NUMA limitation, so that we can get early feedback.
And also in the hope to kickstart this discussion on what's the best
approach to support NUMA .

Ideas/suggestions?

Thanks,
Andrea

> 
> Thanks
> zhongjiang
> 
> On 2017/11/24 18:44, Andrea Reale wrote:
> > Hi zhongjiang,
> >
> > On Fri 24 Nov 2017, 18:35, zhong jiang wrote:
> >> HI, Andrea
> >>
> >> I don't see "memory_add_physaddr_to_nid" in arch/arm64.
> >> Am I miss something?
> > When !CONFIG_NUMA it is defined in include/linux/memory_hotplug.h as 0.
> > In patch 1/5 of this series we require !NUMA to enable
> > ARCH_ENABLE_MEMORY_HOTPLUG.
> >
> > The reason for this simplification is simply that we would not know how
> > to decide the correct node to which to add memory when NUMA is on.
> > Any suggestion on that matter is welcome. 
> >
> > Thanks,
> > Andrea
> >
> >> Thnaks
> >> zhongjiang
> >>
> >> On 2017/11/23 19:14, Andrea Reale wrote:
> >>> Adding a "remove" sysfs handle that can be used to trigger
> >>> memory hotremove manually, exactly simmetrically with
> >>> what happens with the "probe" device for hot-add.
> >>>
> >>> This is usueful for architecture that do not rely on
> >>> ACPI for memory hot-remove.
> >>>
> >>> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >>> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >>> ---
> >>>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >>>  1 file changed, 33 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >>> index 1d60b58..8ccb67c 100644
> >>> --- a/drivers/base/memory.c
> >>> +++ b/drivers/base/memory.c
> >>> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >>>  }
> >>>  
> >>>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> >>> -#endif
> >>> +
> >>> +#ifdef CONFIG_MEMORY_HOTREMOVE
> >>> +static ssize_t
> >>> +memory_remove_store(struct device *dev,
> >>> +		struct device_attribute *attr, const char *buf, size_t count)
> >>> +{
> >>> +	u64 phys_addr;
> >>> +	int nid, ret;
> >>> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> >>> +
> >>> +	ret = kstrtoull(buf, 0, &phys_addr);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> >>> +		return -EINVAL;
> >>> +
> >>> +	nid = memory_add_physaddr_to_nid(phys_addr);
> >>> +	ret = lock_device_hotplug_sysfs();
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	remove_memory(nid, phys_addr,
> >>> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> >>> +	unlock_device_hotplug();
> >>> +	return count;
> >>> +}
> >>> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> >>> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> >>> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >>>  
> >>>  #ifdef CONFIG_MEMORY_FAILURE
> >>>  /*
> >>> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >>>  static struct attribute *memory_root_attrs[] = {
> >>>  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >>>  	&dev_attr_probe.attr,
> >>> +#ifdef CONFIG_MEMORY_HOTREMOVE
> >>> +	&dev_attr_remove.attr,
> >>> +#endif
> >>>  #endif
> >>>  
> >>>  #ifdef CONFIG_MEMORY_FAILURE
> >>
> >
> > .
> >
> 
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-24 14:29           ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 14:29 UTC (permalink / raw)
  To: zhong jiang
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

Hi zhongjian,

On Fri 24 Nov 2017, 20:17, zhong jiang wrote:
> Hi, Andrea
> 
> most of server will benefit from NUMA ,it is best to sovle the issue without
> spcial restrictions.
> 
> At least we can obtain the numa information from dtb. therefore, The memory can
> online correctly.

I fully agree it's an important feature, that should eventually be there. 

But, at least in my understanding, the implementation is not as
straightfoward as it looks. If I declare a memory node in the fdt, then,
at boot, the kernel will expect that memory to actually be there to be
used: this is not true if I want to plug my dimms only later at runtime.
So I think that declaring the hotpluggable memory in an fdt memory
node might not feasible without changes.

One idea could be to add a new property to memory nodes, to specify what
memory is potentially hotplugguable. For example, something like:

memory@0 {
  device_type = "memory";
  reg = <0x0 0x0 0x0 0x40000000>;
  hot-add-range = <0x0 0x40000000 0x0 0x40000000>;
  numa-node-id=<0>;
}

memory@10000000000 {
  device_type = "memory";
  reg = <0x100 0x0 0x0 0x40000000>;
  hot-add-range = <0x100 0x40000000 0x0 0x40000000>;
  numa-node-id=<1>;
}

The information in this imaginary "hot-add-range" property would be
ignored at boot and only checked by the hot add process to see to which
NUMA domain some phy memory belongs.

Of course this is just an example, and my limited knowledge of fdt
doesn't make me the best person to think what's the best approach.

All this to say: in absence of a clear and agreed approach, we released
the patch with the !NUMA limitation, so that we can get early feedback.
And also in the hope to kickstart this discussion on what's the best
approach to support NUMA .

Ideas/suggestions?

Thanks,
Andrea

> 
> Thanks
> zhongjiang
> 
> On 2017/11/24 18:44, Andrea Reale wrote:
> > Hi zhongjiang,
> >
> > On Fri 24 Nov 2017, 18:35, zhong jiang wrote:
> >> HI, Andrea
> >>
> >> I don't see "memory_add_physaddr_to_nid" in arch/arm64.
> >> Am I miss something?
> > When !CONFIG_NUMA it is defined in include/linux/memory_hotplug.h as 0.
> > In patch 1/5 of this series we require !NUMA to enable
> > ARCH_ENABLE_MEMORY_HOTPLUG.
> >
> > The reason for this simplification is simply that we would not know how
> > to decide the correct node to which to add memory when NUMA is on.
> > Any suggestion on that matter is welcome. 
> >
> > Thanks,
> > Andrea
> >
> >> Thnaks
> >> zhongjiang
> >>
> >> On 2017/11/23 19:14, Andrea Reale wrote:
> >>> Adding a "remove" sysfs handle that can be used to trigger
> >>> memory hotremove manually, exactly simmetrically with
> >>> what happens with the "probe" device for hot-add.
> >>>
> >>> This is usueful for architecture that do not rely on
> >>> ACPI for memory hot-remove.
> >>>
> >>> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >>> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >>> ---
> >>>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >>>  1 file changed, 33 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >>> index 1d60b58..8ccb67c 100644
> >>> --- a/drivers/base/memory.c
> >>> +++ b/drivers/base/memory.c
> >>> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >>>  }
> >>>  
> >>>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> >>> -#endif
> >>> +
> >>> +#ifdef CONFIG_MEMORY_HOTREMOVE
> >>> +static ssize_t
> >>> +memory_remove_store(struct device *dev,
> >>> +		struct device_attribute *attr, const char *buf, size_t count)
> >>> +{
> >>> +	u64 phys_addr;
> >>> +	int nid, ret;
> >>> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> >>> +
> >>> +	ret = kstrtoull(buf, 0, &phys_addr);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> >>> +		return -EINVAL;
> >>> +
> >>> +	nid = memory_add_physaddr_to_nid(phys_addr);
> >>> +	ret = lock_device_hotplug_sysfs();
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	remove_memory(nid, phys_addr,
> >>> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> >>> +	unlock_device_hotplug();
> >>> +	return count;
> >>> +}
> >>> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> >>> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> >>> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >>>  
> >>>  #ifdef CONFIG_MEMORY_FAILURE
> >>>  /*
> >>> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >>>  static struct attribute *memory_root_attrs[] = {
> >>>  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >>>  	&dev_attr_probe.attr,
> >>> +#ifdef CONFIG_MEMORY_HOTREMOVE
> >>> +	&dev_attr_remove.attr,
> >>> +#endif
> >>>  #endif
> >>>  
> >>>  #ifdef CONFIG_MEMORY_FAILURE
> >>
> >
> > .
> >
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-24 14:29           ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 14:29 UTC (permalink / raw)
  To: linux-arm-kernel

Hi zhongjian,

On Fri 24 Nov 2017, 20:17, zhong jiang wrote:
> Hi, Andrea
> 
> most of server will benefit from NUMA ,it is best to sovle the issue without
> spcial restrictions.
> 
> At least we can obtain the numa information from dtb. therefore, The memory can
> online correctly.

I fully agree it's an important feature, that should eventually be there. 

But, at least in my understanding, the implementation is not as
straightfoward as it looks. If I declare a memory node in the fdt, then,
at boot, the kernel will expect that memory to actually be there to be
used: this is not true if I want to plug my dimms only later at runtime.
So I think that declaring the hotpluggable memory in an fdt memory
node might not feasible without changes.

One idea could be to add a new property to memory nodes, to specify what
memory is potentially hotplugguable. For example, something like:

memory at 0 {
  device_type = "memory";
  reg = <0x0 0x0 0x0 0x40000000>;
  hot-add-range = <0x0 0x40000000 0x0 0x40000000>;
  numa-node-id=<0>;
}

memory at 10000000000 {
  device_type = "memory";
  reg = <0x100 0x0 0x0 0x40000000>;
  hot-add-range = <0x100 0x40000000 0x0 0x40000000>;
  numa-node-id=<1>;
}

The information in this imaginary "hot-add-range" property would be
ignored at boot and only checked by the hot add process to see to which
NUMA domain some phy memory belongs.

Of course this is just an example, and my limited knowledge of fdt
doesn't make me the best person to think what's the best approach.

All this to say: in absence of a clear and agreed approach, we released
the patch with the !NUMA limitation, so that we can get early feedback.
And also in the hope to kickstart this discussion on what's the best
approach to support NUMA .

Ideas/suggestions?

Thanks,
Andrea

> 
> Thanks
> zhongjiang
> 
> On 2017/11/24 18:44, Andrea Reale wrote:
> > Hi zhongjiang,
> >
> > On Fri 24 Nov 2017, 18:35, zhong jiang wrote:
> >> HI, Andrea
> >>
> >> I don't see "memory_add_physaddr_to_nid" in arch/arm64.
> >> Am I miss something?
> > When !CONFIG_NUMA it is defined in include/linux/memory_hotplug.h as 0.
> > In patch 1/5 of this series we require !NUMA to enable
> > ARCH_ENABLE_MEMORY_HOTPLUG.
> >
> > The reason for this simplification is simply that we would not know how
> > to decide the correct node to which to add memory when NUMA is on.
> > Any suggestion on that matter is welcome. 
> >
> > Thanks,
> > Andrea
> >
> >> Thnaks
> >> zhongjiang
> >>
> >> On 2017/11/23 19:14, Andrea Reale wrote:
> >>> Adding a "remove" sysfs handle that can be used to trigger
> >>> memory hotremove manually, exactly simmetrically with
> >>> what happens with the "probe" device for hot-add.
> >>>
> >>> This is usueful for architecture that do not rely on
> >>> ACPI for memory hot-remove.
> >>>
> >>> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >>> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >>> ---
> >>>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >>>  1 file changed, 33 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >>> index 1d60b58..8ccb67c 100644
> >>> --- a/drivers/base/memory.c
> >>> +++ b/drivers/base/memory.c
> >>> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >>>  }
> >>>  
> >>>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> >>> -#endif
> >>> +
> >>> +#ifdef CONFIG_MEMORY_HOTREMOVE
> >>> +static ssize_t
> >>> +memory_remove_store(struct device *dev,
> >>> +		struct device_attribute *attr, const char *buf, size_t count)
> >>> +{
> >>> +	u64 phys_addr;
> >>> +	int nid, ret;
> >>> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> >>> +
> >>> +	ret = kstrtoull(buf, 0, &phys_addr);
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> >>> +		return -EINVAL;
> >>> +
> >>> +	nid = memory_add_physaddr_to_nid(phys_addr);
> >>> +	ret = lock_device_hotplug_sysfs();
> >>> +	if (ret)
> >>> +		return ret;
> >>> +
> >>> +	remove_memory(nid, phys_addr,
> >>> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> >>> +	unlock_device_hotplug();
> >>> +	return count;
> >>> +}
> >>> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> >>> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> >>> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >>>  
> >>>  #ifdef CONFIG_MEMORY_FAILURE
> >>>  /*
> >>> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >>>  static struct attribute *memory_root_attrs[] = {
> >>>  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >>>  	&dev_attr_probe.attr,
> >>> +#ifdef CONFIG_MEMORY_HOTREMOVE
> >>> +	&dev_attr_remove.attr,
> >>> +#endif
> >>>  #endif
> >>>  
> >>>  #ifdef CONFIG_MEMORY_FAILURE
> >>
> >
> > .
> >
> 
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-23 11:14   ` Andrea Reale
  (?)
@ 2017-11-24 14:39     ` Rafael J. Wysocki
  -1 siblings, 0 replies; 156+ messages in thread
From: Rafael J. Wysocki @ 2017-11-24 14:39 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Michal Hocko, Rafael Wysocki, ACPI Devel Maling List

On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> Everyone else: apologies for the noise.
>
> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> introduced an assumption whereas when control
> reaches remove_memory the corresponding memory has been already
> offlined. In that case, the acpi_memhotplug was making sure that
> the assumption held.
> This assumption, however, is not necessarily true if offlining
> and removal are not done by the same "controller" (for example,
> when first offlining via sysfs).
>
> Removing this assumption for the generic remove_memory code
> and moving it in the specific acpi_memhotplug code. This is
> a dependency for the software-aided arm64 offlining and removal
> process.
>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> ---
>  drivers/acpi/acpi_memhotplug.c |  2 +-
>  include/linux/memory_hotplug.h |  9 ++++++---
>  mm/memory_hotplug.c            | 13 +++++++++----
>  3 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 6b0d3ef..b0126a0 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>                         nid = memory_add_physaddr_to_nid(info->start_addr);
>
>                 acpi_unbind_memory_blocks(info);
> -               remove_memory(nid, info->start_addr, info->length);
> +               BUG_ON(remove_memory(nid, info->start_addr, info->length));

Why does this have to be BUG_ON()?  Is it really necessary to kill the
system here?

If it is, please add a comment describing why continuing is not an option here.

>                 list_del(&info->list);
>                 kfree(info);
>         }

Thanks,
Rafael

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 14:39     ` Rafael J. Wysocki
  0 siblings, 0 replies; 156+ messages in thread
From: Rafael J. Wysocki @ 2017-11-24 14:39 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Michal Hocko, Rafael Wysocki, ACPI Devel Maling List

On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> Everyone else: apologies for the noise.
>
> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> introduced an assumption whereas when control
> reaches remove_memory the corresponding memory has been already
> offlined. In that case, the acpi_memhotplug was making sure that
> the assumption held.
> This assumption, however, is not necessarily true if offlining
> and removal are not done by the same "controller" (for example,
> when first offlining via sysfs).
>
> Removing this assumption for the generic remove_memory code
> and moving it in the specific acpi_memhotplug code. This is
> a dependency for the software-aided arm64 offlining and removal
> process.
>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> ---
>  drivers/acpi/acpi_memhotplug.c |  2 +-
>  include/linux/memory_hotplug.h |  9 ++++++---
>  mm/memory_hotplug.c            | 13 +++++++++----
>  3 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 6b0d3ef..b0126a0 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>                         nid = memory_add_physaddr_to_nid(info->start_addr);
>
>                 acpi_unbind_memory_blocks(info);
> -               remove_memory(nid, info->start_addr, info->length);
> +               BUG_ON(remove_memory(nid, info->start_addr, info->length));

Why does this have to be BUG_ON()?  Is it really necessary to kill the
system here?

If it is, please add a comment describing why continuing is not an option here.

>                 list_del(&info->list);
>                 kfree(info);
>         }

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 14:39     ` Rafael J. Wysocki
  0 siblings, 0 replies; 156+ messages in thread
From: Rafael J. Wysocki @ 2017-11-24 14:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> Everyone else: apologies for the noise.
>
> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> introduced an assumption whereas when control
> reaches remove_memory the corresponding memory has been already
> offlined. In that case, the acpi_memhotplug was making sure that
> the assumption held.
> This assumption, however, is not necessarily true if offlining
> and removal are not done by the same "controller" (for example,
> when first offlining via sysfs).
>
> Removing this assumption for the generic remove_memory code
> and moving it in the specific acpi_memhotplug code. This is
> a dependency for the software-aided arm64 offlining and removal
> process.
>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> ---
>  drivers/acpi/acpi_memhotplug.c |  2 +-
>  include/linux/memory_hotplug.h |  9 ++++++---
>  mm/memory_hotplug.c            | 13 +++++++++----
>  3 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 6b0d3ef..b0126a0 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>                         nid = memory_add_physaddr_to_nid(info->start_addr);
>
>                 acpi_unbind_memory_blocks(info);
> -               remove_memory(nid, info->start_addr, info->length);
> +               BUG_ON(remove_memory(nid, info->start_addr, info->length));

Why does this have to be BUG_ON()?  Is it really necessary to kill the
system here?

If it is, please add a comment describing why continuing is not an option here.

>                 list_del(&info->list);
>                 kfree(info);
>         }

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-24 14:39     ` Rafael J. Wysocki
  (?)
  (?)
@ 2017-11-24 14:49       ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 14:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Michal Hocko, Rafael Wysocki, ACPI Devel Maling List

Hi Rafael,

On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > Everyone else: apologies for the noise.
> >
> > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > introduced an assumption whereas when control
> > reaches remove_memory the corresponding memory has been already
> > offlined. In that case, the acpi_memhotplug was making sure that
> > the assumption held.
> > This assumption, however, is not necessarily true if offlining
> > and removal are not done by the same "controller" (for example,
> > when first offlining via sysfs).
> >
> > Removing this assumption for the generic remove_memory code
> > and moving it in the specific acpi_memhotplug code. This is
> > a dependency for the software-aided arm64 offlining and removal
> > process.
> >
> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > ---
> >  drivers/acpi/acpi_memhotplug.c |  2 +-
> >  include/linux/memory_hotplug.h |  9 ++++++---
> >  mm/memory_hotplug.c            | 13 +++++++++----
> >  3 files changed, 16 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > index 6b0d3ef..b0126a0 100644
> > --- a/drivers/acpi/acpi_memhotplug.c
> > +++ b/drivers/acpi/acpi_memhotplug.c
> > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> >
> >                 acpi_unbind_memory_blocks(info);
> > -               remove_memory(nid, info->start_addr, info->length);
> > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> 
> Why does this have to be BUG_ON()?  Is it really necessary to kill the
> system here?

Actually, I hoped you would help me understand that: that BUG() call was introduced
by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
in memory_hoptlug.c:remove_memory()). 

Just reading at that commit my understanding was that you were assuming
that acpi_memory_remove_memory() have already done the job of offlining
the target memory, so there would be a bug if that wasn't the case.

In my case, that assumption did not hold and I found that it might not
hold for other platforms that do not use ACPI. In fact, the purpose of
this patch is to move this assumption out of the generic hotplug code
and move it to ACPI code where it originated. 

Thanks,
Andrea

> If it is, please add a comment describing why continuing is not an option here.
> 
> >                 list_del(&info->list);
> >                 kfree(info);
> >         }
> 
> Thanks,
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 14:49       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 14:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Michal Hocko, Rafael Wysocki, ACPI Devel Maling List

Hi Rafael,

On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > Everyone else: apologies for the noise.
> >
> > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > introduced an assumption whereas when control
> > reaches remove_memory the corresponding memory has been already
> > offlined. In that case, the acpi_memhotplug was making sure that
> > the assumption held.
> > This assumption, however, is not necessarily true if offlining
> > and removal are not done by the same "controller" (for example,
> > when first offlining via sysfs).
> >
> > Removing this assumption for the generic remove_memory code
> > and moving it in the specific acpi_memhotplug code. This is
> > a dependency for the software-aided arm64 offlining and removal
> > process.
> >
> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > ---
> >  drivers/acpi/acpi_memhotplug.c |  2 +-
> >  include/linux/memory_hotplug.h |  9 ++++++---
> >  mm/memory_hotplug.c            | 13 +++++++++----
> >  3 files changed, 16 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > index 6b0d3ef..b0126a0 100644
> > --- a/drivers/acpi/acpi_memhotplug.c
> > +++ b/drivers/acpi/acpi_memhotplug.c
> > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> >
> >                 acpi_unbind_memory_blocks(info);
> > -               remove_memory(nid, info->start_addr, info->length);
> > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> 
> Why does this have to be BUG_ON()?  Is it really necessary to kill the
> system here?

Actually, I hoped you would help me understand that: that BUG() call was introduced
by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
in memory_hoptlug.c:remove_memory()). 

Just reading at that commit my understanding was that you were assuming
that acpi_memory_remove_memory() have already done the job of offlining
the target memory, so there would be a bug if that wasn't the case.

In my case, that assumption did not hold and I found that it might not
hold for other platforms that do not use ACPI. In fact, the purpose of
this patch is to move this assumption out of the generic hotplug code
and move it to ACPI code where it originated. 

Thanks,
Andrea

> If it is, please add a comment describing why continuing is not an option here.
> 
> >                 list_del(&info->list);
> >                 kfree(info);
> >         }
> 
> Thanks,
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 14:49       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 14:49 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Michal Hocko, Rafael Wysocki, ACPI Devel Maling List

Hi Rafael,

On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > Everyone else: apologies for the noise.
> >
> > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > introduced an assumption whereas when control
> > reaches remove_memory the corresponding memory has been already
> > offlined. In that case, the acpi_memhotplug was making sure that
> > the assumption held.
> > This assumption, however, is not necessarily true if offlining
> > and removal are not done by the same "controller" (for example,
> > when first offlining via sysfs).
> >
> > Removing this assumption for the generic remove_memory code
> > and moving it in the specific acpi_memhotplug code. This is
> > a dependency for the software-aided arm64 offlining and removal
> > process.
> >
> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > ---
> >  drivers/acpi/acpi_memhotplug.c |  2 +-
> >  include/linux/memory_hotplug.h |  9 ++++++---
> >  mm/memory_hotplug.c            | 13 +++++++++----
> >  3 files changed, 16 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > index 6b0d3ef..b0126a0 100644
> > --- a/drivers/acpi/acpi_memhotplug.c
> > +++ b/drivers/acpi/acpi_memhotplug.c
> > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> >
> >                 acpi_unbind_memory_blocks(info);
> > -               remove_memory(nid, info->start_addr, info->length);
> > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> 
> Why does this have to be BUG_ON()?  Is it really necessary to kill the
> system here?

Actually, I hoped you would help me understand that: that BUG() call was introduced
by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
in memory_hoptlug.c:remove_memory()). 

Just reading at that commit my understanding was that you were assuming
that acpi_memory_remove_memory() have already done the job of offlining
the target memory, so there would be a bug if that wasn't the case.

In my case, that assumption did not hold and I found that it might not
hold for other platforms that do not use ACPI. In fact, the purpose of
this patch is to move this assumption out of the generic hotplug code
and move it to ACPI code where it originated. 

Thanks,
Andrea

> If it is, please add a comment describing why continuing is not an option here.
> 
> >                 list_del(&info->list);
> >                 kfree(info);
> >         }
> 
> Thanks,
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 14:49       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 14:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Rafael,

On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > Everyone else: apologies for the noise.
> >
> > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > introduced an assumption whereas when control
> > reaches remove_memory the corresponding memory has been already
> > offlined. In that case, the acpi_memhotplug was making sure that
> > the assumption held.
> > This assumption, however, is not necessarily true if offlining
> > and removal are not done by the same "controller" (for example,
> > when first offlining via sysfs).
> >
> > Removing this assumption for the generic remove_memory code
> > and moving it in the specific acpi_memhotplug code. This is
> > a dependency for the software-aided arm64 offlining and removal
> > process.
> >
> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > ---
> >  drivers/acpi/acpi_memhotplug.c |  2 +-
> >  include/linux/memory_hotplug.h |  9 ++++++---
> >  mm/memory_hotplug.c            | 13 +++++++++----
> >  3 files changed, 16 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > index 6b0d3ef..b0126a0 100644
> > --- a/drivers/acpi/acpi_memhotplug.c
> > +++ b/drivers/acpi/acpi_memhotplug.c
> > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> >
> >                 acpi_unbind_memory_blocks(info);
> > -               remove_memory(nid, info->start_addr, info->length);
> > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> 
> Why does this have to be BUG_ON()?  Is it really necessary to kill the
> system here?

Actually, I hoped you would help me understand that: that BUG() call was introduced
by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
in memory_hoptlug.c:remove_memory()). 

Just reading at that commit my understanding was that you were assuming
that acpi_memory_remove_memory() have already done the job of offlining
the target memory, so there would be a bug if that wasn't the case.

In my case, that assumption did not hold and I found that it might not
hold for other platforms that do not use ACPI. In fact, the purpose of
this patch is to move this assumption out of the generic hotplug code
and move it to ACPI code where it originated. 

Thanks,
Andrea

> If it is, please add a comment describing why continuing is not an option here.
> 
> >                 list_del(&info->list);
> >                 kfree(info);
> >         }
> 
> Thanks,
> Rafael
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-24 14:49       ` Andrea Reale
  (?)
  (?)
@ 2017-11-24 15:43         ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-24 15:43 UTC (permalink / raw)
  To: Andrea Reale
  Cc: Rafael J. Wysocki, linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Rafael Wysocki, ACPI Devel Maling List

On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> Hi Rafael,
> 
> On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > Everyone else: apologies for the noise.
> > >
> > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > introduced an assumption whereas when control
> > > reaches remove_memory the corresponding memory has been already
> > > offlined. In that case, the acpi_memhotplug was making sure that
> > > the assumption held.
> > > This assumption, however, is not necessarily true if offlining
> > > and removal are not done by the same "controller" (for example,
> > > when first offlining via sysfs).
> > >
> > > Removing this assumption for the generic remove_memory code
> > > and moving it in the specific acpi_memhotplug code. This is
> > > a dependency for the software-aided arm64 offlining and removal
> > > process.
> > >
> > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > ---
> > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > >  include/linux/memory_hotplug.h |  9 ++++++---
> > >  mm/memory_hotplug.c            | 13 +++++++++----
> > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > index 6b0d3ef..b0126a0 100644
> > > --- a/drivers/acpi/acpi_memhotplug.c
> > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > >
> > >                 acpi_unbind_memory_blocks(info);
> > > -               remove_memory(nid, info->start_addr, info->length);
> > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > 
> > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > system here?
> 
> Actually, I hoped you would help me understand that: that BUG() call was introduced
> by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> in memory_hoptlug.c:remove_memory()). 
> 
> Just reading at that commit my understanding was that you were assuming
> that acpi_memory_remove_memory() have already done the job of offlining
> the target memory, so there would be a bug if that wasn't the case.
> 
> In my case, that assumption did not hold and I found that it might not
> hold for other platforms that do not use ACPI. In fact, the purpose of
> this patch is to move this assumption out of the generic hotplug code
> and move it to ACPI code where it originated. 

remove_memory failure is basically impossible to handle AFAIR. The
original code to BUG in remove_memory is ugly as hell and we do not want
to spread that out of that function. Instead we really want to get rid
of it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 15:43         ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-24 15:43 UTC (permalink / raw)
  To: Andrea Reale
  Cc: Rafael J. Wysocki, linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Rafael Wysocki, ACPI Devel Maling List

On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> Hi Rafael,
> 
> On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > Everyone else: apologies for the noise.
> > >
> > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > introduced an assumption whereas when control
> > > reaches remove_memory the corresponding memory has been already
> > > offlined. In that case, the acpi_memhotplug was making sure that
> > > the assumption held.
> > > This assumption, however, is not necessarily true if offlining
> > > and removal are not done by the same "controller" (for example,
> > > when first offlining via sysfs).
> > >
> > > Removing this assumption for the generic remove_memory code
> > > and moving it in the specific acpi_memhotplug code. This is
> > > a dependency for the software-aided arm64 offlining and removal
> > > process.
> > >
> > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > ---
> > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > >  include/linux/memory_hotplug.h |  9 ++++++---
> > >  mm/memory_hotplug.c            | 13 +++++++++----
> > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > index 6b0d3ef..b0126a0 100644
> > > --- a/drivers/acpi/acpi_memhotplug.c
> > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > >
> > >                 acpi_unbind_memory_blocks(info);
> > > -               remove_memory(nid, info->start_addr, info->length);
> > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > 
> > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > system here?
> 
> Actually, I hoped you would help me understand that: that BUG() call was introduced
> by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> in memory_hoptlug.c:remove_memory()). 
> 
> Just reading at that commit my understanding was that you were assuming
> that acpi_memory_remove_memory() have already done the job of offlining
> the target memory, so there would be a bug if that wasn't the case.
> 
> In my case, that assumption did not hold and I found that it might not
> hold for other platforms that do not use ACPI. In fact, the purpose of
> this patch is to move this assumption out of the generic hotplug code
> and move it to ACPI code where it originated. 

remove_memory failure is basically impossible to handle AFAIR. The
original code to BUG in remove_memory is ugly as hell and we do not want
to spread that out of that function. Instead we really want to get rid
of it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 15:43         ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-24 15:43 UTC (permalink / raw)
  To: Andrea Reale
  Cc: Rafael J. Wysocki, linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Rafael Wysocki, ACPI Devel Maling List

On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> Hi Rafael,
> 
> On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > Everyone else: apologies for the noise.
> > >
> > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > introduced an assumption whereas when control
> > > reaches remove_memory the corresponding memory has been already
> > > offlined. In that case, the acpi_memhotplug was making sure that
> > > the assumption held.
> > > This assumption, however, is not necessarily true if offlining
> > > and removal are not done by the same "controller" (for example,
> > > when first offlining via sysfs).
> > >
> > > Removing this assumption for the generic remove_memory code
> > > and moving it in the specific acpi_memhotplug code. This is
> > > a dependency for the software-aided arm64 offlining and removal
> > > process.
> > >
> > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > ---
> > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > >  include/linux/memory_hotplug.h |  9 ++++++---
> > >  mm/memory_hotplug.c            | 13 +++++++++----
> > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > index 6b0d3ef..b0126a0 100644
> > > --- a/drivers/acpi/acpi_memhotplug.c
> > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > >
> > >                 acpi_unbind_memory_blocks(info);
> > > -               remove_memory(nid, info->start_addr, info->length);
> > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > 
> > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > system here?
> 
> Actually, I hoped you would help me understand that: that BUG() call was introduced
> by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> in memory_hoptlug.c:remove_memory()). 
> 
> Just reading at that commit my understanding was that you were assuming
> that acpi_memory_remove_memory() have already done the job of offlining
> the target memory, so there would be a bug if that wasn't the case.
> 
> In my case, that assumption did not hold and I found that it might not
> hold for other platforms that do not use ACPI. In fact, the purpose of
> this patch is to move this assumption out of the generic hotplug code
> and move it to ACPI code where it originated. 

remove_memory failure is basically impossible to handle AFAIR. The
original code to BUG in remove_memory is ugly as hell and we do not want
to spread that out of that function. Instead we really want to get rid
of it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 15:43         ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-24 15:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> Hi Rafael,
> 
> On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > Everyone else: apologies for the noise.
> > >
> > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > introduced an assumption whereas when control
> > > reaches remove_memory the corresponding memory has been already
> > > offlined. In that case, the acpi_memhotplug was making sure that
> > > the assumption held.
> > > This assumption, however, is not necessarily true if offlining
> > > and removal are not done by the same "controller" (for example,
> > > when first offlining via sysfs).
> > >
> > > Removing this assumption for the generic remove_memory code
> > > and moving it in the specific acpi_memhotplug code. This is
> > > a dependency for the software-aided arm64 offlining and removal
> > > process.
> > >
> > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > ---
> > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > >  include/linux/memory_hotplug.h |  9 ++++++---
> > >  mm/memory_hotplug.c            | 13 +++++++++----
> > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > index 6b0d3ef..b0126a0 100644
> > > --- a/drivers/acpi/acpi_memhotplug.c
> > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > >
> > >                 acpi_unbind_memory_blocks(info);
> > > -               remove_memory(nid, info->start_addr, info->length);
> > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > 
> > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > system here?
> 
> Actually, I hoped you would help me understand that: that BUG() call was introduced
> by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> in memory_hoptlug.c:remove_memory()). 
> 
> Just reading at that commit my understanding was that you were assuming
> that acpi_memory_remove_memory() have already done the job of offlining
> the target memory, so there would be a bug if that wasn't the case.
> 
> In my case, that assumption did not hold and I found that it might not
> hold for other platforms that do not use ACPI. In fact, the purpose of
> this patch is to move this assumption out of the generic hotplug code
> and move it to ACPI code where it originated. 

remove_memory failure is basically impossible to handle AFAIR. The
original code to BUG in remove_memory is ugly as hell and we do not want
to spread that out of that function. Instead we really want to get rid
of it.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-24 15:43         ` Michal Hocko
  (?)
  (?)
@ 2017-11-24 15:54           ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 15:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Rafael Wysocki, ACPI Devel Maling List

On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > Hi Rafael,
> > 
> > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > Everyone else: apologies for the noise.
> > > >
> > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > introduced an assumption whereas when control
> > > > reaches remove_memory the corresponding memory has been already
> > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > the assumption held.
> > > > This assumption, however, is not necessarily true if offlining
> > > > and removal are not done by the same "controller" (for example,
> > > > when first offlining via sysfs).
> > > >
> > > > Removing this assumption for the generic remove_memory code
> > > > and moving it in the specific acpi_memhotplug code. This is
> > > > a dependency for the software-aided arm64 offlining and removal
> > > > process.
> > > >
> > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > ---
> > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > index 6b0d3ef..b0126a0 100644
> > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > >
> > > >                 acpi_unbind_memory_blocks(info);
> > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > 
> > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > system here?
> > 
> > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > in memory_hoptlug.c:remove_memory()). 
> > 
> > Just reading at that commit my understanding was that you were assuming
> > that acpi_memory_remove_memory() have already done the job of offlining
> > the target memory, so there would be a bug if that wasn't the case.
> > 
> > In my case, that assumption did not hold and I found that it might not
> > hold for other platforms that do not use ACPI. In fact, the purpose of
> > this patch is to move this assumption out of the generic hotplug code
> > and move it to ACPI code where it originated. 
> 
> remove_memory failure is basically impossible to handle AFAIR. The
> original code to BUG in remove_memory is ugly as hell and we do not want
> to spread that out of that function. Instead we really want to get rid
> of it.

Today, BUG() is called even in the simple case where remove fails
because the section we are removing is not offline. I cannot see any need to
BUG() in such a case: an error code seems more than sufficient to me.
This is why this patch removes the BUG() call when the "offline" check
fails from the generic code. 
It moves it back to the ACPI call, where the assumption
originated. Honestlly, I cannot tell if it makes sense to BUG() there:
I have nothing against removing it from ACPI hotplug too, but
I don't know enough to feel free to change the acpi semantics myself, so I
moved it there to keep the original behavior unchanged for x86 code.

In this arm64 hot-remove port, offline and remove are done in two separate
steps, and is conceivable that an user tries erroneusly to remove some
section that he forgot to offline first: in that case, with the patch,
remove will just report an erro without BUGing.

Is my reasoning flawed?

Cheers,
Andrea

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 15:54           ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 15:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Rafael Wysocki, ACPI Devel Maling List

On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > Hi Rafael,
> > 
> > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > Everyone else: apologies for the noise.
> > > >
> > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > introduced an assumption whereas when control
> > > > reaches remove_memory the corresponding memory has been already
> > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > the assumption held.
> > > > This assumption, however, is not necessarily true if offlining
> > > > and removal are not done by the same "controller" (for example,
> > > > when first offlining via sysfs).
> > > >
> > > > Removing this assumption for the generic remove_memory code
> > > > and moving it in the specific acpi_memhotplug code. This is
> > > > a dependency for the software-aided arm64 offlining and removal
> > > > process.
> > > >
> > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > ---
> > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > index 6b0d3ef..b0126a0 100644
> > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > >
> > > >                 acpi_unbind_memory_blocks(info);
> > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > 
> > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > system here?
> > 
> > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > in memory_hoptlug.c:remove_memory()). 
> > 
> > Just reading at that commit my understanding was that you were assuming
> > that acpi_memory_remove_memory() have already done the job of offlining
> > the target memory, so there would be a bug if that wasn't the case.
> > 
> > In my case, that assumption did not hold and I found that it might not
> > hold for other platforms that do not use ACPI. In fact, the purpose of
> > this patch is to move this assumption out of the generic hotplug code
> > and move it to ACPI code where it originated. 
> 
> remove_memory failure is basically impossible to handle AFAIR. The
> original code to BUG in remove_memory is ugly as hell and we do not want
> to spread that out of that function. Instead we really want to get rid
> of it.

Today, BUG() is called even in the simple case where remove fails
because the section we are removing is not offline. I cannot see any need to
BUG() in such a case: an error code seems more than sufficient to me.
This is why this patch removes the BUG() call when the "offline" check
fails from the generic code. 
It moves it back to the ACPI call, where the assumption
originated. Honestlly, I cannot tell if it makes sense to BUG() there:
I have nothing against removing it from ACPI hotplug too, but
I don't know enough to feel free to change the acpi semantics myself, so I
moved it there to keep the original behavior unchanged for x86 code.

In this arm64 hot-remove port, offline and remove are done in two separate
steps, and is conceivable that an user tries erroneusly to remove some
section that he forgot to offline first: in that case, with the patch,
remove will just report an erro without BUGing.

Is my reasoning flawed?

Cheers,
Andrea

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 15:54           ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 15:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J. Wysocki, linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Rafael Wysocki, ACPI Devel Maling List

On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > Hi Rafael,
> > 
> > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > Everyone else: apologies for the noise.
> > > >
> > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > introduced an assumption whereas when control
> > > > reaches remove_memory the corresponding memory has been already
> > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > the assumption held.
> > > > This assumption, however, is not necessarily true if offlining
> > > > and removal are not done by the same "controller" (for example,
> > > > when first offlining via sysfs).
> > > >
> > > > Removing this assumption for the generic remove_memory code
> > > > and moving it in the specific acpi_memhotplug code. This is
> > > > a dependency for the software-aided arm64 offlining and removal
> > > > process.
> > > >
> > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > ---
> > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > index 6b0d3ef..b0126a0 100644
> > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > >
> > > >                 acpi_unbind_memory_blocks(info);
> > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > 
> > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > system here?
> > 
> > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > in memory_hoptlug.c:remove_memory()). 
> > 
> > Just reading at that commit my understanding was that you were assuming
> > that acpi_memory_remove_memory() have already done the job of offlining
> > the target memory, so there would be a bug if that wasn't the case.
> > 
> > In my case, that assumption did not hold and I found that it might not
> > hold for other platforms that do not use ACPI. In fact, the purpose of
> > this patch is to move this assumption out of the generic hotplug code
> > and move it to ACPI code where it originated. 
> 
> remove_memory failure is basically impossible to handle AFAIR. The
> original code to BUG in remove_memory is ugly as hell and we do not want
> to spread that out of that function. Instead we really want to get rid
> of it.

Today, BUG() is called even in the simple case where remove fails
because the section we are removing is not offline. I cannot see any need to
BUG() in such a case: an error code seems more than sufficient to me.
This is why this patch removes the BUG() call when the "offline" check
fails from the generic code. 
It moves it back to the ACPI call, where the assumption
originated. Honestlly, I cannot tell if it makes sense to BUG() there:
I have nothing against removing it from ACPI hotplug too, but
I don't know enough to feel free to change the acpi semantics myself, so I
moved it there to keep the original behavior unchanged for x86 code.

In this arm64 hot-remove port, offline and remove are done in two separate
steps, and is conceivable that an user tries erroneusly to remove some
section that he forgot to offline first: in that case, with the patch,
remove will just report an erro without BUGing.

Is my reasoning flawed?

Cheers,
Andrea

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 15:54           ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-24 15:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > Hi Rafael,
> > 
> > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > Everyone else: apologies for the noise.
> > > >
> > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > introduced an assumption whereas when control
> > > > reaches remove_memory the corresponding memory has been already
> > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > the assumption held.
> > > > This assumption, however, is not necessarily true if offlining
> > > > and removal are not done by the same "controller" (for example,
> > > > when first offlining via sysfs).
> > > >
> > > > Removing this assumption for the generic remove_memory code
> > > > and moving it in the specific acpi_memhotplug code. This is
> > > > a dependency for the software-aided arm64 offlining and removal
> > > > process.
> > > >
> > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > ---
> > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > index 6b0d3ef..b0126a0 100644
> > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > >
> > > >                 acpi_unbind_memory_blocks(info);
> > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > 
> > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > system here?
> > 
> > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > in memory_hoptlug.c:remove_memory()). 
> > 
> > Just reading at that commit my understanding was that you were assuming
> > that acpi_memory_remove_memory() have already done the job of offlining
> > the target memory, so there would be a bug if that wasn't the case.
> > 
> > In my case, that assumption did not hold and I found that it might not
> > hold for other platforms that do not use ACPI. In fact, the purpose of
> > this patch is to move this assumption out of the generic hotplug code
> > and move it to ACPI code where it originated. 
> 
> remove_memory failure is basically impossible to handle AFAIR. The
> original code to BUG in remove_memory is ugly as hell and we do not want
> to spread that out of that function. Instead we really want to get rid
> of it.

Today, BUG() is called even in the simple case where remove fails
because the section we are removing is not offline. I cannot see any need to
BUG() in such a case: an error code seems more than sufficient to me.
This is why this patch removes the BUG() call when the "offline" check
fails from the generic code. 
It moves it back to the ACPI call, where the assumption
originated. Honestlly, I cannot tell if it makes sense to BUG() there:
I have nothing against removing it from ACPI hotplug too, but
I don't know enough to feel free to change the acpi semantics myself, so I
moved it there to keep the original behavior unchanged for x86 code.

In this arm64 hot-remove port, offline and remove are done in two separate
steps, and is conceivable that an user tries erroneusly to remove some
section that he forgot to offline first: in that case, with the patch,
remove will just report an erro without BUGing.

Is my reasoning flawed?

Cheers,
Andrea

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-24 15:54           ` Andrea Reale
  (?)
  (?)
@ 2017-11-24 18:17             ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-24 18:17 UTC (permalink / raw)
  To: Andrea Reale
  Cc: Rafael J. Wysocki, linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Rafael Wysocki, ACPI Devel Maling List

On Fri 24-11-17 15:54:59, Andrea Reale wrote:
> On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> > On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > > Hi Rafael,
> > > 
> > > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > > Everyone else: apologies for the noise.
> > > > >
> > > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > > introduced an assumption whereas when control
> > > > > reaches remove_memory the corresponding memory has been already
> > > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > > the assumption held.
> > > > > This assumption, however, is not necessarily true if offlining
> > > > > and removal are not done by the same "controller" (for example,
> > > > > when first offlining via sysfs).
> > > > >
> > > > > Removing this assumption for the generic remove_memory code
> > > > > and moving it in the specific acpi_memhotplug code. This is
> > > > > a dependency for the software-aided arm64 offlining and removal
> > > > > process.
> > > > >
> > > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > > ---
> > > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > > index 6b0d3ef..b0126a0 100644
> > > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > > >
> > > > >                 acpi_unbind_memory_blocks(info);
> > > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > > 
> > > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > > system here?
> > > 
> > > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > in memory_hoptlug.c:remove_memory()). 
> > > 
> > > Just reading at that commit my understanding was that you were assuming
> > > that acpi_memory_remove_memory() have already done the job of offlining
> > > the target memory, so there would be a bug if that wasn't the case.
> > > 
> > > In my case, that assumption did not hold and I found that it might not
> > > hold for other platforms that do not use ACPI. In fact, the purpose of
> > > this patch is to move this assumption out of the generic hotplug code
> > > and move it to ACPI code where it originated. 
> > 
> > remove_memory failure is basically impossible to handle AFAIR. The
> > original code to BUG in remove_memory is ugly as hell and we do not want
> > to spread that out of that function. Instead we really want to get rid
> > of it.
> 
> Today, BUG() is called even in the simple case where remove fails
> because the section we are removing is not offline.

You cannot hotremove memory which is still online. This is what caller
should enforce. This is too late to handle the failure. At least for
ACPI.

> I cannot see any need to
> BUG() in such a case: an error code seems more than sufficient to me.

I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
context here and cannot simply communicate error code down the road.
I agree that we should be able to simply return an error but what is the
actual error condition that might happen here?

> This is why this patch removes the BUG() call when the "offline" check
> fails from the generic code. 

As I've said we should simply get rid of BUG rather than move it around.

> It moves it back to the ACPI call, where the assumption
> originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> I have nothing against removing it from ACPI hotplug too, but
> I don't know enough to feel free to change the acpi semantics myself, so I
> moved it there to keep the original behavior unchanged for x86 code.

Heh, yeah that is an easier path for sure. I would prefer sorting this
out ;) Not that I would enforce that, though. My concern is that the
previous hotplug development followed this "I do not understand exactly
so I will simply put my on top of existing code" mantra and it ended up
in a huge mess.

> In this arm64 hot-remove port, offline and remove are done in two separate
> steps, and is conceivable that an user tries erroneusly to remove some
> section that he forgot to offline first: in that case, with the patch,
> remove will just report an erro without BUGing.

As I've said it is the caller to enforce that.

> Is my reasoning flawed?

I wouldn't say flawed but this is a low-level call that should already
happen in a reasonable context.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 18:17             ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-24 18:17 UTC (permalink / raw)
  To: Andrea Reale
  Cc: Rafael J. Wysocki, linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Rafael Wysocki, ACPI Devel Maling List

On Fri 24-11-17 15:54:59, Andrea Reale wrote:
> On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> > On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > > Hi Rafael,
> > > 
> > > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > > Everyone else: apologies for the noise.
> > > > >
> > > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > > introduced an assumption whereas when control
> > > > > reaches remove_memory the corresponding memory has been already
> > > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > > the assumption held.
> > > > > This assumption, however, is not necessarily true if offlining
> > > > > and removal are not done by the same "controller" (for example,
> > > > > when first offlining via sysfs).
> > > > >
> > > > > Removing this assumption for the generic remove_memory code
> > > > > and moving it in the specific acpi_memhotplug code. This is
> > > > > a dependency for the software-aided arm64 offlining and removal
> > > > > process.
> > > > >
> > > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > > ---
> > > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > > index 6b0d3ef..b0126a0 100644
> > > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > > >
> > > > >                 acpi_unbind_memory_blocks(info);
> > > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > > 
> > > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > > system here?
> > > 
> > > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > in memory_hoptlug.c:remove_memory()). 
> > > 
> > > Just reading at that commit my understanding was that you were assuming
> > > that acpi_memory_remove_memory() have already done the job of offlining
> > > the target memory, so there would be a bug if that wasn't the case.
> > > 
> > > In my case, that assumption did not hold and I found that it might not
> > > hold for other platforms that do not use ACPI. In fact, the purpose of
> > > this patch is to move this assumption out of the generic hotplug code
> > > and move it to ACPI code where it originated. 
> > 
> > remove_memory failure is basically impossible to handle AFAIR. The
> > original code to BUG in remove_memory is ugly as hell and we do not want
> > to spread that out of that function. Instead we really want to get rid
> > of it.
> 
> Today, BUG() is called even in the simple case where remove fails
> because the section we are removing is not offline.

You cannot hotremove memory which is still online. This is what caller
should enforce. This is too late to handle the failure. At least for
ACPI.

> I cannot see any need to
> BUG() in such a case: an error code seems more than sufficient to me.

I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
context here and cannot simply communicate error code down the road.
I agree that we should be able to simply return an error but what is the
actual error condition that might happen here?

> This is why this patch removes the BUG() call when the "offline" check
> fails from the generic code. 

As I've said we should simply get rid of BUG rather than move it around.

> It moves it back to the ACPI call, where the assumption
> originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> I have nothing against removing it from ACPI hotplug too, but
> I don't know enough to feel free to change the acpi semantics myself, so I
> moved it there to keep the original behavior unchanged for x86 code.

Heh, yeah that is an easier path for sure. I would prefer sorting this
out ;) Not that I would enforce that, though. My concern is that the
previous hotplug development followed this "I do not understand exactly
so I will simply put my on top of existing code" mantra and it ended up
in a huge mess.

> In this arm64 hot-remove port, offline and remove are done in two separate
> steps, and is conceivable that an user tries erroneusly to remove some
> section that he forgot to offline first: in that case, with the patch,
> remove will just report an erro without BUGing.

As I've said it is the caller to enforce that.

> Is my reasoning flawed?

I wouldn't say flawed but this is a low-level call that should already
happen in a reasonable context.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 18:17             ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-24 18:17 UTC (permalink / raw)
  To: Andrea Reale
  Cc: Rafael J. Wysocki, linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Rafael Wysocki, ACPI Devel Maling List

On Fri 24-11-17 15:54:59, Andrea Reale wrote:
> On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> > On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > > Hi Rafael,
> > > 
> > > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > > Everyone else: apologies for the noise.
> > > > >
> > > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > > introduced an assumption whereas when control
> > > > > reaches remove_memory the corresponding memory has been already
> > > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > > the assumption held.
> > > > > This assumption, however, is not necessarily true if offlining
> > > > > and removal are not done by the same "controller" (for example,
> > > > > when first offlining via sysfs).
> > > > >
> > > > > Removing this assumption for the generic remove_memory code
> > > > > and moving it in the specific acpi_memhotplug code. This is
> > > > > a dependency for the software-aided arm64 offlining and removal
> > > > > process.
> > > > >
> > > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > > ---
> > > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > > index 6b0d3ef..b0126a0 100644
> > > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > > >
> > > > >                 acpi_unbind_memory_blocks(info);
> > > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > > 
> > > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > > system here?
> > > 
> > > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > in memory_hoptlug.c:remove_memory()). 
> > > 
> > > Just reading at that commit my understanding was that you were assuming
> > > that acpi_memory_remove_memory() have already done the job of offlining
> > > the target memory, so there would be a bug if that wasn't the case.
> > > 
> > > In my case, that assumption did not hold and I found that it might not
> > > hold for other platforms that do not use ACPI. In fact, the purpose of
> > > this patch is to move this assumption out of the generic hotplug code
> > > and move it to ACPI code where it originated. 
> > 
> > remove_memory failure is basically impossible to handle AFAIR. The
> > original code to BUG in remove_memory is ugly as hell and we do not want
> > to spread that out of that function. Instead we really want to get rid
> > of it.
> 
> Today, BUG() is called even in the simple case where remove fails
> because the section we are removing is not offline.

You cannot hotremove memory which is still online. This is what caller
should enforce. This is too late to handle the failure. At least for
ACPI.

> I cannot see any need to
> BUG() in such a case: an error code seems more than sufficient to me.

I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
context here and cannot simply communicate error code down the road.
I agree that we should be able to simply return an error but what is the
actual error condition that might happen here?

> This is why this patch removes the BUG() call when the "offline" check
> fails from the generic code. 

As I've said we should simply get rid of BUG rather than move it around.

> It moves it back to the ACPI call, where the assumption
> originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> I have nothing against removing it from ACPI hotplug too, but
> I don't know enough to feel free to change the acpi semantics myself, so I
> moved it there to keep the original behavior unchanged for x86 code.

Heh, yeah that is an easier path for sure. I would prefer sorting this
out ;) Not that I would enforce that, though. My concern is that the
previous hotplug development followed this "I do not understand exactly
so I will simply put my on top of existing code" mantra and it ended up
in a huge mess.

> In this arm64 hot-remove port, offline and remove are done in two separate
> steps, and is conceivable that an user tries erroneusly to remove some
> section that he forgot to offline first: in that case, with the patch,
> remove will just report an erro without BUGing.

As I've said it is the caller to enforce that.

> Is my reasoning flawed?

I wouldn't say flawed but this is a low-level call that should already
happen in a reasonable context.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-24 18:17             ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-24 18:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri 24-11-17 15:54:59, Andrea Reale wrote:
> On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> > On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > > Hi Rafael,
> > > 
> > > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > > Everyone else: apologies for the noise.
> > > > >
> > > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > > introduced an assumption whereas when control
> > > > > reaches remove_memory the corresponding memory has been already
> > > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > > the assumption held.
> > > > > This assumption, however, is not necessarily true if offlining
> > > > > and removal are not done by the same "controller" (for example,
> > > > > when first offlining via sysfs).
> > > > >
> > > > > Removing this assumption for the generic remove_memory code
> > > > > and moving it in the specific acpi_memhotplug code. This is
> > > > > a dependency for the software-aided arm64 offlining and removal
> > > > > process.
> > > > >
> > > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > > ---
> > > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > > index 6b0d3ef..b0126a0 100644
> > > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > > >
> > > > >                 acpi_unbind_memory_blocks(info);
> > > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > > 
> > > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > > system here?
> > > 
> > > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > in memory_hoptlug.c:remove_memory()). 
> > > 
> > > Just reading at that commit my understanding was that you were assuming
> > > that acpi_memory_remove_memory() have already done the job of offlining
> > > the target memory, so there would be a bug if that wasn't the case.
> > > 
> > > In my case, that assumption did not hold and I found that it might not
> > > hold for other platforms that do not use ACPI. In fact, the purpose of
> > > this patch is to move this assumption out of the generic hotplug code
> > > and move it to ACPI code where it originated. 
> > 
> > remove_memory failure is basically impossible to handle AFAIR. The
> > original code to BUG in remove_memory is ugly as hell and we do not want
> > to spread that out of that function. Instead we really want to get rid
> > of it.
> 
> Today, BUG() is called even in the simple case where remove fails
> because the section we are removing is not offline.

You cannot hotremove memory which is still online. This is what caller
should enforce. This is too late to handle the failure. At least for
ACPI.

> I cannot see any need to
> BUG() in such a case: an error code seems more than sufficient to me.

I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
context here and cannot simply communicate error code down the road.
I agree that we should be able to simply return an error but what is the
actual error condition that might happen here?

> This is why this patch removes the BUG() call when the "offline" check
> fails from the generic code. 

As I've said we should simply get rid of BUG rather than move it around.

> It moves it back to the ACPI call, where the assumption
> originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> I have nothing against removing it from ACPI hotplug too, but
> I don't know enough to feel free to change the acpi semantics myself, so I
> moved it there to keep the original behavior unchanged for x86 code.

Heh, yeah that is an easier path for sure. I would prefer sorting this
out ;) Not that I would enforce that, though. My concern is that the
previous hotplug development followed this "I do not understand exactly
so I will simply put my on top of existing code" mantra and it ended up
in a huge mess.

> In this arm64 hot-remove port, offline and remove are done in two separate
> steps, and is conceivable that an user tries erroneusly to remove some
> section that he forgot to offline first: in that case, with the patch,
> remove will just report an erro without BUGing.

As I've said it is the caller to enforce that.

> Is my reasoning flawed?

I wouldn't say flawed but this is a low-level call that should already
happen in a reasonable context.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
  2017-11-24 10:53         ` Maciej Bielski
  (?)
@ 2017-11-26  6:58           ` Arun KS
  -1 siblings, 0 replies; 156+ messages in thread
From: Arun KS @ 2017-11-26  6:58 UTC (permalink / raw)
  To: Maciej Bielski
  Cc: Andrea Reale, linux-arm-kernel, linux-kernel, linux-mm, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	Catalin Marinas, mhocko, realean2

On Fri, Nov 24, 2017 at 4:23 PM, Maciej Bielski
<m.bielski@virtualopensystems.com> wrote:
> On Fri, Nov 24, 2017 at 09:42:33AM +0000, Andrea Reale wrote:
>> Hi Arun,
>>
>>
>> On Fri 24 Nov 2017, 11:25, Arun KS wrote:
>> > On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
>> > <m.bielski@virtualopensystems.com> wrote:
>> >> [ ...]
>> > > Introduces memory hotplug functionality (hot-add) for arm64.
>> > > @@ -615,6 +616,44 @@ void __init paging_init(void)
>> > >                       SWAPPER_DIR_SIZE - PAGE_SIZE);
>> > >  }
>> > >
>> > > +#ifdef CONFIG_MEMORY_HOTPLUG
>> > > +
>> > > +/*
>> > > + * hotplug_paging() is used by memory hotplug to build new page tables
>> > > + * for hot added memory.
>> > > + */
>> > > +
>> > > +struct mem_range {
>> > > +       phys_addr_t base;
>> > > +       phys_addr_t size;
>> > > +};
>> > > +
>> > > +static int __hotplug_paging(void *data)
>> > > +{
>> > > +       int flags = 0;
>> > > +       struct mem_range *section = data;
>> > > +
>> > > +       if (debug_pagealloc_enabled())
>> > > +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>> > > +
>> > > +       __create_pgd_mapping(swapper_pg_dir, section->base,
>> > > +                       __phys_to_virt(section->base), section->size,
>> > > +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);
>> >
>> > Hello Andrea,
>> >
>> > __hotplug_paging runs on stop_machine context.
>> > cpu stop callbacks must not sleep.
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479
>> >
>> > __create_pgd_mapping uses pgd_pgtable_alloc. which does
>> > __get_free_page(PGALLOC_GFP)
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342
>> >
>> > PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM
>> >
>> > #define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
>> > #define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
>> >
>> > Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for
>> >
>> > might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>> >
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150
>> >
>> > and then BUG()
>>
>> Well spotted, thanks for reporting the problem. One possible solution
>> would be to revert back to building the updated page tables on a copy
>> pgdir (as it was done in v1 of this patchset) and then replacing swapper
>> atomically with stop_machine.
>>
>> Actually, I am not sure if stop_machine is strictly needed,
>> if we modify the swapper pgdir live: for example, in x86_64
>> kernel_physical_mapping_init, atomicity is ensured by spin-locking on
>> init_mm.page_table_lock.
>> https://elixir.free-electrons.com/linux/v4.14/source/arch/x86/mm/init_64.c#L684
>> I'll spend some time investigating whoever else could be working
>> concurrently on the swapper pgdir.
>>
>> Any suggestion or pointer is very welcome.
>
> Hi Andrea, Arun,
>
> Alternative approach could be implementing pgd_pgtable_alloc_nosleep() and
> pointing this to hotplug_paging(). Subsequently, it could use different flags,
> eg:
>
> #define PGALLOC_GFP_NORECLAIM   (__GFP_IO | __GFP_FS | __GFP_NOTRACK | __GFP_ZERO)

This solves the problem with __get_free_page.

But pgd_pgtable_alloc() ->  pgtable_page_ctor() -> ptlock_alloc() and
then kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL)
Same BUG again.

Regards,
Arun

>
> Is this unefficient approach in any way?
> Do we like the fact that the memory-attaching thread can go to sleep?
>
> BR,
>
>>
>> Thanks,
>> Andrea
>>
>> > I was testing on 4.4 kernel, but cross checked with 4.14 as well.
>> >
>> > Regards,
>> > Arun
>> >
>> >
>> > > +
>> > > +       return 0;
>> > > +}
>> > > +
>> > > +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
>> > > +{
>> > > +       struct mem_range section = {
>> > > +               .base = start,
>> > > +               .size = size,
>> > > +       };
>> > > +
>> > > +       stop_machine(__hotplug_paging, &section, NULL);
>> > > +}
>> > > +#endif /* CONFIG_MEMORY_HOTPLUG */
>> > > +
>> > >  /*
>> > >   * Check whether a kernel address is valid (derived from arch/x86/).
>> > >   */
>> > > --
>> > > 2.7.4
>> > >
>> >
>>
>
> --
> Maciej Bielski

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-26  6:58           ` Arun KS
  0 siblings, 0 replies; 156+ messages in thread
From: Arun KS @ 2017-11-26  6:58 UTC (permalink / raw)
  To: Maciej Bielski
  Cc: Andrea Reale, linux-arm-kernel, linux-kernel, linux-mm, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	Catalin Marinas, mhocko, realean2

On Fri, Nov 24, 2017 at 4:23 PM, Maciej Bielski
<m.bielski@virtualopensystems.com> wrote:
> On Fri, Nov 24, 2017 at 09:42:33AM +0000, Andrea Reale wrote:
>> Hi Arun,
>>
>>
>> On Fri 24 Nov 2017, 11:25, Arun KS wrote:
>> > On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
>> > <m.bielski@virtualopensystems.com> wrote:
>> >> [ ...]
>> > > Introduces memory hotplug functionality (hot-add) for arm64.
>> > > @@ -615,6 +616,44 @@ void __init paging_init(void)
>> > >                       SWAPPER_DIR_SIZE - PAGE_SIZE);
>> > >  }
>> > >
>> > > +#ifdef CONFIG_MEMORY_HOTPLUG
>> > > +
>> > > +/*
>> > > + * hotplug_paging() is used by memory hotplug to build new page tables
>> > > + * for hot added memory.
>> > > + */
>> > > +
>> > > +struct mem_range {
>> > > +       phys_addr_t base;
>> > > +       phys_addr_t size;
>> > > +};
>> > > +
>> > > +static int __hotplug_paging(void *data)
>> > > +{
>> > > +       int flags = 0;
>> > > +       struct mem_range *section = data;
>> > > +
>> > > +       if (debug_pagealloc_enabled())
>> > > +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>> > > +
>> > > +       __create_pgd_mapping(swapper_pg_dir, section->base,
>> > > +                       __phys_to_virt(section->base), section->size,
>> > > +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);
>> >
>> > Hello Andrea,
>> >
>> > __hotplug_paging runs on stop_machine context.
>> > cpu stop callbacks must not sleep.
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479
>> >
>> > __create_pgd_mapping uses pgd_pgtable_alloc. which does
>> > __get_free_page(PGALLOC_GFP)
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342
>> >
>> > PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM
>> >
>> > #define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
>> > #define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
>> >
>> > Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for
>> >
>> > might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>> >
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150
>> >
>> > and then BUG()
>>
>> Well spotted, thanks for reporting the problem. One possible solution
>> would be to revert back to building the updated page tables on a copy
>> pgdir (as it was done in v1 of this patchset) and then replacing swapper
>> atomically with stop_machine.
>>
>> Actually, I am not sure if stop_machine is strictly needed,
>> if we modify the swapper pgdir live: for example, in x86_64
>> kernel_physical_mapping_init, atomicity is ensured by spin-locking on
>> init_mm.page_table_lock.
>> https://elixir.free-electrons.com/linux/v4.14/source/arch/x86/mm/init_64.c#L684
>> I'll spend some time investigating whoever else could be working
>> concurrently on the swapper pgdir.
>>
>> Any suggestion or pointer is very welcome.
>
> Hi Andrea, Arun,
>
> Alternative approach could be implementing pgd_pgtable_alloc_nosleep() and
> pointing this to hotplug_paging(). Subsequently, it could use different flags,
> eg:
>
> #define PGALLOC_GFP_NORECLAIM   (__GFP_IO | __GFP_FS | __GFP_NOTRACK | __GFP_ZERO)

This solves the problem with __get_free_page.

But pgd_pgtable_alloc() ->  pgtable_page_ctor() -> ptlock_alloc() and
then kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL)
Same BUG again.

Regards,
Arun

>
> Is this unefficient approach in any way?
> Do we like the fact that the memory-attaching thread can go to sleep?
>
> BR,
>
>>
>> Thanks,
>> Andrea
>>
>> > I was testing on 4.4 kernel, but cross checked with 4.14 as well.
>> >
>> > Regards,
>> > Arun
>> >
>> >
>> > > +
>> > > +       return 0;
>> > > +}
>> > > +
>> > > +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
>> > > +{
>> > > +       struct mem_range section = {
>> > > +               .base = start,
>> > > +               .size = size,
>> > > +       };
>> > > +
>> > > +       stop_machine(__hotplug_paging, &section, NULL);
>> > > +}
>> > > +#endif /* CONFIG_MEMORY_HOTPLUG */
>> > > +
>> > >  /*
>> > >   * Check whether a kernel address is valid (derived from arch/x86/).
>> > >   */
>> > > --
>> > > 2.7.4
>> > >
>> >
>>
>
> --
> Maciej Bielski

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-26  6:58           ` Arun KS
  0 siblings, 0 replies; 156+ messages in thread
From: Arun KS @ 2017-11-26  6:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 24, 2017 at 4:23 PM, Maciej Bielski
<m.bielski@virtualopensystems.com> wrote:
> On Fri, Nov 24, 2017 at 09:42:33AM +0000, Andrea Reale wrote:
>> Hi Arun,
>>
>>
>> On Fri 24 Nov 2017, 11:25, Arun KS wrote:
>> > On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
>> > <m.bielski@virtualopensystems.com> wrote:
>> >> [ ...]
>> > > Introduces memory hotplug functionality (hot-add) for arm64.
>> > > @@ -615,6 +616,44 @@ void __init paging_init(void)
>> > >                       SWAPPER_DIR_SIZE - PAGE_SIZE);
>> > >  }
>> > >
>> > > +#ifdef CONFIG_MEMORY_HOTPLUG
>> > > +
>> > > +/*
>> > > + * hotplug_paging() is used by memory hotplug to build new page tables
>> > > + * for hot added memory.
>> > > + */
>> > > +
>> > > +struct mem_range {
>> > > +       phys_addr_t base;
>> > > +       phys_addr_t size;
>> > > +};
>> > > +
>> > > +static int __hotplug_paging(void *data)
>> > > +{
>> > > +       int flags = 0;
>> > > +       struct mem_range *section = data;
>> > > +
>> > > +       if (debug_pagealloc_enabled())
>> > > +               flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>> > > +
>> > > +       __create_pgd_mapping(swapper_pg_dir, section->base,
>> > > +                       __phys_to_virt(section->base), section->size,
>> > > +                       PAGE_KERNEL, pgd_pgtable_alloc, flags);
>> >
>> > Hello Andrea,
>> >
>> > __hotplug_paging runs on stop_machine context.
>> > cpu stop callbacks must not sleep.
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479
>> >
>> > __create_pgd_mapping uses pgd_pgtable_alloc. which does
>> > __get_free_page(PGALLOC_GFP)
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342
>> >
>> > PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM
>> >
>> > #define PGALLOC_GFP     (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
>> > #define GFP_KERNEL      (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
>> >
>> > Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for
>> >
>> > might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>> >
>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150
>> >
>> > and then BUG()
>>
>> Well spotted, thanks for reporting the problem. One possible solution
>> would be to revert back to building the updated page tables on a copy
>> pgdir (as it was done in v1 of this patchset) and then replacing swapper
>> atomically with stop_machine.
>>
>> Actually, I am not sure if stop_machine is strictly needed,
>> if we modify the swapper pgdir live: for example, in x86_64
>> kernel_physical_mapping_init, atomicity is ensured by spin-locking on
>> init_mm.page_table_lock.
>> https://elixir.free-electrons.com/linux/v4.14/source/arch/x86/mm/init_64.c#L684
>> I'll spend some time investigating whoever else could be working
>> concurrently on the swapper pgdir.
>>
>> Any suggestion or pointer is very welcome.
>
> Hi Andrea, Arun,
>
> Alternative approach could be implementing pgd_pgtable_alloc_nosleep() and
> pointing this to hotplug_paging(). Subsequently, it could use different flags,
> eg:
>
> #define PGALLOC_GFP_NORECLAIM   (__GFP_IO | __GFP_FS | __GFP_NOTRACK | __GFP_ZERO)

This solves the problem with __get_free_page.

But pgd_pgtable_alloc() ->  pgtable_page_ctor() -> ptlock_alloc() and
then kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL)
Same BUG again.

Regards,
Arun

>
> Is this unefficient approach in any way?
> Do we like the fact that the memory-attaching thread can go to sleep?
>
> BR,
>
>>
>> Thanks,
>> Andrea
>>
>> > I was testing on 4.4 kernel, but cross checked with 4.14 as well.
>> >
>> > Regards,
>> > Arun
>> >
>> >
>> > > +
>> > > +       return 0;
>> > > +}
>> > > +
>> > > +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
>> > > +{
>> > > +       struct mem_range section = {
>> > > +               .base = start,
>> > > +               .size = size,
>> > > +       };
>> > > +
>> > > +       stop_machine(__hotplug_paging, &section, NULL);
>> > > +}
>> > > +#endif /* CONFIG_MEMORY_HOTPLUG */
>> > > +
>> > >  /*
>> > >   * Check whether a kernel address is valid (derived from arch/x86/).
>> > >   */
>> > > --
>> > > 2.7.4
>> > >
>> >
>>
>
> --
> Maciej Bielski

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
  2017-11-23 11:13   ` Maciej Bielski
  (?)
@ 2017-11-27 15:19     ` Robin Murphy
  -1 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:19 UTC (permalink / raw)
  To: Maciej Bielski, linux-arm-kernel, ar
  Cc: mark.rutland, realean2, mhocko, scott.branden, catalin.marinas,
	will.deacon, linux-kernel, linux-mm, arunks, qiuxishi

Hi Andrea,

I've also been looking at memory hotplug for arm64, from the perspective 
of enabling ZONE_DEVICE for pmem. May I ask what your use-case for this 
series is? AFAICS the real demand will be coming from server systems, 
which in practice means both ACPI and NUMA, both of which are being 
resoundingly ignored here.

Further review comments inline.

On 23/11/17 11:13, Maciej Bielski wrote:
> Introduces memory hotplug functionality (hot-add) for arm64.
> 
> Changes v1->v2:
> - swapper pgtable updated in place on hot add, avoiding unnecessary copy:
>    all changes are additive and non destructive.
> 
> - stop_machine used to updated swapper on hot add, avoiding races
> 
> - checking if pagealloc is under debug to stay coherent with mem_map
> 
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> ---
>   arch/arm64/Kconfig           | 12 ++++++
>   arch/arm64/configs/defconfig |  1 +
>   arch/arm64/include/asm/mmu.h |  3 ++
>   arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
>   arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
>   5 files changed, 142 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0df64a6..c736bba 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -641,6 +641,14 @@ config HOTPLUG_CPU
>   	  Say Y here to experiment with turning CPUs off and on.  CPUs
>   	  can be controlled through /sys/devices/system/cpu.
>   
> +config ARCH_HAS_ADD_PAGES
> +	def_bool y
> +	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +
> +config ARCH_ENABLE_MEMORY_HOTPLUG
> +	def_bool y
> +    depends on !NUMA

As above, realistically this seems too limiting to be useful.

> +
>   # Common NUMA Features
>   config NUMA
>   	bool "Numa Memory Allocation and Scheduler Support"
> @@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
>   
>   source "mm/Kconfig"
>   
> +config ARCH_MEMORY_PROBE
> +	def_bool y
> +	depends on MEMORY_HOTPLUG

I'm particularly dubious about enabling this by default - it's useful 
for development and testing, yes, but I think it's the kind of feature 
where the onus should be on interested developers to turn it on, rather 
than production configs to have to turn it off.

> +
>   config SECCOMP
>   	bool "Enable seccomp to safely compute untrusted bytecode"
>   	---help---
> diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> index 34480e9..5fc5656 100644
> --- a/arch/arm64/configs/defconfig
> +++ b/arch/arm64/configs/defconfig
> @@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
>   CONFIG_SCHED_MC=y
>   CONFIG_NUMA=y
>   CONFIG_PREEMPT=y
> +CONFIG_MEMORY_HOTPLUG=y

Note that this is effectively pointless, given two lines above...

>   CONFIG_KSM=y
>   CONFIG_TRANSPARENT_HUGEPAGE=y
>   CONFIG_CMA=y
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 0d34bf0..2b3fa4d 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>   			       pgprot_t prot, bool page_mappings_only);
>   extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
>   extern void mark_linear_text_alias_ro(void);
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +extern void hotplug_paging(phys_addr_t start, phys_addr_t size);

Is there any reason for not just implementing all the hotplug code 
self-contained in mmu.c?

> +#endif
>   
>   #endif
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 5960bef..e96e7d3 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
>   	return 0;
>   }
>   __initcall(register_mem_limit_dumper);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +int add_pages(int nid, unsigned long start_pfn,
> +		unsigned long nr_pages, bool want_memblock)
> +{
> +	int ret;
> +	u64 start_addr = start_pfn << PAGE_SHIFT;
> +	/*
> +	 * Mark the first page in the range as unusable. This is needed
> +	 * because __add_section (within __add_pages) wants pfn_valid
> +	 * of it to be false, and in arm64 pfn falid is implemented by
> +	 * just checking at the nomap flag for existing blocks.
> +	 *
> +	 * A small trick here is that __add_section() requires only
> +	 * phys_start_pfn (that is the first pfn of a section) to be
> +	 * invalid. Regardless of whether it was assumed (by the function
> +	 * author) that all pfns within a section are either all valid
> +	 * or all invalid, it allows to avoid looping twice (once here,
> +	 * second when memblock_clear_nomap() is called) through all
> +	 * pfns of the section and modify only one pfn. Thanks to that,
> +	 * further, in __add_zone() only this very first pfn is skipped
> +	 * and corresponding page is not flagged reserved. Therefore it
> +	 * is enough to correct this setup only for it.
> +	 *
> +	 * When arch_add_memory() returns the walk_memory_range() function
> +	 * is called and passed with online_memory_block() callback,
> +	 * which execution finally reaches the memory_block_action()
> +	 * function, where also only the first pfn of a memory block is
> +	 * checked to be reserved. Above, it was first pfn of a section,
> +	 * here it is a block but
> +	 * (drivers/base/memory.c):
> +	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> +	 * (include/linux/memory.h):
> +	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> +	 * so we can consider block and section equivalently
> +	 */
> +	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> +	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +	/*
> +	 * Make the pages usable after they have been added.
> +	 * This will make pfn_valid return true
> +	 */
> +	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> +
> +	/*
> +	 * This is a hack to avoid having to mix arch specific code
> +	 * into arch independent code. SetPageReserved is supposed
> +	 * to be called by __add_zone (within __add_section, within
> +	 * __add_pages). However, when it is called there, it assumes that
> +	 * pfn_valid returns true.  For the way pfn_valid is implemented
> +	 * in arm64 (a check on the nomap flag), the only way to make
> +	 * this evaluate true inside __add_zone is to clear the nomap
> +	 * flags of blocks in architecture independent code.
> +	 *
> +	 * To avoid this, we set the Reserved flag here after we cleared
> +	 * the nomap flag in the line above.
> +	 */
> +	SetPageReserved(pfn_to_page(start_pfn));

This whole business is utterly horrible. I really think we need to 
revisit why arm64 isn't using the normal sparsemem pfn_valid() 
implementation. If there are callers misusing pfn_valid() where they 
really want page_is_ram() or similar, or missing further 
pfn_valid_within() checks, then it's surely time to fix those at the 
source rather than adding to the Jenga pile of hacks in this area. I've 
started digging into it myself, but don't have any answers yet.

> +
> +	return ret;
> +}
> +
> +int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> +{
> +	int ret;
> +	unsigned long start_pfn = start >> PAGE_SHIFT;
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	unsigned long end_pfn = start_pfn + nr_pages;
> +	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> +
> +	if (end_pfn > max_sparsemem_pfn) {
> +		pr_err("end_pfn too big");
> +		return -1;
> +	}
> +	hotplug_paging(start, size);
> +
> +	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +	if (ret)
> +		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> +			__func__, ret);
> +
> +	return ret;
> +}
> +
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index f1eb15e..d93043d 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -28,6 +28,7 @@
>   #include <linux/mman.h>
>   #include <linux/nodemask.h>
>   #include <linux/memblock.h>
> +#include <linux/stop_machine.h>
>   #include <linux/fs.h>
>   #include <linux/io.h>
>   #include <linux/mm.h>
> @@ -615,6 +616,44 @@ void __init paging_init(void)
>   		      SWAPPER_DIR_SIZE - PAGE_SIZE);
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +
> +/*
> + * hotplug_paging() is used by memory hotplug to build new page tables
> + * for hot added memory.
> + */
> +
> +struct mem_range {
> +	phys_addr_t base;
> +	phys_addr_t size;
> +};
> +
> +static int __hotplug_paging(void *data)
> +{
> +	int flags = 0;
> +	struct mem_range *section = data;
> +
> +	if (debug_pagealloc_enabled())
> +		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> +
> +	__create_pgd_mapping(swapper_pg_dir, section->base,
> +			__phys_to_virt(section->base), section->size,
> +			PAGE_KERNEL, pgd_pgtable_alloc, flags);
> +
> +	return 0;
> +}
> +
> +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> +{
> +	struct mem_range section = {
> +		.base = start,
> +		.size = size,
> +	};
> +
> +	stop_machine(__hotplug_paging, &section, NULL);

Why exactly do we need to swing the stop_machine() hammer here? I 
appreciate that separate hotplug events for adjacent sections could 
potentially affect the same top-level entry in swapper_pg_dir, but those 
should already be serialised by the hotplug lock - who else has cause to 
modify non-leaf entries for the linear map at runtime in a manner which 
might conflict?

Robin.

> +}
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
>   /*
>    * Check whether a kernel address is valid (derived from arch/x86/).
>    */
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-27 15:19     ` Robin Murphy
  0 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:19 UTC (permalink / raw)
  To: Maciej Bielski, linux-arm-kernel, ar
  Cc: mark.rutland, realean2, mhocko, scott.branden, catalin.marinas,
	will.deacon, linux-kernel, linux-mm, arunks, qiuxishi

Hi Andrea,

I've also been looking at memory hotplug for arm64, from the perspective 
of enabling ZONE_DEVICE for pmem. May I ask what your use-case for this 
series is? AFAICS the real demand will be coming from server systems, 
which in practice means both ACPI and NUMA, both of which are being 
resoundingly ignored here.

Further review comments inline.

On 23/11/17 11:13, Maciej Bielski wrote:
> Introduces memory hotplug functionality (hot-add) for arm64.
> 
> Changes v1->v2:
> - swapper pgtable updated in place on hot add, avoiding unnecessary copy:
>    all changes are additive and non destructive.
> 
> - stop_machine used to updated swapper on hot add, avoiding races
> 
> - checking if pagealloc is under debug to stay coherent with mem_map
> 
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> ---
>   arch/arm64/Kconfig           | 12 ++++++
>   arch/arm64/configs/defconfig |  1 +
>   arch/arm64/include/asm/mmu.h |  3 ++
>   arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
>   arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
>   5 files changed, 142 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0df64a6..c736bba 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -641,6 +641,14 @@ config HOTPLUG_CPU
>   	  Say Y here to experiment with turning CPUs off and on.  CPUs
>   	  can be controlled through /sys/devices/system/cpu.
>   
> +config ARCH_HAS_ADD_PAGES
> +	def_bool y
> +	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +
> +config ARCH_ENABLE_MEMORY_HOTPLUG
> +	def_bool y
> +    depends on !NUMA

As above, realistically this seems too limiting to be useful.

> +
>   # Common NUMA Features
>   config NUMA
>   	bool "Numa Memory Allocation and Scheduler Support"
> @@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
>   
>   source "mm/Kconfig"
>   
> +config ARCH_MEMORY_PROBE
> +	def_bool y
> +	depends on MEMORY_HOTPLUG

I'm particularly dubious about enabling this by default - it's useful 
for development and testing, yes, but I think it's the kind of feature 
where the onus should be on interested developers to turn it on, rather 
than production configs to have to turn it off.

> +
>   config SECCOMP
>   	bool "Enable seccomp to safely compute untrusted bytecode"
>   	---help---
> diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> index 34480e9..5fc5656 100644
> --- a/arch/arm64/configs/defconfig
> +++ b/arch/arm64/configs/defconfig
> @@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
>   CONFIG_SCHED_MC=y
>   CONFIG_NUMA=y
>   CONFIG_PREEMPT=y
> +CONFIG_MEMORY_HOTPLUG=y

Note that this is effectively pointless, given two lines above...

>   CONFIG_KSM=y
>   CONFIG_TRANSPARENT_HUGEPAGE=y
>   CONFIG_CMA=y
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 0d34bf0..2b3fa4d 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>   			       pgprot_t prot, bool page_mappings_only);
>   extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
>   extern void mark_linear_text_alias_ro(void);
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +extern void hotplug_paging(phys_addr_t start, phys_addr_t size);

Is there any reason for not just implementing all the hotplug code 
self-contained in mmu.c?

> +#endif
>   
>   #endif
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 5960bef..e96e7d3 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
>   	return 0;
>   }
>   __initcall(register_mem_limit_dumper);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +int add_pages(int nid, unsigned long start_pfn,
> +		unsigned long nr_pages, bool want_memblock)
> +{
> +	int ret;
> +	u64 start_addr = start_pfn << PAGE_SHIFT;
> +	/*
> +	 * Mark the first page in the range as unusable. This is needed
> +	 * because __add_section (within __add_pages) wants pfn_valid
> +	 * of it to be false, and in arm64 pfn falid is implemented by
> +	 * just checking at the nomap flag for existing blocks.
> +	 *
> +	 * A small trick here is that __add_section() requires only
> +	 * phys_start_pfn (that is the first pfn of a section) to be
> +	 * invalid. Regardless of whether it was assumed (by the function
> +	 * author) that all pfns within a section are either all valid
> +	 * or all invalid, it allows to avoid looping twice (once here,
> +	 * second when memblock_clear_nomap() is called) through all
> +	 * pfns of the section and modify only one pfn. Thanks to that,
> +	 * further, in __add_zone() only this very first pfn is skipped
> +	 * and corresponding page is not flagged reserved. Therefore it
> +	 * is enough to correct this setup only for it.
> +	 *
> +	 * When arch_add_memory() returns the walk_memory_range() function
> +	 * is called and passed with online_memory_block() callback,
> +	 * which execution finally reaches the memory_block_action()
> +	 * function, where also only the first pfn of a memory block is
> +	 * checked to be reserved. Above, it was first pfn of a section,
> +	 * here it is a block but
> +	 * (drivers/base/memory.c):
> +	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> +	 * (include/linux/memory.h):
> +	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> +	 * so we can consider block and section equivalently
> +	 */
> +	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> +	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +	/*
> +	 * Make the pages usable after they have been added.
> +	 * This will make pfn_valid return true
> +	 */
> +	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> +
> +	/*
> +	 * This is a hack to avoid having to mix arch specific code
> +	 * into arch independent code. SetPageReserved is supposed
> +	 * to be called by __add_zone (within __add_section, within
> +	 * __add_pages). However, when it is called there, it assumes that
> +	 * pfn_valid returns true.  For the way pfn_valid is implemented
> +	 * in arm64 (a check on the nomap flag), the only way to make
> +	 * this evaluate true inside __add_zone is to clear the nomap
> +	 * flags of blocks in architecture independent code.
> +	 *
> +	 * To avoid this, we set the Reserved flag here after we cleared
> +	 * the nomap flag in the line above.
> +	 */
> +	SetPageReserved(pfn_to_page(start_pfn));

This whole business is utterly horrible. I really think we need to 
revisit why arm64 isn't using the normal sparsemem pfn_valid() 
implementation. If there are callers misusing pfn_valid() where they 
really want page_is_ram() or similar, or missing further 
pfn_valid_within() checks, then it's surely time to fix those at the 
source rather than adding to the Jenga pile of hacks in this area. I've 
started digging into it myself, but don't have any answers yet.

> +
> +	return ret;
> +}
> +
> +int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> +{
> +	int ret;
> +	unsigned long start_pfn = start >> PAGE_SHIFT;
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	unsigned long end_pfn = start_pfn + nr_pages;
> +	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> +
> +	if (end_pfn > max_sparsemem_pfn) {
> +		pr_err("end_pfn too big");
> +		return -1;
> +	}
> +	hotplug_paging(start, size);
> +
> +	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +	if (ret)
> +		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> +			__func__, ret);
> +
> +	return ret;
> +}
> +
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index f1eb15e..d93043d 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -28,6 +28,7 @@
>   #include <linux/mman.h>
>   #include <linux/nodemask.h>
>   #include <linux/memblock.h>
> +#include <linux/stop_machine.h>
>   #include <linux/fs.h>
>   #include <linux/io.h>
>   #include <linux/mm.h>
> @@ -615,6 +616,44 @@ void __init paging_init(void)
>   		      SWAPPER_DIR_SIZE - PAGE_SIZE);
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +
> +/*
> + * hotplug_paging() is used by memory hotplug to build new page tables
> + * for hot added memory.
> + */
> +
> +struct mem_range {
> +	phys_addr_t base;
> +	phys_addr_t size;
> +};
> +
> +static int __hotplug_paging(void *data)
> +{
> +	int flags = 0;
> +	struct mem_range *section = data;
> +
> +	if (debug_pagealloc_enabled())
> +		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> +
> +	__create_pgd_mapping(swapper_pg_dir, section->base,
> +			__phys_to_virt(section->base), section->size,
> +			PAGE_KERNEL, pgd_pgtable_alloc, flags);
> +
> +	return 0;
> +}
> +
> +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> +{
> +	struct mem_range section = {
> +		.base = start,
> +		.size = size,
> +	};
> +
> +	stop_machine(__hotplug_paging, &section, NULL);

Why exactly do we need to swing the stop_machine() hammer here? I 
appreciate that separate hotplug events for adjacent sections could 
potentially affect the same top-level entry in swapper_pg_dir, but those 
should already be serialised by the hotplug lock - who else has cause to 
modify non-leaf entries for the linear map at runtime in a manner which 
might conflict?

Robin.

> +}
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
>   /*
>    * Check whether a kernel address is valid (derived from arch/x86/).
>    */
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-27 15:19     ` Robin Murphy
  0 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:19 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Andrea,

I've also been looking at memory hotplug for arm64, from the perspective 
of enabling ZONE_DEVICE for pmem. May I ask what your use-case for this 
series is? AFAICS the real demand will be coming from server systems, 
which in practice means both ACPI and NUMA, both of which are being 
resoundingly ignored here.

Further review comments inline.

On 23/11/17 11:13, Maciej Bielski wrote:
> Introduces memory hotplug functionality (hot-add) for arm64.
> 
> Changes v1->v2:
> - swapper pgtable updated in place on hot add, avoiding unnecessary copy:
>    all changes are additive and non destructive.
> 
> - stop_machine used to updated swapper on hot add, avoiding races
> 
> - checking if pagealloc is under debug to stay coherent with mem_map
> 
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> ---
>   arch/arm64/Kconfig           | 12 ++++++
>   arch/arm64/configs/defconfig |  1 +
>   arch/arm64/include/asm/mmu.h |  3 ++
>   arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
>   arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
>   5 files changed, 142 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0df64a6..c736bba 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -641,6 +641,14 @@ config HOTPLUG_CPU
>   	  Say Y here to experiment with turning CPUs off and on.  CPUs
>   	  can be controlled through /sys/devices/system/cpu.
>   
> +config ARCH_HAS_ADD_PAGES
> +	def_bool y
> +	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +
> +config ARCH_ENABLE_MEMORY_HOTPLUG
> +	def_bool y
> +    depends on !NUMA

As above, realistically this seems too limiting to be useful.

> +
>   # Common NUMA Features
>   config NUMA
>   	bool "Numa Memory Allocation and Scheduler Support"
> @@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
>   
>   source "mm/Kconfig"
>   
> +config ARCH_MEMORY_PROBE
> +	def_bool y
> +	depends on MEMORY_HOTPLUG

I'm particularly dubious about enabling this by default - it's useful 
for development and testing, yes, but I think it's the kind of feature 
where the onus should be on interested developers to turn it on, rather 
than production configs to have to turn it off.

> +
>   config SECCOMP
>   	bool "Enable seccomp to safely compute untrusted bytecode"
>   	---help---
> diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> index 34480e9..5fc5656 100644
> --- a/arch/arm64/configs/defconfig
> +++ b/arch/arm64/configs/defconfig
> @@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
>   CONFIG_SCHED_MC=y
>   CONFIG_NUMA=y
>   CONFIG_PREEMPT=y
> +CONFIG_MEMORY_HOTPLUG=y

Note that this is effectively pointless, given two lines above...

>   CONFIG_KSM=y
>   CONFIG_TRANSPARENT_HUGEPAGE=y
>   CONFIG_CMA=y
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 0d34bf0..2b3fa4d 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>   			       pgprot_t prot, bool page_mappings_only);
>   extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
>   extern void mark_linear_text_alias_ro(void);
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +extern void hotplug_paging(phys_addr_t start, phys_addr_t size);

Is there any reason for not just implementing all the hotplug code 
self-contained in mmu.c?

> +#endif
>   
>   #endif
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 5960bef..e96e7d3 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
>   	return 0;
>   }
>   __initcall(register_mem_limit_dumper);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +int add_pages(int nid, unsigned long start_pfn,
> +		unsigned long nr_pages, bool want_memblock)
> +{
> +	int ret;
> +	u64 start_addr = start_pfn << PAGE_SHIFT;
> +	/*
> +	 * Mark the first page in the range as unusable. This is needed
> +	 * because __add_section (within __add_pages) wants pfn_valid
> +	 * of it to be false, and in arm64 pfn falid is implemented by
> +	 * just checking at the nomap flag for existing blocks.
> +	 *
> +	 * A small trick here is that __add_section() requires only
> +	 * phys_start_pfn (that is the first pfn of a section) to be
> +	 * invalid. Regardless of whether it was assumed (by the function
> +	 * author) that all pfns within a section are either all valid
> +	 * or all invalid, it allows to avoid looping twice (once here,
> +	 * second when memblock_clear_nomap() is called) through all
> +	 * pfns of the section and modify only one pfn. Thanks to that,
> +	 * further, in __add_zone() only this very first pfn is skipped
> +	 * and corresponding page is not flagged reserved. Therefore it
> +	 * is enough to correct this setup only for it.
> +	 *
> +	 * When arch_add_memory() returns the walk_memory_range() function
> +	 * is called and passed with online_memory_block() callback,
> +	 * which execution finally reaches the memory_block_action()
> +	 * function, where also only the first pfn of a memory block is
> +	 * checked to be reserved. Above, it was first pfn of a section,
> +	 * here it is a block but
> +	 * (drivers/base/memory.c):
> +	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> +	 * (include/linux/memory.h):
> +	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> +	 * so we can consider block and section equivalently
> +	 */
> +	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> +	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +	/*
> +	 * Make the pages usable after they have been added.
> +	 * This will make pfn_valid return true
> +	 */
> +	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> +
> +	/*
> +	 * This is a hack to avoid having to mix arch specific code
> +	 * into arch independent code. SetPageReserved is supposed
> +	 * to be called by __add_zone (within __add_section, within
> +	 * __add_pages). However, when it is called there, it assumes that
> +	 * pfn_valid returns true.  For the way pfn_valid is implemented
> +	 * in arm64 (a check on the nomap flag), the only way to make
> +	 * this evaluate true inside __add_zone is to clear the nomap
> +	 * flags of blocks in architecture independent code.
> +	 *
> +	 * To avoid this, we set the Reserved flag here after we cleared
> +	 * the nomap flag in the line above.
> +	 */
> +	SetPageReserved(pfn_to_page(start_pfn));

This whole business is utterly horrible. I really think we need to 
revisit why arm64 isn't using the normal sparsemem pfn_valid() 
implementation. If there are callers misusing pfn_valid() where they 
really want page_is_ram() or similar, or missing further 
pfn_valid_within() checks, then it's surely time to fix those at the 
source rather than adding to the Jenga pile of hacks in this area. I've 
started digging into it myself, but don't have any answers yet.

> +
> +	return ret;
> +}
> +
> +int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> +{
> +	int ret;
> +	unsigned long start_pfn = start >> PAGE_SHIFT;
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	unsigned long end_pfn = start_pfn + nr_pages;
> +	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> +
> +	if (end_pfn > max_sparsemem_pfn) {
> +		pr_err("end_pfn too big");
> +		return -1;
> +	}
> +	hotplug_paging(start, size);
> +
> +	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> +	if (ret)
> +		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> +			__func__, ret);
> +
> +	return ret;
> +}
> +
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index f1eb15e..d93043d 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -28,6 +28,7 @@
>   #include <linux/mman.h>
>   #include <linux/nodemask.h>
>   #include <linux/memblock.h>
> +#include <linux/stop_machine.h>
>   #include <linux/fs.h>
>   #include <linux/io.h>
>   #include <linux/mm.h>
> @@ -615,6 +616,44 @@ void __init paging_init(void)
>   		      SWAPPER_DIR_SIZE - PAGE_SIZE);
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +
> +/*
> + * hotplug_paging() is used by memory hotplug to build new page tables
> + * for hot added memory.
> + */
> +
> +struct mem_range {
> +	phys_addr_t base;
> +	phys_addr_t size;
> +};
> +
> +static int __hotplug_paging(void *data)
> +{
> +	int flags = 0;
> +	struct mem_range *section = data;
> +
> +	if (debug_pagealloc_enabled())
> +		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> +
> +	__create_pgd_mapping(swapper_pg_dir, section->base,
> +			__phys_to_virt(section->base), section->size,
> +			PAGE_KERNEL, pgd_pgtable_alloc, flags);
> +
> +	return 0;
> +}
> +
> +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> +{
> +	struct mem_range section = {
> +		.base = start,
> +		.size = size,
> +	};
> +
> +	stop_machine(__hotplug_paging, &section, NULL);

Why exactly do we need to swing the stop_machine() hammer here? I 
appreciate that separate hotplug events for adjacent sections could 
potentially affect the same top-level entry in swapper_pg_dir, but those 
should already be serialised by the hotplug lock - who else has cause to 
modify non-leaf entries for the linear map at runtime in a manner which 
might conflict?

Robin.

> +}
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
>   /*
>    * Check whether a kernel address is valid (derived from arch/x86/).
>    */
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-24 15:54           ` Andrea Reale
  (?)
  (?)
@ 2017-11-27 15:20             ` Robin Murphy
  -1 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:20 UTC (permalink / raw)
  To: Andrea Reale, Michal Hocko
  Cc: Mark Rutland, Rafael Wysocki, m.bielski, ACPI Devel Maling List,
	Rafael J. Wysocki, Catalin Marinas, scott.branden, Will Deacon,
	Linux Kernel Mailing List, Linux Memory Management List, arunks,
	qiuxishi, linux-arm-kernel

On 24/11/17 15:54, Andrea Reale wrote:
> On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
>> On Fri 24-11-17 14:49:17, Andrea Reale wrote:
>>> Hi Rafael,
>>>
>>> On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
>>>> On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
>>>>> Resending the patch adding linux-acpi in CC, as suggested by Rafael.
>>>>> Everyone else: apologies for the noise.
>>>>>
>>>>> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
>>>>> introduced an assumption whereas when control
>>>>> reaches remove_memory the corresponding memory has been already
>>>>> offlined. In that case, the acpi_memhotplug was making sure that
>>>>> the assumption held.
>>>>> This assumption, however, is not necessarily true if offlining
>>>>> and removal are not done by the same "controller" (for example,
>>>>> when first offlining via sysfs).
>>>>>
>>>>> Removing this assumption for the generic remove_memory code
>>>>> and moving it in the specific acpi_memhotplug code. This is
>>>>> a dependency for the software-aided arm64 offlining and removal
>>>>> process.
>>>>>
>>>>> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
>>>>> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
>>>>> ---
>>>>>   drivers/acpi/acpi_memhotplug.c |  2 +-
>>>>>   include/linux/memory_hotplug.h |  9 ++++++---
>>>>>   mm/memory_hotplug.c            | 13 +++++++++----
>>>>>   3 files changed, 16 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
>>>>> index 6b0d3ef..b0126a0 100644
>>>>> --- a/drivers/acpi/acpi_memhotplug.c
>>>>> +++ b/drivers/acpi/acpi_memhotplug.c
>>>>> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>>>>>                          nid = memory_add_physaddr_to_nid(info->start_addr);
>>>>>
>>>>>                  acpi_unbind_memory_blocks(info);
>>>>> -               remove_memory(nid, info->start_addr, info->length);
>>>>> +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
>>>>
>>>> Why does this have to be BUG_ON()?  Is it really necessary to kill the
>>>> system here?
>>>
>>> Actually, I hoped you would help me understand that: that BUG() call was introduced
>>> by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
>>> in memory_hoptlug.c:remove_memory()).
>>>
>>> Just reading at that commit my understanding was that you were assuming
>>> that acpi_memory_remove_memory() have already done the job of offlining
>>> the target memory, so there would be a bug if that wasn't the case.
>>>
>>> In my case, that assumption did not hold and I found that it might not
>>> hold for other platforms that do not use ACPI. In fact, the purpose of
>>> this patch is to move this assumption out of the generic hotplug code
>>> and move it to ACPI code where it originated.
>>
>> remove_memory failure is basically impossible to handle AFAIR. The
>> original code to BUG in remove_memory is ugly as hell and we do not want
>> to spread that out of that function. Instead we really want to get rid
>> of it.
> 
> Today, BUG() is called even in the simple case where remove fails
> because the section we are removing is not offline. I cannot see any need to
> BUG() in such a case: an error code seems more than sufficient to me.
> This is why this patch removes the BUG() call when the "offline" check
> fails from the generic code.
> It moves it back to the ACPI call, where the assumption
> originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> I have nothing against removing it from ACPI hotplug too, but
> I don't know enough to feel free to change the acpi semantics myself, so I
> moved it there to keep the original behavior unchanged for x86 code.
> 
> In this arm64 hot-remove port, offline and remove are done in two separate
> steps, and is conceivable that an user tries erroneusly to remove some
> section that he forgot to offline first: in that case, with the patch,
> remove will just report an erro without BUGing.

The user can already kill the system by misusing the sysfs probe driver; 
should similar theoretical misuse of your sysfs remove driver really 
need to be all that different?

> Is my reasoning flawed?

Furthermore, even if your driver does want to enforce this, I don't see 
why it can't just do the equivalent of memory_subsys_offline() itself 
before even trying to call remove_memory().

Robin.

> 
> Cheers,
> Andrea
> 
>> -- 
>> Michal Hocko
>> SUSE Labs
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-27 15:20             ` Robin Murphy
  0 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:20 UTC (permalink / raw)
  To: Andrea Reale, Michal Hocko
  Cc: Mark Rutland, Rafael Wysocki, m.bielski, ACPI Devel Maling List,
	Rafael J. Wysocki, Catalin Marinas, scott.branden, Will Deacon,
	Linux Kernel Mailing List, Linux Memory Management List, arunks,
	qiuxishi, linux-arm-kernel

On 24/11/17 15:54, Andrea Reale wrote:
> On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
>> On Fri 24-11-17 14:49:17, Andrea Reale wrote:
>>> Hi Rafael,
>>>
>>> On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
>>>> On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
>>>>> Resending the patch adding linux-acpi in CC, as suggested by Rafael.
>>>>> Everyone else: apologies for the noise.
>>>>>
>>>>> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
>>>>> introduced an assumption whereas when control
>>>>> reaches remove_memory the corresponding memory has been already
>>>>> offlined. In that case, the acpi_memhotplug was making sure that
>>>>> the assumption held.
>>>>> This assumption, however, is not necessarily true if offlining
>>>>> and removal are not done by the same "controller" (for example,
>>>>> when first offlining via sysfs).
>>>>>
>>>>> Removing this assumption for the generic remove_memory code
>>>>> and moving it in the specific acpi_memhotplug code. This is
>>>>> a dependency for the software-aided arm64 offlining and removal
>>>>> process.
>>>>>
>>>>> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
>>>>> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
>>>>> ---
>>>>>   drivers/acpi/acpi_memhotplug.c |  2 +-
>>>>>   include/linux/memory_hotplug.h |  9 ++++++---
>>>>>   mm/memory_hotplug.c            | 13 +++++++++----
>>>>>   3 files changed, 16 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
>>>>> index 6b0d3ef..b0126a0 100644
>>>>> --- a/drivers/acpi/acpi_memhotplug.c
>>>>> +++ b/drivers/acpi/acpi_memhotplug.c
>>>>> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>>>>>                          nid = memory_add_physaddr_to_nid(info->start_addr);
>>>>>
>>>>>                  acpi_unbind_memory_blocks(info);
>>>>> -               remove_memory(nid, info->start_addr, info->length);
>>>>> +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
>>>>
>>>> Why does this have to be BUG_ON()?  Is it really necessary to kill the
>>>> system here?
>>>
>>> Actually, I hoped you would help me understand that: that BUG() call was introduced
>>> by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
>>> in memory_hoptlug.c:remove_memory()).
>>>
>>> Just reading at that commit my understanding was that you were assuming
>>> that acpi_memory_remove_memory() have already done the job of offlining
>>> the target memory, so there would be a bug if that wasn't the case.
>>>
>>> In my case, that assumption did not hold and I found that it might not
>>> hold for other platforms that do not use ACPI. In fact, the purpose of
>>> this patch is to move this assumption out of the generic hotplug code
>>> and move it to ACPI code where it originated.
>>
>> remove_memory failure is basically impossible to handle AFAIR. The
>> original code to BUG in remove_memory is ugly as hell and we do not want
>> to spread that out of that function. Instead we really want to get rid
>> of it.
> 
> Today, BUG() is called even in the simple case where remove fails
> because the section we are removing is not offline. I cannot see any need to
> BUG() in such a case: an error code seems more than sufficient to me.
> This is why this patch removes the BUG() call when the "offline" check
> fails from the generic code.
> It moves it back to the ACPI call, where the assumption
> originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> I have nothing against removing it from ACPI hotplug too, but
> I don't know enough to feel free to change the acpi semantics myself, so I
> moved it there to keep the original behavior unchanged for x86 code.
> 
> In this arm64 hot-remove port, offline and remove are done in two separate
> steps, and is conceivable that an user tries erroneusly to remove some
> section that he forgot to offline first: in that case, with the patch,
> remove will just report an erro without BUGing.

The user can already kill the system by misusing the sysfs probe driver; 
should similar theoretical misuse of your sysfs remove driver really 
need to be all that different?

> Is my reasoning flawed?

Furthermore, even if your driver does want to enforce this, I don't see 
why it can't just do the equivalent of memory_subsys_offline() itself 
before even trying to call remove_memory().

Robin.

> 
> Cheers,
> Andrea
> 
>> -- 
>> Michal Hocko
>> SUSE Labs
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-27 15:20             ` Robin Murphy
  0 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:20 UTC (permalink / raw)
  To: Andrea Reale, Michal Hocko
  Cc: Mark Rutland, Rafael Wysocki, m.bielski, ACPI Devel Maling List,
	Rafael J. Wysocki, Catalin Marinas, scott.branden, Will Deacon,
	Linux Kernel Mailing List, Linux Memory Management List, arunks,
	qiuxishi, linux-arm-kernel

On 24/11/17 15:54, Andrea Reale wrote:
> On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
>> On Fri 24-11-17 14:49:17, Andrea Reale wrote:
>>> Hi Rafael,
>>>
>>> On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
>>>> On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
>>>>> Resending the patch adding linux-acpi in CC, as suggested by Rafael.
>>>>> Everyone else: apologies for the noise.
>>>>>
>>>>> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
>>>>> introduced an assumption whereas when control
>>>>> reaches remove_memory the corresponding memory has been already
>>>>> offlined. In that case, the acpi_memhotplug was making sure that
>>>>> the assumption held.
>>>>> This assumption, however, is not necessarily true if offlining
>>>>> and removal are not done by the same "controller" (for example,
>>>>> when first offlining via sysfs).
>>>>>
>>>>> Removing this assumption for the generic remove_memory code
>>>>> and moving it in the specific acpi_memhotplug code. This is
>>>>> a dependency for the software-aided arm64 offlining and removal
>>>>> process.
>>>>>
>>>>> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
>>>>> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
>>>>> ---
>>>>>   drivers/acpi/acpi_memhotplug.c |  2 +-
>>>>>   include/linux/memory_hotplug.h |  9 ++++++---
>>>>>   mm/memory_hotplug.c            | 13 +++++++++----
>>>>>   3 files changed, 16 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
>>>>> index 6b0d3ef..b0126a0 100644
>>>>> --- a/drivers/acpi/acpi_memhotplug.c
>>>>> +++ b/drivers/acpi/acpi_memhotplug.c
>>>>> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>>>>>                          nid = memory_add_physaddr_to_nid(info->start_addr);
>>>>>
>>>>>                  acpi_unbind_memory_blocks(info);
>>>>> -               remove_memory(nid, info->start_addr, info->length);
>>>>> +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
>>>>
>>>> Why does this have to be BUG_ON()?  Is it really necessary to kill the
>>>> system here?
>>>
>>> Actually, I hoped you would help me understand that: that BUG() call was introduced
>>> by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
>>> in memory_hoptlug.c:remove_memory()).
>>>
>>> Just reading at that commit my understanding was that you were assuming
>>> that acpi_memory_remove_memory() have already done the job of offlining
>>> the target memory, so there would be a bug if that wasn't the case.
>>>
>>> In my case, that assumption did not hold and I found that it might not
>>> hold for other platforms that do not use ACPI. In fact, the purpose of
>>> this patch is to move this assumption out of the generic hotplug code
>>> and move it to ACPI code where it originated.
>>
>> remove_memory failure is basically impossible to handle AFAIR. The
>> original code to BUG in remove_memory is ugly as hell and we do not want
>> to spread that out of that function. Instead we really want to get rid
>> of it.
> 
> Today, BUG() is called even in the simple case where remove fails
> because the section we are removing is not offline. I cannot see any need to
> BUG() in such a case: an error code seems more than sufficient to me.
> This is why this patch removes the BUG() call when the "offline" check
> fails from the generic code.
> It moves it back to the ACPI call, where the assumption
> originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> I have nothing against removing it from ACPI hotplug too, but
> I don't know enough to feel free to change the acpi semantics myself, so I
> moved it there to keep the original behavior unchanged for x86 code.
> 
> In this arm64 hot-remove port, offline and remove are done in two separate
> steps, and is conceivable that an user tries erroneusly to remove some
> section that he forgot to offline first: in that case, with the patch,
> remove will just report an erro without BUGing.

The user can already kill the system by misusing the sysfs probe driver; 
should similar theoretical misuse of your sysfs remove driver really 
need to be all that different?

> Is my reasoning flawed?

Furthermore, even if your driver does want to enforce this, I don't see 
why it can't just do the equivalent of memory_subsys_offline() itself 
before even trying to call remove_memory().

Robin.

> 
> Cheers,
> Andrea
> 
>> -- 
>> Michal Hocko
>> SUSE Labs
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-27 15:20             ` Robin Murphy
  0 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 24/11/17 15:54, Andrea Reale wrote:
> On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
>> On Fri 24-11-17 14:49:17, Andrea Reale wrote:
>>> Hi Rafael,
>>>
>>> On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
>>>> On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
>>>>> Resending the patch adding linux-acpi in CC, as suggested by Rafael.
>>>>> Everyone else: apologies for the noise.
>>>>>
>>>>> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
>>>>> introduced an assumption whereas when control
>>>>> reaches remove_memory the corresponding memory has been already
>>>>> offlined. In that case, the acpi_memhotplug was making sure that
>>>>> the assumption held.
>>>>> This assumption, however, is not necessarily true if offlining
>>>>> and removal are not done by the same "controller" (for example,
>>>>> when first offlining via sysfs).
>>>>>
>>>>> Removing this assumption for the generic remove_memory code
>>>>> and moving it in the specific acpi_memhotplug code. This is
>>>>> a dependency for the software-aided arm64 offlining and removal
>>>>> process.
>>>>>
>>>>> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
>>>>> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
>>>>> ---
>>>>>   drivers/acpi/acpi_memhotplug.c |  2 +-
>>>>>   include/linux/memory_hotplug.h |  9 ++++++---
>>>>>   mm/memory_hotplug.c            | 13 +++++++++----
>>>>>   3 files changed, 16 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
>>>>> index 6b0d3ef..b0126a0 100644
>>>>> --- a/drivers/acpi/acpi_memhotplug.c
>>>>> +++ b/drivers/acpi/acpi_memhotplug.c
>>>>> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>>>>>                          nid = memory_add_physaddr_to_nid(info->start_addr);
>>>>>
>>>>>                  acpi_unbind_memory_blocks(info);
>>>>> -               remove_memory(nid, info->start_addr, info->length);
>>>>> +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
>>>>
>>>> Why does this have to be BUG_ON()?  Is it really necessary to kill the
>>>> system here?
>>>
>>> Actually, I hoped you would help me understand that: that BUG() call was introduced
>>> by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
>>> in memory_hoptlug.c:remove_memory()).
>>>
>>> Just reading at that commit my understanding was that you were assuming
>>> that acpi_memory_remove_memory() have already done the job of offlining
>>> the target memory, so there would be a bug if that wasn't the case.
>>>
>>> In my case, that assumption did not hold and I found that it might not
>>> hold for other platforms that do not use ACPI. In fact, the purpose of
>>> this patch is to move this assumption out of the generic hotplug code
>>> and move it to ACPI code where it originated.
>>
>> remove_memory failure is basically impossible to handle AFAIR. The
>> original code to BUG in remove_memory is ugly as hell and we do not want
>> to spread that out of that function. Instead we really want to get rid
>> of it.
> 
> Today, BUG() is called even in the simple case where remove fails
> because the section we are removing is not offline. I cannot see any need to
> BUG() in such a case: an error code seems more than sufficient to me.
> This is why this patch removes the BUG() call when the "offline" check
> fails from the generic code.
> It moves it back to the ACPI call, where the assumption
> originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> I have nothing against removing it from ACPI hotplug too, but
> I don't know enough to feel free to change the acpi semantics myself, so I
> moved it there to keep the original behavior unchanged for x86 code.
> 
> In this arm64 hot-remove port, offline and remove are done in two separate
> steps, and is conceivable that an user tries erroneusly to remove some
> section that he forgot to offline first: in that case, with the patch,
> remove will just report an erro without BUGing.

The user can already kill the system by misusing the sysfs probe driver; 
should similar theoretical misuse of your sysfs remove driver really 
need to be all that different?

> Is my reasoning flawed?

Furthermore, even if your driver does want to enforce this, I don't see 
why it can't just do the equivalent of memory_subsys_offline() itself 
before even trying to call remove_memory().

Robin.

> 
> Cheers,
> Andrea
> 
>> -- 
>> Michal Hocko
>> SUSE Labs
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
  2017-11-23 11:14   ` Andrea Reale
  (?)
@ 2017-11-27 15:20     ` Robin Murphy
  -1 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:20 UTC (permalink / raw)
  To: Andrea Reale, linux-arm-kernel
  Cc: mark.rutland, realean2, mhocko, m.bielski, scott.branden,
	catalin.marinas, will.deacon, linux-kernel, linux-mm, arunks,
	qiuxishi

On 23/11/17 11:14, Andrea Reale wrote:
> When hot-removing memory we need to free vmemmap memory.

What problems arise if we don't? Is it only for the sake of freeing up 
some pages here and there, or is there something more fundamental?

> However, depending on the memory is being removed, it might
> not be always possible to free a full vmemmap page / huge-page
> because part of it might still be used.
> 
> Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> hot-remove") introduced a workaround for x86
> hot-remove, by which partially unused areas are filled with
> the 0xFD constant. Full pages are only removed when fully
> filled by 0xFDs.
> 
> This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> the goal of using it in place of 0xFDs. For now, this will be used for
> the arm64 port of memory hot remove, but the idea is to eventually use
> the same mechanism for x86 as well.
> 
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>   include/linux/memblock.h | 12 ++++++++++++
>   mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
>   2 files changed, 44 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7..0daec05 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -26,6 +26,9 @@ enum {
>   	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
>   	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
>   	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */

I'm not sure I get what "dirty" is supposed to mean in this context. 
Also, this appears to be specific to CONFIG_SPARSEMEM_VMEMMAP, whilst 
only tangentially related to CONFIG_MEMORY_HOTREMOVE, so the 
dependencies look a bit off.

In fact, now that I think about it, why does this need to be in memblock 
at all? If it is specific to sparsemem, shouldn't the section map 
already be enough to tell us what's supposed to be present or not?

Robin.

> +#endif
>   };
>   
>   struct memblock_region {
> @@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>   int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
>   int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
>   ulong choose_memblock_flags(void);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +#endif
>   
>   /* Low level functions */
>   int memblock_add_range(struct memblock_type *type,
> @@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
>   	return m->flags & MEMBLOCK_NOMAP;
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end);
> +#endif
> +
>   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>   int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>   			    unsigned long  *end_pfn);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 9120578..30d5aa4 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>   	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +#endif
>   /**
>    * __next_reserved_mem_region - next function for for_each_reserved_region()
>    * @idx: pointer to u64 loop variable
> @@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
>   	}
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end)
> +{
> +	u64 i;
> +	struct memblock_region *r;
> +
> +	i = memblock_search(mt, start);
> +	r = &(mt->regions[i]);
> +	while (r->base < end) {
> +		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> +			return 0;
> +
> +		r = &(memblock.memory.regions[++i]);
> +	}
> +
> +	return 1;
> +}
> +#endif
> +
>   void __init_memblock memblock_set_current_limit(phys_addr_t limit)
>   {
>   	memblock.current_limit = limit;
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-11-27 15:20     ` Robin Murphy
  0 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:20 UTC (permalink / raw)
  To: Andrea Reale, linux-arm-kernel
  Cc: mark.rutland, realean2, mhocko, m.bielski, scott.branden,
	catalin.marinas, will.deacon, linux-kernel, linux-mm, arunks,
	qiuxishi

On 23/11/17 11:14, Andrea Reale wrote:
> When hot-removing memory we need to free vmemmap memory.

What problems arise if we don't? Is it only for the sake of freeing up 
some pages here and there, or is there something more fundamental?

> However, depending on the memory is being removed, it might
> not be always possible to free a full vmemmap page / huge-page
> because part of it might still be used.
> 
> Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> hot-remove") introduced a workaround for x86
> hot-remove, by which partially unused areas are filled with
> the 0xFD constant. Full pages are only removed when fully
> filled by 0xFDs.
> 
> This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> the goal of using it in place of 0xFDs. For now, this will be used for
> the arm64 port of memory hot remove, but the idea is to eventually use
> the same mechanism for x86 as well.
> 
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>   include/linux/memblock.h | 12 ++++++++++++
>   mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
>   2 files changed, 44 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7..0daec05 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -26,6 +26,9 @@ enum {
>   	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
>   	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
>   	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */

I'm not sure I get what "dirty" is supposed to mean in this context. 
Also, this appears to be specific to CONFIG_SPARSEMEM_VMEMMAP, whilst 
only tangentially related to CONFIG_MEMORY_HOTREMOVE, so the 
dependencies look a bit off.

In fact, now that I think about it, why does this need to be in memblock 
at all? If it is specific to sparsemem, shouldn't the section map 
already be enough to tell us what's supposed to be present or not?

Robin.

> +#endif
>   };
>   
>   struct memblock_region {
> @@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>   int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
>   int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
>   ulong choose_memblock_flags(void);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +#endif
>   
>   /* Low level functions */
>   int memblock_add_range(struct memblock_type *type,
> @@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
>   	return m->flags & MEMBLOCK_NOMAP;
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end);
> +#endif
> +
>   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>   int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>   			    unsigned long  *end_pfn);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 9120578..30d5aa4 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>   	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +#endif
>   /**
>    * __next_reserved_mem_region - next function for for_each_reserved_region()
>    * @idx: pointer to u64 loop variable
> @@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
>   	}
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end)
> +{
> +	u64 i;
> +	struct memblock_region *r;
> +
> +	i = memblock_search(mt, start);
> +	r = &(mt->regions[i]);
> +	while (r->base < end) {
> +		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> +			return 0;
> +
> +		r = &(memblock.memory.regions[++i]);
> +	}
> +
> +	return 1;
> +}
> +#endif
> +
>   void __init_memblock memblock_set_current_limit(phys_addr_t limit)
>   {
>   	memblock.current_limit = limit;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-11-27 15:20     ` Robin Murphy
  0 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/11/17 11:14, Andrea Reale wrote:
> When hot-removing memory we need to free vmemmap memory.

What problems arise if we don't? Is it only for the sake of freeing up 
some pages here and there, or is there something more fundamental?

> However, depending on the memory is being removed, it might
> not be always possible to free a full vmemmap page / huge-page
> because part of it might still be used.
> 
> Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> hot-remove") introduced a workaround for x86
> hot-remove, by which partially unused areas are filled with
> the 0xFD constant. Full pages are only removed when fully
> filled by 0xFDs.
> 
> This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> the goal of using it in place of 0xFDs. For now, this will be used for
> the arm64 port of memory hot remove, but the idea is to eventually use
> the same mechanism for x86 as well.
> 
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>   include/linux/memblock.h | 12 ++++++++++++
>   mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
>   2 files changed, 44 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7..0daec05 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -26,6 +26,9 @@ enum {
>   	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
>   	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
>   	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */

I'm not sure I get what "dirty" is supposed to mean in this context. 
Also, this appears to be specific to CONFIG_SPARSEMEM_VMEMMAP, whilst 
only tangentially related to CONFIG_MEMORY_HOTREMOVE, so the 
dependencies look a bit off.

In fact, now that I think about it, why does this need to be in memblock 
at all? If it is specific to sparsemem, shouldn't the section map 
already be enough to tell us what's supposed to be present or not?

Robin.

> +#endif
>   };
>   
>   struct memblock_region {
> @@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>   int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
>   int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
>   ulong choose_memblock_flags(void);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +#endif
>   
>   /* Low level functions */
>   int memblock_add_range(struct memblock_type *type,
> @@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
>   	return m->flags & MEMBLOCK_NOMAP;
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end);
> +#endif
> +
>   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>   int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>   			    unsigned long  *end_pfn);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 9120578..30d5aa4 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>   	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +#endif
>   /**
>    * __next_reserved_mem_region - next function for for_each_reserved_region()
>    * @idx: pointer to u64 loop variable
> @@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
>   	}
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end)
> +{
> +	u64 i;
> +	struct memblock_region *r;
> +
> +	i = memblock_search(mt, start);
> +	r = &(mt->regions[i]);
> +	while (r->base < end) {
> +		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> +			return 0;
> +
> +		r = &(memblock.memory.regions[++i]);
> +	}
> +
> +	return 1;
> +}
> +#endif
> +
>   void __init_memblock memblock_set_current_limit(phys_addr_t limit)
>   {
>   	memblock.current_limit = limit;
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-11-23 11:14   ` Andrea Reale
  (?)
@ 2017-11-27 15:33     ` Robin Murphy
  -1 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:33 UTC (permalink / raw)
  To: Andrea Reale, linux-arm-kernel
  Cc: mark.rutland, realean2, mhocko, m.bielski, scott.branden,
	catalin.marinas, will.deacon, linux-kernel, linux-mm, arunks,
	qiuxishi

On 23/11/17 11:14, Andrea Reale wrote:
> Adding a "remove" sysfs handle that can be used to trigger
> memory hotremove manually, exactly simmetrically with
> what happens with the "probe" device for hot-add.
> 
> This is usueful for architecture that do not rely on
> ACPI for memory hot-remove.

Is there a real-world use-case for this, or is it mostly just a handy 
development feature?

> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>   drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>   1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 1d60b58..8ccb67c 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>   }
>   
>   static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> -#endif
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +static ssize_t
> +memory_remove_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	u64 phys_addr;
> +	int nid, ret;
> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> +
> +	ret = kstrtoull(buf, 0, &phys_addr);
> +	if (ret)
> +		return ret;
> +
> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> +		return -EINVAL;
> +
> +	nid = memory_add_physaddr_to_nid(phys_addr);

This call looks a bit odd, since you're not doing a memory add. In fact, 
any memory being removed should already be fully known-about, so AFAICS 
it should be simple to get everything you need to know (including 
potentially the online status as mentioned earlier), through 'normal' 
methods, e.g. page_to_nid() or similar.

Robin.

> +	ret = lock_device_hotplug_sysfs();
> +	if (ret)
> +		return ret;
> +
> +	remove_memory(nid, phys_addr,
> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> +	unlock_device_hotplug();
> +	return count;
> +}
> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>   
>   #ifdef CONFIG_MEMORY_FAILURE
>   /*
> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>   static struct attribute *memory_root_attrs[] = {
>   #ifdef CONFIG_ARCH_MEMORY_PROBE
>   	&dev_attr_probe.attr,
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	&dev_attr_remove.attr,
> +#endif
>   #endif
>   
>   #ifdef CONFIG_MEMORY_FAILURE
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-27 15:33     ` Robin Murphy
  0 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:33 UTC (permalink / raw)
  To: Andrea Reale, linux-arm-kernel
  Cc: mark.rutland, realean2, mhocko, m.bielski, scott.branden,
	catalin.marinas, will.deacon, linux-kernel, linux-mm, arunks,
	qiuxishi

On 23/11/17 11:14, Andrea Reale wrote:
> Adding a "remove" sysfs handle that can be used to trigger
> memory hotremove manually, exactly simmetrically with
> what happens with the "probe" device for hot-add.
> 
> This is usueful for architecture that do not rely on
> ACPI for memory hot-remove.

Is there a real-world use-case for this, or is it mostly just a handy 
development feature?

> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>   drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>   1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 1d60b58..8ccb67c 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>   }
>   
>   static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> -#endif
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +static ssize_t
> +memory_remove_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	u64 phys_addr;
> +	int nid, ret;
> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> +
> +	ret = kstrtoull(buf, 0, &phys_addr);
> +	if (ret)
> +		return ret;
> +
> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> +		return -EINVAL;
> +
> +	nid = memory_add_physaddr_to_nid(phys_addr);

This call looks a bit odd, since you're not doing a memory add. In fact, 
any memory being removed should already be fully known-about, so AFAICS 
it should be simple to get everything you need to know (including 
potentially the online status as mentioned earlier), through 'normal' 
methods, e.g. page_to_nid() or similar.

Robin.

> +	ret = lock_device_hotplug_sysfs();
> +	if (ret)
> +		return ret;
> +
> +	remove_memory(nid, phys_addr,
> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> +	unlock_device_hotplug();
> +	return count;
> +}
> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>   
>   #ifdef CONFIG_MEMORY_FAILURE
>   /*
> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>   static struct attribute *memory_root_attrs[] = {
>   #ifdef CONFIG_ARCH_MEMORY_PROBE
>   	&dev_attr_probe.attr,
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	&dev_attr_remove.attr,
> +#endif
>   #endif
>   
>   #ifdef CONFIG_MEMORY_FAILURE
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-27 15:33     ` Robin Murphy
  0 siblings, 0 replies; 156+ messages in thread
From: Robin Murphy @ 2017-11-27 15:33 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/11/17 11:14, Andrea Reale wrote:
> Adding a "remove" sysfs handle that can be used to trigger
> memory hotremove manually, exactly simmetrically with
> what happens with the "probe" device for hot-add.
> 
> This is usueful for architecture that do not rely on
> ACPI for memory hot-remove.

Is there a real-world use-case for this, or is it mostly just a handy 
development feature?

> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>   drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>   1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 1d60b58..8ccb67c 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>   }
>   
>   static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> -#endif
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +static ssize_t
> +memory_remove_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	u64 phys_addr;
> +	int nid, ret;
> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> +
> +	ret = kstrtoull(buf, 0, &phys_addr);
> +	if (ret)
> +		return ret;
> +
> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> +		return -EINVAL;
> +
> +	nid = memory_add_physaddr_to_nid(phys_addr);

This call looks a bit odd, since you're not doing a memory add. In fact, 
any memory being removed should already be fully known-about, so AFAICS 
it should be simple to get everything you need to know (including 
potentially the online status as mentioned earlier), through 'normal' 
methods, e.g. page_to_nid() or similar.

Robin.

> +	ret = lock_device_hotplug_sysfs();
> +	if (ret)
> +		return ret;
> +
> +	remove_memory(nid, phys_addr,
> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> +	unlock_device_hotplug();
> +	return count;
> +}
> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>   
>   #ifdef CONFIG_MEMORY_FAILURE
>   /*
> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>   static struct attribute *memory_root_attrs[] = {
>   #ifdef CONFIG_ARCH_MEMORY_PROBE
>   	&dev_attr_probe.attr,
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	&dev_attr_remove.attr,
> +#endif
>   #endif
>   
>   #ifdef CONFIG_MEMORY_FAILURE
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
  2017-11-27 15:19     ` Robin Murphy
  (?)
@ 2017-11-27 16:39       ` Maciej Bielski
  -1 siblings, 0 replies; 156+ messages in thread
From: Maciej Bielski @ 2017-11-27 16:39 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Maciej Bielski, linux-arm-kernel, ar, mark.rutland, realean2,
	mhocko, scott.branden, catalin.marinas, will.deacon,
	linux-kernel, linux-mm, arunks, qiuxishi

Hi Robin,

Thank you for your feedback, its highly appreciated. I let myself to add some
comments.

Our primary goal was to have hotplug working even in the basic setup and
publish first working results. Then we want to improve the code building on top
of community comments. This is a general answer for questions about
configuration flags. The working setup is presented, a bit as a hint, and we do
not deem it to be ultimately best at all. The questions about configuration,
IMHO, falls into category of making an agreement on a proper setup (defaults,
dependencies) and, therefore, we strongly rely on the community experience to
advise us how it should be. So, shortly, for some questions "why this is setup
in such a way" the simple anser is that it worked as a first approximation.
Then, I totally agree that for a server-grade system it should be different and
thanks a lot for sharing your opinion on that.

On Mon, Nov 27, 2017 at 03:19:49PM +0000, Robin Murphy wrote:
> Hi Andrea,
> 
> I've also been looking at memory hotplug for arm64, from the perspective of
> enabling ZONE_DEVICE for pmem. May I ask what your use-case for this series
> is? AFAICS the real demand will be coming from server systems, which in
> practice means both ACPI and NUMA, both of which are being resoundingly
> ignored here.
> 

Eventually we aim for aarch64 server system.

> Further review comments inline.
> 
> On 23/11/17 11:13, Maciej Bielski wrote:
> >Introduces memory hotplug functionality (hot-add) for arm64.
> >
> >Changes v1->v2:
> >- swapper pgtable updated in place on hot add, avoiding unnecessary copy:
> >   all changes are additive and non destructive.
> >
> >- stop_machine used to updated swapper on hot add, avoiding races
> >
> >- checking if pagealloc is under debug to stay coherent with mem_map
> >
> >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >---
> >  arch/arm64/Kconfig           | 12 ++++++
> >  arch/arm64/configs/defconfig |  1 +
> >  arch/arm64/include/asm/mmu.h |  3 ++
> >  arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
> >  arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
> >  5 files changed, 142 insertions(+)
> >
> >diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >index 0df64a6..c736bba 100644
> >--- a/arch/arm64/Kconfig
> >+++ b/arch/arm64/Kconfig
> >@@ -641,6 +641,14 @@ config HOTPLUG_CPU
> >  	  Say Y here to experiment with turning CPUs off and on.  CPUs
> >  	  can be controlled through /sys/devices/system/cpu.
> >+config ARCH_HAS_ADD_PAGES
> >+	def_bool y
> >+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> >+
> >+config ARCH_ENABLE_MEMORY_HOTPLUG
> >+	def_bool y
> >+    depends on !NUMA
> 
> As above, realistically this seems too limiting to be useful.
> 
> >+
> >  # Common NUMA Features
> >  config NUMA
> >  	bool "Numa Memory Allocation and Scheduler Support"
> >@@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
> >  source "mm/Kconfig"
> >+config ARCH_MEMORY_PROBE
> >+	def_bool y
> >+	depends on MEMORY_HOTPLUG
> 
> I'm particularly dubious about enabling this by default - it's useful for
> development and testing, yes, but I think it's the kind of feature where the
> onus should be on interested developers to turn it on, rather than
> production configs to have to turn it off.
> 
> >+
> >  config SECCOMP
> >  	bool "Enable seccomp to safely compute untrusted bytecode"
> >  	---help---
> >diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> >index 34480e9..5fc5656 100644
> >--- a/arch/arm64/configs/defconfig
> >+++ b/arch/arm64/configs/defconfig
> >@@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
> >  CONFIG_SCHED_MC=y
> >  CONFIG_NUMA=y
> >  CONFIG_PREEMPT=y
> >+CONFIG_MEMORY_HOTPLUG=y
> 
> Note that this is effectively pointless, given two lines above...
> 
> >  CONFIG_KSM=y
> >  CONFIG_TRANSPARENT_HUGEPAGE=y
> >  CONFIG_CMA=y
> >diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> >index 0d34bf0..2b3fa4d 100644
> >--- a/arch/arm64/include/asm/mmu.h
> >+++ b/arch/arm64/include/asm/mmu.h
> >@@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> >  			       pgprot_t prot, bool page_mappings_only);
> >  extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
> >  extern void mark_linear_text_alias_ro(void);
> >+#ifdef CONFIG_MEMORY_HOTPLUG
> >+extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
> 
> Is there any reason for not just implementing all the hotplug code
> self-contained in mmu.c?
> 

Simply, in the first version we were supposed to built on top of the patch by
Scott Branden, who put a mock implementation of arch_add_memory() in
arch/arm64/mm/init.c, this is why hotplug_paging() needed to be announced
outside. Quickly looking on the code now I agree that it would be more clean to
put everything in arch/arm64/mm/mmu.c. I will test that.

> >+#endif
> >  #endif
> >diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >index 5960bef..e96e7d3 100644
> >--- a/arch/arm64/mm/init.c
> >+++ b/arch/arm64/mm/init.c
> >@@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
> >  	return 0;
> >  }
> >  __initcall(register_mem_limit_dumper);
> >+
> >+#ifdef CONFIG_MEMORY_HOTPLUG
> >+int add_pages(int nid, unsigned long start_pfn,
> >+		unsigned long nr_pages, bool want_memblock)
> >+{
> >+	int ret;
> >+	u64 start_addr = start_pfn << PAGE_SHIFT;
> >+	/*
> >+	 * Mark the first page in the range as unusable. This is needed
> >+	 * because __add_section (within __add_pages) wants pfn_valid
> >+	 * of it to be false, and in arm64 pfn falid is implemented by
> >+	 * just checking at the nomap flag for existing blocks.
> >+	 *
> >+	 * A small trick here is that __add_section() requires only
> >+	 * phys_start_pfn (that is the first pfn of a section) to be
> >+	 * invalid. Regardless of whether it was assumed (by the function
> >+	 * author) that all pfns within a section are either all valid
> >+	 * or all invalid, it allows to avoid looping twice (once here,
> >+	 * second when memblock_clear_nomap() is called) through all
> >+	 * pfns of the section and modify only one pfn. Thanks to that,
> >+	 * further, in __add_zone() only this very first pfn is skipped
> >+	 * and corresponding page is not flagged reserved. Therefore it
> >+	 * is enough to correct this setup only for it.
> >+	 *
> >+	 * When arch_add_memory() returns the walk_memory_range() function
> >+	 * is called and passed with online_memory_block() callback,
> >+	 * which execution finally reaches the memory_block_action()
> >+	 * function, where also only the first pfn of a memory block is
> >+	 * checked to be reserved. Above, it was first pfn of a section,
> >+	 * here it is a block but
> >+	 * (drivers/base/memory.c):
> >+	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> >+	 * (include/linux/memory.h):
> >+	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> >+	 * so we can consider block and section equivalently
> >+	 */
> >+	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> >+	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> >+
> >+	/*
> >+	 * Make the pages usable after they have been added.
> >+	 * This will make pfn_valid return true
> >+	 */
> >+	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> >+
> >+	/*
> >+	 * This is a hack to avoid having to mix arch specific code
> >+	 * into arch independent code. SetPageReserved is supposed
> >+	 * to be called by __add_zone (within __add_section, within
> >+	 * __add_pages). However, when it is called there, it assumes that
> >+	 * pfn_valid returns true.  For the way pfn_valid is implemented
> >+	 * in arm64 (a check on the nomap flag), the only way to make
> >+	 * this evaluate true inside __add_zone is to clear the nomap
> >+	 * flags of blocks in architecture independent code.
> >+	 *
> >+	 * To avoid this, we set the Reserved flag here after we cleared
> >+	 * the nomap flag in the line above.
> >+	 */
> >+	SetPageReserved(pfn_to_page(start_pfn));
> 
> This whole business is utterly horrible. I really think we need to revisit
> why arm64 isn't using the normal sparsemem pfn_valid() implementation. If
> there are callers misusing pfn_valid() where they really want page_is_ram()
> or similar, or missing further pfn_valid_within() checks, then it's surely
> time to fix those at the source rather than adding to the Jenga pile of
> hacks in this area. I've started digging into it myself, but don't have any
> answers yet.
> 

I fully agree and this is the exact reaction we hoped for. We just decided to
avoid opening too many fronts at the same time, also that we were not
completely sure what exactly the pfn_valid() is supposed to serve for and what
we can potentially break. We are looking for your findings here.

> >+
> >+	return ret;
> >+}
> >+
> >+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> >+{
> >+	int ret;
> >+	unsigned long start_pfn = start >> PAGE_SHIFT;
> >+	unsigned long nr_pages = size >> PAGE_SHIFT;
> >+	unsigned long end_pfn = start_pfn + nr_pages;
> >+	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> >+
> >+	if (end_pfn > max_sparsemem_pfn) {
> >+		pr_err("end_pfn too big");
> >+		return -1;
> >+	}
> >+	hotplug_paging(start, size);
> >+
> >+	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> >+
> >+	if (ret)
> >+		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> >+			__func__, ret);
> >+
> >+	return ret;
> >+}
> >+
> >+#endif /* CONFIG_MEMORY_HOTPLUG */
> >diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >index f1eb15e..d93043d 100644
> >--- a/arch/arm64/mm/mmu.c
> >+++ b/arch/arm64/mm/mmu.c
> >@@ -28,6 +28,7 @@
> >  #include <linux/mman.h>
> >  #include <linux/nodemask.h>
> >  #include <linux/memblock.h>
> >+#include <linux/stop_machine.h>
> >  #include <linux/fs.h>
> >  #include <linux/io.h>
> >  #include <linux/mm.h>
> >@@ -615,6 +616,44 @@ void __init paging_init(void)
> >  		      SWAPPER_DIR_SIZE - PAGE_SIZE);
> >  }
> >+#ifdef CONFIG_MEMORY_HOTPLUG
> >+
> >+/*
> >+ * hotplug_paging() is used by memory hotplug to build new page tables
> >+ * for hot added memory.
> >+ */
> >+
> >+struct mem_range {
> >+	phys_addr_t base;
> >+	phys_addr_t size;
> >+};
> >+
> >+static int __hotplug_paging(void *data)
> >+{
> >+	int flags = 0;
> >+	struct mem_range *section = data;
> >+
> >+	if (debug_pagealloc_enabled())
> >+		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> >+
> >+	__create_pgd_mapping(swapper_pg_dir, section->base,
> >+			__phys_to_virt(section->base), section->size,
> >+			PAGE_KERNEL, pgd_pgtable_alloc, flags);
> >+
> >+	return 0;
> >+}
> >+
> >+inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> >+{
> >+	struct mem_range section = {
> >+		.base = start,
> >+		.size = size,
> >+	};
> >+
> >+	stop_machine(__hotplug_paging, &section, NULL);
> 
> Why exactly do we need to swing the stop_machine() hammer here? I appreciate
> that separate hotplug events for adjacent sections could potentially affect
> the same top-level entry in swapper_pg_dir, but those should already be
> serialised by the hotplug lock - who else has cause to modify non-leaf
> entries for the linear map at runtime in a manner which might conflict?
> 

The reason for this has been mentioned by Mark Rutland in the previous spin
(https://lkml.org/lkml/2017/4/11/582), please let us know if you have different
point of view.


BR,
Maciej Bielski

> Robin.
> 
> >+}
> >+#endif /* CONFIG_MEMORY_HOTPLUG */
> >+
> >  /*
> >   * Check whether a kernel address is valid (derived from arch/x86/).
> >   */
> >

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-27 16:39       ` Maciej Bielski
  0 siblings, 0 replies; 156+ messages in thread
From: Maciej Bielski @ 2017-11-27 16:39 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Maciej Bielski, linux-arm-kernel, ar, mark.rutland, realean2,
	mhocko, scott.branden, catalin.marinas, will.deacon,
	linux-kernel, linux-mm, arunks, qiuxishi

Hi Robin,

Thank you for your feedback, its highly appreciated. I let myself to add some
comments.

Our primary goal was to have hotplug working even in the basic setup and
publish first working results. Then we want to improve the code building on top
of community comments. This is a general answer for questions about
configuration flags. The working setup is presented, a bit as a hint, and we do
not deem it to be ultimately best at all. The questions about configuration,
IMHO, falls into category of making an agreement on a proper setup (defaults,
dependencies) and, therefore, we strongly rely on the community experience to
advise us how it should be. So, shortly, for some questions "why this is setup
in such a way" the simple anser is that it worked as a first approximation.
Then, I totally agree that for a server-grade system it should be different and
thanks a lot for sharing your opinion on that.

On Mon, Nov 27, 2017 at 03:19:49PM +0000, Robin Murphy wrote:
> Hi Andrea,
> 
> I've also been looking at memory hotplug for arm64, from the perspective of
> enabling ZONE_DEVICE for pmem. May I ask what your use-case for this series
> is? AFAICS the real demand will be coming from server systems, which in
> practice means both ACPI and NUMA, both of which are being resoundingly
> ignored here.
> 

Eventually we aim for aarch64 server system.

> Further review comments inline.
> 
> On 23/11/17 11:13, Maciej Bielski wrote:
> >Introduces memory hotplug functionality (hot-add) for arm64.
> >
> >Changes v1->v2:
> >- swapper pgtable updated in place on hot add, avoiding unnecessary copy:
> >   all changes are additive and non destructive.
> >
> >- stop_machine used to updated swapper on hot add, avoiding races
> >
> >- checking if pagealloc is under debug to stay coherent with mem_map
> >
> >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >---
> >  arch/arm64/Kconfig           | 12 ++++++
> >  arch/arm64/configs/defconfig |  1 +
> >  arch/arm64/include/asm/mmu.h |  3 ++
> >  arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
> >  arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
> >  5 files changed, 142 insertions(+)
> >
> >diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >index 0df64a6..c736bba 100644
> >--- a/arch/arm64/Kconfig
> >+++ b/arch/arm64/Kconfig
> >@@ -641,6 +641,14 @@ config HOTPLUG_CPU
> >  	  Say Y here to experiment with turning CPUs off and on.  CPUs
> >  	  can be controlled through /sys/devices/system/cpu.
> >+config ARCH_HAS_ADD_PAGES
> >+	def_bool y
> >+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> >+
> >+config ARCH_ENABLE_MEMORY_HOTPLUG
> >+	def_bool y
> >+    depends on !NUMA
> 
> As above, realistically this seems too limiting to be useful.
> 
> >+
> >  # Common NUMA Features
> >  config NUMA
> >  	bool "Numa Memory Allocation and Scheduler Support"
> >@@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
> >  source "mm/Kconfig"
> >+config ARCH_MEMORY_PROBE
> >+	def_bool y
> >+	depends on MEMORY_HOTPLUG
> 
> I'm particularly dubious about enabling this by default - it's useful for
> development and testing, yes, but I think it's the kind of feature where the
> onus should be on interested developers to turn it on, rather than
> production configs to have to turn it off.
> 
> >+
> >  config SECCOMP
> >  	bool "Enable seccomp to safely compute untrusted bytecode"
> >  	---help---
> >diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> >index 34480e9..5fc5656 100644
> >--- a/arch/arm64/configs/defconfig
> >+++ b/arch/arm64/configs/defconfig
> >@@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
> >  CONFIG_SCHED_MC=y
> >  CONFIG_NUMA=y
> >  CONFIG_PREEMPT=y
> >+CONFIG_MEMORY_HOTPLUG=y
> 
> Note that this is effectively pointless, given two lines above...
> 
> >  CONFIG_KSM=y
> >  CONFIG_TRANSPARENT_HUGEPAGE=y
> >  CONFIG_CMA=y
> >diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> >index 0d34bf0..2b3fa4d 100644
> >--- a/arch/arm64/include/asm/mmu.h
> >+++ b/arch/arm64/include/asm/mmu.h
> >@@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> >  			       pgprot_t prot, bool page_mappings_only);
> >  extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
> >  extern void mark_linear_text_alias_ro(void);
> >+#ifdef CONFIG_MEMORY_HOTPLUG
> >+extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
> 
> Is there any reason for not just implementing all the hotplug code
> self-contained in mmu.c?
> 

Simply, in the first version we were supposed to built on top of the patch by
Scott Branden, who put a mock implementation of arch_add_memory() in
arch/arm64/mm/init.c, this is why hotplug_paging() needed to be announced
outside. Quickly looking on the code now I agree that it would be more clean to
put everything in arch/arm64/mm/mmu.c. I will test that.

> >+#endif
> >  #endif
> >diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >index 5960bef..e96e7d3 100644
> >--- a/arch/arm64/mm/init.c
> >+++ b/arch/arm64/mm/init.c
> >@@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
> >  	return 0;
> >  }
> >  __initcall(register_mem_limit_dumper);
> >+
> >+#ifdef CONFIG_MEMORY_HOTPLUG
> >+int add_pages(int nid, unsigned long start_pfn,
> >+		unsigned long nr_pages, bool want_memblock)
> >+{
> >+	int ret;
> >+	u64 start_addr = start_pfn << PAGE_SHIFT;
> >+	/*
> >+	 * Mark the first page in the range as unusable. This is needed
> >+	 * because __add_section (within __add_pages) wants pfn_valid
> >+	 * of it to be false, and in arm64 pfn falid is implemented by
> >+	 * just checking at the nomap flag for existing blocks.
> >+	 *
> >+	 * A small trick here is that __add_section() requires only
> >+	 * phys_start_pfn (that is the first pfn of a section) to be
> >+	 * invalid. Regardless of whether it was assumed (by the function
> >+	 * author) that all pfns within a section are either all valid
> >+	 * or all invalid, it allows to avoid looping twice (once here,
> >+	 * second when memblock_clear_nomap() is called) through all
> >+	 * pfns of the section and modify only one pfn. Thanks to that,
> >+	 * further, in __add_zone() only this very first pfn is skipped
> >+	 * and corresponding page is not flagged reserved. Therefore it
> >+	 * is enough to correct this setup only for it.
> >+	 *
> >+	 * When arch_add_memory() returns the walk_memory_range() function
> >+	 * is called and passed with online_memory_block() callback,
> >+	 * which execution finally reaches the memory_block_action()
> >+	 * function, where also only the first pfn of a memory block is
> >+	 * checked to be reserved. Above, it was first pfn of a section,
> >+	 * here it is a block but
> >+	 * (drivers/base/memory.c):
> >+	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> >+	 * (include/linux/memory.h):
> >+	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> >+	 * so we can consider block and section equivalently
> >+	 */
> >+	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> >+	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> >+
> >+	/*
> >+	 * Make the pages usable after they have been added.
> >+	 * This will make pfn_valid return true
> >+	 */
> >+	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> >+
> >+	/*
> >+	 * This is a hack to avoid having to mix arch specific code
> >+	 * into arch independent code. SetPageReserved is supposed
> >+	 * to be called by __add_zone (within __add_section, within
> >+	 * __add_pages). However, when it is called there, it assumes that
> >+	 * pfn_valid returns true.  For the way pfn_valid is implemented
> >+	 * in arm64 (a check on the nomap flag), the only way to make
> >+	 * this evaluate true inside __add_zone is to clear the nomap
> >+	 * flags of blocks in architecture independent code.
> >+	 *
> >+	 * To avoid this, we set the Reserved flag here after we cleared
> >+	 * the nomap flag in the line above.
> >+	 */
> >+	SetPageReserved(pfn_to_page(start_pfn));
> 
> This whole business is utterly horrible. I really think we need to revisit
> why arm64 isn't using the normal sparsemem pfn_valid() implementation. If
> there are callers misusing pfn_valid() where they really want page_is_ram()
> or similar, or missing further pfn_valid_within() checks, then it's surely
> time to fix those at the source rather than adding to the Jenga pile of
> hacks in this area. I've started digging into it myself, but don't have any
> answers yet.
> 

I fully agree and this is the exact reaction we hoped for. We just decided to
avoid opening too many fronts at the same time, also that we were not
completely sure what exactly the pfn_valid() is supposed to serve for and what
we can potentially break. We are looking for your findings here.

> >+
> >+	return ret;
> >+}
> >+
> >+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> >+{
> >+	int ret;
> >+	unsigned long start_pfn = start >> PAGE_SHIFT;
> >+	unsigned long nr_pages = size >> PAGE_SHIFT;
> >+	unsigned long end_pfn = start_pfn + nr_pages;
> >+	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> >+
> >+	if (end_pfn > max_sparsemem_pfn) {
> >+		pr_err("end_pfn too big");
> >+		return -1;
> >+	}
> >+	hotplug_paging(start, size);
> >+
> >+	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> >+
> >+	if (ret)
> >+		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> >+			__func__, ret);
> >+
> >+	return ret;
> >+}
> >+
> >+#endif /* CONFIG_MEMORY_HOTPLUG */
> >diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >index f1eb15e..d93043d 100644
> >--- a/arch/arm64/mm/mmu.c
> >+++ b/arch/arm64/mm/mmu.c
> >@@ -28,6 +28,7 @@
> >  #include <linux/mman.h>
> >  #include <linux/nodemask.h>
> >  #include <linux/memblock.h>
> >+#include <linux/stop_machine.h>
> >  #include <linux/fs.h>
> >  #include <linux/io.h>
> >  #include <linux/mm.h>
> >@@ -615,6 +616,44 @@ void __init paging_init(void)
> >  		      SWAPPER_DIR_SIZE - PAGE_SIZE);
> >  }
> >+#ifdef CONFIG_MEMORY_HOTPLUG
> >+
> >+/*
> >+ * hotplug_paging() is used by memory hotplug to build new page tables
> >+ * for hot added memory.
> >+ */
> >+
> >+struct mem_range {
> >+	phys_addr_t base;
> >+	phys_addr_t size;
> >+};
> >+
> >+static int __hotplug_paging(void *data)
> >+{
> >+	int flags = 0;
> >+	struct mem_range *section = data;
> >+
> >+	if (debug_pagealloc_enabled())
> >+		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> >+
> >+	__create_pgd_mapping(swapper_pg_dir, section->base,
> >+			__phys_to_virt(section->base), section->size,
> >+			PAGE_KERNEL, pgd_pgtable_alloc, flags);
> >+
> >+	return 0;
> >+}
> >+
> >+inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> >+{
> >+	struct mem_range section = {
> >+		.base = start,
> >+		.size = size,
> >+	};
> >+
> >+	stop_machine(__hotplug_paging, &section, NULL);
> 
> Why exactly do we need to swing the stop_machine() hammer here? I appreciate
> that separate hotplug events for adjacent sections could potentially affect
> the same top-level entry in swapper_pg_dir, but those should already be
> serialised by the hotplug lock - who else has cause to modify non-leaf
> entries for the linear map at runtime in a manner which might conflict?
> 

The reason for this has been mentioned by Mark Rutland in the previous spin
(https://lkml.org/lkml/2017/4/11/582), please let us know if you have different
point of view.


BR,
Maciej Bielski

> Robin.
> 
> >+}
> >+#endif /* CONFIG_MEMORY_HOTPLUG */
> >+
> >  /*
> >   * Check whether a kernel address is valid (derived from arch/x86/).
> >   */
> >

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-27 16:39       ` Maciej Bielski
  0 siblings, 0 replies; 156+ messages in thread
From: Maciej Bielski @ 2017-11-27 16:39 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,

Thank you for your feedback, its highly appreciated. I let myself to add some
comments.

Our primary goal was to have hotplug working even in the basic setup and
publish first working results. Then we want to improve the code building on top
of community comments. This is a general answer for questions about
configuration flags. The working setup is presented, a bit as a hint, and we do
not deem it to be ultimately best at all. The questions about configuration,
IMHO, falls into category of making an agreement on a proper setup (defaults,
dependencies) and, therefore, we strongly rely on the community experience to
advise us how it should be. So, shortly, for some questions "why this is setup
in such a way" the simple anser is that it worked as a first approximation.
Then, I totally agree that for a server-grade system it should be different and
thanks a lot for sharing your opinion on that.

On Mon, Nov 27, 2017 at 03:19:49PM +0000, Robin Murphy wrote:
> Hi Andrea,
> 
> I've also been looking at memory hotplug for arm64, from the perspective of
> enabling ZONE_DEVICE for pmem. May I ask what your use-case for this series
> is? AFAICS the real demand will be coming from server systems, which in
> practice means both ACPI and NUMA, both of which are being resoundingly
> ignored here.
> 

Eventually we aim for aarch64 server system.

> Further review comments inline.
> 
> On 23/11/17 11:13, Maciej Bielski wrote:
> >Introduces memory hotplug functionality (hot-add) for arm64.
> >
> >Changes v1->v2:
> >- swapper pgtable updated in place on hot add, avoiding unnecessary copy:
> >   all changes are additive and non destructive.
> >
> >- stop_machine used to updated swapper on hot add, avoiding races
> >
> >- checking if pagealloc is under debug to stay coherent with mem_map
> >
> >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >---
> >  arch/arm64/Kconfig           | 12 ++++++
> >  arch/arm64/configs/defconfig |  1 +
> >  arch/arm64/include/asm/mmu.h |  3 ++
> >  arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
> >  arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
> >  5 files changed, 142 insertions(+)
> >
> >diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >index 0df64a6..c736bba 100644
> >--- a/arch/arm64/Kconfig
> >+++ b/arch/arm64/Kconfig
> >@@ -641,6 +641,14 @@ config HOTPLUG_CPU
> >  	  Say Y here to experiment with turning CPUs off and on.  CPUs
> >  	  can be controlled through /sys/devices/system/cpu.
> >+config ARCH_HAS_ADD_PAGES
> >+	def_bool y
> >+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> >+
> >+config ARCH_ENABLE_MEMORY_HOTPLUG
> >+	def_bool y
> >+    depends on !NUMA
> 
> As above, realistically this seems too limiting to be useful.
> 
> >+
> >  # Common NUMA Features
> >  config NUMA
> >  	bool "Numa Memory Allocation and Scheduler Support"
> >@@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
> >  source "mm/Kconfig"
> >+config ARCH_MEMORY_PROBE
> >+	def_bool y
> >+	depends on MEMORY_HOTPLUG
> 
> I'm particularly dubious about enabling this by default - it's useful for
> development and testing, yes, but I think it's the kind of feature where the
> onus should be on interested developers to turn it on, rather than
> production configs to have to turn it off.
> 
> >+
> >  config SECCOMP
> >  	bool "Enable seccomp to safely compute untrusted bytecode"
> >  	---help---
> >diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> >index 34480e9..5fc5656 100644
> >--- a/arch/arm64/configs/defconfig
> >+++ b/arch/arm64/configs/defconfig
> >@@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
> >  CONFIG_SCHED_MC=y
> >  CONFIG_NUMA=y
> >  CONFIG_PREEMPT=y
> >+CONFIG_MEMORY_HOTPLUG=y
> 
> Note that this is effectively pointless, given two lines above...
> 
> >  CONFIG_KSM=y
> >  CONFIG_TRANSPARENT_HUGEPAGE=y
> >  CONFIG_CMA=y
> >diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> >index 0d34bf0..2b3fa4d 100644
> >--- a/arch/arm64/include/asm/mmu.h
> >+++ b/arch/arm64/include/asm/mmu.h
> >@@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> >  			       pgprot_t prot, bool page_mappings_only);
> >  extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
> >  extern void mark_linear_text_alias_ro(void);
> >+#ifdef CONFIG_MEMORY_HOTPLUG
> >+extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
> 
> Is there any reason for not just implementing all the hotplug code
> self-contained in mmu.c?
> 

Simply, in the first version we were supposed to built on top of the patch by
Scott Branden, who put a mock implementation of arch_add_memory() in
arch/arm64/mm/init.c, this is why hotplug_paging() needed to be announced
outside. Quickly looking on the code now I agree that it would be more clean to
put everything in arch/arm64/mm/mmu.c. I will test that.

> >+#endif
> >  #endif
> >diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >index 5960bef..e96e7d3 100644
> >--- a/arch/arm64/mm/init.c
> >+++ b/arch/arm64/mm/init.c
> >@@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
> >  	return 0;
> >  }
> >  __initcall(register_mem_limit_dumper);
> >+
> >+#ifdef CONFIG_MEMORY_HOTPLUG
> >+int add_pages(int nid, unsigned long start_pfn,
> >+		unsigned long nr_pages, bool want_memblock)
> >+{
> >+	int ret;
> >+	u64 start_addr = start_pfn << PAGE_SHIFT;
> >+	/*
> >+	 * Mark the first page in the range as unusable. This is needed
> >+	 * because __add_section (within __add_pages) wants pfn_valid
> >+	 * of it to be false, and in arm64 pfn falid is implemented by
> >+	 * just checking at the nomap flag for existing blocks.
> >+	 *
> >+	 * A small trick here is that __add_section() requires only
> >+	 * phys_start_pfn (that is the first pfn of a section) to be
> >+	 * invalid. Regardless of whether it was assumed (by the function
> >+	 * author) that all pfns within a section are either all valid
> >+	 * or all invalid, it allows to avoid looping twice (once here,
> >+	 * second when memblock_clear_nomap() is called) through all
> >+	 * pfns of the section and modify only one pfn. Thanks to that,
> >+	 * further, in __add_zone() only this very first pfn is skipped
> >+	 * and corresponding page is not flagged reserved. Therefore it
> >+	 * is enough to correct this setup only for it.
> >+	 *
> >+	 * When arch_add_memory() returns the walk_memory_range() function
> >+	 * is called and passed with online_memory_block() callback,
> >+	 * which execution finally reaches the memory_block_action()
> >+	 * function, where also only the first pfn of a memory block is
> >+	 * checked to be reserved. Above, it was first pfn of a section,
> >+	 * here it is a block but
> >+	 * (drivers/base/memory.c):
> >+	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> >+	 * (include/linux/memory.h):
> >+	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> >+	 * so we can consider block and section equivalently
> >+	 */
> >+	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> >+	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> >+
> >+	/*
> >+	 * Make the pages usable after they have been added.
> >+	 * This will make pfn_valid return true
> >+	 */
> >+	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> >+
> >+	/*
> >+	 * This is a hack to avoid having to mix arch specific code
> >+	 * into arch independent code. SetPageReserved is supposed
> >+	 * to be called by __add_zone (within __add_section, within
> >+	 * __add_pages). However, when it is called there, it assumes that
> >+	 * pfn_valid returns true.  For the way pfn_valid is implemented
> >+	 * in arm64 (a check on the nomap flag), the only way to make
> >+	 * this evaluate true inside __add_zone is to clear the nomap
> >+	 * flags of blocks in architecture independent code.
> >+	 *
> >+	 * To avoid this, we set the Reserved flag here after we cleared
> >+	 * the nomap flag in the line above.
> >+	 */
> >+	SetPageReserved(pfn_to_page(start_pfn));
> 
> This whole business is utterly horrible. I really think we need to revisit
> why arm64 isn't using the normal sparsemem pfn_valid() implementation. If
> there are callers misusing pfn_valid() where they really want page_is_ram()
> or similar, or missing further pfn_valid_within() checks, then it's surely
> time to fix those at the source rather than adding to the Jenga pile of
> hacks in this area. I've started digging into it myself, but don't have any
> answers yet.
> 

I fully agree and this is the exact reaction we hoped for. We just decided to
avoid opening too many fronts at the same time, also that we were not
completely sure what exactly the pfn_valid() is supposed to serve for and what
we can potentially break. We are looking for your findings here.

> >+
> >+	return ret;
> >+}
> >+
> >+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> >+{
> >+	int ret;
> >+	unsigned long start_pfn = start >> PAGE_SHIFT;
> >+	unsigned long nr_pages = size >> PAGE_SHIFT;
> >+	unsigned long end_pfn = start_pfn + nr_pages;
> >+	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> >+
> >+	if (end_pfn > max_sparsemem_pfn) {
> >+		pr_err("end_pfn too big");
> >+		return -1;
> >+	}
> >+	hotplug_paging(start, size);
> >+
> >+	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> >+
> >+	if (ret)
> >+		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> >+			__func__, ret);
> >+
> >+	return ret;
> >+}
> >+
> >+#endif /* CONFIG_MEMORY_HOTPLUG */
> >diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >index f1eb15e..d93043d 100644
> >--- a/arch/arm64/mm/mmu.c
> >+++ b/arch/arm64/mm/mmu.c
> >@@ -28,6 +28,7 @@
> >  #include <linux/mman.h>
> >  #include <linux/nodemask.h>
> >  #include <linux/memblock.h>
> >+#include <linux/stop_machine.h>
> >  #include <linux/fs.h>
> >  #include <linux/io.h>
> >  #include <linux/mm.h>
> >@@ -615,6 +616,44 @@ void __init paging_init(void)
> >  		      SWAPPER_DIR_SIZE - PAGE_SIZE);
> >  }
> >+#ifdef CONFIG_MEMORY_HOTPLUG
> >+
> >+/*
> >+ * hotplug_paging() is used by memory hotplug to build new page tables
> >+ * for hot added memory.
> >+ */
> >+
> >+struct mem_range {
> >+	phys_addr_t base;
> >+	phys_addr_t size;
> >+};
> >+
> >+static int __hotplug_paging(void *data)
> >+{
> >+	int flags = 0;
> >+	struct mem_range *section = data;
> >+
> >+	if (debug_pagealloc_enabled())
> >+		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> >+
> >+	__create_pgd_mapping(swapper_pg_dir, section->base,
> >+			__phys_to_virt(section->base), section->size,
> >+			PAGE_KERNEL, pgd_pgtable_alloc, flags);
> >+
> >+	return 0;
> >+}
> >+
> >+inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> >+{
> >+	struct mem_range section = {
> >+		.base = start,
> >+		.size = size,
> >+	};
> >+
> >+	stop_machine(__hotplug_paging, &section, NULL);
> 
> Why exactly do we need to swing the stop_machine() hammer here? I appreciate
> that separate hotplug events for adjacent sections could potentially affect
> the same top-level entry in swapper_pg_dir, but those should already be
> serialised by the hotplug lock - who else has cause to modify non-leaf
> entries for the linear map at runtime in a manner which might conflict?
> 

The reason for this has been mentioned by Mark Rutland in the previous spin
(https://lkml.org/lkml/2017/4/11/582), please let us know if you have different
point of view.


BR,
Maciej Bielski

> Robin.
> 
> >+}
> >+#endif /* CONFIG_MEMORY_HOTPLUG */
> >+
> >  /*
> >   * Check whether a kernel address is valid (derived from arch/x86/).
> >   */
> >

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
  2017-11-27 16:39       ` Maciej Bielski
  (?)
@ 2017-11-27 17:11         ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:11 UTC (permalink / raw)
  To: Maciej Bielski
  Cc: Robin Murphy, linux-arm-kernel, mark.rutland, realean2, mhocko,
	scott.branden, catalin.marinas, will.deacon, linux-kernel,
	linux-mm, arunks, qiuxishi

On Mon 27 Nov 2017, 17:39, Maciej Bielski wrote:

Hi Robin,

> Hi Robin,
> 
> Thank you for your feedback, its highly appreciated. I let myself to add some
> comments.
> 
> Our primary goal was to have hotplug working even in the basic setup and
> publish first working results. Then we want to improve the code building on top
> of community comments. This is a general answer for questions about
> configuration flags. The working setup is presented, a bit as a hint, and we do
> not deem it to be ultimately best at all. The questions about configuration,
> IMHO, falls into category of making an agreement on a proper setup (defaults,
> dependencies) and, therefore, we strongly rely on the community experience to
> advise us how it should be. So, shortly, for some questions "why this is setup
> in such a way" the simple anser is that it worked as a first approximation.
> Then, I totally agree that for a server-grade system it should be different and
> thanks a lot for sharing your opinion on that.
> 
> On Mon, Nov 27, 2017 at 03:19:49PM +0000, Robin Murphy wrote:
> > Hi Andrea,
> > 
> > I've also been looking at memory hotplug for arm64, from the perspective of
> > enabling ZONE_DEVICE for pmem. May I ask what your use-case for this series
> > is? AFAICS the real demand will be coming from server systems, which in
> > practice means both ACPI and NUMA, both of which are being resoundingly
> > ignored here.
> > 
> 
> Eventually we aim for aarch64 server system.
> 

Adding to what Maciej said: the original motivation and driving factor
for this development effort is this project: http://www.dredbox.eu

In short, we have a software-defined interconnect for disaggregated
memory, where memory can be connected to nodes dynamically and via
software. At reconfigurations, we need to hot add and hot remove memory
from running kernels. Our current research prototype is based on an
arm64 SoC+FPGA system. Hence memory hotplug for arm64.  
Since triggers for hot-add and hot-remove are software, we do not need
ACPI; in our specifc case, memory topologies can change dinamically, so
we have a rather ad-hoc and project specific support NUMA that, we
believe. does not make any sense to discuss for mainlining.

> > Further review comments inline.
> > 
> > On 23/11/17 11:13, Maciej Bielski wrote:
> > >Introduces memory hotplug functionality (hot-add) for arm64.
> > >
> > >Changes v1->v2:
> > >- swapper pgtable updated in place on hot add, avoiding unnecessary copy:
> > >   all changes are additive and non destructive.
> > >
> > >- stop_machine used to updated swapper on hot add, avoiding races
> > >
> > >- checking if pagealloc is under debug to stay coherent with mem_map
> > >
> > >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > >---
> > >  arch/arm64/Kconfig           | 12 ++++++
> > >  arch/arm64/configs/defconfig |  1 +
> > >  arch/arm64/include/asm/mmu.h |  3 ++
> > >  arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
> > >  arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
> > >  5 files changed, 142 insertions(+)
> > >
> > >diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > >index 0df64a6..c736bba 100644
> > >--- a/arch/arm64/Kconfig
> > >+++ b/arch/arm64/Kconfig
> > >@@ -641,6 +641,14 @@ config HOTPLUG_CPU
> > >  	  Say Y here to experiment with turning CPUs off and on.  CPUs
> > >  	  can be controlled through /sys/devices/system/cpu.
> > >+config ARCH_HAS_ADD_PAGES
> > >+	def_bool y
> > >+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> > >+
> > >+config ARCH_ENABLE_MEMORY_HOTPLUG
> > >+	def_bool y
> > >+    depends on !NUMA
> > 
> > As above, realistically this seems too limiting to be useful.
> > 
> > >+
> > >  # Common NUMA Features
> > >  config NUMA
> > >  	bool "Numa Memory Allocation and Scheduler Support"
> > >@@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
> > >  source "mm/Kconfig"
> > >+config ARCH_MEMORY_PROBE
> > >+	def_bool y
> > >+	depends on MEMORY_HOTPLUG
> > 
> > I'm particularly dubious about enabling this by default - it's useful for
> > development and testing, yes, but I think it's the kind of feature where the
> > onus should be on interested developers to turn it on, rather than
> > production configs to have to turn it off.
> > 
> > >+
> > >  config SECCOMP
> > >  	bool "Enable seccomp to safely compute untrusted bytecode"
> > >  	---help---
> > >diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> > >index 34480e9..5fc5656 100644
> > >--- a/arch/arm64/configs/defconfig
> > >+++ b/arch/arm64/configs/defconfig
> > >@@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
> > >  CONFIG_SCHED_MC=y
> > >  CONFIG_NUMA=y
> > >  CONFIG_PREEMPT=y
> > >+CONFIG_MEMORY_HOTPLUG=y
> > 
> > Note that this is effectively pointless, given two lines above...
> > 

Well spotted, thanks :) 

> > >  CONFIG_KSM=y
> > >  CONFIG_TRANSPARENT_HUGEPAGE=y
> > >  CONFIG_CMA=y
> > >diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> > >index 0d34bf0..2b3fa4d 100644
> > >--- a/arch/arm64/include/asm/mmu.h
> > >+++ b/arch/arm64/include/asm/mmu.h
> > >@@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> > >  			       pgprot_t prot, bool page_mappings_only);
> > >  extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
> > >  extern void mark_linear_text_alias_ro(void);
> > >+#ifdef CONFIG_MEMORY_HOTPLUG
> > >+extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
> > 
> > Is there any reason for not just implementing all the hotplug code
> > self-contained in mmu.c?
> > 
> 
> Simply, in the first version we were supposed to built on top of the patch by
> Scott Branden, who put a mock implementation of arch_add_memory() in
> arch/arm64/mm/init.c, this is why hotplug_paging() needed to be announced
> outside. Quickly looking on the code now I agree that it would be more clean to
> put everything in arch/arm64/mm/mmu.c. I will test that.
> 
> > >+#endif
> > >  #endif
> > >diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > >index 5960bef..e96e7d3 100644
> > >--- a/arch/arm64/mm/init.c
> > >+++ b/arch/arm64/mm/init.c
> > >@@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
> > >  	return 0;
> > >  }
> > >  __initcall(register_mem_limit_dumper);
> > >+
> > >+#ifdef CONFIG_MEMORY_HOTPLUG
> > >+int add_pages(int nid, unsigned long start_pfn,
> > >+		unsigned long nr_pages, bool want_memblock)
> > >+{
> > >+	int ret;
> > >+	u64 start_addr = start_pfn << PAGE_SHIFT;
> > >+	/*
> > >+	 * Mark the first page in the range as unusable. This is needed
> > >+	 * because __add_section (within __add_pages) wants pfn_valid
> > >+	 * of it to be false, and in arm64 pfn falid is implemented by
> > >+	 * just checking at the nomap flag for existing blocks.
> > >+	 *
> > >+	 * A small trick here is that __add_section() requires only
> > >+	 * phys_start_pfn (that is the first pfn of a section) to be
> > >+	 * invalid. Regardless of whether it was assumed (by the function
> > >+	 * author) that all pfns within a section are either all valid
> > >+	 * or all invalid, it allows to avoid looping twice (once here,
> > >+	 * second when memblock_clear_nomap() is called) through all
> > >+	 * pfns of the section and modify only one pfn. Thanks to that,
> > >+	 * further, in __add_zone() only this very first pfn is skipped
> > >+	 * and corresponding page is not flagged reserved. Therefore it
> > >+	 * is enough to correct this setup only for it.
> > >+	 *
> > >+	 * When arch_add_memory() returns the walk_memory_range() function
> > >+	 * is called and passed with online_memory_block() callback,
> > >+	 * which execution finally reaches the memory_block_action()
> > >+	 * function, where also only the first pfn of a memory block is
> > >+	 * checked to be reserved. Above, it was first pfn of a section,
> > >+	 * here it is a block but
> > >+	 * (drivers/base/memory.c):
> > >+	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> > >+	 * (include/linux/memory.h):
> > >+	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> > >+	 * so we can consider block and section equivalently
> > >+	 */
> > >+	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> > >+	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> > >+
> > >+	/*
> > >+	 * Make the pages usable after they have been added.
> > >+	 * This will make pfn_valid return true
> > >+	 */
> > >+	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> > >+
> > >+	/*
> > >+	 * This is a hack to avoid having to mix arch specific code
> > >+	 * into arch independent code. SetPageReserved is supposed
> > >+	 * to be called by __add_zone (within __add_section, within
> > >+	 * __add_pages). However, when it is called there, it assumes that
> > >+	 * pfn_valid returns true.  For the way pfn_valid is implemented
> > >+	 * in arm64 (a check on the nomap flag), the only way to make
> > >+	 * this evaluate true inside __add_zone is to clear the nomap
> > >+	 * flags of blocks in architecture independent code.
> > >+	 *
> > >+	 * To avoid this, we set the Reserved flag here after we cleared
> > >+	 * the nomap flag in the line above.
> > >+	 */
> > >+	SetPageReserved(pfn_to_page(start_pfn));
> > 
> > This whole business is utterly horrible. I really think we need to revisit
> > why arm64 isn't using the normal sparsemem pfn_valid() implementation. If
> > there are callers misusing pfn_valid() where they really want page_is_ram()
> > or similar, or missing further pfn_valid_within() checks, then it's surely
> > time to fix those at the source rather than adding to the Jenga pile of
> > hacks in this area. I've started digging into it myself, but don't have any
> > answers yet.
> > 
> 
> I fully agree and this is the exact reaction we hoped for. We just decided to
> avoid opening too many fronts at the same time, also that we were not
> completely sure what exactly the pfn_valid() is supposed to serve for and what
> we can potentially break. We are looking for your findings here.
> 
> > >+
> > >+	return ret;
> > >+}
> > >+
> > >+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> > >+{
> > >+	int ret;
> > >+	unsigned long start_pfn = start >> PAGE_SHIFT;
> > >+	unsigned long nr_pages = size >> PAGE_SHIFT;
> > >+	unsigned long end_pfn = start_pfn + nr_pages;
> > >+	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> > >+
> > >+	if (end_pfn > max_sparsemem_pfn) {
> > >+		pr_err("end_pfn too big");
> > >+		return -1;
> > >+	}
> > >+	hotplug_paging(start, size);
> > >+
> > >+	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> > >+
> > >+	if (ret)
> > >+		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> > >+			__func__, ret);
> > >+
> > >+	return ret;
> > >+}
> > >+
> > >+#endif /* CONFIG_MEMORY_HOTPLUG */
> > >diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > >index f1eb15e..d93043d 100644
> > >--- a/arch/arm64/mm/mmu.c
> > >+++ b/arch/arm64/mm/mmu.c
> > >@@ -28,6 +28,7 @@
> > >  #include <linux/mman.h>
> > >  #include <linux/nodemask.h>
> > >  #include <linux/memblock.h>
> > >+#include <linux/stop_machine.h>
> > >  #include <linux/fs.h>
> > >  #include <linux/io.h>
> > >  #include <linux/mm.h>
> > >@@ -615,6 +616,44 @@ void __init paging_init(void)
> > >  		      SWAPPER_DIR_SIZE - PAGE_SIZE);
> > >  }
> > >+#ifdef CONFIG_MEMORY_HOTPLUG
> > >+
> > >+/*
> > >+ * hotplug_paging() is used by memory hotplug to build new page tables
> > >+ * for hot added memory.
> > >+ */
> > >+
> > >+struct mem_range {
> > >+	phys_addr_t base;
> > >+	phys_addr_t size;
> > >+};
> > >+
> > >+static int __hotplug_paging(void *data)
> > >+{
> > >+	int flags = 0;
> > >+	struct mem_range *section = data;
> > >+
> > >+	if (debug_pagealloc_enabled())
> > >+		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > >+
> > >+	__create_pgd_mapping(swapper_pg_dir, section->base,
> > >+			__phys_to_virt(section->base), section->size,
> > >+			PAGE_KERNEL, pgd_pgtable_alloc, flags);
> > >+
> > >+	return 0;
> > >+}
> > >+
> > >+inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > >+{
> > >+	struct mem_range section = {
> > >+		.base = start,
> > >+		.size = size,
> > >+	};
> > >+
> > >+	stop_machine(__hotplug_paging, &section, NULL);
> > 
> > Why exactly do we need to swing the stop_machine() hammer here? I appreciate
> > that separate hotplug events for adjacent sections could potentially affect
> > the same top-level entry in swapper_pg_dir, but those should already be
> > serialised by the hotplug lock - who else has cause to modify non-leaf
> > entries for the linear map at runtime in a manner which might conflict?
> > 
> 
> The reason for this has been mentioned by Mark Rutland in the previous spin
> (https://lkml.org/lkml/2017/4/11/582), please let us know if you have different
> point of view.
> 
> 
> BR,
> Maciej Bielski
> 
> > Robin.
> > 
> > >+}
> > >+#endif /* CONFIG_MEMORY_HOTPLUG */
> > >+
> > >  /*
> > >   * Check whether a kernel address is valid (derived from arch/x86/).
> > >   */
> > >
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-27 17:11         ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:11 UTC (permalink / raw)
  To: Maciej Bielski
  Cc: Robin Murphy, linux-arm-kernel, mark.rutland, realean2, mhocko,
	scott.branden, catalin.marinas, will.deacon, linux-kernel,
	linux-mm, arunks, qiuxishi

On Mon 27 Nov 2017, 17:39, Maciej Bielski wrote:

Hi Robin,

> Hi Robin,
> 
> Thank you for your feedback, its highly appreciated. I let myself to add some
> comments.
> 
> Our primary goal was to have hotplug working even in the basic setup and
> publish first working results. Then we want to improve the code building on top
> of community comments. This is a general answer for questions about
> configuration flags. The working setup is presented, a bit as a hint, and we do
> not deem it to be ultimately best at all. The questions about configuration,
> IMHO, falls into category of making an agreement on a proper setup (defaults,
> dependencies) and, therefore, we strongly rely on the community experience to
> advise us how it should be. So, shortly, for some questions "why this is setup
> in such a way" the simple anser is that it worked as a first approximation.
> Then, I totally agree that for a server-grade system it should be different and
> thanks a lot for sharing your opinion on that.
> 
> On Mon, Nov 27, 2017 at 03:19:49PM +0000, Robin Murphy wrote:
> > Hi Andrea,
> > 
> > I've also been looking at memory hotplug for arm64, from the perspective of
> > enabling ZONE_DEVICE for pmem. May I ask what your use-case for this series
> > is? AFAICS the real demand will be coming from server systems, which in
> > practice means both ACPI and NUMA, both of which are being resoundingly
> > ignored here.
> > 
> 
> Eventually we aim for aarch64 server system.
> 

Adding to what Maciej said: the original motivation and driving factor
for this development effort is this project: http://www.dredbox.eu

In short, we have a software-defined interconnect for disaggregated
memory, where memory can be connected to nodes dynamically and via
software. At reconfigurations, we need to hot add and hot remove memory
from running kernels. Our current research prototype is based on an
arm64 SoC+FPGA system. Hence memory hotplug for arm64.  
Since triggers for hot-add and hot-remove are software, we do not need
ACPI; in our specifc case, memory topologies can change dinamically, so
we have a rather ad-hoc and project specific support NUMA that, we
believe. does not make any sense to discuss for mainlining.

> > Further review comments inline.
> > 
> > On 23/11/17 11:13, Maciej Bielski wrote:
> > >Introduces memory hotplug functionality (hot-add) for arm64.
> > >
> > >Changes v1->v2:
> > >- swapper pgtable updated in place on hot add, avoiding unnecessary copy:
> > >   all changes are additive and non destructive.
> > >
> > >- stop_machine used to updated swapper on hot add, avoiding races
> > >
> > >- checking if pagealloc is under debug to stay coherent with mem_map
> > >
> > >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > >---
> > >  arch/arm64/Kconfig           | 12 ++++++
> > >  arch/arm64/configs/defconfig |  1 +
> > >  arch/arm64/include/asm/mmu.h |  3 ++
> > >  arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
> > >  arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
> > >  5 files changed, 142 insertions(+)
> > >
> > >diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > >index 0df64a6..c736bba 100644
> > >--- a/arch/arm64/Kconfig
> > >+++ b/arch/arm64/Kconfig
> > >@@ -641,6 +641,14 @@ config HOTPLUG_CPU
> > >  	  Say Y here to experiment with turning CPUs off and on.  CPUs
> > >  	  can be controlled through /sys/devices/system/cpu.
> > >+config ARCH_HAS_ADD_PAGES
> > >+	def_bool y
> > >+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> > >+
> > >+config ARCH_ENABLE_MEMORY_HOTPLUG
> > >+	def_bool y
> > >+    depends on !NUMA
> > 
> > As above, realistically this seems too limiting to be useful.
> > 
> > >+
> > >  # Common NUMA Features
> > >  config NUMA
> > >  	bool "Numa Memory Allocation and Scheduler Support"
> > >@@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
> > >  source "mm/Kconfig"
> > >+config ARCH_MEMORY_PROBE
> > >+	def_bool y
> > >+	depends on MEMORY_HOTPLUG
> > 
> > I'm particularly dubious about enabling this by default - it's useful for
> > development and testing, yes, but I think it's the kind of feature where the
> > onus should be on interested developers to turn it on, rather than
> > production configs to have to turn it off.
> > 
> > >+
> > >  config SECCOMP
> > >  	bool "Enable seccomp to safely compute untrusted bytecode"
> > >  	---help---
> > >diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> > >index 34480e9..5fc5656 100644
> > >--- a/arch/arm64/configs/defconfig
> > >+++ b/arch/arm64/configs/defconfig
> > >@@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
> > >  CONFIG_SCHED_MC=y
> > >  CONFIG_NUMA=y
> > >  CONFIG_PREEMPT=y
> > >+CONFIG_MEMORY_HOTPLUG=y
> > 
> > Note that this is effectively pointless, given two lines above...
> > 

Well spotted, thanks :) 

> > >  CONFIG_KSM=y
> > >  CONFIG_TRANSPARENT_HUGEPAGE=y
> > >  CONFIG_CMA=y
> > >diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> > >index 0d34bf0..2b3fa4d 100644
> > >--- a/arch/arm64/include/asm/mmu.h
> > >+++ b/arch/arm64/include/asm/mmu.h
> > >@@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> > >  			       pgprot_t prot, bool page_mappings_only);
> > >  extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
> > >  extern void mark_linear_text_alias_ro(void);
> > >+#ifdef CONFIG_MEMORY_HOTPLUG
> > >+extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
> > 
> > Is there any reason for not just implementing all the hotplug code
> > self-contained in mmu.c?
> > 
> 
> Simply, in the first version we were supposed to built on top of the patch by
> Scott Branden, who put a mock implementation of arch_add_memory() in
> arch/arm64/mm/init.c, this is why hotplug_paging() needed to be announced
> outside. Quickly looking on the code now I agree that it would be more clean to
> put everything in arch/arm64/mm/mmu.c. I will test that.
> 
> > >+#endif
> > >  #endif
> > >diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > >index 5960bef..e96e7d3 100644
> > >--- a/arch/arm64/mm/init.c
> > >+++ b/arch/arm64/mm/init.c
> > >@@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
> > >  	return 0;
> > >  }
> > >  __initcall(register_mem_limit_dumper);
> > >+
> > >+#ifdef CONFIG_MEMORY_HOTPLUG
> > >+int add_pages(int nid, unsigned long start_pfn,
> > >+		unsigned long nr_pages, bool want_memblock)
> > >+{
> > >+	int ret;
> > >+	u64 start_addr = start_pfn << PAGE_SHIFT;
> > >+	/*
> > >+	 * Mark the first page in the range as unusable. This is needed
> > >+	 * because __add_section (within __add_pages) wants pfn_valid
> > >+	 * of it to be false, and in arm64 pfn falid is implemented by
> > >+	 * just checking at the nomap flag for existing blocks.
> > >+	 *
> > >+	 * A small trick here is that __add_section() requires only
> > >+	 * phys_start_pfn (that is the first pfn of a section) to be
> > >+	 * invalid. Regardless of whether it was assumed (by the function
> > >+	 * author) that all pfns within a section are either all valid
> > >+	 * or all invalid, it allows to avoid looping twice (once here,
> > >+	 * second when memblock_clear_nomap() is called) through all
> > >+	 * pfns of the section and modify only one pfn. Thanks to that,
> > >+	 * further, in __add_zone() only this very first pfn is skipped
> > >+	 * and corresponding page is not flagged reserved. Therefore it
> > >+	 * is enough to correct this setup only for it.
> > >+	 *
> > >+	 * When arch_add_memory() returns the walk_memory_range() function
> > >+	 * is called and passed with online_memory_block() callback,
> > >+	 * which execution finally reaches the memory_block_action()
> > >+	 * function, where also only the first pfn of a memory block is
> > >+	 * checked to be reserved. Above, it was first pfn of a section,
> > >+	 * here it is a block but
> > >+	 * (drivers/base/memory.c):
> > >+	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> > >+	 * (include/linux/memory.h):
> > >+	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> > >+	 * so we can consider block and section equivalently
> > >+	 */
> > >+	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> > >+	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> > >+
> > >+	/*
> > >+	 * Make the pages usable after they have been added.
> > >+	 * This will make pfn_valid return true
> > >+	 */
> > >+	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> > >+
> > >+	/*
> > >+	 * This is a hack to avoid having to mix arch specific code
> > >+	 * into arch independent code. SetPageReserved is supposed
> > >+	 * to be called by __add_zone (within __add_section, within
> > >+	 * __add_pages). However, when it is called there, it assumes that
> > >+	 * pfn_valid returns true.  For the way pfn_valid is implemented
> > >+	 * in arm64 (a check on the nomap flag), the only way to make
> > >+	 * this evaluate true inside __add_zone is to clear the nomap
> > >+	 * flags of blocks in architecture independent code.
> > >+	 *
> > >+	 * To avoid this, we set the Reserved flag here after we cleared
> > >+	 * the nomap flag in the line above.
> > >+	 */
> > >+	SetPageReserved(pfn_to_page(start_pfn));
> > 
> > This whole business is utterly horrible. I really think we need to revisit
> > why arm64 isn't using the normal sparsemem pfn_valid() implementation. If
> > there are callers misusing pfn_valid() where they really want page_is_ram()
> > or similar, or missing further pfn_valid_within() checks, then it's surely
> > time to fix those at the source rather than adding to the Jenga pile of
> > hacks in this area. I've started digging into it myself, but don't have any
> > answers yet.
> > 
> 
> I fully agree and this is the exact reaction we hoped for. We just decided to
> avoid opening too many fronts at the same time, also that we were not
> completely sure what exactly the pfn_valid() is supposed to serve for and what
> we can potentially break. We are looking for your findings here.
> 
> > >+
> > >+	return ret;
> > >+}
> > >+
> > >+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> > >+{
> > >+	int ret;
> > >+	unsigned long start_pfn = start >> PAGE_SHIFT;
> > >+	unsigned long nr_pages = size >> PAGE_SHIFT;
> > >+	unsigned long end_pfn = start_pfn + nr_pages;
> > >+	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> > >+
> > >+	if (end_pfn > max_sparsemem_pfn) {
> > >+		pr_err("end_pfn too big");
> > >+		return -1;
> > >+	}
> > >+	hotplug_paging(start, size);
> > >+
> > >+	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> > >+
> > >+	if (ret)
> > >+		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> > >+			__func__, ret);
> > >+
> > >+	return ret;
> > >+}
> > >+
> > >+#endif /* CONFIG_MEMORY_HOTPLUG */
> > >diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > >index f1eb15e..d93043d 100644
> > >--- a/arch/arm64/mm/mmu.c
> > >+++ b/arch/arm64/mm/mmu.c
> > >@@ -28,6 +28,7 @@
> > >  #include <linux/mman.h>
> > >  #include <linux/nodemask.h>
> > >  #include <linux/memblock.h>
> > >+#include <linux/stop_machine.h>
> > >  #include <linux/fs.h>
> > >  #include <linux/io.h>
> > >  #include <linux/mm.h>
> > >@@ -615,6 +616,44 @@ void __init paging_init(void)
> > >  		      SWAPPER_DIR_SIZE - PAGE_SIZE);
> > >  }
> > >+#ifdef CONFIG_MEMORY_HOTPLUG
> > >+
> > >+/*
> > >+ * hotplug_paging() is used by memory hotplug to build new page tables
> > >+ * for hot added memory.
> > >+ */
> > >+
> > >+struct mem_range {
> > >+	phys_addr_t base;
> > >+	phys_addr_t size;
> > >+};
> > >+
> > >+static int __hotplug_paging(void *data)
> > >+{
> > >+	int flags = 0;
> > >+	struct mem_range *section = data;
> > >+
> > >+	if (debug_pagealloc_enabled())
> > >+		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > >+
> > >+	__create_pgd_mapping(swapper_pg_dir, section->base,
> > >+			__phys_to_virt(section->base), section->size,
> > >+			PAGE_KERNEL, pgd_pgtable_alloc, flags);
> > >+
> > >+	return 0;
> > >+}
> > >+
> > >+inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > >+{
> > >+	struct mem_range section = {
> > >+		.base = start,
> > >+		.size = size,
> > >+	};
> > >+
> > >+	stop_machine(__hotplug_paging, &section, NULL);
> > 
> > Why exactly do we need to swing the stop_machine() hammer here? I appreciate
> > that separate hotplug events for adjacent sections could potentially affect
> > the same top-level entry in swapper_pg_dir, but those should already be
> > serialised by the hotplug lock - who else has cause to modify non-leaf
> > entries for the linear map at runtime in a manner which might conflict?
> > 
> 
> The reason for this has been mentioned by Mark Rutland in the previous spin
> (https://lkml.org/lkml/2017/4/11/582), please let us know if you have different
> point of view.
> 
> 
> BR,
> Maciej Bielski
> 
> > Robin.
> > 
> > >+}
> > >+#endif /* CONFIG_MEMORY_HOTPLUG */
> > >+
> > >  /*
> > >   * Check whether a kernel address is valid (derived from arch/x86/).
> > >   */
> > >
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64
@ 2017-11-27 17:11         ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon 27 Nov 2017, 17:39, Maciej Bielski wrote:

Hi Robin,

> Hi Robin,
> 
> Thank you for your feedback, its highly appreciated. I let myself to add some
> comments.
> 
> Our primary goal was to have hotplug working even in the basic setup and
> publish first working results. Then we want to improve the code building on top
> of community comments. This is a general answer for questions about
> configuration flags. The working setup is presented, a bit as a hint, and we do
> not deem it to be ultimately best at all. The questions about configuration,
> IMHO, falls into category of making an agreement on a proper setup (defaults,
> dependencies) and, therefore, we strongly rely on the community experience to
> advise us how it should be. So, shortly, for some questions "why this is setup
> in such a way" the simple anser is that it worked as a first approximation.
> Then, I totally agree that for a server-grade system it should be different and
> thanks a lot for sharing your opinion on that.
> 
> On Mon, Nov 27, 2017 at 03:19:49PM +0000, Robin Murphy wrote:
> > Hi Andrea,
> > 
> > I've also been looking at memory hotplug for arm64, from the perspective of
> > enabling ZONE_DEVICE for pmem. May I ask what your use-case for this series
> > is? AFAICS the real demand will be coming from server systems, which in
> > practice means both ACPI and NUMA, both of which are being resoundingly
> > ignored here.
> > 
> 
> Eventually we aim for aarch64 server system.
> 

Adding to what Maciej said: the original motivation and driving factor
for this development effort is this project: http://www.dredbox.eu

In short, we have a software-defined interconnect for disaggregated
memory, where memory can be connected to nodes dynamically and via
software. At reconfigurations, we need to hot add and hot remove memory
from running kernels. Our current research prototype is based on an
arm64 SoC+FPGA system. Hence memory hotplug for arm64.  
Since triggers for hot-add and hot-remove are software, we do not need
ACPI; in our specifc case, memory topologies can change dinamically, so
we have a rather ad-hoc and project specific support NUMA that, we
believe. does not make any sense to discuss for mainlining.

> > Further review comments inline.
> > 
> > On 23/11/17 11:13, Maciej Bielski wrote:
> > >Introduces memory hotplug functionality (hot-add) for arm64.
> > >
> > >Changes v1->v2:
> > >- swapper pgtable updated in place on hot add, avoiding unnecessary copy:
> > >   all changes are additive and non destructive.
> > >
> > >- stop_machine used to updated swapper on hot add, avoiding races
> > >
> > >- checking if pagealloc is under debug to stay coherent with mem_map
> > >
> > >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > >---
> > >  arch/arm64/Kconfig           | 12 ++++++
> > >  arch/arm64/configs/defconfig |  1 +
> > >  arch/arm64/include/asm/mmu.h |  3 ++
> > >  arch/arm64/mm/init.c         | 87 ++++++++++++++++++++++++++++++++++++++++++++
> > >  arch/arm64/mm/mmu.c          | 39 ++++++++++++++++++++
> > >  5 files changed, 142 insertions(+)
> > >
> > >diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > >index 0df64a6..c736bba 100644
> > >--- a/arch/arm64/Kconfig
> > >+++ b/arch/arm64/Kconfig
> > >@@ -641,6 +641,14 @@ config HOTPLUG_CPU
> > >  	  Say Y here to experiment with turning CPUs off and on.  CPUs
> > >  	  can be controlled through /sys/devices/system/cpu.
> > >+config ARCH_HAS_ADD_PAGES
> > >+	def_bool y
> > >+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
> > >+
> > >+config ARCH_ENABLE_MEMORY_HOTPLUG
> > >+	def_bool y
> > >+    depends on !NUMA
> > 
> > As above, realistically this seems too limiting to be useful.
> > 
> > >+
> > >  # Common NUMA Features
> > >  config NUMA
> > >  	bool "Numa Memory Allocation and Scheduler Support"
> > >@@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
> > >  source "mm/Kconfig"
> > >+config ARCH_MEMORY_PROBE
> > >+	def_bool y
> > >+	depends on MEMORY_HOTPLUG
> > 
> > I'm particularly dubious about enabling this by default - it's useful for
> > development and testing, yes, but I think it's the kind of feature where the
> > onus should be on interested developers to turn it on, rather than
> > production configs to have to turn it off.
> > 
> > >+
> > >  config SECCOMP
> > >  	bool "Enable seccomp to safely compute untrusted bytecode"
> > >  	---help---
> > >diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> > >index 34480e9..5fc5656 100644
> > >--- a/arch/arm64/configs/defconfig
> > >+++ b/arch/arm64/configs/defconfig
> > >@@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
> > >  CONFIG_SCHED_MC=y
> > >  CONFIG_NUMA=y
> > >  CONFIG_PREEMPT=y
> > >+CONFIG_MEMORY_HOTPLUG=y
> > 
> > Note that this is effectively pointless, given two lines above...
> > 

Well spotted, thanks :) 

> > >  CONFIG_KSM=y
> > >  CONFIG_TRANSPARENT_HUGEPAGE=y
> > >  CONFIG_CMA=y
> > >diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> > >index 0d34bf0..2b3fa4d 100644
> > >--- a/arch/arm64/include/asm/mmu.h
> > >+++ b/arch/arm64/include/asm/mmu.h
> > >@@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> > >  			       pgprot_t prot, bool page_mappings_only);
> > >  extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
> > >  extern void mark_linear_text_alias_ro(void);
> > >+#ifdef CONFIG_MEMORY_HOTPLUG
> > >+extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
> > 
> > Is there any reason for not just implementing all the hotplug code
> > self-contained in mmu.c?
> > 
> 
> Simply, in the first version we were supposed to built on top of the patch by
> Scott Branden, who put a mock implementation of arch_add_memory() in
> arch/arm64/mm/init.c, this is why hotplug_paging() needed to be announced
> outside. Quickly looking on the code now I agree that it would be more clean to
> put everything in arch/arm64/mm/mmu.c. I will test that.
> 
> > >+#endif
> > >  #endif
> > >diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > >index 5960bef..e96e7d3 100644
> > >--- a/arch/arm64/mm/init.c
> > >+++ b/arch/arm64/mm/init.c
> > >@@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
> > >  	return 0;
> > >  }
> > >  __initcall(register_mem_limit_dumper);
> > >+
> > >+#ifdef CONFIG_MEMORY_HOTPLUG
> > >+int add_pages(int nid, unsigned long start_pfn,
> > >+		unsigned long nr_pages, bool want_memblock)
> > >+{
> > >+	int ret;
> > >+	u64 start_addr = start_pfn << PAGE_SHIFT;
> > >+	/*
> > >+	 * Mark the first page in the range as unusable. This is needed
> > >+	 * because __add_section (within __add_pages) wants pfn_valid
> > >+	 * of it to be false, and in arm64 pfn falid is implemented by
> > >+	 * just checking at the nomap flag for existing blocks.
> > >+	 *
> > >+	 * A small trick here is that __add_section() requires only
> > >+	 * phys_start_pfn (that is the first pfn of a section) to be
> > >+	 * invalid. Regardless of whether it was assumed (by the function
> > >+	 * author) that all pfns within a section are either all valid
> > >+	 * or all invalid, it allows to avoid looping twice (once here,
> > >+	 * second when memblock_clear_nomap() is called) through all
> > >+	 * pfns of the section and modify only one pfn. Thanks to that,
> > >+	 * further, in __add_zone() only this very first pfn is skipped
> > >+	 * and corresponding page is not flagged reserved. Therefore it
> > >+	 * is enough to correct this setup only for it.
> > >+	 *
> > >+	 * When arch_add_memory() returns the walk_memory_range() function
> > >+	 * is called and passed with online_memory_block() callback,
> > >+	 * which execution finally reaches the memory_block_action()
> > >+	 * function, where also only the first pfn of a memory block is
> > >+	 * checked to be reserved. Above, it was first pfn of a section,
> > >+	 * here it is a block but
> > >+	 * (drivers/base/memory.c):
> > >+	 *     sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> > >+	 * (include/linux/memory.h):
> > >+	 *     #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
> > >+	 * so we can consider block and section equivalently
> > >+	 */
> > >+	memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> > >+	ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> > >+
> > >+	/*
> > >+	 * Make the pages usable after they have been added.
> > >+	 * This will make pfn_valid return true
> > >+	 */
> > >+	memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> > >+
> > >+	/*
> > >+	 * This is a hack to avoid having to mix arch specific code
> > >+	 * into arch independent code. SetPageReserved is supposed
> > >+	 * to be called by __add_zone (within __add_section, within
> > >+	 * __add_pages). However, when it is called there, it assumes that
> > >+	 * pfn_valid returns true.  For the way pfn_valid is implemented
> > >+	 * in arm64 (a check on the nomap flag), the only way to make
> > >+	 * this evaluate true inside __add_zone is to clear the nomap
> > >+	 * flags of blocks in architecture independent code.
> > >+	 *
> > >+	 * To avoid this, we set the Reserved flag here after we cleared
> > >+	 * the nomap flag in the line above.
> > >+	 */
> > >+	SetPageReserved(pfn_to_page(start_pfn));
> > 
> > This whole business is utterly horrible. I really think we need to revisit
> > why arm64 isn't using the normal sparsemem pfn_valid() implementation. If
> > there are callers misusing pfn_valid() where they really want page_is_ram()
> > or similar, or missing further pfn_valid_within() checks, then it's surely
> > time to fix those at the source rather than adding to the Jenga pile of
> > hacks in this area. I've started digging into it myself, but don't have any
> > answers yet.
> > 
> 
> I fully agree and this is the exact reaction we hoped for. We just decided to
> avoid opening too many fronts at the same time, also that we were not
> completely sure what exactly the pfn_valid() is supposed to serve for and what
> we can potentially break. We are looking for your findings here.
> 
> > >+
> > >+	return ret;
> > >+}
> > >+
> > >+int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> > >+{
> > >+	int ret;
> > >+	unsigned long start_pfn = start >> PAGE_SHIFT;
> > >+	unsigned long nr_pages = size >> PAGE_SHIFT;
> > >+	unsigned long end_pfn = start_pfn + nr_pages;
> > >+	unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> > >+
> > >+	if (end_pfn > max_sparsemem_pfn) {
> > >+		pr_err("end_pfn too big");
> > >+		return -1;
> > >+	}
> > >+	hotplug_paging(start, size);
> > >+
> > >+	ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> > >+
> > >+	if (ret)
> > >+		pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> > >+			__func__, ret);
> > >+
> > >+	return ret;
> > >+}
> > >+
> > >+#endif /* CONFIG_MEMORY_HOTPLUG */
> > >diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > >index f1eb15e..d93043d 100644
> > >--- a/arch/arm64/mm/mmu.c
> > >+++ b/arch/arm64/mm/mmu.c
> > >@@ -28,6 +28,7 @@
> > >  #include <linux/mman.h>
> > >  #include <linux/nodemask.h>
> > >  #include <linux/memblock.h>
> > >+#include <linux/stop_machine.h>
> > >  #include <linux/fs.h>
> > >  #include <linux/io.h>
> > >  #include <linux/mm.h>
> > >@@ -615,6 +616,44 @@ void __init paging_init(void)
> > >  		      SWAPPER_DIR_SIZE - PAGE_SIZE);
> > >  }
> > >+#ifdef CONFIG_MEMORY_HOTPLUG
> > >+
> > >+/*
> > >+ * hotplug_paging() is used by memory hotplug to build new page tables
> > >+ * for hot added memory.
> > >+ */
> > >+
> > >+struct mem_range {
> > >+	phys_addr_t base;
> > >+	phys_addr_t size;
> > >+};
> > >+
> > >+static int __hotplug_paging(void *data)
> > >+{
> > >+	int flags = 0;
> > >+	struct mem_range *section = data;
> > >+
> > >+	if (debug_pagealloc_enabled())
> > >+		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > >+
> > >+	__create_pgd_mapping(swapper_pg_dir, section->base,
> > >+			__phys_to_virt(section->base), section->size,
> > >+			PAGE_KERNEL, pgd_pgtable_alloc, flags);
> > >+
> > >+	return 0;
> > >+}
> > >+
> > >+inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> > >+{
> > >+	struct mem_range section = {
> > >+		.base = start,
> > >+		.size = size,
> > >+	};
> > >+
> > >+	stop_machine(__hotplug_paging, &section, NULL);
> > 
> > Why exactly do we need to swing the stop_machine() hammer here? I appreciate
> > that separate hotplug events for adjacent sections could potentially affect
> > the same top-level entry in swapper_pg_dir, but those should already be
> > serialised by the hotplug lock - who else has cause to modify non-leaf
> > entries for the linear map at runtime in a manner which might conflict?
> > 
> 
> The reason for this has been mentioned by Mark Rutland in the previous spin
> (https://lkml.org/lkml/2017/4/11/582), please let us know if you have different
> point of view.
> 
> 
> BR,
> Maciej Bielski
> 
> > Robin.
> > 
> > >+}
> > >+#endif /* CONFIG_MEMORY_HOTPLUG */
> > >+
> > >  /*
> > >   * Check whether a kernel address is valid (derived from arch/x86/).
> > >   */
> > >
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-11-27 15:33     ` Robin Murphy
  (?)
@ 2017-11-27 17:14       ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:14 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-arm-kernel, mark.rutland, realean2, mhocko, m.bielski,
	scott.branden, catalin.marinas, will.deacon, linux-kernel,
	linux-mm, arunks, qiuxishi

Hi Robin,

On Mon 27 Nov 2017, 15:33, Robin Murphy wrote:
> On 23/11/17 11:14, Andrea Reale wrote:
> >Adding a "remove" sysfs handle that can be used to trigger
> >memory hotremove manually, exactly simmetrically with
> >what happens with the "probe" device for hot-add.
> >
> >This is usueful for architecture that do not rely on
> >ACPI for memory hot-remove.
> 
> Is there a real-world use-case for this, or is it mostly just a handy
> development feature?
> 
as I was saying in a response to your previous message, in our use
case remove events are triggered by software. Besides our use case,
yes, it is mostly just a handy develeopment feature AFAICT.

> >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >---
> >  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> >
> >diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >index 1d60b58..8ccb67c 100644
> >--- a/drivers/base/memory.c
> >+++ b/drivers/base/memory.c
> >@@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >  }
> >  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> >-#endif
> >+
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+static ssize_t
> >+memory_remove_store(struct device *dev,
> >+		struct device_attribute *attr, const char *buf, size_t count)
> >+{
> >+	u64 phys_addr;
> >+	int nid, ret;
> >+	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> >+
> >+	ret = kstrtoull(buf, 0, &phys_addr);
> >+	if (ret)
> >+		return ret;
> >+
> >+	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> >+		return -EINVAL;
> >+
> >+	nid = memory_add_physaddr_to_nid(phys_addr);
> 
> This call looks a bit odd, since you're not doing a memory add. In fact, any
> memory being removed should already be fully known-about, so AFAICS it
> should be simple to get everything you need to know (including potentially
> the online status as mentioned earlier), through 'normal' methods, e.g.
> page_to_nid() or similar.

Makes sense. Suggestion noted, thanks.

> Robin.
> 
> >+	ret = lock_device_hotplug_sysfs();
> >+	if (ret)
> >+		return ret;
> >+
> >+	remove_memory(nid, phys_addr,
> >+			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> >+	unlock_device_hotplug();
> >+	return count;
> >+}
> >+static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> >+#endif /* CONFIG_MEMORY_HOTREMOVE */
> >+#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  /*
> >@@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >  static struct attribute *memory_root_attrs[] = {
> >  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >  	&dev_attr_probe.attr,
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+	&dev_attr_remove.attr,
> >+#endif
> >  #endif
> >  #ifdef CONFIG_MEMORY_FAILURE
> >
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-27 17:14       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:14 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-arm-kernel, mark.rutland, realean2, mhocko, m.bielski,
	scott.branden, catalin.marinas, will.deacon, linux-kernel,
	linux-mm, arunks, qiuxishi

Hi Robin,

On Mon 27 Nov 2017, 15:33, Robin Murphy wrote:
> On 23/11/17 11:14, Andrea Reale wrote:
> >Adding a "remove" sysfs handle that can be used to trigger
> >memory hotremove manually, exactly simmetrically with
> >what happens with the "probe" device for hot-add.
> >
> >This is usueful for architecture that do not rely on
> >ACPI for memory hot-remove.
> 
> Is there a real-world use-case for this, or is it mostly just a handy
> development feature?
> 
as I was saying in a response to your previous message, in our use
case remove events are triggered by software. Besides our use case,
yes, it is mostly just a handy develeopment feature AFAICT.

> >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >---
> >  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> >
> >diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >index 1d60b58..8ccb67c 100644
> >--- a/drivers/base/memory.c
> >+++ b/drivers/base/memory.c
> >@@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >  }
> >  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> >-#endif
> >+
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+static ssize_t
> >+memory_remove_store(struct device *dev,
> >+		struct device_attribute *attr, const char *buf, size_t count)
> >+{
> >+	u64 phys_addr;
> >+	int nid, ret;
> >+	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> >+
> >+	ret = kstrtoull(buf, 0, &phys_addr);
> >+	if (ret)
> >+		return ret;
> >+
> >+	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> >+		return -EINVAL;
> >+
> >+	nid = memory_add_physaddr_to_nid(phys_addr);
> 
> This call looks a bit odd, since you're not doing a memory add. In fact, any
> memory being removed should already be fully known-about, so AFAICS it
> should be simple to get everything you need to know (including potentially
> the online status as mentioned earlier), through 'normal' methods, e.g.
> page_to_nid() or similar.

Makes sense. Suggestion noted, thanks.

> Robin.
> 
> >+	ret = lock_device_hotplug_sysfs();
> >+	if (ret)
> >+		return ret;
> >+
> >+	remove_memory(nid, phys_addr,
> >+			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> >+	unlock_device_hotplug();
> >+	return count;
> >+}
> >+static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> >+#endif /* CONFIG_MEMORY_HOTREMOVE */
> >+#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  /*
> >@@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >  static struct attribute *memory_root_attrs[] = {
> >  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >  	&dev_attr_probe.attr,
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+	&dev_attr_remove.attr,
> >+#endif
> >  #endif
> >  #ifdef CONFIG_MEMORY_FAILURE
> >
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-27 17:14       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:14 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,

On Mon 27 Nov 2017, 15:33, Robin Murphy wrote:
> On 23/11/17 11:14, Andrea Reale wrote:
> >Adding a "remove" sysfs handle that can be used to trigger
> >memory hotremove manually, exactly simmetrically with
> >what happens with the "probe" device for hot-add.
> >
> >This is usueful for architecture that do not rely on
> >ACPI for memory hot-remove.
> 
> Is there a real-world use-case for this, or is it mostly just a handy
> development feature?
> 
as I was saying in a response to your previous message, in our use
case remove events are triggered by software. Besides our use case,
yes, it is mostly just a handy develeopment feature AFAICT.

> >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >---
> >  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> >
> >diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> >index 1d60b58..8ccb67c 100644
> >--- a/drivers/base/memory.c
> >+++ b/drivers/base/memory.c
> >@@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >  }
> >  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> >-#endif
> >+
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+static ssize_t
> >+memory_remove_store(struct device *dev,
> >+		struct device_attribute *attr, const char *buf, size_t count)
> >+{
> >+	u64 phys_addr;
> >+	int nid, ret;
> >+	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> >+
> >+	ret = kstrtoull(buf, 0, &phys_addr);
> >+	if (ret)
> >+		return ret;
> >+
> >+	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> >+		return -EINVAL;
> >+
> >+	nid = memory_add_physaddr_to_nid(phys_addr);
> 
> This call looks a bit odd, since you're not doing a memory add. In fact, any
> memory being removed should already be fully known-about, so AFAICS it
> should be simple to get everything you need to know (including potentially
> the online status as mentioned earlier), through 'normal' methods, e.g.
> page_to_nid() or similar.

Makes sense. Suggestion noted, thanks.

> Robin.
> 
> >+	ret = lock_device_hotplug_sysfs();
> >+	if (ret)
> >+		return ret;
> >+
> >+	remove_memory(nid, phys_addr,
> >+			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> >+	unlock_device_hotplug();
> >+	return count;
> >+}
> >+static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> >+#endif /* CONFIG_MEMORY_HOTREMOVE */
> >+#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  /*
> >@@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >  static struct attribute *memory_root_attrs[] = {
> >  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >  	&dev_attr_probe.attr,
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+	&dev_attr_remove.attr,
> >+#endif
> >  #endif
> >  #ifdef CONFIG_MEMORY_FAILURE
> >
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
  2017-11-27 15:20     ` Robin Murphy
  (?)
@ 2017-11-27 17:38       ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:38 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-arm-kernel, mark.rutland, realean2, mhocko, m.bielski,
	scott.branden, catalin.marinas, will.deacon, linux-kernel,
	linux-mm, arunks, qiuxishi

Hi Robin,

On Mon 27 Nov 2017, 15:20, Robin Murphy wrote:
> On 23/11/17 11:14, Andrea Reale wrote:
> >When hot-removing memory we need to free vmemmap memory.
> 
> What problems arise if we don't? Is it only for the sake of freeing up some
> pages here and there, or is there something more fundamental?
>

It is just for freeing up pages, but imho we are talking about a relevant
number of pages. For example, assuming 4K pages, to describe one hot
added section of 1GB of new memory we need ~14MBs of vmemmap space (if
my back of the envelope math is not wrong). This
memory would be leaked if we do not do the cleanup in hot remove. 
If we do hot remove sections many times in the lifetime of a system, 
this quantity can become sizeable.

> >However, depending on the memory is being removed, it might
> >not be always possible to free a full vmemmap page / huge-page
> >because part of it might still be used.
> >
> >Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> >hot-remove") introduced a workaround for x86
> >hot-remove, by which partially unused areas are filled with
> >the 0xFD constant. Full pages are only removed when fully
> >filled by 0xFDs.
> >
> >This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> >the goal of using it in place of 0xFDs. For now, this will be used for
> >the arm64 port of memory hot remove, but the idea is to eventually use
> >the same mechanism for x86 as well.
> >
> >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >---
> >  include/linux/memblock.h | 12 ++++++++++++
> >  mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
> >  2 files changed, 44 insertions(+)
> >
> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >index bae11c7..0daec05 100644
> >--- a/include/linux/memblock.h
> >+++ b/include/linux/memblock.h
> >@@ -26,6 +26,9 @@ enum {
> >  	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
> >  	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
> >  	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
> 
> I'm not sure I get what "dirty" is supposed to mean in this context. Also,
> this appears to be specific to CONFIG_SPARSEMEM_VMEMMAP, whilst only
> tangentially related to CONFIG_MEMORY_HOTREMOVE, so the dependencies look a
> bit off.
> 
> In fact, now that I think about it, why does this need to be in memblock at
> all? If it is specific to sparsemem, shouldn't the section map already be
> enough to tell us what's supposed to be present or not?
> 
> Robin.

The story is: when we are hot-removing one section, we cannot be sure that
the full block  can be fully removed, for example,
because we might have used only a portion of it at hot-add time and the
rest might have been used by other hot adds we are not aware of.
So when we hot-remove, we mark the page structs of the removed memory,
and we only remove the full page when it is all marked.
This is exactly symmetrical to the issue described in commit
ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
hot-remove") - introducing hot-remove for x86. 

In that commit, partially unused vmemmap pages where filled with the
0XFD constant. In the previous iteration of this patchset, it was
rightfully suggested that marking the pages by writing inside them was
not the best way to achieve the result. That's why we reverted to do
this marking using memblock. This is only used in memory hot remove,
that's why the CONFIG_MEMORY_HOTREMOVE dependency. 

Right now, I cannot think of how I could use sparse mem to tell: the
only thing I know at the moment of trying to free a vmemmap block is that I
have some physical addresses that might or not be in use to describe some
pages. I canot think of any way to know which struct pages could be occupying this
vmemmap block, besides maybe walking all pagetables and check if I have
some matching mapping.
However, I might be missing something, so suggestions are welcome.

Thanks,
Andrea

> >+#endif
> >  };
> >  struct memblock_region {
> >@@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> >  int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
> >  int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
> >  ulong choose_memblock_flags(void);
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> >+int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> >+#endif
> >  /* Low level functions */
> >  int memblock_add_range(struct memblock_type *type,
> >@@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
> >  	return m->flags & MEMBLOCK_NOMAP;
> >  }
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> >+		phys_addr_t start, phys_addr_t end);
> >+#endif
> >+
> >  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
> >  			    unsigned long  *end_pfn);
> >diff --git a/mm/memblock.c b/mm/memblock.c
> >index 9120578..30d5aa4 100644
> >--- a/mm/memblock.c
> >+++ b/mm/memblock.c
> >@@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
> >  	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
> >  }
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> >+		phys_addr_t size)
> >+{
> >+	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> >+}
> >+int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> >+		phys_addr_t size)
> >+{
> >+	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> >+}
> >+#endif
> >  /**
> >   * __next_reserved_mem_region - next function for for_each_reserved_region()
> >   * @idx: pointer to u64 loop variable
> >@@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
> >  	}
> >  }
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> >+		phys_addr_t start, phys_addr_t end)
> >+{
> >+	u64 i;
> >+	struct memblock_region *r;
> >+
> >+	i = memblock_search(mt, start);
> >+	r = &(mt->regions[i]);
> >+	while (r->base < end) {
> >+		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> >+			return 0;
> >+
> >+		r = &(memblock.memory.regions[++i]);
> >+	}
> >+
> >+	return 1;
> >+}
> >+#endif
> >+
> >  void __init_memblock memblock_set_current_limit(phys_addr_t limit)
> >  {
> >  	memblock.current_limit = limit;
> >
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-11-27 17:38       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:38 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-arm-kernel, mark.rutland, realean2, mhocko, m.bielski,
	scott.branden, catalin.marinas, will.deacon, linux-kernel,
	linux-mm, arunks, qiuxishi

Hi Robin,

On Mon 27 Nov 2017, 15:20, Robin Murphy wrote:
> On 23/11/17 11:14, Andrea Reale wrote:
> >When hot-removing memory we need to free vmemmap memory.
> 
> What problems arise if we don't? Is it only for the sake of freeing up some
> pages here and there, or is there something more fundamental?
>

It is just for freeing up pages, but imho we are talking about a relevant
number of pages. For example, assuming 4K pages, to describe one hot
added section of 1GB of new memory we need ~14MBs of vmemmap space (if
my back of the envelope math is not wrong). This
memory would be leaked if we do not do the cleanup in hot remove. 
If we do hot remove sections many times in the lifetime of a system, 
this quantity can become sizeable.

> >However, depending on the memory is being removed, it might
> >not be always possible to free a full vmemmap page / huge-page
> >because part of it might still be used.
> >
> >Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> >hot-remove") introduced a workaround for x86
> >hot-remove, by which partially unused areas are filled with
> >the 0xFD constant. Full pages are only removed when fully
> >filled by 0xFDs.
> >
> >This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> >the goal of using it in place of 0xFDs. For now, this will be used for
> >the arm64 port of memory hot remove, but the idea is to eventually use
> >the same mechanism for x86 as well.
> >
> >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >---
> >  include/linux/memblock.h | 12 ++++++++++++
> >  mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
> >  2 files changed, 44 insertions(+)
> >
> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >index bae11c7..0daec05 100644
> >--- a/include/linux/memblock.h
> >+++ b/include/linux/memblock.h
> >@@ -26,6 +26,9 @@ enum {
> >  	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
> >  	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
> >  	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
> 
> I'm not sure I get what "dirty" is supposed to mean in this context. Also,
> this appears to be specific to CONFIG_SPARSEMEM_VMEMMAP, whilst only
> tangentially related to CONFIG_MEMORY_HOTREMOVE, so the dependencies look a
> bit off.
> 
> In fact, now that I think about it, why does this need to be in memblock at
> all? If it is specific to sparsemem, shouldn't the section map already be
> enough to tell us what's supposed to be present or not?
> 
> Robin.

The story is: when we are hot-removing one section, we cannot be sure that
the full block  can be fully removed, for example,
because we might have used only a portion of it at hot-add time and the
rest might have been used by other hot adds we are not aware of.
So when we hot-remove, we mark the page structs of the removed memory,
and we only remove the full page when it is all marked.
This is exactly symmetrical to the issue described in commit
ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
hot-remove") - introducing hot-remove for x86. 

In that commit, partially unused vmemmap pages where filled with the
0XFD constant. In the previous iteration of this patchset, it was
rightfully suggested that marking the pages by writing inside them was
not the best way to achieve the result. That's why we reverted to do
this marking using memblock. This is only used in memory hot remove,
that's why the CONFIG_MEMORY_HOTREMOVE dependency. 

Right now, I cannot think of how I could use sparse mem to tell: the
only thing I know at the moment of trying to free a vmemmap block is that I
have some physical addresses that might or not be in use to describe some
pages. I canot think of any way to know which struct pages could be occupying this
vmemmap block, besides maybe walking all pagetables and check if I have
some matching mapping.
However, I might be missing something, so suggestions are welcome.

Thanks,
Andrea

> >+#endif
> >  };
> >  struct memblock_region {
> >@@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> >  int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
> >  int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
> >  ulong choose_memblock_flags(void);
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> >+int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> >+#endif
> >  /* Low level functions */
> >  int memblock_add_range(struct memblock_type *type,
> >@@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
> >  	return m->flags & MEMBLOCK_NOMAP;
> >  }
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> >+		phys_addr_t start, phys_addr_t end);
> >+#endif
> >+
> >  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
> >  			    unsigned long  *end_pfn);
> >diff --git a/mm/memblock.c b/mm/memblock.c
> >index 9120578..30d5aa4 100644
> >--- a/mm/memblock.c
> >+++ b/mm/memblock.c
> >@@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
> >  	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
> >  }
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> >+		phys_addr_t size)
> >+{
> >+	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> >+}
> >+int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> >+		phys_addr_t size)
> >+{
> >+	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> >+}
> >+#endif
> >  /**
> >   * __next_reserved_mem_region - next function for for_each_reserved_region()
> >   * @idx: pointer to u64 loop variable
> >@@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
> >  	}
> >  }
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> >+		phys_addr_t start, phys_addr_t end)
> >+{
> >+	u64 i;
> >+	struct memblock_region *r;
> >+
> >+	i = memblock_search(mt, start);
> >+	r = &(mt->regions[i]);
> >+	while (r->base < end) {
> >+		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> >+			return 0;
> >+
> >+		r = &(memblock.memory.regions[++i]);
> >+	}
> >+
> >+	return 1;
> >+}
> >+#endif
> >+
> >  void __init_memblock memblock_set_current_limit(phys_addr_t limit)
> >  {
> >  	memblock.current_limit = limit;
> >
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-11-27 17:38       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:38 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,

On Mon 27 Nov 2017, 15:20, Robin Murphy wrote:
> On 23/11/17 11:14, Andrea Reale wrote:
> >When hot-removing memory we need to free vmemmap memory.
> 
> What problems arise if we don't? Is it only for the sake of freeing up some
> pages here and there, or is there something more fundamental?
>

It is just for freeing up pages, but imho we are talking about a relevant
number of pages. For example, assuming 4K pages, to describe one hot
added section of 1GB of new memory we need ~14MBs of vmemmap space (if
my back of the envelope math is not wrong). This
memory would be leaked if we do not do the cleanup in hot remove. 
If we do hot remove sections many times in the lifetime of a system, 
this quantity can become sizeable.

> >However, depending on the memory is being removed, it might
> >not be always possible to free a full vmemmap page / huge-page
> >because part of it might still be used.
> >
> >Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> >hot-remove") introduced a workaround for x86
> >hot-remove, by which partially unused areas are filled with
> >the 0xFD constant. Full pages are only removed when fully
> >filled by 0xFDs.
> >
> >This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> >the goal of using it in place of 0xFDs. For now, this will be used for
> >the arm64 port of memory hot remove, but the idea is to eventually use
> >the same mechanism for x86 as well.
> >
> >Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> >---
> >  include/linux/memblock.h | 12 ++++++++++++
> >  mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
> >  2 files changed, 44 insertions(+)
> >
> >diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >index bae11c7..0daec05 100644
> >--- a/include/linux/memblock.h
> >+++ b/include/linux/memblock.h
> >@@ -26,6 +26,9 @@ enum {
> >  	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
> >  	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
> >  	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
> 
> I'm not sure I get what "dirty" is supposed to mean in this context. Also,
> this appears to be specific to CONFIG_SPARSEMEM_VMEMMAP, whilst only
> tangentially related to CONFIG_MEMORY_HOTREMOVE, so the dependencies look a
> bit off.
> 
> In fact, now that I think about it, why does this need to be in memblock at
> all? If it is specific to sparsemem, shouldn't the section map already be
> enough to tell us what's supposed to be present or not?
> 
> Robin.

The story is: when we are hot-removing one section, we cannot be sure that
the full block  can be fully removed, for example,
because we might have used only a portion of it at hot-add time and the
rest might have been used by other hot adds we are not aware of.
So when we hot-remove, we mark the page structs of the removed memory,
and we only remove the full page when it is all marked.
This is exactly symmetrical to the issue described in commit
ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
hot-remove") - introducing hot-remove for x86. 

In that commit, partially unused vmemmap pages where filled with the
0XFD constant. In the previous iteration of this patchset, it was
rightfully suggested that marking the pages by writing inside them was
not the best way to achieve the result. That's why we reverted to do
this marking using memblock. This is only used in memory hot remove,
that's why the CONFIG_MEMORY_HOTREMOVE dependency. 

Right now, I cannot think of how I could use sparse mem to tell: the
only thing I know at the moment of trying to free a vmemmap block is that I
have some physical addresses that might or not be in use to describe some
pages. I canot think of any way to know which struct pages could be occupying this
vmemmap block, besides maybe walking all pagetables and check if I have
some matching mapping.
However, I might be missing something, so suggestions are welcome.

Thanks,
Andrea

> >+#endif
> >  };
> >  struct memblock_region {
> >@@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> >  int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
> >  int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
> >  ulong choose_memblock_flags(void);
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> >+int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> >+#endif
> >  /* Low level functions */
> >  int memblock_add_range(struct memblock_type *type,
> >@@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
> >  	return m->flags & MEMBLOCK_NOMAP;
> >  }
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> >+		phys_addr_t start, phys_addr_t end);
> >+#endif
> >+
> >  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
> >  			    unsigned long  *end_pfn);
> >diff --git a/mm/memblock.c b/mm/memblock.c
> >index 9120578..30d5aa4 100644
> >--- a/mm/memblock.c
> >+++ b/mm/memblock.c
> >@@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
> >  	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
> >  }
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> >+		phys_addr_t size)
> >+{
> >+	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> >+}
> >+int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> >+		phys_addr_t size)
> >+{
> >+	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> >+}
> >+#endif
> >  /**
> >   * __next_reserved_mem_region - next function for for_each_reserved_region()
> >   * @idx: pointer to u64 loop variable
> >@@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
> >  	}
> >  }
> >+#ifdef CONFIG_MEMORY_HOTREMOVE
> >+bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> >+		phys_addr_t start, phys_addr_t end)
> >+{
> >+	u64 i;
> >+	struct memblock_region *r;
> >+
> >+	i = memblock_search(mt, start);
> >+	r = &(mt->regions[i]);
> >+	while (r->base < end) {
> >+		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> >+			return 0;
> >+
> >+		r = &(memblock.memory.regions[++i]);
> >+	}
> >+
> >+	return 1;
> >+}
> >+#endif
> >+
> >  void __init_memblock memblock_set_current_limit(phys_addr_t limit)
> >  {
> >  	memblock.current_limit = limit;
> >
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-27 15:20             ` Robin Murphy
  (?)
  (?)
@ 2017-11-27 17:44               ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:44 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Michal Hocko, Mark Rutland, Rafael Wysocki, m.bielski,
	ACPI Devel Maling List, Rafael J. Wysocki, Catalin Marinas,
	scott.branden, Will Deacon, Linux Kernel Mailing List,
	Linux Memory Management List, arunks, qiuxishi, linux-arm-kernel

Hi again,

On Mon 27 Nov 2017, 15:20, Robin Murphy wrote:
> On 24/11/17 15:54, Andrea Reale wrote:
> >On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> >>On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> >>>Hi Rafael,
> >>>
> >>>On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> >>>>On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> >>>>>Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> >>>>>Everyone else: apologies for the noise.
> >>>>>
> >>>>>Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> >>>>>introduced an assumption whereas when control
> >>>>>reaches remove_memory the corresponding memory has been already
> >>>>>offlined. In that case, the acpi_memhotplug was making sure that
> >>>>>the assumption held.
> >>>>>This assumption, however, is not necessarily true if offlining
> >>>>>and removal are not done by the same "controller" (for example,
> >>>>>when first offlining via sysfs).
> >>>>>
> >>>>>Removing this assumption for the generic remove_memory code
> >>>>>and moving it in the specific acpi_memhotplug code. This is
> >>>>>a dependency for the software-aided arm64 offlining and removal
> >>>>>process.
> >>>>>
> >>>>>Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >>>>>Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> >>>>>---
> >>>>>  drivers/acpi/acpi_memhotplug.c |  2 +-
> >>>>>  include/linux/memory_hotplug.h |  9 ++++++---
> >>>>>  mm/memory_hotplug.c            | 13 +++++++++----
> >>>>>  3 files changed, 16 insertions(+), 8 deletions(-)
> >>>>>
> >>>>>diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> >>>>>index 6b0d3ef..b0126a0 100644
> >>>>>--- a/drivers/acpi/acpi_memhotplug.c
> >>>>>+++ b/drivers/acpi/acpi_memhotplug.c
> >>>>>@@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> >>>>>                         nid = memory_add_physaddr_to_nid(info->start_addr);
> >>>>>
> >>>>>                 acpi_unbind_memory_blocks(info);
> >>>>>-               remove_memory(nid, info->start_addr, info->length);
> >>>>>+               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> >>>>
> >>>>Why does this have to be BUG_ON()?  Is it really necessary to kill the
> >>>>system here?
> >>>
> >>>Actually, I hoped you would help me understand that: that BUG() call was introduced
> >>>by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> >>>in memory_hoptlug.c:remove_memory()).
> >>>
> >>>Just reading at that commit my understanding was that you were assuming
> >>>that acpi_memory_remove_memory() have already done the job of offlining
> >>>the target memory, so there would be a bug if that wasn't the case.
> >>>
> >>>In my case, that assumption did not hold and I found that it might not
> >>>hold for other platforms that do not use ACPI. In fact, the purpose of
> >>>this patch is to move this assumption out of the generic hotplug code
> >>>and move it to ACPI code where it originated.
> >>
> >>remove_memory failure is basically impossible to handle AFAIR. The
> >>original code to BUG in remove_memory is ugly as hell and we do not want
> >>to spread that out of that function. Instead we really want to get rid
> >>of it.
> >
> >Today, BUG() is called even in the simple case where remove fails
> >because the section we are removing is not offline. I cannot see any need to
> >BUG() in such a case: an error code seems more than sufficient to me.
> >This is why this patch removes the BUG() call when the "offline" check
> >fails from the generic code.
> >It moves it back to the ACPI call, where the assumption
> >originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> >I have nothing against removing it from ACPI hotplug too, but
> >I don't know enough to feel free to change the acpi semantics myself, so I
> >moved it there to keep the original behavior unchanged for x86 code.
> >
> >In this arm64 hot-remove port, offline and remove are done in two separate
> >steps, and is conceivable that an user tries erroneusly to remove some
> >section that he forgot to offline first: in that case, with the patch,
> >remove will just report an erro without BUGing.
> 
> The user can already kill the system by misusing the sysfs probe driver;
> should similar theoretical misuse of your sysfs remove driver really need to
> be all that different?
> 
> >Is my reasoning flawed?
> 
> Furthermore, even if your driver does want to enforce this, I don't see why
> it can't just do the equivalent of memory_subsys_offline() itself before
> even trying to call remove_memory().
> 
> Robin.

My whole point is that I do not see any good reason to kill the system
when an hot-remove fails. My guess is that the original assumption is
that - once a memory is successfully offlined - then hot remove should
always succeed. Even if we assume that offlining and removal are always
done in one single step (but then why expose the separate sysfs handle
to offline without removing memory), I don't see that as a good excuse
to kill the system: there is no critical kernel state being compromised
AFAICT, so we can leave the system happily running with an hot remove
that did not succeed.

Thanks,
Andrea

> >
> >Cheers,
> >Andrea
> >
> >>-- 
> >>Michal Hocko
> >>SUSE Labs
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> >
> >_______________________________________________
> >linux-arm-kernel mailing list
> >linux-arm-kernel@lists.infradead.org
> >http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-27 17:44               ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:44 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Michal Hocko, Mark Rutland, Rafael Wysocki, m.bielski,
	ACPI Devel Maling List, Rafael J. Wysocki, Catalin Marinas,
	scott.branden, Will Deacon, Linux Kernel Mailing List,
	Linux Memory Management List, arunks, qiuxishi, linux-arm-kernel

Hi again,

On Mon 27 Nov 2017, 15:20, Robin Murphy wrote:
> On 24/11/17 15:54, Andrea Reale wrote:
> >On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> >>On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> >>>Hi Rafael,
> >>>
> >>>On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> >>>>On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> >>>>>Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> >>>>>Everyone else: apologies for the noise.
> >>>>>
> >>>>>Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> >>>>>introduced an assumption whereas when control
> >>>>>reaches remove_memory the corresponding memory has been already
> >>>>>offlined. In that case, the acpi_memhotplug was making sure that
> >>>>>the assumption held.
> >>>>>This assumption, however, is not necessarily true if offlining
> >>>>>and removal are not done by the same "controller" (for example,
> >>>>>when first offlining via sysfs).
> >>>>>
> >>>>>Removing this assumption for the generic remove_memory code
> >>>>>and moving it in the specific acpi_memhotplug code. This is
> >>>>>a dependency for the software-aided arm64 offlining and removal
> >>>>>process.
> >>>>>
> >>>>>Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >>>>>Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> >>>>>---
> >>>>>  drivers/acpi/acpi_memhotplug.c |  2 +-
> >>>>>  include/linux/memory_hotplug.h |  9 ++++++---
> >>>>>  mm/memory_hotplug.c            | 13 +++++++++----
> >>>>>  3 files changed, 16 insertions(+), 8 deletions(-)
> >>>>>
> >>>>>diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> >>>>>index 6b0d3ef..b0126a0 100644
> >>>>>--- a/drivers/acpi/acpi_memhotplug.c
> >>>>>+++ b/drivers/acpi/acpi_memhotplug.c
> >>>>>@@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> >>>>>                         nid = memory_add_physaddr_to_nid(info->start_addr);
> >>>>>
> >>>>>                 acpi_unbind_memory_blocks(info);
> >>>>>-               remove_memory(nid, info->start_addr, info->length);
> >>>>>+               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> >>>>
> >>>>Why does this have to be BUG_ON()?  Is it really necessary to kill the
> >>>>system here?
> >>>
> >>>Actually, I hoped you would help me understand that: that BUG() call was introduced
> >>>by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> >>>in memory_hoptlug.c:remove_memory()).
> >>>
> >>>Just reading at that commit my understanding was that you were assuming
> >>>that acpi_memory_remove_memory() have already done the job of offlining
> >>>the target memory, so there would be a bug if that wasn't the case.
> >>>
> >>>In my case, that assumption did not hold and I found that it might not
> >>>hold for other platforms that do not use ACPI. In fact, the purpose of
> >>>this patch is to move this assumption out of the generic hotplug code
> >>>and move it to ACPI code where it originated.
> >>
> >>remove_memory failure is basically impossible to handle AFAIR. The
> >>original code to BUG in remove_memory is ugly as hell and we do not want
> >>to spread that out of that function. Instead we really want to get rid
> >>of it.
> >
> >Today, BUG() is called even in the simple case where remove fails
> >because the section we are removing is not offline. I cannot see any need to
> >BUG() in such a case: an error code seems more than sufficient to me.
> >This is why this patch removes the BUG() call when the "offline" check
> >fails from the generic code.
> >It moves it back to the ACPI call, where the assumption
> >originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> >I have nothing against removing it from ACPI hotplug too, but
> >I don't know enough to feel free to change the acpi semantics myself, so I
> >moved it there to keep the original behavior unchanged for x86 code.
> >
> >In this arm64 hot-remove port, offline and remove are done in two separate
> >steps, and is conceivable that an user tries erroneusly to remove some
> >section that he forgot to offline first: in that case, with the patch,
> >remove will just report an erro without BUGing.
> 
> The user can already kill the system by misusing the sysfs probe driver;
> should similar theoretical misuse of your sysfs remove driver really need to
> be all that different?
> 
> >Is my reasoning flawed?
> 
> Furthermore, even if your driver does want to enforce this, I don't see why
> it can't just do the equivalent of memory_subsys_offline() itself before
> even trying to call remove_memory().
> 
> Robin.

My whole point is that I do not see any good reason to kill the system
when an hot-remove fails. My guess is that the original assumption is
that - once a memory is successfully offlined - then hot remove should
always succeed. Even if we assume that offlining and removal are always
done in one single step (but then why expose the separate sysfs handle
to offline without removing memory), I don't see that as a good excuse
to kill the system: there is no critical kernel state being compromised
AFAICT, so we can leave the system happily running with an hot remove
that did not succeed.

Thanks,
Andrea

> >
> >Cheers,
> >Andrea
> >
> >>-- 
> >>Michal Hocko
> >>SUSE Labs
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> >
> >_______________________________________________
> >linux-arm-kernel mailing list
> >linux-arm-kernel@lists.infradead.org
> >http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-27 17:44               ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:44 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Michal Hocko, Mark Rutland, Rafael Wysocki, m.bielski,
	ACPI Devel Maling List, Rafael J. Wysocki, Catalin Marinas,
	scott.branden, Will Deacon, Linux Kernel Mailing List,
	Linux Memory Management List, arunks, qiuxishi, linux-arm-kernel

Hi again,

On Mon 27 Nov 2017, 15:20, Robin Murphy wrote:
> On 24/11/17 15:54, Andrea Reale wrote:
> >On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> >>On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> >>>Hi Rafael,
> >>>
> >>>On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> >>>>On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> >>>>>Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> >>>>>Everyone else: apologies for the noise.
> >>>>>
> >>>>>Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> >>>>>introduced an assumption whereas when control
> >>>>>reaches remove_memory the corresponding memory has been already
> >>>>>offlined. In that case, the acpi_memhotplug was making sure that
> >>>>>the assumption held.
> >>>>>This assumption, however, is not necessarily true if offlining
> >>>>>and removal are not done by the same "controller" (for example,
> >>>>>when first offlining via sysfs).
> >>>>>
> >>>>>Removing this assumption for the generic remove_memory code
> >>>>>and moving it in the specific acpi_memhotplug code. This is
> >>>>>a dependency for the software-aided arm64 offlining and removal
> >>>>>process.
> >>>>>
> >>>>>Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >>>>>Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> >>>>>---
> >>>>>  drivers/acpi/acpi_memhotplug.c |  2 +-
> >>>>>  include/linux/memory_hotplug.h |  9 ++++++---
> >>>>>  mm/memory_hotplug.c            | 13 +++++++++----
> >>>>>  3 files changed, 16 insertions(+), 8 deletions(-)
> >>>>>
> >>>>>diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> >>>>>index 6b0d3ef..b0126a0 100644
> >>>>>--- a/drivers/acpi/acpi_memhotplug.c
> >>>>>+++ b/drivers/acpi/acpi_memhotplug.c
> >>>>>@@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> >>>>>                         nid = memory_add_physaddr_to_nid(info->start_addr);
> >>>>>
> >>>>>                 acpi_unbind_memory_blocks(info);
> >>>>>-               remove_memory(nid, info->start_addr, info->length);
> >>>>>+               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> >>>>
> >>>>Why does this have to be BUG_ON()?  Is it really necessary to kill the
> >>>>system here?
> >>>
> >>>Actually, I hoped you would help me understand that: that BUG() call was introduced
> >>>by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> >>>in memory_hoptlug.c:remove_memory()).
> >>>
> >>>Just reading at that commit my understanding was that you were assuming
> >>>that acpi_memory_remove_memory() have already done the job of offlining
> >>>the target memory, so there would be a bug if that wasn't the case.
> >>>
> >>>In my case, that assumption did not hold and I found that it might not
> >>>hold for other platforms that do not use ACPI. In fact, the purpose of
> >>>this patch is to move this assumption out of the generic hotplug code
> >>>and move it to ACPI code where it originated.
> >>
> >>remove_memory failure is basically impossible to handle AFAIR. The
> >>original code to BUG in remove_memory is ugly as hell and we do not want
> >>to spread that out of that function. Instead we really want to get rid
> >>of it.
> >
> >Today, BUG() is called even in the simple case where remove fails
> >because the section we are removing is not offline. I cannot see any need to
> >BUG() in such a case: an error code seems more than sufficient to me.
> >This is why this patch removes the BUG() call when the "offline" check
> >fails from the generic code.
> >It moves it back to the ACPI call, where the assumption
> >originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> >I have nothing against removing it from ACPI hotplug too, but
> >I don't know enough to feel free to change the acpi semantics myself, so I
> >moved it there to keep the original behavior unchanged for x86 code.
> >
> >In this arm64 hot-remove port, offline and remove are done in two separate
> >steps, and is conceivable that an user tries erroneusly to remove some
> >section that he forgot to offline first: in that case, with the patch,
> >remove will just report an erro without BUGing.
> 
> The user can already kill the system by misusing the sysfs probe driver;
> should similar theoretical misuse of your sysfs remove driver really need to
> be all that different?
> 
> >Is my reasoning flawed?
> 
> Furthermore, even if your driver does want to enforce this, I don't see why
> it can't just do the equivalent of memory_subsys_offline() itself before
> even trying to call remove_memory().
> 
> Robin.

My whole point is that I do not see any good reason to kill the system
when an hot-remove fails. My guess is that the original assumption is
that - once a memory is successfully offlined - then hot remove should
always succeed. Even if we assume that offlining and removal are always
done in one single step (but then why expose the separate sysfs handle
to offline without removing memory), I don't see that as a good excuse
to kill the system: there is no critical kernel state being compromised
AFAICT, so we can leave the system happily running with an hot remove
that did not succeed.

Thanks,
Andrea

> >
> >Cheers,
> >Andrea
> >
> >>-- 
> >>Michal Hocko
> >>SUSE Labs
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> >
> >_______________________________________________
> >linux-arm-kernel mailing list
> >linux-arm-kernel@lists.infradead.org
> >http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-27 17:44               ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-11-27 17:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi again,

On Mon 27 Nov 2017, 15:20, Robin Murphy wrote:
> On 24/11/17 15:54, Andrea Reale wrote:
> >On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> >>On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> >>>Hi Rafael,
> >>>
> >>>On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> >>>>On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> >>>>>Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> >>>>>Everyone else: apologies for the noise.
> >>>>>
> >>>>>Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> >>>>>introduced an assumption whereas when control
> >>>>>reaches remove_memory the corresponding memory has been already
> >>>>>offlined. In that case, the acpi_memhotplug was making sure that
> >>>>>the assumption held.
> >>>>>This assumption, however, is not necessarily true if offlining
> >>>>>and removal are not done by the same "controller" (for example,
> >>>>>when first offlining via sysfs).
> >>>>>
> >>>>>Removing this assumption for the generic remove_memory code
> >>>>>and moving it in the specific acpi_memhotplug code. This is
> >>>>>a dependency for the software-aided arm64 offlining and removal
> >>>>>process.
> >>>>>
> >>>>>Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> >>>>>Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> >>>>>---
> >>>>>  drivers/acpi/acpi_memhotplug.c |  2 +-
> >>>>>  include/linux/memory_hotplug.h |  9 ++++++---
> >>>>>  mm/memory_hotplug.c            | 13 +++++++++----
> >>>>>  3 files changed, 16 insertions(+), 8 deletions(-)
> >>>>>
> >>>>>diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> >>>>>index 6b0d3ef..b0126a0 100644
> >>>>>--- a/drivers/acpi/acpi_memhotplug.c
> >>>>>+++ b/drivers/acpi/acpi_memhotplug.c
> >>>>>@@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> >>>>>                         nid = memory_add_physaddr_to_nid(info->start_addr);
> >>>>>
> >>>>>                 acpi_unbind_memory_blocks(info);
> >>>>>-               remove_memory(nid, info->start_addr, info->length);
> >>>>>+               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> >>>>
> >>>>Why does this have to be BUG_ON()?  Is it really necessary to kill the
> >>>>system here?
> >>>
> >>>Actually, I hoped you would help me understand that: that BUG() call was introduced
> >>>by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> >>>in memory_hoptlug.c:remove_memory()).
> >>>
> >>>Just reading at that commit my understanding was that you were assuming
> >>>that acpi_memory_remove_memory() have already done the job of offlining
> >>>the target memory, so there would be a bug if that wasn't the case.
> >>>
> >>>In my case, that assumption did not hold and I found that it might not
> >>>hold for other platforms that do not use ACPI. In fact, the purpose of
> >>>this patch is to move this assumption out of the generic hotplug code
> >>>and move it to ACPI code where it originated.
> >>
> >>remove_memory failure is basically impossible to handle AFAIR. The
> >>original code to BUG in remove_memory is ugly as hell and we do not want
> >>to spread that out of that function. Instead we really want to get rid
> >>of it.
> >
> >Today, BUG() is called even in the simple case where remove fails
> >because the section we are removing is not offline. I cannot see any need to
> >BUG() in such a case: an error code seems more than sufficient to me.
> >This is why this patch removes the BUG() call when the "offline" check
> >fails from the generic code.
> >It moves it back to the ACPI call, where the assumption
> >originated. Honestlly, I cannot tell if it makes sense to BUG() there:
> >I have nothing against removing it from ACPI hotplug too, but
> >I don't know enough to feel free to change the acpi semantics myself, so I
> >moved it there to keep the original behavior unchanged for x86 code.
> >
> >In this arm64 hot-remove port, offline and remove are done in two separate
> >steps, and is conceivable that an user tries erroneusly to remove some
> >section that he forgot to offline first: in that case, with the patch,
> >remove will just report an erro without BUGing.
> 
> The user can already kill the system by misusing the sysfs probe driver;
> should similar theoretical misuse of your sysfs remove driver really need to
> be all that different?
> 
> >Is my reasoning flawed?
> 
> Furthermore, even if your driver does want to enforce this, I don't see why
> it can't just do the equivalent of memory_subsys_offline() itself before
> even trying to call remove_memory().
> 
> Robin.

My whole point is that I do not see any good reason to kill the system
when an hot-remove fails. My guess is that the original assumption is
that - once a memory is successfully offlined - then hot remove should
always succeed. Even if we assume that offlining and removal are always
done in one single step (but then why expose the separate sysfs handle
to offline without removing memory), I don't see that as a good excuse
to kill the system: there is no critical kernel state being compromised
AFAICT, so we can leave the system happily running with an hot remove
that did not succeed.

Thanks,
Andrea

> >
> >Cheers,
> >Andrea
> >
> >>-- 
> >>Michal Hocko
> >>SUSE Labs
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> >>the body of a message to majordomo at vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> >
> >_______________________________________________
> >linux-arm-kernel mailing list
> >linux-arm-kernel at lists.infradead.org
> >http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-23 11:14   ` Andrea Reale
  (?)
@ 2017-11-29  0:49     ` joeyli
  -1 siblings, 0 replies; 156+ messages in thread
From: joeyli @ 2017-11-29  0:49 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, rafael.j.wysocki, linux-acpi

Hi Andrea, 

On Fri, Nov 24, 2017 at 10:22:35AM +0000, Andrea Reale wrote:
> Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> Everyone else: apologies for the noise.
> 
> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> introduced an assumption whereas when control
> reaches remove_memory the corresponding memory has been already
> offlined. In that case, the acpi_memhotplug was making sure that
> the assumption held.
> This assumption, however, is not necessarily true if offlining
> and removal are not done by the same "controller" (for example,
> when first offlining via sysfs).
> 
> Removing this assumption for the generic remove_memory code
> and moving it in the specific acpi_memhotplug code. This is
> a dependency for the software-aided arm64 offlining and removal
> process.
> 
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> ---
>  drivers/acpi/acpi_memhotplug.c |  2 +-
>  include/linux/memory_hotplug.h |  9 ++++++---
>  mm/memory_hotplug.c            | 13 +++++++++----
>  3 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 6b0d3ef..b0126a0 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>  			nid = memory_add_physaddr_to_nid(info->start_addr);
>  
>  		acpi_unbind_memory_blocks(info);
> -		remove_memory(nid, info->start_addr, info->length);
> +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
>  		list_del(&info->list);
>  		kfree(info);
>  	}
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 58e110a..1a9c7b2 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
>  extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
>  extern void try_offline_node(int nid);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>  
>  #else
>  static inline bool is_mem_section_removable(unsigned long pfn,
> @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	return -EINVAL;
>  }
>  
> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline int remove_memory(int nid, u64 start, u64 size)
> +{
> +	return -EINVAL;
> +}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
>  extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>  		unsigned long nr_pages);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>  extern bool is_memblock_offlined(struct memory_block *mem);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>  extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
>  extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
>  		unsigned long map_offset);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d4b5f29..d5f15af 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
>   * and online/offline operations before this call, as required by
>   * try_offline_node().
>   */
> -void __ref remove_memory(int nid, u64 start, u64 size)
> +int __ref remove_memory(int nid, u64 start, u64 size)
>  {
>  	int ret;
>  
> @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>  	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
>  				check_memblock_offlined_cb);
>  	if (ret)
> -		BUG();
> +		goto end_remove;
> +
> +	ret = arch_remove_memory(start, size);
> +
> +	if (ret)
> +		goto end_remove;

The original code triggers BUG() when any memblock is not offlined. Why
the new logic includes the result of arch_remove_memory()?

But I agreed the we don't need BUG(). Returning a error is better.

>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(start, size);
> -
>  	try_offline_node(nid);
>  
> +end_remove:
>  	mem_hotplug_done();
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(remove_memory);
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-29  0:49     ` joeyli
  0 siblings, 0 replies; 156+ messages in thread
From: joeyli @ 2017-11-29  0:49 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, rafael.j.wysocki, linux-acpi

Hi Andrea, 

On Fri, Nov 24, 2017 at 10:22:35AM +0000, Andrea Reale wrote:
> Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> Everyone else: apologies for the noise.
> 
> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> introduced an assumption whereas when control
> reaches remove_memory the corresponding memory has been already
> offlined. In that case, the acpi_memhotplug was making sure that
> the assumption held.
> This assumption, however, is not necessarily true if offlining
> and removal are not done by the same "controller" (for example,
> when first offlining via sysfs).
> 
> Removing this assumption for the generic remove_memory code
> and moving it in the specific acpi_memhotplug code. This is
> a dependency for the software-aided arm64 offlining and removal
> process.
> 
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> ---
>  drivers/acpi/acpi_memhotplug.c |  2 +-
>  include/linux/memory_hotplug.h |  9 ++++++---
>  mm/memory_hotplug.c            | 13 +++++++++----
>  3 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 6b0d3ef..b0126a0 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>  			nid = memory_add_physaddr_to_nid(info->start_addr);
>  
>  		acpi_unbind_memory_blocks(info);
> -		remove_memory(nid, info->start_addr, info->length);
> +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
>  		list_del(&info->list);
>  		kfree(info);
>  	}
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 58e110a..1a9c7b2 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
>  extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
>  extern void try_offline_node(int nid);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>  
>  #else
>  static inline bool is_mem_section_removable(unsigned long pfn,
> @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	return -EINVAL;
>  }
>  
> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline int remove_memory(int nid, u64 start, u64 size)
> +{
> +	return -EINVAL;
> +}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
>  extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>  		unsigned long nr_pages);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>  extern bool is_memblock_offlined(struct memory_block *mem);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>  extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
>  extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
>  		unsigned long map_offset);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d4b5f29..d5f15af 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
>   * and online/offline operations before this call, as required by
>   * try_offline_node().
>   */
> -void __ref remove_memory(int nid, u64 start, u64 size)
> +int __ref remove_memory(int nid, u64 start, u64 size)
>  {
>  	int ret;
>  
> @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>  	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
>  				check_memblock_offlined_cb);
>  	if (ret)
> -		BUG();
> +		goto end_remove;
> +
> +	ret = arch_remove_memory(start, size);
> +
> +	if (ret)
> +		goto end_remove;

The original code triggers BUG() when any memblock is not offlined. Why
the new logic includes the result of arch_remove_memory()?

But I agreed the we don't need BUG(). Returning a error is better.

>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(start, size);
> -
>  	try_offline_node(nid);
>  
> +end_remove:
>  	mem_hotplug_done();
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(remove_memory);
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-29  0:49     ` joeyli
  0 siblings, 0 replies; 156+ messages in thread
From: joeyli @ 2017-11-29  0:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Andrea, 

On Fri, Nov 24, 2017 at 10:22:35AM +0000, Andrea Reale wrote:
> Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> Everyone else: apologies for the noise.
> 
> Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> introduced an assumption whereas when control
> reaches remove_memory the corresponding memory has been already
> offlined. In that case, the acpi_memhotplug was making sure that
> the assumption held.
> This assumption, however, is not necessarily true if offlining
> and removal are not done by the same "controller" (for example,
> when first offlining via sysfs).
> 
> Removing this assumption for the generic remove_memory code
> and moving it in the specific acpi_memhotplug code. This is
> a dependency for the software-aided arm64 offlining and removal
> process.
> 
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> ---
>  drivers/acpi/acpi_memhotplug.c |  2 +-
>  include/linux/memory_hotplug.h |  9 ++++++---
>  mm/memory_hotplug.c            | 13 +++++++++----
>  3 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 6b0d3ef..b0126a0 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
>  			nid = memory_add_physaddr_to_nid(info->start_addr);
>  
>  		acpi_unbind_memory_blocks(info);
> -		remove_memory(nid, info->start_addr, info->length);
> +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
>  		list_del(&info->list);
>  		kfree(info);
>  	}
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 58e110a..1a9c7b2 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
>  extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
>  extern void try_offline_node(int nid);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>  
>  #else
>  static inline bool is_mem_section_removable(unsigned long pfn,
> @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	return -EINVAL;
>  }
>  
> -static inline void remove_memory(int nid, u64 start, u64 size) {}
> +static inline int remove_memory(int nid, u64 start, u64 size)
> +{
> +	return -EINVAL;
> +}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
>  extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>  		unsigned long nr_pages);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>  extern bool is_memblock_offlined(struct memory_block *mem);
> -extern void remove_memory(int nid, u64 start, u64 size);
> +extern int remove_memory(int nid, u64 start, u64 size);
>  extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
>  extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
>  		unsigned long map_offset);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index d4b5f29..d5f15af 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
>   * and online/offline operations before this call, as required by
>   * try_offline_node().
>   */
> -void __ref remove_memory(int nid, u64 start, u64 size)
> +int __ref remove_memory(int nid, u64 start, u64 size)
>  {
>  	int ret;
>  
> @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>  	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
>  				check_memblock_offlined_cb);
>  	if (ret)
> -		BUG();
> +		goto end_remove;
> +
> +	ret = arch_remove_memory(start, size);
> +
> +	if (ret)
> +		goto end_remove;

The original code triggers BUG() when any memblock is not offlined. Why
the new logic includes the result of arch_remove_memory()?

But I agreed the we don't need BUG(). Returning a error is better.

>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(start, size);
> -
>  	try_offline_node(nid);
>  
> +end_remove:
>  	mem_hotplug_done();
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(remove_memory);
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-24 18:17             ` Michal Hocko
  (?)
  (?)
@ 2017-11-29  1:20               ` joeyli
  -1 siblings, 0 replies; 156+ messages in thread
From: joeyli @ 2017-11-29  1:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Reale, Rafael J. Wysocki, linux-arm-kernel,
	Linux Kernel Mailing List, Linux Memory Management List,
	m.bielski, arunks, Mark Rutland, scott.branden, Will Deacon,
	qiuxishi, Catalin Marinas, Rafael Wysocki,
	ACPI Devel Maling List

On Fri, Nov 24, 2017 at 07:17:41PM +0100, Michal Hocko wrote:
> On Fri 24-11-17 15:54:59, Andrea Reale wrote:
> > On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> > > On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > > > Hi Rafael,
> > > > 
> > > > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > > > Everyone else: apologies for the noise.
> > > > > >
> > > > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > > > introduced an assumption whereas when control
> > > > > > reaches remove_memory the corresponding memory has been already
> > > > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > > > the assumption held.
> > > > > > This assumption, however, is not necessarily true if offlining
> > > > > > and removal are not done by the same "controller" (for example,
> > > > > > when first offlining via sysfs).
> > > > > >
> > > > > > Removing this assumption for the generic remove_memory code
> > > > > > and moving it in the specific acpi_memhotplug code. This is
> > > > > > a dependency for the software-aided arm64 offlining and removal
> > > > > > process.
> > > > > >
> > > > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > > > ---
> > > > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > > > index 6b0d3ef..b0126a0 100644
> > > > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > > > >
> > > > > >                 acpi_unbind_memory_blocks(info);
> > > > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > > > 
> > > > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > > > system here?
> > > > 
> > > > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > > > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > in memory_hoptlug.c:remove_memory()). 
> > > > 
> > > > Just reading at that commit my understanding was that you were assuming
> > > > that acpi_memory_remove_memory() have already done the job of offlining
> > > > the target memory, so there would be a bug if that wasn't the case.
> > > > 
> > > > In my case, that assumption did not hold and I found that it might not
> > > > hold for other platforms that do not use ACPI. In fact, the purpose of
> > > > this patch is to move this assumption out of the generic hotplug code
> > > > and move it to ACPI code where it originated. 
> > > 
> > > remove_memory failure is basically impossible to handle AFAIR. The
> > > original code to BUG in remove_memory is ugly as hell and we do not want
> > > to spread that out of that function. Instead we really want to get rid
> > > of it.
> > 
> > Today, BUG() is called even in the simple case where remove fails
> > because the section we are removing is not offline.
> 
> You cannot hotremove memory which is still online. This is what caller
> should enforce. This is too late to handle the failure. At least for
> ACPI.
>

The logic in acpi_scan_hot_remove() calls memory_subsys_offline(). If
there doesn't have any error returns by memory_subsys_offline, then ACPI
assumes all devices are offlined by subsystem (memory subsystem in this case).

Then system moves to remove stage, ACPI calls acpi_memory_device_remove().
Here
 
> > I cannot see any need to
> > BUG() in such a case: an error code seems more than sufficient to me.
> 
> I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
> context here and cannot simply communicate error code down the road.
> I agree that we should be able to simply return an error but what is the
> actual error condition that might happen here?
>

Currently acpi_bus_trim() didn't handle any return error. If subsystem
returns error, then ACPI can only interrupt hot-remove process.

> > This is why this patch removes the BUG() call when the "offline" check
> > fails from the generic code. 
> 
> As I've said we should simply get rid of BUG rather than move it around.
>

As I remember that the original BUG() helped us to find out a bug about the
offline state doesn't sync between memblock device with memory state.
Something likes:
	mem->dev.offline != (mem->state == MEM_OFFLINE)

So, the BUG() is useful to capture bug about state sync between device object
and subsystem object.

Thanks
Joey Lee

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-29  1:20               ` joeyli
  0 siblings, 0 replies; 156+ messages in thread
From: joeyli @ 2017-11-29  1:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Reale, Rafael J. Wysocki, linux-arm-kernel,
	Linux Kernel Mailing List, Linux Memory Management List,
	m.bielski, arunks, Mark Rutland, scott.branden, Will Deacon,
	qiuxishi, Catalin Marinas, Rafael Wysocki,
	ACPI Devel Maling List

On Fri, Nov 24, 2017 at 07:17:41PM +0100, Michal Hocko wrote:
> On Fri 24-11-17 15:54:59, Andrea Reale wrote:
> > On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> > > On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > > > Hi Rafael,
> > > > 
> > > > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > > > Everyone else: apologies for the noise.
> > > > > >
> > > > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > > > introduced an assumption whereas when control
> > > > > > reaches remove_memory the corresponding memory has been already
> > > > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > > > the assumption held.
> > > > > > This assumption, however, is not necessarily true if offlining
> > > > > > and removal are not done by the same "controller" (for example,
> > > > > > when first offlining via sysfs).
> > > > > >
> > > > > > Removing this assumption for the generic remove_memory code
> > > > > > and moving it in the specific acpi_memhotplug code. This is
> > > > > > a dependency for the software-aided arm64 offlining and removal
> > > > > > process.
> > > > > >
> > > > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > > > ---
> > > > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > > > index 6b0d3ef..b0126a0 100644
> > > > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > > > >
> > > > > >                 acpi_unbind_memory_blocks(info);
> > > > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > > > 
> > > > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > > > system here?
> > > > 
> > > > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > > > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > in memory_hoptlug.c:remove_memory()). 
> > > > 
> > > > Just reading at that commit my understanding was that you were assuming
> > > > that acpi_memory_remove_memory() have already done the job of offlining
> > > > the target memory, so there would be a bug if that wasn't the case.
> > > > 
> > > > In my case, that assumption did not hold and I found that it might not
> > > > hold for other platforms that do not use ACPI. In fact, the purpose of
> > > > this patch is to move this assumption out of the generic hotplug code
> > > > and move it to ACPI code where it originated. 
> > > 
> > > remove_memory failure is basically impossible to handle AFAIR. The
> > > original code to BUG in remove_memory is ugly as hell and we do not want
> > > to spread that out of that function. Instead we really want to get rid
> > > of it.
> > 
> > Today, BUG() is called even in the simple case where remove fails
> > because the section we are removing is not offline.
> 
> You cannot hotremove memory which is still online. This is what caller
> should enforce. This is too late to handle the failure. At least for
> ACPI.
>

The logic in acpi_scan_hot_remove() calls memory_subsys_offline(). If
there doesn't have any error returns by memory_subsys_offline, then ACPI
assumes all devices are offlined by subsystem (memory subsystem in this case).

Then system moves to remove stage, ACPI calls acpi_memory_device_remove().
Here
 
> > I cannot see any need to
> > BUG() in such a case: an error code seems more than sufficient to me.
> 
> I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
> context here and cannot simply communicate error code down the road.
> I agree that we should be able to simply return an error but what is the
> actual error condition that might happen here?
>

Currently acpi_bus_trim() didn't handle any return error. If subsystem
returns error, then ACPI can only interrupt hot-remove process.

> > This is why this patch removes the BUG() call when the "offline" check
> > fails from the generic code. 
> 
> As I've said we should simply get rid of BUG rather than move it around.
>

As I remember that the original BUG() helped us to find out a bug about the
offline state doesn't sync between memblock device with memory state.
Something likes:
	mem->dev.offline != (mem->state == MEM_OFFLINE)

So, the BUG() is useful to capture bug about state sync between device object
and subsystem object.

Thanks
Joey Lee

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-29  1:20               ` joeyli
  0 siblings, 0 replies; 156+ messages in thread
From: joeyli @ 2017-11-29  1:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrea Reale, Rafael J. Wysocki, linux-arm-kernel,
	Linux Kernel Mailing List, Linux Memory Management List,
	m.bielski, arunks, Mark Rutland, scott.branden, Will Deacon,
	qiuxishi, Catalin Marinas, Rafael Wysocki,
	ACPI Devel Maling List

On Fri, Nov 24, 2017 at 07:17:41PM +0100, Michal Hocko wrote:
> On Fri 24-11-17 15:54:59, Andrea Reale wrote:
> > On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> > > On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > > > Hi Rafael,
> > > > 
> > > > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > > > Everyone else: apologies for the noise.
> > > > > >
> > > > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > > > introduced an assumption whereas when control
> > > > > > reaches remove_memory the corresponding memory has been already
> > > > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > > > the assumption held.
> > > > > > This assumption, however, is not necessarily true if offlining
> > > > > > and removal are not done by the same "controller" (for example,
> > > > > > when first offlining via sysfs).
> > > > > >
> > > > > > Removing this assumption for the generic remove_memory code
> > > > > > and moving it in the specific acpi_memhotplug code. This is
> > > > > > a dependency for the software-aided arm64 offlining and removal
> > > > > > process.
> > > > > >
> > > > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > > > ---
> > > > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > > > index 6b0d3ef..b0126a0 100644
> > > > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > > > >
> > > > > >                 acpi_unbind_memory_blocks(info);
> > > > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > > > 
> > > > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > > > system here?
> > > > 
> > > > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > > > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > in memory_hoptlug.c:remove_memory()). 
> > > > 
> > > > Just reading at that commit my understanding was that you were assuming
> > > > that acpi_memory_remove_memory() have already done the job of offlining
> > > > the target memory, so there would be a bug if that wasn't the case.
> > > > 
> > > > In my case, that assumption did not hold and I found that it might not
> > > > hold for other platforms that do not use ACPI. In fact, the purpose of
> > > > this patch is to move this assumption out of the generic hotplug code
> > > > and move it to ACPI code where it originated. 
> > > 
> > > remove_memory failure is basically impossible to handle AFAIR. The
> > > original code to BUG in remove_memory is ugly as hell and we do not want
> > > to spread that out of that function. Instead we really want to get rid
> > > of it.
> > 
> > Today, BUG() is called even in the simple case where remove fails
> > because the section we are removing is not offline.
> 
> You cannot hotremove memory which is still online. This is what caller
> should enforce. This is too late to handle the failure. At least for
> ACPI.
>

The logic in acpi_scan_hot_remove() calls memory_subsys_offline(). If
there doesn't have any error returns by memory_subsys_offline, then ACPI
assumes all devices are offlined by subsystem (memory subsystem in this case).

Then system moves to remove stage, ACPI calls acpi_memory_device_remove().
Here
 
> > I cannot see any need to
> > BUG() in such a case: an error code seems more than sufficient to me.
> 
> I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
> context here and cannot simply communicate error code down the road.
> I agree that we should be able to simply return an error but what is the
> actual error condition that might happen here?
>

Currently acpi_bus_trim() didn't handle any return error. If subsystem
returns error, then ACPI can only interrupt hot-remove process.

> > This is why this patch removes the BUG() call when the "offline" check
> > fails from the generic code. 
> 
> As I've said we should simply get rid of BUG rather than move it around.
>

As I remember that the original BUG() helped us to find out a bug about the
offline state doesn't sync between memblock device with memory state.
Something likes:
	mem->dev.offline != (mem->state == MEM_OFFLINE)

So, the BUG() is useful to capture bug about state sync between device object
and subsystem object.

Thanks
Joey Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-29  1:20               ` joeyli
  0 siblings, 0 replies; 156+ messages in thread
From: joeyli @ 2017-11-29  1:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 24, 2017 at 07:17:41PM +0100, Michal Hocko wrote:
> On Fri 24-11-17 15:54:59, Andrea Reale wrote:
> > On Fri 24 Nov 2017, 16:43, Michal Hocko wrote:
> > > On Fri 24-11-17 14:49:17, Andrea Reale wrote:
> > > > Hi Rafael,
> > > > 
> > > > On Fri 24 Nov 2017, 15:39, Rafael J. Wysocki wrote:
> > > > > On Fri, Nov 24, 2017 at 11:22 AM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> > > > > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > > > > Everyone else: apologies for the noise.
> > > > > >
> > > > > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > > > introduced an assumption whereas when control
> > > > > > reaches remove_memory the corresponding memory has been already
> > > > > > offlined. In that case, the acpi_memhotplug was making sure that
> > > > > > the assumption held.
> > > > > > This assumption, however, is not necessarily true if offlining
> > > > > > and removal are not done by the same "controller" (for example,
> > > > > > when first offlining via sysfs).
> > > > > >
> > > > > > Removing this assumption for the generic remove_memory code
> > > > > > and moving it in the specific acpi_memhotplug code. This is
> > > > > > a dependency for the software-aided arm64 offlining and removal
> > > > > > process.
> > > > > >
> > > > > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > > > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > > > > ---
> > > > > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > > > > >  include/linux/memory_hotplug.h |  9 ++++++---
> > > > > >  mm/memory_hotplug.c            | 13 +++++++++----
> > > > > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > > > > index 6b0d3ef..b0126a0 100644
> > > > > > --- a/drivers/acpi/acpi_memhotplug.c
> > > > > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > > > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > > > > >                         nid = memory_add_physaddr_to_nid(info->start_addr);
> > > > > >
> > > > > >                 acpi_unbind_memory_blocks(info);
> > > > > > -               remove_memory(nid, info->start_addr, info->length);
> > > > > > +               BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > > > > 
> > > > > Why does this have to be BUG_ON()?  Is it really necessary to kill the
> > > > > system here?
> > > > 
> > > > Actually, I hoped you would help me understand that: that BUG() call was introduced
> > > > by yourself in Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > > in memory_hoptlug.c:remove_memory()). 
> > > > 
> > > > Just reading at that commit my understanding was that you were assuming
> > > > that acpi_memory_remove_memory() have already done the job of offlining
> > > > the target memory, so there would be a bug if that wasn't the case.
> > > > 
> > > > In my case, that assumption did not hold and I found that it might not
> > > > hold for other platforms that do not use ACPI. In fact, the purpose of
> > > > this patch is to move this assumption out of the generic hotplug code
> > > > and move it to ACPI code where it originated. 
> > > 
> > > remove_memory failure is basically impossible to handle AFAIR. The
> > > original code to BUG in remove_memory is ugly as hell and we do not want
> > > to spread that out of that function. Instead we really want to get rid
> > > of it.
> > 
> > Today, BUG() is called even in the simple case where remove fails
> > because the section we are removing is not offline.
> 
> You cannot hotremove memory which is still online. This is what caller
> should enforce. This is too late to handle the failure. At least for
> ACPI.
>

The logic in acpi_scan_hot_remove() calls memory_subsys_offline(). If
there doesn't have any error returns by memory_subsys_offline, then ACPI
assumes all devices are offlined by subsystem (memory subsystem in this case).

Then system moves to remove stage, ACPI calls acpi_memory_device_remove().
Here
 
> > I cannot see any need to
> > BUG() in such a case: an error code seems more than sufficient to me.
> 
> I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
> context here and cannot simply communicate error code down the road.
> I agree that we should be able to simply return an error but what is the
> actual error condition that might happen here?
>

Currently acpi_bus_trim() didn't handle any return error. If subsystem
returns error, then ACPI can only interrupt hot-remove process.

> > This is why this patch removes the BUG() call when the "offline" check
> > fails from the generic code. 
> 
> As I've said we should simply get rid of BUG rather than move it around.
>

As I remember that the original BUG() helped us to find out a bug about the
offline state doesn't sync between memblock device with memory state.
Something likes:
	mem->dev.offline != (mem->state == MEM_OFFLINE)

So, the BUG() is useful to capture bug about state sync between device object
and subsystem object.

Thanks
Joey Lee

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-29  0:49     ` joeyli
  (?)
@ 2017-11-29  1:52       ` joeyli
  -1 siblings, 0 replies; 156+ messages in thread
From: joeyli @ 2017-11-29  1:52 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, rafael.j.wysocki, linux-acpi

On Wed, Nov 29, 2017 at 08:49:13AM +0800, joeyli wrote:
> Hi Andrea, 
> 
> On Fri, Nov 24, 2017 at 10:22:35AM +0000, Andrea Reale wrote:
> > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > Everyone else: apologies for the noise.
> > 
> > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > introduced an assumption whereas when control
> > reaches remove_memory the corresponding memory has been already
> > offlined. In that case, the acpi_memhotplug was making sure that
> > the assumption held.
> > This assumption, however, is not necessarily true if offlining
> > and removal are not done by the same "controller" (for example,
> > when first offlining via sysfs).
> > 
> > Removing this assumption for the generic remove_memory code
> > and moving it in the specific acpi_memhotplug code. This is
> > a dependency for the software-aided arm64 offlining and removal
> > process.
> > 
> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > ---
> >  drivers/acpi/acpi_memhotplug.c |  2 +-
> >  include/linux/memory_hotplug.h |  9 ++++++---
> >  mm/memory_hotplug.c            | 13 +++++++++----
> >  3 files changed, 16 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > index 6b0d3ef..b0126a0 100644
> > --- a/drivers/acpi/acpi_memhotplug.c
> > +++ b/drivers/acpi/acpi_memhotplug.c
> > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> >  			nid = memory_add_physaddr_to_nid(info->start_addr);
> >  
> >  		acpi_unbind_memory_blocks(info);
> > -		remove_memory(nid, info->start_addr, info->length);
> > +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
> >  		list_del(&info->list);
> >  		kfree(info);
> >  	}
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 58e110a..1a9c7b2 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
> >  extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
> >  extern void try_offline_node(int nid);
> >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> > -extern void remove_memory(int nid, u64 start, u64 size);
> > +extern int remove_memory(int nid, u64 start, u64 size);
> >  
> >  #else
> >  static inline bool is_mem_section_removable(unsigned long pfn,
> > @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> >  	return -EINVAL;
> >  }
> >  
> > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > +static inline int remove_memory(int nid, u64 start, u64 size)
> > +{
> > +	return -EINVAL;
> > +}
> >  #endif /* CONFIG_MEMORY_HOTREMOVE */
> >  
> >  extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> > @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> >  		unsigned long nr_pages);
> >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> >  extern bool is_memblock_offlined(struct memory_block *mem);
> > -extern void remove_memory(int nid, u64 start, u64 size);
> > +extern int remove_memory(int nid, u64 start, u64 size);
> >  extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
> >  extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
> >  		unsigned long map_offset);
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index d4b5f29..d5f15af 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
> >   * and online/offline operations before this call, as required by
> >   * try_offline_node().
> >   */
> > -void __ref remove_memory(int nid, u64 start, u64 size)
> > +int __ref remove_memory(int nid, u64 start, u64 size)
> >  {
> >  	int ret;
> >  
> > @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
> >  	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> >  				check_memblock_offlined_cb);
> >  	if (ret)
> > -		BUG();
> > +		goto end_remove;
> > +
> > +	ret = arch_remove_memory(start, size);

Should not include arch_remove_memory() to BUG().

> > +
> > +	if (ret)
> > +		goto end_remove;
> 
> The original code triggers BUG() when any memblock is not offlined. Why
> the new logic includes the result of arch_remove_memory()?
> 
> But I agreed the we don't need BUG(). Returning a error is better.

Actually, I lost one thing.

The BUG() have caught a issue about the offline state doesn't sync between
memory_block and device object. like:
        mem->dev.offline != (mem->state == MEM_OFFLINE)

So, the BUG() is useful to capture state issue in memory subsystem. But, I
understood your concern about the two steps offline/remove from userland. 

Maybe we should move the BUG() to somewhere but not just remove it. Or if
we think that the BUG() is too intense, at least we should print out a error
message, and ACPI should checks the return value from subsystem to
interrupt memory-hotplug process.

Thanks a lot!
Joey Lee 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-29  1:52       ` joeyli
  0 siblings, 0 replies; 156+ messages in thread
From: joeyli @ 2017-11-29  1:52 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, rafael.j.wysocki, linux-acpi

On Wed, Nov 29, 2017 at 08:49:13AM +0800, joeyli wrote:
> Hi Andrea, 
> 
> On Fri, Nov 24, 2017 at 10:22:35AM +0000, Andrea Reale wrote:
> > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > Everyone else: apologies for the noise.
> > 
> > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > introduced an assumption whereas when control
> > reaches remove_memory the corresponding memory has been already
> > offlined. In that case, the acpi_memhotplug was making sure that
> > the assumption held.
> > This assumption, however, is not necessarily true if offlining
> > and removal are not done by the same "controller" (for example,
> > when first offlining via sysfs).
> > 
> > Removing this assumption for the generic remove_memory code
> > and moving it in the specific acpi_memhotplug code. This is
> > a dependency for the software-aided arm64 offlining and removal
> > process.
> > 
> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > ---
> >  drivers/acpi/acpi_memhotplug.c |  2 +-
> >  include/linux/memory_hotplug.h |  9 ++++++---
> >  mm/memory_hotplug.c            | 13 +++++++++----
> >  3 files changed, 16 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > index 6b0d3ef..b0126a0 100644
> > --- a/drivers/acpi/acpi_memhotplug.c
> > +++ b/drivers/acpi/acpi_memhotplug.c
> > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> >  			nid = memory_add_physaddr_to_nid(info->start_addr);
> >  
> >  		acpi_unbind_memory_blocks(info);
> > -		remove_memory(nid, info->start_addr, info->length);
> > +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
> >  		list_del(&info->list);
> >  		kfree(info);
> >  	}
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 58e110a..1a9c7b2 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
> >  extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
> >  extern void try_offline_node(int nid);
> >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> > -extern void remove_memory(int nid, u64 start, u64 size);
> > +extern int remove_memory(int nid, u64 start, u64 size);
> >  
> >  #else
> >  static inline bool is_mem_section_removable(unsigned long pfn,
> > @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> >  	return -EINVAL;
> >  }
> >  
> > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > +static inline int remove_memory(int nid, u64 start, u64 size)
> > +{
> > +	return -EINVAL;
> > +}
> >  #endif /* CONFIG_MEMORY_HOTREMOVE */
> >  
> >  extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> > @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> >  		unsigned long nr_pages);
> >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> >  extern bool is_memblock_offlined(struct memory_block *mem);
> > -extern void remove_memory(int nid, u64 start, u64 size);
> > +extern int remove_memory(int nid, u64 start, u64 size);
> >  extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
> >  extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
> >  		unsigned long map_offset);
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index d4b5f29..d5f15af 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
> >   * and online/offline operations before this call, as required by
> >   * try_offline_node().
> >   */
> > -void __ref remove_memory(int nid, u64 start, u64 size)
> > +int __ref remove_memory(int nid, u64 start, u64 size)
> >  {
> >  	int ret;
> >  
> > @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
> >  	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> >  				check_memblock_offlined_cb);
> >  	if (ret)
> > -		BUG();
> > +		goto end_remove;
> > +
> > +	ret = arch_remove_memory(start, size);

Should not include arch_remove_memory() to BUG().

> > +
> > +	if (ret)
> > +		goto end_remove;
> 
> The original code triggers BUG() when any memblock is not offlined. Why
> the new logic includes the result of arch_remove_memory()?
> 
> But I agreed the we don't need BUG(). Returning a error is better.

Actually, I lost one thing.

The BUG() have caught a issue about the offline state doesn't sync between
memory_block and device object. like:
        mem->dev.offline != (mem->state == MEM_OFFLINE)

So, the BUG() is useful to capture state issue in memory subsystem. But, I
understood your concern about the two steps offline/remove from userland. 

Maybe we should move the BUG() to somewhere but not just remove it. Or if
we think that the BUG() is too intense, at least we should print out a error
message, and ACPI should checks the return value from subsystem to
interrupt memory-hotplug process.

Thanks a lot!
Joey Lee 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-29  1:52       ` joeyli
  0 siblings, 0 replies; 156+ messages in thread
From: joeyli @ 2017-11-29  1:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 29, 2017 at 08:49:13AM +0800, joeyli wrote:
> Hi Andrea, 
> 
> On Fri, Nov 24, 2017 at 10:22:35AM +0000, Andrea Reale wrote:
> > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > Everyone else: apologies for the noise.
> > 
> > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > introduced an assumption whereas when control
> > reaches remove_memory the corresponding memory has been already
> > offlined. In that case, the acpi_memhotplug was making sure that
> > the assumption held.
> > This assumption, however, is not necessarily true if offlining
> > and removal are not done by the same "controller" (for example,
> > when first offlining via sysfs).
> > 
> > Removing this assumption for the generic remove_memory code
> > and moving it in the specific acpi_memhotplug code. This is
> > a dependency for the software-aided arm64 offlining and removal
> > process.
> > 
> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > ---
> >  drivers/acpi/acpi_memhotplug.c |  2 +-
> >  include/linux/memory_hotplug.h |  9 ++++++---
> >  mm/memory_hotplug.c            | 13 +++++++++----
> >  3 files changed, 16 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > index 6b0d3ef..b0126a0 100644
> > --- a/drivers/acpi/acpi_memhotplug.c
> > +++ b/drivers/acpi/acpi_memhotplug.c
> > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> >  			nid = memory_add_physaddr_to_nid(info->start_addr);
> >  
> >  		acpi_unbind_memory_blocks(info);
> > -		remove_memory(nid, info->start_addr, info->length);
> > +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
> >  		list_del(&info->list);
> >  		kfree(info);
> >  	}
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 58e110a..1a9c7b2 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
> >  extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
> >  extern void try_offline_node(int nid);
> >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> > -extern void remove_memory(int nid, u64 start, u64 size);
> > +extern int remove_memory(int nid, u64 start, u64 size);
> >  
> >  #else
> >  static inline bool is_mem_section_removable(unsigned long pfn,
> > @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> >  	return -EINVAL;
> >  }
> >  
> > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > +static inline int remove_memory(int nid, u64 start, u64 size)
> > +{
> > +	return -EINVAL;
> > +}
> >  #endif /* CONFIG_MEMORY_HOTREMOVE */
> >  
> >  extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> > @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> >  		unsigned long nr_pages);
> >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> >  extern bool is_memblock_offlined(struct memory_block *mem);
> > -extern void remove_memory(int nid, u64 start, u64 size);
> > +extern int remove_memory(int nid, u64 start, u64 size);
> >  extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
> >  extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
> >  		unsigned long map_offset);
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index d4b5f29..d5f15af 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
> >   * and online/offline operations before this call, as required by
> >   * try_offline_node().
> >   */
> > -void __ref remove_memory(int nid, u64 start, u64 size)
> > +int __ref remove_memory(int nid, u64 start, u64 size)
> >  {
> >  	int ret;
> >  
> > @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
> >  	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> >  				check_memblock_offlined_cb);
> >  	if (ret)
> > -		BUG();
> > +		goto end_remove;
> > +
> > +	ret = arch_remove_memory(start, size);

Should not include arch_remove_memory() to BUG().

> > +
> > +	if (ret)
> > +		goto end_remove;
> 
> The original code triggers BUG() when any memblock is not offlined. Why
> the new logic includes the result of arch_remove_memory()?
> 
> But I agreed the we don't need BUG(). Returning a error is better.

Actually, I lost one thing.

The BUG() have caught a issue about the offline state doesn't sync between
memory_block and device object. like:
        mem->dev.offline != (mem->state == MEM_OFFLINE)

So, the BUG() is useful to capture state issue in memory subsystem. But, I
understood your concern about the two steps offline/remove from userland. 

Maybe we should move the BUG() to somewhere but not just remove it. Or if
we think that the BUG() is too intense, at least we should print out a error
message, and ACPI should checks the return value from subsystem to
interrupt memory-hotplug process.

Thanks a lot!
Joey Lee 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-29  1:20               ` joeyli
  (?)
  (?)
@ 2017-11-30  9:47                 ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30  9:47 UTC (permalink / raw)
  To: joeyli
  Cc: Andrea Reale, Rafael J. Wysocki, linux-arm-kernel,
	Linux Kernel Mailing List, Linux Memory Management List,
	m.bielski, arunks, Mark Rutland, scott.branden, Will Deacon,
	qiuxishi, Catalin Marinas, Rafael Wysocki,
	ACPI Devel Maling List

On Wed 29-11-17 09:20:40, Joey Lee wrote:
> On Fri, Nov 24, 2017 at 07:17:41PM +0100, Michal Hocko wrote:
[...]
> > You cannot hotremove memory which is still online. This is what caller
> > should enforce. This is too late to handle the failure. At least for
> > ACPI.
> >
> 
> The logic in acpi_scan_hot_remove() calls memory_subsys_offline(). If
> there doesn't have any error returns by memory_subsys_offline, then ACPI
> assumes all devices are offlined by subsystem (memory subsystem in this case).

yes, that is what I meant by calling it caller responsibility

> Then system moves to remove stage, ACPI calls acpi_memory_device_remove().
> Here
>  
> > > I cannot see any need to
> > > BUG() in such a case: an error code seems more than sufficient to me.
> > 
> > I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
> > context here and cannot simply communicate error code down the road.
> > I agree that we should be able to simply return an error but what is the
> > actual error condition that might happen here?
> >
> 
> Currently acpi_bus_trim() didn't handle any return error. If subsystem
> returns error, then ACPI can only interrupt hot-remove process.
> 
> > > This is why this patch removes the BUG() call when the "offline" check
> > > fails from the generic code. 
> > 
> > As I've said we should simply get rid of BUG rather than move it around.
> >
> 
> As I remember that the original BUG() helped us to find out a bug about the
> offline state doesn't sync between memblock device with memory state.
> Something likes:
> 	mem->dev.offline != (mem->state == MEM_OFFLINE)
> 
> So, the BUG() is useful to capture bug about state sync between device object
> and subsystem object.

BUG is a fatal condition under many contexts. And therefore not an
appropriate error handling.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-30  9:47                 ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30  9:47 UTC (permalink / raw)
  To: joeyli
  Cc: Andrea Reale, Rafael J. Wysocki, linux-arm-kernel,
	Linux Kernel Mailing List, Linux Memory Management List,
	m.bielski, arunks, Mark Rutland, scott.branden, Will Deacon,
	qiuxishi, Catalin Marinas, Rafael Wysocki,
	ACPI Devel Maling List

On Wed 29-11-17 09:20:40, Joey Lee wrote:
> On Fri, Nov 24, 2017 at 07:17:41PM +0100, Michal Hocko wrote:
[...]
> > You cannot hotremove memory which is still online. This is what caller
> > should enforce. This is too late to handle the failure. At least for
> > ACPI.
> >
> 
> The logic in acpi_scan_hot_remove() calls memory_subsys_offline(). If
> there doesn't have any error returns by memory_subsys_offline, then ACPI
> assumes all devices are offlined by subsystem (memory subsystem in this case).

yes, that is what I meant by calling it caller responsibility

> Then system moves to remove stage, ACPI calls acpi_memory_device_remove().
> Here
>  
> > > I cannot see any need to
> > > BUG() in such a case: an error code seems more than sufficient to me.
> > 
> > I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
> > context here and cannot simply communicate error code down the road.
> > I agree that we should be able to simply return an error but what is the
> > actual error condition that might happen here?
> >
> 
> Currently acpi_bus_trim() didn't handle any return error. If subsystem
> returns error, then ACPI can only interrupt hot-remove process.
> 
> > > This is why this patch removes the BUG() call when the "offline" check
> > > fails from the generic code. 
> > 
> > As I've said we should simply get rid of BUG rather than move it around.
> >
> 
> As I remember that the original BUG() helped us to find out a bug about the
> offline state doesn't sync between memblock device with memory state.
> Something likes:
> 	mem->dev.offline != (mem->state == MEM_OFFLINE)
> 
> So, the BUG() is useful to capture bug about state sync between device object
> and subsystem object.

BUG is a fatal condition under many contexts. And therefore not an
appropriate error handling.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-30  9:47                 ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30  9:47 UTC (permalink / raw)
  To: joeyli
  Cc: Andrea Reale, Rafael J. Wysocki, linux-arm-kernel,
	Linux Kernel Mailing List, Linux Memory Management List,
	m.bielski, arunks, Mark Rutland, scott.branden, Will Deacon,
	qiuxishi, Catalin Marinas, Rafael Wysocki,
	ACPI Devel Maling List

On Wed 29-11-17 09:20:40, Joey Lee wrote:
> On Fri, Nov 24, 2017 at 07:17:41PM +0100, Michal Hocko wrote:
[...]
> > You cannot hotremove memory which is still online. This is what caller
> > should enforce. This is too late to handle the failure. At least for
> > ACPI.
> >
> 
> The logic in acpi_scan_hot_remove() calls memory_subsys_offline(). If
> there doesn't have any error returns by memory_subsys_offline, then ACPI
> assumes all devices are offlined by subsystem (memory subsystem in this case).

yes, that is what I meant by calling it caller responsibility

> Then system moves to remove stage, ACPI calls acpi_memory_device_remove().
> Here
>  
> > > I cannot see any need to
> > > BUG() in such a case: an error code seems more than sufficient to me.
> > 
> > I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
> > context here and cannot simply communicate error code down the road.
> > I agree that we should be able to simply return an error but what is the
> > actual error condition that might happen here?
> >
> 
> Currently acpi_bus_trim() didn't handle any return error. If subsystem
> returns error, then ACPI can only interrupt hot-remove process.
> 
> > > This is why this patch removes the BUG() call when the "offline" check
> > > fails from the generic code. 
> > 
> > As I've said we should simply get rid of BUG rather than move it around.
> >
> 
> As I remember that the original BUG() helped us to find out a bug about the
> offline state doesn't sync between memblock device with memory state.
> Something likes:
> 	mem->dev.offline != (mem->state == MEM_OFFLINE)
> 
> So, the BUG() is useful to capture bug about state sync between device object
> and subsystem object.

BUG is a fatal condition under many contexts. And therefore not an
appropriate error handling.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-11-30  9:47                 ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30  9:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed 29-11-17 09:20:40, Joey Lee wrote:
> On Fri, Nov 24, 2017 at 07:17:41PM +0100, Michal Hocko wrote:
[...]
> > You cannot hotremove memory which is still online. This is what caller
> > should enforce. This is too late to handle the failure. At least for
> > ACPI.
> >
> 
> The logic in acpi_scan_hot_remove() calls memory_subsys_offline(). If
> there doesn't have any error returns by memory_subsys_offline, then ACPI
> assumes all devices are offlined by subsystem (memory subsystem in this case).

yes, that is what I meant by calling it caller responsibility

> Then system moves to remove stage, ACPI calls acpi_memory_device_remove().
> Here
>  
> > > I cannot see any need to
> > > BUG() in such a case: an error code seems more than sufficient to me.
> > 
> > I do not rememeber details but AFAIR ACPI is in a deferred (kworker)
> > context here and cannot simply communicate error code down the road.
> > I agree that we should be able to simply return an error but what is the
> > actual error condition that might happen here?
> >
> 
> Currently acpi_bus_trim() didn't handle any return error. If subsystem
> returns error, then ACPI can only interrupt hot-remove process.
> 
> > > This is why this patch removes the BUG() call when the "offline" check
> > > fails from the generic code. 
> > 
> > As I've said we should simply get rid of BUG rather than move it around.
> >
> 
> As I remember that the original BUG() helped us to find out a bug about the
> offline state doesn't sync between memblock device with memory state.
> Something likes:
> 	mem->dev.offline != (mem->state == MEM_OFFLINE)
> 
> So, the BUG() is useful to capture bug about state sync between device object
> and subsystem object.

BUG is a fatal condition under many contexts. And therefore not an
appropriate error handling.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-11-23 11:14   ` Andrea Reale
  (?)
@ 2017-11-30 14:49     ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30 14:49 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> Adding a "remove" sysfs handle that can be used to trigger
> memory hotremove manually, exactly simmetrically with
> what happens with the "probe" device for hot-add.
> 
> This is usueful for architecture that do not rely on
> ACPI for memory hot-remove.

As already said elsewhere, this really has to check the online status of
the range and fail some is still online.

> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 1d60b58..8ccb67c 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>  }
>  
>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> -#endif
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +static ssize_t
> +memory_remove_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	u64 phys_addr;
> +	int nid, ret;
> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> +
> +	ret = kstrtoull(buf, 0, &phys_addr);
> +	if (ret)
> +		return ret;
> +
> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> +		return -EINVAL;
> +
> +	nid = memory_add_physaddr_to_nid(phys_addr);
> +	ret = lock_device_hotplug_sysfs();
> +	if (ret)
> +		return ret;
> +
> +	remove_memory(nid, phys_addr,
> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> +	unlock_device_hotplug();
> +	return count;
> +}
> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>  
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>  static struct attribute *memory_root_attrs[] = {
>  #ifdef CONFIG_ARCH_MEMORY_PROBE
>  	&dev_attr_probe.attr,
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	&dev_attr_remove.attr,
> +#endif
>  #endif
>  
>  #ifdef CONFIG_MEMORY_FAILURE
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-30 14:49     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30 14:49 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> Adding a "remove" sysfs handle that can be used to trigger
> memory hotremove manually, exactly simmetrically with
> what happens with the "probe" device for hot-add.
> 
> This is usueful for architecture that do not rely on
> ACPI for memory hot-remove.

As already said elsewhere, this really has to check the online status of
the range and fail some is still online.

> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 1d60b58..8ccb67c 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>  }
>  
>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> -#endif
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +static ssize_t
> +memory_remove_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	u64 phys_addr;
> +	int nid, ret;
> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> +
> +	ret = kstrtoull(buf, 0, &phys_addr);
> +	if (ret)
> +		return ret;
> +
> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> +		return -EINVAL;
> +
> +	nid = memory_add_physaddr_to_nid(phys_addr);
> +	ret = lock_device_hotplug_sysfs();
> +	if (ret)
> +		return ret;
> +
> +	remove_memory(nid, phys_addr,
> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> +	unlock_device_hotplug();
> +	return count;
> +}
> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>  
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>  static struct attribute *memory_root_attrs[] = {
>  #ifdef CONFIG_ARCH_MEMORY_PROBE
>  	&dev_attr_probe.attr,
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	&dev_attr_remove.attr,
> +#endif
>  #endif
>  
>  #ifdef CONFIG_MEMORY_FAILURE
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-11-30 14:49     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30 14:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> Adding a "remove" sysfs handle that can be used to trigger
> memory hotremove manually, exactly simmetrically with
> what happens with the "probe" device for hot-add.
> 
> This is usueful for architecture that do not rely on
> ACPI for memory hot-remove.

As already said elsewhere, this really has to check the online status of
the range and fail some is still online.

> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
>  1 file changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 1d60b58..8ccb67c 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
>  }
>  
>  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> -#endif
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +static ssize_t
> +memory_remove_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	u64 phys_addr;
> +	int nid, ret;
> +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> +
> +	ret = kstrtoull(buf, 0, &phys_addr);
> +	if (ret)
> +		return ret;
> +
> +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> +		return -EINVAL;
> +
> +	nid = memory_add_physaddr_to_nid(phys_addr);
> +	ret = lock_device_hotplug_sysfs();
> +	if (ret)
> +		return ret;
> +
> +	remove_memory(nid, phys_addr,
> +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> +	unlock_device_hotplug();
> +	return count;
> +}
> +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> +#endif /* CONFIG_ARCH_MEMORY_PROBE */
>  
>  #ifdef CONFIG_MEMORY_FAILURE
>  /*
> @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
>  static struct attribute *memory_root_attrs[] = {
>  #ifdef CONFIG_ARCH_MEMORY_PROBE
>  	&dev_attr_probe.attr,
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	&dev_attr_remove.attr,
> +#endif
>  #endif
>  
>  #ifdef CONFIG_MEMORY_FAILURE
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
  2017-11-23 11:14   ` Andrea Reale
  (?)
@ 2017-11-30 14:51     ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30 14:51 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> When hot-removing memory we need to free vmemmap memory.
> However, depending on the memory is being removed, it might
> not be always possible to free a full vmemmap page / huge-page
> because part of it might still be used.
> 
> Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> hot-remove") introduced a workaround for x86
> hot-remove, by which partially unused areas are filled with
> the 0xFD constant. Full pages are only removed when fully
> filled by 0xFDs.
> 
> This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> the goal of using it in place of 0xFDs. For now, this will be used for
> the arm64 port of memory hot remove, but the idea is to eventually use
> the same mechanism for x86 as well.

Why cannot you use the same approach as x86 have? Have a look at the
vmemmap_free at al.
 
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>  include/linux/memblock.h | 12 ++++++++++++
>  mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
>  2 files changed, 44 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7..0daec05 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -26,6 +26,9 @@ enum {
>  	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
>  	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
>  	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
> +#endif
>  };
>  
>  struct memblock_region {
> @@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>  int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
>  int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
>  ulong choose_memblock_flags(void);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +#endif
>  
>  /* Low level functions */
>  int memblock_add_range(struct memblock_type *type,
> @@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
>  	return m->flags & MEMBLOCK_NOMAP;
>  }
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end);
> +#endif
> +
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>  			    unsigned long  *end_pfn);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 9120578..30d5aa4 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>  	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
>  }
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +#endif
>  /**
>   * __next_reserved_mem_region - next function for for_each_reserved_region()
>   * @idx: pointer to u64 loop variable
> @@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
>  	}
>  }
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end)
> +{
> +	u64 i;
> +	struct memblock_region *r;
> +
> +	i = memblock_search(mt, start);
> +	r = &(mt->regions[i]);
> +	while (r->base < end) {
> +		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> +			return 0;
> +
> +		r = &(memblock.memory.regions[++i]);
> +	}
> +
> +	return 1;
> +}
> +#endif
> +
>  void __init_memblock memblock_set_current_limit(phys_addr_t limit)
>  {
>  	memblock.current_limit = limit;
> -- 
> 2.7.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-11-30 14:51     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30 14:51 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> When hot-removing memory we need to free vmemmap memory.
> However, depending on the memory is being removed, it might
> not be always possible to free a full vmemmap page / huge-page
> because part of it might still be used.
> 
> Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> hot-remove") introduced a workaround for x86
> hot-remove, by which partially unused areas are filled with
> the 0xFD constant. Full pages are only removed when fully
> filled by 0xFDs.
> 
> This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> the goal of using it in place of 0xFDs. For now, this will be used for
> the arm64 port of memory hot remove, but the idea is to eventually use
> the same mechanism for x86 as well.

Why cannot you use the same approach as x86 have? Have a look at the
vmemmap_free at al.
 
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>  include/linux/memblock.h | 12 ++++++++++++
>  mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
>  2 files changed, 44 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7..0daec05 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -26,6 +26,9 @@ enum {
>  	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
>  	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
>  	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
> +#endif
>  };
>  
>  struct memblock_region {
> @@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>  int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
>  int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
>  ulong choose_memblock_flags(void);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +#endif
>  
>  /* Low level functions */
>  int memblock_add_range(struct memblock_type *type,
> @@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
>  	return m->flags & MEMBLOCK_NOMAP;
>  }
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end);
> +#endif
> +
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>  			    unsigned long  *end_pfn);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 9120578..30d5aa4 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>  	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
>  }
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +#endif
>  /**
>   * __next_reserved_mem_region - next function for for_each_reserved_region()
>   * @idx: pointer to u64 loop variable
> @@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
>  	}
>  }
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end)
> +{
> +	u64 i;
> +	struct memblock_region *r;
> +
> +	i = memblock_search(mt, start);
> +	r = &(mt->regions[i]);
> +	while (r->base < end) {
> +		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> +			return 0;
> +
> +		r = &(memblock.memory.regions[++i]);
> +	}
> +
> +	return 1;
> +}
> +#endif
> +
>  void __init_memblock memblock_set_current_limit(phys_addr_t limit)
>  {
>  	memblock.current_limit = limit;
> -- 
> 2.7.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-11-30 14:51     ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30 14:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> When hot-removing memory we need to free vmemmap memory.
> However, depending on the memory is being removed, it might
> not be always possible to free a full vmemmap page / huge-page
> because part of it might still be used.
> 
> Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> hot-remove") introduced a workaround for x86
> hot-remove, by which partially unused areas are filled with
> the 0xFD constant. Full pages are only removed when fully
> filled by 0xFDs.
> 
> This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> the goal of using it in place of 0xFDs. For now, this will be used for
> the arm64 port of memory hot remove, but the idea is to eventually use
> the same mechanism for x86 as well.

Why cannot you use the same approach as x86 have? Have a look at the
vmemmap_free at al.
 
> Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> ---
>  include/linux/memblock.h | 12 ++++++++++++
>  mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
>  2 files changed, 44 insertions(+)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index bae11c7..0daec05 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -26,6 +26,9 @@ enum {
>  	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
>  	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
>  	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
> +#endif
>  };
>  
>  struct memblock_region {
> @@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
>  int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
>  int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
>  ulong choose_memblock_flags(void);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> +#endif
>  
>  /* Low level functions */
>  int memblock_add_range(struct memblock_type *type,
> @@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
>  	return m->flags & MEMBLOCK_NOMAP;
>  }
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end);
> +#endif
> +
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
>  			    unsigned long  *end_pfn);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 9120578..30d5aa4 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>  	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
>  }
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> +		phys_addr_t size)
> +{
> +	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> +}
> +#endif
>  /**
>   * __next_reserved_mem_region - next function for for_each_reserved_region()
>   * @idx: pointer to u64 loop variable
> @@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
>  	}
>  }
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> +		phys_addr_t start, phys_addr_t end)
> +{
> +	u64 i;
> +	struct memblock_region *r;
> +
> +	i = memblock_search(mt, start);
> +	r = &(mt->regions[i]);
> +	while (r->base < end) {
> +		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> +			return 0;
> +
> +		r = &(memblock.memory.regions[++i]);
> +	}
> +
> +	return 1;
> +}
> +#endif
> +
>  void __init_memblock memblock_set_current_limit(phys_addr_t limit)
>  {
>  	memblock.current_limit = limit;
> -- 
> 2.7.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo at kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email at kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
  2017-11-23 17:33     ` Andrea Reale
  (?)
@ 2017-11-30 14:57       ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30 14:57 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 23-11-17 17:33:31, Andrea Reale wrote:
> On Thu 23 Nov 2017, 17:02, Michal Hocko wrote:
> 
> Hi Michal,
> 
> > I will try to have a look but I do not expect to understand any of arm64
> > specific changes so I will focus on the generic code but it would help a
> > _lot_ if the cover letter provided some overview of what has been done
> > from a higher level POV. What are the arch pieces and what is the
> > generic code missing. A quick glance over patches suggests that
> > changelogs for specific patches are modest as well. Could you give us
> > more information please? Reviewing hundreds lines of code without
> > context is a pain.
> 
> sorry for the lack of details. I will try to provide a better
> overview in the following. Please, feel free to ask for more details
> where needed.
> 
> Overall, the goal of the patchset is to implement arch_memory_add and
> arch_memory_remove for arm64, to support the generic memory_hotplug
> framework. 
> 
> Hot add
> -------
> Not so many surprises here. We implement the arch specific
> arch_add_memory, which builds the kernel page tables via hotplug_paging()
> and then calls arch specific add_pages(). We need the arch specific
> add_pages() to implement a trick that makes the satus of pages being
> added accepted by the asumptions made in the generic __add_pages. (See
> code comments).

Actually I would like to see exactly this explained. The arch support of
the hotplug should be basically only about arch_add_memory and add_pages
resp. arch_remove_memory and __remove_pages. Nothing much more, really.
The core hotplug code should take care of the rest. Ideally you
shouldn't be really forced to touch the generic code. If yes than this
should be called out explicitly.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
@ 2017-11-30 14:57       ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30 14:57 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 23-11-17 17:33:31, Andrea Reale wrote:
> On Thu 23 Nov 2017, 17:02, Michal Hocko wrote:
> 
> Hi Michal,
> 
> > I will try to have a look but I do not expect to understand any of arm64
> > specific changes so I will focus on the generic code but it would help a
> > _lot_ if the cover letter provided some overview of what has been done
> > from a higher level POV. What are the arch pieces and what is the
> > generic code missing. A quick glance over patches suggests that
> > changelogs for specific patches are modest as well. Could you give us
> > more information please? Reviewing hundreds lines of code without
> > context is a pain.
> 
> sorry for the lack of details. I will try to provide a better
> overview in the following. Please, feel free to ask for more details
> where needed.
> 
> Overall, the goal of the patchset is to implement arch_memory_add and
> arch_memory_remove for arm64, to support the generic memory_hotplug
> framework. 
> 
> Hot add
> -------
> Not so many surprises here. We implement the arch specific
> arch_add_memory, which builds the kernel page tables via hotplug_paging()
> and then calls arch specific add_pages(). We need the arch specific
> add_pages() to implement a trick that makes the satus of pages being
> added accepted by the asumptions made in the generic __add_pages. (See
> code comments).

Actually I would like to see exactly this explained. The arch support of
the hotplug should be basically only about arch_add_memory and add_pages
resp. arch_remove_memory and __remove_pages. Nothing much more, really.
The core hotplug code should take care of the rest. Ideally you
shouldn't be really forced to touch the generic code. If yes than this
should be called out explicitly.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
@ 2017-11-30 14:57       ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-11-30 14:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu 23-11-17 17:33:31, Andrea Reale wrote:
> On Thu 23 Nov 2017, 17:02, Michal Hocko wrote:
> 
> Hi Michal,
> 
> > I will try to have a look but I do not expect to understand any of arm64
> > specific changes so I will focus on the generic code but it would help a
> > _lot_ if the cover letter provided some overview of what has been done
> > from a higher level POV. What are the arch pieces and what is the
> > generic code missing. A quick glance over patches suggests that
> > changelogs for specific patches are modest as well. Could you give us
> > more information please? Reviewing hundreds lines of code without
> > context is a pain.
> 
> sorry for the lack of details. I will try to provide a better
> overview in the following. Please, feel free to ask for more details
> where needed.
> 
> Overall, the goal of the patchset is to implement arch_memory_add and
> arch_memory_remove for arm64, to support the generic memory_hotplug
> framework. 
> 
> Hot add
> -------
> Not so many surprises here. We implement the arch specific
> arch_add_memory, which builds the kernel page tables via hotplug_paging()
> and then calls arch specific add_pages(). We need the arch specific
> add_pages() to implement a trick that makes the satus of pages being
> added accepted by the asumptions made in the generic __add_pages. (See
> code comments).

Actually I would like to see exactly this explained. The arch support of
the hotplug should be basically only about arch_add_memory and add_pages
resp. arch_remove_memory and __remove_pages. Nothing much more, really.
The core hotplug code should take care of the rest. Ideally you
shouldn't be really forced to touch the generic code. If yes than this
should be called out explicitly.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-11-29  1:52       ` joeyli
  (?)
@ 2017-12-04 11:28         ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:28 UTC (permalink / raw)
  To: joeyli
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, rafael.j.wysocki, linux-acpi

Hi Joey,

and thanks for your comments. Response inline:

On Wed 29 Nov 2017, 09:52, joeyli wrote:
> On Wed, Nov 29, 2017 at 08:49:13AM +0800, joeyli wrote:
> > Hi Andrea, 
> > 
> > On Fri, Nov 24, 2017 at 10:22:35AM +0000, Andrea Reale wrote:
> > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > Everyone else: apologies for the noise.
> > > 
> > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > introduced an assumption whereas when control
> > > reaches remove_memory the corresponding memory has been already
> > > offlined. In that case, the acpi_memhotplug was making sure that
> > > the assumption held.
> > > This assumption, however, is not necessarily true if offlining
> > > and removal are not done by the same "controller" (for example,
> > > when first offlining via sysfs).
> > > 
> > > Removing this assumption for the generic remove_memory code
> > > and moving it in the specific acpi_memhotplug code. This is
> > > a dependency for the software-aided arm64 offlining and removal
> > > process.
> > > 
> > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > ---
> > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > >  include/linux/memory_hotplug.h |  9 ++++++---
> > >  mm/memory_hotplug.c            | 13 +++++++++----
> > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > index 6b0d3ef..b0126a0 100644
> > > --- a/drivers/acpi/acpi_memhotplug.c
> > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > >  			nid = memory_add_physaddr_to_nid(info->start_addr);
> > >  
> > >  		acpi_unbind_memory_blocks(info);
> > > -		remove_memory(nid, info->start_addr, info->length);
> > > +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > >  		list_del(&info->list);
> > >  		kfree(info);
> > >  	}
> > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > > index 58e110a..1a9c7b2 100644
> > > --- a/include/linux/memory_hotplug.h
> > > +++ b/include/linux/memory_hotplug.h
> > > @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
> > >  extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
> > >  extern void try_offline_node(int nid);
> > >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> > > -extern void remove_memory(int nid, u64 start, u64 size);
> > > +extern int remove_memory(int nid, u64 start, u64 size);
> > >  
> > >  #else
> > >  static inline bool is_mem_section_removable(unsigned long pfn,
> > > @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> > >  	return -EINVAL;
> > >  }
> > >  
> > > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > > +static inline int remove_memory(int nid, u64 start, u64 size)
> > > +{
> > > +	return -EINVAL;
> > > +}
> > >  #endif /* CONFIG_MEMORY_HOTREMOVE */
> > >  
> > >  extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> > > @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> > >  		unsigned long nr_pages);
> > >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> > >  extern bool is_memblock_offlined(struct memory_block *mem);
> > > -extern void remove_memory(int nid, u64 start, u64 size);
> > > +extern int remove_memory(int nid, u64 start, u64 size);
> > >  extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
> > >  extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
> > >  		unsigned long map_offset);
> > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > > index d4b5f29..d5f15af 100644
> > > --- a/mm/memory_hotplug.c
> > > +++ b/mm/memory_hotplug.c
> > > @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
> > >   * and online/offline operations before this call, as required by
> > >   * try_offline_node().
> > >   */
> > > -void __ref remove_memory(int nid, u64 start, u64 size)
> > > +int __ref remove_memory(int nid, u64 start, u64 size)
> > >  {
> > >  	int ret;
> > >  
> > > @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
> > >  	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> > >  				check_memblock_offlined_cb);
> > >  	if (ret)
> > > -		BUG();
> > > +		goto end_remove;
> > > +
> > > +	ret = arch_remove_memory(start, size);
> 
> Should not include arch_remove_memory() to BUG().

arch_remove_memory might also fail in some cases. In the arm64
implementation of this patchset, for example, it might fail in the
(very rare) case when we would have to split a P[UM]D mapped section for
removal (and we do not support that - see email thread here:
https://lkml.org/lkml/2017/11/23/456).


> > > +
> > > +	if (ret)
> > > +		goto end_remove;
> > 
> > The original code triggers BUG() when any memblock is not offlined. Why
> > the new logic includes the result of arch_remove_memory()?
> > 
> > But I agreed the we don't need BUG(). Returning a error is better.
> 
> Actually, I lost one thing.
> 
> The BUG() have caught a issue about the offline state doesn't sync between
> memory_block and device object. like:
>         mem->dev.offline != (mem->state == MEM_OFFLINE)
> 
> So, the BUG() is useful to capture state issue in memory subsystem. But, I
> understood your concern about the two steps offline/remove from userland. 
> 
> Maybe we should move the BUG() to somewhere but not just remove it. Or if
> we think that the BUG() is too intense, at least we should print out a error
> message, and ACPI should checks the return value from subsystem to
> interrupt memory-hotplug process.

In this patchset, BUG() is moved to acpi_memory_remove_memory(),
the caller of arch_remove_memory(). However, I agree with Michal, that
we should not BUG() here but rather halt the hotremove process and print
some errors. 
Is there any state in ACPI that should be undone in case of hotremove
errors or we can just stop the process "halfway"?

> Thanks a lot!
> Joey Lee 

Thanks,
Andrea

> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-12-04 11:28         ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:28 UTC (permalink / raw)
  To: joeyli
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, rafael.j.wysocki, linux-acpi

Hi Joey,

and thanks for your comments. Response inline:

On Wed 29 Nov 2017, 09:52, joeyli wrote:
> On Wed, Nov 29, 2017 at 08:49:13AM +0800, joeyli wrote:
> > Hi Andrea, 
> > 
> > On Fri, Nov 24, 2017 at 10:22:35AM +0000, Andrea Reale wrote:
> > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > Everyone else: apologies for the noise.
> > > 
> > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > introduced an assumption whereas when control
> > > reaches remove_memory the corresponding memory has been already
> > > offlined. In that case, the acpi_memhotplug was making sure that
> > > the assumption held.
> > > This assumption, however, is not necessarily true if offlining
> > > and removal are not done by the same "controller" (for example,
> > > when first offlining via sysfs).
> > > 
> > > Removing this assumption for the generic remove_memory code
> > > and moving it in the specific acpi_memhotplug code. This is
> > > a dependency for the software-aided arm64 offlining and removal
> > > process.
> > > 
> > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > ---
> > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > >  include/linux/memory_hotplug.h |  9 ++++++---
> > >  mm/memory_hotplug.c            | 13 +++++++++----
> > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > index 6b0d3ef..b0126a0 100644
> > > --- a/drivers/acpi/acpi_memhotplug.c
> > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > >  			nid = memory_add_physaddr_to_nid(info->start_addr);
> > >  
> > >  		acpi_unbind_memory_blocks(info);
> > > -		remove_memory(nid, info->start_addr, info->length);
> > > +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > >  		list_del(&info->list);
> > >  		kfree(info);
> > >  	}
> > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > > index 58e110a..1a9c7b2 100644
> > > --- a/include/linux/memory_hotplug.h
> > > +++ b/include/linux/memory_hotplug.h
> > > @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
> > >  extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
> > >  extern void try_offline_node(int nid);
> > >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> > > -extern void remove_memory(int nid, u64 start, u64 size);
> > > +extern int remove_memory(int nid, u64 start, u64 size);
> > >  
> > >  #else
> > >  static inline bool is_mem_section_removable(unsigned long pfn,
> > > @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> > >  	return -EINVAL;
> > >  }
> > >  
> > > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > > +static inline int remove_memory(int nid, u64 start, u64 size)
> > > +{
> > > +	return -EINVAL;
> > > +}
> > >  #endif /* CONFIG_MEMORY_HOTREMOVE */
> > >  
> > >  extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> > > @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> > >  		unsigned long nr_pages);
> > >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> > >  extern bool is_memblock_offlined(struct memory_block *mem);
> > > -extern void remove_memory(int nid, u64 start, u64 size);
> > > +extern int remove_memory(int nid, u64 start, u64 size);
> > >  extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
> > >  extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
> > >  		unsigned long map_offset);
> > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > > index d4b5f29..d5f15af 100644
> > > --- a/mm/memory_hotplug.c
> > > +++ b/mm/memory_hotplug.c
> > > @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
> > >   * and online/offline operations before this call, as required by
> > >   * try_offline_node().
> > >   */
> > > -void __ref remove_memory(int nid, u64 start, u64 size)
> > > +int __ref remove_memory(int nid, u64 start, u64 size)
> > >  {
> > >  	int ret;
> > >  
> > > @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
> > >  	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> > >  				check_memblock_offlined_cb);
> > >  	if (ret)
> > > -		BUG();
> > > +		goto end_remove;
> > > +
> > > +	ret = arch_remove_memory(start, size);
> 
> Should not include arch_remove_memory() to BUG().

arch_remove_memory might also fail in some cases. In the arm64
implementation of this patchset, for example, it might fail in the
(very rare) case when we would have to split a P[UM]D mapped section for
removal (and we do not support that - see email thread here:
https://lkml.org/lkml/2017/11/23/456).


> > > +
> > > +	if (ret)
> > > +		goto end_remove;
> > 
> > The original code triggers BUG() when any memblock is not offlined. Why
> > the new logic includes the result of arch_remove_memory()?
> > 
> > But I agreed the we don't need BUG(). Returning a error is better.
> 
> Actually, I lost one thing.
> 
> The BUG() have caught a issue about the offline state doesn't sync between
> memory_block and device object. like:
>         mem->dev.offline != (mem->state == MEM_OFFLINE)
> 
> So, the BUG() is useful to capture state issue in memory subsystem. But, I
> understood your concern about the two steps offline/remove from userland. 
> 
> Maybe we should move the BUG() to somewhere but not just remove it. Or if
> we think that the BUG() is too intense, at least we should print out a error
> message, and ACPI should checks the return value from subsystem to
> interrupt memory-hotplug process.

In this patchset, BUG() is moved to acpi_memory_remove_memory(),
the caller of arch_remove_memory(). However, I agree with Michal, that
we should not BUG() here but rather halt the hotremove process and print
some errors. 
Is there any state in ACPI that should be undone in case of hotremove
errors or we can just stop the process "halfway"?

> Thanks a lot!
> Joey Lee 

Thanks,
Andrea

> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-12-04 11:28         ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Joey,

and thanks for your comments. Response inline:

On Wed 29 Nov 2017, 09:52, joeyli wrote:
> On Wed, Nov 29, 2017 at 08:49:13AM +0800, joeyli wrote:
> > Hi Andrea, 
> > 
> > On Fri, Nov 24, 2017 at 10:22:35AM +0000, Andrea Reale wrote:
> > > Resending the patch adding linux-acpi in CC, as suggested by Rafael.
> > > Everyone else: apologies for the noise.
> > > 
> > > Commit 242831eb15a0 ("Memory hotplug / ACPI: Simplify memory removal")
> > > introduced an assumption whereas when control
> > > reaches remove_memory the corresponding memory has been already
> > > offlined. In that case, the acpi_memhotplug was making sure that
> > > the assumption held.
> > > This assumption, however, is not necessarily true if offlining
> > > and removal are not done by the same "controller" (for example,
> > > when first offlining via sysfs).
> > > 
> > > Removing this assumption for the generic remove_memory code
> > > and moving it in the specific acpi_memhotplug code. This is
> > > a dependency for the software-aided arm64 offlining and removal
> > > process.
> > > 
> > > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > > Signed-off-by: Maciej Bielski <m.bielski@linux.vnet.ibm.com>
> > > ---
> > >  drivers/acpi/acpi_memhotplug.c |  2 +-
> > >  include/linux/memory_hotplug.h |  9 ++++++---
> > >  mm/memory_hotplug.c            | 13 +++++++++----
> > >  3 files changed, 16 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> > > index 6b0d3ef..b0126a0 100644
> > > --- a/drivers/acpi/acpi_memhotplug.c
> > > +++ b/drivers/acpi/acpi_memhotplug.c
> > > @@ -282,7 +282,7 @@ static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
> > >  			nid = memory_add_physaddr_to_nid(info->start_addr);
> > >  
> > >  		acpi_unbind_memory_blocks(info);
> > > -		remove_memory(nid, info->start_addr, info->length);
> > > +		BUG_ON(remove_memory(nid, info->start_addr, info->length));
> > >  		list_del(&info->list);
> > >  		kfree(info);
> > >  	}
> > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > > index 58e110a..1a9c7b2 100644
> > > --- a/include/linux/memory_hotplug.h
> > > +++ b/include/linux/memory_hotplug.h
> > > @@ -295,7 +295,7 @@ static inline bool movable_node_is_enabled(void)
> > >  extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages);
> > >  extern void try_offline_node(int nid);
> > >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> > > -extern void remove_memory(int nid, u64 start, u64 size);
> > > +extern int remove_memory(int nid, u64 start, u64 size);
> > >  
> > >  #else
> > >  static inline bool is_mem_section_removable(unsigned long pfn,
> > > @@ -311,7 +311,10 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> > >  	return -EINVAL;
> > >  }
> > >  
> > > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > > +static inline int remove_memory(int nid, u64 start, u64 size)
> > > +{
> > > +	return -EINVAL;
> > > +}
> > >  #endif /* CONFIG_MEMORY_HOTREMOVE */
> > >  
> > >  extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
> > > @@ -323,7 +326,7 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> > >  		unsigned long nr_pages);
> > >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
> > >  extern bool is_memblock_offlined(struct memory_block *mem);
> > > -extern void remove_memory(int nid, u64 start, u64 size);
> > > +extern int remove_memory(int nid, u64 start, u64 size);
> > >  extern int sparse_add_one_section(struct pglist_data *pgdat, unsigned long start_pfn);
> > >  extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
> > >  		unsigned long map_offset);
> > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > > index d4b5f29..d5f15af 100644
> > > --- a/mm/memory_hotplug.c
> > > +++ b/mm/memory_hotplug.c
> > > @@ -1892,7 +1892,7 @@ EXPORT_SYMBOL(try_offline_node);
> > >   * and online/offline operations before this call, as required by
> > >   * try_offline_node().
> > >   */
> > > -void __ref remove_memory(int nid, u64 start, u64 size)
> > > +int __ref remove_memory(int nid, u64 start, u64 size)
> > >  {
> > >  	int ret;
> > >  
> > > @@ -1908,18 +1908,23 @@ void __ref remove_memory(int nid, u64 start, u64 size)
> > >  	ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
> > >  				check_memblock_offlined_cb);
> > >  	if (ret)
> > > -		BUG();
> > > +		goto end_remove;
> > > +
> > > +	ret = arch_remove_memory(start, size);
> 
> Should not include arch_remove_memory() to BUG().

arch_remove_memory might also fail in some cases. In the arm64
implementation of this patchset, for example, it might fail in the
(very rare) case when we would have to split a P[UM]D mapped section for
removal (and we do not support that - see email thread here:
https://lkml.org/lkml/2017/11/23/456).


> > > +
> > > +	if (ret)
> > > +		goto end_remove;
> > 
> > The original code triggers BUG() when any memblock is not offlined. Why
> > the new logic includes the result of arch_remove_memory()?
> > 
> > But I agreed the we don't need BUG(). Returning a error is better.
> 
> Actually, I lost one thing.
> 
> The BUG() have caught a issue about the offline state doesn't sync between
> memory_block and device object. like:
>         mem->dev.offline != (mem->state == MEM_OFFLINE)
> 
> So, the BUG() is useful to capture state issue in memory subsystem. But, I
> understood your concern about the two steps offline/remove from userland. 
> 
> Maybe we should move the BUG() to somewhere but not just remove it. Or if
> we think that the BUG() is too intense, at least we should print out a error
> message, and ACPI should checks the return value from subsystem to
> interrupt memory-hotplug process.

In this patchset, BUG() is moved to acpi_memory_remove_memory(),
the caller of arch_remove_memory(). However, I agree with Michal, that
we should not BUG() here but rather halt the hotremove process and print
some errors. 
Is there any state in ACPI that should be undone in case of hotremove
errors or we can just stop the process "halfway"?

> Thanks a lot!
> Joey Lee 

Thanks,
Andrea

> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
  2017-11-30 14:57       ` Michal Hocko
  (?)
@ 2017-12-04 11:34         ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

Hi Michal,

On Thu 30 Nov 2017, 15:57, Michal Hocko wrote:
> On Thu 23-11-17 17:33:31, Andrea Reale wrote:
> > On Thu 23 Nov 2017, 17:02, Michal Hocko wrote:
> > 
> > Hi Michal,
> > 
> > > I will try to have a look but I do not expect to understand any of arm64
> > > specific changes so I will focus on the generic code but it would help a
> > > _lot_ if the cover letter provided some overview of what has been done
> > > from a higher level POV. What are the arch pieces and what is the
> > > generic code missing. A quick glance over patches suggests that
> > > changelogs for specific patches are modest as well. Could you give us
> > > more information please? Reviewing hundreds lines of code without
> > > context is a pain.
> > 
> > sorry for the lack of details. I will try to provide a better
> > overview in the following. Please, feel free to ask for more details
> > where needed.
> > 
> > Overall, the goal of the patchset is to implement arch_memory_add and
> > arch_memory_remove for arm64, to support the generic memory_hotplug
> > framework. 
> > 
> > Hot add
> > -------
> > Not so many surprises here. We implement the arch specific
> > arch_add_memory, which builds the kernel page tables via hotplug_paging()
> > and then calls arch specific add_pages(). We need the arch specific
> > add_pages() to implement a trick that makes the satus of pages being
> > added accepted by the asumptions made in the generic __add_pages. (See
> > code comments).
> 
> Actually I would like to see exactly this explained. The arch support of
> the hotplug should be basically only about arch_add_memory and add_pages
> resp. arch_remove_memory and __remove_pages. Nothing much more, really.
> The core hotplug code should take care of the rest. Ideally you
> shouldn't be really forced to touch the generic code. If yes than this
> should be called out explicitly.

For what concerns hot add, there are no changes to the core hotplug code
whatsoever; just arch_add_memory and add_pages.

For what concerns hot remove, there are two changes to generic code, as
described in the second part of https://lkml.org/lkml/2017/11/23/456.
The first is the removal of the BUG() call in arch_remove_memory and
moving it to ACPI code: I think we agree that calling BUG() from
arch_remove_memory is undesirable. I have to develop a better
understanding on how to get rid of it from ACPI as well.

The second are the memblock changes for vmemmap removal. 
I'll try to discuss this change in more details in a follow up email.

Thanks,
Andrea

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
@ 2017-12-04 11:34         ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

Hi Michal,

On Thu 30 Nov 2017, 15:57, Michal Hocko wrote:
> On Thu 23-11-17 17:33:31, Andrea Reale wrote:
> > On Thu 23 Nov 2017, 17:02, Michal Hocko wrote:
> > 
> > Hi Michal,
> > 
> > > I will try to have a look but I do not expect to understand any of arm64
> > > specific changes so I will focus on the generic code but it would help a
> > > _lot_ if the cover letter provided some overview of what has been done
> > > from a higher level POV. What are the arch pieces and what is the
> > > generic code missing. A quick glance over patches suggests that
> > > changelogs for specific patches are modest as well. Could you give us
> > > more information please? Reviewing hundreds lines of code without
> > > context is a pain.
> > 
> > sorry for the lack of details. I will try to provide a better
> > overview in the following. Please, feel free to ask for more details
> > where needed.
> > 
> > Overall, the goal of the patchset is to implement arch_memory_add and
> > arch_memory_remove for arm64, to support the generic memory_hotplug
> > framework. 
> > 
> > Hot add
> > -------
> > Not so many surprises here. We implement the arch specific
> > arch_add_memory, which builds the kernel page tables via hotplug_paging()
> > and then calls arch specific add_pages(). We need the arch specific
> > add_pages() to implement a trick that makes the satus of pages being
> > added accepted by the asumptions made in the generic __add_pages. (See
> > code comments).
> 
> Actually I would like to see exactly this explained. The arch support of
> the hotplug should be basically only about arch_add_memory and add_pages
> resp. arch_remove_memory and __remove_pages. Nothing much more, really.
> The core hotplug code should take care of the rest. Ideally you
> shouldn't be really forced to touch the generic code. If yes than this
> should be called out explicitly.

For what concerns hot add, there are no changes to the core hotplug code
whatsoever; just arch_add_memory and add_pages.

For what concerns hot remove, there are two changes to generic code, as
described in the second part of https://lkml.org/lkml/2017/11/23/456.
The first is the removal of the BUG() call in arch_remove_memory and
moving it to ACPI code: I think we agree that calling BUG() from
arch_remove_memory is undesirable. I have to develop a better
understanding on how to get rid of it from ACPI as well.

The second are the memblock changes for vmemmap removal. 
I'll try to discuss this change in more details in a follow up email.

Thanks,
Andrea

> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2
@ 2017-12-04 11:34         ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Michal,

On Thu 30 Nov 2017, 15:57, Michal Hocko wrote:
> On Thu 23-11-17 17:33:31, Andrea Reale wrote:
> > On Thu 23 Nov 2017, 17:02, Michal Hocko wrote:
> > 
> > Hi Michal,
> > 
> > > I will try to have a look but I do not expect to understand any of arm64
> > > specific changes so I will focus on the generic code but it would help a
> > > _lot_ if the cover letter provided some overview of what has been done
> > > from a higher level POV. What are the arch pieces and what is the
> > > generic code missing. A quick glance over patches suggests that
> > > changelogs for specific patches are modest as well. Could you give us
> > > more information please? Reviewing hundreds lines of code without
> > > context is a pain.
> > 
> > sorry for the lack of details. I will try to provide a better
> > overview in the following. Please, feel free to ask for more details
> > where needed.
> > 
> > Overall, the goal of the patchset is to implement arch_memory_add and
> > arch_memory_remove for arm64, to support the generic memory_hotplug
> > framework. 
> > 
> > Hot add
> > -------
> > Not so many surprises here. We implement the arch specific
> > arch_add_memory, which builds the kernel page tables via hotplug_paging()
> > and then calls arch specific add_pages(). We need the arch specific
> > add_pages() to implement a trick that makes the satus of pages being
> > added accepted by the asumptions made in the generic __add_pages. (See
> > code comments).
> 
> Actually I would like to see exactly this explained. The arch support of
> the hotplug should be basically only about arch_add_memory and add_pages
> resp. arch_remove_memory and __remove_pages. Nothing much more, really.
> The core hotplug code should take care of the rest. Ideally you
> shouldn't be really forced to touch the generic code. If yes than this
> should be called out explicitly.

For what concerns hot add, there are no changes to the core hotplug code
whatsoever; just arch_add_memory and add_pages.

For what concerns hot remove, there are two changes to generic code, as
described in the second part of https://lkml.org/lkml/2017/11/23/456.
The first is the removal of the BUG() call in arch_remove_memory and
moving it to ACPI code: I think we agree that calling BUG() from
arch_remove_memory is undesirable. I have to develop a better
understanding on how to get rid of it from ACPI as well.

The second are the memblock changes for vmemmap removal. 
I'll try to discuss this change in more details in a follow up email.

Thanks,
Andrea

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
  2017-11-30 14:51     ` Michal Hocko
  (?)
@ 2017-12-04 11:49       ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > When hot-removing memory we need to free vmemmap memory.
> > However, depending on the memory is being removed, it might
> > not be always possible to free a full vmemmap page / huge-page
> > because part of it might still be used.
> > 
> > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > hot-remove") introduced a workaround for x86
> > hot-remove, by which partially unused areas are filled with
> > the 0xFD constant. Full pages are only removed when fully
> > filled by 0xFDs.
> > 
> > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > the goal of using it in place of 0xFDs. For now, this will be used for
> > the arm64 port of memory hot remove, but the idea is to eventually use
> > the same mechanism for x86 as well.
> 
> Why cannot you use the same approach as x86 have? Have a look at the
> vmemmap_free at al.
> 

This arm64 hot-remove version (including vmemmap_free) is indeed an
almost 1-to-1 port of the x86 approach. 

If you look at the first version of the patchset we submitted a while 
ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
x86 approach of filling unsued page structs with 0xFDs. Commenting on
that, Mark suggested (and, indeed, I agree with him) that relying on a
magic constant for marking some portions of physical memory was quite
ugly. That is why we have used memblock for the purpose in this revised
patchset.

If you have a different view and any concrete suggestion on how to
improve this, it is definitely very well welcome. 

> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > ---
> >  include/linux/memblock.h | 12 ++++++++++++
> >  mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
> >  2 files changed, 44 insertions(+)
> > 
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index bae11c7..0daec05 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -26,6 +26,9 @@ enum {
> >  	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
> >  	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
> >  	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
> > +#endif
> >  };
> >  
> >  struct memblock_region {
> > @@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> >  int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
> >  int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
> >  ulong choose_memblock_flags(void);
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> > +int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> > +#endif
> >  
> >  /* Low level functions */
> >  int memblock_add_range(struct memblock_type *type,
> > @@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
> >  	return m->flags & MEMBLOCK_NOMAP;
> >  }
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> > +		phys_addr_t start, phys_addr_t end);
> > +#endif
> > +
> >  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
> >  			    unsigned long  *end_pfn);
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 9120578..30d5aa4 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
> >  	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
> >  }
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> > +		phys_addr_t size)
> > +{
> > +	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> > +}
> > +int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> > +		phys_addr_t size)
> > +{
> > +	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> > +}
> > +#endif
> >  /**
> >   * __next_reserved_mem_region - next function for for_each_reserved_region()
> >   * @idx: pointer to u64 loop variable
> > @@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> > +		phys_addr_t start, phys_addr_t end)
> > +{
> > +	u64 i;
> > +	struct memblock_region *r;
> > +
> > +	i = memblock_search(mt, start);
> > +	r = &(mt->regions[i]);
> > +	while (r->base < end) {
> > +		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> > +			return 0;
> > +
> > +		r = &(memblock.memory.regions[++i]);
> > +	}
> > +
> > +	return 1;
> > +}
> > +#endif
> > +
> >  void __init_memblock memblock_set_current_limit(phys_addr_t limit)
> >  {
> >  	memblock.current_limit = limit;
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

Thanks,
Andrea

> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-12-04 11:49       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > When hot-removing memory we need to free vmemmap memory.
> > However, depending on the memory is being removed, it might
> > not be always possible to free a full vmemmap page / huge-page
> > because part of it might still be used.
> > 
> > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > hot-remove") introduced a workaround for x86
> > hot-remove, by which partially unused areas are filled with
> > the 0xFD constant. Full pages are only removed when fully
> > filled by 0xFDs.
> > 
> > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > the goal of using it in place of 0xFDs. For now, this will be used for
> > the arm64 port of memory hot remove, but the idea is to eventually use
> > the same mechanism for x86 as well.
> 
> Why cannot you use the same approach as x86 have? Have a look at the
> vmemmap_free at al.
> 

This arm64 hot-remove version (including vmemmap_free) is indeed an
almost 1-to-1 port of the x86 approach. 

If you look at the first version of the patchset we submitted a while 
ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
x86 approach of filling unsued page structs with 0xFDs. Commenting on
that, Mark suggested (and, indeed, I agree with him) that relying on a
magic constant for marking some portions of physical memory was quite
ugly. That is why we have used memblock for the purpose in this revised
patchset.

If you have a different view and any concrete suggestion on how to
improve this, it is definitely very well welcome. 

> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > ---
> >  include/linux/memblock.h | 12 ++++++++++++
> >  mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
> >  2 files changed, 44 insertions(+)
> > 
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index bae11c7..0daec05 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -26,6 +26,9 @@ enum {
> >  	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
> >  	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
> >  	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
> > +#endif
> >  };
> >  
> >  struct memblock_region {
> > @@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> >  int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
> >  int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
> >  ulong choose_memblock_flags(void);
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> > +int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> > +#endif
> >  
> >  /* Low level functions */
> >  int memblock_add_range(struct memblock_type *type,
> > @@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
> >  	return m->flags & MEMBLOCK_NOMAP;
> >  }
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> > +		phys_addr_t start, phys_addr_t end);
> > +#endif
> > +
> >  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
> >  			    unsigned long  *end_pfn);
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 9120578..30d5aa4 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
> >  	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
> >  }
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> > +		phys_addr_t size)
> > +{
> > +	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> > +}
> > +int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> > +		phys_addr_t size)
> > +{
> > +	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> > +}
> > +#endif
> >  /**
> >   * __next_reserved_mem_region - next function for for_each_reserved_region()
> >   * @idx: pointer to u64 loop variable
> > @@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> > +		phys_addr_t start, phys_addr_t end)
> > +{
> > +	u64 i;
> > +	struct memblock_region *r;
> > +
> > +	i = memblock_search(mt, start);
> > +	r = &(mt->regions[i]);
> > +	while (r->base < end) {
> > +		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> > +			return 0;
> > +
> > +		r = &(memblock.memory.regions[++i]);
> > +	}
> > +
> > +	return 1;
> > +}
> > +#endif
> > +
> >  void __init_memblock memblock_set_current_limit(phys_addr_t limit)
> >  {
> >  	memblock.current_limit = limit;
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

Thanks,
Andrea

> 
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-12-04 11:49       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > When hot-removing memory we need to free vmemmap memory.
> > However, depending on the memory is being removed, it might
> > not be always possible to free a full vmemmap page / huge-page
> > because part of it might still be used.
> > 
> > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > hot-remove") introduced a workaround for x86
> > hot-remove, by which partially unused areas are filled with
> > the 0xFD constant. Full pages are only removed when fully
> > filled by 0xFDs.
> > 
> > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > the goal of using it in place of 0xFDs. For now, this will be used for
> > the arm64 port of memory hot remove, but the idea is to eventually use
> > the same mechanism for x86 as well.
> 
> Why cannot you use the same approach as x86 have? Have a look at the
> vmemmap_free at al.
> 

This arm64 hot-remove version (including vmemmap_free) is indeed an
almost 1-to-1 port of the x86 approach. 

If you look at the first version of the patchset we submitted a while 
ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
x86 approach of filling unsued page structs with 0xFDs. Commenting on
that, Mark suggested (and, indeed, I agree with him) that relying on a
magic constant for marking some portions of physical memory was quite
ugly. That is why we have used memblock for the purpose in this revised
patchset.

If you have a different view and any concrete suggestion on how to
improve this, it is definitely very well welcome. 

> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > ---
> >  include/linux/memblock.h | 12 ++++++++++++
> >  mm/memblock.c            | 32 ++++++++++++++++++++++++++++++++
> >  2 files changed, 44 insertions(+)
> > 
> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index bae11c7..0daec05 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -26,6 +26,9 @@ enum {
> >  	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
> >  	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
> >  	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	MEMBLOCK_UNUSED_VMEMMAP	= 0x8,  /* Mark VMEMAP blocks as dirty */
> > +#endif
> >  };
> >  
> >  struct memblock_region {
> > @@ -90,6 +93,10 @@ int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> >  int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
> >  int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
> >  ulong choose_memblock_flags(void);
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +int memblock_mark_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> > +int memblock_clear_unused_vmemmap(phys_addr_t base, phys_addr_t size);
> > +#endif
> >  
> >  /* Low level functions */
> >  int memblock_add_range(struct memblock_type *type,
> > @@ -182,6 +189,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
> >  	return m->flags & MEMBLOCK_NOMAP;
> >  }
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +bool memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> > +		phys_addr_t start, phys_addr_t end);
> > +#endif
> > +
> >  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> >  int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
> >  			    unsigned long  *end_pfn);
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 9120578..30d5aa4 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -809,6 +809,18 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
> >  	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NOMAP);
> >  }
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +int __init_memblock memblock_mark_unused_vmemmap(phys_addr_t base,
> > +		phys_addr_t size)
> > +{
> > +	return memblock_setclr_flag(base, size, 1, MEMBLOCK_UNUSED_VMEMMAP);
> > +}
> > +int __init_memblock memblock_clear_unused_vmemmap(phys_addr_t base,
> > +		phys_addr_t size)
> > +{
> > +	return memblock_setclr_flag(base, size, 0, MEMBLOCK_UNUSED_VMEMMAP);
> > +}
> > +#endif
> >  /**
> >   * __next_reserved_mem_region - next function for for_each_reserved_region()
> >   * @idx: pointer to u64 loop variable
> > @@ -1696,6 +1708,26 @@ void __init_memblock memblock_trim_memory(phys_addr_t align)
> >  	}
> >  }
> >  
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +bool __init_memblock memblock_is_vmemmap_unused_range(struct memblock_type *mt,
> > +		phys_addr_t start, phys_addr_t end)
> > +{
> > +	u64 i;
> > +	struct memblock_region *r;
> > +
> > +	i = memblock_search(mt, start);
> > +	r = &(mt->regions[i]);
> > +	while (r->base < end) {
> > +		if (!(r->flags & MEMBLOCK_UNUSED_VMEMMAP))
> > +			return 0;
> > +
> > +		r = &(memblock.memory.regions[++i]);
> > +	}
> > +
> > +	return 1;
> > +}
> > +#endif
> > +
> >  void __init_memblock memblock_set_current_limit(phys_addr_t limit)
> >  {
> >  	memblock.current_limit = limit;
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo at kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email at kvack.org </a>

Thanks,
Andrea

> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-11-30 14:49     ` Michal Hocko
  (?)
@ 2017-12-04 11:51       ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 30 Nov 2017, 15:49, Michal Hocko wrote:
> On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> > Adding a "remove" sysfs handle that can be used to trigger
> > memory hotremove manually, exactly simmetrically with
> > what happens with the "probe" device for hot-add.
> > 
> > This is usueful for architecture that do not rely on
> > ACPI for memory hot-remove.
> 
> As already said elsewhere, this really has to check the online status of
> the range and fail some is still online.
> 

This is actually still done in remove_memory() (patch 2/5) with
walk_memory_range. We just return an error rather than BUGing().

Or are you referring to something else?


> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > ---
> >  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 1d60b58..8ccb67c 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >  }
> >  
> >  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> > -#endif
> > +
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +static ssize_t
> > +memory_remove_store(struct device *dev,
> > +		struct device_attribute *attr, const char *buf, size_t count)
> > +{
> > +	u64 phys_addr;
> > +	int nid, ret;
> > +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> > +
> > +	ret = kstrtoull(buf, 0, &phys_addr);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> > +		return -EINVAL;
> > +
> > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > +	ret = lock_device_hotplug_sysfs();
> > +	if (ret)
> > +		return ret;
> > +
> > +	remove_memory(nid, phys_addr,
> > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > +	unlock_device_hotplug();
> > +	return count;
> > +}
> > +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> > +#endif /* CONFIG_MEMORY_HOTREMOVE */
> > +#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  /*
> > @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >  static struct attribute *memory_root_attrs[] = {
> >  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >  	&dev_attr_probe.attr,
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	&dev_attr_remove.attr,
> > +#endif
> >  #endif
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> > -- 
> > 2.7.4

Thanks,
Andrea

> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-12-04 11:51       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Thu 30 Nov 2017, 15:49, Michal Hocko wrote:
> On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> > Adding a "remove" sysfs handle that can be used to trigger
> > memory hotremove manually, exactly simmetrically with
> > what happens with the "probe" device for hot-add.
> > 
> > This is usueful for architecture that do not rely on
> > ACPI for memory hot-remove.
> 
> As already said elsewhere, this really has to check the online status of
> the range and fail some is still online.
> 

This is actually still done in remove_memory() (patch 2/5) with
walk_memory_range. We just return an error rather than BUGing().

Or are you referring to something else?


> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > ---
> >  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 1d60b58..8ccb67c 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >  }
> >  
> >  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> > -#endif
> > +
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +static ssize_t
> > +memory_remove_store(struct device *dev,
> > +		struct device_attribute *attr, const char *buf, size_t count)
> > +{
> > +	u64 phys_addr;
> > +	int nid, ret;
> > +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> > +
> > +	ret = kstrtoull(buf, 0, &phys_addr);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> > +		return -EINVAL;
> > +
> > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > +	ret = lock_device_hotplug_sysfs();
> > +	if (ret)
> > +		return ret;
> > +
> > +	remove_memory(nid, phys_addr,
> > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > +	unlock_device_hotplug();
> > +	return count;
> > +}
> > +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> > +#endif /* CONFIG_MEMORY_HOTREMOVE */
> > +#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  /*
> > @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >  static struct attribute *memory_root_attrs[] = {
> >  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >  	&dev_attr_probe.attr,
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	&dev_attr_remove.attr,
> > +#endif
> >  #endif
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> > -- 
> > 2.7.4

Thanks,
Andrea

> 
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-12-04 11:51       ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 11:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu 30 Nov 2017, 15:49, Michal Hocko wrote:
> On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> > Adding a "remove" sysfs handle that can be used to trigger
> > memory hotremove manually, exactly simmetrically with
> > what happens with the "probe" device for hot-add.
> > 
> > This is usueful for architecture that do not rely on
> > ACPI for memory hot-remove.
> 
> As already said elsewhere, this really has to check the online status of
> the range and fail some is still online.
> 

This is actually still done in remove_memory() (patch 2/5) with
walk_memory_range. We just return an error rather than BUGing().

Or are you referring to something else?


> > Signed-off-by: Andrea Reale <ar@linux.vnet.ibm.com>
> > Signed-off-by: Maciej Bielski <m.bielski@virtualopensystems.com>
> > ---
> >  drivers/base/memory.c | 34 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 1d60b58..8ccb67c 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -530,7 +530,36 @@ memory_probe_store(struct device *dev, struct device_attribute *attr,
> >  }
> >  
> >  static DEVICE_ATTR(probe, S_IWUSR, NULL, memory_probe_store);
> > -#endif
> > +
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +static ssize_t
> > +memory_remove_store(struct device *dev,
> > +		struct device_attribute *attr, const char *buf, size_t count)
> > +{
> > +	u64 phys_addr;
> > +	int nid, ret;
> > +	unsigned long pages_per_block = PAGES_PER_SECTION * sections_per_block;
> > +
> > +	ret = kstrtoull(buf, 0, &phys_addr);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (phys_addr & ((pages_per_block << PAGE_SHIFT) - 1))
> > +		return -EINVAL;
> > +
> > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > +	ret = lock_device_hotplug_sysfs();
> > +	if (ret)
> > +		return ret;
> > +
> > +	remove_memory(nid, phys_addr,
> > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > +	unlock_device_hotplug();
> > +	return count;
> > +}
> > +static DEVICE_ATTR(remove, S_IWUSR, NULL, memory_remove_store);
> > +#endif /* CONFIG_MEMORY_HOTREMOVE */
> > +#endif /* CONFIG_ARCH_MEMORY_PROBE */
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> >  /*
> > @@ -790,6 +819,9 @@ bool is_memblock_offlined(struct memory_block *mem)
> >  static struct attribute *memory_root_attrs[] = {
> >  #ifdef CONFIG_ARCH_MEMORY_PROBE
> >  	&dev_attr_probe.attr,
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +	&dev_attr_remove.attr,
> > +#endif
> >  #endif
> >  
> >  #ifdef CONFIG_MEMORY_FAILURE
> > -- 
> > 2.7.4

Thanks,
Andrea

> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
  2017-12-04 11:49       ` Andrea Reale
  (?)
@ 2017-12-04 12:32         ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-12-04 12:32 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Mon 04-12-17 11:49:09, Andrea Reale wrote:
> On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> > On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > > When hot-removing memory we need to free vmemmap memory.
> > > However, depending on the memory is being removed, it might
> > > not be always possible to free a full vmemmap page / huge-page
> > > because part of it might still be used.
> > > 
> > > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > > hot-remove") introduced a workaround for x86
> > > hot-remove, by which partially unused areas are filled with
> > > the 0xFD constant. Full pages are only removed when fully
> > > filled by 0xFDs.
> > > 
> > > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > > the goal of using it in place of 0xFDs. For now, this will be used for
> > > the arm64 port of memory hot remove, but the idea is to eventually use
> > > the same mechanism for x86 as well.
> > 
> > Why cannot you use the same approach as x86 have? Have a look at the
> > vmemmap_free at al.
> > 
> 
> This arm64 hot-remove version (including vmemmap_free) is indeed an
> almost 1-to-1 port of the x86 approach. 
> 
> If you look at the first version of the patchset we submitted a while 
> ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
> x86 approach of filling unsued page structs with 0xFDs. Commenting on
> that, Mark suggested (and, indeed, I agree with him) that relying on a
> magic constant for marking some portions of physical memory was quite
> ugly. That is why we have used memblock for the purpose in this revised
> patchset.
> 
> If you have a different view and any concrete suggestion on how to
> improve this, it is definitely very well welcome. 

I would really prefer if those archictectues shared the code (and
concept) as much as possible. It is really a PITA to wrap your head
around each architectures for reasons which are not inherent to that
specific architecture. If you find the way how x86 is implemented ugly,
then all right, but making arm64 special just for the matter of taste is
far from ideal IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-12-04 12:32         ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-12-04 12:32 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Mon 04-12-17 11:49:09, Andrea Reale wrote:
> On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> > On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > > When hot-removing memory we need to free vmemmap memory.
> > > However, depending on the memory is being removed, it might
> > > not be always possible to free a full vmemmap page / huge-page
> > > because part of it might still be used.
> > > 
> > > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > > hot-remove") introduced a workaround for x86
> > > hot-remove, by which partially unused areas are filled with
> > > the 0xFD constant. Full pages are only removed when fully
> > > filled by 0xFDs.
> > > 
> > > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > > the goal of using it in place of 0xFDs. For now, this will be used for
> > > the arm64 port of memory hot remove, but the idea is to eventually use
> > > the same mechanism for x86 as well.
> > 
> > Why cannot you use the same approach as x86 have? Have a look at the
> > vmemmap_free at al.
> > 
> 
> This arm64 hot-remove version (including vmemmap_free) is indeed an
> almost 1-to-1 port of the x86 approach. 
> 
> If you look at the first version of the patchset we submitted a while 
> ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
> x86 approach of filling unsued page structs with 0xFDs. Commenting on
> that, Mark suggested (and, indeed, I agree with him) that relying on a
> magic constant for marking some portions of physical memory was quite
> ugly. That is why we have used memblock for the purpose in this revised
> patchset.
> 
> If you have a different view and any concrete suggestion on how to
> improve this, it is definitely very well welcome. 

I would really prefer if those archictectues shared the code (and
concept) as much as possible. It is really a PITA to wrap your head
around each architectures for reasons which are not inherent to that
specific architecture. If you find the way how x86 is implemented ugly,
then all right, but making arm64 special just for the matter of taste is
far from ideal IMHO.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-12-04 12:32         ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-12-04 12:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon 04-12-17 11:49:09, Andrea Reale wrote:
> On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> > On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > > When hot-removing memory we need to free vmemmap memory.
> > > However, depending on the memory is being removed, it might
> > > not be always possible to free a full vmemmap page / huge-page
> > > because part of it might still be used.
> > > 
> > > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > > hot-remove") introduced a workaround for x86
> > > hot-remove, by which partially unused areas are filled with
> > > the 0xFD constant. Full pages are only removed when fully
> > > filled by 0xFDs.
> > > 
> > > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > > the goal of using it in place of 0xFDs. For now, this will be used for
> > > the arm64 port of memory hot remove, but the idea is to eventually use
> > > the same mechanism for x86 as well.
> > 
> > Why cannot you use the same approach as x86 have? Have a look at the
> > vmemmap_free at al.
> > 
> 
> This arm64 hot-remove version (including vmemmap_free) is indeed an
> almost 1-to-1 port of the x86 approach. 
> 
> If you look at the first version of the patchset we submitted a while 
> ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
> x86 approach of filling unsued page structs with 0xFDs. Commenting on
> that, Mark suggested (and, indeed, I agree with him) that relying on a
> magic constant for marking some portions of physical memory was quite
> ugly. That is why we have used memblock for the purpose in this revised
> patchset.
> 
> If you have a different view and any concrete suggestion on how to
> improve this, it is definitely very well welcome. 

I would really prefer if those archictectues shared the code (and
concept) as much as possible. It is really a PITA to wrap your head
around each architectures for reasons which are not inherent to that
specific architecture. If you find the way how x86 is implemented ugly,
then all right, but making arm64 special just for the matter of taste is
far from ideal IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-12-04 11:51       ` Andrea Reale
  (?)
@ 2017-12-04 12:33         ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-12-04 12:33 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Mon 04-12-17 11:51:29, Andrea Reale wrote:
> On Thu 30 Nov 2017, 15:49, Michal Hocko wrote:
> > On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> > > Adding a "remove" sysfs handle that can be used to trigger
> > > memory hotremove manually, exactly simmetrically with
> > > what happens with the "probe" device for hot-add.
> > > 
> > > This is usueful for architecture that do not rely on
> > > ACPI for memory hot-remove.
> > 
> > As already said elsewhere, this really has to check the online status of
> > the range and fail some is still online.
> > 
> 
> This is actually still done in remove_memory() (patch 2/5) with
> walk_memory_range. We just return an error rather than BUGing().
> 
> Or are you referring to something else?

But you are not returning that error to the caller, are you?

[...]
> > > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > > +	ret = lock_device_hotplug_sysfs();
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	remove_memory(nid, phys_addr,
> > > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > > +	unlock_device_hotplug();
> > > +	return count;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-12-04 12:33         ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-12-04 12:33 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Mon 04-12-17 11:51:29, Andrea Reale wrote:
> On Thu 30 Nov 2017, 15:49, Michal Hocko wrote:
> > On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> > > Adding a "remove" sysfs handle that can be used to trigger
> > > memory hotremove manually, exactly simmetrically with
> > > what happens with the "probe" device for hot-add.
> > > 
> > > This is usueful for architecture that do not rely on
> > > ACPI for memory hot-remove.
> > 
> > As already said elsewhere, this really has to check the online status of
> > the range and fail some is still online.
> > 
> 
> This is actually still done in remove_memory() (patch 2/5) with
> walk_memory_range. We just return an error rather than BUGing().
> 
> Or are you referring to something else?

But you are not returning that error to the caller, are you?

[...]
> > > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > > +	ret = lock_device_hotplug_sysfs();
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	remove_memory(nid, phys_addr,
> > > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > > +	unlock_device_hotplug();
> > > +	return count;

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-12-04 12:33         ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-12-04 12:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon 04-12-17 11:51:29, Andrea Reale wrote:
> On Thu 30 Nov 2017, 15:49, Michal Hocko wrote:
> > On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> > > Adding a "remove" sysfs handle that can be used to trigger
> > > memory hotremove manually, exactly simmetrically with
> > > what happens with the "probe" device for hot-add.
> > > 
> > > This is usueful for architecture that do not rely on
> > > ACPI for memory hot-remove.
> > 
> > As already said elsewhere, this really has to check the online status of
> > the range and fail some is still online.
> > 
> 
> This is actually still done in remove_memory() (patch 2/5) with
> walk_memory_range. We just return an error rather than BUGing().
> 
> Or are you referring to something else?

But you are not returning that error to the caller, are you?

[...]
> > > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > > +	ret = lock_device_hotplug_sysfs();
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	remove_memory(nid, phys_addr,
> > > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > > +	unlock_device_hotplug();
> > > +	return count;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
  2017-12-04 12:32         ` Michal Hocko
  (?)
@ 2017-12-04 12:42           ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 12:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Mon  4 Dec 2017, 13:32, Michal Hocko wrote:
> On Mon 04-12-17 11:49:09, Andrea Reale wrote:
> > On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> > > On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > > > When hot-removing memory we need to free vmemmap memory.
> > > > However, depending on the memory is being removed, it might
> > > > not be always possible to free a full vmemmap page / huge-page
> > > > because part of it might still be used.
> > > > 
> > > > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > > > hot-remove") introduced a workaround for x86
> > > > hot-remove, by which partially unused areas are filled with
> > > > the 0xFD constant. Full pages are only removed when fully
> > > > filled by 0xFDs.
> > > > 
> > > > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > > > the goal of using it in place of 0xFDs. For now, this will be used for
> > > > the arm64 port of memory hot remove, but the idea is to eventually use
> > > > the same mechanism for x86 as well.
> > > 
> > > Why cannot you use the same approach as x86 have? Have a look at the
> > > vmemmap_free at al.
> > > 
> > 
> > This arm64 hot-remove version (including vmemmap_free) is indeed an
> > almost 1-to-1 port of the x86 approach. 
> > 
> > If you look at the first version of the patchset we submitted a while 
> > ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
> > x86 approach of filling unsued page structs with 0xFDs. Commenting on
> > that, Mark suggested (and, indeed, I agree with him) that relying on a
> > magic constant for marking some portions of physical memory was quite
> > ugly. That is why we have used memblock for the purpose in this revised
> > patchset.
> > 
> > If you have a different view and any concrete suggestion on how to
> > improve this, it is definitely very well welcome. 
> 
> I would really prefer if those archictectues shared the code (and
> concept) as much as possible. It is really a PITA to wrap your head
> around each architectures for reasons which are not inherent to that
> specific architecture. If you find the way how x86 is implemented ugly,
> then all right, but making arm64 special just for the matter of taste is
> far from ideal IMHO.

The plan is indeed to use this memblock flag in x86 hot remove as well,
in place of the 0xFDs. The change is quite straightforward and we could
push it in a next patchset release. Our rationale was to first use it in
the new architecture and then, once proven stable, back port it to x86.

However, I am not in principle against of pushing it right now.

Thanks,
Andrea

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-12-04 12:42           ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 12:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Mon  4 Dec 2017, 13:32, Michal Hocko wrote:
> On Mon 04-12-17 11:49:09, Andrea Reale wrote:
> > On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> > > On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > > > When hot-removing memory we need to free vmemmap memory.
> > > > However, depending on the memory is being removed, it might
> > > > not be always possible to free a full vmemmap page / huge-page
> > > > because part of it might still be used.
> > > > 
> > > > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > > > hot-remove") introduced a workaround for x86
> > > > hot-remove, by which partially unused areas are filled with
> > > > the 0xFD constant. Full pages are only removed when fully
> > > > filled by 0xFDs.
> > > > 
> > > > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > > > the goal of using it in place of 0xFDs. For now, this will be used for
> > > > the arm64 port of memory hot remove, but the idea is to eventually use
> > > > the same mechanism for x86 as well.
> > > 
> > > Why cannot you use the same approach as x86 have? Have a look at the
> > > vmemmap_free at al.
> > > 
> > 
> > This arm64 hot-remove version (including vmemmap_free) is indeed an
> > almost 1-to-1 port of the x86 approach. 
> > 
> > If you look at the first version of the patchset we submitted a while 
> > ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
> > x86 approach of filling unsued page structs with 0xFDs. Commenting on
> > that, Mark suggested (and, indeed, I agree with him) that relying on a
> > magic constant for marking some portions of physical memory was quite
> > ugly. That is why we have used memblock for the purpose in this revised
> > patchset.
> > 
> > If you have a different view and any concrete suggestion on how to
> > improve this, it is definitely very well welcome. 
> 
> I would really prefer if those archictectues shared the code (and
> concept) as much as possible. It is really a PITA to wrap your head
> around each architectures for reasons which are not inherent to that
> specific architecture. If you find the way how x86 is implemented ugly,
> then all right, but making arm64 special just for the matter of taste is
> far from ideal IMHO.

The plan is indeed to use this memblock flag in x86 hot remove as well,
in place of the 0xFDs. The change is quite straightforward and we could
push it in a next patchset release. Our rationale was to first use it in
the new architecture and then, once proven stable, back port it to x86.

However, I am not in principle against of pushing it right now.

Thanks,
Andrea

> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-12-04 12:42           ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 12:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon  4 Dec 2017, 13:32, Michal Hocko wrote:
> On Mon 04-12-17 11:49:09, Andrea Reale wrote:
> > On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> > > On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > > > When hot-removing memory we need to free vmemmap memory.
> > > > However, depending on the memory is being removed, it might
> > > > not be always possible to free a full vmemmap page / huge-page
> > > > because part of it might still be used.
> > > > 
> > > > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > > > hot-remove") introduced a workaround for x86
> > > > hot-remove, by which partially unused areas are filled with
> > > > the 0xFD constant. Full pages are only removed when fully
> > > > filled by 0xFDs.
> > > > 
> > > > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > > > the goal of using it in place of 0xFDs. For now, this will be used for
> > > > the arm64 port of memory hot remove, but the idea is to eventually use
> > > > the same mechanism for x86 as well.
> > > 
> > > Why cannot you use the same approach as x86 have? Have a look at the
> > > vmemmap_free at al.
> > > 
> > 
> > This arm64 hot-remove version (including vmemmap_free) is indeed an
> > almost 1-to-1 port of the x86 approach. 
> > 
> > If you look at the first version of the patchset we submitted a while 
> > ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
> > x86 approach of filling unsued page structs with 0xFDs. Commenting on
> > that, Mark suggested (and, indeed, I agree with him) that relying on a
> > magic constant for marking some portions of physical memory was quite
> > ugly. That is why we have used memblock for the purpose in this revised
> > patchset.
> > 
> > If you have a different view and any concrete suggestion on how to
> > improve this, it is definitely very well welcome. 
> 
> I would really prefer if those archictectues shared the code (and
> concept) as much as possible. It is really a PITA to wrap your head
> around each architectures for reasons which are not inherent to that
> specific architecture. If you find the way how x86 is implemented ugly,
> then all right, but making arm64 special just for the matter of taste is
> far from ideal IMHO.

The plan is indeed to use this memblock flag in x86 hot remove as well,
in place of the 0xFDs. The change is quite straightforward and we could
push it in a next patchset release. Our rationale was to first use it in
the new architecture and then, once proven stable, back port it to x86.

However, I am not in principle against of pushing it right now.

Thanks,
Andrea

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-12-04 12:33         ` Michal Hocko
  (?)
@ 2017-12-04 12:44           ` Andrea Reale
  -1 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 12:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Mon  4 Dec 2017, 13:33, Michal Hocko wrote:
> On Mon 04-12-17 11:51:29, Andrea Reale wrote:
> > On Thu 30 Nov 2017, 15:49, Michal Hocko wrote:
> > > On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> > > > Adding a "remove" sysfs handle that can be used to trigger
> > > > memory hotremove manually, exactly simmetrically with
> > > > what happens with the "probe" device for hot-add.
> > > > 
> > > > This is usueful for architecture that do not rely on
> > > > ACPI for memory hot-remove.
> > > 
> > > As already said elsewhere, this really has to check the online status of
> > > the range and fail some is still online.
> > > 
> > 
> > This is actually still done in remove_memory() (patch 2/5) with
> > walk_memory_range. We just return an error rather than BUGing().
> > 
> > Or are you referring to something else?
> 
> But you are not returning that error to the caller, are you?
> 
> [...]

Oh, I see your point. Yes, indeed we should have returned it. Thanks for
catching the issue.

> > > > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > > > +	ret = lock_device_hotplug_sysfs();
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	remove_memory(nid, phys_addr,
> > > > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > > > +	unlock_device_hotplug();
> > > > +	return count;

Thanks,
Andrea
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-12-04 12:44           ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 12:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Mon  4 Dec 2017, 13:33, Michal Hocko wrote:
> On Mon 04-12-17 11:51:29, Andrea Reale wrote:
> > On Thu 30 Nov 2017, 15:49, Michal Hocko wrote:
> > > On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> > > > Adding a "remove" sysfs handle that can be used to trigger
> > > > memory hotremove manually, exactly simmetrically with
> > > > what happens with the "probe" device for hot-add.
> > > > 
> > > > This is usueful for architecture that do not rely on
> > > > ACPI for memory hot-remove.
> > > 
> > > As already said elsewhere, this really has to check the online status of
> > > the range and fail some is still online.
> > > 
> > 
> > This is actually still done in remove_memory() (patch 2/5) with
> > walk_memory_range. We just return an error rather than BUGing().
> > 
> > Or are you referring to something else?
> 
> But you are not returning that error to the caller, are you?
> 
> [...]

Oh, I see your point. Yes, indeed we should have returned it. Thanks for
catching the issue.

> > > > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > > > +	ret = lock_device_hotplug_sysfs();
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	remove_memory(nid, phys_addr,
> > > > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > > > +	unlock_device_hotplug();
> > > > +	return count;

Thanks,
Andrea
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-12-04 12:44           ` Andrea Reale
  0 siblings, 0 replies; 156+ messages in thread
From: Andrea Reale @ 2017-12-04 12:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon  4 Dec 2017, 13:33, Michal Hocko wrote:
> On Mon 04-12-17 11:51:29, Andrea Reale wrote:
> > On Thu 30 Nov 2017, 15:49, Michal Hocko wrote:
> > > On Thu 23-11-17 11:14:52, Andrea Reale wrote:
> > > > Adding a "remove" sysfs handle that can be used to trigger
> > > > memory hotremove manually, exactly simmetrically with
> > > > what happens with the "probe" device for hot-add.
> > > > 
> > > > This is usueful for architecture that do not rely on
> > > > ACPI for memory hot-remove.
> > > 
> > > As already said elsewhere, this really has to check the online status of
> > > the range and fail some is still online.
> > > 
> > 
> > This is actually still done in remove_memory() (patch 2/5) with
> > walk_memory_range. We just return an error rather than BUGing().
> > 
> > Or are you referring to something else?
> 
> But you are not returning that error to the caller, are you?
> 
> [...]

Oh, I see your point. Yes, indeed we should have returned it. Thanks for
catching the issue.

> > > > +	nid = memory_add_physaddr_to_nid(phys_addr);
> > > > +	ret = lock_device_hotplug_sysfs();
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	remove_memory(nid, phys_addr,
> > > > +			 MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> > > > +	unlock_device_hotplug();
> > > > +	return count;

Thanks,
Andrea
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
  2017-12-04 12:42           ` Andrea Reale
  (?)
@ 2017-12-04 12:48             ` Michal Hocko
  -1 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-12-04 12:48 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Mon 04-12-17 12:42:31, Andrea Reale wrote:
> On Mon  4 Dec 2017, 13:32, Michal Hocko wrote:
> > On Mon 04-12-17 11:49:09, Andrea Reale wrote:
> > > On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> > > > On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > > > > When hot-removing memory we need to free vmemmap memory.
> > > > > However, depending on the memory is being removed, it might
> > > > > not be always possible to free a full vmemmap page / huge-page
> > > > > because part of it might still be used.
> > > > > 
> > > > > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > > > > hot-remove") introduced a workaround for x86
> > > > > hot-remove, by which partially unused areas are filled with
> > > > > the 0xFD constant. Full pages are only removed when fully
> > > > > filled by 0xFDs.
> > > > > 
> > > > > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > > > > the goal of using it in place of 0xFDs. For now, this will be used for
> > > > > the arm64 port of memory hot remove, but the idea is to eventually use
> > > > > the same mechanism for x86 as well.
> > > > 
> > > > Why cannot you use the same approach as x86 have? Have a look at the
> > > > vmemmap_free at al.
> > > > 
> > > 
> > > This arm64 hot-remove version (including vmemmap_free) is indeed an
> > > almost 1-to-1 port of the x86 approach. 
> > > 
> > > If you look at the first version of the patchset we submitted a while 
> > > ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
> > > x86 approach of filling unsued page structs with 0xFDs. Commenting on
> > > that, Mark suggested (and, indeed, I agree with him) that relying on a
> > > magic constant for marking some portions of physical memory was quite
> > > ugly. That is why we have used memblock for the purpose in this revised
> > > patchset.
> > > 
> > > If you have a different view and any concrete suggestion on how to
> > > improve this, it is definitely very well welcome. 
> > 
> > I would really prefer if those archictectues shared the code (and
> > concept) as much as possible. It is really a PITA to wrap your head
> > around each architectures for reasons which are not inherent to that
> > specific architecture. If you find the way how x86 is implemented ugly,
> > then all right, but making arm64 special just for the matter of taste is
> > far from ideal IMHO.
> 
> The plan is indeed to use this memblock flag in x86 hot remove as well,
> in place of the 0xFDs. The change is quite straightforward and we could
> push it in a next patchset release. Our rationale was to first use it in
> the new architecture and then, once proven stable, back port it to x86.
> 
> However, I am not in principle against of pushing it right now.

So please start with a simpler (cleanup) patch for x86. It will make the
life so much easier.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-12-04 12:48             ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-12-04 12:48 UTC (permalink / raw)
  To: Andrea Reale
  Cc: linux-arm-kernel, linux-kernel, linux-mm, m.bielski, arunks,
	mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, realean2

On Mon 04-12-17 12:42:31, Andrea Reale wrote:
> On Mon  4 Dec 2017, 13:32, Michal Hocko wrote:
> > On Mon 04-12-17 11:49:09, Andrea Reale wrote:
> > > On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> > > > On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > > > > When hot-removing memory we need to free vmemmap memory.
> > > > > However, depending on the memory is being removed, it might
> > > > > not be always possible to free a full vmemmap page / huge-page
> > > > > because part of it might still be used.
> > > > > 
> > > > > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > > > > hot-remove") introduced a workaround for x86
> > > > > hot-remove, by which partially unused areas are filled with
> > > > > the 0xFD constant. Full pages are only removed when fully
> > > > > filled by 0xFDs.
> > > > > 
> > > > > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > > > > the goal of using it in place of 0xFDs. For now, this will be used for
> > > > > the arm64 port of memory hot remove, but the idea is to eventually use
> > > > > the same mechanism for x86 as well.
> > > > 
> > > > Why cannot you use the same approach as x86 have? Have a look at the
> > > > vmemmap_free at al.
> > > > 
> > > 
> > > This arm64 hot-remove version (including vmemmap_free) is indeed an
> > > almost 1-to-1 port of the x86 approach. 
> > > 
> > > If you look at the first version of the patchset we submitted a while 
> > > ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
> > > x86 approach of filling unsued page structs with 0xFDs. Commenting on
> > > that, Mark suggested (and, indeed, I agree with him) that relying on a
> > > magic constant for marking some portions of physical memory was quite
> > > ugly. That is why we have used memblock for the purpose in this revised
> > > patchset.
> > > 
> > > If you have a different view and any concrete suggestion on how to
> > > improve this, it is definitely very well welcome. 
> > 
> > I would really prefer if those archictectues shared the code (and
> > concept) as much as possible. It is really a PITA to wrap your head
> > around each architectures for reasons which are not inherent to that
> > specific architecture. If you find the way how x86 is implemented ugly,
> > then all right, but making arm64 special just for the matter of taste is
> > far from ideal IMHO.
> 
> The plan is indeed to use this memblock flag in x86 hot remove as well,
> in place of the 0xFDs. The change is quite straightforward and we could
> push it in a next patchset release. Our rationale was to first use it in
> the new architecture and then, once proven stable, back port it to x86.
> 
> However, I am not in principle against of pushing it right now.

So please start with a simpler (cleanup) patch for x86. It will make the
life so much easier.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem
@ 2017-12-04 12:48             ` Michal Hocko
  0 siblings, 0 replies; 156+ messages in thread
From: Michal Hocko @ 2017-12-04 12:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon 04-12-17 12:42:31, Andrea Reale wrote:
> On Mon  4 Dec 2017, 13:32, Michal Hocko wrote:
> > On Mon 04-12-17 11:49:09, Andrea Reale wrote:
> > > On Thu 30 Nov 2017, 15:51, Michal Hocko wrote:
> > > > On Thu 23-11-17 11:14:38, Andrea Reale wrote:
> > > > > When hot-removing memory we need to free vmemmap memory.
> > > > > However, depending on the memory is being removed, it might
> > > > > not be always possible to free a full vmemmap page / huge-page
> > > > > because part of it might still be used.
> > > > > 
> > > > > Commit ae9aae9eda2d ("memory-hotplug: common APIs to support page tables
> > > > > hot-remove") introduced a workaround for x86
> > > > > hot-remove, by which partially unused areas are filled with
> > > > > the 0xFD constant. Full pages are only removed when fully
> > > > > filled by 0xFDs.
> > > > > 
> > > > > This commit introduces a MEMBLOCK_UNUSED_VMEMMAP memblock flag, with
> > > > > the goal of using it in place of 0xFDs. For now, this will be used for
> > > > > the arm64 port of memory hot remove, but the idea is to eventually use
> > > > > the same mechanism for x86 as well.
> > > > 
> > > > Why cannot you use the same approach as x86 have? Have a look at the
> > > > vmemmap_free at al.
> > > > 
> > > 
> > > This arm64 hot-remove version (including vmemmap_free) is indeed an
> > > almost 1-to-1 port of the x86 approach. 
> > > 
> > > If you look at the first version of the patchset we submitted a while 
> > > ago (https://lkml.org/lkml/2017/4/11/540), we were initially using the
> > > x86 approach of filling unsued page structs with 0xFDs. Commenting on
> > > that, Mark suggested (and, indeed, I agree with him) that relying on a
> > > magic constant for marking some portions of physical memory was quite
> > > ugly. That is why we have used memblock for the purpose in this revised
> > > patchset.
> > > 
> > > If you have a different view and any concrete suggestion on how to
> > > improve this, it is definitely very well welcome. 
> > 
> > I would really prefer if those archictectues shared the code (and
> > concept) as much as possible. It is really a PITA to wrap your head
> > around each architectures for reasons which are not inherent to that
> > specific architecture. If you find the way how x86 is implemented ugly,
> > then all right, but making arm64 special just for the matter of taste is
> > far from ideal IMHO.
> 
> The plan is indeed to use this memblock flag in x86 hot remove as well,
> in place of the 0xFDs. The change is quite straightforward and we could
> push it in a next patchset release. Our rationale was to first use it in
> the new architecture and then, once proven stable, back port it to x86.
> 
> However, I am not in principle against of pushing it right now.

So please start with a simpler (cleanup) patch for x86. It will make the
life so much easier.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
  2017-12-04 11:28         ` Andrea Reale
  (?)
@ 2017-12-04 14:05           ` Rafael J. Wysocki
  -1 siblings, 0 replies; 156+ messages in thread
From: Rafael J. Wysocki @ 2017-12-04 14:05 UTC (permalink / raw)
  To: Andrea Reale
  Cc: joeyli, linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Michal Hocko, Rafael Wysocki, ACPI Devel Maling List

On Mon, Dec 4, 2017 at 12:28 PM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> Hi Joey,
>
> and thanks for your comments. Response inline:
>

[cut]

>>
>> So, the BUG() is useful to capture state issue in memory subsystem. But, I
>> understood your concern about the two steps offline/remove from userland.
>>
>> Maybe we should move the BUG() to somewhere but not just remove it. Or if
>> we think that the BUG() is too intense, at least we should print out a error
>> message, and ACPI should checks the return value from subsystem to
>> interrupt memory-hotplug process.
>
> In this patchset, BUG() is moved to acpi_memory_remove_memory(),
> the caller of arch_remove_memory(). However, I agree with Michal, that
> we should not BUG() here but rather halt the hotremove process and print
> some errors.
> Is there any state in ACPI that should be undone in case of hotremove
> errors or we can just stop the process "halfway"?

I have to recall a couple of things before answering this question, so
that may take some time.

Thanks,
Rafael

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-12-04 14:05           ` Rafael J. Wysocki
  0 siblings, 0 replies; 156+ messages in thread
From: Rafael J. Wysocki @ 2017-12-04 14:05 UTC (permalink / raw)
  To: Andrea Reale
  Cc: joeyli, linux-arm-kernel, Linux Kernel Mailing List,
	Linux Memory Management List, m.bielski, arunks, Mark Rutland,
	scott.branden, Will Deacon, qiuxishi, Catalin Marinas,
	Michal Hocko, Rafael Wysocki, ACPI Devel Maling List

On Mon, Dec 4, 2017 at 12:28 PM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> Hi Joey,
>
> and thanks for your comments. Response inline:
>

[cut]

>>
>> So, the BUG() is useful to capture state issue in memory subsystem. But, I
>> understood your concern about the two steps offline/remove from userland.
>>
>> Maybe we should move the BUG() to somewhere but not just remove it. Or if
>> we think that the BUG() is too intense, at least we should print out a error
>> message, and ACPI should checks the return value from subsystem to
>> interrupt memory-hotplug process.
>
> In this patchset, BUG() is moved to acpi_memory_remove_memory(),
> the caller of arch_remove_memory(). However, I agree with Michal, that
> we should not BUG() here but rather halt the hotremove process and print
> some errors.
> Is there any state in ACPI that should be undone in case of hotremove
> errors or we can just stop the process "halfway"?

I have to recall a couple of things before answering this question, so
that may take some time.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove
@ 2017-12-04 14:05           ` Rafael J. Wysocki
  0 siblings, 0 replies; 156+ messages in thread
From: Rafael J. Wysocki @ 2017-12-04 14:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 4, 2017 at 12:28 PM, Andrea Reale <ar@linux.vnet.ibm.com> wrote:
> Hi Joey,
>
> and thanks for your comments. Response inline:
>

[cut]

>>
>> So, the BUG() is useful to capture state issue in memory subsystem. But, I
>> understood your concern about the two steps offline/remove from userland.
>>
>> Maybe we should move the BUG() to somewhere but not just remove it. Or if
>> we think that the BUG() is too intense, at least we should print out a error
>> message, and ACPI should checks the return value from subsystem to
>> interrupt memory-hotplug process.
>
> In this patchset, BUG() is moved to acpi_memory_remove_memory(),
> the caller of arch_remove_memory(). However, I agree with Michal, that
> we should not BUG() here but rather halt the hotremove process and print
> some errors.
> Is there any state in ACPI that should be undone in case of hotremove
> errors or we can just stop the process "halfway"?

I have to recall a couple of things before answering this question, so
that may take some time.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
  2017-11-24 14:29           ` Andrea Reale
  (?)
@ 2017-12-04 17:50             ` Reza Arbab
  -1 siblings, 0 replies; 156+ messages in thread
From: Reza Arbab @ 2017-12-04 17:50 UTC (permalink / raw)
  To: Andrea Reale
  Cc: zhong jiang, linux-arm-kernel, linux-kernel, linux-mm, m.bielski,
	arunks, mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

On Fri, Nov 24, 2017 at 02:29:48PM +0000, Andrea Reale wrote:
>But, at least in my understanding, the implementation is not as
>straightfoward as it looks. If I declare a memory node in the fdt, then,
>at boot, the kernel will expect that memory to actually be there to be
>used: this is not true if I want to plug my dimms only later at runtime.
>So I think that declaring the hotpluggable memory in an fdt memory
>node might not feasible without changes.

On the power arch, we do this today using "linux,usable-memory".

memory@10000000000 {
  device_type = "memory";
  reg = <0x100 0x0 0x0 0x80000000>;
  linux,usable-memory = <0x100 0x0 0x0 0x40000000>;
  :
}

The reg range defines the node, but at at boot, memblocks are only 
created for the linux,usable-memory range. The rest can be hotplugged 
later. YMMV, because this depends on your arch's implementation of 
memory_add_physaddr_to_nid().

>One idea could be to add a new property to memory nodes, to specify 
>what memory is potentially hotplugguable.

Somewhat related, there is already a "hotpluggable" property.

memory@10040000000 {
  device_type = "memory";
  reg = <0x100 0x40000000 0x0 0x40000000>;
  hotpluggable;
  :
}

This is subtly different from the earlier example. This memory IS 
present at boot. The hotpluggable property ensures that it resides in 
ZONE_MOVABLE so it can potentially be removed.

-- 
Reza Arbab

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-12-04 17:50             ` Reza Arbab
  0 siblings, 0 replies; 156+ messages in thread
From: Reza Arbab @ 2017-12-04 17:50 UTC (permalink / raw)
  To: Andrea Reale
  Cc: zhong jiang, linux-arm-kernel, linux-kernel, linux-mm, m.bielski,
	arunks, mark.rutland, scott.branden, will.deacon, qiuxishi,
	catalin.marinas, mhocko, realean2

On Fri, Nov 24, 2017 at 02:29:48PM +0000, Andrea Reale wrote:
>But, at least in my understanding, the implementation is not as
>straightfoward as it looks. If I declare a memory node in the fdt, then,
>at boot, the kernel will expect that memory to actually be there to be
>used: this is not true if I want to plug my dimms only later at runtime.
>So I think that declaring the hotpluggable memory in an fdt memory
>node might not feasible without changes.

On the power arch, we do this today using "linux,usable-memory".

memory@10000000000 {
  device_type = "memory";
  reg = <0x100 0x0 0x0 0x80000000>;
  linux,usable-memory = <0x100 0x0 0x0 0x40000000>;
  :
}

The reg range defines the node, but at at boot, memblocks are only 
created for the linux,usable-memory range. The rest can be hotplugged 
later. YMMV, because this depends on your arch's implementation of 
memory_add_physaddr_to_nid().

>One idea could be to add a new property to memory nodes, to specify 
>what memory is potentially hotplugguable.

Somewhat related, there is already a "hotpluggable" property.

memory@10040000000 {
  device_type = "memory";
  reg = <0x100 0x40000000 0x0 0x40000000>;
  hotpluggable;
  :
}

This is subtly different from the earlier example. This memory IS 
present at boot. The hotpluggable property ensures that it resides in 
ZONE_MOVABLE so it can potentially be removed.

-- 
Reza Arbab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device
@ 2017-12-04 17:50             ` Reza Arbab
  0 siblings, 0 replies; 156+ messages in thread
From: Reza Arbab @ 2017-12-04 17:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 24, 2017 at 02:29:48PM +0000, Andrea Reale wrote:
>But, at least in my understanding, the implementation is not as
>straightfoward as it looks. If I declare a memory node in the fdt, then,
>at boot, the kernel will expect that memory to actually be there to be
>used: this is not true if I want to plug my dimms only later at runtime.
>So I think that declaring the hotpluggable memory in an fdt memory
>node might not feasible without changes.

On the power arch, we do this today using "linux,usable-memory".

memory at 10000000000 {
  device_type = "memory";
  reg = <0x100 0x0 0x0 0x80000000>;
  linux,usable-memory = <0x100 0x0 0x0 0x40000000>;
  :
}

The reg range defines the node, but at at boot, memblocks are only 
created for the linux,usable-memory range. The rest can be hotplugged 
later. YMMV, because this depends on your arch's implementation of 
memory_add_physaddr_to_nid().

>One idea could be to add a new property to memory nodes, to specify 
>what memory is potentially hotplugguable.

Somewhat related, there is already a "hotpluggable" property.

memory at 10040000000 {
  device_type = "memory";
  reg = <0x100 0x40000000 0x0 0x40000000>;
  hotpluggable;
  :
}

This is subtly different from the earlier example. This memory IS 
present at boot. The hotpluggable property ensures that it resides in 
ZONE_MOVABLE so it can potentially be removed.

-- 
Reza Arbab

^ permalink raw reply	[flat|nested] 156+ messages in thread

end of thread, other threads:[~2017-12-04 17:50 UTC | newest]

Thread overview: 156+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-23 11:13 [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2 Andrea Reale
2017-11-23 11:13 ` Andrea Reale
2017-11-23 11:13 ` Andrea Reale
2017-11-23 11:13 ` [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64 Maciej Bielski
2017-11-23 11:13   ` Maciej Bielski
2017-11-23 11:13   ` Maciej Bielski
2017-11-24  5:55   ` Arun KS
2017-11-24  5:55     ` Arun KS
2017-11-24  5:55     ` Arun KS
2017-11-24  9:42     ` Andrea Reale
2017-11-24  9:42       ` Andrea Reale
2017-11-24  9:42       ` Andrea Reale
2017-11-24 10:53       ` Maciej Bielski
2017-11-24 10:53         ` Maciej Bielski
2017-11-24 10:53         ` Maciej Bielski
2017-11-26  6:58         ` Arun KS
2017-11-26  6:58           ` Arun KS
2017-11-26  6:58           ` Arun KS
2017-11-27 15:19   ` Robin Murphy
2017-11-27 15:19     ` Robin Murphy
2017-11-27 15:19     ` Robin Murphy
2017-11-27 16:39     ` Maciej Bielski
2017-11-27 16:39       ` Maciej Bielski
2017-11-27 16:39       ` Maciej Bielski
2017-11-27 17:11       ` Andrea Reale
2017-11-27 17:11         ` Andrea Reale
2017-11-27 17:11         ` Andrea Reale
2017-11-23 11:14 ` [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove Andrea Reale
2017-11-23 11:14   ` Andrea Reale
2017-11-23 11:14   ` Andrea Reale
2017-11-23 22:18   ` Rafael J. Wysocki
2017-11-23 22:18     ` Rafael J. Wysocki
2017-11-23 22:18     ` Rafael J. Wysocki
2017-11-24 14:39   ` Rafael J. Wysocki
2017-11-24 14:39     ` Rafael J. Wysocki
2017-11-24 14:39     ` Rafael J. Wysocki
2017-11-24 14:49     ` Andrea Reale
2017-11-24 14:49       ` Andrea Reale
2017-11-24 14:49       ` Andrea Reale
2017-11-24 14:49       ` Andrea Reale
2017-11-24 15:43       ` Michal Hocko
2017-11-24 15:43         ` Michal Hocko
2017-11-24 15:43         ` Michal Hocko
2017-11-24 15:43         ` Michal Hocko
2017-11-24 15:54         ` Andrea Reale
2017-11-24 15:54           ` Andrea Reale
2017-11-24 15:54           ` Andrea Reale
2017-11-24 15:54           ` Andrea Reale
2017-11-24 18:17           ` Michal Hocko
2017-11-24 18:17             ` Michal Hocko
2017-11-24 18:17             ` Michal Hocko
2017-11-24 18:17             ` Michal Hocko
2017-11-29  1:20             ` joeyli
2017-11-29  1:20               ` joeyli
2017-11-29  1:20               ` joeyli
2017-11-29  1:20               ` joeyli
2017-11-30  9:47               ` Michal Hocko
2017-11-30  9:47                 ` Michal Hocko
2017-11-30  9:47                 ` Michal Hocko
2017-11-30  9:47                 ` Michal Hocko
2017-11-27 15:20           ` Robin Murphy
2017-11-27 15:20             ` Robin Murphy
2017-11-27 15:20             ` Robin Murphy
2017-11-27 15:20             ` Robin Murphy
2017-11-27 17:44             ` Andrea Reale
2017-11-27 17:44               ` Andrea Reale
2017-11-27 17:44               ` Andrea Reale
2017-11-27 17:44               ` Andrea Reale
2017-11-29  0:49   ` joeyli
2017-11-29  0:49     ` joeyli
2017-11-29  0:49     ` joeyli
2017-11-29  1:52     ` joeyli
2017-11-29  1:52       ` joeyli
2017-11-29  1:52       ` joeyli
2017-12-04 11:28       ` Andrea Reale
2017-12-04 11:28         ` Andrea Reale
2017-12-04 11:28         ` Andrea Reale
2017-12-04 14:05         ` Rafael J. Wysocki
2017-12-04 14:05           ` Rafael J. Wysocki
2017-12-04 14:05           ` Rafael J. Wysocki
2017-11-23 11:14 ` [PATCH v2 3/5] mm: memory_hotplug: memblock to track partially removed vmemmap mem Andrea Reale
2017-11-23 11:14   ` Andrea Reale
2017-11-23 11:14   ` Andrea Reale
2017-11-27 15:20   ` Robin Murphy
2017-11-27 15:20     ` Robin Murphy
2017-11-27 15:20     ` Robin Murphy
2017-11-27 17:38     ` Andrea Reale
2017-11-27 17:38       ` Andrea Reale
2017-11-27 17:38       ` Andrea Reale
2017-11-30 14:51   ` Michal Hocko
2017-11-30 14:51     ` Michal Hocko
2017-11-30 14:51     ` Michal Hocko
2017-12-04 11:49     ` Andrea Reale
2017-12-04 11:49       ` Andrea Reale
2017-12-04 11:49       ` Andrea Reale
2017-12-04 12:32       ` Michal Hocko
2017-12-04 12:32         ` Michal Hocko
2017-12-04 12:32         ` Michal Hocko
2017-12-04 12:42         ` Andrea Reale
2017-12-04 12:42           ` Andrea Reale
2017-12-04 12:42           ` Andrea Reale
2017-12-04 12:48           ` Michal Hocko
2017-12-04 12:48             ` Michal Hocko
2017-12-04 12:48             ` Michal Hocko
2017-11-23 11:14 ` [PATCH v2 4/5] mm: memory_hotplug: Add memory hotremove probe device Andrea Reale
2017-11-23 11:14   ` Andrea Reale
2017-11-23 11:14   ` Andrea Reale
2017-11-24 10:35   ` zhong jiang
2017-11-24 10:35     ` zhong jiang
2017-11-24 10:35     ` zhong jiang
2017-11-24 10:44     ` Andrea Reale
2017-11-24 10:44       ` Andrea Reale
2017-11-24 10:44       ` Andrea Reale
2017-11-24 12:17       ` zhong jiang
2017-11-24 12:17         ` zhong jiang
2017-11-24 12:17         ` zhong jiang
2017-11-24 14:29         ` Andrea Reale
2017-11-24 14:29           ` Andrea Reale
2017-11-24 14:29           ` Andrea Reale
2017-12-04 17:50           ` Reza Arbab
2017-12-04 17:50             ` Reza Arbab
2017-12-04 17:50             ` Reza Arbab
2017-11-27 15:33   ` Robin Murphy
2017-11-27 15:33     ` Robin Murphy
2017-11-27 15:33     ` Robin Murphy
2017-11-27 17:14     ` Andrea Reale
2017-11-27 17:14       ` Andrea Reale
2017-11-27 17:14       ` Andrea Reale
2017-11-30 14:49   ` Michal Hocko
2017-11-30 14:49     ` Michal Hocko
2017-11-30 14:49     ` Michal Hocko
2017-12-04 11:51     ` Andrea Reale
2017-12-04 11:51       ` Andrea Reale
2017-12-04 11:51       ` Andrea Reale
2017-12-04 12:33       ` Michal Hocko
2017-12-04 12:33         ` Michal Hocko
2017-12-04 12:33         ` Michal Hocko
2017-12-04 12:44         ` Andrea Reale
2017-12-04 12:44           ` Andrea Reale
2017-12-04 12:44           ` Andrea Reale
2017-11-23 11:15 ` [PATCH v2 5/5] mm: memory-hotplug: Add memory hot remove support for arm64 Andrea Reale
2017-11-23 11:15   ` Andrea Reale
2017-11-23 11:15   ` Andrea Reale
2017-11-23 16:02 ` [PATCH v2 0/5] Memory hotplug support for arm64 - complete patchset v2 Michal Hocko
2017-11-23 16:02   ` Michal Hocko
2017-11-23 16:02   ` Michal Hocko
2017-11-23 17:33   ` Andrea Reale
2017-11-23 17:33     ` Andrea Reale
2017-11-23 17:33     ` Andrea Reale
2017-11-30 14:57     ` Michal Hocko
2017-11-30 14:57       ` Michal Hocko
2017-11-30 14:57       ` Michal Hocko
2017-12-04 11:34       ` Andrea Reale
2017-12-04 11:34         ` Andrea Reale
2017-12-04 11:34         ` Andrea Reale
2017-11-24 10:22 ` [PATCH v2 2/5] mm: memory_hotplug: Remove assumption on memory state before hotremove Andrea Reale

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.