linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE
@ 2019-04-03  4:30 Anshuman Khandual
  2019-04-03  4:30 ` [PATCH 1/6] arm64/mm: Enable sysfs based memory hot add interface Anshuman Khandual
                   ` (7 more replies)
  0 siblings, 8 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-03  4:30 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, david, cai

This series enables memory hot remove on arm64, fixes a memblock removal
ordering problem in generic __remove_memory(), enables sysfs memory probe
interface on arm64. It also enables ZONE_DEVICE with struct vmem_altmap
support.

Testing:

Tested hot remove on arm64 for all 4K, 16K, 64K page config options with
all possible VA_BITS and PGTABLE_LEVELS combinations. Tested ZONE_DEVICE
with ARM64_4K_PAGES through a dummy driver.

Build tested on non arm64 platforms. I will appreciate if folks can test
arch_remove_memory() re-ordering in __remove_memory() on other platforms.

Dependency:

V5 series in the thread (https://lkml.org/lkml/2019/2/14/1096) will make
kernel linear mapping loose pgtable_page_ctor() init. When this happens
the proposed functions free_pte|pmd|pud_table() in [PATCH 2/6] will have
to stop calling pgtable_page_dtor().

Anshuman Khandual (5):
  arm64/mm: Enable sysfs based memory hot add interface
  arm64/mm: Enable memory hot remove
  arm64/mm: Enable struct page allocation from device memory
  mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  arm64/mm: Enable ZONE_DEVICE

Robin Murphy (1):
  mm/memremap: Rename and consolidate SECTION_SIZE

 arch/arm64/Kconfig               |  13 +++
 arch/arm64/include/asm/pgtable.h |  14 +++
 arch/arm64/mm/mmu.c              | 242 ++++++++++++++++++++++++++++++++++++++-
 include/linux/mmzone.h           |   1 +
 kernel/memremap.c                |  10 +-
 mm/hmm.c                         |   2 -
 mm/memory_hotplug.c              |   3 +-
 7 files changed, 271 insertions(+), 14 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/6] arm64/mm: Enable sysfs based memory hot add interface
  2019-04-03  4:30 [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Anshuman Khandual
@ 2019-04-03  4:30 ` Anshuman Khandual
  2019-04-03  8:20   ` David Hildenbrand
  2019-04-03  4:30 ` [PATCH 2/6] arm64/mm: Enable memory hot remove Anshuman Khandual
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-03  4:30 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, david, cai

Sysfs memory probe interface (/sys/devices/system/memory/probe) can accept
starting physical address of an entire memory block to be hot added into
the kernel. This is in addition to the existing ACPI based interface. This
just enables it with the required config CONFIG_ARCH_MEMORY_PROBE.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/Kconfig | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7e34b9e..a2418fb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -266,6 +266,15 @@ config HAVE_GENERIC_GUP
 config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 
+config ARCH_MEMORY_PROBE
+	bool "Enable /sys/devices/system/memory/probe interface"
+	depends on MEMORY_HOTPLUG
+	help
+	  This option enables a sysfs /sys/devices/system/memory/probe
+	  interface for testing. See Documentation/memory-hotplug.txt
+	  for more information. If you are unsure how to answer this
+	  question, answer N.
+
 config SMP
 	def_bool y
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-03  4:30 [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Anshuman Khandual
  2019-04-03  4:30 ` [PATCH 1/6] arm64/mm: Enable sysfs based memory hot add interface Anshuman Khandual
@ 2019-04-03  4:30 ` Anshuman Khandual
  2019-04-03 12:37   ` Robin Murphy
  2019-04-03 17:32   ` Logan Gunthorpe
  2019-04-03  4:30 ` [PATCH 3/6] arm64/mm: Enable struct page allocation from device memory Anshuman Khandual
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-03  4:30 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, david, cai

Memory removal from an arch perspective involves tearing down two different
kernel based mappings i.e vmemmap and linear while releasing related page
table pages allocated for the physical memory range to be removed.

Define a common kernel page table tear down helper remove_pagetable() which
can be used to unmap given kernel virtual address range. In effect it can
tear down both vmemap or kernel linear mappings. This new helper is called
from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
The argument 'direct' here identifies kernel linear mappings.

Vmemmap mappings page table pages are allocated through sparse mem helper
functions like vmemmap_alloc_block() which does not cycle the pages through
pgtable_page_ctor() constructs. Hence while removing it skips corresponding
destructor construct pgtable_page_dtor().

While here update arch_add_mempory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/Kconfig               |   3 +
 arch/arm64/include/asm/pgtable.h |  14 +++
 arch/arm64/mm/mmu.c              | 227 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 241 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2418fb..db3e625 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -266,6 +266,9 @@ config HAVE_GENERIC_GUP
 config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 config ARCH_MEMORY_PROBE
 	bool "Enable /sys/devices/system/memory/probe interface"
 	depends on MEMORY_HOTPLUG
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index de70c1e..858098e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -355,6 +355,18 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#if (CONFIG_PGTABLE_LEVELS > 2)
+#define pmd_large(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
+#else
+#define pmd_large(pmd) 0
+#endif
+
+#if (CONFIG_PGTABLE_LEVELS > 3)
+#define pud_large(pud)	(pud_val(pud) && !(pud_val(pud) & PUD_TABLE_BIT))
+#else
+#define pud_large(pmd) 0
+#endif
+
 /*
  * THP definitions.
  */
@@ -555,6 +567,7 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
 
 #else
 
+#define pmd_index(addr) 0
 #define pud_page_paddr(pud)	({ BUILD_BUG(); 0; })
 
 /* Match pmd_offset folding in <asm/generic/pgtable-nopmd.h> */
@@ -612,6 +625,7 @@ static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
 
 #else
 
+#define pud_index(adrr)	0
 #define pgd_page_paddr(pgd)	({ BUILD_BUG(); 0;})
 
 /* Match pud_offset folding in <asm/generic/pgtable-nopud.h> */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e97f018..ae0777b 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -714,6 +714,198 @@ int kern_addr_valid(unsigned long addr)
 
 	return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void __meminit free_pagetable(struct page *page, int order)
+{
+	unsigned long magic;
+	unsigned int nr_pages = 1 << order;
+
+	if (PageReserved(page)) {
+		__ClearPageReserved(page);
+
+		magic = (unsigned long)page->freelist;
+		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
+			while (nr_pages--)
+				put_page_bootmem(page++);
+		} else
+			while (nr_pages--)
+				free_reserved_page(page++);
+	} else
+		free_pages((unsigned long)page_address(page), order);
+}
+
+#if (CONFIG_PGTABLE_LEVELS > 2)
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd, bool direct)
+{
+	pte_t *pte;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte = pte_start + i;
+		if (!pte_none(*pte))
+			return;
+	}
+
+	if (direct)
+		pgtable_page_dtor(pmd_page(*pmd));
+	free_pagetable(pmd_page(*pmd), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pmd_clear(pmd);
+	spin_unlock(&init_mm.page_table_lock);
+}
+#else
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd, bool direct)
+{
+}
+#endif
+
+#if (CONFIG_PGTABLE_LEVELS > 3)
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud, bool direct)
+{
+	pmd_t *pmd;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		pmd = pmd_start + i;
+		if (!pmd_none(*pmd))
+			return;
+	}
+
+	if (direct)
+		pgtable_page_dtor(pud_page(*pud));
+	free_pagetable(pud_page(*pud), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pud_clear(pud);
+	spin_unlock(&init_mm.page_table_lock);
+}
+
+static void __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd, bool direct)
+{
+	pud_t *pud;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (!pud_none(*pud))
+			return;
+	}
+
+	if (direct)
+		pgtable_page_dtor(pgd_page(*pgd));
+	free_pagetable(pgd_page(*pgd), 0);
+	spin_lock(&init_mm.page_table_lock);
+	pgd_clear(pgd);
+	spin_unlock(&init_mm.page_table_lock);
+}
+#else
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud, bool direct)
+{
+}
+
+static void __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd, bool direct)
+{
+}
+#endif
+
+static void __meminit
+remove_pte_table(pte_t *pte_start, unsigned long addr,
+			unsigned long end, bool direct)
+{
+	pte_t *pte;
+
+	pte = pte_start + pte_index(addr);
+	for (; addr < end; addr += PAGE_SIZE, pte++) {
+		if (!pte_present(*pte))
+			continue;
+
+		if (!direct)
+			free_pagetable(pte_page(*pte), 0);
+		spin_lock(&init_mm.page_table_lock);
+		pte_clear(&init_mm, addr, pte);
+		spin_unlock(&init_mm.page_table_lock);
+	}
+}
+
+static void __meminit
+remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
+			unsigned long end, bool direct)
+{
+	unsigned long next;
+	pte_t *pte_base;
+	pmd_t *pmd;
+
+	pmd = pmd_start + pmd_index(addr);
+	for (; addr < end; addr = next, pmd++) {
+		next = pmd_addr_end(addr, end);
+		if (!pmd_present(*pmd))
+			continue;
+
+		if (pmd_large(*pmd)) {
+			if (!direct)
+				free_pagetable(pmd_page(*pmd),
+						get_order(PMD_SIZE));
+			spin_lock(&init_mm.page_table_lock);
+			pmd_clear(pmd);
+			spin_unlock(&init_mm.page_table_lock);
+			continue;
+		}
+		pte_base = pte_offset_kernel(pmd, 0UL);
+		remove_pte_table(pte_base, addr, next, direct);
+		free_pte_table(pte_base, pmd, direct);
+	}
+}
+
+static void __meminit
+remove_pud_table(pud_t *pud_start, unsigned long addr,
+			unsigned long end, bool direct)
+{
+	unsigned long next;
+	pmd_t *pmd_base;
+	pud_t *pud;
+
+	pud = pud_start + pud_index(addr);
+	for (; addr < end; addr = next, pud++) {
+		next = pud_addr_end(addr, end);
+		if (!pud_present(*pud))
+			continue;
+
+		if (pud_large(*pud)) {
+			if (!direct)
+				free_pagetable(pud_page(*pud),
+						get_order(PUD_SIZE));
+			spin_lock(&init_mm.page_table_lock);
+			pud_clear(pud);
+			spin_unlock(&init_mm.page_table_lock);
+			continue;
+		}
+		pmd_base = pmd_offset(pud, 0UL);
+		remove_pmd_table(pmd_base, addr, next, direct);
+		free_pmd_table(pmd_base, pud, direct);
+	}
+}
+
+static void __meminit
+remove_pagetable(unsigned long start, unsigned long end, bool direct)
+{
+	unsigned long addr, next;
+	pud_t *pud_base;
+	pgd_t *pgd;
+
+	for (addr = start; addr < end; addr = next) {
+		next = pgd_addr_end(addr, end);
+		pgd = pgd_offset_k(addr);
+		if (!pgd_present(*pgd))
+			continue;
+
+		pud_base = pud_offset(pgd, 0UL);
+		remove_pud_table(pud_base, addr, next, direct);
+		free_pud_table(pud_base, pgd, direct);
+	}
+	flush_tlb_kernel_range(start, end);
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -758,9 +950,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	return 0;
 }
 #endif	/* CONFIG_ARM64_64K_PAGES */
-void vmemmap_free(unsigned long start, unsigned long end,
+void __ref vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+#ifdef CONFIG_MEMORY_HOTPLUG
+	remove_pagetable(start, end, false);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
@@ -1046,10 +1241,16 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
+{
+	WARN_ON(pgdir != init_mm.pgd);
+	remove_pagetable(start, start + size, true);
+}
+
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 		    bool want_memblock)
 {
-	int flags = 0;
+	int flags = 0, ret = 0;
 
 	if (rodata_full || debug_pagealloc_enabled())
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
@@ -1057,7 +1258,27 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
 			     size, PAGE_KERNEL, pgd_pgtable_alloc, flags);
 
-	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
 			   altmap, want_memblock);
+	if (ret)
+		__remove_pgd_mapping(swapper_pg_dir,
+					__phys_to_virt(start), size);
+	return ret;
 }
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
+{
+	unsigned long start_pfn = start >> PAGE_SHIFT;
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone = page_zone(pfn_to_page(start_pfn));
+	int ret;
+
+	ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
+	if (!ret)
+		__remove_pgd_mapping(swapper_pg_dir,
+					__phys_to_virt(start), size);
+	return ret;
+}
+#endif
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/6] arm64/mm: Enable struct page allocation from device memory
  2019-04-03  4:30 [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Anshuman Khandual
  2019-04-03  4:30 ` [PATCH 1/6] arm64/mm: Enable sysfs based memory hot add interface Anshuman Khandual
  2019-04-03  4:30 ` [PATCH 2/6] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-04-03  4:30 ` Anshuman Khandual
  2019-04-03  4:30 ` [PATCH 4/6] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-03  4:30 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, david, cai

ZONE_DEVICE based device memory like persistent memory would typically be
more than available system RAM and can have size in TBs. Allocating struct
pages from system RAM for these vast range of device memory will reduce
amount of system RAM available for other purposes. There is a mechanism
with struct vmem_altmap which reserves range of device memory to be used
for it's own struct pages.

On arm64 platforms this enables vmemmap_populate() & vmemmap_free() which
creates & destroys struct page mapping to accommodate a given instance of
struct vmem_altmap.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/mmu.c | 41 +++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ae0777b..4b25b75 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -735,6 +735,15 @@ static void __meminit free_pagetable(struct page *page, int order)
 		free_pages((unsigned long)page_address(page), order);
 }
 
+static void __meminit free_huge_pagetable(struct page *page, int order,
+						struct vmem_altmap *altmap)
+{
+	if (altmap)
+		vmem_altmap_free(altmap, (1UL << order));
+	else
+		free_pagetable(page, order);
+}
+
 #if (CONFIG_PGTABLE_LEVELS > 2)
 static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd, bool direct)
 {
@@ -828,8 +837,8 @@ remove_pte_table(pte_t *pte_start, unsigned long addr,
 }
 
 static void __meminit
-remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
-			unsigned long end, bool direct)
+remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
+			bool direct, struct vmem_altmap *altmap)
 {
 	unsigned long next;
 	pte_t *pte_base;
@@ -843,8 +852,8 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 
 		if (pmd_large(*pmd)) {
 			if (!direct)
-				free_pagetable(pmd_page(*pmd),
-						get_order(PMD_SIZE));
+				free_huge_pagetable(pmd_page(*pmd),
+						get_order(PMD_SIZE), altmap);
 			spin_lock(&init_mm.page_table_lock);
 			pmd_clear(pmd);
 			spin_unlock(&init_mm.page_table_lock);
@@ -857,8 +866,8 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 }
 
 static void __meminit
-remove_pud_table(pud_t *pud_start, unsigned long addr,
-			unsigned long end, bool direct)
+remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
+			bool direct, struct vmem_altmap *altmap)
 {
 	unsigned long next;
 	pmd_t *pmd_base;
@@ -872,21 +881,22 @@ remove_pud_table(pud_t *pud_start, unsigned long addr,
 
 		if (pud_large(*pud)) {
 			if (!direct)
-				free_pagetable(pud_page(*pud),
-						get_order(PUD_SIZE));
+				free_huge_pagetable(pud_page(*pud),
+						get_order(PUD_SIZE), altmap);
 			spin_lock(&init_mm.page_table_lock);
 			pud_clear(pud);
 			spin_unlock(&init_mm.page_table_lock);
 			continue;
 		}
 		pmd_base = pmd_offset(pud, 0UL);
-		remove_pmd_table(pmd_base, addr, next, direct);
+		remove_pmd_table(pmd_base, addr, next, direct, altmap);
 		free_pmd_table(pmd_base, pud, direct);
 	}
 }
 
 static void __meminit
-remove_pagetable(unsigned long start, unsigned long end, bool direct)
+remove_pagetable(unsigned long start, unsigned long end,
+			bool direct, struct vmem_altmap *altmap)
 {
 	unsigned long addr, next;
 	pud_t *pud_base;
@@ -899,7 +909,7 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 			continue;
 
 		pud_base = pud_offset(pgd, 0UL);
-		remove_pud_table(pud_base, addr, next, direct);
+		remove_pud_table(pud_base, addr, next, direct, altmap);
 		free_pud_table(pud_base, pgd, direct);
 	}
 	flush_tlb_kernel_range(start, end);
@@ -938,7 +948,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
-			p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+			if (altmap)
+				p = altmap_alloc_block_buf(PMD_SIZE, altmap);
+			else
+				p = vmemmap_alloc_block_buf(PMD_SIZE, node);
 			if (!p)
 				return -ENOMEM;
 
@@ -954,7 +967,7 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
 #ifdef CONFIG_MEMORY_HOTPLUG
-	remove_pagetable(start, end, false);
+	remove_pagetable(start, end, false, altmap);
 #endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
@@ -1244,7 +1257,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
 {
 	WARN_ON(pgdir != init_mm.pgd);
-	remove_pagetable(start, start + size, true);
+	remove_pagetable(start, start + size, true, NULL);
 }
 
 int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/6] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  2019-04-03  4:30 [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Anshuman Khandual
                   ` (2 preceding siblings ...)
  2019-04-03  4:30 ` [PATCH 3/6] arm64/mm: Enable struct page allocation from device memory Anshuman Khandual
@ 2019-04-03  4:30 ` Anshuman Khandual
  2019-04-03  8:45   ` Oscar Salvador
                     ` (2 more replies)
  2019-04-03  4:30 ` [PATCH 5/6] mm/memremap: Rename and consolidate SECTION_SIZE Anshuman Khandual
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-03  4:30 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, david, cai

Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
entries between memory block and node. It first checks pfn validity with
pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
(arm64 has this enabled) pfn_valid_within() calls pfn_valid().

pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
which scans all mapped memblock regions with memblock_is_map_memory(). This
creates a problem in memory hot remove path which has already removed given
memory range from memory block with memblock_[remove|free] before arriving
at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
of existing sysfs entries.

[   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
[   62.052517] ------------[ cut here ]------------
[   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
[   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   62.054589] Modules linked in:
[   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
[   62.056274] Hardware name: linux,dummy-virt (DT)
[   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
[   62.058083] pc : add_memory_resource+0x1cc/0x1d8
[   62.058961] lr : add_memory_resource+0x10c/0x1d8
[   62.059842] sp : ffff0000168b3ce0
[   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
[   62.061501] x27: 0000000000000000 x26: 0000000000000000
[   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
[   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
[   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
[   62.065558] x19: 0000000000680000 x18: 0000000000000024
[   62.066566] x17: 0000000000000000 x16: 0000000000000000
[   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
[   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
[   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
[   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
[   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
[   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
[   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
[   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
[   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
[   62.076930] Call trace:
[   62.077411]  add_memory_resource+0x1cc/0x1d8
[   62.078227]  __add_memory+0x70/0xa8
[   62.078901]  probe_store+0xa4/0xc8
[   62.079561]  dev_attr_store+0x18/0x28
[   62.080270]  sysfs_kf_write+0x40/0x58
[   62.080992]  kernfs_fop_write+0xcc/0x1d8
[   62.081744]  __vfs_write+0x18/0x40
[   62.082400]  vfs_write+0xa4/0x1b0
[   62.083037]  ksys_write+0x5c/0xc0
[   62.083681]  __arm64_sys_write+0x18/0x20
[   62.084432]  el0_svc_handler+0x88/0x100
[   62.085177]  el0_svc+0x8/0xc

Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
problem on arm64 as pfn_valid() behaves correctly and returns positive
as memblock for the address range still exists. arch_remove_memory()
removes applicable memory sections from zone with __remove_pages() and
tears down kernel linear mapping. Removing memblock regions afterwards
is consistent.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 mm/memory_hotplug.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0082d69..71d0d79 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1872,11 +1872,10 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
+	arch_remove_memory(nid, start, size, NULL);
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(nid, start, size, NULL);
-
 	try_offline_node(nid);
 
 	mem_hotplug_done();
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/6] mm/memremap: Rename and consolidate SECTION_SIZE
  2019-04-03  4:30 [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Anshuman Khandual
                   ` (3 preceding siblings ...)
  2019-04-03  4:30 ` [PATCH 4/6] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
@ 2019-04-03  4:30 ` Anshuman Khandual
  2019-04-03  9:26   ` Michal Hocko
  2019-04-03  9:30   ` David Hildenbrand
  2019-04-03  4:30 ` [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE Anshuman Khandual
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-03  4:30 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, david, cai

From: Robin Murphy <robin.murphy@arm.com>

Enabling ZONE_DEVICE (through ARCH_HAS_ZONE_DEVICE) for arm64 reveals that
memremap's internal helpers for sparsemem sections conflict with arm64's
definitions for hugepages which inherit the name of "sections" from earlier
versions of the ARM architecture.

Disambiguate memremap by propagating sparsemem's PA_ prefix, to clarify
that these values are in terms of addresses rather than PFNs (and
because it's a heck of a lot easier than changing all the arch code).
SECTION_MASK is unused, so it can just go. While here consolidate single
instance of PA_SECTION_SIZE from mm/hmm.c as well.

[anshuman: Consolidated mm/hmm.c instance and updated the commit message]

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 include/linux/mmzone.h |  1 +
 kernel/memremap.c      | 10 ++++------
 mm/hmm.c               |  2 --
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fba7741..ed7dd27 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1081,6 +1081,7 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
  * PFN_SECTION_SHIFT		pfn to/from section number
  */
 #define PA_SECTION_SHIFT	(SECTION_SIZE_BITS)
+#define PA_SECTION_SIZE		(1UL << PA_SECTION_SHIFT)
 #define PFN_SECTION_SHIFT	(SECTION_SIZE_BITS - PAGE_SHIFT)
 
 #define NR_MEM_SECTIONS		(1UL << SECTIONS_SHIFT)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index a856cb5..dda1367 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -14,8 +14,6 @@
 #include <linux/hmm.h>
 
 static DEFINE_XARRAY(pgmap_array);
-#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
-#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
 vm_fault_t device_private_entry_fault(struct vm_area_struct *vma,
@@ -98,8 +96,8 @@ static void devm_memremap_pages_release(void *data)
 		put_page(pfn_to_page(pfn));
 
 	/* pages are dead and unused, undo the arch mapping */
-	align_start = res->start & ~(SECTION_SIZE - 1);
-	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
+	align_start = res->start & ~(PA_SECTION_SIZE - 1);
+	align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
 		- align_start;
 
 	nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
@@ -154,8 +152,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	if (!pgmap->ref || !pgmap->kill)
 		return ERR_PTR(-EINVAL);
 
-	align_start = res->start & ~(SECTION_SIZE - 1);
-	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
+	align_start = res->start & ~(PA_SECTION_SIZE - 1);
+	align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
 		- align_start;
 	align_end = align_start + align_size - 1;
 
diff --git a/mm/hmm.c b/mm/hmm.c
index fe1cd87..ef9e4e6 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -33,8 +33,6 @@
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
 
-#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE
  2019-04-03  4:30 [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Anshuman Khandual
                   ` (4 preceding siblings ...)
  2019-04-03  4:30 ` [PATCH 5/6] mm/memremap: Rename and consolidate SECTION_SIZE Anshuman Khandual
@ 2019-04-03  4:30 ` Anshuman Khandual
  2019-04-03 13:58   ` Robin Murphy
  2019-04-03 18:08 ` [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Dan Williams
  2019-04-04  9:46 ` [RFC 1/2] mm/vmemmap: Enable vmem_altmap based base page mapping for vmemmap Anshuman Khandual
  7 siblings, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-03  4:30 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, david, cai

Arch implementation for functions which create or destroy vmemmap mapping
(vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
device memory range through driver provided vmem_altmap structure which
fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
creates vmemmap section mappings and utilize vmem_altmap structure.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index db3e625..b5d8cf5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -31,6 +31,7 @@ config ARM64
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
+	select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_INLINE_READ_LOCK if !PREEMPT
 	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] arm64/mm: Enable sysfs based memory hot add interface
  2019-04-03  4:30 ` [PATCH 1/6] arm64/mm: Enable sysfs based memory hot add interface Anshuman Khandual
@ 2019-04-03  8:20   ` David Hildenbrand
  2019-04-03 13:12     ` Robin Murphy
  2019-04-04  5:25     ` Anshuman Khandual
  0 siblings, 2 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-04-03  8:20 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, linux-mm,
	akpm, will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, cai

On 03.04.19 06:30, Anshuman Khandual wrote:
> Sysfs memory probe interface (/sys/devices/system/memory/probe) can accept
> starting physical address of an entire memory block to be hot added into
> the kernel. This is in addition to the existing ACPI based interface. This
> just enables it with the required config CONFIG_ARCH_MEMORY_PROBE.
> 

We recently discussed that the similar interface for removal should
rather be moved to a debug/test module

I wonder if we should try to do the same for the sysfs probing
interface. Rather try to get rid of it than open the doors for more users.

> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  arch/arm64/Kconfig | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 7e34b9e..a2418fb 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -266,6 +266,15 @@ config HAVE_GENERIC_GUP
>  config ARCH_ENABLE_MEMORY_HOTPLUG
>  	def_bool y
>  
> +config ARCH_MEMORY_PROBE
> +	bool "Enable /sys/devices/system/memory/probe interface"
> +	depends on MEMORY_HOTPLUG
> +	help
> +	  This option enables a sysfs /sys/devices/system/memory/probe
> +	  interface for testing. See Documentation/memory-hotplug.txt
> +	  for more information. If you are unsure how to answer this
> +	  question, answer N.
> +
>  config SMP
>  	def_bool y
>  
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  2019-04-03  4:30 ` [PATCH 4/6] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
@ 2019-04-03  8:45   ` Oscar Salvador
  2019-04-03  9:17   ` Michal Hocko
  2019-04-03  9:30   ` David Hildenbrand
  2 siblings, 0 replies; 43+ messages in thread
From: Oscar Salvador @ 2019-04-03  8:45 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas, mhocko, mgorman, james.morse, mark.rutland,
	robin.murphy, cpandya, arunks, dan.j.williams, logang,
	pasha.tatashin, david, cai

On Wed, Apr 03, 2019 at 10:00:04AM +0530, Anshuman Khandual wrote:
> Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
> problem on arm64 as pfn_valid() behaves correctly and returns positive
> as memblock for the address range still exists. arch_remove_memory()
> removes applicable memory sections from zone with __remove_pages() and
> tears down kernel linear mapping. Removing memblock regions afterwards
> is consistent.
> 
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>

Yes, I cannot see a way of those two colliding with each other, and
a testing on my box did not raise anything spooky either.

Reviewed-by: Oscar Salvador <osalvador@suse.de>

> ---
>  mm/memory_hotplug.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0082d69..71d0d79 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1872,11 +1872,10 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
> +	arch_remove_memory(nid, start, size, NULL);
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(nid, start, size, NULL);
> -
>  	try_offline_node(nid);
>  
>  	mem_hotplug_done();
> -- 
> 2.7.4
> 

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  2019-04-03  4:30 ` [PATCH 4/6] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
  2019-04-03  8:45   ` Oscar Salvador
@ 2019-04-03  9:17   ` Michal Hocko
  2019-04-04  8:32     ` Anshuman Khandual
  2019-04-03  9:30   ` David Hildenbrand
  2 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2019-04-03  9:17 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas, mgorman, james.morse, mark.rutland,
	robin.murphy, cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, david, cai

On Wed 03-04-19 10:00:04, Anshuman Khandual wrote:
> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
> entries between memory block and node. It first checks pfn validity with
> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
> 
> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
> which scans all mapped memblock regions with memblock_is_map_memory(). This
> creates a problem in memory hot remove path which has already removed given
> memory range from memory block with memblock_[remove|free] before arriving
> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
> of existing sysfs entries.
> 
> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
> [   62.052517] ------------[ cut here ]------------
> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [   62.054589] Modules linked in:
> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
> [   62.056274] Hardware name: linux,dummy-virt (DT)
> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
> [   62.059842] sp : ffff0000168b3ce0
> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
> [   62.076930] Call trace:
> [   62.077411]  add_memory_resource+0x1cc/0x1d8
> [   62.078227]  __add_memory+0x70/0xa8
> [   62.078901]  probe_store+0xa4/0xc8
> [   62.079561]  dev_attr_store+0x18/0x28
> [   62.080270]  sysfs_kf_write+0x40/0x58
> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
> [   62.081744]  __vfs_write+0x18/0x40
> [   62.082400]  vfs_write+0xa4/0x1b0
> [   62.083037]  ksys_write+0x5c/0xc0
> [   62.083681]  __arm64_sys_write+0x18/0x20
> [   62.084432]  el0_svc_handler+0x88/0x100
> [   62.085177]  el0_svc+0x8/0xc
> 
> Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
> problem on arm64 as pfn_valid() behaves correctly and returns positive
> as memblock for the address range still exists. arch_remove_memory()
> removes applicable memory sections from zone with __remove_pages() and
> tears down kernel linear mapping. Removing memblock regions afterwards
> is consistent.

consistent with what? Anyway, I believe you wanted to mention that this
is safe because there is no other memblock (bootmem) allocator user that
late. So nobody is going to allocate from the removed range just to blow
up later. Also nobody should be using the bootmem allocated range else
we wouldn't allow to remove it. So reordering is indeed safe.
 
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>

With a changelog updated to explain why this is safe
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memory_hotplug.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0082d69..71d0d79 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1872,11 +1872,10 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
> +	arch_remove_memory(nid, start, size, NULL);
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(nid, start, size, NULL);
> -
>  	try_offline_node(nid);
>  
>  	mem_hotplug_done();
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] mm/memremap: Rename and consolidate SECTION_SIZE
  2019-04-03  4:30 ` [PATCH 5/6] mm/memremap: Rename and consolidate SECTION_SIZE Anshuman Khandual
@ 2019-04-03  9:26   ` Michal Hocko
  2019-04-03  9:30   ` David Hildenbrand
  1 sibling, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2019-04-03  9:26 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas, mgorman, james.morse, mark.rutland,
	robin.murphy, cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, david, cai

On Wed 03-04-19 10:00:05, Anshuman Khandual wrote:
> From: Robin Murphy <robin.murphy@arm.com>
> 
> Enabling ZONE_DEVICE (through ARCH_HAS_ZONE_DEVICE) for arm64 reveals that
> memremap's internal helpers for sparsemem sections conflict with arm64's
> definitions for hugepages which inherit the name of "sections" from earlier
> versions of the ARM architecture.
> 
> Disambiguate memremap by propagating sparsemem's PA_ prefix, to clarify
> that these values are in terms of addresses rather than PFNs (and
> because it's a heck of a lot easier than changing all the arch code).
> SECTION_MASK is unused, so it can just go. While here consolidate single
> instance of PA_SECTION_SIZE from mm/hmm.c as well.
> 
> [anshuman: Consolidated mm/hmm.c instance and updated the commit message]

Agreed. mremap shouldn't have redefined SECTION_SIZE in the first place.
This just adds a confusion.

> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mmzone.h |  1 +
>  kernel/memremap.c      | 10 ++++------
>  mm/hmm.c               |  2 --
>  3 files changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fba7741..ed7dd27 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1081,6 +1081,7 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
>   * PFN_SECTION_SHIFT		pfn to/from section number
>   */
>  #define PA_SECTION_SHIFT	(SECTION_SIZE_BITS)
> +#define PA_SECTION_SIZE		(1UL << PA_SECTION_SHIFT)
>  #define PFN_SECTION_SHIFT	(SECTION_SIZE_BITS - PAGE_SHIFT)
>  
>  #define NR_MEM_SECTIONS		(1UL << SECTIONS_SHIFT)
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index a856cb5..dda1367 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -14,8 +14,6 @@
>  #include <linux/hmm.h>
>  
>  static DEFINE_XARRAY(pgmap_array);
> -#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
> -#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
>  
>  #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
>  vm_fault_t device_private_entry_fault(struct vm_area_struct *vma,
> @@ -98,8 +96,8 @@ static void devm_memremap_pages_release(void *data)
>  		put_page(pfn_to_page(pfn));
>  
>  	/* pages are dead and unused, undo the arch mapping */
> -	align_start = res->start & ~(SECTION_SIZE - 1);
> -	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
> +	align_start = res->start & ~(PA_SECTION_SIZE - 1);
> +	align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
>  		- align_start;
>  
>  	nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
> @@ -154,8 +152,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>  	if (!pgmap->ref || !pgmap->kill)
>  		return ERR_PTR(-EINVAL);
>  
> -	align_start = res->start & ~(SECTION_SIZE - 1);
> -	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
> +	align_start = res->start & ~(PA_SECTION_SIZE - 1);
> +	align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
>  		- align_start;
>  	align_end = align_start + align_size - 1;
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index fe1cd87..ef9e4e6 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -33,8 +33,6 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/memory_hotplug.h>
>  
> -#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
> -
>  #if IS_ENABLED(CONFIG_HMM_MIRROR)
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>  
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/6] mm/memremap: Rename and consolidate SECTION_SIZE
  2019-04-03  4:30 ` [PATCH 5/6] mm/memremap: Rename and consolidate SECTION_SIZE Anshuman Khandual
  2019-04-03  9:26   ` Michal Hocko
@ 2019-04-03  9:30   ` David Hildenbrand
  1 sibling, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-04-03  9:30 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, linux-mm,
	akpm, will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, cai

On 03.04.19 06:30, Anshuman Khandual wrote:
> From: Robin Murphy <robin.murphy@arm.com>
> 
> Enabling ZONE_DEVICE (through ARCH_HAS_ZONE_DEVICE) for arm64 reveals that
> memremap's internal helpers for sparsemem sections conflict with arm64's
> definitions for hugepages which inherit the name of "sections" from earlier
> versions of the ARM architecture.
> 
> Disambiguate memremap by propagating sparsemem's PA_ prefix, to clarify
> that these values are in terms of addresses rather than PFNs (and
> because it's a heck of a lot easier than changing all the arch code).
> SECTION_MASK is unused, so it can just go. While here consolidate single
> instance of PA_SECTION_SIZE from mm/hmm.c as well.
> 
> [anshuman: Consolidated mm/hmm.c instance and updated the commit message]
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  include/linux/mmzone.h |  1 +
>  kernel/memremap.c      | 10 ++++------
>  mm/hmm.c               |  2 --
>  3 files changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fba7741..ed7dd27 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1081,6 +1081,7 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
>   * PFN_SECTION_SHIFT		pfn to/from section number
>   */
>  #define PA_SECTION_SHIFT	(SECTION_SIZE_BITS)
> +#define PA_SECTION_SIZE		(1UL << PA_SECTION_SHIFT)
>  #define PFN_SECTION_SHIFT	(SECTION_SIZE_BITS - PAGE_SHIFT)
>  
>  #define NR_MEM_SECTIONS		(1UL << SECTIONS_SHIFT)
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index a856cb5..dda1367 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -14,8 +14,6 @@
>  #include <linux/hmm.h>
>  
>  static DEFINE_XARRAY(pgmap_array);
> -#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
> -#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
>  
>  #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
>  vm_fault_t device_private_entry_fault(struct vm_area_struct *vma,
> @@ -98,8 +96,8 @@ static void devm_memremap_pages_release(void *data)
>  		put_page(pfn_to_page(pfn));
>  
>  	/* pages are dead and unused, undo the arch mapping */
> -	align_start = res->start & ~(SECTION_SIZE - 1);
> -	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
> +	align_start = res->start & ~(PA_SECTION_SIZE - 1);
> +	align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
>  		- align_start;
>  
>  	nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
> @@ -154,8 +152,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>  	if (!pgmap->ref || !pgmap->kill)
>  		return ERR_PTR(-EINVAL);
>  
> -	align_start = res->start & ~(SECTION_SIZE - 1);
> -	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
> +	align_start = res->start & ~(PA_SECTION_SIZE - 1);
> +	align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
>  		- align_start;
>  	align_end = align_start + align_size - 1;
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index fe1cd87..ef9e4e6 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -33,8 +33,6 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/memory_hotplug.h>
>  
> -#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
> -
>  #if IS_ENABLED(CONFIG_HMM_MIRROR)
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>  
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  2019-04-03  4:30 ` [PATCH 4/6] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
  2019-04-03  8:45   ` Oscar Salvador
  2019-04-03  9:17   ` Michal Hocko
@ 2019-04-03  9:30   ` David Hildenbrand
  2 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-04-03  9:30 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, linux-mm,
	akpm, will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, cai

On 03.04.19 06:30, Anshuman Khandual wrote:
> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
> entries between memory block and node. It first checks pfn validity with
> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
> 
> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
> which scans all mapped memblock regions with memblock_is_map_memory(). This
> creates a problem in memory hot remove path which has already removed given
> memory range from memory block with memblock_[remove|free] before arriving
> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
> of existing sysfs entries.
> 
> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
> [   62.052517] ------------[ cut here ]------------
> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [   62.054589] Modules linked in:
> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
> [   62.056274] Hardware name: linux,dummy-virt (DT)
> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
> [   62.059842] sp : ffff0000168b3ce0
> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
> [   62.076930] Call trace:
> [   62.077411]  add_memory_resource+0x1cc/0x1d8
> [   62.078227]  __add_memory+0x70/0xa8
> [   62.078901]  probe_store+0xa4/0xc8
> [   62.079561]  dev_attr_store+0x18/0x28
> [   62.080270]  sysfs_kf_write+0x40/0x58
> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
> [   62.081744]  __vfs_write+0x18/0x40
> [   62.082400]  vfs_write+0xa4/0x1b0
> [   62.083037]  ksys_write+0x5c/0xc0
> [   62.083681]  __arm64_sys_write+0x18/0x20
> [   62.084432]  el0_svc_handler+0x88/0x100
> [   62.085177]  el0_svc+0x8/0xc
> 
> Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
> problem on arm64 as pfn_valid() behaves correctly and returns positive
> as memblock for the address range still exists. arch_remove_memory()
> removes applicable memory sections from zone with __remove_pages() and
> tears down kernel linear mapping. Removing memblock regions afterwards
> is consistent.
> 
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  mm/memory_hotplug.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0082d69..71d0d79 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1872,11 +1872,10 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
> +	arch_remove_memory(nid, start, size, NULL);
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(nid, start, size, NULL);
> -
>  	try_offline_node(nid);
>  
>  	mem_hotplug_done();
> 

As discussed, with tweaked wording

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-03  4:30 ` [PATCH 2/6] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-04-03 12:37   ` Robin Murphy
  2019-04-03 13:15     ` Steven Price
  2019-04-04  5:39     ` Anshuman Khandual
  2019-04-03 17:32   ` Logan Gunthorpe
  1 sibling, 2 replies; 43+ messages in thread
From: Robin Murphy @ 2019-04-03 12:37 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, linux-mm,
	akpm, will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, cpandya, arunks,
	dan.j.williams, osalvador, logang, pasha.tatashin, david, cai,
	Steven Price

[ +Steve ]

Hi Anshuman,

On 03/04/2019 05:30, Anshuman Khandual wrote:
> Memory removal from an arch perspective involves tearing down two different
> kernel based mappings i.e vmemmap and linear while releasing related page
> table pages allocated for the physical memory range to be removed.
> 
> Define a common kernel page table tear down helper remove_pagetable() which
> can be used to unmap given kernel virtual address range. In effect it can
> tear down both vmemap or kernel linear mappings. This new helper is called
> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
> The argument 'direct' here identifies kernel linear mappings.
> 
> Vmemmap mappings page table pages are allocated through sparse mem helper
> functions like vmemmap_alloc_block() which does not cycle the pages through
> pgtable_page_ctor() constructs. Hence while removing it skips corresponding
> destructor construct pgtable_page_dtor().
> 
> While here update arch_add_mempory() to handle __add_pages() failures by
> just unmapping recently added kernel linear mapping. Now enable memory hot
> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
> 
> This implementation is overall inspired from kernel page table tear down
> procedure on X86 architecture.

A bit of a nit, but since this depends on at least patch #4 to work 
properly, it would be good to reorder the series appropriately.
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>   arch/arm64/Kconfig               |   3 +
>   arch/arm64/include/asm/pgtable.h |  14 +++
>   arch/arm64/mm/mmu.c              | 227 ++++++++++++++++++++++++++++++++++++++-
>   3 files changed, 241 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2418fb..db3e625 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -266,6 +266,9 @@ config HAVE_GENERIC_GUP
>   config ARCH_ENABLE_MEMORY_HOTPLUG
>   	def_bool y
>   
> +config ARCH_ENABLE_MEMORY_HOTREMOVE
> +	def_bool y
> +
>   config ARCH_MEMORY_PROBE
>   	bool "Enable /sys/devices/system/memory/probe interface"
>   	depends on MEMORY_HOTPLUG
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index de70c1e..858098e 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -355,6 +355,18 @@ static inline int pmd_protnone(pmd_t pmd)
>   }
>   #endif
>   
> +#if (CONFIG_PGTABLE_LEVELS > 2)
> +#define pmd_large(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
> +#else
> +#define pmd_large(pmd) 0
> +#endif
> +
> +#if (CONFIG_PGTABLE_LEVELS > 3)
> +#define pud_large(pud)	(pud_val(pud) && !(pud_val(pud) & PUD_TABLE_BIT))
> +#else
> +#define pud_large(pmd) 0
> +#endif

These seem rather different from the versions that Steve is proposing in 
the generic pagewalk series - can you reach an agreement on which 
implementation is preferred?

> +
>   /*
>    * THP definitions.
>    */
> @@ -555,6 +567,7 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
>   
>   #else
>   
> +#define pmd_index(addr) 0
>   #define pud_page_paddr(pud)	({ BUILD_BUG(); 0; })
>   
>   /* Match pmd_offset folding in <asm/generic/pgtable-nopmd.h> */
> @@ -612,6 +625,7 @@ static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
>   
>   #else
>   
> +#define pud_index(adrr)	0
>   #define pgd_page_paddr(pgd)	({ BUILD_BUG(); 0;})
>   
>   /* Match pud_offset folding in <asm/generic/pgtable-nopud.h> */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index e97f018..ae0777b 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -714,6 +714,198 @@ int kern_addr_valid(unsigned long addr)
>   
>   	return pfn_valid(pte_pfn(pte));
>   }
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +static void __meminit free_pagetable(struct page *page, int order)

Do these need to be __meminit? AFAICS it's effectively redundant with 
the containing #ifdef, and removal feels like it's inherently a 
later-than-init thing anyway.

> +{
> +	unsigned long magic;
> +	unsigned int nr_pages = 1 << order;
> +
> +	if (PageReserved(page)) {
> +		__ClearPageReserved(page);
> +
> +		magic = (unsigned long)page->freelist;
> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> +			while (nr_pages--)
> +				put_page_bootmem(page++);
> +		} else
> +			while (nr_pages--)
> +				free_reserved_page(page++);
> +	} else
> +		free_pages((unsigned long)page_address(page), order);
> +}
> +
> +#if (CONFIG_PGTABLE_LEVELS > 2)
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd, bool direct)
> +{
> +	pte_t *pte;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		pte = pte_start + i;
> +		if (!pte_none(*pte))
> +			return;
> +	}
> +
> +	if (direct)
> +		pgtable_page_dtor(pmd_page(*pmd));
> +	free_pagetable(pmd_page(*pmd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pmd_clear(pmd);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +#else
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd, bool direct)
> +{
> +}
> +#endif
> +
> +#if (CONFIG_PGTABLE_LEVELS > 3)
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud, bool direct)
> +{
> +	pmd_t *pmd;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		pmd = pmd_start + i;
> +		if (!pmd_none(*pmd))
> +			return;
> +	}
> +
> +	if (direct)
> +		pgtable_page_dtor(pud_page(*pud));
> +	free_pagetable(pud_page(*pud), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pud_clear(pud);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +static void __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd, bool direct)
> +{
> +	pud_t *pud;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		pud = pud_start + i;
> +		if (!pud_none(*pud))
> +			return;
> +	}
> +
> +	if (direct)
> +		pgtable_page_dtor(pgd_page(*pgd));
> +	free_pagetable(pgd_page(*pgd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pgd_clear(pgd);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +#else
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud, bool direct)
> +{
> +}
> +
> +static void __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd, bool direct)
> +{
> +}
> +#endif
> +
> +static void __meminit
> +remove_pte_table(pte_t *pte_start, unsigned long addr,
> +			unsigned long end, bool direct)
> +{
> +	pte_t *pte;
> +
> +	pte = pte_start + pte_index(addr);
> +	for (; addr < end; addr += PAGE_SIZE, pte++) {
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		if (!direct)
> +			free_pagetable(pte_page(*pte), 0);
> +		spin_lock(&init_mm.page_table_lock);
> +		pte_clear(&init_mm, addr, pte);
> +		spin_unlock(&init_mm.page_table_lock);
> +	}
> +}
> +
> +static void __meminit
> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
> +			unsigned long end, bool direct)
> +{
> +	unsigned long next;
> +	pte_t *pte_base;
> +	pmd_t *pmd;
> +
> +	pmd = pmd_start + pmd_index(addr);
> +	for (; addr < end; addr = next, pmd++) {
> +		next = pmd_addr_end(addr, end);
> +		if (!pmd_present(*pmd))
> +			continue;
> +
> +		if (pmd_large(*pmd)) {
> +			if (!direct)
> +				free_pagetable(pmd_page(*pmd),
> +						get_order(PMD_SIZE));
> +			spin_lock(&init_mm.page_table_lock);
> +			pmd_clear(pmd);
> +			spin_unlock(&init_mm.page_table_lock);
> +			continue;
> +		}
> +		pte_base = pte_offset_kernel(pmd, 0UL);
> +		remove_pte_table(pte_base, addr, next, direct);
> +		free_pte_table(pte_base, pmd, direct);
> +	}
> +}
> +
> +static void __meminit
> +remove_pud_table(pud_t *pud_start, unsigned long addr,
> +			unsigned long end, bool direct)
> +{
> +	unsigned long next;
> +	pmd_t *pmd_base;
> +	pud_t *pud;
> +
> +	pud = pud_start + pud_index(addr);
> +	for (; addr < end; addr = next, pud++) {
> +		next = pud_addr_end(addr, end);
> +		if (!pud_present(*pud))
> +			continue;
> +
> +		if (pud_large(*pud)) {
> +			if (!direct)
> +				free_pagetable(pud_page(*pud),
> +						get_order(PUD_SIZE));
> +			spin_lock(&init_mm.page_table_lock);
> +			pud_clear(pud);
> +			spin_unlock(&init_mm.page_table_lock);
> +			continue;
> +		}
> +		pmd_base = pmd_offset(pud, 0UL);
> +		remove_pmd_table(pmd_base, addr, next, direct);
> +		free_pmd_table(pmd_base, pud, direct);
> +	}
> +}
> +
> +static void __meminit
> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> +{
> +	unsigned long addr, next;
> +	pud_t *pud_base;
> +	pgd_t *pgd;
> +
> +	for (addr = start; addr < end; addr = next) {
> +		next = pgd_addr_end(addr, end);
> +		pgd = pgd_offset_k(addr);
> +		if (!pgd_present(*pgd))
> +			continue;
> +
> +		pud_base = pud_offset(pgd, 0UL);
> +		remove_pud_table(pud_base, addr, next, direct);
> +		free_pud_table(pud_base, pgd, direct);
> +	}
> +	flush_tlb_kernel_range(start, end);
> +}
> +#endif
> +
>   #ifdef CONFIG_SPARSEMEM_VMEMMAP
>   #if !ARM64_SWAPPER_USES_SECTION_MAPS
>   int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> @@ -758,9 +950,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>   	return 0;
>   }
>   #endif	/* CONFIG_ARM64_64K_PAGES */
> -void vmemmap_free(unsigned long start, unsigned long end,
> +void __ref vmemmap_free(unsigned long start, unsigned long end,

Why is the __ref needed? Presumably it's avoidable by addressing the 
__meminit thing above.

>   		struct vmem_altmap *altmap)
>   {
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +	remove_pagetable(start, end, false);
> +#endif
>   }
>   #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
>   
> @@ -1046,10 +1241,16 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
>   }
>   
>   #ifdef CONFIG_MEMORY_HOTPLUG
> +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
> +{
> +	WARN_ON(pgdir != init_mm.pgd);
> +	remove_pagetable(start, start + size, true);
> +}
> +
>   int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
>   		    bool want_memblock)
>   {
> -	int flags = 0;
> +	int flags = 0, ret = 0;

Initialising ret here is unnecessary.

Robin.

>   
>   	if (rodata_full || debug_pagealloc_enabled())
>   		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> @@ -1057,7 +1258,27 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
>   	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
>   			     size, PAGE_KERNEL, pgd_pgtable_alloc, flags);
>   
> -	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
> +	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
>   			   altmap, want_memblock);
> +	if (ret)
> +		__remove_pgd_mapping(swapper_pg_dir,
> +					__phys_to_virt(start), size);
> +	return ret;
>   }
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
> +{
> +	unsigned long start_pfn = start >> PAGE_SHIFT;
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	struct zone *zone = page_zone(pfn_to_page(start_pfn));
> +	int ret;
> +
> +	ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
> +	if (!ret)
> +		__remove_pgd_mapping(swapper_pg_dir,
> +					__phys_to_virt(start), size);
> +	return ret;
> +}
> +#endif
>   #endif
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] arm64/mm: Enable sysfs based memory hot add interface
  2019-04-03  8:20   ` David Hildenbrand
@ 2019-04-03 13:12     ` Robin Murphy
  2019-04-04  5:21       ` Anshuman Khandual
  2019-04-04  5:25     ` Anshuman Khandual
  1 sibling, 1 reply; 43+ messages in thread
From: Robin Murphy @ 2019-04-03 13:12 UTC (permalink / raw)
  To: David Hildenbrand, Anshuman Khandual, linux-kernel,
	linux-arm-kernel, linux-mm, akpm, will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, cpandya, arunks,
	dan.j.williams, osalvador, logang, cai

On 03/04/2019 09:20, David Hildenbrand wrote:
> On 03.04.19 06:30, Anshuman Khandual wrote:
>> Sysfs memory probe interface (/sys/devices/system/memory/probe) can accept
>> starting physical address of an entire memory block to be hot added into
>> the kernel. This is in addition to the existing ACPI based interface. This
>> just enables it with the required config CONFIG_ARCH_MEMORY_PROBE.
>>
> 
> We recently discussed that the similar interface for removal should
> rather be moved to a debug/test module
> 
> I wonder if we should try to do the same for the sysfs probing
> interface. Rather try to get rid of it than open the doors for more users.

Agreed - if this option even exists in a released kernel, there's a risk 
that distros will turn it on for the sake of it, and at that point arm64 
is stuck carrying the same ABI baggage as well.

If users turn up in future with a desperate and unavoidable need for the 
legacy half-an-API on arm64, we can always reconsider adding it at that 
point. It was very much deliberate that my original hot-add support did 
not include a patch like this one.

Robin.

>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>>   arch/arm64/Kconfig | 9 +++++++++
>>   1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 7e34b9e..a2418fb 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -266,6 +266,15 @@ config HAVE_GENERIC_GUP
>>   config ARCH_ENABLE_MEMORY_HOTPLUG
>>   	def_bool y
>>   
>> +config ARCH_MEMORY_PROBE
>> +	bool "Enable /sys/devices/system/memory/probe interface"
>> +	depends on MEMORY_HOTPLUG
>> +	help
>> +	  This option enables a sysfs /sys/devices/system/memory/probe
>> +	  interface for testing. See Documentation/memory-hotplug.txt
>> +	  for more information. If you are unsure how to answer this
>> +	  question, answer N.
>> +
>>   config SMP
>>   	def_bool y
>>   
>>
> 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-03 12:37   ` Robin Murphy
@ 2019-04-03 13:15     ` Steven Price
  2019-04-04  6:51       ` Anshuman Khandual
  2019-04-04  5:39     ` Anshuman Khandual
  1 sibling, 1 reply; 43+ messages in thread
From: Steven Price @ 2019-04-03 13:15 UTC (permalink / raw)
  To: Robin Murphy, Anshuman Khandual, linux-kernel, linux-arm-kernel,
	linux-mm, akpm, will.deacon, catalin.marinas
  Cc: mark.rutland, mhocko, david, logang, cai, pasha.tatashin,
	james.morse, cpandya, arunks, dan.j.williams, mgorman, osalvador

On 03/04/2019 13:37, Robin Murphy wrote:
> [ +Steve ]
> 
> Hi Anshuman,
> 
> On 03/04/2019 05:30, Anshuman Khandual wrote:

<snip>

>> diff --git a/arch/arm64/include/asm/pgtable.h
>> b/arch/arm64/include/asm/pgtable.h
>> index de70c1e..858098e 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -355,6 +355,18 @@ static inline int pmd_protnone(pmd_t pmd)
>>   }
>>   #endif
>>   +#if (CONFIG_PGTABLE_LEVELS > 2)
>> +#define pmd_large(pmd)    (pmd_val(pmd) && !(pmd_val(pmd) &
>> PMD_TABLE_BIT))
>> +#else
>> +#define pmd_large(pmd) 0
>> +#endif
>> +
>> +#if (CONFIG_PGTABLE_LEVELS > 3)
>> +#define pud_large(pud)    (pud_val(pud) && !(pud_val(pud) &
>> PUD_TABLE_BIT))
>> +#else
>> +#define pud_large(pmd) 0
>> +#endif
> 
> These seem rather different from the versions that Steve is proposing in
> the generic pagewalk series - can you reach an agreement on which
> implementation is preferred?

Indeed this doesn't match the version in my series although is quite
similar.

My desire is that p?d_large represents the hardware architectural
definition of large page/huge page/section (pick your naming). Although
now I look more closely this is actually broken in my series (I'll fix
that up and send a new version shortly) - p?d_sect() is similarly
conditional.

Is there a good reason not to use the existing p?d_sect() macros
available on arm64?

I'm also surprised by the CONFIG_PGTABLE_LEVEL conditions as they don't
match the existing conditions for p?d_sect(). Might be worth double
checking it actually does what you expect.

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE
  2019-04-03  4:30 ` [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE Anshuman Khandual
@ 2019-04-03 13:58   ` Robin Murphy
  2019-04-03 16:07     ` Jerome Glisse
  2019-04-04  4:42     ` Anshuman Khandual
  0 siblings, 2 replies; 43+ messages in thread
From: Robin Murphy @ 2019-04-03 13:58 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, linux-mm,
	akpm, will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, cpandya, arunks,
	dan.j.williams, osalvador, logang, david, cai, dan.j.williams,
	jglisse

[ +Dan, Jerome ]

On 03/04/2019 05:30, Anshuman Khandual wrote:
> Arch implementation for functions which create or destroy vmemmap mapping
> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
> device memory range through driver provided vmem_altmap structure which
> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just

ZONE_DEVICE is about more than just altmap support, no?

> enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
> applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
> creates vmemmap section mappings and utilize vmem_altmap structure.

What prevents it from working with other page sizes? One of the foremost 
use-cases for our 52-bit VA/PA support is to enable mapping large 
quantities of persistent memory, so we really do need this for 64K pages 
too. FWIW, it appears not to be an issue for PowerPC.

> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>   arch/arm64/Kconfig | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index db3e625..b5d8cf5 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -31,6 +31,7 @@ config ARM64
>   	select ARCH_HAS_SYSCALL_WRAPPER
>   	select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
>   	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> +	select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES

IIRC certain configurations (HMM?) don't even build if you just turn 
this on alone (although of course things may have changed elsewhere in 
the meantime) - crucially, though, from previous discussions[1] it seems 
fundamentally unsafe, since I don't think we can guarantee that nobody 
will touch the corners of ZONE_DEVICE that also require pte_devmap in 
order not to go subtly wrong. I did get as far as cooking up some 
patches to sort that out [2][3] which I never got round to posting for 
their own sake, so please consider picking those up as part of this series.

Robin.

>   	select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   	select ARCH_INLINE_READ_LOCK if !PREEMPT
>   	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
> 


[1] 
https://lore.kernel.org/linux-mm/CAA9_cmfA9GS+1M1aSyv1ty5jKY3iho3CERhnRAruWJW3PfmpgA@mail.gmail.com/#t
[2] 
http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=61816b833afdb56b49c2e58f5289ae18809e5d67
[3] 
http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a5a16560eb1becf9a1d4cc0d03d6b5e76da4f4e1
(apologies to anyone if the linux-arm.org server is being flaky as usual 
and requires a few tries to respond properly)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE
  2019-04-03 13:58   ` Robin Murphy
@ 2019-04-03 16:07     ` Jerome Glisse
  2019-04-04  5:03       ` Anshuman Khandual
  2019-04-04  4:42     ` Anshuman Khandual
  1 sibling, 1 reply; 43+ messages in thread
From: Jerome Glisse @ 2019-04-03 16:07 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Anshuman Khandual, linux-kernel, linux-arm-kernel, linux-mm,
	akpm, will.deacon, catalin.marinas, mhocko, mgorman, james.morse,
	mark.rutland, cpandya, arunks, dan.j.williams, osalvador, logang,
	david, cai

On Wed, Apr 03, 2019 at 02:58:28PM +0100, Robin Murphy wrote:
> [ +Dan, Jerome ]
> 
> On 03/04/2019 05:30, Anshuman Khandual wrote:
> > Arch implementation for functions which create or destroy vmemmap mapping
> > (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
> > device memory range through driver provided vmem_altmap structure which
> > fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
> 
> ZONE_DEVICE is about more than just altmap support, no?
> 
> > enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
> > applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
> > creates vmemmap section mappings and utilize vmem_altmap structure.
> 
> What prevents it from working with other page sizes? One of the foremost
> use-cases for our 52-bit VA/PA support is to enable mapping large quantities
> of persistent memory, so we really do need this for 64K pages too. FWIW, it
> appears not to be an issue for PowerPC.
> 
> > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> > ---
> >   arch/arm64/Kconfig | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index db3e625..b5d8cf5 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -31,6 +31,7 @@ config ARM64
> >   	select ARCH_HAS_SYSCALL_WRAPPER
> >   	select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
> >   	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> > +	select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
> 
> IIRC certain configurations (HMM?) don't even build if you just turn this on
> alone (although of course things may have changed elsewhere in the meantime)
> - crucially, though, from previous discussions[1] it seems fundamentally
> unsafe, since I don't think we can guarantee that nobody will touch the
> corners of ZONE_DEVICE that also require pte_devmap in order not to go
> subtly wrong. I did get as far as cooking up some patches to sort that out
> [2][3] which I never got round to posting for their own sake, so please
> consider picking those up as part of this series.

Correct _do not_ enable ZONE_DEVICE without support for pte_devmap detection.
If you want some feature of ZONE_DEVICE. Like HMM as while DAX does require
pte_devmap, HMM device private does not. So you would first have to split
ZONE_DEVICE into more sub-features kconfig option.

What is the end use case you are looking for ? Persistent memory ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-03  4:30 ` [PATCH 2/6] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-04-03 12:37   ` Robin Murphy
@ 2019-04-03 17:32   ` Logan Gunthorpe
  2019-04-03 17:57     ` Robin Murphy
  2019-04-04  7:07     ` Anshuman Khandual
  1 sibling, 2 replies; 43+ messages in thread
From: Logan Gunthorpe @ 2019-04-03 17:32 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, linux-mm,
	akpm, will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, pasha.tatashin,
	david, cai, Stephen Bates



On 2019-04-02 10:30 p.m., Anshuman Khandual wrote:
> Memory removal from an arch perspective involves tearing down two different
> kernel based mappings i.e vmemmap and linear while releasing related page
> table pages allocated for the physical memory range to be removed.
> 
> Define a common kernel page table tear down helper remove_pagetable() which
> can be used to unmap given kernel virtual address range. In effect it can
> tear down both vmemap or kernel linear mappings. This new helper is called
> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
> The argument 'direct' here identifies kernel linear mappings.
> 
> Vmemmap mappings page table pages are allocated through sparse mem helper
> functions like vmemmap_alloc_block() which does not cycle the pages through
> pgtable_page_ctor() constructs. Hence while removing it skips corresponding
> destructor construct pgtable_page_dtor().
> 
> While here update arch_add_mempory() to handle __add_pages() failures by
> just unmapping recently added kernel linear mapping. Now enable memory hot
> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
> 
> This implementation is overall inspired from kernel page table tear down
> procedure on X86 architecture.

I've been working on very similar things for RISC-V. In fact, I'm
currently in progress on a very similar stripped down version of
remove_pagetable(). (Though I'm fairly certain I've done a bunch of
stuff wrong.)

Would it be possible to move this work into common code that can be used
by all arches? Seems like, to start, we should be able to support both
arm64 and RISC-V... and maybe even x86 too.

I'd be happy to help integrate and test such functions in RISC-V.

Thanks,

Logan



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-03 17:32   ` Logan Gunthorpe
@ 2019-04-03 17:57     ` Robin Murphy
  2019-04-04  8:23       ` Anshuman Khandual
  2019-04-04  7:07     ` Anshuman Khandual
  1 sibling, 1 reply; 43+ messages in thread
From: Robin Murphy @ 2019-04-03 17:57 UTC (permalink / raw)
  To: Logan Gunthorpe, Anshuman Khandual, linux-kernel,
	linux-arm-kernel, linux-mm, akpm, will.deacon, catalin.marinas
  Cc: mark.rutland, mhocko, david, cai, pasha.tatashin, Stephen Bates,
	james.morse, cpandya, arunks, dan.j.williams, mgorman, osalvador

On 03/04/2019 18:32, Logan Gunthorpe wrote:
> 
> 
> On 2019-04-02 10:30 p.m., Anshuman Khandual wrote:
>> Memory removal from an arch perspective involves tearing down two different
>> kernel based mappings i.e vmemmap and linear while releasing related page
>> table pages allocated for the physical memory range to be removed.
>>
>> Define a common kernel page table tear down helper remove_pagetable() which
>> can be used to unmap given kernel virtual address range. In effect it can
>> tear down both vmemap or kernel linear mappings. This new helper is called
>> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
>> The argument 'direct' here identifies kernel linear mappings.
>>
>> Vmemmap mappings page table pages are allocated through sparse mem helper
>> functions like vmemmap_alloc_block() which does not cycle the pages through
>> pgtable_page_ctor() constructs. Hence while removing it skips corresponding
>> destructor construct pgtable_page_dtor().
>>
>> While here update arch_add_mempory() to handle __add_pages() failures by
>> just unmapping recently added kernel linear mapping. Now enable memory hot
>> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
>>
>> This implementation is overall inspired from kernel page table tear down
>> procedure on X86 architecture.
> 
> I've been working on very similar things for RISC-V. In fact, I'm
> currently in progress on a very similar stripped down version of
> remove_pagetable(). (Though I'm fairly certain I've done a bunch of
> stuff wrong.)
> 
> Would it be possible to move this work into common code that can be used
> by all arches? Seems like, to start, we should be able to support both
> arm64 and RISC-V... and maybe even x86 too.
> 
> I'd be happy to help integrate and test such functions in RISC-V.

Indeed, I had hoped we might be able to piggyback off generic code for 
this anyway, given that we have generic pagetable code which knows how 
to free process pagetables, and kernel pagetables are also pagetables.

I did actually hack up such a patch[1], and other than 
p?d_none_or_clear_bad() being loud it does actually appear to function 
OK in terms of withstanding repeated add/remove cycles and not crashing, 
but all the pagetable accounting and other stuff I don't really know 
about mean it's probably not viable without a lot more core work.

Robin.

[1] 
http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=75934a2c4f737ad9f26903861108d5b0658e86bb

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE
  2019-04-03  4:30 [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Anshuman Khandual
                   ` (5 preceding siblings ...)
  2019-04-03  4:30 ` [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE Anshuman Khandual
@ 2019-04-03 18:08 ` Dan Williams
  2019-04-04 13:11   ` Anshuman Khandual
  2019-04-04  9:46 ` [RFC 1/2] mm/vmemmap: Enable vmem_altmap based base page mapping for vmemmap Anshuman Khandual
  7 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-04-03 18:08 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Linux Kernel Mailing List, linux-arm-kernel, Linux MM,
	Andrew Morton, Will Deacon, Catalin Marinas, Michal Hocko,
	Mel Gorman, james.morse, Mark Rutland, Robin Murphy, cpandya,
	arunks, osalvador, Logan Gunthorpe, Pavel Tatashin,
	David Hildenbrand, cai

On Tue, Apr 2, 2019 at 9:30 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
> This series enables memory hot remove on arm64, fixes a memblock removal
> ordering problem in generic __remove_memory(), enables sysfs memory probe
> interface on arm64. It also enables ZONE_DEVICE with struct vmem_altmap
> support.
>
> Testing:
>
> Tested hot remove on arm64 for all 4K, 16K, 64K page config options with
> all possible VA_BITS and PGTABLE_LEVELS combinations. Tested ZONE_DEVICE
> with ARM64_4K_PAGES through a dummy driver.
>
> Build tested on non arm64 platforms. I will appreciate if folks can test
> arch_remove_memory() re-ordering in __remove_memory() on other platforms.
>
> Dependency:
>
> V5 series in the thread (https://lkml.org/lkml/2019/2/14/1096) will make
> kernel linear mapping loose pgtable_page_ctor() init. When this happens
> the proposed functions free_pte|pmd|pud_table() in [PATCH 2/6] will have
> to stop calling pgtable_page_dtor().

Hi Anshuman,

I'd be interested to integrate this with the sub-section hotplug
support [1]. Otherwise the padding implementation in libnvdimm can't
be removed unless all ZONE_DEVICE capable archs also agree on the
minimum arch_add_memory() granularity. I'd prefer not to special case
which archs support which granularity, but it unfortunately
complicates what you're trying to achieve.

I think at a minimum we, mm hotplug co-travellers, need to come to a
consensus on whether sub-section support is viable for v5.2 and / or a
pre-requisite for new arch-ZONE_DEVICE implementations.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE
  2019-04-03 13:58   ` Robin Murphy
  2019-04-03 16:07     ` Jerome Glisse
@ 2019-04-04  4:42     ` Anshuman Khandual
  2019-04-04  5:04       ` Dan Williams
  1 sibling, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04  4:42 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-arm-kernel, linux-mm, akpm,
	will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, cpandya, arunks,
	dan.j.williams, osalvador, logang, david, cai, jglisse



On 04/03/2019 07:28 PM, Robin Murphy wrote:
> [ +Dan, Jerome ]
> 
> On 03/04/2019 05:30, Anshuman Khandual wrote:
>> Arch implementation for functions which create or destroy vmemmap mapping
>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
>> device memory range through driver provided vmem_altmap structure which
>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
> 
> ZONE_DEVICE is about more than just altmap support, no?

Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
struct pages for it has stand alone and self contained use case. The driver could
just want to manage the memory itself but with struct pages either in the RAM or
in the device memory range through struct vmem_altmap. The driver may not choose
to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
map these pages into any user pagetable which would necessitate support for
pte|pmd|pud_devmap.

Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
IMHO ZONE_DEVICE is self contained and can be evaluated in itself.

> 
>> enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
>> applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
>> creates vmemmap section mappings and utilize vmem_altmap structure.
> 
> What prevents it from working with other page sizes? One of the foremost use-cases for our 52-bit VA/PA support is to enable mapping large quantities of persistent memory, so we really do need this for 64K pages too. FWIW, it appears not to be an issue for PowerPC.


On !AR64_4K_PAGES vmemmap_populate() calls vmemmap_populate_basepages() which
does not support struct vmem_altmap right now. Originally was planning to send
the vmemmap_populate_basepages() enablement patches separately but will post it
here for review.

> 
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>>   arch/arm64/Kconfig | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index db3e625..b5d8cf5 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -31,6 +31,7 @@ config ARM64
>>       select ARCH_HAS_SYSCALL_WRAPPER
>>       select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
>>       select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>> +    select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
> 
> IIRC certain configurations (HMM?) don't even build if you just turn this on alone (although of course things may have changed elsewhere in the meantime) - crucially, though, from previous discussions[1] it seems fundamentally unsafe, since I don't think we can guarantee that nobody will touch the corners of ZONE_DEVICE that also require pte_devmap in order not to go subtly wrong. I did get as far as cooking up some patches to sort that out [2][3] which I never got round to posting for their own sake, so please consider picking those up as part of this series.

In the previous discussion mentioned here [1] it sort of indicates that we
cannot have a viable (ARCH_HAS_ZONE_DEVICE=y but !__HAVE_ARCH_PTE_DEVMAP). I
dont understand why ! The driver can just hotplug the range into ZONE_DEVICE,
manage the memory itself without mapping them to user page table ever. IIUC
ZONE_DEVICE must not need user mapped device PFN support. All the corner case
problems discussed previously come in once these new 'device PFN' memory which
is now in ZONE_DEVICE get mapped into user page table.

> 
> Robin.
> 
>>       select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>       select ARCH_INLINE_READ_LOCK if !PREEMPT
>>       select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
>>
> 
> 
> [1] https://lore.kernel.org/linux-mm/CAA9_cmfA9GS+1M1aSyv1ty5jKY3iho3CERhnRAruWJW3PfmpgA@mail.gmail.com/#t
> [2] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=61816b833afdb56b49c2e58f5289ae18809e5d67
> [3] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a5a16560eb1becf9a1d4cc0d03d6b5e76da4f4e1
> (apologies to anyone if the linux-arm.org server is being flaky as usual and requires a few tries to respond properly)

I have not evaluated pte_devmap(). Will consider [3] when enabling it. But
I still dont understand why ZONE_DEVICE can not be enabled and used from a
driver which never requires user mapping or pte|pmd|pud_devmap() support.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE
  2019-04-03 16:07     ` Jerome Glisse
@ 2019-04-04  5:03       ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04  5:03 UTC (permalink / raw)
  To: Jerome Glisse, Robin Murphy
  Cc: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas, mhocko, mgorman, james.morse, mark.rutland,
	cpandya, arunks, dan.j.williams, osalvador, logang, david, cai



On 04/03/2019 09:37 PM, Jerome Glisse wrote:
> On Wed, Apr 03, 2019 at 02:58:28PM +0100, Robin Murphy wrote:
>> [ +Dan, Jerome ]
>>
>> On 03/04/2019 05:30, Anshuman Khandual wrote:
>>> Arch implementation for functions which create or destroy vmemmap mapping
>>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
>>> device memory range through driver provided vmem_altmap structure which
>>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
>>
>> ZONE_DEVICE is about more than just altmap support, no?
>>
>>> enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
>>> applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
>>> creates vmemmap section mappings and utilize vmem_altmap structure.
>>
>> What prevents it from working with other page sizes? One of the foremost
>> use-cases for our 52-bit VA/PA support is to enable mapping large quantities
>> of persistent memory, so we really do need this for 64K pages too. FWIW, it
>> appears not to be an issue for PowerPC.
>>
>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>> ---
>>>   arch/arm64/Kconfig | 1 +
>>>   1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index db3e625..b5d8cf5 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -31,6 +31,7 @@ config ARM64
>>>   	select ARCH_HAS_SYSCALL_WRAPPER
>>>   	select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
>>>   	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>>> +	select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
>>
>> IIRC certain configurations (HMM?) don't even build if you just turn this on
>> alone (although of course things may have changed elsewhere in the meantime)
>> - crucially, though, from previous discussions[1] it seems fundamentally
>> unsafe, since I don't think we can guarantee that nobody will touch the
>> corners of ZONE_DEVICE that also require pte_devmap in order not to go
>> subtly wrong. I did get as far as cooking up some patches to sort that out
>> [2][3] which I never got round to posting for their own sake, so please
>> consider picking those up as part of this series.
> 
> Correct _do not_ enable ZONE_DEVICE without support for pte_devmap detection.

Driver managed ZONE_DEVICE memory which never maps into user page table is not
a valid use case for ZONE_DEVICE ? Also what about MEMORY_DEVICE_PRIVATE ? That
can never be mapped into user page table. A driver can still manage these non
coherent memory through it's struct pages (which will be allocated inside RAM)

> If you want some feature of ZONE_DEVICE. Like HMM as while DAX does require
> pte_devmap, HMM device private does not. So you would first have to split
> ZONE_DEVICE into more sub-features kconfig option.

CONFIG_ZONE_DEVICE does not do that already ! All it says is that a device
memory range can be plugged into ZONE_DEVICE either as PRIVATE (non-coherent)
or PUBLIC/PCI_P2PDMA (coherent) memory without mandating anything about how
these memory will be further used.

> 
> What is the end use case you are looking for ? Persistent memory ?

Persistent memory is one of the primary use cases.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE
  2019-04-04  4:42     ` Anshuman Khandual
@ 2019-04-04  5:04       ` Dan Williams
  2019-04-04  9:46         ` Robin Murphy
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-04-04  5:04 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Robin Murphy, Linux Kernel Mailing List, linux-arm-kernel,
	Linux MM, Andrew Morton, Will Deacon, Catalin Marinas,
	Michal Hocko, Mel Gorman, james.morse, Mark Rutland, cpandya,
	arunks, osalvador, Logan Gunthorpe, David Hildenbrand, cai,
	Jérôme Glisse

On Wed, Apr 3, 2019 at 9:42 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 04/03/2019 07:28 PM, Robin Murphy wrote:
> > [ +Dan, Jerome ]
> >
> > On 03/04/2019 05:30, Anshuman Khandual wrote:
> >> Arch implementation for functions which create or destroy vmemmap mapping
> >> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
> >> device memory range through driver provided vmem_altmap structure which
> >> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
> >
> > ZONE_DEVICE is about more than just altmap support, no?
>
> Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
> struct pages for it has stand alone and self contained use case. The driver could
> just want to manage the memory itself but with struct pages either in the RAM or
> in the device memory range through struct vmem_altmap. The driver may not choose
> to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
> map these pages into any user pagetable which would necessitate support for
> pte|pmd|pud_devmap.

What's left for ZONE_DEVICE if none of the above cases are used?

> Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
> IMHO ZONE_DEVICE is self contained and can be evaluated in itself.

I'm not convinced. What's the specific use case.

>
> >
> >> enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
> >> applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
> >> creates vmemmap section mappings and utilize vmem_altmap structure.
> >
> > What prevents it from working with other page sizes? One of the foremost use-cases for our 52-bit VA/PA support is to enable mapping large quantities of persistent memory, so we really do need this for 64K pages too. FWIW, it appears not to be an issue for PowerPC.
>
>
> On !AR64_4K_PAGES vmemmap_populate() calls vmemmap_populate_basepages() which
> does not support struct vmem_altmap right now. Originally was planning to send
> the vmemmap_populate_basepages() enablement patches separately but will post it
> here for review.
>
> >
> >> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> >> ---
> >>   arch/arm64/Kconfig | 1 +
> >>   1 file changed, 1 insertion(+)
> >>
> >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >> index db3e625..b5d8cf5 100644
> >> --- a/arch/arm64/Kconfig
> >> +++ b/arch/arm64/Kconfig
> >> @@ -31,6 +31,7 @@ config ARM64
> >>       select ARCH_HAS_SYSCALL_WRAPPER
> >>       select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
> >>       select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> >> +    select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
> >
> > IIRC certain configurations (HMM?) don't even build if you just turn this on alone (although of course things may have changed elsewhere in the meantime) - crucially, though, from previous discussions[1] it seems fundamentally unsafe, since I don't think we can guarantee that nobody will touch the corners of ZONE_DEVICE that also require pte_devmap in order not to go subtly wrong. I did get as far as cooking up some patches to sort that out [2][3] which I never got round to posting for their own sake, so please consider picking those up as part of this series.
>
> In the previous discussion mentioned here [1] it sort of indicates that we
> cannot have a viable (ARCH_HAS_ZONE_DEVICE=y but !__HAVE_ARCH_PTE_DEVMAP). I
> dont understand why !

Because ZONE_DEVICE was specifically invented to solve get_user_pages() for DAX.

> The driver can just hotplug the range into ZONE_DEVICE,
> manage the memory itself without mapping them to user page table ever.

Then why do you even need 'struct page' objects?

> IIUC
> ZONE_DEVICE must not need user mapped device PFN support.

No, you don't understand correctly, or I don't understand how you plan
to use ZONE_DEVICE outside it's intended use case.

> All the corner case
> problems discussed previously come in once these new 'device PFN' memory which
> is now in ZONE_DEVICE get mapped into user page table.
>
> >
> > Robin.
> >
> >>       select ARCH_HAVE_NMI_SAFE_CMPXCHG
> >>       select ARCH_INLINE_READ_LOCK if !PREEMPT
> >>       select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
> >>
> >
> >
> > [1] https://lore.kernel.org/linux-mm/CAA9_cmfA9GS+1M1aSyv1ty5jKY3iho3CERhnRAruWJW3PfmpgA@mail.gmail.com/#t
> > [2] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=61816b833afdb56b49c2e58f5289ae18809e5d67
> > [3] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a5a16560eb1becf9a1d4cc0d03d6b5e76da4f4e1
> > (apologies to anyone if the linux-arm.org server is being flaky as usual and requires a few tries to respond properly)
>
> I have not evaluated pte_devmap(). Will consider [3] when enabling it. But
> I still dont understand why ZONE_DEVICE can not be enabled and used from a
> driver which never requires user mapping or pte|pmd|pud_devmap() support.

Because there are mm paths that make assumptions about ZONE_DEVICE
that your use case might violate.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] arm64/mm: Enable sysfs based memory hot add interface
  2019-04-03 13:12     ` Robin Murphy
@ 2019-04-04  5:21       ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04  5:21 UTC (permalink / raw)
  To: Robin Murphy, David Hildenbrand, linux-kernel, linux-arm-kernel,
	linux-mm, akpm, will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, cpandya, arunks,
	dan.j.williams, osalvador, logang, cai



On 04/03/2019 06:42 PM, Robin Murphy wrote:
> On 03/04/2019 09:20, David Hildenbrand wrote:
>> On 03.04.19 06:30, Anshuman Khandual wrote:
>>> Sysfs memory probe interface (/sys/devices/system/memory/probe) can accept
>>> starting physical address of an entire memory block to be hot added into
>>> the kernel. This is in addition to the existing ACPI based interface. This
>>> just enables it with the required config CONFIG_ARCH_MEMORY_PROBE.
>>>
>>
>> We recently discussed that the similar interface for removal should
>> rather be moved to a debug/test module
>>
>> I wonder if we should try to do the same for the sysfs probing
>> interface. Rather try to get rid of it than open the doors for more users.
> 
> Agreed - if this option even exists in a released kernel, there's a risk that distros will turn it on for the sake of it, and at that point arm64 is stuck carrying the same ABI baggage as well.

True. Only if we really dont like that interface.

> 
> If users turn up in future with a desperate and unavoidable need for the legacy half-an-API on arm64, we can always reconsider adding it at that point. It was very much deliberate that my original hot-add support did not include a patch like this one.

Sure. Will drop this one next time around.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] arm64/mm: Enable sysfs based memory hot add interface
  2019-04-03  8:20   ` David Hildenbrand
  2019-04-03 13:12     ` Robin Murphy
@ 2019-04-04  5:25     ` Anshuman Khandual
  2019-04-04  8:49       ` David Hildenbrand
  1 sibling, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04  5:25 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel, linux-arm-kernel, linux-mm,
	akpm, will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, cai



On 04/03/2019 01:50 PM, David Hildenbrand wrote:
> On 03.04.19 06:30, Anshuman Khandual wrote:
>> Sysfs memory probe interface (/sys/devices/system/memory/probe) can accept
>> starting physical address of an entire memory block to be hot added into
>> the kernel. This is in addition to the existing ACPI based interface. This
>> just enables it with the required config CONFIG_ARCH_MEMORY_PROBE.
>>
> We recently discussed that the similar interface for removal should
> rather be moved to a debug/test module.

Can we maintain such a debug/test module mainline and enable it when required. Or
can have both add and remove interface at /sys/kernel/debug/ just for testing
purpose.

> 
> I wonder if we should try to do the same for the sysfs probing
> interface. Rather try to get rid of it than open the doors for more users.
> 

I understand your concern. Will drop this patch.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-03 12:37   ` Robin Murphy
  2019-04-03 13:15     ` Steven Price
@ 2019-04-04  5:39     ` Anshuman Khandual
  2019-04-04 11:58       ` Oscar Salvador
  1 sibling, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04  5:39 UTC (permalink / raw)
  To: Robin Murphy, linux-kernel, linux-arm-kernel, linux-mm, akpm,
	will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, cpandya, arunks,
	dan.j.williams, osalvador, logang, pasha.tatashin, david, cai,
	Steven Price



On 04/03/2019 06:07 PM, Robin Murphy wrote:
> [ +Steve ]
> 
> Hi Anshuman,
> 
> On 03/04/2019 05:30, Anshuman Khandual wrote:
>> Memory removal from an arch perspective involves tearing down two different
>> kernel based mappings i.e vmemmap and linear while releasing related page
>> table pages allocated for the physical memory range to be removed.
>>
>> Define a common kernel page table tear down helper remove_pagetable() which
>> can be used to unmap given kernel virtual address range. In effect it can
>> tear down both vmemap or kernel linear mappings. This new helper is called
>> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
>> The argument 'direct' here identifies kernel linear mappings.
>>
>> Vmemmap mappings page table pages are allocated through sparse mem helper
>> functions like vmemmap_alloc_block() which does not cycle the pages through
>> pgtable_page_ctor() constructs. Hence while removing it skips corresponding
>> destructor construct pgtable_page_dtor().
>>
>> While here update arch_add_mempory() to handle __add_pages() failures by
>> just unmapping recently added kernel linear mapping. Now enable memory hot
>> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
>>
>> This implementation is overall inspired from kernel page table tear down
>> procedure on X86 architecture.
> 
> A bit of a nit, but since this depends on at least patch #4 to work properly, it would be good to reorder the series appropriately.

Sure will move up the generic changes forward.

>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>>   arch/arm64/Kconfig               |   3 +
>>   arch/arm64/include/asm/pgtable.h |  14 +++
>>   arch/arm64/mm/mmu.c              | 227 ++++++++++++++++++++++++++++++++++++++-
>>   3 files changed, 241 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index a2418fb..db3e625 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -266,6 +266,9 @@ config HAVE_GENERIC_GUP
>>   config ARCH_ENABLE_MEMORY_HOTPLUG
>>       def_bool y
>>   +config ARCH_ENABLE_MEMORY_HOTREMOVE
>> +    def_bool y
>> +
>>   config ARCH_MEMORY_PROBE
>>       bool "Enable /sys/devices/system/memory/probe interface"
>>       depends on MEMORY_HOTPLUG
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index de70c1e..858098e 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -355,6 +355,18 @@ static inline int pmd_protnone(pmd_t pmd)
>>   }
>>   #endif
>>   +#if (CONFIG_PGTABLE_LEVELS > 2)
>> +#define pmd_large(pmd)    (pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
>> +#else
>> +#define pmd_large(pmd) 0
>> +#endif
>> +
>> +#if (CONFIG_PGTABLE_LEVELS > 3)
>> +#define pud_large(pud)    (pud_val(pud) && !(pud_val(pud) & PUD_TABLE_BIT))
>> +#else
>> +#define pud_large(pmd) 0
>> +#endif
> 
> These seem rather different from the versions that Steve is proposing in the generic pagewalk series - can you reach an agreement on which implementation is preferred?

Sure will take a look.

> 
>> +
>>   /*
>>    * THP definitions.
>>    */
>> @@ -555,6 +567,7 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
>>     #else
>>   +#define pmd_index(addr) 0
>>   #define pud_page_paddr(pud)    ({ BUILD_BUG(); 0; })
>>     /* Match pmd_offset folding in <asm/generic/pgtable-nopmd.h> */
>> @@ -612,6 +625,7 @@ static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
>>     #else
>>   +#define pud_index(adrr)    0
>>   #define pgd_page_paddr(pgd)    ({ BUILD_BUG(); 0;})
>>     /* Match pud_offset folding in <asm/generic/pgtable-nopud.h> */
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index e97f018..ae0777b 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -714,6 +714,198 @@ int kern_addr_valid(unsigned long addr)
>>         return pfn_valid(pte_pfn(pte));
>>   }
>> +
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +static void __meminit free_pagetable(struct page *page, int order)
> 
> Do these need to be __meminit? AFAICS it's effectively redundant with the containing #ifdef, and removal feels like it's inherently a later-than-init thing anyway.

I was confused here a bit but even X86 does exactly the same. All these functions
are still labeled __meminit and all wrapped under CONFIG_MEMORY_HOTPLUG. Is there
any concern to have __meminit here ?

> 
>> +{
>> +    unsigned long magic;
>> +    unsigned int nr_pages = 1 << order;
>> +
>> +    if (PageReserved(page)) {
>> +        __ClearPageReserved(page);
>> +
>> +        magic = (unsigned long)page->freelist;
>> +        if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>> +            while (nr_pages--)
>> +                put_page_bootmem(page++);
>> +        } else
>> +            while (nr_pages--)
>> +                free_reserved_page(page++);
>> +    } else
>> +        free_pages((unsigned long)page_address(page), order);
>> +}
>> +
>> +#if (CONFIG_PGTABLE_LEVELS > 2)
>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd, bool direct)
>> +{
>> +    pte_t *pte;
>> +    int i;
>> +
>> +    for (i = 0; i < PTRS_PER_PTE; i++) {
>> +        pte = pte_start + i;
>> +        if (!pte_none(*pte))
>> +            return;
>> +    }
>> +
>> +    if (direct)
>> +        pgtable_page_dtor(pmd_page(*pmd));
>> +    free_pagetable(pmd_page(*pmd), 0);
>> +    spin_lock(&init_mm.page_table_lock);
>> +    pmd_clear(pmd);
>> +    spin_unlock(&init_mm.page_table_lock);
>> +}
>> +#else
>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd, bool direct)
>> +{
>> +}
>> +#endif
>> +
>> +#if (CONFIG_PGTABLE_LEVELS > 3)
>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud, bool direct)
>> +{
>> +    pmd_t *pmd;
>> +    int i;
>> +
>> +    for (i = 0; i < PTRS_PER_PMD; i++) {
>> +        pmd = pmd_start + i;
>> +        if (!pmd_none(*pmd))
>> +            return;
>> +    }
>> +
>> +    if (direct)
>> +        pgtable_page_dtor(pud_page(*pud));
>> +    free_pagetable(pud_page(*pud), 0);
>> +    spin_lock(&init_mm.page_table_lock);
>> +    pud_clear(pud);
>> +    spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static void __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd, bool direct)
>> +{
>> +    pud_t *pud;
>> +    int i;
>> +
>> +    for (i = 0; i < PTRS_PER_PUD; i++) {
>> +        pud = pud_start + i;
>> +        if (!pud_none(*pud))
>> +            return;
>> +    }
>> +
>> +    if (direct)
>> +        pgtable_page_dtor(pgd_page(*pgd));
>> +    free_pagetable(pgd_page(*pgd), 0);
>> +    spin_lock(&init_mm.page_table_lock);
>> +    pgd_clear(pgd);
>> +    spin_unlock(&init_mm.page_table_lock);
>> +}
>> +#else
>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud, bool direct)
>> +{
>> +}
>> +
>> +static void __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd, bool direct)
>> +{
>> +}
>> +#endif
>> +
>> +static void __meminit
>> +remove_pte_table(pte_t *pte_start, unsigned long addr,
>> +            unsigned long end, bool direct)
>> +{
>> +    pte_t *pte;
>> +
>> +    pte = pte_start + pte_index(addr);
>> +    for (; addr < end; addr += PAGE_SIZE, pte++) {
>> +        if (!pte_present(*pte))
>> +            continue;
>> +
>> +        if (!direct)
>> +            free_pagetable(pte_page(*pte), 0);
>> +        spin_lock(&init_mm.page_table_lock);
>> +        pte_clear(&init_mm, addr, pte);
>> +        spin_unlock(&init_mm.page_table_lock);
>> +    }
>> +}
>> +
>> +static void __meminit
>> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
>> +            unsigned long end, bool direct)
>> +{
>> +    unsigned long next;
>> +    pte_t *pte_base;
>> +    pmd_t *pmd;
>> +
>> +    pmd = pmd_start + pmd_index(addr);
>> +    for (; addr < end; addr = next, pmd++) {
>> +        next = pmd_addr_end(addr, end);
>> +        if (!pmd_present(*pmd))
>> +            continue;
>> +
>> +        if (pmd_large(*pmd)) {
>> +            if (!direct)
>> +                free_pagetable(pmd_page(*pmd),
>> +                        get_order(PMD_SIZE));
>> +            spin_lock(&init_mm.page_table_lock);
>> +            pmd_clear(pmd);
>> +            spin_unlock(&init_mm.page_table_lock);
>> +            continue;
>> +        }
>> +        pte_base = pte_offset_kernel(pmd, 0UL);
>> +        remove_pte_table(pte_base, addr, next, direct);
>> +        free_pte_table(pte_base, pmd, direct);
>> +    }
>> +}
>> +
>> +static void __meminit
>> +remove_pud_table(pud_t *pud_start, unsigned long addr,
>> +            unsigned long end, bool direct)
>> +{
>> +    unsigned long next;
>> +    pmd_t *pmd_base;
>> +    pud_t *pud;
>> +
>> +    pud = pud_start + pud_index(addr);
>> +    for (; addr < end; addr = next, pud++) {
>> +        next = pud_addr_end(addr, end);
>> +        if (!pud_present(*pud))
>> +            continue;
>> +
>> +        if (pud_large(*pud)) {
>> +            if (!direct)
>> +                free_pagetable(pud_page(*pud),
>> +                        get_order(PUD_SIZE));
>> +            spin_lock(&init_mm.page_table_lock);
>> +            pud_clear(pud);
>> +            spin_unlock(&init_mm.page_table_lock);
>> +            continue;
>> +        }
>> +        pmd_base = pmd_offset(pud, 0UL);
>> +        remove_pmd_table(pmd_base, addr, next, direct);
>> +        free_pmd_table(pmd_base, pud, direct);
>> +    }
>> +}
>> +
>> +static void __meminit
>> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
>> +{
>> +    unsigned long addr, next;
>> +    pud_t *pud_base;
>> +    pgd_t *pgd;
>> +
>> +    for (addr = start; addr < end; addr = next) {
>> +        next = pgd_addr_end(addr, end);
>> +        pgd = pgd_offset_k(addr);
>> +        if (!pgd_present(*pgd))
>> +            continue;
>> +
>> +        pud_base = pud_offset(pgd, 0UL);
>> +        remove_pud_table(pud_base, addr, next, direct);
>> +        free_pud_table(pud_base, pgd, direct);
>> +    }
>> +    flush_tlb_kernel_range(start, end);
>> +}
>> +#endif
>> +
>>   #ifdef CONFIG_SPARSEMEM_VMEMMAP
>>   #if !ARM64_SWAPPER_USES_SECTION_MAPS
>>   int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>> @@ -758,9 +950,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>       return 0;
>>   }
>>   #endif    /* CONFIG_ARM64_64K_PAGES */
>> -void vmemmap_free(unsigned long start, unsigned long end,
>> +void __ref vmemmap_free(unsigned long start, unsigned long end,
> 
> Why is the __ref needed? Presumably it's avoidable by addressing the __meminit thing above.

Right.

> 
>>           struct vmem_altmap *altmap)
>>   {
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +    remove_pagetable(start, end, false);
>> +#endif
>>   }
>>   #endif    /* CONFIG_SPARSEMEM_VMEMMAP */
>>   @@ -1046,10 +1241,16 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
>>   }
>>     #ifdef CONFIG_MEMORY_HOTPLUG
>> +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
>> +{
>> +    WARN_ON(pgdir != init_mm.pgd);
>> +    remove_pagetable(start, start + size, true);
>> +}
>> +
>>   int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
>>               bool want_memblock)
>>   {
>> -    int flags = 0;
>> +    int flags = 0, ret = 0;
> 
> Initialising ret here is unnecessary.

Sure. Will change.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-03 13:15     ` Steven Price
@ 2019-04-04  6:51       ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04  6:51 UTC (permalink / raw)
  To: Steven Price, Robin Murphy, linux-kernel, linux-arm-kernel,
	linux-mm, akpm, will.deacon, catalin.marinas
  Cc: mark.rutland, mhocko, david, logang, cai, pasha.tatashin,
	james.morse, cpandya, arunks, dan.j.williams, mgorman, osalvador



On 04/03/2019 06:45 PM, Steven Price wrote:
> On 03/04/2019 13:37, Robin Murphy wrote:
>> [ +Steve ]
>>
>> Hi Anshuman,

Hi Steve,

>>
>> On 03/04/2019 05:30, Anshuman Khandual wrote:
> 
> <snip>
> 
>>> diff --git a/arch/arm64/include/asm/pgtable.h
>>> b/arch/arm64/include/asm/pgtable.h
>>> index de70c1e..858098e 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -355,6 +355,18 @@ static inline int pmd_protnone(pmd_t pmd)
>>>   }
>>>   #endif
>>>   +#if (CONFIG_PGTABLE_LEVELS > 2)
>>> +#define pmd_large(pmd)    (pmd_val(pmd) && !(pmd_val(pmd) &
>>> PMD_TABLE_BIT))
>>> +#else
>>> +#define pmd_large(pmd) 0
>>> +#endif
>>> +
>>> +#if (CONFIG_PGTABLE_LEVELS > 3)
>>> +#define pud_large(pud)    (pud_val(pud) && !(pud_val(pud) &
>>> PUD_TABLE_BIT))
>>> +#else
>>> +#define pud_large(pmd) 0
>>> +#endif
>>
>> These seem rather different from the versions that Steve is proposing in
>> the generic pagewalk series - can you reach an agreement on which
>> implementation is preferred?
> 
> Indeed this doesn't match the version in my series although is quite
> similar.
> 
> My desire is that p?d_large represents the hardware architectural
> definition of large page/huge page/section (pick your naming). Although
> now I look more closely this is actually broken in my series (I'll fix
> that up and send a new version shortly) - p?d_sect() is similarly
> conditional.
> 
> Is there a good reason not to use the existing p?d_sect() macros
> available on arm64?

Nothing specific. Now I just tried using pud|pmd_sect() which looks good on
multiple configs for 4K/16K/64K. Will migrate pmd|pud_large() to more arch
specific pmd|pud_sect() which would also help in staying clear from your
series.

> 
> I'm also surprised by the CONFIG_PGTABLE_LEVEL conditions as they don't
> match the existing conditions for p?d_sect(). Might be worth double
> checking it actually does what you expect.

Right they are bit different. Surely will check. But if pmd|pud_sect() works
out okay will probably go with it as its been there for sometime.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-03 17:32   ` Logan Gunthorpe
  2019-04-03 17:57     ` Robin Murphy
@ 2019-04-04  7:07     ` Anshuman Khandual
  2019-04-04  9:16       ` Steven Price
  1 sibling, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04  7:07 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-arm-kernel, linux-mm, akpm,
	will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, pasha.tatashin,
	david, cai, Stephen Bates



On 04/03/2019 11:02 PM, Logan Gunthorpe wrote:
> 
> 
> On 2019-04-02 10:30 p.m., Anshuman Khandual wrote:
>> Memory removal from an arch perspective involves tearing down two different
>> kernel based mappings i.e vmemmap and linear while releasing related page
>> table pages allocated for the physical memory range to be removed.
>>
>> Define a common kernel page table tear down helper remove_pagetable() which
>> can be used to unmap given kernel virtual address range. In effect it can
>> tear down both vmemap or kernel linear mappings. This new helper is called
>> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
>> The argument 'direct' here identifies kernel linear mappings.
>>
>> Vmemmap mappings page table pages are allocated through sparse mem helper
>> functions like vmemmap_alloc_block() which does not cycle the pages through
>> pgtable_page_ctor() constructs. Hence while removing it skips corresponding
>> destructor construct pgtable_page_dtor().
>>
>> While here update arch_add_mempory() to handle __add_pages() failures by
>> just unmapping recently added kernel linear mapping. Now enable memory hot
>> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
>>
>> This implementation is overall inspired from kernel page table tear down
>> procedure on X86 architecture.
> 
> I've been working on very similar things for RISC-V. In fact, I'm
> currently in progress on a very similar stripped down version of
> remove_pagetable(). (Though I'm fairly certain I've done a bunch of
> stuff wrong.)
> 
> Would it be possible to move this work into common code that can be used
> by all arches? Seems like, to start, we should be able to support both
> arm64 and RISC-V... and maybe even x86 too.
> 
> I'd be happy to help integrate and test such functions in RISC-V.

Sure that will be great. The only impediment is pgtable_page_ctor() for kernel
linear mapping. This series is based on current arm64 where linear mapping
pgtable pages go through pgtable_page_ctor() init sequence but that might be
changing soon. If RISC-V does not have pgtable_page_ctor() init for linear
mapping and no other arch specific stuff later on we can try to consolidate
remove_pagetable() atleast for both the architectures.

Then I wondering whether I can transition pud|pmd_large() to pud|pmd_sect().

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-03 17:57     ` Robin Murphy
@ 2019-04-04  8:23       ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04  8:23 UTC (permalink / raw)
  To: Robin Murphy, Logan Gunthorpe, linux-kernel, linux-arm-kernel,
	linux-mm, akpm, will.deacon, catalin.marinas
  Cc: mark.rutland, mhocko, david, cai, pasha.tatashin, Stephen Bates,
	james.morse, cpandya, arunks, dan.j.williams, mgorman, osalvador



On 04/03/2019 11:27 PM, Robin Murphy wrote:
> On 03/04/2019 18:32, Logan Gunthorpe wrote:
>>
>>
>> On 2019-04-02 10:30 p.m., Anshuman Khandual wrote:
>>> Memory removal from an arch perspective involves tearing down two different
>>> kernel based mappings i.e vmemmap and linear while releasing related page
>>> table pages allocated for the physical memory range to be removed.
>>>
>>> Define a common kernel page table tear down helper remove_pagetable() which
>>> can be used to unmap given kernel virtual address range. In effect it can
>>> tear down both vmemap or kernel linear mappings. This new helper is called
>>> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
>>> The argument 'direct' here identifies kernel linear mappings.
>>>
>>> Vmemmap mappings page table pages are allocated through sparse mem helper
>>> functions like vmemmap_alloc_block() which does not cycle the pages through
>>> pgtable_page_ctor() constructs. Hence while removing it skips corresponding
>>> destructor construct pgtable_page_dtor().
>>>
>>> While here update arch_add_mempory() to handle __add_pages() failures by
>>> just unmapping recently added kernel linear mapping. Now enable memory hot
>>> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
>>>
>>> This implementation is overall inspired from kernel page table tear down
>>> procedure on X86 architecture.
>>
>> I've been working on very similar things for RISC-V. In fact, I'm
>> currently in progress on a very similar stripped down version of
>> remove_pagetable(). (Though I'm fairly certain I've done a bunch of
>> stuff wrong.)
>>
>> Would it be possible to move this work into common code that can be used
>> by all arches? Seems like, to start, we should be able to support both
>> arm64 and RISC-V... and maybe even x86 too.
>>
>> I'd be happy to help integrate and test such functions in RISC-V.

I am more inclined towards consolidating remove_pagetable() across platforms
like arm64 and RISC-V (probably others). But there are clear distinctions
between user page table and kernel page table tear down process.

> 
> Indeed, I had hoped we might be able to piggyback off generic code for this anyway,
> given that we have generic pagetable code which knows how to free process pagetables,
> and kernel pagetables are also pagetables.

But there are differences. To list some

* Freeing mapped and pagetable pages

	- Memory hot remove deals with both vmemmap and linear mappings
	- Selectively call pgtable_page_dtor() for linear mappings (arch specific)
	- Not actually freeing PTE|PMD|PUD mapped pages for linear mappings
	- Freeing mapped pages for vmemap mappings

* TLB shootdown

	- User page table process uses mmu_gather mechanism for TLB flush
	- Kernel page table tear down can do with less TLB flush invocations
		- Dont have to care about flush deferral etc

* THP and HugeTLB

	- Kernel page table tear down procedure does not have to understand THP or HugeTLB
	- Though it has to understand possible arch specific special block mappings

		- Specific kernel linear mappings on arm64
			- PUD|PMD|CONT_PMD|CONT_PTE large page mappings

		- Specific vmemmap mappings on arm64
			- PMD large or PTE mappings

	-User page table tear down procedure needs to understand THP and HugeTLB

* Page table locking

	- Kernel procedure locks init_mm.page_table_lock while clearing an individual entry
	- Kernel procedure does not have to worry about mmap_sem

* ZONE_DEVICE struct vmem_altmap

	- Kernel page table tear down procedure needs to accommodate 'struct vmem_altmap' when
	vmemmap mappings are created with pages allocated from 'struct vmem_altmap' (ZONE_DEVICE)
	rather than buddy allocator or memblock.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/6] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory()
  2019-04-03  9:17   ` Michal Hocko
@ 2019-04-04  8:32     ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04  8:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas, mgorman, james.morse, mark.rutland,
	robin.murphy, cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, david, cai



On 04/03/2019 02:47 PM, Michal Hocko wrote:
> On Wed 03-04-19 10:00:04, Anshuman Khandual wrote:
>> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
>> entries between memory block and node. It first checks pfn validity with
>> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
>> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
>>
>> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
>> which scans all mapped memblock regions with memblock_is_map_memory(). This
>> creates a problem in memory hot remove path which has already removed given
>> memory range from memory block with memblock_[remove|free] before arriving
>> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
>> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
>> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
>> of existing sysfs entries.
>>
>> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
>> [   62.052517] ------------[ cut here ]------------
>> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
>> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>> [   62.054589] Modules linked in:
>> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
>> [   62.056274] Hardware name: linux,dummy-virt (DT)
>> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
>> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
>> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
>> [   62.059842] sp : ffff0000168b3ce0
>> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
>> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
>> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
>> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
>> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
>> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
>> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
>> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
>> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
>> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
>> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
>> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
>> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
>> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
>> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
>> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
>> [   62.076930] Call trace:
>> [   62.077411]  add_memory_resource+0x1cc/0x1d8
>> [   62.078227]  __add_memory+0x70/0xa8
>> [   62.078901]  probe_store+0xa4/0xc8
>> [   62.079561]  dev_attr_store+0x18/0x28
>> [   62.080270]  sysfs_kf_write+0x40/0x58
>> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
>> [   62.081744]  __vfs_write+0x18/0x40
>> [   62.082400]  vfs_write+0xa4/0x1b0
>> [   62.083037]  ksys_write+0x5c/0xc0
>> [   62.083681]  __arm64_sys_write+0x18/0x20
>> [   62.084432]  el0_svc_handler+0x88/0x100
>> [   62.085177]  el0_svc+0x8/0xc
>>
>> Re-ordering arch_remove_memory() with memblock_[free|remove] solves the
>> problem on arm64 as pfn_valid() behaves correctly and returns positive
>> as memblock for the address range still exists. arch_remove_memory()
>> removes applicable memory sections from zone with __remove_pages() and
>> tears down kernel linear mapping. Removing memblock regions afterwards
>> is consistent.
> 
> consistent with what? Anyway, I believe you wanted to mention that this
> is safe because there is no other memblock (bootmem) allocator user that

Yes I did intend but did not express that very well here.

> late. So nobody is going to allocate from the removed range just to blow
> up later. Also nobody should be using the bootmem allocated range else
> we wouldn't allow to remove it. So reordering is indeed safe.

Looks better.

>  
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> 
> With a changelog updated to explain why this is safe

Sure will change it. Thanks for the commit message suggestion.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] arm64/mm: Enable sysfs based memory hot add interface
  2019-04-04  5:25     ` Anshuman Khandual
@ 2019-04-04  8:49       ` David Hildenbrand
  0 siblings, 0 replies; 43+ messages in thread
From: David Hildenbrand @ 2019-04-04  8:49 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-arm-kernel, linux-mm,
	akpm, will.deacon, catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang,
	pasha.tatashin, cai

On 04.04.19 07:25, Anshuman Khandual wrote:
> 
> 
> On 04/03/2019 01:50 PM, David Hildenbrand wrote:
>> On 03.04.19 06:30, Anshuman Khandual wrote:
>>> Sysfs memory probe interface (/sys/devices/system/memory/probe) can accept
>>> starting physical address of an entire memory block to be hot added into
>>> the kernel. This is in addition to the existing ACPI based interface. This
>>> just enables it with the required config CONFIG_ARCH_MEMORY_PROBE.
>>>
>> We recently discussed that the similar interface for removal should
>> rather be moved to a debug/test module.
> 
> Can we maintain such a debug/test module mainline and enable it when required. Or
> can have both add and remove interface at /sys/kernel/debug/ just for testing
> purpose.

I assume we could put such a module into mainline. It could span up an
interface in debugfs.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-04  7:07     ` Anshuman Khandual
@ 2019-04-04  9:16       ` Steven Price
  0 siblings, 0 replies; 43+ messages in thread
From: Steven Price @ 2019-04-04  9:16 UTC (permalink / raw)
  To: Anshuman Khandual, Logan Gunthorpe, linux-kernel,
	linux-arm-kernel, linux-mm, akpm, will.deacon, catalin.marinas
  Cc: mark.rutland, mhocko, david, robin.murphy, cai, pasha.tatashin,
	Stephen Bates, james.morse, cpandya, arunks, dan.j.williams,
	mgorman, osalvador

On 04/04/2019 08:07, Anshuman Khandual wrote:
> 
> 
> On 04/03/2019 11:02 PM, Logan Gunthorpe wrote:
>>
>>
>> On 2019-04-02 10:30 p.m., Anshuman Khandual wrote:
>>> Memory removal from an arch perspective involves tearing down two different
>>> kernel based mappings i.e vmemmap and linear while releasing related page
>>> table pages allocated for the physical memory range to be removed.
>>>
>>> Define a common kernel page table tear down helper remove_pagetable() which
>>> can be used to unmap given kernel virtual address range. In effect it can
>>> tear down both vmemap or kernel linear mappings. This new helper is called
>>> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
>>> The argument 'direct' here identifies kernel linear mappings.
>>>
>>> Vmemmap mappings page table pages are allocated through sparse mem helper
>>> functions like vmemmap_alloc_block() which does not cycle the pages through
>>> pgtable_page_ctor() constructs. Hence while removing it skips corresponding
>>> destructor construct pgtable_page_dtor().
>>>
>>> While here update arch_add_mempory() to handle __add_pages() failures by
>>> just unmapping recently added kernel linear mapping. Now enable memory hot
>>> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
>>>
>>> This implementation is overall inspired from kernel page table tear down
>>> procedure on X86 architecture.
>>
>> I've been working on very similar things for RISC-V. In fact, I'm
>> currently in progress on a very similar stripped down version of
>> remove_pagetable(). (Though I'm fairly certain I've done a bunch of
>> stuff wrong.)
>>
>> Would it be possible to move this work into common code that can be used
>> by all arches? Seems like, to start, we should be able to support both
>> arm64 and RISC-V... and maybe even x86 too.
>>
>> I'd be happy to help integrate and test such functions in RISC-V.
> 
> Sure that will be great. The only impediment is pgtable_page_ctor() for kernel
> linear mapping. This series is based on current arm64 where linear mapping
> pgtable pages go through pgtable_page_ctor() init sequence but that might be
> changing soon. If RISC-V does not have pgtable_page_ctor() init for linear
> mapping and no other arch specific stuff later on we can try to consolidate
> remove_pagetable() atleast for both the architectures.
> 
> Then I wondering whether I can transition pud|pmd_large() to pud|pmd_sect().

The first 10 patches of my generic page walk series[1] adds p?d_large()
as a common feature, so probably best sticking with p?d_large() if this
is going to be common and basing on top of those patches.

[1]
https://lore.kernel.org/lkml/20190403141627.11664-1-steven.price@arm.com/T/

Steve

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC 1/2] mm/vmemmap: Enable vmem_altmap based base page mapping for vmemmap
  2019-04-03  4:30 [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Anshuman Khandual
                   ` (6 preceding siblings ...)
  2019-04-03 18:08 ` [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Dan Williams
@ 2019-04-04  9:46 ` Anshuman Khandual
  2019-04-04  9:46   ` [RFC 2/2] arm64/mm: Enable ZONE_DEVICE for all page configs Anshuman Khandual
  7 siblings, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04  9:46 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang, david, cai

vmemmap_populate_basepages() is used for vmemmap mapping across platforms.
On arm64 it is used for ARM64_16K_PAGES and ARM64_64K_PAGES configs. When
applicable enable it's allocation from device memory range through struct
vmem_altpamp. Individual archs should enable this when appropriate. Hence
keep it disabled to continue with the existing semantics.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/mmu.c      |  2 +-
 arch/ia64/mm/discontig.c |  2 +-
 arch/x86/mm/init_64.c    |  4 ++--
 include/linux/mm.h       |  5 +++--
 mm/sparse-vmemmap.c      | 14 ++++++++++----
 5 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 4b25b7544763..2859aa89cc4a 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -921,7 +921,7 @@ remove_pagetable(unsigned long start, unsigned long end,
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
-	return vmemmap_populate_basepages(start, end, node);
+	return vmemmap_populate_basepages(start, end, node, NULL);
 }
 #else	/* !ARM64_SWAPPER_USES_SECTION_MAPS */
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 05490dd073e6..faefd7ec991f 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -660,7 +660,7 @@ void arch_refresh_nodedata(int update_node, pg_data_t *update_pgdat)
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
-	return vmemmap_populate_basepages(start, end, node);
+	return vmemmap_populate_basepages(start, end, node, NULL);
 }
 
 void vmemmap_free(unsigned long start, unsigned long end,
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bccff68e3267..e7e05d1b8bcf 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1450,7 +1450,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 			vmemmap_verify((pte_t *)pmd, node, addr, next);
 			continue;
 		}
-		if (vmemmap_populate_basepages(addr, next, node))
+		if (vmemmap_populate_basepages(addr, next, node, NULL))
 			return -ENOMEM;
 	}
 	return 0;
@@ -1468,7 +1468,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 				__func__);
 		err = -ENOMEM;
 	} else
-		err = vmemmap_populate_basepages(start, end, node);
+		err = vmemmap_populate_basepages(start, end, node, NULL);
 	if (!err)
 		sync_global_pgds(start, end - 1);
 	return err;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 76769749b5a5..a62e9ff24af3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2672,14 +2672,15 @@ pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
-pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node);
+pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
+					struct vmem_altmap *altmap);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node);
 void *altmap_alloc_block_buf(unsigned long size, struct vmem_altmap *altmap);
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
-			       int node);
+			       int node, struct vmem_altmap *altmap);
 int vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap);
 void vmemmap_populate_print_last(void);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 7fec05796796..81a0960b5cd4 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -140,12 +140,18 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 			start, end - 1);
 }
 
-pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node)
+pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
+				struct vmem_altmap *altmap)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
-		void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node);
+		void *p;
+
+		if (altmap)
+			p = altmap_alloc_block_buf(PAGE_SIZE, altmap);
+		else
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
@@ -214,7 +220,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 }
 
 int __meminit vmemmap_populate_basepages(unsigned long start,
-					 unsigned long end, int node)
+			unsigned long end, int node, struct vmem_altmap *altmap)
 {
 	unsigned long addr = start;
 	pgd_t *pgd;
@@ -236,7 +242,7 @@ int __meminit vmemmap_populate_basepages(unsigned long start,
 		pmd = vmemmap_pmd_populate(pud, addr, node);
 		if (!pmd)
 			return -ENOMEM;
-		pte = vmemmap_pte_populate(pmd, addr, node);
+		pte = vmemmap_pte_populate(pmd, addr, node, altmap);
 		if (!pte)
 			return -ENOMEM;
 		vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 2/2] arm64/mm: Enable ZONE_DEVICE for all page configs
  2019-04-04  9:46 ` [RFC 1/2] mm/vmemmap: Enable vmem_altmap based base page mapping for vmemmap Anshuman Khandual
@ 2019-04-04  9:46   ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04  9:46 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, linux-mm, akpm, will.deacon,
	catalin.marinas
  Cc: mhocko, mgorman, james.morse, mark.rutland, robin.murphy,
	cpandya, arunks, dan.j.williams, osalvador, logang, david, cai

Now that vmemmap_populate_basepages() supports struct vmem_altmap based
allocations, ZONE_DEVICE can be functional across all page size configs.
Now vmemmap_populate_baepages() takes in actual struct vmem_altmap for
allocation and remove_pagetable() should accommodate such new PTE level
vmemmap mappings. Just remove the ARCH_HAS_ZONE_DEVICE dependency from
ARM64_4K_PAGES.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/Kconfig  |  2 +-
 arch/arm64/mm/mmu.c | 10 +++++-----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index b5d8cf57e220..4a37a33a4fe5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -31,7 +31,7 @@ config ARM64
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
-	select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
+	select ARCH_HAS_ZONE_DEVICE
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_INLINE_READ_LOCK if !PREEMPT
 	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 2859aa89cc4a..509ed7e547a3 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -818,8 +818,8 @@ static void __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd, bool direct)
 #endif
 
 static void __meminit
-remove_pte_table(pte_t *pte_start, unsigned long addr,
-			unsigned long end, bool direct)
+remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
+		bool direct, struct vmem_altmap *altmap)
 {
 	pte_t *pte;
 
@@ -829,7 +829,7 @@ remove_pte_table(pte_t *pte_start, unsigned long addr,
 			continue;
 
 		if (!direct)
-			free_pagetable(pte_page(*pte), 0);
+			free_huge_pagetable(pte_page(*pte), 0, altmap);
 		spin_lock(&init_mm.page_table_lock);
 		pte_clear(&init_mm, addr, pte);
 		spin_unlock(&init_mm.page_table_lock);
@@ -860,7 +860,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 			continue;
 		}
 		pte_base = pte_offset_kernel(pmd, 0UL);
-		remove_pte_table(pte_base, addr, next, direct);
+		remove_pte_table(pte_base, addr, next, direct, altmap);
 		free_pte_table(pte_base, pmd, direct);
 	}
 }
@@ -921,7 +921,7 @@ remove_pagetable(unsigned long start, unsigned long end,
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
-	return vmemmap_populate_basepages(start, end, node, NULL);
+	return vmemmap_populate_basepages(start, end, node, altmap);
 }
 #else	/* !ARM64_SWAPPER_USES_SECTION_MAPS */
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE
  2019-04-04  5:04       ` Dan Williams
@ 2019-04-04  9:46         ` Robin Murphy
  2019-04-07 22:11           ` Dan Williams
  0 siblings, 1 reply; 43+ messages in thread
From: Robin Murphy @ 2019-04-04  9:46 UTC (permalink / raw)
  To: Dan Williams, Anshuman Khandual
  Cc: Linux Kernel Mailing List, linux-arm-kernel, Linux MM,
	Andrew Morton, Will Deacon, Catalin Marinas, Michal Hocko,
	Mel Gorman, james.morse, Mark Rutland, cpandya, arunks,
	osalvador, Logan Gunthorpe, David Hildenbrand, cai,
	Jérôme Glisse

On 04/04/2019 06:04, Dan Williams wrote:
> On Wed, Apr 3, 2019 at 9:42 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>>
>>
>> On 04/03/2019 07:28 PM, Robin Murphy wrote:
>>> [ +Dan, Jerome ]
>>>
>>> On 03/04/2019 05:30, Anshuman Khandual wrote:
>>>> Arch implementation for functions which create or destroy vmemmap mapping
>>>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
>>>> device memory range through driver provided vmem_altmap structure which
>>>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
>>>
>>> ZONE_DEVICE is about more than just altmap support, no?
>>
>> Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
>> struct pages for it has stand alone and self contained use case. The driver could
>> just want to manage the memory itself but with struct pages either in the RAM or
>> in the device memory range through struct vmem_altmap. The driver may not choose
>> to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
>> map these pages into any user pagetable which would necessitate support for
>> pte|pmd|pud_devmap.
> 
> What's left for ZONE_DEVICE if none of the above cases are used?
> 
>> Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
>> IMHO ZONE_DEVICE is self contained and can be evaluated in itself.
> 
> I'm not convinced. What's the specific use case.

The fundamental "roadmap" reason we've been doing this is to enable 
further NVDIMM/pmem development (libpmem/Qemu/etc.) on arm64. The fact 
that ZONE_DEVICE immediately opens the door to the various other stuff 
that the CCIX folks have interest in is a definite bonus, so it would 
certainly be preferable to get arm64 on par with the current state of 
things rather than try to subdivide the scope further.

I started working on this from the ZONE_DEVICE end, but got bogged down 
in trying to replace my copied-from-s390 dummy hot-remove implementation 
with something proper. Anshuman has stepped in to help with hot-remove 
(since we also have cloud folks wanting that for its own sake), so is 
effectively coming at the problem from the opposite direction, and I'll 
be the first to admit that we've not managed the greatest job of meeting 
in the middle and coordinating our upstream story; sorry about that :)

Let me freshen up my devmap patches and post them properly, since that 
discussion doesn't have to happen in the context of hot-remove; they're 
effectively just parallel dependencies for ZONE_DEVICE.

Robin.

>>
>>>
>>>> enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
>>>> applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
>>>> creates vmemmap section mappings and utilize vmem_altmap structure.
>>>
>>> What prevents it from working with other page sizes? One of the foremost use-cases for our 52-bit VA/PA support is to enable mapping large quantities of persistent memory, so we really do need this for 64K pages too. FWIW, it appears not to be an issue for PowerPC.
>>
>>
>> On !AR64_4K_PAGES vmemmap_populate() calls vmemmap_populate_basepages() which
>> does not support struct vmem_altmap right now. Originally was planning to send
>> the vmemmap_populate_basepages() enablement patches separately but will post it
>> here for review.
>>
>>>
>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>> ---
>>>>    arch/arm64/Kconfig | 1 +
>>>>    1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index db3e625..b5d8cf5 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -31,6 +31,7 @@ config ARM64
>>>>        select ARCH_HAS_SYSCALL_WRAPPER
>>>>        select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
>>>>        select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>>>> +    select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
>>>
>>> IIRC certain configurations (HMM?) don't even build if you just turn this on alone (although of course things may have changed elsewhere in the meantime) - crucially, though, from previous discussions[1] it seems fundamentally unsafe, since I don't think we can guarantee that nobody will touch the corners of ZONE_DEVICE that also require pte_devmap in order not to go subtly wrong. I did get as far as cooking up some patches to sort that out [2][3] which I never got round to posting for their own sake, so please consider picking those up as part of this series.
>>
>> In the previous discussion mentioned here [1] it sort of indicates that we
>> cannot have a viable (ARCH_HAS_ZONE_DEVICE=y but !__HAVE_ARCH_PTE_DEVMAP). I
>> dont understand why !
> 
> Because ZONE_DEVICE was specifically invented to solve get_user_pages() for DAX.
> 
>> The driver can just hotplug the range into ZONE_DEVICE,
>> manage the memory itself without mapping them to user page table ever.
> 
> Then why do you even need 'struct page' objects?
> 
>> IIUC
>> ZONE_DEVICE must not need user mapped device PFN support.
> 
> No, you don't understand correctly, or I don't understand how you plan
> to use ZONE_DEVICE outside it's intended use case.
> 
>> All the corner case
>> problems discussed previously come in once these new 'device PFN' memory which
>> is now in ZONE_DEVICE get mapped into user page table.
>>
>>>
>>> Robin.
>>>
>>>>        select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>>>        select ARCH_INLINE_READ_LOCK if !PREEMPT
>>>>        select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
>>>>
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/CAA9_cmfA9GS+1M1aSyv1ty5jKY3iho3CERhnRAruWJW3PfmpgA@mail.gmail.com/#t
>>> [2] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=61816b833afdb56b49c2e58f5289ae18809e5d67
>>> [3] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a5a16560eb1becf9a1d4cc0d03d6b5e76da4f4e1
>>> (apologies to anyone if the linux-arm.org server is being flaky as usual and requires a few tries to respond properly)
>>
>> I have not evaluated pte_devmap(). Will consider [3] when enabling it. But
>> I still dont understand why ZONE_DEVICE can not be enabled and used from a
>> driver which never requires user mapping or pte|pmd|pud_devmap() support.
> 
> Because there are mm paths that make assumptions about ZONE_DEVICE
> that your use case might violate.
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-04  5:39     ` Anshuman Khandual
@ 2019-04-04 11:58       ` Oscar Salvador
  2019-04-04 13:03         ` Anshuman Khandual
  0 siblings, 1 reply; 43+ messages in thread
From: Oscar Salvador @ 2019-04-04 11:58 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Robin Murphy, linux-kernel, linux-arm-kernel, linux-mm, akpm,
	will.deacon, catalin.marinas, mhocko, mgorman, james.morse,
	mark.rutland, cpandya, arunks, dan.j.williams, logang,
	pasha.tatashin, david, cai, Steven Price

On Thu, Apr 04, 2019 at 11:09:22AM +0530, Anshuman Khandual wrote:
> > Do these need to be __meminit? AFAICS it's effectively redundant with the containing #ifdef, and removal feels like it's inherently a later-than-init thing anyway.
> 
> I was confused here a bit but even X86 does exactly the same. All these functions
> are still labeled __meminit and all wrapped under CONFIG_MEMORY_HOTPLUG. Is there
> any concern to have __meminit here ?

We do not really need it as long as the code is within #ifdef CONFIG_MEMORY_HOTPLUG.
__meminit is being used when functions that are going to be need for hotplug need
to stay around.

/* Used for MEMORY_HOTPLUG */
#define __meminit        __section(.meminit.text) __cold notrace \
                                                  __latent_entropy

#if defined(CONFIG_MEMORY_HOTPLUG)
#define MEM_KEEP(sec)    *(.mem##sec)
#define MEM_DISCARD(sec)
#else
#define MEM_KEEP(sec)
#define MEM_DISCARD(sec) *(.mem##sec)
#endif

So it is kind of redundant to have both.
I will clean it up when reposting [1] and [2].

[1] https://patchwork.kernel.org/patch/10875019/
[2] https://patchwork.kernel.org/patch/10875021/

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-04 11:58       ` Oscar Salvador
@ 2019-04-04 13:03         ` Anshuman Khandual
  2019-04-04 15:19           ` Oscar Salvador
  0 siblings, 1 reply; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04 13:03 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Robin Murphy, linux-kernel, linux-arm-kernel, linux-mm, akpm,
	will.deacon, catalin.marinas, mhocko, mgorman, james.morse,
	mark.rutland, cpandya, arunks, dan.j.williams, logang,
	pasha.tatashin, david, cai, Steven Price



On 04/04/2019 05:28 PM, Oscar Salvador wrote:
> On Thu, Apr 04, 2019 at 11:09:22AM +0530, Anshuman Khandual wrote:
>>> Do these need to be __meminit? AFAICS it's effectively redundant with the containing #ifdef, and removal feels like it's inherently a later-than-init thing anyway.
>>
>> I was confused here a bit but even X86 does exactly the same. All these functions
>> are still labeled __meminit and all wrapped under CONFIG_MEMORY_HOTPLUG. Is there
>> any concern to have __meminit here ?
> 
> We do not really need it as long as the code is within #ifdef CONFIG_MEMORY_HOTPLUG.
> __meminit is being used when functions that are going to be need for hotplug need
> to stay around.

Makes sense.

> 
> /* Used for MEMORY_HOTPLUG */
> #define __meminit        __section(.meminit.text) __cold notrace \
>                                                   __latent_entropy
> 
> #if defined(CONFIG_MEMORY_HOTPLUG)
> #define MEM_KEEP(sec)    *(.mem##sec)
> #define MEM_DISCARD(sec)
> #else
> #define MEM_KEEP(sec)
> #define MEM_DISCARD(sec) *(.mem##sec)
> #endif
> 
> So it is kind of redundant to have both.
> I will clean it up when reposting [1] and [2].
> 
> [1] https://patchwork.kernel.org/patch/10875019/
> [2] https://patchwork.kernel.org/patch/10875021/
> 

Sure. Will remove them from the proposed functions next time around.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE
  2019-04-03 18:08 ` [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Dan Williams
@ 2019-04-04 13:11   ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-04 13:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux Kernel Mailing List, linux-arm-kernel, Linux MM,
	Andrew Morton, Will Deacon, Catalin Marinas, Michal Hocko,
	Mel Gorman, james.morse, Mark Rutland, Robin Murphy, cpandya,
	arunks, osalvador, Logan Gunthorpe, Pavel Tatashin,
	David Hildenbrand, cai



On 04/03/2019 11:38 PM, Dan Williams wrote:
> On Tue, Apr 2, 2019 at 9:30 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>> This series enables memory hot remove on arm64, fixes a memblock removal
>> ordering problem in generic __remove_memory(), enables sysfs memory probe
>> interface on arm64. It also enables ZONE_DEVICE with struct vmem_altmap
>> support.
>>
>> Testing:
>>
>> Tested hot remove on arm64 for all 4K, 16K, 64K page config options with
>> all possible VA_BITS and PGTABLE_LEVELS combinations. Tested ZONE_DEVICE
>> with ARM64_4K_PAGES through a dummy driver.
>>
>> Build tested on non arm64 platforms. I will appreciate if folks can test
>> arch_remove_memory() re-ordering in __remove_memory() on other platforms.
>>
>> Dependency:
>>
>> V5 series in the thread (https://lkml.org/lkml/2019/2/14/1096) will make
>> kernel linear mapping loose pgtable_page_ctor() init. When this happens
>> the proposed functions free_pte|pmd|pud_table() in [PATCH 2/6] will have
>> to stop calling pgtable_page_dtor().
> 
> Hi Anshuman,

Hello Dan,

> 
> I'd be interested to integrate this with the sub-section hotplug
> support [1]. Otherwise the padding implementation in libnvdimm can't
> be removed unless all ZONE_DEVICE capable archs also agree on the
> minimum arch_add_memory() granularity. I'd prefer not to special case
> which archs support which granularity, but it unfortunately
> complicates what you're trying to achieve.

Sorry I have not been following your series on sub-section hotplug support.
Hence might not have the full context here. Could you please give some more
details on what exactly might be a problem.

> 
> I think at a minimum we, mm hotplug co-travellers, need to come to a
> consensus on whether sub-section support is viable for v5.2 and / or a
> pre-requisite for new arch-ZONE_DEVICE implementations

I would need to go through sub-section hotplug series first to understand
the pre-requisite. Do we need to support sub-section hotplug first before
being able to enable ZONE_DEVICE ?

- Anshuman

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/6] arm64/mm: Enable memory hot remove
  2019-04-04 13:03         ` Anshuman Khandual
@ 2019-04-04 15:19           ` Oscar Salvador
  0 siblings, 0 replies; 43+ messages in thread
From: Oscar Salvador @ 2019-04-04 15:19 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Robin Murphy, linux-kernel, linux-arm-kernel, linux-mm, akpm,
	will.deacon, catalin.marinas, mhocko, mgorman, james.morse,
	mark.rutland, cpandya, arunks, dan.j.williams, logang,
	pasha.tatashin, david, cai, Steven Price

On Thu, Apr 04, 2019 at 06:33:09PM +0530, Anshuman Khandual wrote:
> Sure. Will remove them from the proposed functions next time around.

Just need to make sure that the function is not calling directly or indirectly
another __meminit function, then it is safe to remove it.

E.g: 

sparse_add_one_section() is wrapped around CONFIG_MEMORY_HOTPLUG, but the
__meminit must stay because it calls sparse_index_init(),
sparse_init_one_section() and sparse_mem_map_populate(), all three marked as
__meminit because they are also used out of hotplug scope, during early boot.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE
  2019-04-04  9:46         ` Robin Murphy
@ 2019-04-07 22:11           ` Dan Williams
  2019-04-08  4:03             ` Ira Weiny
  0 siblings, 1 reply; 43+ messages in thread
From: Dan Williams @ 2019-04-07 22:11 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Anshuman Khandual, Linux Kernel Mailing List, linux-arm-kernel,
	Linux MM, Andrew Morton, Will Deacon, Catalin Marinas,
	Michal Hocko, Mel Gorman, james.morse, Mark Rutland, cpandya,
	arunks, osalvador, Logan Gunthorpe, David Hildenbrand, cai,
	Jérôme Glisse, Weiny, Ira

On Thu, Apr 4, 2019 at 2:47 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 04/04/2019 06:04, Dan Williams wrote:
> > On Wed, Apr 3, 2019 at 9:42 PM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> >>
> >>
> >>
> >> On 04/03/2019 07:28 PM, Robin Murphy wrote:
> >>> [ +Dan, Jerome ]
> >>>
> >>> On 03/04/2019 05:30, Anshuman Khandual wrote:
> >>>> Arch implementation for functions which create or destroy vmemmap mapping
> >>>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
> >>>> device memory range through driver provided vmem_altmap structure which
> >>>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
> >>>
> >>> ZONE_DEVICE is about more than just altmap support, no?
> >>
> >> Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
> >> struct pages for it has stand alone and self contained use case. The driver could
> >> just want to manage the memory itself but with struct pages either in the RAM or
> >> in the device memory range through struct vmem_altmap. The driver may not choose
> >> to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
> >> map these pages into any user pagetable which would necessitate support for
> >> pte|pmd|pud_devmap.
> >
> > What's left for ZONE_DEVICE if none of the above cases are used?
> >
> >> Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
> >> IMHO ZONE_DEVICE is self contained and can be evaluated in itself.
> >
> > I'm not convinced. What's the specific use case.
>
> The fundamental "roadmap" reason we've been doing this is to enable
> further NVDIMM/pmem development (libpmem/Qemu/etc.) on arm64. The fact
> that ZONE_DEVICE immediately opens the door to the various other stuff
> that the CCIX folks have interest in is a definite bonus, so it would
> certainly be preferable to get arm64 on par with the current state of
> things rather than try to subdivide the scope further.
>
> I started working on this from the ZONE_DEVICE end, but got bogged down
> in trying to replace my copied-from-s390 dummy hot-remove implementation
> with something proper. Anshuman has stepped in to help with hot-remove
> (since we also have cloud folks wanting that for its own sake), so is
> effectively coming at the problem from the opposite direction, and I'll
> be the first to admit that we've not managed the greatest job of meeting
> in the middle and coordinating our upstream story; sorry about that :)
>
> Let me freshen up my devmap patches and post them properly, since that
> discussion doesn't have to happen in the context of hot-remove; they're
> effectively just parallel dependencies for ZONE_DEVICE.

Sounds good. It's also worth noting that Ira's recent patches for
supporting get_user_pages_fast() for "longterm" pins relies on
PTE_DEVMAP to determine when fast-GUP is safe to proceed, or whether
it needs to fall back to slow-GUP. So it really is the case that
"devmap" support is an assumption for ZONE_DEVICE.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE
  2019-04-07 22:11           ` Dan Williams
@ 2019-04-08  4:03             ` Ira Weiny
  2019-04-08  6:03               ` Anshuman Khandual
  0 siblings, 1 reply; 43+ messages in thread
From: Ira Weiny @ 2019-04-08  4:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Robin Murphy, Anshuman Khandual, Linux Kernel Mailing List,
	linux-arm-kernel, Linux MM, Andrew Morton, Will Deacon,
	Catalin Marinas, Michal Hocko, Mel Gorman, james.morse,
	Mark Rutland, cpandya, arunks, osalvador, Logan Gunthorpe,
	David Hildenbrand, cai, Jérôme Glisse

On Sun, Apr 07, 2019 at 03:11:00PM -0700, Dan Williams wrote:
> On Thu, Apr 4, 2019 at 2:47 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 04/04/2019 06:04, Dan Williams wrote:
> > > On Wed, Apr 3, 2019 at 9:42 PM Anshuman Khandual
> > > <anshuman.khandual@arm.com> wrote:
> > >>
> > >>
> > >>
> > >> On 04/03/2019 07:28 PM, Robin Murphy wrote:
> > >>> [ +Dan, Jerome ]
> > >>>
> > >>> On 03/04/2019 05:30, Anshuman Khandual wrote:
> > >>>> Arch implementation for functions which create or destroy vmemmap mapping
> > >>>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
> > >>>> device memory range through driver provided vmem_altmap structure which
> > >>>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
> > >>>
> > >>> ZONE_DEVICE is about more than just altmap support, no?
> > >>
> > >> Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
> > >> struct pages for it has stand alone and self contained use case. The driver could
> > >> just want to manage the memory itself but with struct pages either in the RAM or
> > >> in the device memory range through struct vmem_altmap. The driver may not choose
> > >> to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
> > >> map these pages into any user pagetable which would necessitate support for
> > >> pte|pmd|pud_devmap.
> > >
> > > What's left for ZONE_DEVICE if none of the above cases are used?
> > >
> > >> Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
> > >> IMHO ZONE_DEVICE is self contained and can be evaluated in itself.
> > >
> > > I'm not convinced. What's the specific use case.
> >
> > The fundamental "roadmap" reason we've been doing this is to enable
> > further NVDIMM/pmem development (libpmem/Qemu/etc.) on arm64. The fact
> > that ZONE_DEVICE immediately opens the door to the various other stuff
> > that the CCIX folks have interest in is a definite bonus, so it would
> > certainly be preferable to get arm64 on par with the current state of
> > things rather than try to subdivide the scope further.
> >
> > I started working on this from the ZONE_DEVICE end, but got bogged down
> > in trying to replace my copied-from-s390 dummy hot-remove implementation
> > with something proper. Anshuman has stepped in to help with hot-remove
> > (since we also have cloud folks wanting that for its own sake), so is
> > effectively coming at the problem from the opposite direction, and I'll
> > be the first to admit that we've not managed the greatest job of meeting
> > in the middle and coordinating our upstream story; sorry about that :)
> >
> > Let me freshen up my devmap patches and post them properly, since that
> > discussion doesn't have to happen in the context of hot-remove; they're
> > effectively just parallel dependencies for ZONE_DEVICE.
> 
> Sounds good. It's also worth noting that Ira's recent patches for
> supporting get_user_pages_fast() for "longterm" pins relies on
> PTE_DEVMAP to determine when fast-GUP is safe to proceed, or whether
> it needs to fall back to slow-GUP. So it really is the case that
> "devmap" support is an assumption for ZONE_DEVICE.

Could you cc me on the patches when you post?

Thanks,
Ira


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE
  2019-04-08  4:03             ` Ira Weiny
@ 2019-04-08  6:03               ` Anshuman Khandual
  0 siblings, 0 replies; 43+ messages in thread
From: Anshuman Khandual @ 2019-04-08  6:03 UTC (permalink / raw)
  To: Ira Weiny, Dan Williams
  Cc: Robin Murphy, Linux Kernel Mailing List, linux-arm-kernel,
	Linux MM, Andrew Morton, Will Deacon, Catalin Marinas,
	Michal Hocko, Mel Gorman, james.morse, Mark Rutland, cpandya,
	arunks, osalvador, Logan Gunthorpe, David Hildenbrand, cai,
	Jérôme Glisse



On 04/08/2019 09:33 AM, Ira Weiny wrote:
> On Sun, Apr 07, 2019 at 03:11:00PM -0700, Dan Williams wrote:
>> On Thu, Apr 4, 2019 at 2:47 AM Robin Murphy <robin.murphy@arm.com> wrote:
>>>
>>> On 04/04/2019 06:04, Dan Williams wrote:
>>>> On Wed, Apr 3, 2019 at 9:42 PM Anshuman Khandual
>>>> <anshuman.khandual@arm.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 04/03/2019 07:28 PM, Robin Murphy wrote:
>>>>>> [ +Dan, Jerome ]
>>>>>>
>>>>>> On 03/04/2019 05:30, Anshuman Khandual wrote:
>>>>>>> Arch implementation for functions which create or destroy vmemmap mapping
>>>>>>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
>>>>>>> device memory range through driver provided vmem_altmap structure which
>>>>>>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
>>>>>>
>>>>>> ZONE_DEVICE is about more than just altmap support, no?
>>>>>
>>>>> Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
>>>>> struct pages for it has stand alone and self contained use case. The driver could
>>>>> just want to manage the memory itself but with struct pages either in the RAM or
>>>>> in the device memory range through struct vmem_altmap. The driver may not choose
>>>>> to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
>>>>> map these pages into any user pagetable which would necessitate support for
>>>>> pte|pmd|pud_devmap.
>>>>
>>>> What's left for ZONE_DEVICE if none of the above cases are used?
>>>>
>>>>> Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
>>>>> IMHO ZONE_DEVICE is self contained and can be evaluated in itself.
>>>>
>>>> I'm not convinced. What's the specific use case.
>>>
>>> The fundamental "roadmap" reason we've been doing this is to enable
>>> further NVDIMM/pmem development (libpmem/Qemu/etc.) on arm64. The fact
>>> that ZONE_DEVICE immediately opens the door to the various other stuff
>>> that the CCIX folks have interest in is a definite bonus, so it would
>>> certainly be preferable to get arm64 on par with the current state of
>>> things rather than try to subdivide the scope further.
>>>
>>> I started working on this from the ZONE_DEVICE end, but got bogged down
>>> in trying to replace my copied-from-s390 dummy hot-remove implementation
>>> with something proper. Anshuman has stepped in to help with hot-remove
>>> (since we also have cloud folks wanting that for its own sake), so is
>>> effectively coming at the problem from the opposite direction, and I'll
>>> be the first to admit that we've not managed the greatest job of meeting
>>> in the middle and coordinating our upstream story; sorry about that :)
>>>
>>> Let me freshen up my devmap patches and post them properly, since that
>>> discussion doesn't have to happen in the context of hot-remove; they're
>>> effectively just parallel dependencies for ZONE_DEVICE.
>>
>> Sounds good. It's also worth noting that Ira's recent patches for
>> supporting get_user_pages_fast() for "longterm" pins relies on
>> PTE_DEVMAP to determine when fast-GUP is safe to proceed, or whether
>> it needs to fall back to slow-GUP. So it really is the case that
>> "devmap" support is an assumption for ZONE_DEVICE.
> 
> Could you cc me on the patches when you post?

Sure will do.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2019-04-08  6:03 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-03  4:30 [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Anshuman Khandual
2019-04-03  4:30 ` [PATCH 1/6] arm64/mm: Enable sysfs based memory hot add interface Anshuman Khandual
2019-04-03  8:20   ` David Hildenbrand
2019-04-03 13:12     ` Robin Murphy
2019-04-04  5:21       ` Anshuman Khandual
2019-04-04  5:25     ` Anshuman Khandual
2019-04-04  8:49       ` David Hildenbrand
2019-04-03  4:30 ` [PATCH 2/6] arm64/mm: Enable memory hot remove Anshuman Khandual
2019-04-03 12:37   ` Robin Murphy
2019-04-03 13:15     ` Steven Price
2019-04-04  6:51       ` Anshuman Khandual
2019-04-04  5:39     ` Anshuman Khandual
2019-04-04 11:58       ` Oscar Salvador
2019-04-04 13:03         ` Anshuman Khandual
2019-04-04 15:19           ` Oscar Salvador
2019-04-03 17:32   ` Logan Gunthorpe
2019-04-03 17:57     ` Robin Murphy
2019-04-04  8:23       ` Anshuman Khandual
2019-04-04  7:07     ` Anshuman Khandual
2019-04-04  9:16       ` Steven Price
2019-04-03  4:30 ` [PATCH 3/6] arm64/mm: Enable struct page allocation from device memory Anshuman Khandual
2019-04-03  4:30 ` [PATCH 4/6] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
2019-04-03  8:45   ` Oscar Salvador
2019-04-03  9:17   ` Michal Hocko
2019-04-04  8:32     ` Anshuman Khandual
2019-04-03  9:30   ` David Hildenbrand
2019-04-03  4:30 ` [PATCH 5/6] mm/memremap: Rename and consolidate SECTION_SIZE Anshuman Khandual
2019-04-03  9:26   ` Michal Hocko
2019-04-03  9:30   ` David Hildenbrand
2019-04-03  4:30 ` [PATCH 6/6] arm64/mm: Enable ZONE_DEVICE Anshuman Khandual
2019-04-03 13:58   ` Robin Murphy
2019-04-03 16:07     ` Jerome Glisse
2019-04-04  5:03       ` Anshuman Khandual
2019-04-04  4:42     ` Anshuman Khandual
2019-04-04  5:04       ` Dan Williams
2019-04-04  9:46         ` Robin Murphy
2019-04-07 22:11           ` Dan Williams
2019-04-08  4:03             ` Ira Weiny
2019-04-08  6:03               ` Anshuman Khandual
2019-04-03 18:08 ` [PATCH 0/6] arm64/mm: Enable memory hot remove and ZONE_DEVICE Dan Williams
2019-04-04 13:11   ` Anshuman Khandual
2019-04-04  9:46 ` [RFC 1/2] mm/vmemmap: Enable vmem_altmap based base page mapping for vmemmap Anshuman Khandual
2019-04-04  9:46   ` [RFC 2/2] arm64/mm: Enable ZONE_DEVICE for all page configs Anshuman Khandual

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).