linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V12 0/2] arm64/mm: Enable memory hot remove
@ 2020-01-16  6:45 Anshuman Khandual
  2020-01-16  6:45 ` [PATCH V12 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Anshuman Khandual @ 2020-01-16  6:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will
  Cc: mark.rutland, david, cai, logang, cpandya, arunks,
	dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper,
	broonie, valentin.schneider, Robin.Murphy, steven.price,
	suzuki.poulose, ira.weiny, Anshuman Khandual

This series enables memory hot remove functionality on arm64 platform. This
is based on Linux 5.5-rc6 and particularly deals with a problem caused when
boot memory is attempted to be removed.

On arm64 platform, it is essential to ensure that the boot time discovered
memory couldn't be hot-removed so that,

1. FW data structures used across kexec are idempotent
   e.g. the EFI memory map.

2. linear map or vmemmap would not have to be dynamically split, and can
   map boot memory at a large granularity

3. Avoid penalizing paths that have to walk page tables, where we can be
   certain that the memory is not hot-removable

This problem has been extensively discussed previously during V10 version
which can be found here (https://lkml.org/lkml/2019/10/11/233). Never the
less this series now adds memory hotplug notifier to prevent boot memory
offlining and thus hot remove. It also fixes a potential race condition
which might happen while trying to dump kernel page table entries along
with a concurrent memory hot remove operation.

Concurrent vmalloc() and hot-remove conflict:

As pointed out earlier on the V5 thread [2] there can be potential conflict
between concurrent vmalloc() and memory hot-remove operation. The problem here
is caused by inadequate locking in vmalloc() which protects installation of a
page table page but not the page table walk or the leaf entry modification.

Now free_empty_tables() and it's children functions take into account a maximum
possible range on which it operates as a floor-ceiling boundary. This makes sure
that no page table page is freed unless its fully within the maximum possible
range as decided by the caller.

Testing:

Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config
options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS
combinations.

Changes in V12:

- Dropped all changes introduced earlier in V11
- Added a memory hotplug notifier to prevent boot memory offlining per David

Changes in V11: (https://lkml.org/lkml/2020/1/9/1159)

- Bifurcated check_hotplug_memory_range() and carved out check_hotremove_memory_range()
- Introduced arch_memory_removable() call back while validating hot remove range
- Introduced memblock flag MEMBLOCK_BOOT in order to track boot memory at runtime
- Marked all boot memory ranges on arm64 with MEMBLOCK_BOOT flag while processing FDT
- Overridden arch_memory_removable() on arm64 to reject boot memory removal requests
- Added an WARN_ON() in arch_remove_memory() when it receives boot memory removal request
- Added arch_memory_removable() related updates in the commit message for core hot remove

Changes in V10: (https://lkml.org/lkml/2019/10/11/233)

- Perform just single TLBI invalidation for PMD or PUD block mappings per Catalin
- Added comment in free_empty_pte_table() while validating PTE level clears per Catalin
- Added comments in free_empty_pxx_table() while checking for non-clear entries per Catalin

Changes in V9: (https://lkml.org/lkml/2019/10/9/131)

- Dropped ACK tags from Steve and David as this series has changed since
- Dropped WARN(!page) in free_hotplug_page_range() per Matthew Wilcox
- Replaced pxx_page() with virt_to_page() in free_pxx_table() per Catalin
- Dropped page and call virt_to_page() in free_hotplug_pgtable_page()
- Replaced sparse_vmap with free_mapped per Catalin
- Dropped ternary operators in all unmap_hotplug_pxx_range() per Catalin
- Collapsed all free_pxx_table() into free_empty_pxx_table() per Catalin

Changes in V8: (https://lkml.org/lkml/2019/9/23/22)

- Dropped the first patch (memblock_[free|remove] reorder) from the series which
  is no longer needed for arm64 hot-remove enablement and was posted separately
  as (https://patchwork.kernel.org/patch/11146361/)
- Dropped vmalloc-vmemmap detection and subsequent skipping of free_empty_tables()
- Changed free_empty_[pxx]_tables() functions which now accepts a possible maximum
  floor-ceiling address range on which it operates. Also changed free_pxx_table()
  functions to check against required alignment as well as maximum floor-ceiling
  range as another prerequisite before freeing the page table page.
- Dropped remove_pagetable(), instead call it's constituent functions directly

Changes in V7: (https://lkml.org/lkml/2019/9/3/326)

- vmalloc_vmemmap_overlap gets evaluated early during boot for a given config
- free_empty_tables() gets conditionally called based on vmalloc_vmemmap_overlap

Changes in V6: (https://lkml.org/lkml/2019/7/15/36)

- Implemented most of the suggestions from Mark Rutland
- Added <linux/memory_hotplug.h> in ptdump
- remove_pagetable() now has two distinct passes over the kernel page table
- First pass unmap_hotplug_range() removes leaf level entries at all level
- Second pass free_empty_tables() removes empty page table pages
- Kernel page table lock has been dropped completely
- vmemmap_free() does not call freee_empty_tables() to avoid conflict with vmalloc()
- All address range scanning are converted to do {} while() loop
- Added 'unsigned long end' in __remove_pgd_mapping()
- Callers need not provide starting pointer argument to free_[pte|pmd|pud]_table() 
- Drop the starting pointer argument from free_[pte|pmd|pud]_table() functions
- Fetching pxxp[i] in free_[pte|pmd|pud]_table() is wrapped around in READ_ONCE()
- free_[pte|pmd|pud]_table() now computes starting pointer inside the function
- Fixed TLB handling while freeing huge page section mappings at PMD or PUD level
- Added WARN_ON(!page) in free_hotplug_page_range()
- Added WARN_ON(![pm|pud]_table(pud|pmd)) when there is no section mapping

- [PATCH 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
- Request earlier for separate merger (https://patchwork.kernel.org/patch/10986599/)
- s/__remove_memory/try_remove_memory in the subject line
- s/arch_remove_memory/memblock_[free|remove] in the subject line
- A small change in the commit message as re-order happens now for memblock remove
  functions not for arch_remove_memory()

Changes in V5: (https://lkml.org/lkml/2019/5/29/218)

- Have some agreement [1] over using memory_hotplug_lock for arm64 ptdump
- Change 7ba36eccb3f8 ("arm64/mm: Inhibit huge-vmap with ptdump") already merged
- Dropped the above patch from this series
- Fixed indentation problem in arch_[add|remove]_memory() as per David
- Collected all new Acked-by tags
 
Changes in V4: (https://lkml.org/lkml/2019/5/20/19)

- Implemented most of the suggestions from Mark Rutland
- Interchanged patch [PATCH 2/4] <---> [PATCH 3/4] and updated commit message
- Moved CONFIG_PGTABLE_LEVELS inside free_[pud|pmd]_table()
- Used READ_ONCE() in missing instances while accessing page table entries
- s/p???_present()/p???_none() for checking valid kernel page table entries
- WARN_ON() when an entry is !p???_none() and !p???_present() at the same time
- Updated memory hot-remove commit message with additional details as suggested
- Rebased the series on 5.2-rc1 with hotplug changes from David and Michal Hocko
- Collected all new Acked-by tags

Changes in V3: (https://lkml.org/lkml/2019/5/14/197)
 
- Implemented most of the suggestions from Mark Rutland for remove_pagetable()
- Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions
- Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted polarity
- Changed pointer names ('p' at end) and removed tmp from iterations
- Perform intermediate TLB invalidation while clearing pgtable entries
- Dropped flush_tlb_kernel_range() in remove_pagetable()
- Added flush_tlb_kernel_range() in remove_pte_table() instead
- Renamed page freeing functions for pgtable page and mapped pages
- Used page range size instead of order while freeing mapped or pgtable pages
- Removed all PageReserved() handling while freeing mapped or pgtable pages
- Replaced XXX_index() with XXX_offset() while walking the kernel page table
- Used READ_ONCE() while fetching individual pgtable entries
- Taken overall init_mm.page_table_lock instead of just while changing an entry
- Dropped previously added [pmd|pud]_index() which are not required anymore
- Added a new patch to protect kernel page table race condition for ptdump
- Added a new patch from Mark Rutland to prevent huge-vmap with ptdump

Changes in V2: (https://lkml.org/lkml/2019/4/14/5)

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review
- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko
- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
  pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

References:

[1] https://lkml.org/lkml/2019/5/28/584
[2] https://lkml.org/lkml/2019/6/11/709

Anshuman Khandual (2):
  arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  arm64/mm: Enable memory hot remove

 arch/arm64/Kconfig              |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c             | 342 ++++++++++++++++++++++++++++++++++++++--
 arch/arm64/mm/ptdump_debugfs.c  |   4 +
 4 files changed, 341 insertions(+), 9 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH V12 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2020-01-16  6:45 [PATCH V12 0/2] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2020-01-16  6:45 ` Anshuman Khandual
  2020-01-21 12:02   ` Catalin Marinas
  2020-01-16  6:45 ` [PATCH V12 2/2] arm64/mm: Enable memory hot remove Anshuman Khandual
  2020-01-21 15:18 ` [PATCH V12 0/2] " Will Deacon
  2 siblings, 1 reply; 7+ messages in thread
From: Anshuman Khandual @ 2020-01-16  6:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will
  Cc: mark.rutland, david, cai, logang, cpandya, arunks,
	dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper,
	broonie, valentin.schneider, Robin.Murphy, steven.price,
	suzuki.poulose, ira.weiny, Anshuman Khandual

The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.

Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/ptdump_debugfs.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f..b5eebc8 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/debugfs.h>
+#include <linux/memory_hotplug.h>
 #include <linux/seq_file.h>
 
 #include <asm/ptdump.h>
@@ -7,7 +8,10 @@
 static int ptdump_show(struct seq_file *m, void *v)
 {
 	struct ptdump_info *info = m->private;
+
+	get_online_mems();
 	ptdump_walk_pgd(m, info);
+	put_online_mems();
 	return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH V12 2/2] arm64/mm: Enable memory hot remove
  2020-01-16  6:45 [PATCH V12 0/2] arm64/mm: Enable memory hot remove Anshuman Khandual
  2020-01-16  6:45 ` [PATCH V12 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
@ 2020-01-16  6:45 ` Anshuman Khandual
  2020-01-21 12:02   ` Catalin Marinas
  2020-01-21 15:18 ` [PATCH V12 0/2] " Will Deacon
  2 siblings, 1 reply; 7+ messages in thread
From: Anshuman Khandual @ 2020-01-16  6:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will
  Cc: mark.rutland, david, cai, logang, cpandya, arunks,
	dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper,
	broonie, valentin.schneider, Robin.Murphy, steven.price,
	suzuki.poulose, ira.weiny, Anshuman Khandual

The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds unmap_hotplug_range() and free_empty_tables() helpers which
can be used to tear down either region and calls it from vmemmap_free() and
___remove_pgd_mapping(). The free_mapped argument determines whether the
backing memory will be freed.

It makes two distinct passes over the kernel page table. In the first pass
with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and
frees backing memory if required (vmemmap) for each mapped leaf entry. In
the second pass with free_empty_tables() it looks for empty page table
sections whose page table page can be unmapped, TLB invalidated and freed.

While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling
method which is borrowed from user page table tear with free_pgd_range()
which skips freeing page table pages if intermediate address range is not
aligned or maximum floor-ceiling might not own the entire page table page.

Boot memory on arm64 cannot be removed. Hence this registers a new memory
hotplug notifier which prevents boot memory offlining and it's removal.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture and user page table tear down method.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
---
 arch/arm64/Kconfig              |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c             | 342 ++++++++++++++++++++++++++++++++++++++--
 3 files changed, 337 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e688dfa..08d0f0cb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -278,6 +278,9 @@ config ZONE_DMA32
 config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 config SMP
 	def_bool y
 
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index a4f9ca5..580e245 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -54,6 +54,7 @@
 #define MODULES_VADDR		(BPF_JIT_REGION_END)
 #define MODULES_VSIZE		(SZ_128M)
 #define VMEMMAP_START		(-VMEMMAP_SIZE - SZ_2M)
+#define VMEMMAP_END		(VMEMMAP_START + VMEMMAP_SIZE)
 #define PCI_IO_END		(VMEMMAP_START - SZ_2M)
 #define PCI_IO_START		(PCI_IO_END - PCI_IO_SIZE)
 #define FIXADDR_TOP		(PCI_IO_START - SZ_2M)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 40797cb..8dcafe1 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -17,6 +17,7 @@
 #include <linux/mman.h>
 #include <linux/nodemask.h>
 #include <linux/memblock.h>
+#include <linux/memory.h>
 #include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
@@ -724,6 +725,275 @@ int kern_addr_valid(unsigned long addr)
 
 	return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+	WARN_ON(PageReserved(page));
+	free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+	free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static bool pgtable_range_aligned(unsigned long start, unsigned long end,
+				  unsigned long floor, unsigned long ceiling,
+				  unsigned long mask)
+{
+	start &= mask;
+	if (start < floor)
+		return false;
+
+	if (ceiling) {
+		ceiling &= mask;
+		if (!ceiling)
+			return false;
+	}
+
+	if (end - 1 > ceiling - 1)
+		return false;
+	return true;
+}
+
+static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
+				    unsigned long end, bool free_mapped)
+{
+	pte_t *ptep, pte;
+
+	do {
+		ptep = pte_offset_kernel(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+		if (pte_none(pte))
+			continue;
+
+		WARN_ON(!pte_present(pte));
+		pte_clear(&init_mm, addr, ptep);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+		if (free_mapped)
+			free_hotplug_page_range(pte_page(pte), PAGE_SIZE);
+	} while (addr += PAGE_SIZE, addr < end);
+}
+
+static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
+				    unsigned long end, bool free_mapped)
+{
+	unsigned long next;
+	pmd_t *pmdp, pmd;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+		if (pmd_none(pmd))
+			continue;
+
+		WARN_ON(!pmd_present(pmd));
+		if (pmd_sect(pmd)) {
+			pmd_clear(pmdp);
+
+			/*
+			 * One TLBI should be sufficient here as the PMD_SIZE
+			 * range is mapped with a single block entry.
+			 */
+			flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+			if (free_mapped)
+				free_hotplug_page_range(pmd_page(pmd),
+							PMD_SIZE);
+			continue;
+		}
+		WARN_ON(!pmd_table(pmd));
+		unmap_hotplug_pte_range(pmdp, addr, next, free_mapped);
+	} while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr,
+				    unsigned long end, bool free_mapped)
+{
+	unsigned long next;
+	pud_t *pudp, pud;
+
+	do {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
+			continue;
+
+		WARN_ON(!pud_present(pud));
+		if (pud_sect(pud)) {
+			pud_clear(pudp);
+
+			/*
+			 * One TLBI should be sufficient here as the PUD_SIZE
+			 * range is mapped with a single block entry.
+			 */
+			flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+			if (free_mapped)
+				free_hotplug_page_range(pud_page(pud),
+							PUD_SIZE);
+			continue;
+		}
+		WARN_ON(!pud_table(pud));
+		unmap_hotplug_pmd_range(pudp, addr, next, free_mapped);
+	} while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_range(unsigned long addr, unsigned long end,
+				bool free_mapped)
+{
+	unsigned long next;
+	pgd_t *pgdp, pgd;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		pgd = READ_ONCE(*pgdp);
+		if (pgd_none(pgd))
+			continue;
+
+		WARN_ON(!pgd_present(pgd));
+		unmap_hotplug_pud_range(pgdp, addr, next, free_mapped);
+	} while (addr = next, addr < end);
+}
+
+static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
+				 unsigned long end, unsigned long floor,
+				 unsigned long ceiling)
+{
+	pte_t *ptep, pte;
+	unsigned long i, start = addr;
+
+	do {
+		ptep = pte_offset_kernel(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+
+		/*
+		 * This is just a sanity check here which verifies that
+		 * pte clearing has been done by earlier unmap loops.
+		 */
+		WARN_ON(!pte_none(pte));
+	} while (addr += PAGE_SIZE, addr < end);
+
+	if (!pgtable_range_aligned(start, end, floor, ceiling, PMD_MASK))
+		return;
+
+	/*
+	 * Check whether we can free the pte page if the rest of the
+	 * entries are empty. Overlap with other regions have been
+	 * handled by the floor/ceiling check.
+	 */
+	ptep = pte_offset_kernel(pmdp, 0UL);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		if (!pte_none(READ_ONCE(ptep[i])))
+			return;
+	}
+
+	pmd_clear(pmdp);
+	__flush_tlb_kernel_pgtable(start);
+	free_hotplug_pgtable_page(virt_to_page(ptep));
+}
+
+static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
+				 unsigned long end, unsigned long floor,
+				 unsigned long ceiling)
+{
+	pmd_t *pmdp, pmd;
+	unsigned long i, next, start = addr;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+		if (pmd_none(pmd))
+			continue;
+
+		WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd));
+		free_empty_pte_table(pmdp, addr, next, floor, ceiling);
+	} while (addr = next, addr < end);
+
+	if (CONFIG_PGTABLE_LEVELS <= 2)
+		return;
+
+	if (!pgtable_range_aligned(start, end, floor, ceiling, PUD_MASK))
+		return;
+
+	/*
+	 * Check whether we can free the pmd page if the rest of the
+	 * entries are empty. Overlap with other regions have been
+	 * handled by the floor/ceiling check.
+	 */
+	pmdp = pmd_offset(pudp, 0UL);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (!pmd_none(READ_ONCE(pmdp[i])))
+			return;
+	}
+
+	pud_clear(pudp);
+	__flush_tlb_kernel_pgtable(start);
+	free_hotplug_pgtable_page(virt_to_page(pmdp));
+}
+
+static void free_empty_pud_table(pgd_t *pgdp, unsigned long addr,
+				 unsigned long end, unsigned long floor,
+				 unsigned long ceiling)
+{
+	pud_t *pudp, pud;
+	unsigned long i, next, start = addr;
+
+	do {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
+			continue;
+
+		WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud));
+		free_empty_pmd_table(pudp, addr, next, floor, ceiling);
+	} while (addr = next, addr < end);
+
+	if (CONFIG_PGTABLE_LEVELS <= 3)
+		return;
+
+	if (!pgtable_range_aligned(start, end, floor, ceiling, PGDIR_MASK))
+		return;
+
+	/*
+	 * Check whether we can free the pud page if the rest of the
+	 * entries are empty. Overlap with other regions have been
+	 * handled by the floor/ceiling check.
+	 */
+	pudp = pud_offset(pgdp, 0UL);
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		if (!pud_none(READ_ONCE(pudp[i])))
+			return;
+	}
+
+	pgd_clear(pgdp);
+	__flush_tlb_kernel_pgtable(start);
+	free_hotplug_pgtable_page(virt_to_page(pudp));
+}
+
+static void free_empty_tables(unsigned long addr, unsigned long end,
+			      unsigned long floor, unsigned long ceiling)
+{
+	unsigned long next;
+	pgd_t *pgdp, pgd;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		pgd = READ_ONCE(*pgdp);
+		if (pgd_none(pgd))
+			continue;
+
+		WARN_ON(!pgd_present(pgd));
+		free_empty_pud_table(pgdp, addr, next, floor, ceiling);
+	} while (addr = next, addr < end);
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -771,6 +1041,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+#ifdef CONFIG_MEMORY_HOTPLUG
+	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
+
+	unmap_hotplug_range(start, end, true);
+	free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
@@ -1049,10 +1325,21 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
+{
+	unsigned long end = start + size;
+
+	WARN_ON(pgdir != init_mm.pgd);
+	WARN_ON((start < PAGE_OFFSET) || (end > PAGE_END));
+
+	unmap_hotplug_range(start, end, false);
+	free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
+}
+
 int arch_add_memory(int nid, u64 start, u64 size,
 			struct mhp_restrictions *restrictions)
 {
-	int flags = 0;
+	int ret, flags = 0;
 
 	if (rodata_full || debug_pagealloc_enabled())
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
@@ -1062,22 +1349,59 @@ int arch_add_memory(int nid, u64 start, u64 size,
 
 	memblock_clear_nomap(start, size);
 
-	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
 			   restrictions);
+	if (ret)
+		__remove_pgd_mapping(swapper_pg_dir,
+				     __phys_to_virt(start), size);
+	return ret;
 }
+
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
-	/*
-	 * FIXME: Cleanup page tables (also in arch_add_memory() in case
-	 * adding fails). Until then, this function should only be used
-	 * during memory hotplug (adding memory), not for memory
-	 * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
-	 * unlocked yet.
-	 */
 	__remove_pages(start_pfn, nr_pages, altmap);
+	__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
+}
+
+/*
+ * This memory hotplug notifier helps prevent boot memory from being
+ * inadvertently removed as it blocks pfn range offlining process in
+ * __offline_pages(). Hence this prevents both offlining as well as
+ * removal process for boot memory which is initially always online.
+ * In future if and when boot memory could be removed, this notifier
+ * should be dropped and free_hotplug_page_range() should handle any
+ * reserved pages allocated during boot.
+ */
+static int prevent_bootmem_remove_notifier(struct notifier_block *nb,
+					   unsigned long action, void *data)
+{
+	struct mem_section *ms;
+	struct memory_notify *arg = data;
+	unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
+	unsigned long pfn = arg->start_pfn;
+
+	if (action != MEM_GOING_OFFLINE)
+		return NOTIFY_OK;
+
+	for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+		ms = __pfn_to_section(pfn);
+		if (early_section(ms))
+			return NOTIFY_BAD;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block prevent_bootmem_remove_nb = {
+	.notifier_call = prevent_bootmem_remove_notifier,
+};
+
+static int __init prevent_bootmem_remove_init(void)
+{
+	return register_memory_notifier(&prevent_bootmem_remove_nb);
 }
+device_initcall(prevent_bootmem_remove_init);
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH V12 2/2] arm64/mm: Enable memory hot remove
  2020-01-16  6:45 ` [PATCH V12 2/2] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2020-01-21 12:02   ` Catalin Marinas
  0 siblings, 0 replies; 7+ messages in thread
From: Catalin Marinas @ 2020-01-21 12:02 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, will,
	Mark Rutland, david, cai, logang, cpandya, arunks,
	dan.j.williams, mgorman, osalvador, Ard Biesheuvel, Steve Capper,
	broonie, Valentin Schneider, Robin Murphy, Steven Price,
	Suzuki Poulose, ira.weiny

On Thu, Jan 16, 2020 at 12:15:35PM +0530, Anshuman Khandual wrote:
> The arch code for hot-remove must tear down portions of the linear map and
> vmemmap corresponding to memory being removed. In both cases the page
> tables mapping these regions must be freed, and when sparse vmemmap is in
> use the memory backing the vmemmap must also be freed.
>
> This patch adds unmap_hotplug_range() and free_empty_tables() helpers which
> can be used to tear down either region and calls it from vmemmap_free() and
> ___remove_pgd_mapping(). The free_mapped argument determines whether the
> backing memory will be freed.
>
> It makes two distinct passes over the kernel page table. In the first pass
> with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and
> frees backing memory if required (vmemmap) for each mapped leaf entry. In
> the second pass with free_empty_tables() it looks for empty page table
> sections whose page table page can be unmapped, TLB invalidated and freed.
>
> While freeing intermediate level page table pages bail out if any of its
> entries are still valid. This can happen for partially filled kernel page
> table either from a previously attempted failed memory hot add or while
> removing an address range which does not span the entire page table page
> range.
>
> The vmemmap region may share levels of table with the vmalloc region.
> There can be conflicts between hot remove freeing page table pages with
> a concurrent vmalloc() walking the kernel page table. This conflict can
> not just be solved by taking the init_mm ptl because of existing locking
> scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling
> method which is borrowed from user page table tear with free_pgd_range()
> which skips freeing page table pages if intermediate address range is not
> aligned or maximum floor-ceiling might not own the entire page table page.
>
> Boot memory on arm64 cannot be removed. Hence this registers a new memory
> hotplug notifier which prevents boot memory offlining and it's removal.
>
> While here update arch_add_memory() to handle __add_pages() failures by
> just unmapping recently added kernel linear mapping. Now enable memory hot
> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
>
> This implementation is overall inspired from kernel page table tear down
> procedure on X86 architecture and user page table tear down method.
>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

With the memory notifier added, my reviewed-by still stands.

--
Catalin
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH V12 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2020-01-16  6:45 ` [PATCH V12 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
@ 2020-01-21 12:02   ` Catalin Marinas
  0 siblings, 0 replies; 7+ messages in thread
From: Catalin Marinas @ 2020-01-21 12:02 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, will,
	Mark Rutland, david, cai, logang, cpandya, arunks,
	dan.j.williams, mgorman, osalvador, Ard Biesheuvel, Steve Capper,
	broonie, Valentin Schneider, Robin Murphy, Steven Price,
	Suzuki Poulose, ira.weiny

On Thu, Jan 16, 2020 at 12:15:34PM +0530, Anshuman Khandual wrote:
> The arm64 page table dump code can race with concurrent modification of the
> kernel page tables. When a leaf entries are modified concurrently, the dump
> code may log stale or inconsistent information for a VA range, but this is
> otherwise not harmful.
>
> When intermediate levels of table are freed, the dump code will continue to
> use memory which has been freed and potentially reallocated for another
> purpose. In such cases, the dump code may dereference bogus addresses,
> leading to a number of potential problems.
>
> Intermediate levels of table may by freed during memory hot-remove,
> which will be enabled by a subsequent patch. To avoid racing with
> this, take the memory hotplug lock when walking the kernel page table.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH V12 0/2] arm64/mm: Enable memory hot remove
  2020-01-16  6:45 [PATCH V12 0/2] arm64/mm: Enable memory hot remove Anshuman Khandual
  2020-01-16  6:45 ` [PATCH V12 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
  2020-01-16  6:45 ` [PATCH V12 2/2] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2020-01-21 15:18 ` Will Deacon
  2020-01-22  3:35   ` Anshuman Khandual
  2 siblings, 1 reply; 7+ messages in thread
From: Will Deacon @ 2020-01-21 15:18 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	mark.rutland, david, cai, logang, cpandya, arunks,
	dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper,
	broonie, valentin.schneider, Robin.Murphy, steven.price,
	suzuki.poulose, ira.weiny

On Thu, Jan 16, 2020 at 12:15:33PM +0530, Anshuman Khandual wrote:
> This series enables memory hot remove functionality on arm64 platform. This
> is based on Linux 5.5-rc6 and particularly deals with a problem caused when
> boot memory is attempted to be removed.

Unfortunately, this results in a conflict with mainline since the arm64
-next branches are based on -rc3 and there was a fix merged after that
(feee6b298916 ("mm/memory_hotplug: shrink zones when offlining memory"))
which changes the type of __remove_pages().

So I think I'll leave this for 5.7.

Will

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH V12 0/2] arm64/mm: Enable memory hot remove
  2020-01-21 15:18 ` [PATCH V12 0/2] " Will Deacon
@ 2020-01-22  3:35   ` Anshuman Khandual
  0 siblings, 0 replies; 7+ messages in thread
From: Anshuman Khandual @ 2020-01-22  3:35 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	mark.rutland, david, cai, logang, cpandya, arunks,
	dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper,
	broonie, valentin.schneider, Robin.Murphy, steven.price,
	suzuki.poulose, ira.weiny



On 01/21/2020 08:48 PM, Will Deacon wrote:
> On Thu, Jan 16, 2020 at 12:15:33PM +0530, Anshuman Khandual wrote:
>> This series enables memory hot remove functionality on arm64 platform. This
>> is based on Linux 5.5-rc6 and particularly deals with a problem caused when
>> boot memory is attempted to be removed.

Hello Will,

> 
> Unfortunately, this results in a conflict with mainline since the arm64
> -next branches are based on -rc3 and there was a fix merged after that
> (feee6b298916 ("mm/memory_hotplug: shrink zones when offlining memory"))
> which changes the type of __remove_pages().

Right, that fix went in last couple of weeks.

> 
> So I think I'll leave this for 5.7.

Just wondering if there is any chance this can still make it to 5.6
or its already too late ?

> 
> Will
> 

- Anshuman

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-01-22  3:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-16  6:45 [PATCH V12 0/2] arm64/mm: Enable memory hot remove Anshuman Khandual
2020-01-16  6:45 ` [PATCH V12 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
2020-01-21 12:02   ` Catalin Marinas
2020-01-16  6:45 ` [PATCH V12 2/2] arm64/mm: Enable memory hot remove Anshuman Khandual
2020-01-21 12:02   ` Catalin Marinas
2020-01-21 15:18 ` [PATCH V12 0/2] " Will Deacon
2020-01-22  3:35   ` Anshuman Khandual

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).