linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V6 RESEND 0/3] arm64/mm: Enable memory hot remove
@ 2019-07-15  6:17 Anshuman Khandual
  2019-07-15  6:17 ` [PATCH V6 RESEND 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Anshuman Khandual @ 2019-07-15  6:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon
  Cc: mark.rutland, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel, steve.capper

This series enables memory hot remove on arm64 after fixing a memblock
removal ordering problem in generic try_remove_memory() and a possible
arm64 platform specific kernel page table race condition. This series
is based on linux-next (next-20190712).

Concurrent vmalloc() and hot-remove conflict:

As pointed out earlier on the v5 thread [2] there can be potential conflict
between concurrent vmalloc() and memory hot-remove operation. This can be
solved or at least avoided with some possible methods. The problem here is
caused by inadequate locking in vmalloc() which protects installation of a
page table page but not the walk or the leaf entry modification.

Option 1: Making locking in vmalloc() adequate

Current locking scheme protects installation of page table pages but not the
page table walk or leaf entry creation which can conflict with hot-remove.
This scheme is sufficient for now as vmalloc() works on mutually exclusive
ranges which can proceed concurrently only if their shared page table pages
can be created while inside the lock. It achieves performance improvement
which will be compromised if entire vmalloc() operation (even if with some
optimization) has to be completed under a lock.

Option 2: Making sure hot-remove does not happen during vmalloc()

Take mem_hotplug_lock in read mode through [get|put]_online_mems() constructs
for the entire duration of vmalloc(). It protects from concurrent memory hot
remove operation and does not add any significant overhead to other concurrent
vmalloc() threads. It solves the problem in right way unless we do not want to
extend the usage of mem_hotplug_lock in generic MM.

Option 3: Memory hot-remove does not free (conflicting) page table pages

Don't not free page table pages (if any) for vmemmap mappings after unmapping
it's virtual range. The only downside here is that some page table pages might
remain empty and unused until next memory hot-add operation of the same memory
range.

Option 4: Dont let vmalloc and vmemmap share intermediate page table pages

The conflict does not arise if vmalloc and vmemap range do not share kernel
page table pages to start with. If such placement can be ensured in platform
kernel virtual address layout, this problem can be successfully avoided.

There are two generic solutions (Option 1 and 2) and two platform specific
solutions (Options 2 and 3). This series has decided to go with (Option 3)
which requires minimum changes while self-contained inside the functionality.

Testing:

Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config
options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS
combinations. Its only build tested on non-arm64 platforms.

Changes in V6:

- Implemented most of the suggestions from Mark Rutland
- Added <linux/memory_hotplug.h> in ptdump
- remove_pagetable() now has two distinct passes over the kernel page table
- First pass unmap_hotplug_range() removes leaf level entries at all level
- Second pass free_empty_tables() removes empty page table pages
- Kernel page table lock has been dropped completely
- vmemmap_free() does not call freee_empty_tables() to avoid conflict with vmalloc()
- All address range scanning are converted to do {} while() loop
- Added 'unsigned long end' in __remove_pgd_mapping()
- Callers need not provide starting pointer argument to free_[pte|pmd|pud]_table() 
- Drop the starting pointer argument from free_[pte|pmd|pud]_table() functions
- Fetching pxxp[i] in free_[pte|pmd|pud]_table() is wrapped around in READ_ONCE()
- free_[pte|pmd|pud]_table() now computes starting pointer inside the function
- Fixed TLB handling while freeing huge page section mappings at PMD or PUD level
- Added WARN_ON(!page) in free_hotplug_page_range()
- Added WARN_ON(![pm|pud]_table(pud|pmd)) when there is no section mapping

- [PATCH 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
- Request earlier for separate merger (https://patchwork.kernel.org/patch/10986599/)
- s/__remove_memory/try_remove_memory in the subject line
- s/arch_remove_memory/memblock_[free|remove] in the subject line
- A small change in the commit message as re-order happens now for memblock remove
  functions not for arch_remove_memory()

Changes in V5: (https://lkml.org/lkml/2019/5/29/218)

- Have some agreement [1] over using memory_hotplug_lock for arm64 ptdump
- Change 7ba36eccb3f8 ("arm64/mm: Inhibit huge-vmap with ptdump") already merged
- Dropped the above patch from this series
- Fixed indentation problem in arch_[add|remove]_memory() as per David
- Collected all new Acked-by tags
 
Changes in V4: (https://lkml.org/lkml/2019/5/20/19)

- Implemented most of the suggestions from Mark Rutland
- Interchanged patch [PATCH 2/4] <---> [PATCH 3/4] and updated commit message
- Moved CONFIG_PGTABLE_LEVELS inside free_[pud|pmd]_table()
- Used READ_ONCE() in missing instances while accessing page table entries
- s/p???_present()/p???_none() for checking valid kernel page table entries
- WARN_ON() when an entry is !p???_none() and !p???_present() at the same time
- Updated memory hot-remove commit message with additional details as suggested
- Rebased the series on 5.2-rc1 with hotplug changes from David and Michal Hocko
- Collected all new Acked-by tags

Changes in V3: (https://lkml.org/lkml/2019/5/14/197)
 
- Implemented most of the suggestions from Mark Rutland for remove_pagetable()
- Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions
- Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted polarity
- Changed pointer names ('p' at end) and removed tmp from iterations
- Perform intermediate TLB invalidation while clearing pgtable entries
- Dropped flush_tlb_kernel_range() in remove_pagetable()
- Added flush_tlb_kernel_range() in remove_pte_table() instead
- Renamed page freeing functions for pgtable page and mapped pages
- Used page range size instead of order while freeing mapped or pgtable pages
- Removed all PageReserved() handling while freeing mapped or pgtable pages
- Replaced XXX_index() with XXX_offset() while walking the kernel page table
- Used READ_ONCE() while fetching individual pgtable entries
- Taken overall init_mm.page_table_lock instead of just while changing an entry
- Dropped previously added [pmd|pud]_index() which are not required anymore
- Added a new patch to protect kernel page table race condition for ptdump
- Added a new patch from Mark Rutland to prevent huge-vmap with ptdump

Changes in V2: (https://lkml.org/lkml/2019/4/14/5)

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review
- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko
- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
  pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

References:

[1] https://lkml.org/lkml/2019/5/28/584
[2] https://lkml.org/lkml/2019/6/11/709

Anshuman Khandual (3):
  mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  arm64/mm: Enable memory hot remove

 arch/arm64/Kconfig             |   3 +
 arch/arm64/mm/mmu.c            | 290 +++++++++++++++++++++++++++++++++++++++--
 arch/arm64/mm/ptdump_debugfs.c |   4 +
 mm/memory_hotplug.c            |   4 +-
 4 files changed, 290 insertions(+), 11 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH V6 RESEND 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-07-15  6:17 [PATCH V6 RESEND 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-07-15  6:17 ` Anshuman Khandual
  2019-07-15  6:17 ` [PATCH V6 RESEND 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Anshuman Khandual @ 2019-07-15  6:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon
  Cc: mark.rutland, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel, steve.capper

Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
entries between memory block and node. It first checks pfn validity with
pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
(arm64 has this enabled) pfn_valid_within() calls pfn_valid().

pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
which scans all mapped memblock regions with memblock_is_map_memory(). This
creates a problem in memory hot remove path which has already removed given
memory range from memory block with memblock_[remove|free] before arriving
at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
of existing sysfs entries.

[   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
[   62.052517] ------------[ cut here ]------------
[   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
[   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   62.054589] Modules linked in:
[   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
[   62.056274] Hardware name: linux,dummy-virt (DT)
[   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
[   62.058083] pc : add_memory_resource+0x1cc/0x1d8
[   62.058961] lr : add_memory_resource+0x10c/0x1d8
[   62.059842] sp : ffff0000168b3ce0
[   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
[   62.061501] x27: 0000000000000000 x26: 0000000000000000
[   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
[   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
[   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
[   62.065558] x19: 0000000000680000 x18: 0000000000000024
[   62.066566] x17: 0000000000000000 x16: 0000000000000000
[   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
[   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
[   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
[   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
[   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
[   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
[   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
[   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
[   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
[   62.076930] Call trace:
[   62.077411]  add_memory_resource+0x1cc/0x1d8
[   62.078227]  __add_memory+0x70/0xa8
[   62.078901]  probe_store+0xa4/0xc8
[   62.079561]  dev_attr_store+0x18/0x28
[   62.080270]  sysfs_kf_write+0x40/0x58
[   62.080992]  kernfs_fop_write+0xcc/0x1d8
[   62.081744]  __vfs_write+0x18/0x40
[   62.082400]  vfs_write+0xa4/0x1b0
[   62.083037]  ksys_write+0x5c/0xc0
[   62.083681]  __arm64_sys_write+0x18/0x20
[   62.084432]  el0_svc_handler+0x88/0x100
[   62.085177]  el0_svc+0x8/0xc

Re-ordering memblock_[free|remove]() with arch_remove_memory() solves the
problem on arm64 as pfn_valid() behaves correctly and returns positive
as memblock for the address range still exists. arch_remove_memory()
removes applicable memory sections from zone with __remove_pages() and
tears down kernel linear mapping. Removing memblock regions afterwards
is safe because there is no other memblock (bootmem) allocator user that
late. So nobody is going to allocate from the removed range just to blow
up later. Also nobody should be using the bootmem allocated range else
we wouldn't allow to remove it. So reordering is indeed safe.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b9ba5b8..1635a89 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1772,13 +1772,13 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
-	memblock_free(start, size);
-	memblock_remove(start, size);
 
 	/* remove memory block devices before removing memory */
 	remove_memory_block_devices(start, size);
 
 	arch_remove_memory(nid, start, size, NULL);
+	memblock_free(start, size);
+	memblock_remove(start, size);
 	__release_memory_resource(start, size);
 
 	try_offline_node(nid);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V6 RESEND 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-07-15  6:17 [PATCH V6 RESEND 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-07-15  6:17 ` [PATCH V6 RESEND 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
@ 2019-07-15  6:17 ` Anshuman Khandual
  2019-07-15  6:17 ` [PATCH V6 RESEND 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-07-23 10:56 ` [PATCH V6 RESEND 0/3] " Mark Rutland
  3 siblings, 0 replies; 8+ messages in thread
From: Anshuman Khandual @ 2019-07-15  6:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon
  Cc: mark.rutland, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel, steve.capper

The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.

Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/ptdump_debugfs.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f..b5eebc8 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/debugfs.h>
+#include <linux/memory_hotplug.h>
 #include <linux/seq_file.h>
 
 #include <asm/ptdump.h>
@@ -7,7 +8,10 @@
 static int ptdump_show(struct seq_file *m, void *v)
 {
 	struct ptdump_info *info = m->private;
+
+	get_online_mems();
 	ptdump_walk_pgd(m, info);
+	put_online_mems();
 	return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V6 RESEND 3/3] arm64/mm: Enable memory hot remove
  2019-07-15  6:17 [PATCH V6 RESEND 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-07-15  6:17 ` [PATCH V6 RESEND 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
  2019-07-15  6:17 ` [PATCH V6 RESEND 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
@ 2019-07-15  6:17 ` Anshuman Khandual
  2019-07-23 10:56 ` [PATCH V6 RESEND 0/3] " Mark Rutland
  3 siblings, 0 replies; 8+ messages in thread
From: Anshuman Khandual @ 2019-07-15  6:17 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon
  Cc: mark.rutland, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel, steve.capper

The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds a new remove_pagetable() helper which can be used to tear
down either region, and calls it from vmemmap_free() and
___remove_pgd_mapping(). The sparse_vmap argument determines whether the
backing memory will be freed.

remove_pagetable() makes two distinct passes over the kernel page table.
In the first pass it unmaps, invalidates applicable TLB cache and frees
backing memory if required (vmemmap) for each mapped leaf entry. In the
second pass it looks for empty page table sections whose page table page
can be unmapped, TLB invalidated and freed.

While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). Hence unlike linear mapping, skip freeing page table
pages while tearing down vmemmap mapping.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture.

Acked-by: Steve Capper <steve.capper@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/Kconfig  |   3 +
 arch/arm64/mm/mmu.c | 290 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 284 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7442edb..b94daec 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -273,6 +273,9 @@ config ZONE_DMA32
 config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 config SMP
 	def_bool y
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 750a69d..282a4b2 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -722,6 +722,250 @@ int kern_addr_valid(unsigned long addr)
 
 	return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+	WARN_ON(!page || PageReserved(page));
+	free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+	free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static void free_pte_table(pmd_t *pmdp, unsigned long addr)
+{
+	struct page *page;
+	pte_t *ptep;
+	int i;
+
+	ptep = pte_offset_kernel(pmdp, 0UL);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		if (!pte_none(READ_ONCE(ptep[i])))
+			return;
+	}
+
+	page = pmd_page(READ_ONCE(*pmdp));
+	pmd_clear(pmdp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void free_pmd_table(pud_t *pudp, unsigned long addr)
+{
+	struct page *page;
+	pmd_t *pmdp;
+	int i;
+
+	if (CONFIG_PGTABLE_LEVELS <= 2)
+		return;
+
+	pmdp = pmd_offset(pudp, 0UL);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (!pmd_none(READ_ONCE(pmdp[i])))
+			return;
+	}
+
+	page = pud_page(READ_ONCE(*pudp));
+	pud_clear(pudp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void free_pud_table(pgd_t *pgdp, unsigned long addr)
+{
+	struct page *page;
+	pud_t *pudp;
+	int i;
+
+	if (CONFIG_PGTABLE_LEVELS <= 3)
+		return;
+
+	pudp = pud_offset(pgdp, 0UL);
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		if (!pud_none(READ_ONCE(pudp[i])))
+			return;
+	}
+
+	page = pgd_page(READ_ONCE(*pgdp));
+	pgd_clear(pgdp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
+				    unsigned long end, bool sparse_vmap)
+{
+	struct page *page;
+	pte_t *ptep, pte;
+
+	do {
+		ptep = pte_offset_kernel(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+		if (pte_none(pte))
+			continue;
+
+		WARN_ON(!pte_present(pte));
+		page = sparse_vmap ? pte_page(pte) : NULL;
+		pte_clear(&init_mm, addr, ptep);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+		if (sparse_vmap)
+			free_hotplug_page_range(page, PAGE_SIZE);
+	} while (addr += PAGE_SIZE, addr < end);
+}
+
+static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
+				    unsigned long end, bool sparse_vmap)
+{
+	unsigned long next;
+	struct page *page;
+	pmd_t *pmdp, pmd;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+		if (pmd_none(pmd))
+			continue;
+
+		WARN_ON(!pmd_present(pmd));
+		if (pmd_sect(pmd)) {
+			page = sparse_vmap ? pmd_page(pmd) : NULL;
+			pmd_clear(pmdp);
+			flush_tlb_kernel_range(addr, next);
+			if (sparse_vmap)
+				free_hotplug_page_range(page, PMD_SIZE);
+			continue;
+		}
+		WARN_ON(!pmd_table(pmd));
+		unmap_hotplug_pte_range(pmdp, addr, next, sparse_vmap);
+	} while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr,
+				    unsigned long end, bool sparse_vmap)
+{
+	unsigned long next;
+	struct page *page;
+	pud_t *pudp, pud;
+
+	do {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
+			continue;
+
+		WARN_ON(!pud_present(pud));
+		if (pud_sect(pud)) {
+			page = sparse_vmap ? pud_page(pud) : NULL;
+			pud_clear(pudp);
+			flush_tlb_kernel_range(addr, next);
+			if (sparse_vmap)
+				free_hotplug_page_range(page, PUD_SIZE);
+			continue;
+		}
+		WARN_ON(!pud_table(pud));
+		unmap_hotplug_pmd_range(pudp, addr, next, sparse_vmap);
+	} while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_range(unsigned long addr, unsigned long end,
+				bool sparse_vmap)
+{
+	unsigned long next;
+	pgd_t *pgdp, pgd;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		pgd = READ_ONCE(*pgdp);
+		if (pgd_none(pgd))
+			continue;
+
+		WARN_ON(!pgd_present(pgd));
+		unmap_hotplug_pud_range(pgdp, addr, next, sparse_vmap);
+	} while (addr = next, addr < end);
+}
+
+static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
+				 unsigned long end)
+{
+	pte_t *ptep, pte;
+
+	do {
+		ptep = pte_offset_kernel(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+		WARN_ON(!pte_none(pte));
+	} while (addr += PAGE_SIZE, addr < end);
+}
+
+static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
+				 unsigned long end)
+{
+	unsigned long next;
+	pmd_t *pmdp, pmd;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+		if (pmd_none(pmd))
+			continue;
+
+		WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd));
+		free_empty_pte_table(pmdp, addr, next);
+		free_pte_table(pmdp, addr);
+	} while (addr = next, addr < end);
+}
+
+static void free_empty_pud_table(pgd_t *pgdp, unsigned long addr,
+				 unsigned long end)
+{
+	unsigned long next;
+	pud_t *pudp, pud;
+
+	do {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
+			continue;
+
+		WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud));
+		free_empty_pmd_table(pudp, addr, next);
+		free_pmd_table(pudp, addr);
+	} while (addr = next, addr < end);
+}
+
+static void free_empty_tables(unsigned long addr, unsigned long end)
+{
+	unsigned long next;
+	pgd_t *pgdp, pgd;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		pgd = READ_ONCE(*pgdp);
+		if (pgd_none(pgd))
+			continue;
+
+		WARN_ON(!pgd_present(pgd));
+		free_empty_pud_table(pgdp, addr, next);
+		free_pud_table(pgdp, addr);
+	} while (addr = next, addr < end);
+}
+
+static void remove_pagetable(unsigned long start, unsigned long end,
+			     bool sparse_vmap)
+{
+	unmap_hotplug_range(start, end, sparse_vmap);
+	free_empty_tables(start, end);
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -769,6 +1013,27 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+#ifdef CONFIG_MEMORY_HOTPLUG
+	/*
+	 * FIXME: We should have called remove_pagetable(start, end, true).
+	 * vmemmap and vmalloc virtual range might share intermediate kernel
+	 * page table entries. Removing vmemmap range page table pages here
+	 * can potentially conflict with a concurrent vmalloc() allocation.
+	 *
+	 * This is primarily because vmalloc() does not take init_mm ptl for
+	 * the entire page table walk and it's modification. Instead it just
+	 * takes the lock while allocating and installing page table pages
+	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
+	 * entry via memory hot remove can cause vmalloc() kernel page table
+	 * walk pointers to be invalid on the fly which can cause corruption
+	 * or worst, a crash.
+	 *
+	 * To avoid this problem, lets not free empty page table pages for
+	 * given vmemmap range being hot-removed. Just unmap and free the
+	 * range instead.
+	 */
+	unmap_hotplug_range(start, end, true);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
@@ -1060,10 +1325,18 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
+{
+	unsigned long end = start + size;
+
+	WARN_ON(pgdir != init_mm.pgd);
+	remove_pagetable(start, end, false);
+}
+
 int arch_add_memory(int nid, u64 start, u64 size,
 			struct mhp_restrictions *restrictions)
 {
-	int flags = 0;
+	int ret, flags = 0;
 
 	if (rodata_full || debug_pagealloc_enabled())
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
@@ -1071,9 +1344,14 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
 			     size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
 
-	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
 			   restrictions);
+	if (ret)
+		__remove_pgd_mapping(swapper_pg_dir,
+				     __phys_to_virt(start), size);
+	return ret;
 }
+
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
@@ -1081,14 +1359,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 
-	/*
-	 * FIXME: Cleanup page tables (also in arch_add_memory() in case
-	 * adding fails). Until then, this function should only be used
-	 * during memory hotplug (adding memory), not for memory
-	 * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
-	 * unlocked yet.
-	 */
 	zone = page_zone(pfn_to_page(start_pfn));
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
 }
 #endif
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V6 RESEND 0/3] arm64/mm: Enable memory hot remove
  2019-07-15  6:17 [PATCH V6 RESEND 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
                   ` (2 preceding siblings ...)
  2019-07-15  6:17 ` [PATCH V6 RESEND 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-07-23 10:56 ` Mark Rutland
  2019-07-24  6:58   ` Anshuman Khandual
  3 siblings, 1 reply; 8+ messages in thread
From: Mark Rutland @ 2019-07-23 10:56 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel, steve.capper

Hi Anshuman,

On Mon, Jul 15, 2019 at 11:47:47AM +0530, Anshuman Khandual wrote:
> This series enables memory hot remove on arm64 after fixing a memblock
> removal ordering problem in generic try_remove_memory() and a possible
> arm64 platform specific kernel page table race condition. This series
> is based on linux-next (next-20190712).
> 
> Concurrent vmalloc() and hot-remove conflict:
> 
> As pointed out earlier on the v5 thread [2] there can be potential conflict
> between concurrent vmalloc() and memory hot-remove operation. This can be
> solved or at least avoided with some possible methods. The problem here is
> caused by inadequate locking in vmalloc() which protects installation of a
> page table page but not the walk or the leaf entry modification.
> 
> Option 1: Making locking in vmalloc() adequate
> 
> Current locking scheme protects installation of page table pages but not the
> page table walk or leaf entry creation which can conflict with hot-remove.
> This scheme is sufficient for now as vmalloc() works on mutually exclusive
> ranges which can proceed concurrently only if their shared page table pages
> can be created while inside the lock. It achieves performance improvement
> which will be compromised if entire vmalloc() operation (even if with some
> optimization) has to be completed under a lock.
> 
> Option 2: Making sure hot-remove does not happen during vmalloc()
> 
> Take mem_hotplug_lock in read mode through [get|put]_online_mems() constructs
> for the entire duration of vmalloc(). It protects from concurrent memory hot
> remove operation and does not add any significant overhead to other concurrent
> vmalloc() threads. It solves the problem in right way unless we do not want to
> extend the usage of mem_hotplug_lock in generic MM.
> 
> Option 3: Memory hot-remove does not free (conflicting) page table pages
> 
> Don't not free page table pages (if any) for vmemmap mappings after unmapping
> it's virtual range. The only downside here is that some page table pages might
> remain empty and unused until next memory hot-add operation of the same memory
> range.
> 
> Option 4: Dont let vmalloc and vmemmap share intermediate page table pages
> 
> The conflict does not arise if vmalloc and vmemap range do not share kernel
> page table pages to start with. If such placement can be ensured in platform
> kernel virtual address layout, this problem can be successfully avoided.
> 
> There are two generic solutions (Option 1 and 2) and two platform specific
> solutions (Options 2 and 3). This series has decided to go with (Option 3)
> which requires minimum changes while self-contained inside the functionality.

... while also leaking memory, right?

In my view, option 2 or 4 would have been preferable. Were there
specific technical reasons to not go down either of those routes? I'm
not sure that minimizing changes is the right rout given that this same
problem presumably applies to other architectures, which will need to be
fixed.

Do we know why we aren't seeing issues on other architectures? e.g. is
the issue possible but rare (and hence not reported), or masked by
something else (e.g. the layout of the kernel VA space)?

I'd like to solve the underyling issue before we start adding new
functionality.

> Testing:
> 
> Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config
> options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS
> combinations. Its only build tested on non-arm64 platforms.

Could you please share how you've tested this?

Having instructions so that I could reproduce this locally would be very
helpful.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V6 RESEND 0/3] arm64/mm: Enable memory hot remove
  2019-07-23 10:56 ` [PATCH V6 RESEND 0/3] " Mark Rutland
@ 2019-07-24  6:58   ` Anshuman Khandual
  2019-07-25 13:51     ` Mark Rutland
  0 siblings, 1 reply; 8+ messages in thread
From: Anshuman Khandual @ 2019-07-24  6:58 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel, steve.capper


On 07/23/2019 04:26 PM, Mark Rutland wrote:
> Hi Anshuman,

Hello Mark,

> 
> On Mon, Jul 15, 2019 at 11:47:47AM +0530, Anshuman Khandual wrote:
>> This series enables memory hot remove on arm64 after fixing a memblock
>> removal ordering problem in generic try_remove_memory() and a possible
>> arm64 platform specific kernel page table race condition. This series
>> is based on linux-next (next-20190712).
>>
>> Concurrent vmalloc() and hot-remove conflict:
>>
>> As pointed out earlier on the v5 thread [2] there can be potential conflict
>> between concurrent vmalloc() and memory hot-remove operation. This can be
>> solved or at least avoided with some possible methods. The problem here is
>> caused by inadequate locking in vmalloc() which protects installation of a
>> page table page but not the walk or the leaf entry modification.
>>
>> Option 1: Making locking in vmalloc() adequate
>>
>> Current locking scheme protects installation of page table pages but not the
>> page table walk or leaf entry creation which can conflict with hot-remove.
>> This scheme is sufficient for now as vmalloc() works on mutually exclusive
>> ranges which can proceed concurrently only if their shared page table pages
>> can be created while inside the lock. It achieves performance improvement
>> which will be compromised if entire vmalloc() operation (even if with some
>> optimization) has to be completed under a lock.
>>
>> Option 2: Making sure hot-remove does not happen during vmalloc()
>>
>> Take mem_hotplug_lock in read mode through [get|put]_online_mems() constructs
>> for the entire duration of vmalloc(). It protects from concurrent memory hot
>> remove operation and does not add any significant overhead to other concurrent
>> vmalloc() threads. It solves the problem in right way unless we do not want to
>> extend the usage of mem_hotplug_lock in generic MM.
>>
>> Option 3: Memory hot-remove does not free (conflicting) page table pages
>>
>> Don't not free page table pages (if any) for vmemmap mappings after unmapping
>> it's virtual range. The only downside here is that some page table pages might
>> remain empty and unused until next memory hot-add operation of the same memory
>> range.
>>
>> Option 4: Dont let vmalloc and vmemmap share intermediate page table pages
>>
>> The conflict does not arise if vmalloc and vmemap range do not share kernel
>> page table pages to start with. If such placement can be ensured in platform
>> kernel virtual address layout, this problem can be successfully avoided.
>>
>> There are two generic solutions (Option 1 and 2) and two platform specific
>> solutions (Options 2 and 3). This series has decided to go with (Option 3)

s/Option 2 and 3/Option 3 and 4/

>> which requires minimum changes while self-contained inside the functionality.
> 
> ... while also leaking memory, right?

This is not a memory leak. In the worst case where an empty page table page could
have been freed after parts of it's kernel virtual range span's vmemmap mapping has
been taken down still remains attached to the higher level page table entry. This
empty page table page will be completely reusable during future vmalloc() allocations
or vmemmap mapping for newly hot added memory in overlapping memory range. It is just
an empty data structure sticking around which could (probably would) be reused later.
This problem will not scale and get worse because its part of kernel page table not
user process which could get multiplied. Its a small price we are paying to remain
safe from a vmalloc() and memory hot remove potential collisions on the kernel page
table. IMHO that is fair enough.

> 
> In my view, option 2 or 4 would have been preferable. Were there

I would say option 2 is the ideal solution where we make sure that each vmalloc()
instance is protected against concurrent memory hot remove through a read side lock
via [get|put]_online_mems().

Option 4 is very much platform specific and each platform has to make sure that they
remain compliant all the time which is not ideal. Its is also an a work around which
avoids the problem and does not really fix it.

> specific technical reasons to not go down either of those routes? I'm

Option 2 will require wider agreement as it involves a very critical hot-path vmalloc()
which can affect many workloads. IMHO Option 4 is neither optimal and not does it solve
the problem correctly. Like this approach it just avoids it but unlike this touches upon
another code area.

> not sure that minimizing changes is the right rout given that this same
> problem presumably applies to other architectures, which will need to be
> fixed.

Yes this needs to be fixed but we can get there one step at a time. vmemmap tear
down process can start freeing empty page table pages when this gets solved. But
why should it prevent entire memory hot remove functionality from being available.

> 
> Do we know why we aren't seeing issues on other architectures? e.g. is
> the issue possible but rare (and hence not reported), or masked by
> something else (e.g. the layout of the kernel VA space)?

I would believe so but we can only get more insights from respective architecture folks.

> 
> I'd like to solve the underyling issue before we start adding new
> functionality.

The entire memory hot-remove functionality from the platform perspective has four
primary functions.

1. Tear down linear mapping
2. Tear down vmemmap mapping
3. Free empty kernel page table pages after tearing down linear mapping
4. Free empty kernel page table pages after tearing down vmemmap mapping

This particular issue mentioned before prevents just the last function (4) which
in the worst case will retain some empty page tables pages erstwhile holding vmemmap
mapping in the kernel page table but otherwise provides complete memory hot remove
functionality.

Why should all these remaining memory hot-remove functions be prevented from being
available for use ? The remaining set of functions (1-3) do not create any side affects
or introduce any new bugs. Also function (4) is not tightly coupled with rest of the
functions (1-3) and anyways will be unlocked independently when the particular issue
gets resolved. The point I am trying to make here is they are not tightly coupled
and perceiving them that way blocks remaining memory hot-remove functionality from
being available.

> 
>> Testing:
>>
>> Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config
>> options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS
>> combinations. Its only build tested on non-arm64 platforms.
> 
> Could you please share how you've tested this?
> 
> Having instructions so that I could reproduce this locally would be very
> helpful.

Please find the series rebased on v5.3-rc1 along with test patches which
enable sysfs interfaces for memory hot add and remove used for testing.

git://linux-arm.org/linux-anshuman.git (memory_hotremove/v6_resend_v5.3-rc1)

Sample Testing Procedure:

echo offline > /sys/devices/system/memory/auto_online_blocks
echo 0x680000000 > /sys/devices/system/memory/probe
echo online_movable > /sys/devices/system/memory/memory26/state
echo 0x680000000 > /sys/devices/system/memory/unprobe

Writing into unprobe trigger offlining first followed by actual memory removal.

NOTE:

This assumes that 0x680000000 is valid memory block starting physical address
and memory26 gets created because of the preceding memory hot addition. Please
use appropriate values based on your local setup. Let me know how it goes and
if I could provide more information.

- Anshuman

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V6 RESEND 0/3] arm64/mm: Enable memory hot remove
  2019-07-24  6:58   ` Anshuman Khandual
@ 2019-07-25 13:51     ` Mark Rutland
  2019-07-30  9:35       ` Anshuman Khandual
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Rutland @ 2019-07-25 13:51 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel, steve.capper

On Wed, Jul 24, 2019 at 12:28:50PM +0530, Anshuman Khandual wrote:
> On 07/23/2019 04:26 PM, Mark Rutland wrote:
> > On Mon, Jul 15, 2019 at 11:47:47AM +0530, Anshuman Khandual wrote:
> >> This series enables memory hot remove on arm64 after fixing a memblock
> >> removal ordering problem in generic try_remove_memory() and a possible
> >> arm64 platform specific kernel page table race condition. This series
> >> is based on linux-next (next-20190712).
> >>
> >> Concurrent vmalloc() and hot-remove conflict:
> >>
> >> As pointed out earlier on the v5 thread [2] there can be potential conflict
> >> between concurrent vmalloc() and memory hot-remove operation. This can be
> >> solved or at least avoided with some possible methods. The problem here is
> >> caused by inadequate locking in vmalloc() which protects installation of a
> >> page table page but not the walk or the leaf entry modification.
> >>
> >> Option 1: Making locking in vmalloc() adequate
> >>
> >> Current locking scheme protects installation of page table pages but not the
> >> page table walk or leaf entry creation which can conflict with hot-remove.
> >> This scheme is sufficient for now as vmalloc() works on mutually exclusive
> >> ranges which can proceed concurrently only if their shared page table pages
> >> can be created while inside the lock. It achieves performance improvement
> >> which will be compromised if entire vmalloc() operation (even if with some
> >> optimization) has to be completed under a lock.
> >>
> >> Option 2: Making sure hot-remove does not happen during vmalloc()
> >>
> >> Take mem_hotplug_lock in read mode through [get|put]_online_mems() constructs
> >> for the entire duration of vmalloc(). It protects from concurrent memory hot
> >> remove operation and does not add any significant overhead to other concurrent
> >> vmalloc() threads. It solves the problem in right way unless we do not want to
> >> extend the usage of mem_hotplug_lock in generic MM.
> >>
> >> Option 3: Memory hot-remove does not free (conflicting) page table pages
> >>
> >> Don't not free page table pages (if any) for vmemmap mappings after unmapping
> >> it's virtual range. The only downside here is that some page table pages might
> >> remain empty and unused until next memory hot-add operation of the same memory
> >> range.
> >>
> >> Option 4: Dont let vmalloc and vmemmap share intermediate page table pages
> >>
> >> The conflict does not arise if vmalloc and vmemap range do not share kernel
> >> page table pages to start with. If such placement can be ensured in platform
> >> kernel virtual address layout, this problem can be successfully avoided.
> >>
> >> There are two generic solutions (Option 1 and 2) and two platform specific
> >> solutions (Options 2 and 3). This series has decided to go with (Option 3)
> 
> s/Option 2 and 3/Option 3 and 4/
> 
> >> which requires minimum changes while self-contained inside the functionality.
> > 
> > ... while also leaking memory, right?
> 
> This is not a memory leak. In the worst case where an empty page table page could
> have been freed after parts of it's kernel virtual range span's vmemmap mapping has
> been taken down still remains attached to the higher level page table entry. This
> empty page table page will be completely reusable during future vmalloc() allocations
> or vmemmap mapping for newly hot added memory in overlapping memory range. It is just
> an empty data structure sticking around which could (probably would) be reused later.
> This problem will not scale and get worse because its part of kernel page table not
> user process which could get multiplied. Its a small price we are paying to remain
> safe from a vmalloc() and memory hot remove potential collisions on the kernel page
> table. IMHO that is fair enough.

I appreciate that we can reuse the memory if we hotplug the same
phyiscal range.

Regardless, I think it's important to note that this approach leaves
that memory around. Could you please quantify how much memory this
would be? i.e. for a 4K 48-bit VA kernel, how many pages would be left
over for a 1GiB region of memory?

> > In my view, option 2 or 4 would have been preferable. Were there
> 
> I would say option 2 is the ideal solution where we make sure that each vmalloc()
> instance is protected against concurrent memory hot remove through a read side lock
> via [get|put]_online_mems().

I agree that this would be simple to reason about. However, even taking
a read lock could significantly change the performance of operations in
the vmalloc space, so that would need to be quantified. Additionally,
hotplug operations would stall all vmalloc space operations, which is
unfortunate.

> Option 4 is very much platform specific and each platform has to make sure that they
> remain compliant all the time which is not ideal. Its is also an a work around which
> avoids the problem and does not really fix it.

I understand that you don't like this solution.

I think it should be simple to verify that the layout is safe via
BUILD_BUG_ON() checking the regions we care about don't overlap, so I
don't buy that it's all that difficult to ensure going forward if it's
naturally the case today.

> > specific technical reasons to not go down either of those routes? I'm
> 
> Option 2 will require wider agreement as it involves a very critical hot-path vmalloc()
> which can affect many workloads.

I agree that this would need to be quantified.

> IMHO Option 4 is neither optimal and not does it solve the problem
> correctly. Like this approach it just avoids it but unlike this
> touches upon another code area.

I disagree that option 4 wouldn't be correct; it's just avoiding the
issue at a different level.

> > not sure that minimizing changes is the right rout given that this same
> > problem presumably applies to other architectures, which will need to be
> > fixed.
> 
> Yes this needs to be fixed but we can get there one step at a time. vmemmap tear
> down process can start freeing empty page table pages when this gets solved. But
> why should it prevent entire memory hot remove functionality from being available.

My experience has been that people rarely go back to solve the edge
cases once the feature they care about has been merged, and we're left
with more edge cases...

I think we at least need to have a clear idea that we can fix the
problem before we punt it on as later cleanup. Especially given that
this seems like it is an existing problem affecting other architectures.

> > Do we know why we aren't seeing issues on other architectures? e.g. is
> > the issue possible but rare (and hence not reported), or masked by
> > something else (e.g. the layout of the kernel VA space)?
> 
> I would believe so but we can only get more insights from respective architecture folks.

Could you please investigate, e.g. have a look at how this works on x86?

You should be able to figure out if the VA ranges overlap, and I suspect
that if there is a problem youi can deliberately trigger it within a
QEMU VM.

> >> Testing:
> >>
> >> Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config
> >> options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS
> >> combinations. Its only build tested on non-arm64 platforms.
> > 
> > Could you please share how you've tested this?
> > 
> > Having instructions so that I could reproduce this locally would be very
> > helpful.
> 
> Please find the series rebased on v5.3-rc1 along with test patches which
> enable sysfs interfaces for memory hot add and remove used for testing.
> 
> git://linux-arm.org/linux-anshuman.git (memory_hotremove/v6_resend_v5.3-rc1)
> 
> Sample Testing Procedure:
> 
> echo offline > /sys/devices/system/memory/auto_online_blocks
> echo 0x680000000 > /sys/devices/system/memory/probe
> echo online_movable > /sys/devices/system/memory/memory26/state
> echo 0x680000000 > /sys/devices/system/memory/unprobe
> 
> Writing into unprobe trigger offlining first followed by actual memory removal.
> 
> NOTE:
> 
> This assumes that 0x680000000 is valid memory block starting physical address
> and memory26 gets created because of the preceding memory hot addition. Please
> use appropriate values based on your local setup. Let me know how it goes and
> if I could provide more information.

Thanks for these notes; they're very helpful!

Mark.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V6 RESEND 0/3] arm64/mm: Enable memory hot remove
  2019-07-25 13:51     ` Mark Rutland
@ 2019-07-30  9:35       ` Anshuman Khandual
  0 siblings, 0 replies; 8+ messages in thread
From: Anshuman Khandual @ 2019-07-30  9:35 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas,
	will.deacon, mhocko, ira.weiny, david, cai, logang, james.morse,
	cpandya, arunks, dan.j.williams, mgorman, osalvador,
	ard.biesheuvel, steve.capper

On 07/25/2019 07:21 PM, Mark Rutland wrote:
> On Wed, Jul 24, 2019 at 12:28:50PM +0530, Anshuman Khandual wrote:
>> On 07/23/2019 04:26 PM, Mark Rutland wrote:
>>> On Mon, Jul 15, 2019 at 11:47:47AM +0530, Anshuman Khandual wrote:
>>>> This series enables memory hot remove on arm64 after fixing a memblock
>>>> removal ordering problem in generic try_remove_memory() and a possible
>>>> arm64 platform specific kernel page table race condition. This series
>>>> is based on linux-next (next-20190712).
>>>>
>>>> Concurrent vmalloc() and hot-remove conflict:
>>>>
>>>> As pointed out earlier on the v5 thread [2] there can be potential conflict
>>>> between concurrent vmalloc() and memory hot-remove operation. This can be
>>>> solved or at least avoided with some possible methods. The problem here is
>>>> caused by inadequate locking in vmalloc() which protects installation of a
>>>> page table page but not the walk or the leaf entry modification.
>>>>
>>>> Option 1: Making locking in vmalloc() adequate
>>>>
>>>> Current locking scheme protects installation of page table pages but not the
>>>> page table walk or leaf entry creation which can conflict with hot-remove.
>>>> This scheme is sufficient for now as vmalloc() works on mutually exclusive
>>>> ranges which can proceed concurrently only if their shared page table pages
>>>> can be created while inside the lock. It achieves performance improvement
>>>> which will be compromised if entire vmalloc() operation (even if with some
>>>> optimization) has to be completed under a lock.
>>>>
>>>> Option 2: Making sure hot-remove does not happen during vmalloc()
>>>>
>>>> Take mem_hotplug_lock in read mode through [get|put]_online_mems() constructs
>>>> for the entire duration of vmalloc(). It protects from concurrent memory hot
>>>> remove operation and does not add any significant overhead to other concurrent
>>>> vmalloc() threads. It solves the problem in right way unless we do not want to
>>>> extend the usage of mem_hotplug_lock in generic MM.
>>>>
>>>> Option 3: Memory hot-remove does not free (conflicting) page table pages
>>>>
>>>> Don't not free page table pages (if any) for vmemmap mappings after unmapping
>>>> it's virtual range. The only downside here is that some page table pages might
>>>> remain empty and unused until next memory hot-add operation of the same memory
>>>> range.
>>>>
>>>> Option 4: Dont let vmalloc and vmemmap share intermediate page table pages
>>>>
>>>> The conflict does not arise if vmalloc and vmemap range do not share kernel
>>>> page table pages to start with. If such placement can be ensured in platform
>>>> kernel virtual address layout, this problem can be successfully avoided.
>>>>
>>>> There are two generic solutions (Option 1 and 2) and two platform specific
>>>> solutions (Options 2 and 3). This series has decided to go with (Option 3)
>>
>> s/Option 2 and 3/Option 3 and 4/
>>
>>>> which requires minimum changes while self-contained inside the functionality.
>>>
>>> ... while also leaking memory, right?
>>
>> This is not a memory leak. In the worst case where an empty page table page could
>> have been freed after parts of it's kernel virtual range span's vmemmap mapping has
>> been taken down still remains attached to the higher level page table entry. This
>> empty page table page will be completely reusable during future vmalloc() allocations
>> or vmemmap mapping for newly hot added memory in overlapping memory range. It is just
>> an empty data structure sticking around which could (probably would) be reused later.
>> This problem will not scale and get worse because its part of kernel page table not
>> user process which could get multiplied. Its a small price we are paying to remain
>> safe from a vmalloc() and memory hot remove potential collisions on the kernel page
>> table. IMHO that is fair enough.
> 
> I appreciate that we can reuse the memory if we hotplug the same
> phyiscal range.
> 
> Regardless, I think it's important to note that this approach leaves
> that memory around. Could you please quantify how much memory this
> would be? i.e. for a 4K 48-bit VA kernel, how many pages would be left
> over for a 1GiB region of memory?

Evaluated this for all possible configs with a single 1GB memory region
for hot remove.

1. 4K  page size - 39 bit VA - 3 LEVEL PGTABLE

	Potential pte page [freed: 0 skipped: 0]
	Potential pmd page [freed: 0 skipped: 1]

2. 4K  page size - 48 bit VA - 4 LEVEL PGTABLE

	Potential pte page [freed: 0 skipped: 0]
	Potential pmd page [freed: 0 skipped: 1]
	Potential pud page [freed: 0 skipped: 1]

3. 16K page size - 47 bit VA - 3 LEVEL PGTABLE

	Potential pte page [freed: 0 skipped: 1]
	Potential pmd page [freed: 0 skipped: 1]

4. 16K page size - 48 bit VA - 4 LEVEL PGTABLE

	Potential pte page [freed: 0 skipped: 1]
	Potential pmd page [freed: 0 skipped: 1]
	Potential pud page [freed: 0 skipped: 1]

5. 64K page size - 42 bit VA - 2 LEVEL PGTABLE

	Potential pte page [freed: 0 skipped: 1]

6. 64K page size - 48 bit VA - 3 LEVEL PGTABLE

	Potential pte page [freed: 0 skipped: 1]
	Potential pmd page [freed: 0 skipped: 1]

7. 64K page size - 52 bit VA - 3 LEVEL PGTABLE

	Potential pte page [freed: 0 skipped: 1]
	Potential pmd page [freed: 0 skipped: 1]

This is based on if free_empty_tables() would have been called from
vmemmap_free() during memory hot remove process then how many pages
would have got freed or skipped.

Freed:   pgtable pages (pte|pmd|pud) did actually get freed
Skipped: pgtable pages (pte|pmd|pud) did not get freed due
	 to other valid and present entries

This indicates there no potential free page table pages being left
around with 1GB memory region. It might be because of this particular
memory region in the experiment which finds a particular place in the
kernel page table based on it's starting address. Regardless it at least
indicates that memory pages being left around during hot remove (if any)
would be minimal.

> 
>>> In my view, option 2 or 4 would have been preferable. Were there
>>
>> I would say option 2 is the ideal solution where we make sure that each vmalloc()
>> instance is protected against concurrent memory hot remove through a read side lock
>> via [get|put]_online_mems().
> 
> I agree that this would be simple to reason about. However, even taking
> a read lock could significantly change the performance of operations in
> the vmalloc space, so that would need to be quantified. Additionally,
> hotplug operations would stall all vmalloc space operations, which is
> unfortunate.
> 
>> Option 4 is very much platform specific and each platform has to make sure that they
>> remain compliant all the time which is not ideal. Its is also an a work around which
>> avoids the problem and does not really fix it.
> 
> I understand that you don't like this solution.
> 
> I think it should be simple to verify that the layout is safe via
> BUILD_BUG_ON() checking the regions we care about don't overlap, so I
> don't buy that it's all that difficult to ensure going forward if it's
> naturally the case today.

The concern was that platform need to ensure this all the time not that it
is difficult to do so. I did evaluate this for all possible config options.

1. 4K  page size - 39 bit VA - 3 LEVEL PGTABLE

vmalloc [start ffffff8010000000 end ffffffbebfff0000 size 3eafff0000]
vmemmap [start ffffffbf00000000 end ffffffc000000000 size 100000000]
pgtable [pgdir 40000000 pud 40000000  pmd 200000]
spacing [40010000]

2. 4K  page size - 48 bit VA - 4 LEVEL PGTABLE

vmalloc [start ffff000010000000 end ffff7dffbfff0000 size 7dffafff0000]
vmemmap [start ffff7e0000000000 end ffff800000000000 size 20000000000]
pgtable [pgdir 8000000000 pud 40000000  pmd 200000]
spacing [40010000]

3. 16K page size - 47 bit VA - 3 LEVEL PGTABLE

vmalloc [start ffff800010000000 end ffffbfafffff0000 size 3fafefff0000]
vmemmap [start ffffbfc000000000 end ffffc00000000000 size 4000000000]
pgtable [pgdir 1000000000 pud 1000000000  pmd 2000000]
spacing [1000010000] 

4. 16K page size - 48 bit VA - 4 LEVEL PGTABLE

vmalloc [start ffff000010000000 end ffff7f6fffff0000 size 7f6fefff0000]
vmemmap [start ffff7f8000000000 end ffff800000000000 size 8000000000]
pgtable [pgdir 800000000000 pud 1000000000  pmd 2000000]
spacing [1000010000] PGD overlapping

5. 64K page size - 42 bit VA - 2 LEVEL PGTABLE

vmalloc [start fffffc0010000000 end fffffdff5fff0000 size 1ff4fff0000]
vmemmap [start fffffdff80000000 end fffffe0000000000 size 80000000]
pgtable [pgdir 20000000 pud 20000000  pmd 20000000]
spacing [20010000]

6. 64K page size - 48 bit VA - 3 LEVEL PGTABLE

vmalloc [start ffff000010000000 end ffff7bdfffff0000 size 7bdfefff0000]
vmemmap [start ffff7fe000000000 end ffff800000000000 size 2000000000]
pgtable [pgdir 40000000000 pud 40000000000  pmd 20000000]
spacing [40000010000] 

7. 64K page size - 52 bit VA - 3 LEVEL PGTABLE

vmalloc [start ffff000010000000 end ffff7bdfffff0000 size 7bdfefff0000]
vmemmap [start ffff7fe000000000 end ffff800000000000 size 2000000000]
pgtable [pgdir 40000000000 pud 40000000000  pmd 20000000]
spacing [40000010000] 

So except just one case (16K page size - 48 bit VA - 4 LEVEL PGTABLE) vmalloc
and vmemmap does not seem to share any intermediate page table entry. Even in
that case it shares just a PGD entry. So calling free_empty_tables() during
memory hot-remove might not be that risky as it appeared before and just to be
sure we can have a BUILD_BUG_ON() when memory hot remove is enabled which will
make sure that (vmalloc - vmemap) spacing is adequate.

When a PGD entry (i.e PUD page) is being shared it can still potentially collide
with a concurrent vmalloc(). Should we not call free_empty_tables() only for that
particular case ? Any thoughts ?

spacing: (vmalloc - vmemmap)
pgdir:	 PGDIR_SIZE
pud:	 PUD_SIZE
pmd:	 PMD_SIZE

> 
>>> specific technical reasons to not go down either of those routes? I'm
>>
>> Option 2 will require wider agreement as it involves a very critical hot-path vmalloc()
>> which can affect many workloads.
> 
> I agree that this would need to be quantified.
> 
>> IMHO Option 4 is neither optimal and not does it solve the problem
>> correctly. Like this approach it just avoids it but unlike this
>> touches upon another code area.
> 
> I disagree that option 4 wouldn't be correct; it's just avoiding the
> issue at a different level.
> 
>>> not sure that minimizing changes is the right rout given that this same
>>> problem presumably applies to other architectures, which will need to be
>>> fixed.
>>
>> Yes this needs to be fixed but we can get there one step at a time. vmemmap tear
>> down process can start freeing empty page table pages when this gets solved. But
>> why should it prevent entire memory hot remove functionality from being available.
> 
> My experience has been that people rarely go back to solve the edge
> cases once the feature they care about has been merged, and we're left
> with more edge cases...
> 
> I think we at least need to have a clear idea that we can fix the
> problem before we punt it on as later cleanup. Especially given that
> this seems like it is an existing problem affecting other architectures.
> 
>>> Do we know why we aren't seeing issues on other architectures? e.g. is
>>> the issue possible but rare (and hence not reported), or masked by
>>> something else (e.g. the layout of the kernel VA space)?
>>
>> I would believe so but we can only get more insights from respective architecture folks.
> 
> Could you please investigate, e.g. have a look at how this works on x86?
> 
> You should be able to figure out if the VA ranges overlap, and I suspect
> that if there is a problem youi can deliberately trigger it within a
> QEMU VM.

Even on x86 the vmemmap vmalloc separation seems adequate. IIUC X86 does
not seem to have other config combinations which need to verified. Please
let me know if that is not correct.

vmalloc [start ffffb91080000000 end ffffd9107fffffff size 1fffffffffff]
vmemmap [start ffffe9c3c0000000 end ffffe9efc0000000 size 2c00000000]
pgtable [pgdir 8000000000 pud 40000000  pmd 200000]
spacing [10b340000001] 

So x86 might never have faced this problem ? It does not have VMEMMAP_SIZE
hence computed these with a hack.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-07-30  9:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-15  6:17 [PATCH V6 RESEND 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
2019-07-15  6:17 ` [PATCH V6 RESEND 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
2019-07-15  6:17 ` [PATCH V6 RESEND 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
2019-07-15  6:17 ` [PATCH V6 RESEND 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
2019-07-23 10:56 ` [PATCH V6 RESEND 0/3] " Mark Rutland
2019-07-24  6:58   ` Anshuman Khandual
2019-07-25 13:51     ` Mark Rutland
2019-07-30  9:35       ` Anshuman Khandual

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).