Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH V7 0/3] arm64/mm: Enable memory hot remove
@ 2019-09-03  9:45 Anshuman Khandual
  2019-09-03  9:45 ` [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Anshuman Khandual @ 2019-09-03  9:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will
  Cc: mark.rutland, mhocko, david, ira.weiny, steve.capper, mgorman,
	steven.price, broonie, cai, ard.biesheuvel, cpandya, arunks,
	dan.j.williams, Robin.Murphy, logang, valentin.schneider,
	suzuki.poulose, osalvador

This series enables memory hot remove on arm64 after fixing a memblock
removal ordering problem in generic try_remove_memory() and a possible
arm64 platform specific kernel page table race condition. This series
is based on the following current arm64 tree.

git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git (for-next/core)

Concurrent vmalloc() and hot-remove conflict:

As pointed out earlier on the v5 thread [2] there can be potential conflict
between concurrent vmalloc() and memory hot-remove operation. The problem here
is caused by inadequate locking in vmalloc() which protects installation of a
page table page but not the walk or the leaf entry modification.

Lets not free page table pages if any for vmemmap mappings after unmapping
it's virtual range if vmalloc and vmemap range overlaps. The only downside
here is that some page table pages might remain empty and unused until next
memory hot-add operation for the same memory range. But as mentioned earlier
those will be minimal.

Only the following two configurations have vmalloc and vmemmap overlap where
free_empty_tables() function is not getting called.

1. 4K page size with 48 bit VA space
2. 16K page size with 48 bit VA space

Testing:

Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config
options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS
combinations. It is only build tested on non-arm64 platforms.

Changes in V7:

- vmalloc_vmemmap_overlap gets evaluated early during boot for a given config
- free_empty_tables() gets conditionally called based on vmalloc_vmemmap_overlap

Changes in V6: (https://lkml.org/lkml/2019/7/15/36)

- Implemented most of the suggestions from Mark Rutland
- Added <linux/memory_hotplug.h> in ptdump
- remove_pagetable() now has two distinct passes over the kernel page table
- First pass unmap_hotplug_range() removes leaf level entries at all level
- Second pass free_empty_tables() removes empty page table pages
- Kernel page table lock has been dropped completely
- vmemmap_free() does not call freee_empty_tables() to avoid conflict with vmalloc()
- All address range scanning are converted to do {} while() loop
- Added 'unsigned long end' in __remove_pgd_mapping()
- Callers need not provide starting pointer argument to free_[pte|pmd|pud]_table() 
- Drop the starting pointer argument from free_[pte|pmd|pud]_table() functions
- Fetching pxxp[i] in free_[pte|pmd|pud]_table() is wrapped around in READ_ONCE()
- free_[pte|pmd|pud]_table() now computes starting pointer inside the function
- Fixed TLB handling while freeing huge page section mappings at PMD or PUD level
- Added WARN_ON(!page) in free_hotplug_page_range()
- Added WARN_ON(![pm|pud]_table(pud|pmd)) when there is no section mapping

- [PATCH 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
- Request earlier for separate merger (https://patchwork.kernel.org/patch/10986599/)
- s/__remove_memory/try_remove_memory in the subject line
- s/arch_remove_memory/memblock_[free|remove] in the subject line
- A small change in the commit message as re-order happens now for memblock remove
  functions not for arch_remove_memory()

Changes in V5: (https://lkml.org/lkml/2019/5/29/218)

- Have some agreement [1] over using memory_hotplug_lock for arm64 ptdump
- Change 7ba36eccb3f8 ("arm64/mm: Inhibit huge-vmap with ptdump") already merged
- Dropped the above patch from this series
- Fixed indentation problem in arch_[add|remove]_memory() as per David
- Collected all new Acked-by tags
 
Changes in V4: (https://lkml.org/lkml/2019/5/20/19)

- Implemented most of the suggestions from Mark Rutland
- Interchanged patch [PATCH 2/4] <---> [PATCH 3/4] and updated commit message
- Moved CONFIG_PGTABLE_LEVELS inside free_[pud|pmd]_table()
- Used READ_ONCE() in missing instances while accessing page table entries
- s/p???_present()/p???_none() for checking valid kernel page table entries
- WARN_ON() when an entry is !p???_none() and !p???_present() at the same time
- Updated memory hot-remove commit message with additional details as suggested
- Rebased the series on 5.2-rc1 with hotplug changes from David and Michal Hocko
- Collected all new Acked-by tags

Changes in V3: (https://lkml.org/lkml/2019/5/14/197)
 
- Implemented most of the suggestions from Mark Rutland for remove_pagetable()
- Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions
- Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted polarity
- Changed pointer names ('p' at end) and removed tmp from iterations
- Perform intermediate TLB invalidation while clearing pgtable entries
- Dropped flush_tlb_kernel_range() in remove_pagetable()
- Added flush_tlb_kernel_range() in remove_pte_table() instead
- Renamed page freeing functions for pgtable page and mapped pages
- Used page range size instead of order while freeing mapped or pgtable pages
- Removed all PageReserved() handling while freeing mapped or pgtable pages
- Replaced XXX_index() with XXX_offset() while walking the kernel page table
- Used READ_ONCE() while fetching individual pgtable entries
- Taken overall init_mm.page_table_lock instead of just while changing an entry
- Dropped previously added [pmd|pud]_index() which are not required anymore
- Added a new patch to protect kernel page table race condition for ptdump
- Added a new patch from Mark Rutland to prevent huge-vmap with ptdump

Changes in V2: (https://lkml.org/lkml/2019/4/14/5)

- Added all received review and ack tags
- Split the series from ZONE_DEVICE enablement for better review
- Moved memblock re-order patch to the front as per Robin Murphy
- Updated commit message on memblock re-order patch per Michal Hocko
- Dropped [pmd|pud]_large() definitions
- Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large()
- Removed __meminit and __ref tags as per Oscar Salvador
- Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy
- Skipped calling into pgtable_page_dtor() for linear mapping page table
  pages and updated all relevant functions

Changes in V1: (https://lkml.org/lkml/2019/4/3/28)

References:

[1] https://lkml.org/lkml/2019/5/28/584
[2] https://lkml.org/lkml/2019/6/11/709

Anshuman Khandual (3):
  mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  arm64/mm: Enable memory hot remove

 arch/arm64/Kconfig              |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c             | 338 +++++++++++++++++++++++++++++++-
 arch/arm64/mm/ptdump_debugfs.c  |   4 +
 mm/memory_hotplug.c             |   4 +-
 5 files changed, 339 insertions(+), 11 deletions(-)

-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-09-03  9:45 [PATCH V7 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-09-03  9:45 ` Anshuman Khandual
  2019-09-04  8:16   ` David Hildenbrand
  2019-09-16  1:44   ` Balbir Singh
  2019-09-03  9:45 ` [PATCH V7 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
  2019-09-03  9:45 ` [PATCH V7 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
  2 siblings, 2 replies; 19+ messages in thread
From: Anshuman Khandual @ 2019-09-03  9:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will
  Cc: mark.rutland, mhocko, david, ira.weiny, steve.capper, mgorman,
	steven.price, broonie, cai, ard.biesheuvel, cpandya, arunks,
	dan.j.williams, Robin.Murphy, logang, valentin.schneider,
	suzuki.poulose, osalvador

Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
entries between memory block and node. It first checks pfn validity with
pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
(arm64 has this enabled) pfn_valid_within() calls pfn_valid().

pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
which scans all mapped memblock regions with memblock_is_map_memory(). This
creates a problem in memory hot remove path which has already removed given
memory range from memory block with memblock_[remove|free] before arriving
at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
of existing sysfs entries.

[   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
[   62.052517] ------------[ cut here ]------------
[   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
[   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   62.054589] Modules linked in:
[   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
[   62.056274] Hardware name: linux,dummy-virt (DT)
[   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
[   62.058083] pc : add_memory_resource+0x1cc/0x1d8
[   62.058961] lr : add_memory_resource+0x10c/0x1d8
[   62.059842] sp : ffff0000168b3ce0
[   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
[   62.061501] x27: 0000000000000000 x26: 0000000000000000
[   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
[   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
[   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
[   62.065558] x19: 0000000000680000 x18: 0000000000000024
[   62.066566] x17: 0000000000000000 x16: 0000000000000000
[   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
[   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
[   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
[   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
[   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
[   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
[   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
[   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
[   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
[   62.076930] Call trace:
[   62.077411]  add_memory_resource+0x1cc/0x1d8
[   62.078227]  __add_memory+0x70/0xa8
[   62.078901]  probe_store+0xa4/0xc8
[   62.079561]  dev_attr_store+0x18/0x28
[   62.080270]  sysfs_kf_write+0x40/0x58
[   62.080992]  kernfs_fop_write+0xcc/0x1d8
[   62.081744]  __vfs_write+0x18/0x40
[   62.082400]  vfs_write+0xa4/0x1b0
[   62.083037]  ksys_write+0x5c/0xc0
[   62.083681]  __arm64_sys_write+0x18/0x20
[   62.084432]  el0_svc_handler+0x88/0x100
[   62.085177]  el0_svc+0x8/0xc

Re-ordering memblock_[free|remove]() with arch_remove_memory() solves the
problem on arm64 as pfn_valid() behaves correctly and returns positive
as memblock for the address range still exists. arch_remove_memory()
removes applicable memory sections from zone with __remove_pages() and
tears down kernel linear mapping. Removing memblock regions afterwards
is safe because there is no other memblock (bootmem) allocator user that
late. So nobody is going to allocate from the removed range just to blow
up later. Also nobody should be using the bootmem allocated range else
we wouldn't allow to remove it. So reordering is indeed safe.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c73f09913165..355c466e0621 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1770,13 +1770,13 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
 
 	/* remove memmap entry */
 	firmware_map_remove(start, start + size, "System RAM");
-	memblock_free(start, size);
-	memblock_remove(start, size);
 
 	/* remove memory block devices before removing memory */
 	remove_memory_block_devices(start, size);
 
 	arch_remove_memory(nid, start, size, NULL);
+	memblock_free(start, size);
+	memblock_remove(start, size);
 	__release_memory_resource(start, size);
 
 	try_offline_node(nid);
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH V7 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-09-03  9:45 [PATCH V7 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-09-03  9:45 ` [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
@ 2019-09-03  9:45 ` Anshuman Khandual
  2019-09-15  2:35   ` Balbir Singh
  2019-09-03  9:45 ` [PATCH V7 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
  2 siblings, 1 reply; 19+ messages in thread
From: Anshuman Khandual @ 2019-09-03  9:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will
  Cc: mark.rutland, mhocko, david, ira.weiny, steve.capper, mgorman,
	steven.price, broonie, cai, ard.biesheuvel, cpandya, arunks,
	dan.j.williams, Robin.Murphy, logang, valentin.schneider,
	suzuki.poulose, osalvador

The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.

Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/ptdump_debugfs.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f25592..b5eebc8c4924 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/debugfs.h>
+#include <linux/memory_hotplug.h>
 #include <linux/seq_file.h>
 
 #include <asm/ptdump.h>
@@ -7,7 +8,10 @@
 static int ptdump_show(struct seq_file *m, void *v)
 {
 	struct ptdump_info *info = m->private;
+
+	get_online_mems();
 	ptdump_walk_pgd(m, info);
+	put_online_mems();
 	return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH V7 3/3] arm64/mm: Enable memory hot remove
  2019-09-03  9:45 [PATCH V7 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-09-03  9:45 ` [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
  2019-09-03  9:45 ` [PATCH V7 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
@ 2019-09-03  9:45 ` Anshuman Khandual
  2019-09-10 16:17   ` Catalin Marinas
  2019-09-12 20:15   ` Catalin Marinas
  2 siblings, 2 replies; 19+ messages in thread
From: Anshuman Khandual @ 2019-09-03  9:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will
  Cc: mark.rutland, mhocko, david, ira.weiny, steve.capper, mgorman,
	steven.price, broonie, cai, ard.biesheuvel, cpandya, arunks,
	dan.j.williams, Robin.Murphy, logang, valentin.schneider,
	suzuki.poulose, osalvador

The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.

This patch adds a new remove_pagetable() helper which can be used to tear
down either region, and calls it from vmemmap_free() and
___remove_pgd_mapping(). The sparse_vmap argument determines whether the
backing memory will be freed.

remove_pagetable() makes two distinct passes over the kernel page table.
In the first pass it unmaps, invalidates applicable TLB cache and frees
backing memory if required (vmemmap) for each mapped leaf entry. In the
second pass it looks for empty page table sections whose page table page
can be unmapped, TLB invalidated and freed.

While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.

The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). Hence unlike linear mapping, skip freeing page table
pages while tearing down vmemmap mapping when vmalloc and vmemmap ranges
overlap.

While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.

This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture.

Acked-by: Steve Capper <steve.capper@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/Kconfig              |   3 +
 arch/arm64/include/asm/memory.h |   1 +
 arch/arm64/mm/mmu.c             | 338 +++++++++++++++++++++++++++++++-
 3 files changed, 333 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6481964b6425..0079820af308 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -274,6 +274,9 @@ config ZONE_DMA32
 config ARCH_ENABLE_MEMORY_HOTPLUG
 	def_bool y
 
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+	def_bool y
+
 config SMP
 	def_bool y
 
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index b61b50bf68b1..615dcd08acfa 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -54,6 +54,7 @@
 #define MODULES_VADDR		(BPF_JIT_REGION_END)
 #define MODULES_VSIZE		(SZ_128M)
 #define VMEMMAP_START		(-VMEMMAP_SIZE - SZ_2M)
+#define VMEMMAP_END		(VMEMMAP_START + VMEMMAP_SIZE)
 #define PCI_IO_END		(VMEMMAP_START - SZ_2M)
 #define PCI_IO_START		(PCI_IO_END - PCI_IO_SIZE)
 #define FIXADDR_TOP		(PCI_IO_START - SZ_2M)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ee1bf416368d..980d761a7327 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -60,6 +60,14 @@ static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused;
 
 static DEFINE_SPINLOCK(swapper_pgdir_lock);
 
+/*
+ * This represents if vmalloc and vmemmap address range overlap with
+ * each other on an intermediate level kernel page table entry which
+ * in turn helps in deciding whether empty kernel page table pages
+ * if any can be freed during memory hotplug operation.
+ */
+static bool vmalloc_vmemmap_overlap;
+
 void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 	pgd_t *fixmap_pgdp;
@@ -723,6 +731,250 @@ int kern_addr_valid(unsigned long addr)
 
 	return pfn_valid(pte_pfn(pte));
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+	WARN_ON(!page || PageReserved(page));
+	free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+	free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static void free_pte_table(pmd_t *pmdp, unsigned long addr)
+{
+	struct page *page;
+	pte_t *ptep;
+	int i;
+
+	ptep = pte_offset_kernel(pmdp, 0UL);
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		if (!pte_none(READ_ONCE(ptep[i])))
+			return;
+	}
+
+	page = pmd_page(READ_ONCE(*pmdp));
+	pmd_clear(pmdp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void free_pmd_table(pud_t *pudp, unsigned long addr)
+{
+	struct page *page;
+	pmd_t *pmdp;
+	int i;
+
+	if (CONFIG_PGTABLE_LEVELS <= 2)
+		return;
+
+	pmdp = pmd_offset(pudp, 0UL);
+	for (i = 0; i < PTRS_PER_PMD; i++) {
+		if (!pmd_none(READ_ONCE(pmdp[i])))
+			return;
+	}
+
+	page = pud_page(READ_ONCE(*pudp));
+	pud_clear(pudp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void free_pud_table(pgd_t *pgdp, unsigned long addr)
+{
+	struct page *page;
+	pud_t *pudp;
+	int i;
+
+	if (CONFIG_PGTABLE_LEVELS <= 3)
+		return;
+
+	pudp = pud_offset(pgdp, 0UL);
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		if (!pud_none(READ_ONCE(pudp[i])))
+			return;
+	}
+
+	page = pgd_page(READ_ONCE(*pgdp));
+	pgd_clear(pgdp);
+	__flush_tlb_kernel_pgtable(addr);
+	free_hotplug_pgtable_page(page);
+}
+
+static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
+				    unsigned long end, bool sparse_vmap)
+{
+	struct page *page;
+	pte_t *ptep, pte;
+
+	do {
+		ptep = pte_offset_kernel(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+		if (pte_none(pte))
+			continue;
+
+		WARN_ON(!pte_present(pte));
+		page = sparse_vmap ? pte_page(pte) : NULL;
+		pte_clear(&init_mm, addr, ptep);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+		if (sparse_vmap)
+			free_hotplug_page_range(page, PAGE_SIZE);
+	} while (addr += PAGE_SIZE, addr < end);
+}
+
+static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
+				    unsigned long end, bool sparse_vmap)
+{
+	unsigned long next;
+	struct page *page;
+	pmd_t *pmdp, pmd;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+		if (pmd_none(pmd))
+			continue;
+
+		WARN_ON(!pmd_present(pmd));
+		if (pmd_sect(pmd)) {
+			page = sparse_vmap ? pmd_page(pmd) : NULL;
+			pmd_clear(pmdp);
+			flush_tlb_kernel_range(addr, next);
+			if (sparse_vmap)
+				free_hotplug_page_range(page, PMD_SIZE);
+			continue;
+		}
+		WARN_ON(!pmd_table(pmd));
+		unmap_hotplug_pte_range(pmdp, addr, next, sparse_vmap);
+	} while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr,
+				    unsigned long end, bool sparse_vmap)
+{
+	unsigned long next;
+	struct page *page;
+	pud_t *pudp, pud;
+
+	do {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
+			continue;
+
+		WARN_ON(!pud_present(pud));
+		if (pud_sect(pud)) {
+			page = sparse_vmap ? pud_page(pud) : NULL;
+			pud_clear(pudp);
+			flush_tlb_kernel_range(addr, next);
+			if (sparse_vmap)
+				free_hotplug_page_range(page, PUD_SIZE);
+			continue;
+		}
+		WARN_ON(!pud_table(pud));
+		unmap_hotplug_pmd_range(pudp, addr, next, sparse_vmap);
+	} while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_range(unsigned long addr, unsigned long end,
+				bool sparse_vmap)
+{
+	unsigned long next;
+	pgd_t *pgdp, pgd;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		pgd = READ_ONCE(*pgdp);
+		if (pgd_none(pgd))
+			continue;
+
+		WARN_ON(!pgd_present(pgd));
+		unmap_hotplug_pud_range(pgdp, addr, next, sparse_vmap);
+	} while (addr = next, addr < end);
+}
+
+static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
+				 unsigned long end)
+{
+	pte_t *ptep, pte;
+
+	do {
+		ptep = pte_offset_kernel(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+		WARN_ON(!pte_none(pte));
+	} while (addr += PAGE_SIZE, addr < end);
+}
+
+static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
+				 unsigned long end)
+{
+	unsigned long next;
+	pmd_t *pmdp, pmd;
+
+	do {
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+		if (pmd_none(pmd))
+			continue;
+
+		WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd));
+		free_empty_pte_table(pmdp, addr, next);
+		free_pte_table(pmdp, addr);
+	} while (addr = next, addr < end);
+}
+
+static void free_empty_pud_table(pgd_t *pgdp, unsigned long addr,
+				 unsigned long end)
+{
+	unsigned long next;
+	pud_t *pudp, pud;
+
+	do {
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
+			continue;
+
+		WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud));
+		free_empty_pmd_table(pudp, addr, next);
+		free_pmd_table(pudp, addr);
+	} while (addr = next, addr < end);
+}
+
+static void free_empty_tables(unsigned long addr, unsigned long end)
+{
+	unsigned long next;
+	pgd_t *pgdp, pgd;
+
+	do {
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset_k(addr);
+		pgd = READ_ONCE(*pgdp);
+		if (pgd_none(pgd))
+			continue;
+
+		WARN_ON(!pgd_present(pgd));
+		free_empty_pud_table(pgdp, addr, next);
+		free_pud_table(pgdp, addr);
+	} while (addr = next, addr < end);
+}
+
+static void remove_pagetable(unsigned long start, unsigned long end,
+			     bool sparse_vmap)
+{
+	unmap_hotplug_range(start, end, sparse_vmap);
+	free_empty_tables(start, end);
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+#ifdef CONFIG_MEMORY_HOTPLUG
+	/*
+	 * FIXME: We should have called remove_pagetable(start, end, true).
+	 * vmemmap and vmalloc virtual range might share intermediate kernel
+	 * page table entries. Removing vmemmap range page table pages here
+	 * can potentially conflict with a concurrent vmalloc() allocation.
+	 *
+	 * This is primarily because vmalloc() does not take init_mm ptl for
+	 * the entire page table walk and it's modification. Instead it just
+	 * takes the lock while allocating and installing page table pages
+	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
+	 * entry via memory hot remove can cause vmalloc() kernel page table
+	 * walk pointers to be invalid on the fly which can cause corruption
+	 * or worst, a crash.
+	 *
+	 * So free_empty_tables() gets called where vmalloc and vmemmap range
+	 * do not overlap at any intermediate level kernel page table entry.
+	 */
+	unmap_hotplug_range(start, end, true);
+	if (!vmalloc_vmemmap_overlap)
+		free_empty_tables(start, end);
+#endif
 }
 #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
 
@@ -1048,10 +1322,18 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
+{
+	unsigned long end = start + size;
+
+	WARN_ON(pgdir != init_mm.pgd);
+	remove_pagetable(start, end, false);
+}
+
 int arch_add_memory(int nid, u64 start, u64 size,
 			struct mhp_restrictions *restrictions)
 {
-	int flags = 0;
+	int ret, flags = 0;
 
 	if (rodata_full || debug_pagealloc_enabled())
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
@@ -1059,9 +1341,14 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
 			     size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
 
-	return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+	ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
 			   restrictions);
+	if (ret)
+		__remove_pgd_mapping(swapper_pg_dir,
+				     __phys_to_virt(start), size);
+	return ret;
 }
+
 void arch_remove_memory(int nid, u64 start, u64 size,
 			struct vmem_altmap *altmap)
 {
@@ -1069,14 +1356,47 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 
-	/*
-	 * FIXME: Cleanup page tables (also in arch_add_memory() in case
-	 * adding fails). Until then, this function should only be used
-	 * during memory hotplug (adding memory), not for memory
-	 * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
-	 * unlocked yet.
-	 */
 	zone = page_zone(pfn_to_page(start_pfn));
 	__remove_pages(zone, start_pfn, nr_pages, altmap);
+	__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
+}
+
+static int __init find_vmalloc_vmemmap_overlap(void)
+{
+	unsigned long gap_start, gap_end;
+
+	/*
+	 * vmalloc and vmemmap address ranges should be linearly
+	 * increasing and non-overlapping before the gap between
+	 * them can be measured.
+	 */
+	BUILD_BUG_ON(VMALLOC_END <= VMALLOC_START);
+	BUILD_BUG_ON(VMEMMAP_END <= VMEMMAP_START);
+	BUILD_BUG_ON((VMEMMAP_START >= VMALLOC_START)
+			&& (VMEMMAP_START <= VMALLOC_END));
+	BUILD_BUG_ON((VMEMMAP_END >= VMALLOC_START)
+			&& (VMEMMAP_END <= VMALLOC_END));
+
+	/*
+	 * vmalloc and vmemmap can only be non-overlapping disjoint
+	 * ranges. So [gap_start..gap_end] is determined which will
+	 * represent the gap between the above mentioned ranges.
+	 */
+	if (VMEMMAP_START > VMALLOC_END) {
+		gap_start = VMALLOC_END;
+		gap_end = VMEMMAP_START;
+	} else {
+		gap_start = VMEMMAP_END;
+		gap_end = VMALLOC_START;
+	}
+
+	/*
+	 * Race condition during memory hot-remove exists when the
+	 * gap edges share same PGD entry in kernel page table.
+	 */
+	if ((gap_start & PGDIR_MASK) == (gap_end & PGDIR_MASK))
+		vmalloc_vmemmap_overlap = true;
+	return 0;
 }
+early_initcall(find_vmalloc_vmemmap_overlap);
 #endif
-- 
2.20.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-09-03  9:45 ` [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
@ 2019-09-04  8:16   ` David Hildenbrand
  2019-09-05  4:27     ` Anshuman Khandual
  2019-09-16  1:44   ` Balbir Singh
  1 sibling, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2019-09-04  8:16 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm, linux-kernel, linux-arm-kernel,
	akpm, catalin.marinas, will
  Cc: mark.rutland, mhocko, steve.capper, ira.weiny, suzuki.poulose,
	mgorman, steven.price, broonie, cai, ard.biesheuvel, cpandya,
	arunks, dan.j.williams, Robin.Murphy, logang, valentin.schneider,
	osalvador

On 03.09.19 11:45, Anshuman Khandual wrote:
> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
> entries between memory block and node. It first checks pfn validity with
> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
> 
> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
> which scans all mapped memblock regions with memblock_is_map_memory(). This
> creates a problem in memory hot remove path which has already removed given
> memory range from memory block with memblock_[remove|free] before arriving
> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
> of existing sysfs entries.

Since

commit 60bb462fc7adb06ebee3beb5a4af6c7e6182e248
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Aug 28 13:57:15 2019 +1000

    drivers/base/node.c: simplify unregister_memory_block_under_nodes()

that problem should be gone. There is no get_nid_for_pfn() call anymore.

So this patch should no longer be necessary - but as I said during
earlier versions of this patch, the re-ordering might still make sense
for consistency (removing stuff in the reverse order they were added).
You'll have to rephrase the description then.

> 
> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
> [   62.052517] ------------[ cut here ]------------
> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [   62.054589] Modules linked in:
> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
> [   62.056274] Hardware name: linux,dummy-virt (DT)
> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
> [   62.059842] sp : ffff0000168b3ce0
> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
> [   62.076930] Call trace:
> [   62.077411]  add_memory_resource+0x1cc/0x1d8
> [   62.078227]  __add_memory+0x70/0xa8
> [   62.078901]  probe_store+0xa4/0xc8
> [   62.079561]  dev_attr_store+0x18/0x28
> [   62.080270]  sysfs_kf_write+0x40/0x58
> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
> [   62.081744]  __vfs_write+0x18/0x40
> [   62.082400]  vfs_write+0xa4/0x1b0
> [   62.083037]  ksys_write+0x5c/0xc0
> [   62.083681]  __arm64_sys_write+0x18/0x20
> [   62.084432]  el0_svc_handler+0x88/0x100
> [   62.085177]  el0_svc+0x8/0xc
> 
> Re-ordering memblock_[free|remove]() with arch_remove_memory() solves the
> problem on arm64 as pfn_valid() behaves correctly and returns positive
> as memblock for the address range still exists. arch_remove_memory()
> removes applicable memory sections from zone with __remove_pages() and
> tears down kernel linear mapping. Removing memblock regions afterwards
> is safe because there is no other memblock (bootmem) allocator user that
> late. So nobody is going to allocate from the removed range just to blow
> up later. Also nobody should be using the bootmem allocated range else
> we wouldn't allow to remove it. So reordering is indeed safe.
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Acked-by: Mark Rutland <mark.rutland@arm.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  mm/memory_hotplug.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c73f09913165..355c466e0621 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1770,13 +1770,13 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
> -	memblock_free(start, size);
> -	memblock_remove(start, size);
>  
>  	/* remove memory block devices before removing memory */
>  	remove_memory_block_devices(start, size);
>  
>  	arch_remove_memory(nid, start, size, NULL);
> +	memblock_free(start, size);
> +	memblock_remove(start, size);
>  	__release_memory_resource(start, size);
>  
>  	try_offline_node(nid);
> 

-- 

Thanks,

David / dhildenb

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-09-04  8:16   ` David Hildenbrand
@ 2019-09-05  4:27     ` Anshuman Khandual
  0 siblings, 0 replies; 19+ messages in thread
From: Anshuman Khandual @ 2019-09-05  4:27 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel, linux-arm-kernel,
	akpm, catalin.marinas, will
  Cc: mark.rutland, mhocko, steve.capper, ira.weiny, suzuki.poulose,
	mgorman, steven.price, broonie, cai, ard.biesheuvel, cpandya,
	arunks, dan.j.williams, Robin.Murphy, logang, valentin.schneider,
	osalvador



On 09/04/2019 01:46 PM, David Hildenbrand wrote:
> On 03.09.19 11:45, Anshuman Khandual wrote:
>> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
>> entries between memory block and node. It first checks pfn validity with
>> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
>> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
>>
>> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
>> which scans all mapped memblock regions with memblock_is_map_memory(). This
>> creates a problem in memory hot remove path which has already removed given
>> memory range from memory block with memblock_[remove|free] before arriving
>> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
>> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
>> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
>> of existing sysfs entries.
> Since
> 
> commit 60bb462fc7adb06ebee3beb5a4af6c7e6182e248
> Author: David Hildenbrand <david@redhat.com>
> Date:   Wed Aug 28 13:57:15 2019 +1000
> 
>     drivers/base/node.c: simplify unregister_memory_block_under_nodes()
> 
> that problem should be gone. There is no get_nid_for_pfn() call anymore.

Yes, the problem is gone. The above commit is still not present on arm64
tree against which this series was rebased and tested while posting.

> 
> So this patch should no longer be necessary - but as I said during
> earlier versions of this patch, the re-ordering might still make sense
> for consistency (removing stuff in the reverse order they were added).
> You'll have to rephrase the description then.

Sure will reword the commit message on these lines.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove
  2019-09-03  9:45 ` [PATCH V7 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
@ 2019-09-10 16:17   ` Catalin Marinas
  2019-09-11 10:31     ` David Hildenbrand
  2019-09-12  4:28     ` Anshuman Khandual
  2019-09-12 20:15   ` Catalin Marinas
  1 sibling, 2 replies; 19+ messages in thread
From: Catalin Marinas @ 2019-09-10 16:17 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya,
	ira.weiny, will, steven.price, valentin.schneider,
	suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel,
	dan.j.williams, linux-arm-kernel, osalvador, steve.capper,
	logang, linux-kernel, akpm, mgorman

On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  void vmemmap_free(unsigned long start, unsigned long end,
>  		struct vmem_altmap *altmap)
>  {
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +	/*
> +	 * FIXME: We should have called remove_pagetable(start, end, true).
> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
> +	 * page table entries. Removing vmemmap range page table pages here
> +	 * can potentially conflict with a concurrent vmalloc() allocation.
> +	 *
> +	 * This is primarily because vmalloc() does not take init_mm ptl for
> +	 * the entire page table walk and it's modification. Instead it just
> +	 * takes the lock while allocating and installing page table pages
> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
> +	 * entry via memory hot remove can cause vmalloc() kernel page table
> +	 * walk pointers to be invalid on the fly which can cause corruption
> +	 * or worst, a crash.
> +	 *
> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
> +	 * do not overlap at any intermediate level kernel page table entry.
> +	 */
> +	unmap_hotplug_range(start, end, true);
> +	if (!vmalloc_vmemmap_overlap)
> +		free_empty_tables(start, end);
> +#endif
>  }
>  #endif	/* CONFIG_SPARSEMEM_VMEMMAP */

I wonder whether we could simply ignore the vmemmap freeing altogether,
just leave it around and not unmap it. This way, we could call
unmap_kernel_range() for removing the linear map and we save some code.

For the linear map, I think we use just above 2MB of tables for 1GB of
memory mapped (worst case with 4KB pages we need 512 pte pages). For
vmemmap we'd use slightly above 2MB for a 64GB hotplugged memory. Do we
expect such memory to be re-plugged again in the same range? If we do,
then I shouldn't even bother with removing the vmmemmap.

I don't fully understand the use-case for memory hotremove, so any
additional info would be useful to make a decision here.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove
  2019-09-10 16:17   ` Catalin Marinas
@ 2019-09-11 10:31     ` David Hildenbrand
  2019-09-12  4:28     ` Anshuman Khandual
  1 sibling, 0 replies; 19+ messages in thread
From: David Hildenbrand @ 2019-09-11 10:31 UTC (permalink / raw)
  To: Catalin Marinas, Anshuman Khandual
  Cc: mark.rutland, mhocko, linux-mm, arunks, cpandya, ira.weiny, will,
	steven.price, valentin.schneider, suzuki.poulose, Robin.Murphy,
	broonie, cai, ard.biesheuvel, dan.j.williams, linux-arm-kernel,
	osalvador, steve.capper, logang, linux-kernel, akpm, mgorman

On 10.09.19 18:17, Catalin Marinas wrote:
> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>  void vmemmap_free(unsigned long start, unsigned long end,
>>  		struct vmem_altmap *altmap)
>>  {
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +	/*
>> +	 * FIXME: We should have called remove_pagetable(start, end, true).
>> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
>> +	 * page table entries. Removing vmemmap range page table pages here
>> +	 * can potentially conflict with a concurrent vmalloc() allocation.
>> +	 *
>> +	 * This is primarily because vmalloc() does not take init_mm ptl for
>> +	 * the entire page table walk and it's modification. Instead it just
>> +	 * takes the lock while allocating and installing page table pages
>> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>> +	 * entry via memory hot remove can cause vmalloc() kernel page table
>> +	 * walk pointers to be invalid on the fly which can cause corruption
>> +	 * or worst, a crash.
>> +	 *
>> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
>> +	 * do not overlap at any intermediate level kernel page table entry.
>> +	 */
>> +	unmap_hotplug_range(start, end, true);
>> +	if (!vmalloc_vmemmap_overlap)
>> +		free_empty_tables(start, end);
>> +#endif
>>  }
>>  #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
> 
> I wonder whether we could simply ignore the vmemmap freeing altogether,
> just leave it around and not unmap it. This way, we could call
> unmap_kernel_range() for removing the linear map and we save some code.
> 
> For the linear map, I think we use just above 2MB of tables for 1GB of
> memory mapped (worst case with 4KB pages we need 512 pte pages). For
> vmemmap we'd use slightly above 2MB for a 64GB hotplugged memory. Do we
> expect such memory to be re-plugged again in the same range? If we do,
> then I shouldn't even bother with removing the vmmemmap.
> 

FWIW, I think we should do it cleanly.

> I don't fully understand the use-case for memory hotremove, so any
> additional info would be useful to make a decision here.
> 

Especially in virtual environment, hotremove will be relevant. For
physical environments - I have no idea how important that is for ARM.

-- 

Thanks,

David / dhildenb

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove
  2019-09-10 16:17   ` Catalin Marinas
  2019-09-11 10:31     ` David Hildenbrand
@ 2019-09-12  4:28     ` Anshuman Khandual
  2019-09-12  8:37       ` Anshuman Khandual
  1 sibling, 1 reply; 19+ messages in thread
From: Anshuman Khandual @ 2019-09-12  4:28 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya,
	ira.weiny, will, steven.price, valentin.schneider,
	suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel,
	dan.j.williams, linux-arm-kernel, osalvador, steve.capper,
	logang, linux-kernel, akpm, mgorman



On 09/10/2019 09:47 PM, Catalin Marinas wrote:
> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>  void vmemmap_free(unsigned long start, unsigned long end,
>>  		struct vmem_altmap *altmap)
>>  {
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +	/*
>> +	 * FIXME: We should have called remove_pagetable(start, end, true).
>> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
>> +	 * page table entries. Removing vmemmap range page table pages here
>> +	 * can potentially conflict with a concurrent vmalloc() allocation.
>> +	 *
>> +	 * This is primarily because vmalloc() does not take init_mm ptl for
>> +	 * the entire page table walk and it's modification. Instead it just
>> +	 * takes the lock while allocating and installing page table pages
>> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>> +	 * entry via memory hot remove can cause vmalloc() kernel page table
>> +	 * walk pointers to be invalid on the fly which can cause corruption
>> +	 * or worst, a crash.
>> +	 *
>> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
>> +	 * do not overlap at any intermediate level kernel page table entry.
>> +	 */
>> +	unmap_hotplug_range(start, end, true);
>> +	if (!vmalloc_vmemmap_overlap)
>> +		free_empty_tables(start, end);
>> +#endif
>>  }
>>  #endif	/* CONFIG_SPARSEMEM_VMEMMAP */

Hello Catalin,

> 
> I wonder whether we could simply ignore the vmemmap freeing altogether,
> just leave it around and not unmap it. This way, we could call

This would have been an option (even if we just ignore for a moment that
it might not be the cleanest possible method) if present memory hot remove
scenarios involved just system RAM of comparable sizes.

But with persistent memory which will be plugged in as ZONE_DEVICE might
ask for a vmem_atlamp based vmemmap mapping where the backing memory comes
from the persistent memory range itself not from existing system RAM. IIRC
altmap support was originally added because the amount persistent memory on
a system might be order of magnitude higher than that of regular system RAM.
During normal memory hot add (without altmap) would have caused great deal
of consumption from system RAM just for persistent memory range's vmemmap
mapping. In order to avoid such a scenario altmap was created to allocate
vmemmap mapping backing memory from the device memory range itself.

In such cases vmemmap must be unmapped and it's backing memory freed up for
the complete removal of persistent memory which originally requested for
altmap based vmemmap backing.

Just as a reference, the upcoming series which enables altmap support on
arm64 tries to allocate vmemmap mapping backing memory from the device range
itself during memory hot add and free them up during memory hot remove. Those
methods will not be possible if memory hot-remove does not really free up
vmemmap backing storage.

https://patchwork.kernel.org/project/linux-mm/list/?series=139299

> unmap_kernel_range() for removing the linear map and we save some code.
> 
> For the linear map, I think we use just above 2MB of tables for 1GB of
> memory mapped (worst case with 4KB pages we need 512 pte pages). For
> vmemmap we'd use slightly above 2MB for a 64GB hotplugged memory. Do we

You are right, the amount of memory required for kernel page table pages
are dependent on mapping page size and size of the range to be mapped. But
as explained below there might be hot remove situations where these ranges
will remain unused for ever after hot remove. There are chances that some
these pages (even empty) might remain unused for good.

> expect such memory to be re-plugged again in the same range? If we do,
> then I shouldn't even bother with removing the vmmemmap.
> 
> I don't fully understand the use-case for memory hotremove, so any
> additional info would be useful to make a decision here.

Sure, these are some of the scenarios I could recollect.

Physical Environment:

A. Physical DIMM replacement

Platform detects memory errors and initiates a DIMM replacement.

- Hot remove selected DIMM with errors
- Hot add a new DIMM in it's place on the same slot

In normal circumstances, the new DIMM will require the same linear
and vmemmap mapping. In such cases hot-remove could just unmap
linear mapping, leave everything else and be done with it. Though
I am not sure whether its a good idea to leave aside accessible
struct pages which correspond to non-present pfns.

B. Physical DIMM movement

Platform can detect errors on a DIMM slot itself and initiates a
DIMM movement into a different empty slot

- Hot remove selected memory DIMM from defective slot
- Hot add same memory DIMM into a different available empty slot

Physical address range for the DIMM has now changed, it will require
different linear and vmemmap mapping than what it had originally.
Hence during hot remove we should not only unmap linear and vmemmap
mapping but also free up all associated resources as this physical
memory range is never going to be available again because the slot
has gone bad permanently.

C. Physical DIMM hot-remove

Platform just initiates hot-remove of a DIMM and reduces available
memory as instructed by the administrator.

- Hot remove a selected DIMM

This memory might never come back again or comes back on a different
slot. Without that certainty, its is always better to unmap both linear
and vmemmap mappings, free up all associated resources.

D. Changing NUMA affinity

After performance analysis, administrator through the platform initiates
a DIMM hot-remove from a given node and a DIMM hot-add to another node
to achieve better NUMA affinity.

- Hot remove a selected DIMM from node N0
- Hot add selected DIMM to another node N1

Here both linear and vmemmap ranges will change after the movement and
there is uncertainty regarding whether the now empty physical range on
node N0 will ever get populated again. Without that certainty, its is
always better to unmap both linear and vmemmap mapping, free up all
associated resources.

Virtual Environment:

1. Memory hot-remove can just be initiated by the admin from the host in
order to reduce total physical memory entitlement of a guest which will
reflect any changing hosting contracts etc. The memory might never come
back again and in such cases hot-remove should be as clean freeing all
associated resources.

2. Memory hot-remove on the guest can be initiated from the host after
detecting memory errors on the backing physical DIMM. Memory hot-remove
on the guest will be followed by memory hot-remove on the host itself.
Replacement DIMM can be on the same slot taking over the same physical
address range from host as before but guest might get back it's memory
either on the same range previously or on some other guest physical
range.

3. Changing NUMA binding for a guest on the host might require guest
PFN realignment with respect to guest nodes as well.

Persistent Memory:

As mentioned previously, persistent memory has special vmemmap mapping
requirements through vmem_altmap which would need freeing up backing
memory from it's own range, for it to be completely removed.

Device memory (FPGA cards, GPU cards, Network cards etc):

In future, some of these coherent device memory might be plugged into
ZONE_DEVICE and managed through drivers. They might be attached to the
system via upcoming interfaces like CCIX. The managing drivers might
need to offline the device memory range in order to service some high
priority error, re-init and plug it back on a different physical range
due to existing CCIX link errors or some other constraints.

The point I am trying to make here is that there are many such possible
combinations of events with respect to memory hot-remove in both physical
and virtual environment for system RAM, persistent memory and other coherent
device memory. Leaving aside kernel page table pages or even struct pages
for unavailable (possibly forever) physical range might problematic. IMHO
it is better to do this as much cleanly as possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove
  2019-09-12  4:28     ` Anshuman Khandual
@ 2019-09-12  8:37       ` Anshuman Khandual
  0 siblings, 0 replies; 19+ messages in thread
From: Anshuman Khandual @ 2019-09-12  8:37 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya,
	ira.weiny, will, steven.price, valentin.schneider,
	suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel,
	dan.j.williams, linux-arm-kernel, osalvador, steve.capper,
	logang, linux-kernel, akpm, mgorman



On 09/12/2019 09:58 AM, Anshuman Khandual wrote:
> 
> On 09/10/2019 09:47 PM, Catalin Marinas wrote:
>> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>  void vmemmap_free(unsigned long start, unsigned long end,
>>>  		struct vmem_altmap *altmap)
>>>  {
>>> +#ifdef CONFIG_MEMORY_HOTPLUG
>>> +	/*
>>> +	 * FIXME: We should have called remove_pagetable(start, end, true).
>>> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
>>> +	 * page table entries. Removing vmemmap range page table pages here
>>> +	 * can potentially conflict with a concurrent vmalloc() allocation.
>>> +	 *
>>> +	 * This is primarily because vmalloc() does not take init_mm ptl for
>>> +	 * the entire page table walk and it's modification. Instead it just
>>> +	 * takes the lock while allocating and installing page table pages
>>> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>>> +	 * entry via memory hot remove can cause vmalloc() kernel page table
>>> +	 * walk pointers to be invalid on the fly which can cause corruption
>>> +	 * or worst, a crash.
>>> +	 *
>>> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
>>> +	 * do not overlap at any intermediate level kernel page table entry.
>>> +	 */
>>> +	unmap_hotplug_range(start, end, true);
>>> +	if (!vmalloc_vmemmap_overlap)
>>> +		free_empty_tables(start, end);
>>> +#endif
>>>  }
>>>  #endif	/* CONFIG_SPARSEMEM_VMEMMAP */
> Hello Catalin,
> 
>> I wonder whether we could simply ignore the vmemmap freeing altogether,
>> just leave it around and not unmap it. This way, we could call
> This would have been an option (even if we just ignore for a moment that
> it might not be the cleanest possible method) if present memory hot remove
> scenarios involved just system RAM of comparable sizes.
> 
> But with persistent memory which will be plugged in as ZONE_DEVICE might
> ask for a vmem_atlamp based vmemmap mapping where the backing memory comes
> from the persistent memory range itself not from existing system RAM. IIRC
> altmap support was originally added because the amount persistent memory on
> a system might be order of magnitude higher than that of regular system RAM.
> During normal memory hot add (without altmap) would have caused great deal
> of consumption from system RAM just for persistent memory range's vmemmap
> mapping. In order to avoid such a scenario altmap was created to allocate
> vmemmap mapping backing memory from the device memory range itself.
> 
> In such cases vmemmap must be unmapped and it's backing memory freed up for
> the complete removal of persistent memory which originally requested for
> altmap based vmemmap backing.
> 
> Just as a reference, the upcoming series which enables altmap support on
> arm64 tries to allocate vmemmap mapping backing memory from the device range
> itself during memory hot add and free them up during memory hot remove. Those
> methods will not be possible if memory hot-remove does not really free up
> vmemmap backing storage.
> 
> https://patchwork.kernel.org/project/linux-mm/list/?series=139299
> 

Just to add in here. There is an ongoing work which will enable allocating
memory from the hot-add range itself even for normal system RAM. So this
might not be specific to ZONE_DEVICE based device/persistent memory alone
for a long time.

https://lore.kernel.org/lkml/20190725160207.19579-1-osalvador@suse.de/

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove
  2019-09-03  9:45 ` [PATCH V7 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
  2019-09-10 16:17   ` Catalin Marinas
@ 2019-09-12 20:15   ` Catalin Marinas
  2019-09-13  5:58     ` Anshuman Khandual
  1 sibling, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2019-09-12 20:15 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya,
	ira.weiny, will, steven.price, valentin.schneider,
	suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel,
	dan.j.williams, linux-arm-kernel, osalvador, steve.capper,
	logang, linux-kernel, akpm, mgorman

Hi Anshuman,

Thanks for the details on the need for removing the page tables and
vmemmap backing. Some comments on the code below.

On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -60,6 +60,14 @@ static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused;
>  
>  static DEFINE_SPINLOCK(swapper_pgdir_lock);
>  
> +/*
> + * This represents if vmalloc and vmemmap address range overlap with
> + * each other on an intermediate level kernel page table entry which
> + * in turn helps in deciding whether empty kernel page table pages
> + * if any can be freed during memory hotplug operation.
> + */
> +static bool vmalloc_vmemmap_overlap;

I'd say just move the static find_vmalloc_vmemmap_overlap() function
here, the compiler should be sufficiently smart enough to figure out
that it's just a build-time constant.

> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  void vmemmap_free(unsigned long start, unsigned long end,
>  		struct vmem_altmap *altmap)
>  {
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +	/*
> +	 * FIXME: We should have called remove_pagetable(start, end, true).
> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
> +	 * page table entries. Removing vmemmap range page table pages here
> +	 * can potentially conflict with a concurrent vmalloc() allocation.
> +	 *
> +	 * This is primarily because vmalloc() does not take init_mm ptl for
> +	 * the entire page table walk and it's modification. Instead it just
> +	 * takes the lock while allocating and installing page table pages
> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
> +	 * entry via memory hot remove can cause vmalloc() kernel page table
> +	 * walk pointers to be invalid on the fly which can cause corruption
> +	 * or worst, a crash.
> +	 *
> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
> +	 * do not overlap at any intermediate level kernel page table entry.
> +	 */
> +	unmap_hotplug_range(start, end, true);
> +	if (!vmalloc_vmemmap_overlap)
> +		free_empty_tables(start, end);
> +#endif
>  }

So, I see the risk with overlapping and I guess for some kernel
configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we
can, that's great, otherwise could we rewrite the above functions to
handle floor and ceiling similar to free_pgd_range()? (I wonder how this
function works if you called it on init_mm and kernel address range). By
having the vmemmap start/end information it avoids freeing partially
filled page table pages.

Another question: could we do the page table and the actual vmemmap
pages freeing in a single pass (sorry if this has been discussed
before)?

> @@ -1048,10 +1322,18 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
> +{
> +	unsigned long end = start + size;
> +
> +	WARN_ON(pgdir != init_mm.pgd);
> +	remove_pagetable(start, end, false);
> +}

I think the point I've made previously still stands: you only call
remove_pagetable() with sparse_vmap == false in this patch. Just remove
the extra argument and call unmap_hotplug_range() with sparse_vmap ==
false directly in remove_pagetable().

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove
  2019-09-12 20:15   ` Catalin Marinas
@ 2019-09-13  5:58     ` Anshuman Khandual
  2019-09-13 10:09       ` Catalin Marinas
  0 siblings, 1 reply; 19+ messages in thread
From: Anshuman Khandual @ 2019-09-13  5:58 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya,
	ira.weiny, will, steven.price, valentin.schneider,
	suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel,
	dan.j.williams, linux-arm-kernel, osalvador, steve.capper,
	logang, linux-kernel, akpm, mgorman

On 09/13/2019 01:45 AM, Catalin Marinas wrote:
> Hi Anshuman,
> 
> Thanks for the details on the need for removing the page tables and
> vmemmap backing. Some comments on the code below.
> 
> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -60,6 +60,14 @@ static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused;
>>  
>>  static DEFINE_SPINLOCK(swapper_pgdir_lock);
>>  
>> +/*
>> + * This represents if vmalloc and vmemmap address range overlap with
>> + * each other on an intermediate level kernel page table entry which
>> + * in turn helps in deciding whether empty kernel page table pages
>> + * if any can be freed during memory hotplug operation.
>> + */
>> +static bool vmalloc_vmemmap_overlap;
> 
> I'd say just move the static find_vmalloc_vmemmap_overlap() function
> here, the compiler should be sufficiently smart enough to figure out
> that it's just a build-time constant.

Sure, will do.

> 
>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>  void vmemmap_free(unsigned long start, unsigned long end,
>>  		struct vmem_altmap *altmap)
>>  {
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +	/*
>> +	 * FIXME: We should have called remove_pagetable(start, end, true).
>> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
>> +	 * page table entries. Removing vmemmap range page table pages here
>> +	 * can potentially conflict with a concurrent vmalloc() allocation.
>> +	 *
>> +	 * This is primarily because vmalloc() does not take init_mm ptl for
>> +	 * the entire page table walk and it's modification. Instead it just
>> +	 * takes the lock while allocating and installing page table pages
>> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>> +	 * entry via memory hot remove can cause vmalloc() kernel page table
>> +	 * walk pointers to be invalid on the fly which can cause corruption
>> +	 * or worst, a crash.
>> +	 *
>> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
>> +	 * do not overlap at any intermediate level kernel page table entry.
>> +	 */
>> +	unmap_hotplug_range(start, end, true);
>> +	if (!vmalloc_vmemmap_overlap)
>> +		free_empty_tables(start, end);
>> +#endif
>>  }
> 
> So, I see the risk with overlapping and I guess for some kernel
> configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we

Did not see 64K config options to have overlap, do you suspect they might ?
After the 52 bit KVA series has been merged, following configurations have
the vmalloc-vmemmap range overlap problem.

- 4K  page size with 48 bit VA space
- 16K page size with 48 bit VA space

> can, that's great, otherwise could we rewrite the above functions to
> handle floor and ceiling similar to free_pgd_range()? (I wonder how this
> function works if you called it on init_mm and kernel address range). By

Hmm, never tried that. Are you wondering if this can be used directly ?
There are two distinct elements which make it very specific to user page
tables, mmu_gather based TLB tracking and mm->pgtable_bytes accounting
with mm_dec_nr_pxx().

> having the vmemmap start/end information it avoids freeing partially
> filled page table pages.

Did you mean page table pages which can partially overlap with vmalloc ?

The problem (race) is not because of the inability to deal with partially
filled table. We can handle that correctly as explained below [1]. The
problem is with inadequate kernel page table locking during vmalloc()
which might be accessing intermediate kernel page table pointers which is
being freed with free_empty_tables() concurrently. Hence we cannot free
any page table page which can ever have entries from vmalloc() range.

Though not completely sure, whether I really understood the suggestion above
with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you
suggesting that we should only attempt to free up those vmemmap range page
table pages which *definitely* could never overlap with vmalloc by working
on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range
at each level) vmemmap range instead ? This can be one restrictive version of
the function free_empty_tables() called in case there is an overlap. So we
will maintain two versions for free_empty_tables(). Please correct me if any
the above assumptions or understanding is wrong.

But yes, with this we should be able to free up some possible empty page
table pages which were being left out in the current proposal when overlap
happens.

[1] Skipping partially filled page tables

All free_pXX_table() functions take care in avoiding freeing partially filled
page table pages whether they represent or ever represented linear or vmemmap
or vmalloc mapping in init_mm. They go over each individual entry in a given
page table making sure that each of them checks as pXX_none() before freeing
the entire page table page.

Though walking is restricted by the address range in question.

free_empty_tables(start, end)
	free_empty_pud_table(pgdp, addr, next);
		free_empty_pmd_table(pudp, addr, next);
			free_empty_pte_table(pmdp, addr, next);

Page table pages being examined here on the way while freeing might contain
entries which once represented address beyond vmemmap range in removal. But
thats a good thing IMHO. It can accommodate vmemmap tear down from a previous
hot remove for an adjacent range which might not have been freed last time.

pudp = pud_offset(pgdp, 0UL);
pmdp = pmd_offset(pudp, 0UL);
ptep = pte_offset_kernel(pmdp, 0UL);

pxx_none() makes sure that in such cases freeing of the page table page is
skipped. But yes, even though it is more thorough, it might attempt to free
page table pages which might contains entries not belonging to the range
being removed.

> 
> Another question: could we do the page table and the actual vmemmap
> pages freeing in a single pass (sorry if this has been discussed
> before)?

We could and some initial versions (till V5) of the series had that in fact.
Initially Mark Rutland had suggested to do this in two passes. Some extracts
from the previous discussion.

https://lkml.org/lkml/2019/5/30/1159

-----------------------
Looking at this some more, I don't think this is quite right, and tI
think that structure of the free_*() and remove_*() functions makes this
unnecessarily hard to follow. We should aim for this to be obviously
correct.

The x86 code is the best template to follow here. As mentioned
previously, I'm fairly certain it's not entirely correct (e.g. due to
missing TLB maintenance), and we've already diverged a fair amount in
fixing up obvious issues, so we shouldn't aim to mirror it.

I think that the structure of unmap_region() is closer to what we want
here -- do one pass to unmap leaf entries (and freeing the associated
memory if unmapping the vmemmap), then do a second pass cleaning up any
empty tables.
----------------------

Apart from the fact that two passes over the page table is cleaner and gives
us more granular and modular infrastructure to use for later purposes, it is
also a necessity in dealing with vmalloc-vmemmap overlap. free_empty_tables()
which is the second pass, can be skipped cleanly when overlap is detected.

> 
>> @@ -1048,10 +1322,18 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
>>  }
>>  
>>  #ifdef CONFIG_MEMORY_HOTPLUG
>> +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
>> +{
>> +	unsigned long end = start + size;
>> +
>> +	WARN_ON(pgdir != init_mm.pgd);
>> +	remove_pagetable(start, end, false);
>> +}
> 
> I think the point I've made previously still stands: you only call
> remove_pagetable() with sparse_vmap == false in this patch. Just remove
> the extra argument and call unmap_hotplug_range() with sparse_vmap ==
> false directly in remove_pagetable().

Sure, will do. The original function signature was left unchanged in the hope
that at a later point in time it can be called with "sparse_vmap == true" as
mentioned by the comment in vmemmap_free(). Will change the comment as well.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove
  2019-09-13  5:58     ` Anshuman Khandual
@ 2019-09-13 10:09       ` Catalin Marinas
  2019-09-17  4:36         ` Anshuman Khandual
  0 siblings, 1 reply; 19+ messages in thread
From: Catalin Marinas @ 2019-09-13 10:09 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya,
	ira.weiny, will, steven.price, valentin.schneider,
	suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel,
	dan.j.williams, linux-arm-kernel, osalvador, steve.capper,
	logang, linux-kernel, akpm, mgorman

On Fri, Sep 13, 2019 at 11:28:01AM +0530, Anshuman Khandual wrote:
> On 09/13/2019 01:45 AM, Catalin Marinas wrote:
> > On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
> >> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> >>  void vmemmap_free(unsigned long start, unsigned long end,
> >>  		struct vmem_altmap *altmap)
> >>  {
> >> +#ifdef CONFIG_MEMORY_HOTPLUG
> >> +	/*
> >> +	 * FIXME: We should have called remove_pagetable(start, end, true).
> >> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
> >> +	 * page table entries. Removing vmemmap range page table pages here
> >> +	 * can potentially conflict with a concurrent vmalloc() allocation.
> >> +	 *
> >> +	 * This is primarily because vmalloc() does not take init_mm ptl for
> >> +	 * the entire page table walk and it's modification. Instead it just
> >> +	 * takes the lock while allocating and installing page table pages
> >> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
> >> +	 * entry via memory hot remove can cause vmalloc() kernel page table
> >> +	 * walk pointers to be invalid on the fly which can cause corruption
> >> +	 * or worst, a crash.
> >> +	 *
> >> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
> >> +	 * do not overlap at any intermediate level kernel page table entry.
> >> +	 */
> >> +	unmap_hotplug_range(start, end, true);
> >> +	if (!vmalloc_vmemmap_overlap)
> >> +		free_empty_tables(start, end);
> >> +#endif
> >>  }
> > 
> > So, I see the risk with overlapping and I guess for some kernel
> > configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we
> 
> Did not see 64K config options to have overlap, do you suspect they might ?
> After the 52 bit KVA series has been merged, following configurations have
> the vmalloc-vmemmap range overlap problem.
> 
> - 4K  page size with 48 bit VA space
> - 16K page size with 48 bit VA space

OK. I haven't checked, so it was just a guess that 64K has this problem
since the pgd entry coverage is fairly large.

> > can, that's great, otherwise could we rewrite the above functions to
> > handle floor and ceiling similar to free_pgd_range()? (I wonder how this
> > function works if you called it on init_mm and kernel address range). By
> 
> Hmm, never tried that. Are you wondering if this can be used directly ?
> There are two distinct elements which make it very specific to user page
> tables, mmu_gather based TLB tracking and mm->pgtable_bytes accounting
> with mm_dec_nr_pxx().

Ah, I missed the mm_dec_nr_*(). So I don't think it would work directly.
We could, however, use the same approach for kernel page tables.

> > having the vmemmap start/end information it avoids freeing partially
> > filled page table pages.
> 
> Did you mean page table pages which can partially overlap with vmalloc ?

Overlapping with the vmalloc range, not necessarily with a mapped
vmalloc area.

> The problem (race) is not because of the inability to deal with partially
> filled table. We can handle that correctly as explained below [1]. The
> problem is with inadequate kernel page table locking during vmalloc()
> which might be accessing intermediate kernel page table pointers which is
> being freed with free_empty_tables() concurrently. Hence we cannot free
> any page table page which can ever have entries from vmalloc() range.

The way you deal with the partially filled table in this patch is to
avoid freeing if there is a non-empty entry (!p*d_none()). This is what
causes the race with vmalloc. If you simply avoid freeing a pmd page,
for example, if the range floor/ceiling is not aligned to PUD_SIZE,
irrespective of whether the other entries are empty or not, you
shouldn't have this problem. You do free the pte page if the range is
aligned to PMD_SIZE but in this case it wouldn't overlap with the
vmalloc space. That's how free_pgd_range() works.

We may have some pgtable pages not freed at both ends of the range
(maximum 6 in total) but I don't really see this an issue. They could be
reused if something else gets mapped in that range.

> Though not completely sure, whether I really understood the suggestion above
> with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you
> suggesting that we should only attempt to free up those vmemmap range page
> table pages which *definitely* could never overlap with vmalloc by working
> on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range
> at each level) vmemmap range instead ?

You can ignore the overlap check altogether, only free the page tables
with floor/ceiling set to the start/size passed to arch_remove_memory()
and vmemmap_free().

> This can be one restrictive version of the function
> free_empty_tables() called in case there is an overlap. So we will
> maintain two versions for free_empty_tables(). Please correct me if
> any the above assumptions or understanding is wrong.

I'd rather have a single version of free_empty_tables(). As I said
above, the only downside is that a partially filled pgtable page would
not be freed even though the other entries are empty.

> But yes, with this we should be able to free up some possible empty page
> table pages which were being left out in the current proposal when overlap
> happens.
> 
> [1] Skipping partially filled page tables
> 
> All free_pXX_table() functions take care in avoiding freeing partially filled
> page table pages whether they represent or ever represented linear or vmemmap
> or vmalloc mapping in init_mm. They go over each individual entry in a given
> page table making sure that each of them checks as pXX_none() before freeing
> the entire page table page.

Yes but that's what's causing the race with a vmalloc trying to create
such entries.

> > Another question: could we do the page table and the actual vmemmap
> > pages freeing in a single pass (sorry if this has been discussed
> > before)?
> 
> We could and some initial versions (till V5) of the series had that in fact.
> Initially Mark Rutland had suggested to do this in two passes. Some extracts
> from the previous discussion.
> 
> https://lkml.org/lkml/2019/5/30/1159
> 
> -----------------------
> Looking at this some more, I don't think this is quite right, and tI
> think that structure of the free_*() and remove_*() functions makes this
> unnecessarily hard to follow. We should aim for this to be obviously
> correct.
> 
> The x86 code is the best template to follow here. As mentioned
> previously, I'm fairly certain it's not entirely correct (e.g. due to
> missing TLB maintenance), and we've already diverged a fair amount in
> fixing up obvious issues, so we shouldn't aim to mirror it.
> 
> I think that the structure of unmap_region() is closer to what we want
> here -- do one pass to unmap leaf entries (and freeing the associated
> memory if unmapping the vmemmap), then do a second pass cleaning up any
> empty tables.
> ----------------------
> 
> Apart from the fact that two passes over the page table is cleaner and gives
> us more granular and modular infrastructure to use for later purposes, it is
> also a necessity in dealing with vmalloc-vmemmap overlap. free_empty_tables()
> which is the second pass, can be skipped cleanly when overlap is detected.

I'm fine with two passes for unmap and pgtable free for the time being
and if they look fairly similar in a feature version, we can think of
merging them. But for now, stick with two passes. The unmapping one in
this patchset I think seems fine (though I haven't looked in detail).

There is also a race with ptdump that I haven't looked into.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-09-03  9:45 ` [PATCH V7 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
@ 2019-09-15  2:35   ` Balbir Singh
  2019-09-18  9:12     ` Anshuman Khandual
  0 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2019-09-15  2:35 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm, linux-kernel, linux-arm-kernel,
	akpm, catalin.marinas, will
  Cc: mark.rutland, mhocko, david, ira.weiny, steve.capper, mgorman,
	steven.price, broonie, cai, ard.biesheuvel, cpandya, arunks,
	dan.j.williams, Robin.Murphy, logang, valentin.schneider,
	suzuki.poulose, osalvador



On 3/9/19 7:45 pm, Anshuman Khandual wrote:
> The arm64 page table dump code can race with concurrent modification of the
> kernel page tables. When a leaf entries are modified concurrently, the dump
> code may log stale or inconsistent information for a VA range, but this is
> otherwise not harmful.
> 
> When intermediate levels of table are freed, the dump code will continue to
> use memory which has been freed and potentially reallocated for another
> purpose. In such cases, the dump code may dereference bogus addresses,
> leading to a number of potential problems.
> 
> Intermediate levels of table may by freed during memory hot-remove,
> which will be enabled by a subsequent patch. To avoid racing with
> this, take the memory hotplug lock when walking the kernel page table.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  arch/arm64/mm/ptdump_debugfs.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
> index 064163f25592..b5eebc8c4924 100644
> --- a/arch/arm64/mm/ptdump_debugfs.c
> +++ b/arch/arm64/mm/ptdump_debugfs.c
> @@ -1,5 +1,6 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/debugfs.h>
> +#include <linux/memory_hotplug.h>
>  #include <linux/seq_file.h>
>  
>  #include <asm/ptdump.h>
> @@ -7,7 +8,10 @@
>  static int ptdump_show(struct seq_file *m, void *v)
>  {
>  	struct ptdump_info *info = m->private;
> +
> +	get_online_mems();
>  	ptdump_walk_pgd(m, info);
> +	put_online_mems();

Looks sane, BTW, checking other arches they might have the same race.
Is there anything special about the arch?

Acked-by: Balbir Singh <bsingharora@gmail.com>


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-09-03  9:45 ` [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
  2019-09-04  8:16   ` David Hildenbrand
@ 2019-09-16  1:44   ` Balbir Singh
  2019-09-18  9:28     ` Anshuman Khandual
  1 sibling, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2019-09-16  1:44 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm, linux-kernel, linux-arm-kernel,
	akpm, catalin.marinas, will
  Cc: mark.rutland, mhocko, david, ira.weiny, steve.capper, mgorman,
	steven.price, broonie, cai, ard.biesheuvel, cpandya, arunks,
	dan.j.williams, Robin.Murphy, logang, valentin.schneider,
	suzuki.poulose, osalvador



On 3/9/19 7:45 pm, Anshuman Khandual wrote:
> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs

I could not find this path in the code, the only called for get_nid_for_pfn()
was register_mem_sect_under_node() when the system is under boot.

> entries between memory block and node. It first checks pfn validity with
> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
> 
> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
> which scans all mapped memblock regions with memblock_is_map_memory(). This
> creates a problem in memory hot remove path which has already removed given
> memory range from memory block with memblock_[remove|free] before arriving
> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
> of existing sysfs entries.
> 
> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
> [   62.052517] ------------[ cut here ]------------

This seems like arm64 is not ready for probe_store() via drivers/base/memory.c/ode.c

> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!



> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [   62.054589] Modules linked in:
> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
> [   62.056274] Hardware name: linux,dummy-virt (DT)
> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
> [   62.059842] sp : ffff0000168b3ce0
> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
> [   62.076930] Call trace:
> [   62.077411]  add_memory_resource+0x1cc/0x1d8
> [   62.078227]  __add_memory+0x70/0xa8
> [   62.078901]  probe_store+0xa4/0xc8
> [   62.079561]  dev_attr_store+0x18/0x28
> [   62.080270]  sysfs_kf_write+0x40/0x58
> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
> [   62.081744]  __vfs_write+0x18/0x40
> [   62.082400]  vfs_write+0xa4/0x1b0
> [   62.083037]  ksys_write+0x5c/0xc0
> [   62.083681]  __arm64_sys_write+0x18/0x20
> [   62.084432]  el0_svc_handler+0x88/0x100
> [   62.085177]  el0_svc+0x8/0xc
> 
> Re-ordering memblock_[free|remove]() with arch_remove_memory() solves the
> problem on arm64 as pfn_valid() behaves correctly and returns positive
> as memblock for the address range still exists. arch_remove_memory()
> removes applicable memory sections from zone with __remove_pages() and
> tears down kernel linear mapping. Removing memblock regions afterwards
> is safe because there is no other memblock (bootmem) allocator user that
> late. So nobody is going to allocate from the removed range just to blow
> up later. Also nobody should be using the bootmem allocated range else
> we wouldn't allow to remove it. So reordering is indeed safe.
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Acked-by: Mark Rutland <mark.rutland@arm.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---

Honestly, the issue is not clear from the changelog, largely
because I can't find the use of get_nid_for_pfn()  being used
in memory hotunplug. I can see why using pfn_valid() after
memblock_free/remove is bad on the architecture.

I think the checks to pfn_valid() can be avoided from the
remove paths if we did the following

memblock_isolate_regions()
for each isolate_region {
	memblock_free
	memblock_remove
	arch_memory_remove

	# ensure that __remove_memory can avoid calling pfn_valid
}

Having said that, your patch is easier and if your assumption
about not using the memblocks is valid (after arch_memory_remove())
then might be the least resistant way forward

Balbir Singh.


>  mm/memory_hotplug.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c73f09913165..355c466e0621 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1770,13 +1770,13 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
>  
>  	/* remove memmap entry */
>  	firmware_map_remove(start, start + size, "System RAM");
> -	memblock_free(start, size);
> -	memblock_remove(start, size);
>  
>  	/* remove memory block devices before removing memory */
>  	remove_memory_block_devices(start, size);
>  
>  	arch_remove_memory(nid, start, size, NULL);
> +	memblock_free(start, size);
> +	memblock_remove(start, size);
>  	__release_memory_resource(start, size);
>  
>  	try_offline_node(nid);
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove
  2019-09-13 10:09       ` Catalin Marinas
@ 2019-09-17  4:36         ` Anshuman Khandual
  2019-09-17 15:08           ` Catalin Marinas
  0 siblings, 1 reply; 19+ messages in thread
From: Anshuman Khandual @ 2019-09-17  4:36 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya,
	ira.weiny, will, steven.price, valentin.schneider,
	suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel,
	dan.j.williams, linux-arm-kernel, osalvador, steve.capper,
	logang, linux-kernel, akpm, mgorman



On 09/13/2019 03:39 PM, Catalin Marinas wrote:
> On Fri, Sep 13, 2019 at 11:28:01AM +0530, Anshuman Khandual wrote:
>> On 09/13/2019 01:45 AM, Catalin Marinas wrote:
>>> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote:
>>>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>>>  void vmemmap_free(unsigned long start, unsigned long end,
>>>>  		struct vmem_altmap *altmap)
>>>>  {
>>>> +#ifdef CONFIG_MEMORY_HOTPLUG
>>>> +	/*
>>>> +	 * FIXME: We should have called remove_pagetable(start, end, true).
>>>> +	 * vmemmap and vmalloc virtual range might share intermediate kernel
>>>> +	 * page table entries. Removing vmemmap range page table pages here
>>>> +	 * can potentially conflict with a concurrent vmalloc() allocation.
>>>> +	 *
>>>> +	 * This is primarily because vmalloc() does not take init_mm ptl for
>>>> +	 * the entire page table walk and it's modification. Instead it just
>>>> +	 * takes the lock while allocating and installing page table pages
>>>> +	 * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
>>>> +	 * entry via memory hot remove can cause vmalloc() kernel page table
>>>> +	 * walk pointers to be invalid on the fly which can cause corruption
>>>> +	 * or worst, a crash.
>>>> +	 *
>>>> +	 * So free_empty_tables() gets called where vmalloc and vmemmap range
>>>> +	 * do not overlap at any intermediate level kernel page table entry.
>>>> +	 */
>>>> +	unmap_hotplug_range(start, end, true);
>>>> +	if (!vmalloc_vmemmap_overlap)
>>>> +		free_empty_tables(start, end);
>>>> +#endif
>>>>  }
>>>
>>> So, I see the risk with overlapping and I guess for some kernel
>>> configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we
>>
>> Did not see 64K config options to have overlap, do you suspect they might ?
>> After the 52 bit KVA series has been merged, following configurations have
>> the vmalloc-vmemmap range overlap problem.
>>
>> - 4K  page size with 48 bit VA space
>> - 16K page size with 48 bit VA space
> 
> OK. I haven't checked, so it was just a guess that 64K has this problem
> since the pgd entry coverage is fairly large.
> 
>>> can, that's great, otherwise could we rewrite the above functions to
>>> handle floor and ceiling similar to free_pgd_range()? (I wonder how this
>>> function works if you called it on init_mm and kernel address range). By
>>
>> Hmm, never tried that. Are you wondering if this can be used directly ?
>> There are two distinct elements which make it very specific to user page
>> tables, mmu_gather based TLB tracking and mm->pgtable_bytes accounting
>> with mm_dec_nr_pxx().
> 
> Ah, I missed the mm_dec_nr_*(). So I don't think it would work directly.
> We could, however, use the same approach for kernel page tables.

Right.

> 
>>> having the vmemmap start/end information it avoids freeing partially
>>> filled page table pages.
>>
>> Did you mean page table pages which can partially overlap with vmalloc ?
> 
> Overlapping with the vmalloc range, not necessarily with a mapped
> vmalloc area.
> 
>> The problem (race) is not because of the inability to deal with partially
>> filled table. We can handle that correctly as explained below [1]. The
>> problem is with inadequate kernel page table locking during vmalloc()
>> which might be accessing intermediate kernel page table pointers which is
>> being freed with free_empty_tables() concurrently. Hence we cannot free
>> any page table page which can ever have entries from vmalloc() range.
> 
> The way you deal with the partially filled table in this patch is to
> avoid freeing if there is a non-empty entry (!p*d_none()). This is what
> causes the race with vmalloc. If you simply avoid freeing a pmd page,
> for example, if the range floor/ceiling is not aligned to PUD_SIZE,
> irrespective of whether the other entries are empty or not, you
> shouldn't have this problem. You do free the pte page if the range is

Right, the floor/ceiling alignment check should abort the process before
scanning for non-empty entries.

> aligned to PMD_SIZE but in this case it wouldn't overlap with the
> vmalloc space. That's how free_pgd_range() works.

Like free_pgd_range(), page table pages can be freed at lower levels when
they have the right alignment wrt floor/ceiling. I will change all existing
free_pxx_table() functions to accommodate floor/ceiling alignment checks to
achieve this.

> 
> We may have some pgtable pages not freed at both ends of the range
> (maximum 6 in total) but I don't really see this an issue. They could be
> reused if something else gets mapped in that range.

I assume that the number 6 for maximum page possibility came from

(floor edge + ceiling edge) * (PTE table + PMD table + PUD table)


> 
>> Though not completely sure, whether I really understood the suggestion above
>> with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you
>> suggesting that we should only attempt to free up those vmemmap range page
>> table pages which *definitely* could never overlap with vmalloc by working
>> on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range
>> at each level) vmemmap range instead ?
> 
> You can ignore the overlap check altogether, only free the page tables
> with floor/ceiling set to the start/size passed to arch_remove_memory()
> and vmemmap_free().

Wondering if it will be better to use [VMEMMAP_START - VMEMMAP_END] and
[PAGE_OFFSET - PAGE_END] as floor/ceiling respectively with vmemmap_free()
and arch_remove_memory(). Not only it is safe to free all page table pages
which span over these maximum possible mapping range but also it reduces
the risk for alignment related wastage.

> 
>> This can be one restrictive version of the function
>> free_empty_tables() called in case there is an overlap. So we will
>> maintain two versions for free_empty_tables(). Please correct me if
>> any the above assumptions or understanding is wrong.
> 
> I'd rather have a single version of free_empty_tables(). As I said
> above, the only downside is that a partially filled pgtable page would
> not be freed even though the other entries are empty.

Sure. Also practically the limitation will be applicable only for vmemmap
mapping but not for linear mappings where the chances of overlap might be
negligible as it covers half kernel virtual address space.

> 
>> But yes, with this we should be able to free up some possible empty page
>> table pages which were being left out in the current proposal when overlap
>> happens.
>>
>> [1] Skipping partially filled page tables
>>
>> All free_pXX_table() functions take care in avoiding freeing partially filled
>> page table pages whether they represent or ever represented linear or vmemmap
>> or vmalloc mapping in init_mm. They go over each individual entry in a given
>> page table making sure that each of them checks as pXX_none() before freeing
>> the entire page table page.
> 
> Yes but that's what's causing the race with a vmalloc trying to create
> such entries.

free_pxx_table() needs to check for both floor-ceiling alignment before
making sure that the page table page is completely empty before freeing.

> 
>>> Another question: could we do the page table and the actual vmemmap
>>> pages freeing in a single pass (sorry if this has been discussed
>>> before)?
>>
>> We could and some initial versions (till V5) of the series had that in fact.
>> Initially Mark Rutland had suggested to do this in two passes. Some extracts
>> from the previous discussion.
>>
>> https://lkml.org/lkml/2019/5/30/1159
>>
>> -----------------------
>> Looking at this some more, I don't think this is quite right, and tI
>> think that structure of the free_*() and remove_*() functions makes this
>> unnecessarily hard to follow. We should aim for this to be obviously
>> correct.
>>
>> The x86 code is the best template to follow here. As mentioned
>> previously, I'm fairly certain it's not entirely correct (e.g. due to
>> missing TLB maintenance), and we've already diverged a fair amount in
>> fixing up obvious issues, so we shouldn't aim to mirror it.
>>
>> I think that the structure of unmap_region() is closer to what we want
>> here -- do one pass to unmap leaf entries (and freeing the associated
>> memory if unmapping the vmemmap), then do a second pass cleaning up any
>> empty tables.
>> ----------------------
>>
>> Apart from the fact that two passes over the page table is cleaner and gives
>> us more granular and modular infrastructure to use for later purposes, it is
>> also a necessity in dealing with vmalloc-vmemmap overlap. free_empty_tables()
>> which is the second pass, can be skipped cleanly when overlap is detected.
> 
> I'm fine with two passes for unmap and pgtable free for the time being
> and if they look fairly similar in a feature version, we can think of
> merging them. But for now, stick with two passes. The unmapping one in
> this patchset I think seems fine (though I haven't looked in detail).

Sure.

> 
> There is also a race with ptdump that I haven't looked into.

The second patch in the series deals with that.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 3/3] arm64/mm: Enable memory hot remove
  2019-09-17  4:36         ` Anshuman Khandual
@ 2019-09-17 15:08           ` Catalin Marinas
  0 siblings, 0 replies; 19+ messages in thread
From: Catalin Marinas @ 2019-09-17 15:08 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya,
	ira.weiny, will, steven.price, valentin.schneider,
	suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel,
	dan.j.williams, linux-arm-kernel, osalvador, steve.capper,
	logang, linux-kernel, akpm, mgorman

On Tue, Sep 17, 2019 at 10:06:11AM +0530, Anshuman Khandual wrote:
> On 09/13/2019 03:39 PM, Catalin Marinas wrote:
> > On Fri, Sep 13, 2019 at 11:28:01AM +0530, Anshuman Khandual wrote:
> >> The problem (race) is not because of the inability to deal with partially
> >> filled table. We can handle that correctly as explained below [1]. The
> >> problem is with inadequate kernel page table locking during vmalloc()
> >> which might be accessing intermediate kernel page table pointers which is
> >> being freed with free_empty_tables() concurrently. Hence we cannot free
> >> any page table page which can ever have entries from vmalloc() range.
> > 
> > The way you deal with the partially filled table in this patch is to
> > avoid freeing if there is a non-empty entry (!p*d_none()). This is what
> > causes the race with vmalloc. If you simply avoid freeing a pmd page,
> > for example, if the range floor/ceiling is not aligned to PUD_SIZE,
> > irrespective of whether the other entries are empty or not, you
> > shouldn't have this problem. You do free the pte page if the range is
[...]
> > We may have some pgtable pages not freed at both ends of the range
> > (maximum 6 in total) but I don't really see this an issue. They could be
> > reused if something else gets mapped in that range.
> 
> I assume that the number 6 for maximum page possibility came from
> 
> (floor edge + ceiling edge) * (PTE table + PMD table + PUD table)

Yes.

> >> Though not completely sure, whether I really understood the suggestion above
> >> with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you
> >> suggesting that we should only attempt to free up those vmemmap range page
> >> table pages which *definitely* could never overlap with vmalloc by working
> >> on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range
> >> at each level) vmemmap range instead ?
> > 
> > You can ignore the overlap check altogether, only free the page tables
> > with floor/ceiling set to the start/size passed to arch_remove_memory()
> > and vmemmap_free().
> 
> Wondering if it will be better to use [VMEMMAP_START - VMEMMAP_END] and
> [PAGE_OFFSET - PAGE_END] as floor/ceiling respectively with vmemmap_free()
> and arch_remove_memory(). Not only it is safe to free all page table pages
> which span over these maximum possible mapping range but also it reduces
> the risk for alignment related wastage.

That's indeed better. You pass the floor/ceiling as the enclosing range
and start/end as the actual range to unmap is. We avoid the potential
"leak" around the edges when falling within the floor/ceiling range (I
think that's close to what free_pgd_range() does).

> >> This can be one restrictive version of the function
> >> free_empty_tables() called in case there is an overlap. So we will
> >> maintain two versions for free_empty_tables(). Please correct me if
> >> any the above assumptions or understanding is wrong.
> > 
> > I'd rather have a single version of free_empty_tables(). As I said
> > above, the only downside is that a partially filled pgtable page would
> > not be freed even though the other entries are empty.
> 
> Sure. Also practically the limitation will be applicable only for vmemmap
> mapping but not for linear mappings where the chances of overlap might be
> negligible as it covers half kernel virtual address space.

If you have a common set of functions, it doesn't heart to pass the
correct floor/ceiling in both cases.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
  2019-09-15  2:35   ` Balbir Singh
@ 2019-09-18  9:12     ` Anshuman Khandual
  0 siblings, 0 replies; 19+ messages in thread
From: Anshuman Khandual @ 2019-09-18  9:12 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will
  Cc: mark.rutland, mhocko, david, ira.weiny, steve.capper, mgorman,
	steven.price, broonie, cai, ard.biesheuvel, cpandya, arunks,
	dan.j.williams, Robin.Murphy, logang, valentin.schneider,
	suzuki.poulose, osalvador



On 09/15/2019 08:05 AM, Balbir Singh wrote:
> 
> 
> On 3/9/19 7:45 pm, Anshuman Khandual wrote:
>> The arm64 page table dump code can race with concurrent modification of the
>> kernel page tables. When a leaf entries are modified concurrently, the dump
>> code may log stale or inconsistent information for a VA range, but this is
>> otherwise not harmful.
>>
>> When intermediate levels of table are freed, the dump code will continue to
>> use memory which has been freed and potentially reallocated for another
>> purpose. In such cases, the dump code may dereference bogus addresses,
>> leading to a number of potential problems.
>>
>> Intermediate levels of table may by freed during memory hot-remove,
>> which will be enabled by a subsequent patch. To avoid racing with
>> this, take the memory hotplug lock when walking the kernel page table.
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> Acked-by: Mark Rutland <mark.rutland@arm.com>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>>  arch/arm64/mm/ptdump_debugfs.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
>> index 064163f25592..b5eebc8c4924 100644
>> --- a/arch/arm64/mm/ptdump_debugfs.c
>> +++ b/arch/arm64/mm/ptdump_debugfs.c
>> @@ -1,5 +1,6 @@
>>  // SPDX-License-Identifier: GPL-2.0
>>  #include <linux/debugfs.h>
>> +#include <linux/memory_hotplug.h>
>>  #include <linux/seq_file.h>
>>  
>>  #include <asm/ptdump.h>
>> @@ -7,7 +8,10 @@
>>  static int ptdump_show(struct seq_file *m, void *v)
>>  {
>>  	struct ptdump_info *info = m->private;
>> +
>> +	get_online_mems();
>>  	ptdump_walk_pgd(m, info);
>> +	put_online_mems();
> 
> Looks sane, BTW, checking other arches they might have the same race.

The problem can be present on other architectures which can dump kernel page
table during memory hot-remove operation where it actually frees up page table
pages. If there is no freeing involved the race condition here could cause
inconsistent or garbage information capture for a given VA range. Same is true
even for concurrent vmalloc() operations as well. But removal of page tables
pages can make it worse. Freeing page table pages during hot-remove is a platform
decision, so would be adding these locks while walking kernel page table during
ptdump.

> Is there anything special about the arch?

AFAICS, no.

> 
> Acked-by: Balbir Singh <bsingharora@gmail.com>
> 
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory()
  2019-09-16  1:44   ` Balbir Singh
@ 2019-09-18  9:28     ` Anshuman Khandual
  0 siblings, 0 replies; 19+ messages in thread
From: Anshuman Khandual @ 2019-09-18  9:28 UTC (permalink / raw)
  To: Balbir Singh, linux-mm, linux-kernel, linux-arm-kernel, akpm,
	catalin.marinas, will
  Cc: mark.rutland, mhocko, david, ira.weiny, steve.capper, mgorman,
	steven.price, broonie, cai, ard.biesheuvel, cpandya, arunks,
	dan.j.williams, Robin.Murphy, logang, valentin.schneider,
	suzuki.poulose, osalvador



On 09/16/2019 07:14 AM, Balbir Singh wrote:
> 
> 
> On 3/9/19 7:45 pm, Anshuman Khandual wrote:
>> Memory hot remove uses get_nid_for_pfn() while tearing down linked sysfs
> 
> I could not find this path in the code, the only called for get_nid_for_pfn()
> was register_mem_sect_under_node() when the system is under boot.
> 
>> entries between memory block and node. It first checks pfn validity with
>> pfn_valid_within() before fetching nid. With CONFIG_HOLES_IN_ZONE config
>> (arm64 has this enabled) pfn_valid_within() calls pfn_valid().
>>
>> pfn_valid() is an arch implementation on arm64 (CONFIG_HAVE_ARCH_PFN_VALID)
>> which scans all mapped memblock regions with memblock_is_map_memory(). This
>> creates a problem in memory hot remove path which has already removed given
>> memory range from memory block with memblock_[remove|free] before arriving
>> at unregister_mem_sect_under_nodes(). Hence get_nid_for_pfn() returns -1
>> skipping subsequent sysfs_remove_link() calls leaving node <-> memory block
>> sysfs entries as is. Subsequent memory add operation hits BUG_ON() because
>> of existing sysfs entries.
>>
>> [   62.007176] NUMA: Unknown node for memory at 0x680000000, assuming node 0
>> [   62.052517] ------------[ cut here ]------------
> 
> This seems like arm64 is not ready for probe_store() via drivers/base/memory.c/ode.c
> 
>> [   62.053211] kernel BUG at mm/memory_hotplug.c:1143!
> 
> 
> 
>> [   62.053868] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>> [   62.054589] Modules linked in:
>> [   62.054999] CPU: 19 PID: 3275 Comm: bash Not tainted 5.1.0-rc2-00004-g28cea40b2683 #41
>> [   62.056274] Hardware name: linux,dummy-virt (DT)
>> [   62.057166] pstate: 40400005 (nZcv daif +PAN -UAO)
>> [   62.058083] pc : add_memory_resource+0x1cc/0x1d8
>> [   62.058961] lr : add_memory_resource+0x10c/0x1d8
>> [   62.059842] sp : ffff0000168b3ce0
>> [   62.060477] x29: ffff0000168b3ce0 x28: ffff8005db546c00
>> [   62.061501] x27: 0000000000000000 x26: 0000000000000000
>> [   62.062509] x25: ffff0000111ef000 x24: ffff0000111ef5d0
>> [   62.063520] x23: 0000000000000000 x22: 00000006bfffffff
>> [   62.064540] x21: 00000000ffffffef x20: 00000000006c0000
>> [   62.065558] x19: 0000000000680000 x18: 0000000000000024
>> [   62.066566] x17: 0000000000000000 x16: 0000000000000000
>> [   62.067579] x15: ffffffffffffffff x14: ffff8005e412e890
>> [   62.068588] x13: ffff8005d6b105d8 x12: 0000000000000000
>> [   62.069610] x11: ffff8005d6b10490 x10: 0000000000000040
>> [   62.070615] x9 : ffff8005e412e898 x8 : ffff8005e412e890
>> [   62.071631] x7 : ffff8005d6b105d8 x6 : ffff8005db546c00
>> [   62.072640] x5 : 0000000000000001 x4 : 0000000000000002
>> [   62.073654] x3 : ffff8005d7049480 x2 : 0000000000000002
>> [   62.074666] x1 : 0000000000000003 x0 : 00000000ffffffef
>> [   62.075685] Process bash (pid: 3275, stack limit = 0x00000000d754280f)
>> [   62.076930] Call trace:
>> [   62.077411]  add_memory_resource+0x1cc/0x1d8
>> [   62.078227]  __add_memory+0x70/0xa8
>> [   62.078901]  probe_store+0xa4/0xc8
>> [   62.079561]  dev_attr_store+0x18/0x28
>> [   62.080270]  sysfs_kf_write+0x40/0x58
>> [   62.080992]  kernfs_fop_write+0xcc/0x1d8
>> [   62.081744]  __vfs_write+0x18/0x40
>> [   62.082400]  vfs_write+0xa4/0x1b0
>> [   62.083037]  ksys_write+0x5c/0xc0
>> [   62.083681]  __arm64_sys_write+0x18/0x20
>> [   62.084432]  el0_svc_handler+0x88/0x100
>> [   62.085177]  el0_svc+0x8/0xc
>>
>> Re-ordering memblock_[free|remove]() with arch_remove_memory() solves the
>> problem on arm64 as pfn_valid() behaves correctly and returns positive
>> as memblock for the address range still exists. arch_remove_memory()
>> removes applicable memory sections from zone with __remove_pages() and
>> tears down kernel linear mapping. Removing memblock regions afterwards
>> is safe because there is no other memblock (bootmem) allocator user that
>> late. So nobody is going to allocate from the removed range just to blow
>> up later. Also nobody should be using the bootmem allocated range else
>> we wouldn't allow to remove it. So reordering is indeed safe.
>>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>> Acked-by: Mark Rutland <mark.rutland@arm.com>
>> Acked-by: Michal Hocko <mhocko@suse.com>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
> 
> Honestly, the issue is not clear from the changelog, largely
> because I can't find the use of get_nid_for_pfn()  being used
> in memory hotunplug. I can see why using pfn_valid() after
> memblock_free/remove is bad on the architecture.
> 
> I think the checks to pfn_valid() can be avoided from the
> remove paths if we did the following
> 
> memblock_isolate_regions()
> for each isolate_region {
> 	memblock_free
> 	memblock_remove
> 	arch_memory_remove
> 
> 	# ensure that __remove_memory can avoid calling pfn_valid
> }
> 
> Having said that, your patch is easier and if your assumption
> about not using the memblocks is valid (after arch_memory_remove())
> then might be the least resistant way forward

The context for this patch has changed a bit which now reflects in
it's current posting (https://patchwork.kernel.org/patch/11146361/)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, back to index

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-03  9:45 [PATCH V7 0/3] arm64/mm: Enable memory hot remove Anshuman Khandual
2019-09-03  9:45 ` [PATCH V7 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() Anshuman Khandual
2019-09-04  8:16   ` David Hildenbrand
2019-09-05  4:27     ` Anshuman Khandual
2019-09-16  1:44   ` Balbir Singh
2019-09-18  9:28     ` Anshuman Khandual
2019-09-03  9:45 ` [PATCH V7 2/3] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual
2019-09-15  2:35   ` Balbir Singh
2019-09-18  9:12     ` Anshuman Khandual
2019-09-03  9:45 ` [PATCH V7 3/3] arm64/mm: Enable memory hot remove Anshuman Khandual
2019-09-10 16:17   ` Catalin Marinas
2019-09-11 10:31     ` David Hildenbrand
2019-09-12  4:28     ` Anshuman Khandual
2019-09-12  8:37       ` Anshuman Khandual
2019-09-12 20:15   ` Catalin Marinas
2019-09-13  5:58     ` Anshuman Khandual
2019-09-13 10:09       ` Catalin Marinas
2019-09-17  4:36         ` Anshuman Khandual
2019-09-17 15:08           ` Catalin Marinas

Linux-ARM-Kernel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/0 linux-arm-kernel/git/0.git
	git clone --mirror https://lore.kernel.org/linux-arm-kernel/1 linux-arm-kernel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-arm-kernel linux-arm-kernel/ https://lore.kernel.org/linux-arm-kernel \
		linux-arm-kernel@lists.infradead.org infradead-linux-arm-kernel@archiver.kernel.org
	public-inbox-index linux-arm-kernel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.infradead.lists.linux-arm-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox