* [PATCH V8 0/2] arm64/mm: Enable memory hot remove @ 2019-09-23 5:43 ` Anshuman Khandual 0 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-09-23 5:43 UTC (permalink / raw) To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will Cc: mark.rutland, mhocko, david, cai, logang, cpandya, arunks, dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper, broonie, valentin.schneider, Robin.Murphy, steven.price, suzuki.poulose, ira.weiny This series enables memory hot remove on arm64 after fixing a possible arm64 platform specific kernel page table race condition. This series is based on linux-next (next-20190920). Concurrent vmalloc() and hot-remove conflict: As pointed out earlier on the v5 thread [2] there can be potential conflict between concurrent vmalloc() and memory hot-remove operation. The problem here is caused by inadequate locking in vmalloc() which protects installation of a page table page but not the walk or the leaf entry modification. Now free_empty_tables() and it's children functions take into account a maximum possible range on which it operates as a floor-ceiling boundary. This along with changed free_pxx_table() makes sure that no page table page is freed unless its fully within the maximum possible range as decided by the caller. Testing: Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS combinations. Changes in V8: - Dropped the first patch (memblock_[free|remove] reorder) from the series which is no longer needed for arm64 hot-remove enablement and was posted separately as (https://patchwork.kernel.org/patch/11146361/) - Dropped vmalloc-vmemmap detection and subsequent skipping of free_empty_tables() - Changed free_empty_[pxx]_tables() functions which now accepts a possible maximum floor-ceiling address range on which it operates. Also changed free_pxx_table() functions to check against required alignment as well as maximum floor-ceiling range as another prerequisite before freeing the page table page. - Dropped remove_pagetable(), instead call it's constituent functions directly Changes in V7: (https://lkml.org/lkml/2019/9/3/326) - vmalloc_vmemmap_overlap gets evaluated early during boot for a given config - free_empty_tables() gets conditionally called based on vmalloc_vmemmap_overlap Changes in V6: (https://lkml.org/lkml/2019/7/15/36) - Implemented most of the suggestions from Mark Rutland - Added <linux/memory_hotplug.h> in ptdump - remove_pagetable() now has two distinct passes over the kernel page table - First pass unmap_hotplug_range() removes leaf level entries at all level - Second pass free_empty_tables() removes empty page table pages - Kernel page table lock has been dropped completely - vmemmap_free() does not call freee_empty_tables() to avoid conflict with vmalloc() - All address range scanning are converted to do {} while() loop - Added 'unsigned long end' in __remove_pgd_mapping() - Callers need not provide starting pointer argument to free_[pte|pmd|pud]_table() - Drop the starting pointer argument from free_[pte|pmd|pud]_table() functions - Fetching pxxp[i] in free_[pte|pmd|pud]_table() is wrapped around in READ_ONCE() - free_[pte|pmd|pud]_table() now computes starting pointer inside the function - Fixed TLB handling while freeing huge page section mappings at PMD or PUD level - Added WARN_ON(!page) in free_hotplug_page_range() - Added WARN_ON(![pm|pud]_table(pud|pmd)) when there is no section mapping - [PATCH 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() - Request earlier for separate merger (https://patchwork.kernel.org/patch/10986599/) - s/__remove_memory/try_remove_memory in the subject line - s/arch_remove_memory/memblock_[free|remove] in the subject line - A small change in the commit message as re-order happens now for memblock remove functions not for arch_remove_memory() Changes in V5: (https://lkml.org/lkml/2019/5/29/218) - Have some agreement [1] over using memory_hotplug_lock for arm64 ptdump - Change 7ba36eccb3f8 ("arm64/mm: Inhibit huge-vmap with ptdump") already merged - Dropped the above patch from this series - Fixed indentation problem in arch_[add|remove]_memory() as per David - Collected all new Acked-by tags Changes in V4: (https://lkml.org/lkml/2019/5/20/19) - Implemented most of the suggestions from Mark Rutland - Interchanged patch [PATCH 2/4] <---> [PATCH 3/4] and updated commit message - Moved CONFIG_PGTABLE_LEVELS inside free_[pud|pmd]_table() - Used READ_ONCE() in missing instances while accessing page table entries - s/p???_present()/p???_none() for checking valid kernel page table entries - WARN_ON() when an entry is !p???_none() and !p???_present() at the same time - Updated memory hot-remove commit message with additional details as suggested - Rebased the series on 5.2-rc1 with hotplug changes from David and Michal Hocko - Collected all new Acked-by tags Changes in V3: (https://lkml.org/lkml/2019/5/14/197) - Implemented most of the suggestions from Mark Rutland for remove_pagetable() - Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions - Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted polarity - Changed pointer names ('p' at end) and removed tmp from iterations - Perform intermediate TLB invalidation while clearing pgtable entries - Dropped flush_tlb_kernel_range() in remove_pagetable() - Added flush_tlb_kernel_range() in remove_pte_table() instead - Renamed page freeing functions for pgtable page and mapped pages - Used page range size instead of order while freeing mapped or pgtable pages - Removed all PageReserved() handling while freeing mapped or pgtable pages - Replaced XXX_index() with XXX_offset() while walking the kernel page table - Used READ_ONCE() while fetching individual pgtable entries - Taken overall init_mm.page_table_lock instead of just while changing an entry - Dropped previously added [pmd|pud]_index() which are not required anymore - Added a new patch to protect kernel page table race condition for ptdump - Added a new patch from Mark Rutland to prevent huge-vmap with ptdump Changes in V2: (https://lkml.org/lkml/2019/4/14/5) - Added all received review and ack tags - Split the series from ZONE_DEVICE enablement for better review - Moved memblock re-order patch to the front as per Robin Murphy - Updated commit message on memblock re-order patch per Michal Hocko - Dropped [pmd|pud]_large() definitions - Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large() - Removed __meminit and __ref tags as per Oscar Salvador - Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy - Skipped calling into pgtable_page_dtor() for linear mapping page table pages and updated all relevant functions Changes in V1: (https://lkml.org/lkml/2019/4/3/28) References: [1] https://lkml.org/lkml/2019/5/28/584 [2] https://lkml.org/lkml/2019/6/11/709 Anshuman Khandual (2): arm64/mm: Hold memory hotplug lock while walking for kernel page table dump arm64/mm: Enable memory hot remove arch/arm64/Kconfig | 3 + arch/arm64/include/asm/memory.h | 1 + arch/arm64/mm/mmu.c | 305 ++++++++++++++++++++++++++++++++++++++-- arch/arm64/mm/ptdump_debugfs.c | 4 + 4 files changed, 304 insertions(+), 9 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH V8 0/2] arm64/mm: Enable memory hot remove @ 2019-09-23 5:43 ` Anshuman Khandual 0 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-09-23 5:43 UTC (permalink / raw) To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will Cc: mark.rutland, mhocko, ira.weiny, david, mgorman, steve.capper, Robin.Murphy, steven.price, broonie, cai, ard.biesheuvel, cpandya, arunks, dan.j.williams, logang, valentin.schneider, suzuki.poulose, osalvador This series enables memory hot remove on arm64 after fixing a possible arm64 platform specific kernel page table race condition. This series is based on linux-next (next-20190920). Concurrent vmalloc() and hot-remove conflict: As pointed out earlier on the v5 thread [2] there can be potential conflict between concurrent vmalloc() and memory hot-remove operation. The problem here is caused by inadequate locking in vmalloc() which protects installation of a page table page but not the walk or the leaf entry modification. Now free_empty_tables() and it's children functions take into account a maximum possible range on which it operates as a floor-ceiling boundary. This along with changed free_pxx_table() makes sure that no page table page is freed unless its fully within the maximum possible range as decided by the caller. Testing: Memory hot remove has been tested on arm64 for 4K, 16K, 64K page config options with all possible CONFIG_ARM64_VA_BITS and CONFIG_PGTABLE_LEVELS combinations. Changes in V8: - Dropped the first patch (memblock_[free|remove] reorder) from the series which is no longer needed for arm64 hot-remove enablement and was posted separately as (https://patchwork.kernel.org/patch/11146361/) - Dropped vmalloc-vmemmap detection and subsequent skipping of free_empty_tables() - Changed free_empty_[pxx]_tables() functions which now accepts a possible maximum floor-ceiling address range on which it operates. Also changed free_pxx_table() functions to check against required alignment as well as maximum floor-ceiling range as another prerequisite before freeing the page table page. - Dropped remove_pagetable(), instead call it's constituent functions directly Changes in V7: (https://lkml.org/lkml/2019/9/3/326) - vmalloc_vmemmap_overlap gets evaluated early during boot for a given config - free_empty_tables() gets conditionally called based on vmalloc_vmemmap_overlap Changes in V6: (https://lkml.org/lkml/2019/7/15/36) - Implemented most of the suggestions from Mark Rutland - Added <linux/memory_hotplug.h> in ptdump - remove_pagetable() now has two distinct passes over the kernel page table - First pass unmap_hotplug_range() removes leaf level entries at all level - Second pass free_empty_tables() removes empty page table pages - Kernel page table lock has been dropped completely - vmemmap_free() does not call freee_empty_tables() to avoid conflict with vmalloc() - All address range scanning are converted to do {} while() loop - Added 'unsigned long end' in __remove_pgd_mapping() - Callers need not provide starting pointer argument to free_[pte|pmd|pud]_table() - Drop the starting pointer argument from free_[pte|pmd|pud]_table() functions - Fetching pxxp[i] in free_[pte|pmd|pud]_table() is wrapped around in READ_ONCE() - free_[pte|pmd|pud]_table() now computes starting pointer inside the function - Fixed TLB handling while freeing huge page section mappings at PMD or PUD level - Added WARN_ON(!page) in free_hotplug_page_range() - Added WARN_ON(![pm|pud]_table(pud|pmd)) when there is no section mapping - [PATCH 1/3] mm/hotplug: Reorder memblock_[free|remove]() calls in try_remove_memory() - Request earlier for separate merger (https://patchwork.kernel.org/patch/10986599/) - s/__remove_memory/try_remove_memory in the subject line - s/arch_remove_memory/memblock_[free|remove] in the subject line - A small change in the commit message as re-order happens now for memblock remove functions not for arch_remove_memory() Changes in V5: (https://lkml.org/lkml/2019/5/29/218) - Have some agreement [1] over using memory_hotplug_lock for arm64 ptdump - Change 7ba36eccb3f8 ("arm64/mm: Inhibit huge-vmap with ptdump") already merged - Dropped the above patch from this series - Fixed indentation problem in arch_[add|remove]_memory() as per David - Collected all new Acked-by tags Changes in V4: (https://lkml.org/lkml/2019/5/20/19) - Implemented most of the suggestions from Mark Rutland - Interchanged patch [PATCH 2/4] <---> [PATCH 3/4] and updated commit message - Moved CONFIG_PGTABLE_LEVELS inside free_[pud|pmd]_table() - Used READ_ONCE() in missing instances while accessing page table entries - s/p???_present()/p???_none() for checking valid kernel page table entries - WARN_ON() when an entry is !p???_none() and !p???_present() at the same time - Updated memory hot-remove commit message with additional details as suggested - Rebased the series on 5.2-rc1 with hotplug changes from David and Michal Hocko - Collected all new Acked-by tags Changes in V3: (https://lkml.org/lkml/2019/5/14/197) - Implemented most of the suggestions from Mark Rutland for remove_pagetable() - Fixed applicable PGTABLE_LEVEL wrappers around pgtable page freeing functions - Replaced 'direct' with 'sparse_vmap' in remove_pagetable() with inverted polarity - Changed pointer names ('p' at end) and removed tmp from iterations - Perform intermediate TLB invalidation while clearing pgtable entries - Dropped flush_tlb_kernel_range() in remove_pagetable() - Added flush_tlb_kernel_range() in remove_pte_table() instead - Renamed page freeing functions for pgtable page and mapped pages - Used page range size instead of order while freeing mapped or pgtable pages - Removed all PageReserved() handling while freeing mapped or pgtable pages - Replaced XXX_index() with XXX_offset() while walking the kernel page table - Used READ_ONCE() while fetching individual pgtable entries - Taken overall init_mm.page_table_lock instead of just while changing an entry - Dropped previously added [pmd|pud]_index() which are not required anymore - Added a new patch to protect kernel page table race condition for ptdump - Added a new patch from Mark Rutland to prevent huge-vmap with ptdump Changes in V2: (https://lkml.org/lkml/2019/4/14/5) - Added all received review and ack tags - Split the series from ZONE_DEVICE enablement for better review - Moved memblock re-order patch to the front as per Robin Murphy - Updated commit message on memblock re-order patch per Michal Hocko - Dropped [pmd|pud]_large() definitions - Used existing [pmd|pud]_sect() instead of earlier [pmd|pud]_large() - Removed __meminit and __ref tags as per Oscar Salvador - Dropped unnecessary 'ret' init in arch_add_memory() per Robin Murphy - Skipped calling into pgtable_page_dtor() for linear mapping page table pages and updated all relevant functions Changes in V1: (https://lkml.org/lkml/2019/4/3/28) References: [1] https://lkml.org/lkml/2019/5/28/584 [2] https://lkml.org/lkml/2019/6/11/709 Anshuman Khandual (2): arm64/mm: Hold memory hotplug lock while walking for kernel page table dump arm64/mm: Enable memory hot remove arch/arm64/Kconfig | 3 + arch/arm64/include/asm/memory.h | 1 + arch/arm64/mm/mmu.c | 305 ++++++++++++++++++++++++++++++++++++++-- arch/arm64/mm/ptdump_debugfs.c | 4 + 4 files changed, 304 insertions(+), 9 deletions(-) -- 2.7.4 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH V8 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump 2019-09-23 5:43 ` Anshuman Khandual @ 2019-09-23 5:43 ` Anshuman Khandual -1 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-09-23 5:43 UTC (permalink / raw) To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will Cc: mark.rutland, mhocko, david, cai, logang, cpandya, arunks, dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper, broonie, valentin.schneider, Robin.Murphy, steven.price, suzuki.poulose, ira.weiny The arm64 page table dump code can race with concurrent modification of the kernel page tables. When a leaf entries are modified concurrently, the dump code may log stale or inconsistent information for a VA range, but this is otherwise not harmful. When intermediate levels of table are freed, the dump code will continue to use memory which has been freed and potentially reallocated for another purpose. In such cases, the dump code may dereference bogus addresses, leading to a number of potential problems. Intermediate levels of table may by freed during memory hot-remove, which will be enabled by a subsequent patch. To avoid racing with this, take the memory hotplug lock when walking the kernel page table. Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> --- arch/arm64/mm/ptdump_debugfs.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c index 064163f..b5eebc8 100644 --- a/arch/arm64/mm/ptdump_debugfs.c +++ b/arch/arm64/mm/ptdump_debugfs.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include <linux/debugfs.h> +#include <linux/memory_hotplug.h> #include <linux/seq_file.h> #include <asm/ptdump.h> @@ -7,7 +8,10 @@ static int ptdump_show(struct seq_file *m, void *v) { struct ptdump_info *info = m->private; + + get_online_mems(); ptdump_walk_pgd(m, info); + put_online_mems(); return 0; } DEFINE_SHOW_ATTRIBUTE(ptdump); -- 2.7.4 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH V8 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump @ 2019-09-23 5:43 ` Anshuman Khandual 0 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-09-23 5:43 UTC (permalink / raw) To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will Cc: mark.rutland, mhocko, ira.weiny, david, mgorman, steve.capper, Robin.Murphy, steven.price, broonie, cai, ard.biesheuvel, cpandya, arunks, dan.j.williams, logang, valentin.schneider, suzuki.poulose, osalvador The arm64 page table dump code can race with concurrent modification of the kernel page tables. When a leaf entries are modified concurrently, the dump code may log stale or inconsistent information for a VA range, but this is otherwise not harmful. When intermediate levels of table are freed, the dump code will continue to use memory which has been freed and potentially reallocated for another purpose. In such cases, the dump code may dereference bogus addresses, leading to a number of potential problems. Intermediate levels of table may by freed during memory hot-remove, which will be enabled by a subsequent patch. To avoid racing with this, take the memory hotplug lock when walking the kernel page table. Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> --- arch/arm64/mm/ptdump_debugfs.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c index 064163f..b5eebc8 100644 --- a/arch/arm64/mm/ptdump_debugfs.c +++ b/arch/arm64/mm/ptdump_debugfs.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #include <linux/debugfs.h> +#include <linux/memory_hotplug.h> #include <linux/seq_file.h> #include <asm/ptdump.h> @@ -7,7 +8,10 @@ static int ptdump_show(struct seq_file *m, void *v) { struct ptdump_info *info = m->private; + + get_online_mems(); ptdump_walk_pgd(m, info); + put_online_mems(); return 0; } DEFINE_SHOW_ATTRIBUTE(ptdump); -- 2.7.4 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH V8 2/2] arm64/mm: Enable memory hot remove 2019-09-23 5:43 ` Anshuman Khandual @ 2019-09-23 5:43 ` Anshuman Khandual -1 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-09-23 5:43 UTC (permalink / raw) To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will Cc: mark.rutland, mhocko, david, cai, logang, cpandya, arunks, dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper, broonie, valentin.schneider, Robin.Murphy, steven.price, suzuki.poulose, ira.weiny The arch code for hot-remove must tear down portions of the linear map and vmemmap corresponding to memory being removed. In both cases the page tables mapping these regions must be freed, and when sparse vmemmap is in use the memory backing the vmemmap must also be freed. This patch adds unmap_hotplug_range() and free_empty_tables() helpers which can be used to tear down either region and calls it from vmemmap_free() and ___remove_pgd_mapping(). The sparse_vmap argument determines whether the backing memory will be freed. It makes two distinct passes over the kernel page table. In the first pass with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and frees backing memory if required (vmemmap) for each mapped leaf entry. In the second pass with free_empty_tables() it looks for empty page table sections whose page table page can be unmapped, TLB invalidated and freed. While freeing intermediate level page table pages bail out if any of its entries are still valid. This can happen for partially filled kernel page table either from a previously attempted failed memory hot add or while removing an address range which does not span the entire page table page range. The vmemmap region may share levels of table with the vmalloc region. There can be conflicts between hot remove freeing page table pages with a concurrent vmalloc() walking the kernel page table. This conflict can not just be solved by taking the init_mm ptl because of existing locking scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling method which is borrowed from user page table tear with free_pgd_range() which skips freeing page table pages if intermediate address range is not aligned or maximum floor-ceiling might not own the entire page table page. While here update arch_add_memory() to handle __add_pages() failures by just unmapping recently added kernel linear mapping. Now enable memory hot remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE. This implementation is overall inspired from kernel page table tear down procedure on X86 architecture and user page table tear down method. Acked-by: Steve Capper <steve.capper@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> --- arch/arm64/Kconfig | 3 + arch/arm64/include/asm/memory.h | 1 + arch/arm64/mm/mmu.c | 305 ++++++++++++++++++++++++++++++++++++++-- 3 files changed, 300 insertions(+), 9 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 41a9b42..511249f 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -274,6 +274,9 @@ config ZONE_DMA32 config ARCH_ENABLE_MEMORY_HOTPLUG def_bool y +config ARCH_ENABLE_MEMORY_HOTREMOVE + def_bool y + config SMP def_bool y diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index b61b50b..615dcd0 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -54,6 +54,7 @@ #define MODULES_VADDR (BPF_JIT_REGION_END) #define MODULES_VSIZE (SZ_128M) #define VMEMMAP_START (-VMEMMAP_SIZE - SZ_2M) +#define VMEMMAP_END (VMEMMAP_START + VMEMMAP_SIZE) #define PCI_IO_END (VMEMMAP_START - SZ_2M) #define PCI_IO_START (PCI_IO_END - PCI_IO_SIZE) #define FIXADDR_TOP (PCI_IO_START - SZ_2M) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 60c929f..6178837 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -725,6 +725,277 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(pte)); } + +#ifdef CONFIG_MEMORY_HOTPLUG +static void free_hotplug_page_range(struct page *page, size_t size) +{ + WARN_ON(!page || PageReserved(page)); + free_pages((unsigned long)page_address(page), get_order(size)); +} + +static void free_hotplug_pgtable_page(struct page *page) +{ + free_hotplug_page_range(page, PAGE_SIZE); +} + +static bool pgtable_range_aligned(unsigned long start, unsigned long end, + unsigned long floor, unsigned long ceiling, + unsigned long mask) +{ + start &= mask; + if (start < floor) + return false; + + if (ceiling) { + ceiling &= mask; + if (!ceiling) + return false; + } + + if (end - 1 > ceiling - 1) + return false; + return true; +} + +static void free_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + struct page *page; + pte_t *ptep; + int i; + + if (!pgtable_range_aligned(addr, end, floor, ceiling, PMD_MASK)) + return; + + ptep = pte_offset_kernel(pmdp, 0UL); + for (i = 0; i < PTRS_PER_PTE; i++) { + if (!pte_none(READ_ONCE(ptep[i]))) + return; + } + + page = pmd_page(READ_ONCE(*pmdp)); + pmd_clear(pmdp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void free_pmd_table(pud_t *pudp, unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + struct page *page; + pmd_t *pmdp; + int i; + + if (CONFIG_PGTABLE_LEVELS <= 2) + return; + + if (!pgtable_range_aligned(addr, end, floor, ceiling, PUD_MASK)) + return; + + pmdp = pmd_offset(pudp, 0UL); + for (i = 0; i < PTRS_PER_PMD; i++) { + if (!pmd_none(READ_ONCE(pmdp[i]))) + return; + } + + page = pud_page(READ_ONCE(*pudp)); + pud_clear(pudp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void free_pud_table(pgd_t *pgdp, unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + struct page *page; + pud_t *pudp; + int i; + + if (CONFIG_PGTABLE_LEVELS <= 3) + return; + + if (!pgtable_range_aligned(addr, end, floor, ceiling, PGDIR_MASK)) + return; + + pudp = pud_offset(pgdp, 0UL); + for (i = 0; i < PTRS_PER_PUD; i++) { + if (!pud_none(READ_ONCE(pudp[i]))) + return; + } + + page = pgd_page(READ_ONCE(*pgdp)); + pgd_clear(pgdp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, + unsigned long end, bool sparse_vmap) +{ + struct page *page; + pte_t *ptep, pte; + + do { + ptep = pte_offset_kernel(pmdp, addr); + pte = READ_ONCE(*ptep); + if (pte_none(pte)) + continue; + + WARN_ON(!pte_present(pte)); + page = sparse_vmap ? pte_page(pte) : NULL; + pte_clear(&init_mm, addr, ptep); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + if (sparse_vmap) + free_hotplug_page_range(page, PAGE_SIZE); + } while (addr += PAGE_SIZE, addr < end); +} + +static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, + unsigned long end, bool sparse_vmap) +{ + unsigned long next; + struct page *page; + pmd_t *pmdp, pmd; + + do { + next = pmd_addr_end(addr, end); + pmdp = pmd_offset(pudp, addr); + pmd = READ_ONCE(*pmdp); + if (pmd_none(pmd)) + continue; + + WARN_ON(!pmd_present(pmd)); + if (pmd_sect(pmd)) { + page = sparse_vmap ? pmd_page(pmd) : NULL; + pmd_clear(pmdp); + flush_tlb_kernel_range(addr, next); + if (sparse_vmap) + free_hotplug_page_range(page, PMD_SIZE); + continue; + } + WARN_ON(!pmd_table(pmd)); + unmap_hotplug_pte_range(pmdp, addr, next, sparse_vmap); + } while (addr = next, addr < end); +} + +static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr, + unsigned long end, bool sparse_vmap) +{ + unsigned long next; + struct page *page; + pud_t *pudp, pud; + + do { + next = pud_addr_end(addr, end); + pudp = pud_offset(pgdp, addr); + pud = READ_ONCE(*pudp); + if (pud_none(pud)) + continue; + + WARN_ON(!pud_present(pud)); + if (pud_sect(pud)) { + page = sparse_vmap ? pud_page(pud) : NULL; + pud_clear(pudp); + flush_tlb_kernel_range(addr, next); + if (sparse_vmap) + free_hotplug_page_range(page, PUD_SIZE); + continue; + } + WARN_ON(!pud_table(pud)); + unmap_hotplug_pmd_range(pudp, addr, next, sparse_vmap); + } while (addr = next, addr < end); +} + +static void unmap_hotplug_range(unsigned long addr, unsigned long end, + bool sparse_vmap) +{ + unsigned long next; + pgd_t *pgdp, pgd; + + do { + next = pgd_addr_end(addr, end); + pgdp = pgd_offset_k(addr); + pgd = READ_ONCE(*pgdp); + if (pgd_none(pgd)) + continue; + + WARN_ON(!pgd_present(pgd)); + unmap_hotplug_pud_range(pgdp, addr, next, sparse_vmap); + } while (addr = next, addr < end); +} + +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, + unsigned long end) +{ + pte_t *ptep, pte; + + do { + ptep = pte_offset_kernel(pmdp, addr); + pte = READ_ONCE(*ptep); + WARN_ON(!pte_none(pte)); + } while (addr += PAGE_SIZE, addr < end); +} + +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, + unsigned long end, unsigned long floor, + unsigned long ceiling) +{ + unsigned long next; + pmd_t *pmdp, pmd; + + do { + next = pmd_addr_end(addr, end); + pmdp = pmd_offset(pudp, addr); + pmd = READ_ONCE(*pmdp); + if (pmd_none(pmd)) + continue; + + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); + free_empty_pte_table(pmdp, addr, next); + free_pte_table(pmdp, addr, next, floor, ceiling); + } while (addr = next, addr < end); +} + +static void free_empty_pud_table(pgd_t *pgdp, unsigned long addr, + unsigned long end, unsigned long floor, + unsigned long ceiling) +{ + unsigned long next; + pud_t *pudp, pud; + + do { + next = pud_addr_end(addr, end); + pudp = pud_offset(pgdp, addr); + pud = READ_ONCE(*pudp); + if (pud_none(pud)) + continue; + + WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud)); + free_empty_pmd_table(pudp, addr, next, floor, ceiling); + free_pmd_table(pudp, addr, next, floor, ceiling); + } while (addr = next, addr < end); +} + +static void free_empty_tables(unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + unsigned long next; + pgd_t *pgdp, pgd; + + do { + next = pgd_addr_end(addr, end); + pgdp = pgd_offset_k(addr); + pgd = READ_ONCE(*pgdp); + if (pgd_none(pgd)) + continue; + + WARN_ON(!pgd_present(pgd)); + free_empty_pud_table(pgdp, addr, next, floor, ceiling); + free_pud_table(pgdp, addr, next, floor, ceiling); + } while (addr = next, addr < end); +} +#endif + #ifdef CONFIG_SPARSEMEM_VMEMMAP #if !ARM64_SWAPPER_USES_SECTION_MAPS int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, @@ -772,6 +1043,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void vmemmap_free(unsigned long start, unsigned long end, struct vmem_altmap *altmap) { +#ifdef CONFIG_MEMORY_HOTPLUG + WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END)); + + unmap_hotplug_range(start, end, true); + free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END); +#endif } #endif /* CONFIG_SPARSEMEM_VMEMMAP */ @@ -1050,10 +1327,21 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) } #ifdef CONFIG_MEMORY_HOTPLUG +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size) +{ + unsigned long end = start + size; + + WARN_ON(pgdir != init_mm.pgd); + WARN_ON((start < PAGE_OFFSET) || (end > PAGE_END)); + + unmap_hotplug_range(start, end, false); + free_empty_tables(start, end, PAGE_OFFSET, PAGE_END); +} + int arch_add_memory(int nid, u64 start, u64 size, struct mhp_restrictions *restrictions) { - int flags = 0; + int ret, flags = 0; if (rodata_full || debug_pagealloc_enabled()) flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; @@ -1061,9 +1349,14 @@ int arch_add_memory(int nid, u64 start, u64 size, __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start), size, PAGE_KERNEL, __pgd_pgtable_alloc, flags); - return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, + ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, restrictions); + if (ret) + __remove_pgd_mapping(swapper_pg_dir, + __phys_to_virt(start), size); + return ret; } + void arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { @@ -1071,14 +1364,8 @@ void arch_remove_memory(int nid, u64 start, u64 size, unsigned long nr_pages = size >> PAGE_SHIFT; struct zone *zone; - /* - * FIXME: Cleanup page tables (also in arch_add_memory() in case - * adding fails). Until then, this function should only be used - * during memory hotplug (adding memory), not for memory - * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be - * unlocked yet. - */ zone = page_zone(pfn_to_page(start_pfn)); __remove_pages(zone, start_pfn, nr_pages, altmap); + __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size); } #endif -- 2.7.4 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH V8 2/2] arm64/mm: Enable memory hot remove @ 2019-09-23 5:43 ` Anshuman Khandual 0 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-09-23 5:43 UTC (permalink / raw) To: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will Cc: mark.rutland, mhocko, ira.weiny, david, mgorman, steve.capper, Robin.Murphy, steven.price, broonie, cai, ard.biesheuvel, cpandya, arunks, dan.j.williams, logang, valentin.schneider, suzuki.poulose, osalvador The arch code for hot-remove must tear down portions of the linear map and vmemmap corresponding to memory being removed. In both cases the page tables mapping these regions must be freed, and when sparse vmemmap is in use the memory backing the vmemmap must also be freed. This patch adds unmap_hotplug_range() and free_empty_tables() helpers which can be used to tear down either region and calls it from vmemmap_free() and ___remove_pgd_mapping(). The sparse_vmap argument determines whether the backing memory will be freed. It makes two distinct passes over the kernel page table. In the first pass with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and frees backing memory if required (vmemmap) for each mapped leaf entry. In the second pass with free_empty_tables() it looks for empty page table sections whose page table page can be unmapped, TLB invalidated and freed. While freeing intermediate level page table pages bail out if any of its entries are still valid. This can happen for partially filled kernel page table either from a previously attempted failed memory hot add or while removing an address range which does not span the entire page table page range. The vmemmap region may share levels of table with the vmalloc region. There can be conflicts between hot remove freeing page table pages with a concurrent vmalloc() walking the kernel page table. This conflict can not just be solved by taking the init_mm ptl because of existing locking scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling method which is borrowed from user page table tear with free_pgd_range() which skips freeing page table pages if intermediate address range is not aligned or maximum floor-ceiling might not own the entire page table page. While here update arch_add_memory() to handle __add_pages() failures by just unmapping recently added kernel linear mapping. Now enable memory hot remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE. This implementation is overall inspired from kernel page table tear down procedure on X86 architecture and user page table tear down method. Acked-by: Steve Capper <steve.capper@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> --- arch/arm64/Kconfig | 3 + arch/arm64/include/asm/memory.h | 1 + arch/arm64/mm/mmu.c | 305 ++++++++++++++++++++++++++++++++++++++-- 3 files changed, 300 insertions(+), 9 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 41a9b42..511249f 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -274,6 +274,9 @@ config ZONE_DMA32 config ARCH_ENABLE_MEMORY_HOTPLUG def_bool y +config ARCH_ENABLE_MEMORY_HOTREMOVE + def_bool y + config SMP def_bool y diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index b61b50b..615dcd0 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -54,6 +54,7 @@ #define MODULES_VADDR (BPF_JIT_REGION_END) #define MODULES_VSIZE (SZ_128M) #define VMEMMAP_START (-VMEMMAP_SIZE - SZ_2M) +#define VMEMMAP_END (VMEMMAP_START + VMEMMAP_SIZE) #define PCI_IO_END (VMEMMAP_START - SZ_2M) #define PCI_IO_START (PCI_IO_END - PCI_IO_SIZE) #define FIXADDR_TOP (PCI_IO_START - SZ_2M) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 60c929f..6178837 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -725,6 +725,277 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(pte)); } + +#ifdef CONFIG_MEMORY_HOTPLUG +static void free_hotplug_page_range(struct page *page, size_t size) +{ + WARN_ON(!page || PageReserved(page)); + free_pages((unsigned long)page_address(page), get_order(size)); +} + +static void free_hotplug_pgtable_page(struct page *page) +{ + free_hotplug_page_range(page, PAGE_SIZE); +} + +static bool pgtable_range_aligned(unsigned long start, unsigned long end, + unsigned long floor, unsigned long ceiling, + unsigned long mask) +{ + start &= mask; + if (start < floor) + return false; + + if (ceiling) { + ceiling &= mask; + if (!ceiling) + return false; + } + + if (end - 1 > ceiling - 1) + return false; + return true; +} + +static void free_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + struct page *page; + pte_t *ptep; + int i; + + if (!pgtable_range_aligned(addr, end, floor, ceiling, PMD_MASK)) + return; + + ptep = pte_offset_kernel(pmdp, 0UL); + for (i = 0; i < PTRS_PER_PTE; i++) { + if (!pte_none(READ_ONCE(ptep[i]))) + return; + } + + page = pmd_page(READ_ONCE(*pmdp)); + pmd_clear(pmdp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void free_pmd_table(pud_t *pudp, unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + struct page *page; + pmd_t *pmdp; + int i; + + if (CONFIG_PGTABLE_LEVELS <= 2) + return; + + if (!pgtable_range_aligned(addr, end, floor, ceiling, PUD_MASK)) + return; + + pmdp = pmd_offset(pudp, 0UL); + for (i = 0; i < PTRS_PER_PMD; i++) { + if (!pmd_none(READ_ONCE(pmdp[i]))) + return; + } + + page = pud_page(READ_ONCE(*pudp)); + pud_clear(pudp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void free_pud_table(pgd_t *pgdp, unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + struct page *page; + pud_t *pudp; + int i; + + if (CONFIG_PGTABLE_LEVELS <= 3) + return; + + if (!pgtable_range_aligned(addr, end, floor, ceiling, PGDIR_MASK)) + return; + + pudp = pud_offset(pgdp, 0UL); + for (i = 0; i < PTRS_PER_PUD; i++) { + if (!pud_none(READ_ONCE(pudp[i]))) + return; + } + + page = pgd_page(READ_ONCE(*pgdp)); + pgd_clear(pgdp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, + unsigned long end, bool sparse_vmap) +{ + struct page *page; + pte_t *ptep, pte; + + do { + ptep = pte_offset_kernel(pmdp, addr); + pte = READ_ONCE(*ptep); + if (pte_none(pte)) + continue; + + WARN_ON(!pte_present(pte)); + page = sparse_vmap ? pte_page(pte) : NULL; + pte_clear(&init_mm, addr, ptep); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + if (sparse_vmap) + free_hotplug_page_range(page, PAGE_SIZE); + } while (addr += PAGE_SIZE, addr < end); +} + +static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, + unsigned long end, bool sparse_vmap) +{ + unsigned long next; + struct page *page; + pmd_t *pmdp, pmd; + + do { + next = pmd_addr_end(addr, end); + pmdp = pmd_offset(pudp, addr); + pmd = READ_ONCE(*pmdp); + if (pmd_none(pmd)) + continue; + + WARN_ON(!pmd_present(pmd)); + if (pmd_sect(pmd)) { + page = sparse_vmap ? pmd_page(pmd) : NULL; + pmd_clear(pmdp); + flush_tlb_kernel_range(addr, next); + if (sparse_vmap) + free_hotplug_page_range(page, PMD_SIZE); + continue; + } + WARN_ON(!pmd_table(pmd)); + unmap_hotplug_pte_range(pmdp, addr, next, sparse_vmap); + } while (addr = next, addr < end); +} + +static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr, + unsigned long end, bool sparse_vmap) +{ + unsigned long next; + struct page *page; + pud_t *pudp, pud; + + do { + next = pud_addr_end(addr, end); + pudp = pud_offset(pgdp, addr); + pud = READ_ONCE(*pudp); + if (pud_none(pud)) + continue; + + WARN_ON(!pud_present(pud)); + if (pud_sect(pud)) { + page = sparse_vmap ? pud_page(pud) : NULL; + pud_clear(pudp); + flush_tlb_kernel_range(addr, next); + if (sparse_vmap) + free_hotplug_page_range(page, PUD_SIZE); + continue; + } + WARN_ON(!pud_table(pud)); + unmap_hotplug_pmd_range(pudp, addr, next, sparse_vmap); + } while (addr = next, addr < end); +} + +static void unmap_hotplug_range(unsigned long addr, unsigned long end, + bool sparse_vmap) +{ + unsigned long next; + pgd_t *pgdp, pgd; + + do { + next = pgd_addr_end(addr, end); + pgdp = pgd_offset_k(addr); + pgd = READ_ONCE(*pgdp); + if (pgd_none(pgd)) + continue; + + WARN_ON(!pgd_present(pgd)); + unmap_hotplug_pud_range(pgdp, addr, next, sparse_vmap); + } while (addr = next, addr < end); +} + +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, + unsigned long end) +{ + pte_t *ptep, pte; + + do { + ptep = pte_offset_kernel(pmdp, addr); + pte = READ_ONCE(*ptep); + WARN_ON(!pte_none(pte)); + } while (addr += PAGE_SIZE, addr < end); +} + +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, + unsigned long end, unsigned long floor, + unsigned long ceiling) +{ + unsigned long next; + pmd_t *pmdp, pmd; + + do { + next = pmd_addr_end(addr, end); + pmdp = pmd_offset(pudp, addr); + pmd = READ_ONCE(*pmdp); + if (pmd_none(pmd)) + continue; + + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); + free_empty_pte_table(pmdp, addr, next); + free_pte_table(pmdp, addr, next, floor, ceiling); + } while (addr = next, addr < end); +} + +static void free_empty_pud_table(pgd_t *pgdp, unsigned long addr, + unsigned long end, unsigned long floor, + unsigned long ceiling) +{ + unsigned long next; + pud_t *pudp, pud; + + do { + next = pud_addr_end(addr, end); + pudp = pud_offset(pgdp, addr); + pud = READ_ONCE(*pudp); + if (pud_none(pud)) + continue; + + WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud)); + free_empty_pmd_table(pudp, addr, next, floor, ceiling); + free_pmd_table(pudp, addr, next, floor, ceiling); + } while (addr = next, addr < end); +} + +static void free_empty_tables(unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) +{ + unsigned long next; + pgd_t *pgdp, pgd; + + do { + next = pgd_addr_end(addr, end); + pgdp = pgd_offset_k(addr); + pgd = READ_ONCE(*pgdp); + if (pgd_none(pgd)) + continue; + + WARN_ON(!pgd_present(pgd)); + free_empty_pud_table(pgdp, addr, next, floor, ceiling); + free_pud_table(pgdp, addr, next, floor, ceiling); + } while (addr = next, addr < end); +} +#endif + #ifdef CONFIG_SPARSEMEM_VMEMMAP #if !ARM64_SWAPPER_USES_SECTION_MAPS int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, @@ -772,6 +1043,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void vmemmap_free(unsigned long start, unsigned long end, struct vmem_altmap *altmap) { +#ifdef CONFIG_MEMORY_HOTPLUG + WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END)); + + unmap_hotplug_range(start, end, true); + free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END); +#endif } #endif /* CONFIG_SPARSEMEM_VMEMMAP */ @@ -1050,10 +1327,21 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) } #ifdef CONFIG_MEMORY_HOTPLUG +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size) +{ + unsigned long end = start + size; + + WARN_ON(pgdir != init_mm.pgd); + WARN_ON((start < PAGE_OFFSET) || (end > PAGE_END)); + + unmap_hotplug_range(start, end, false); + free_empty_tables(start, end, PAGE_OFFSET, PAGE_END); +} + int arch_add_memory(int nid, u64 start, u64 size, struct mhp_restrictions *restrictions) { - int flags = 0; + int ret, flags = 0; if (rodata_full || debug_pagealloc_enabled()) flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; @@ -1061,9 +1349,14 @@ int arch_add_memory(int nid, u64 start, u64 size, __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start), size, PAGE_KERNEL, __pgd_pgtable_alloc, flags); - return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, + ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, restrictions); + if (ret) + __remove_pgd_mapping(swapper_pg_dir, + __phys_to_virt(start), size); + return ret; } + void arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { @@ -1071,14 +1364,8 @@ void arch_remove_memory(int nid, u64 start, u64 size, unsigned long nr_pages = size >> PAGE_SHIFT; struct zone *zone; - /* - * FIXME: Cleanup page tables (also in arch_add_memory() in case - * adding fails). Until then, this function should only be used - * during memory hotplug (adding memory), not for memory - * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be - * unlocked yet. - */ zone = page_zone(pfn_to_page(start_pfn)); __remove_pages(zone, start_pfn, nr_pages, altmap); + __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size); } #endif -- 2.7.4 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove 2019-09-23 5:43 ` Anshuman Khandual @ 2019-09-23 11:17 ` Matthew Wilcox -1 siblings, 0 replies; 20+ messages in thread From: Matthew Wilcox @ 2019-09-23 11:17 UTC (permalink / raw) To: Anshuman Khandual Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will, mark.rutland, mhocko, david, cai, logang, cpandya, arunks, dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper, broonie, valentin.schneider, Robin.Murphy, steven.price, suzuki.poulose, ira.weiny On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: > +#ifdef CONFIG_MEMORY_HOTPLUG > +static void free_hotplug_page_range(struct page *page, size_t size) > +{ > + WARN_ON(!page || PageReserved(page)); WARN_ON(!page) isn't terribly useful. You're going to crash on the very next line when you call page_address() anyway. If this line were if (WARN_ON(!page || PageReserved(page))) return; it would make sense, or if it were just WARN_ON(PageReserved(page)) it would also make sense. > + free_pages((unsigned long)page_address(page), get_order(size)); > +} ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove @ 2019-09-23 11:17 ` Matthew Wilcox 0 siblings, 0 replies; 20+ messages in thread From: Matthew Wilcox @ 2019-09-23 11:17 UTC (permalink / raw) To: Anshuman Khandual Cc: mark.rutland, mhocko, david, catalin.marinas, linux-mm, arunks, cpandya, will, ira.weiny, steven.price, valentin.schneider, suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel, dan.j.williams, linux-arm-kernel, osalvador, steve.capper, logang, linux-kernel, akpm, mgorman On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: > +#ifdef CONFIG_MEMORY_HOTPLUG > +static void free_hotplug_page_range(struct page *page, size_t size) > +{ > + WARN_ON(!page || PageReserved(page)); WARN_ON(!page) isn't terribly useful. You're going to crash on the very next line when you call page_address() anyway. If this line were if (WARN_ON(!page || PageReserved(page))) return; it would make sense, or if it were just WARN_ON(PageReserved(page)) it would also make sense. > + free_pages((unsigned long)page_address(page), get_order(size)); > +} _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove 2019-09-23 11:17 ` Matthew Wilcox @ 2019-09-24 8:41 ` Anshuman Khandual -1 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-09-24 8:41 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will, mark.rutland, mhocko, david, cai, logang, cpandya, arunks, dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper, broonie, valentin.schneider, Robin.Murphy, steven.price, suzuki.poulose, ira.weiny On 09/23/2019 04:47 PM, Matthew Wilcox wrote: > On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: >> +#ifdef CONFIG_MEMORY_HOTPLUG >> +static void free_hotplug_page_range(struct page *page, size_t size) >> +{ >> + WARN_ON(!page || PageReserved(page)); > > WARN_ON(!page) isn't terribly useful. You're going to crash on the very > next line when you call page_address() anyway. If this line were > > if (WARN_ON(!page || PageReserved(page))) > return; > > it would make sense, or if it were just > > WARN_ON(PageReserved(page)) > > it would also make sense. I guess WARN_ON(PageReserved(page)) should be good enough to make sure that page being freed here was originally allocated at runtime for a previous memory hot add operation and did not some how come from the memblock reserved area. That was the original objective for this check. Will change it. > >> + free_pages((unsigned long)page_address(page), get_order(size)); >> +} > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove @ 2019-09-24 8:41 ` Anshuman Khandual 0 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-09-24 8:41 UTC (permalink / raw) To: Matthew Wilcox Cc: mark.rutland, mhocko, david, catalin.marinas, linux-mm, arunks, cpandya, will, ira.weiny, steven.price, valentin.schneider, suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel, dan.j.williams, linux-arm-kernel, osalvador, steve.capper, logang, linux-kernel, akpm, mgorman On 09/23/2019 04:47 PM, Matthew Wilcox wrote: > On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: >> +#ifdef CONFIG_MEMORY_HOTPLUG >> +static void free_hotplug_page_range(struct page *page, size_t size) >> +{ >> + WARN_ON(!page || PageReserved(page)); > > WARN_ON(!page) isn't terribly useful. You're going to crash on the very > next line when you call page_address() anyway. If this line were > > if (WARN_ON(!page || PageReserved(page))) > return; > > it would make sense, or if it were just > > WARN_ON(PageReserved(page)) > > it would also make sense. I guess WARN_ON(PageReserved(page)) should be good enough to make sure that page being freed here was originally allocated at runtime for a previous memory hot add operation and did not some how come from the memblock reserved area. That was the original objective for this check. Will change it. > >> + free_pages((unsigned long)page_address(page), get_order(size)); >> +} > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove 2019-09-23 5:43 ` Anshuman Khandual @ 2019-09-23 17:39 ` kbuild test robot -1 siblings, 0 replies; 20+ messages in thread From: kbuild test robot @ 2019-09-23 17:39 UTC (permalink / raw) To: Anshuman Khandual Cc: kbuild-all, linux-mm, linux-kernel, linux-arm-kernel, akpm, catalin.marinas, will, mark.rutland, mhocko, david, cai, logang, cpandya, arunks, dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper, broonie, valentin.schneider, Robin.Murphy, steven.price, suzuki.poulose, ira.weiny [-- Attachment #1: Type: text/plain, Size: 4255 bytes --] Hi Anshuman, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on linus/master] [cannot apply to v5.3 next-20190920] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Anshuman-Khandual/arm64-mm-Hold-memory-hotplug-lock-while-walking-for-kernel-page-table-dump/20190923-134733 config: arm64-allmodconfig (attached as .config) compiler: aarch64-linux-gcc (GCC) 7.4.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=7.4.0 make.cross ARCH=arm64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> mm/memremap.c:16:0: warning: "SECTION_MASK" redefined #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1) In file included from arch/arm64/include/asm/processor.h:34:0, from include/linux/mutex.h:19, from include/linux/kernfs.h:12, from include/linux/sysfs.h:16, from include/linux/kobject.h:20, from include/linux/device.h:16, from mm/memremap.c:3: arch/arm64/include/asm/pgtable-hwdef.h:79:0: note: this is the location of the previous definition #define SECTION_MASK (~(SECTION_SIZE-1)) >> mm/memremap.c:17:0: warning: "SECTION_SIZE" redefined #define SECTION_SIZE (1UL << PA_SECTION_SHIFT) In file included from arch/arm64/include/asm/processor.h:34:0, from include/linux/mutex.h:19, from include/linux/kernfs.h:12, from include/linux/sysfs.h:16, from include/linux/kobject.h:20, from include/linux/device.h:16, from mm/memremap.c:3: arch/arm64/include/asm/pgtable-hwdef.h:78:0: note: this is the location of the previous definition #define SECTION_SIZE (_AC(1, UL) << SECTION_SHIFT) vim +/SECTION_MASK +16 mm/memremap.c 7d3dcf26a6559f kernel/memremap.c Christoph Hellwig 2015-08-10 @3 #include <linux/device.h> 92281dee825f6d kernel/memremap.c Dan Williams 2015-08-10 4 #include <linux/io.h> 0207df4fa1a869 kernel/memremap.c Andrey Ryabinin 2018-08-17 5 #include <linux/kasan.h> 41e94a851304f7 kernel/memremap.c Christoph Hellwig 2015-08-17 6 #include <linux/memory_hotplug.h> bcfa4b72111c9a kernel/memremap.c Matthew Wilcox 2018-08-15 7 #include <linux/mm.h> bcfa4b72111c9a kernel/memremap.c Matthew Wilcox 2018-08-15 8 #include <linux/pfn_t.h> 5042db43cc26f5 kernel/memremap.c Jérôme Glisse 2017-09-08 9 #include <linux/swap.h> 5042db43cc26f5 kernel/memremap.c Jérôme Glisse 2017-09-08 10 #include <linux/swapops.h> bcfa4b72111c9a kernel/memremap.c Matthew Wilcox 2018-08-15 11 #include <linux/types.h> e7638488434415 kernel/memremap.c Dan Williams 2018-05-16 12 #include <linux/wait_bit.h> bcfa4b72111c9a kernel/memremap.c Matthew Wilcox 2018-08-15 13 #include <linux/xarray.h> 92281dee825f6d kernel/memremap.c Dan Williams 2015-08-10 14 bcfa4b72111c9a kernel/memremap.c Matthew Wilcox 2018-08-15 15 static DEFINE_XARRAY(pgmap_array); 9476df7d80dfc4 kernel/memremap.c Dan Williams 2016-01-15 @16 #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1) 9476df7d80dfc4 kernel/memremap.c Dan Williams 2016-01-15 @17 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT) 9476df7d80dfc4 kernel/memremap.c Dan Williams 2016-01-15 18 :::::: The code at line 16 was first introduced by commit :::::: 9476df7d80dfc425b37bfecf1d89edf8ec81fcb6 mm: introduce find_dev_pagemap() :::::: TO: Dan Williams <dan.j.williams@intel.com> :::::: CC: Linus Torvalds <torvalds@linux-foundation.org> --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 67082 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove @ 2019-09-23 17:39 ` kbuild test robot 0 siblings, 0 replies; 20+ messages in thread From: kbuild test robot @ 2019-09-23 17:39 UTC (permalink / raw) To: Anshuman Khandual Cc: mark.rutland, mhocko, david, catalin.marinas, linux-mm, arunks, cpandya, will, ira.weiny, steven.price, valentin.schneider, suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel, dan.j.williams, linux-arm-kernel, osalvador, steve.capper, logang, linux-kernel, kbuild-all, akpm, mgorman [-- Attachment #1: Type: text/plain, Size: 4255 bytes --] Hi Anshuman, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on linus/master] [cannot apply to v5.3 next-20190920] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Anshuman-Khandual/arm64-mm-Hold-memory-hotplug-lock-while-walking-for-kernel-page-table-dump/20190923-134733 config: arm64-allmodconfig (attached as .config) compiler: aarch64-linux-gcc (GCC) 7.4.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=7.4.0 make.cross ARCH=arm64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> mm/memremap.c:16:0: warning: "SECTION_MASK" redefined #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1) In file included from arch/arm64/include/asm/processor.h:34:0, from include/linux/mutex.h:19, from include/linux/kernfs.h:12, from include/linux/sysfs.h:16, from include/linux/kobject.h:20, from include/linux/device.h:16, from mm/memremap.c:3: arch/arm64/include/asm/pgtable-hwdef.h:79:0: note: this is the location of the previous definition #define SECTION_MASK (~(SECTION_SIZE-1)) >> mm/memremap.c:17:0: warning: "SECTION_SIZE" redefined #define SECTION_SIZE (1UL << PA_SECTION_SHIFT) In file included from arch/arm64/include/asm/processor.h:34:0, from include/linux/mutex.h:19, from include/linux/kernfs.h:12, from include/linux/sysfs.h:16, from include/linux/kobject.h:20, from include/linux/device.h:16, from mm/memremap.c:3: arch/arm64/include/asm/pgtable-hwdef.h:78:0: note: this is the location of the previous definition #define SECTION_SIZE (_AC(1, UL) << SECTION_SHIFT) vim +/SECTION_MASK +16 mm/memremap.c 7d3dcf26a6559f kernel/memremap.c Christoph Hellwig 2015-08-10 @3 #include <linux/device.h> 92281dee825f6d kernel/memremap.c Dan Williams 2015-08-10 4 #include <linux/io.h> 0207df4fa1a869 kernel/memremap.c Andrey Ryabinin 2018-08-17 5 #include <linux/kasan.h> 41e94a851304f7 kernel/memremap.c Christoph Hellwig 2015-08-17 6 #include <linux/memory_hotplug.h> bcfa4b72111c9a kernel/memremap.c Matthew Wilcox 2018-08-15 7 #include <linux/mm.h> bcfa4b72111c9a kernel/memremap.c Matthew Wilcox 2018-08-15 8 #include <linux/pfn_t.h> 5042db43cc26f5 kernel/memremap.c Jérôme Glisse 2017-09-08 9 #include <linux/swap.h> 5042db43cc26f5 kernel/memremap.c Jérôme Glisse 2017-09-08 10 #include <linux/swapops.h> bcfa4b72111c9a kernel/memremap.c Matthew Wilcox 2018-08-15 11 #include <linux/types.h> e7638488434415 kernel/memremap.c Dan Williams 2018-05-16 12 #include <linux/wait_bit.h> bcfa4b72111c9a kernel/memremap.c Matthew Wilcox 2018-08-15 13 #include <linux/xarray.h> 92281dee825f6d kernel/memremap.c Dan Williams 2015-08-10 14 bcfa4b72111c9a kernel/memremap.c Matthew Wilcox 2018-08-15 15 static DEFINE_XARRAY(pgmap_array); 9476df7d80dfc4 kernel/memremap.c Dan Williams 2016-01-15 @16 #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1) 9476df7d80dfc4 kernel/memremap.c Dan Williams 2016-01-15 @17 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT) 9476df7d80dfc4 kernel/memremap.c Dan Williams 2016-01-15 18 :::::: The code at line 16 was first introduced by commit :::::: 9476df7d80dfc425b37bfecf1d89edf8ec81fcb6 mm: introduce find_dev_pagemap() :::::: TO: Dan Williams <dan.j.williams@intel.com> :::::: CC: Linus Torvalds <torvalds@linux-foundation.org> --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 67082 bytes --] [-- Attachment #3: Type: text/plain, Size: 176 bytes --] _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove 2019-09-23 5:43 ` Anshuman Khandual @ 2019-10-07 14:17 ` Catalin Marinas -1 siblings, 0 replies; 20+ messages in thread From: Catalin Marinas @ 2019-10-07 14:17 UTC (permalink / raw) To: Anshuman Khandual Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, will, mark.rutland, mhocko, david, cai, logang, cpandya, arunks, dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper, broonie, valentin.schneider, Robin.Murphy, steven.price, suzuki.poulose, ira.weiny On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: > The arch code for hot-remove must tear down portions of the linear map and > vmemmap corresponding to memory being removed. In both cases the page > tables mapping these regions must be freed, and when sparse vmemmap is in > use the memory backing the vmemmap must also be freed. > > This patch adds unmap_hotplug_range() and free_empty_tables() helpers which > can be used to tear down either region and calls it from vmemmap_free() and > ___remove_pgd_mapping(). The sparse_vmap argument determines whether the > backing memory will be freed. Can you change the 'sparse_vmap' name to something more meaningful which would suggest freeing of the backing memory? > It makes two distinct passes over the kernel page table. In the first pass > with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and > frees backing memory if required (vmemmap) for each mapped leaf entry. In > the second pass with free_empty_tables() it looks for empty page table > sections whose page table page can be unmapped, TLB invalidated and freed. > > While freeing intermediate level page table pages bail out if any of its > entries are still valid. This can happen for partially filled kernel page > table either from a previously attempted failed memory hot add or while > removing an address range which does not span the entire page table page > range. > > The vmemmap region may share levels of table with the vmalloc region. > There can be conflicts between hot remove freeing page table pages with > a concurrent vmalloc() walking the kernel page table. This conflict can > not just be solved by taking the init_mm ptl because of existing locking > scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling > method which is borrowed from user page table tear with free_pgd_range() > which skips freeing page table pages if intermediate address range is not > aligned or maximum floor-ceiling might not own the entire page table page. > > While here update arch_add_memory() to handle __add_pages() failures by > just unmapping recently added kernel linear mapping. Now enable memory hot > remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE. > > This implementation is overall inspired from kernel page table tear down > procedure on X86 architecture and user page table tear down method. > > Acked-by: Steve Capper <steve.capper@arm.com> > Acked-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Given the amount of changes since version 7, do the acks still stand? [...] > +static void free_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long end, > + unsigned long floor, unsigned long ceiling) > +{ > + struct page *page; > + pte_t *ptep; > + int i; > + > + if (!pgtable_range_aligned(addr, end, floor, ceiling, PMD_MASK)) > + return; > + > + ptep = pte_offset_kernel(pmdp, 0UL); > + for (i = 0; i < PTRS_PER_PTE; i++) { > + if (!pte_none(READ_ONCE(ptep[i]))) > + return; > + } > + > + page = pmd_page(READ_ONCE(*pmdp)); Arguably, that's not the pmd page we are freeing here. Even if you get the same result, pmd_page() is normally used for huge pages pointed at by the pmd entry. Since you have the ptep already, why not use virt_to_page(ptep)? > + pmd_clear(pmdp); > + __flush_tlb_kernel_pgtable(addr); > + free_hotplug_pgtable_page(page); > +} > + > +static void free_pmd_table(pud_t *pudp, unsigned long addr, unsigned long end, > + unsigned long floor, unsigned long ceiling) > +{ > + struct page *page; > + pmd_t *pmdp; > + int i; > + > + if (CONFIG_PGTABLE_LEVELS <= 2) > + return; > + > + if (!pgtable_range_aligned(addr, end, floor, ceiling, PUD_MASK)) > + return; > + > + pmdp = pmd_offset(pudp, 0UL); > + for (i = 0; i < PTRS_PER_PMD; i++) { > + if (!pmd_none(READ_ONCE(pmdp[i]))) > + return; > + } > + > + page = pud_page(READ_ONCE(*pudp)); Same here, virt_to_page(pmdp). > + pud_clear(pudp); > + __flush_tlb_kernel_pgtable(addr); > + free_hotplug_pgtable_page(page); > +} > + > +static void free_pud_table(pgd_t *pgdp, unsigned long addr, unsigned long end, > + unsigned long floor, unsigned long ceiling) > +{ > + struct page *page; > + pud_t *pudp; > + int i; > + > + if (CONFIG_PGTABLE_LEVELS <= 3) > + return; > + > + if (!pgtable_range_aligned(addr, end, floor, ceiling, PGDIR_MASK)) > + return; > + > + pudp = pud_offset(pgdp, 0UL); > + for (i = 0; i < PTRS_PER_PUD; i++) { > + if (!pud_none(READ_ONCE(pudp[i]))) > + return; > + } > + > + page = pgd_page(READ_ONCE(*pgdp)); As above. > + pgd_clear(pgdp); > + __flush_tlb_kernel_pgtable(addr); > + free_hotplug_pgtable_page(page); > +} > + > +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, > + unsigned long end, bool sparse_vmap) > +{ > + struct page *page; > + pte_t *ptep, pte; > + > + do { > + ptep = pte_offset_kernel(pmdp, addr); > + pte = READ_ONCE(*ptep); > + if (pte_none(pte)) > + continue; > + > + WARN_ON(!pte_present(pte)); > + page = sparse_vmap ? pte_page(pte) : NULL; > + pte_clear(&init_mm, addr, ptep); > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > + if (sparse_vmap) > + free_hotplug_page_range(page, PAGE_SIZE); You could only set 'page' if sparse_vmap (or even drop 'page' entirely). The compiler is probably smart enough to optimise it but using a pointless ternary operator just makes the code harder to follow. > + } while (addr += PAGE_SIZE, addr < end); > +} [...] > +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, > + unsigned long end) > +{ > + pte_t *ptep, pte; > + > + do { > + ptep = pte_offset_kernel(pmdp, addr); > + pte = READ_ONCE(*ptep); > + WARN_ON(!pte_none(pte)); > + } while (addr += PAGE_SIZE, addr < end); > +} > + > +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, > + unsigned long end, unsigned long floor, > + unsigned long ceiling) > +{ > + unsigned long next; > + pmd_t *pmdp, pmd; > + > + do { > + next = pmd_addr_end(addr, end); > + pmdp = pmd_offset(pudp, addr); > + pmd = READ_ONCE(*pmdp); > + if (pmd_none(pmd)) > + continue; > + > + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); > + free_empty_pte_table(pmdp, addr, next); > + free_pte_table(pmdp, addr, next, floor, ceiling); Do we need two closely named functions here? Can you not collapse free_empty_pud_table() and free_pte_table() into a single one? The same comment for the pmd/pud variants. I just find this confusing. > + } while (addr = next, addr < end); You could make these function in two steps: first, as above, invoke the next level recursively; second, after the do..while loop, check whether it's empty and free the pmd page as in free_pmd_table(). > +} [...] -- Catalin ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove @ 2019-10-07 14:17 ` Catalin Marinas 0 siblings, 0 replies; 20+ messages in thread From: Catalin Marinas @ 2019-10-07 14:17 UTC (permalink / raw) To: Anshuman Khandual Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya, will, ira.weiny, steven.price, valentin.schneider, suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel, dan.j.williams, linux-arm-kernel, osalvador, steve.capper, logang, linux-kernel, akpm, mgorman On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: > The arch code for hot-remove must tear down portions of the linear map and > vmemmap corresponding to memory being removed. In both cases the page > tables mapping these regions must be freed, and when sparse vmemmap is in > use the memory backing the vmemmap must also be freed. > > This patch adds unmap_hotplug_range() and free_empty_tables() helpers which > can be used to tear down either region and calls it from vmemmap_free() and > ___remove_pgd_mapping(). The sparse_vmap argument determines whether the > backing memory will be freed. Can you change the 'sparse_vmap' name to something more meaningful which would suggest freeing of the backing memory? > It makes two distinct passes over the kernel page table. In the first pass > with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and > frees backing memory if required (vmemmap) for each mapped leaf entry. In > the second pass with free_empty_tables() it looks for empty page table > sections whose page table page can be unmapped, TLB invalidated and freed. > > While freeing intermediate level page table pages bail out if any of its > entries are still valid. This can happen for partially filled kernel page > table either from a previously attempted failed memory hot add or while > removing an address range which does not span the entire page table page > range. > > The vmemmap region may share levels of table with the vmalloc region. > There can be conflicts between hot remove freeing page table pages with > a concurrent vmalloc() walking the kernel page table. This conflict can > not just be solved by taking the init_mm ptl because of existing locking > scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling > method which is borrowed from user page table tear with free_pgd_range() > which skips freeing page table pages if intermediate address range is not > aligned or maximum floor-ceiling might not own the entire page table page. > > While here update arch_add_memory() to handle __add_pages() failures by > just unmapping recently added kernel linear mapping. Now enable memory hot > remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE. > > This implementation is overall inspired from kernel page table tear down > procedure on X86 architecture and user page table tear down method. > > Acked-by: Steve Capper <steve.capper@arm.com> > Acked-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Given the amount of changes since version 7, do the acks still stand? [...] > +static void free_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long end, > + unsigned long floor, unsigned long ceiling) > +{ > + struct page *page; > + pte_t *ptep; > + int i; > + > + if (!pgtable_range_aligned(addr, end, floor, ceiling, PMD_MASK)) > + return; > + > + ptep = pte_offset_kernel(pmdp, 0UL); > + for (i = 0; i < PTRS_PER_PTE; i++) { > + if (!pte_none(READ_ONCE(ptep[i]))) > + return; > + } > + > + page = pmd_page(READ_ONCE(*pmdp)); Arguably, that's not the pmd page we are freeing here. Even if you get the same result, pmd_page() is normally used for huge pages pointed at by the pmd entry. Since you have the ptep already, why not use virt_to_page(ptep)? > + pmd_clear(pmdp); > + __flush_tlb_kernel_pgtable(addr); > + free_hotplug_pgtable_page(page); > +} > + > +static void free_pmd_table(pud_t *pudp, unsigned long addr, unsigned long end, > + unsigned long floor, unsigned long ceiling) > +{ > + struct page *page; > + pmd_t *pmdp; > + int i; > + > + if (CONFIG_PGTABLE_LEVELS <= 2) > + return; > + > + if (!pgtable_range_aligned(addr, end, floor, ceiling, PUD_MASK)) > + return; > + > + pmdp = pmd_offset(pudp, 0UL); > + for (i = 0; i < PTRS_PER_PMD; i++) { > + if (!pmd_none(READ_ONCE(pmdp[i]))) > + return; > + } > + > + page = pud_page(READ_ONCE(*pudp)); Same here, virt_to_page(pmdp). > + pud_clear(pudp); > + __flush_tlb_kernel_pgtable(addr); > + free_hotplug_pgtable_page(page); > +} > + > +static void free_pud_table(pgd_t *pgdp, unsigned long addr, unsigned long end, > + unsigned long floor, unsigned long ceiling) > +{ > + struct page *page; > + pud_t *pudp; > + int i; > + > + if (CONFIG_PGTABLE_LEVELS <= 3) > + return; > + > + if (!pgtable_range_aligned(addr, end, floor, ceiling, PGDIR_MASK)) > + return; > + > + pudp = pud_offset(pgdp, 0UL); > + for (i = 0; i < PTRS_PER_PUD; i++) { > + if (!pud_none(READ_ONCE(pudp[i]))) > + return; > + } > + > + page = pgd_page(READ_ONCE(*pgdp)); As above. > + pgd_clear(pgdp); > + __flush_tlb_kernel_pgtable(addr); > + free_hotplug_pgtable_page(page); > +} > + > +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, > + unsigned long end, bool sparse_vmap) > +{ > + struct page *page; > + pte_t *ptep, pte; > + > + do { > + ptep = pte_offset_kernel(pmdp, addr); > + pte = READ_ONCE(*ptep); > + if (pte_none(pte)) > + continue; > + > + WARN_ON(!pte_present(pte)); > + page = sparse_vmap ? pte_page(pte) : NULL; > + pte_clear(&init_mm, addr, ptep); > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > + if (sparse_vmap) > + free_hotplug_page_range(page, PAGE_SIZE); You could only set 'page' if sparse_vmap (or even drop 'page' entirely). The compiler is probably smart enough to optimise it but using a pointless ternary operator just makes the code harder to follow. > + } while (addr += PAGE_SIZE, addr < end); > +} [...] > +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, > + unsigned long end) > +{ > + pte_t *ptep, pte; > + > + do { > + ptep = pte_offset_kernel(pmdp, addr); > + pte = READ_ONCE(*ptep); > + WARN_ON(!pte_none(pte)); > + } while (addr += PAGE_SIZE, addr < end); > +} > + > +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, > + unsigned long end, unsigned long floor, > + unsigned long ceiling) > +{ > + unsigned long next; > + pmd_t *pmdp, pmd; > + > + do { > + next = pmd_addr_end(addr, end); > + pmdp = pmd_offset(pudp, addr); > + pmd = READ_ONCE(*pmdp); > + if (pmd_none(pmd)) > + continue; > + > + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); > + free_empty_pte_table(pmdp, addr, next); > + free_pte_table(pmdp, addr, next, floor, ceiling); Do we need two closely named functions here? Can you not collapse free_empty_pud_table() and free_pte_table() into a single one? The same comment for the pmd/pud variants. I just find this confusing. > + } while (addr = next, addr < end); You could make these function in two steps: first, as above, invoke the next level recursively; second, after the do..while loop, check whether it's empty and free the pmd page as in free_pmd_table(). > +} [...] -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove 2019-10-07 14:17 ` Catalin Marinas @ 2019-10-08 4:36 ` Anshuman Khandual -1 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-10-08 4:36 UTC (permalink / raw) To: Catalin Marinas Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, will, mark.rutland, mhocko, david, cai, logang, cpandya, arunks, dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper, broonie, valentin.schneider, Robin.Murphy, steven.price, suzuki.poulose, ira.weiny On 10/07/2019 07:47 PM, Catalin Marinas wrote: > On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: >> The arch code for hot-remove must tear down portions of the linear map and >> vmemmap corresponding to memory being removed. In both cases the page >> tables mapping these regions must be freed, and when sparse vmemmap is in >> use the memory backing the vmemmap must also be freed. >> >> This patch adds unmap_hotplug_range() and free_empty_tables() helpers which >> can be used to tear down either region and calls it from vmemmap_free() and >> ___remove_pgd_mapping(). The sparse_vmap argument determines whether the >> backing memory will be freed. > > Can you change the 'sparse_vmap' name to something more meaningful which > would suggest freeing of the backing memory? free_mapped_mem or free_backed_mem ? Even shorter forms like free_mapped or free_backed might do as well. Do you have a particular preference here ? But yes, sparse_vmap has been very much specific to vmemmap for these functions which are now very generic in nature. > >> It makes two distinct passes over the kernel page table. In the first pass >> with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and >> frees backing memory if required (vmemmap) for each mapped leaf entry. In >> the second pass with free_empty_tables() it looks for empty page table >> sections whose page table page can be unmapped, TLB invalidated and freed. >> >> While freeing intermediate level page table pages bail out if any of its >> entries are still valid. This can happen for partially filled kernel page >> table either from a previously attempted failed memory hot add or while >> removing an address range which does not span the entire page table page >> range. >> >> The vmemmap region may share levels of table with the vmalloc region. >> There can be conflicts between hot remove freeing page table pages with >> a concurrent vmalloc() walking the kernel page table. This conflict can >> not just be solved by taking the init_mm ptl because of existing locking >> scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling >> method which is borrowed from user page table tear with free_pgd_range() >> which skips freeing page table pages if intermediate address range is not >> aligned or maximum floor-ceiling might not own the entire page table page. >> >> While here update arch_add_memory() to handle __add_pages() failures by >> just unmapping recently added kernel linear mapping. Now enable memory hot >> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE. >> >> This implementation is overall inspired from kernel page table tear down >> procedure on X86 architecture and user page table tear down method. >> >> Acked-by: Steve Capper <steve.capper@arm.com> >> Acked-by: David Hildenbrand <david@redhat.com> >> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> > > Given the amount of changes since version 7, do the acks still stand? I had taken the liberty to carry them till V7 where the implementation has been sort of structurally similar but as you mention, there have been some basic changes in the approach since V7. Will drop these tags in next version and request their fresh ACKs once again. > > [...] >> +static void free_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long end, >> + unsigned long floor, unsigned long ceiling) >> +{ >> + struct page *page; >> + pte_t *ptep; >> + int i; >> + >> + if (!pgtable_range_aligned(addr, end, floor, ceiling, PMD_MASK)) >> + return; >> + >> + ptep = pte_offset_kernel(pmdp, 0UL); >> + for (i = 0; i < PTRS_PER_PTE; i++) { >> + if (!pte_none(READ_ONCE(ptep[i]))) >> + return; >> + } >> + >> + page = pmd_page(READ_ONCE(*pmdp)); > > Arguably, that's not the pmd page we are freeing here. Even if you get > the same result, pmd_page() is normally used for huge pages pointed at > by the pmd entry. Since you have the ptep already, why not use > virt_to_page(ptep)? Makes sense, will do. > >> + pmd_clear(pmdp); >> + __flush_tlb_kernel_pgtable(addr); >> + free_hotplug_pgtable_page(page); >> +} >> + >> +static void free_pmd_table(pud_t *pudp, unsigned long addr, unsigned long end, >> + unsigned long floor, unsigned long ceiling) >> +{ >> + struct page *page; >> + pmd_t *pmdp; >> + int i; >> + >> + if (CONFIG_PGTABLE_LEVELS <= 2) >> + return; >> + >> + if (!pgtable_range_aligned(addr, end, floor, ceiling, PUD_MASK)) >> + return; >> + >> + pmdp = pmd_offset(pudp, 0UL); >> + for (i = 0; i < PTRS_PER_PMD; i++) { >> + if (!pmd_none(READ_ONCE(pmdp[i]))) >> + return; >> + } >> + >> + page = pud_page(READ_ONCE(*pudp)); > > Same here, virt_to_page(pmdp). Will do. > >> + pud_clear(pudp); >> + __flush_tlb_kernel_pgtable(addr); >> + free_hotplug_pgtable_page(page); >> +} >> + >> +static void free_pud_table(pgd_t *pgdp, unsigned long addr, unsigned long end, >> + unsigned long floor, unsigned long ceiling) >> +{ >> + struct page *page; >> + pud_t *pudp; >> + int i; >> + >> + if (CONFIG_PGTABLE_LEVELS <= 3) >> + return; >> + >> + if (!pgtable_range_aligned(addr, end, floor, ceiling, PGDIR_MASK)) >> + return; >> + >> + pudp = pud_offset(pgdp, 0UL); >> + for (i = 0; i < PTRS_PER_PUD; i++) { >> + if (!pud_none(READ_ONCE(pudp[i]))) >> + return; >> + } >> + >> + page = pgd_page(READ_ONCE(*pgdp)); > > As above. Will do. > >> + pgd_clear(pgdp); >> + __flush_tlb_kernel_pgtable(addr); >> + free_hotplug_pgtable_page(page); >> +} >> + >> +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, >> + unsigned long end, bool sparse_vmap) >> +{ >> + struct page *page; >> + pte_t *ptep, pte; >> + >> + do { >> + ptep = pte_offset_kernel(pmdp, addr); >> + pte = READ_ONCE(*ptep); >> + if (pte_none(pte)) >> + continue; >> + >> + WARN_ON(!pte_present(pte)); >> + page = sparse_vmap ? pte_page(pte) : NULL; >> + pte_clear(&init_mm, addr, ptep); >> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >> + if (sparse_vmap) >> + free_hotplug_page_range(page, PAGE_SIZE); > > You could only set 'page' if sparse_vmap (or even drop 'page' entirely). I am afraid 'page' is being used to hold pte_page(pte) extraction which needs to be freed (sparse_vmap) as we are going to clear the ptep entry in the next statement and lose access to it for good. We will need some where to hold onto pte_page(pte) across pte_clear() as we cannot free it before clearing it's entry and flushing the TLB. Hence wondering how the 'page' can be completely dropped. > The compiler is probably smart enough to optimise it but using a > pointless ternary operator just makes the code harder to follow. Not sure I got this but are you suggesting for an 'if' statement here if (sparse_vmap) page = pte_page(pte); instead of the current assignment ? page = sparse_vmap ? pte_page(pte) : NULL; > >> + } while (addr += PAGE_SIZE, addr < end); >> +} > [...] >> +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, >> + unsigned long end) >> +{ >> + pte_t *ptep, pte; >> + >> + do { >> + ptep = pte_offset_kernel(pmdp, addr); >> + pte = READ_ONCE(*ptep); >> + WARN_ON(!pte_none(pte)); >> + } while (addr += PAGE_SIZE, addr < end); >> +} >> + >> +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, >> + unsigned long end, unsigned long floor, >> + unsigned long ceiling) >> +{ >> + unsigned long next; >> + pmd_t *pmdp, pmd; >> + >> + do { >> + next = pmd_addr_end(addr, end); >> + pmdp = pmd_offset(pudp, addr); >> + pmd = READ_ONCE(*pmdp); >> + if (pmd_none(pmd)) >> + continue; >> + >> + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); >> + free_empty_pte_table(pmdp, addr, next); >> + free_pte_table(pmdp, addr, next, floor, ceiling); > > Do we need two closely named functions here? Can you not collapse > free_empty_pud_table() and free_pte_table() into a single one? The same > comment for the pmd/pud variants. I just find this confusing. The two functions could be collapsed into a single one. But just wanted to keep free_pxx_table() part which checks floor/ceiling alignment, non-zero entries clear off the actual page table walking. > >> + } while (addr = next, addr < end); > > You could make these function in two steps: first, as above, invoke the > next level recursively; second, after the do..while loop, check whether > it's empty and free the pmd page as in free_pmd_table(). free_pte_table() freeing attempt actually belongs to free_empty_pte_table(). Yes, free_pte_table() part can be moved inside free_empty_pte_table() after it's do..while(). Also s/free_pte_table/free_pte_page to make it sound more distinct with respect to free_empty_pte_table(). Just that the pgtable page freeing part is still wrapped in a helper function to hide it's details. But if you prefer not to have these helpers free_pxx_page() and directly encode the second step in free_empty_pxx_table(), then will that instead. > >> +} > [...] > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove @ 2019-10-08 4:36 ` Anshuman Khandual 0 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-10-08 4:36 UTC (permalink / raw) To: Catalin Marinas Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya, will, ira.weiny, steven.price, valentin.schneider, suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel, dan.j.williams, linux-arm-kernel, osalvador, steve.capper, logang, linux-kernel, akpm, mgorman On 10/07/2019 07:47 PM, Catalin Marinas wrote: > On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: >> The arch code for hot-remove must tear down portions of the linear map and >> vmemmap corresponding to memory being removed. In both cases the page >> tables mapping these regions must be freed, and when sparse vmemmap is in >> use the memory backing the vmemmap must also be freed. >> >> This patch adds unmap_hotplug_range() and free_empty_tables() helpers which >> can be used to tear down either region and calls it from vmemmap_free() and >> ___remove_pgd_mapping(). The sparse_vmap argument determines whether the >> backing memory will be freed. > > Can you change the 'sparse_vmap' name to something more meaningful which > would suggest freeing of the backing memory? free_mapped_mem or free_backed_mem ? Even shorter forms like free_mapped or free_backed might do as well. Do you have a particular preference here ? But yes, sparse_vmap has been very much specific to vmemmap for these functions which are now very generic in nature. > >> It makes two distinct passes over the kernel page table. In the first pass >> with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and >> frees backing memory if required (vmemmap) for each mapped leaf entry. In >> the second pass with free_empty_tables() it looks for empty page table >> sections whose page table page can be unmapped, TLB invalidated and freed. >> >> While freeing intermediate level page table pages bail out if any of its >> entries are still valid. This can happen for partially filled kernel page >> table either from a previously attempted failed memory hot add or while >> removing an address range which does not span the entire page table page >> range. >> >> The vmemmap region may share levels of table with the vmalloc region. >> There can be conflicts between hot remove freeing page table pages with >> a concurrent vmalloc() walking the kernel page table. This conflict can >> not just be solved by taking the init_mm ptl because of existing locking >> scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling >> method which is borrowed from user page table tear with free_pgd_range() >> which skips freeing page table pages if intermediate address range is not >> aligned or maximum floor-ceiling might not own the entire page table page. >> >> While here update arch_add_memory() to handle __add_pages() failures by >> just unmapping recently added kernel linear mapping. Now enable memory hot >> remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE. >> >> This implementation is overall inspired from kernel page table tear down >> procedure on X86 architecture and user page table tear down method. >> >> Acked-by: Steve Capper <steve.capper@arm.com> >> Acked-by: David Hildenbrand <david@redhat.com> >> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> > > Given the amount of changes since version 7, do the acks still stand? I had taken the liberty to carry them till V7 where the implementation has been sort of structurally similar but as you mention, there have been some basic changes in the approach since V7. Will drop these tags in next version and request their fresh ACKs once again. > > [...] >> +static void free_pte_table(pmd_t *pmdp, unsigned long addr, unsigned long end, >> + unsigned long floor, unsigned long ceiling) >> +{ >> + struct page *page; >> + pte_t *ptep; >> + int i; >> + >> + if (!pgtable_range_aligned(addr, end, floor, ceiling, PMD_MASK)) >> + return; >> + >> + ptep = pte_offset_kernel(pmdp, 0UL); >> + for (i = 0; i < PTRS_PER_PTE; i++) { >> + if (!pte_none(READ_ONCE(ptep[i]))) >> + return; >> + } >> + >> + page = pmd_page(READ_ONCE(*pmdp)); > > Arguably, that's not the pmd page we are freeing here. Even if you get > the same result, pmd_page() is normally used for huge pages pointed at > by the pmd entry. Since you have the ptep already, why not use > virt_to_page(ptep)? Makes sense, will do. > >> + pmd_clear(pmdp); >> + __flush_tlb_kernel_pgtable(addr); >> + free_hotplug_pgtable_page(page); >> +} >> + >> +static void free_pmd_table(pud_t *pudp, unsigned long addr, unsigned long end, >> + unsigned long floor, unsigned long ceiling) >> +{ >> + struct page *page; >> + pmd_t *pmdp; >> + int i; >> + >> + if (CONFIG_PGTABLE_LEVELS <= 2) >> + return; >> + >> + if (!pgtable_range_aligned(addr, end, floor, ceiling, PUD_MASK)) >> + return; >> + >> + pmdp = pmd_offset(pudp, 0UL); >> + for (i = 0; i < PTRS_PER_PMD; i++) { >> + if (!pmd_none(READ_ONCE(pmdp[i]))) >> + return; >> + } >> + >> + page = pud_page(READ_ONCE(*pudp)); > > Same here, virt_to_page(pmdp). Will do. > >> + pud_clear(pudp); >> + __flush_tlb_kernel_pgtable(addr); >> + free_hotplug_pgtable_page(page); >> +} >> + >> +static void free_pud_table(pgd_t *pgdp, unsigned long addr, unsigned long end, >> + unsigned long floor, unsigned long ceiling) >> +{ >> + struct page *page; >> + pud_t *pudp; >> + int i; >> + >> + if (CONFIG_PGTABLE_LEVELS <= 3) >> + return; >> + >> + if (!pgtable_range_aligned(addr, end, floor, ceiling, PGDIR_MASK)) >> + return; >> + >> + pudp = pud_offset(pgdp, 0UL); >> + for (i = 0; i < PTRS_PER_PUD; i++) { >> + if (!pud_none(READ_ONCE(pudp[i]))) >> + return; >> + } >> + >> + page = pgd_page(READ_ONCE(*pgdp)); > > As above. Will do. > >> + pgd_clear(pgdp); >> + __flush_tlb_kernel_pgtable(addr); >> + free_hotplug_pgtable_page(page); >> +} >> + >> +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, >> + unsigned long end, bool sparse_vmap) >> +{ >> + struct page *page; >> + pte_t *ptep, pte; >> + >> + do { >> + ptep = pte_offset_kernel(pmdp, addr); >> + pte = READ_ONCE(*ptep); >> + if (pte_none(pte)) >> + continue; >> + >> + WARN_ON(!pte_present(pte)); >> + page = sparse_vmap ? pte_page(pte) : NULL; >> + pte_clear(&init_mm, addr, ptep); >> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >> + if (sparse_vmap) >> + free_hotplug_page_range(page, PAGE_SIZE); > > You could only set 'page' if sparse_vmap (or even drop 'page' entirely). I am afraid 'page' is being used to hold pte_page(pte) extraction which needs to be freed (sparse_vmap) as we are going to clear the ptep entry in the next statement and lose access to it for good. We will need some where to hold onto pte_page(pte) across pte_clear() as we cannot free it before clearing it's entry and flushing the TLB. Hence wondering how the 'page' can be completely dropped. > The compiler is probably smart enough to optimise it but using a > pointless ternary operator just makes the code harder to follow. Not sure I got this but are you suggesting for an 'if' statement here if (sparse_vmap) page = pte_page(pte); instead of the current assignment ? page = sparse_vmap ? pte_page(pte) : NULL; > >> + } while (addr += PAGE_SIZE, addr < end); >> +} > [...] >> +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, >> + unsigned long end) >> +{ >> + pte_t *ptep, pte; >> + >> + do { >> + ptep = pte_offset_kernel(pmdp, addr); >> + pte = READ_ONCE(*ptep); >> + WARN_ON(!pte_none(pte)); >> + } while (addr += PAGE_SIZE, addr < end); >> +} >> + >> +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, >> + unsigned long end, unsigned long floor, >> + unsigned long ceiling) >> +{ >> + unsigned long next; >> + pmd_t *pmdp, pmd; >> + >> + do { >> + next = pmd_addr_end(addr, end); >> + pmdp = pmd_offset(pudp, addr); >> + pmd = READ_ONCE(*pmdp); >> + if (pmd_none(pmd)) >> + continue; >> + >> + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); >> + free_empty_pte_table(pmdp, addr, next); >> + free_pte_table(pmdp, addr, next, floor, ceiling); > > Do we need two closely named functions here? Can you not collapse > free_empty_pud_table() and free_pte_table() into a single one? The same > comment for the pmd/pud variants. I just find this confusing. The two functions could be collapsed into a single one. But just wanted to keep free_pxx_table() part which checks floor/ceiling alignment, non-zero entries clear off the actual page table walking. > >> + } while (addr = next, addr < end); > > You could make these function in two steps: first, as above, invoke the > next level recursively; second, after the do..while loop, check whether > it's empty and free the pmd page as in free_pmd_table(). free_pte_table() freeing attempt actually belongs to free_empty_pte_table(). Yes, free_pte_table() part can be moved inside free_empty_pte_table() after it's do..while(). Also s/free_pte_table/free_pte_page to make it sound more distinct with respect to free_empty_pte_table(). Just that the pgtable page freeing part is still wrapped in a helper function to hide it's details. But if you prefer not to have these helpers free_pxx_page() and directly encode the second step in free_empty_pxx_table(), then will that instead. > >> +} > [...] > _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove 2019-10-08 4:36 ` Anshuman Khandual @ 2019-10-08 10:55 ` Catalin Marinas -1 siblings, 0 replies; 20+ messages in thread From: Catalin Marinas @ 2019-10-08 10:55 UTC (permalink / raw) To: Anshuman Khandual Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, will, mark.rutland, mhocko, david, cai, logang, cpandya, arunks, dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper, broonie, valentin.schneider, Robin.Murphy, steven.price, suzuki.poulose, ira.weiny On Tue, Oct 08, 2019 at 10:06:26AM +0530, Anshuman Khandual wrote: > On 10/07/2019 07:47 PM, Catalin Marinas wrote: > > On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: > >> The arch code for hot-remove must tear down portions of the linear map and > >> vmemmap corresponding to memory being removed. In both cases the page > >> tables mapping these regions must be freed, and when sparse vmemmap is in > >> use the memory backing the vmemmap must also be freed. > >> > >> This patch adds unmap_hotplug_range() and free_empty_tables() helpers which > >> can be used to tear down either region and calls it from vmemmap_free() and > >> ___remove_pgd_mapping(). The sparse_vmap argument determines whether the > >> backing memory will be freed. > > > > Can you change the 'sparse_vmap' name to something more meaningful which > > would suggest freeing of the backing memory? > > free_mapped_mem or free_backed_mem ? Even shorter forms like free_mapped or > free_backed might do as well. Do you have a particular preference here ? But > yes, sparse_vmap has been very much specific to vmemmap for these functions > which are now very generic in nature. free_mapped would do. > >> +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, > >> + unsigned long end, bool sparse_vmap) > >> +{ > >> + struct page *page; > >> + pte_t *ptep, pte; > >> + > >> + do { > >> + ptep = pte_offset_kernel(pmdp, addr); > >> + pte = READ_ONCE(*ptep); > >> + if (pte_none(pte)) > >> + continue; > >> + > >> + WARN_ON(!pte_present(pte)); > >> + page = sparse_vmap ? pte_page(pte) : NULL; > >> + pte_clear(&init_mm, addr, ptep); > >> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > >> + if (sparse_vmap) > >> + free_hotplug_page_range(page, PAGE_SIZE); > > > > You could only set 'page' if sparse_vmap (or even drop 'page' entirely). > > I am afraid 'page' is being used to hold pte_page(pte) extraction which > needs to be freed (sparse_vmap) as we are going to clear the ptep entry > in the next statement and lose access to it for good. You clear *ptep, not pte. > We will need some > where to hold onto pte_page(pte) across pte_clear() as we cannot free it > before clearing it's entry and flushing the TLB. Hence wondering how the > 'page' can be completely dropped. > > > The compiler is probably smart enough to optimise it but using a > > pointless ternary operator just makes the code harder to follow. > > Not sure I got this but are you suggesting for an 'if' statement here > > if (sparse_vmap) > page = pte_page(pte); > > instead of the current assignment ? > > page = sparse_vmap ? pte_page(pte) : NULL; I suggest: if (sparse_vmap) free_hotplug_pgtable_page(pte_page(pte), PAGE_SIZE); > >> + } while (addr += PAGE_SIZE, addr < end); > >> +} > > [...] > >> +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, > >> + unsigned long end) > >> +{ > >> + pte_t *ptep, pte; > >> + > >> + do { > >> + ptep = pte_offset_kernel(pmdp, addr); > >> + pte = READ_ONCE(*ptep); > >> + WARN_ON(!pte_none(pte)); > >> + } while (addr += PAGE_SIZE, addr < end); > >> +} > >> + > >> +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, > >> + unsigned long end, unsigned long floor, > >> + unsigned long ceiling) > >> +{ > >> + unsigned long next; > >> + pmd_t *pmdp, pmd; > >> + > >> + do { > >> + next = pmd_addr_end(addr, end); > >> + pmdp = pmd_offset(pudp, addr); > >> + pmd = READ_ONCE(*pmdp); > >> + if (pmd_none(pmd)) > >> + continue; > >> + > >> + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); > >> + free_empty_pte_table(pmdp, addr, next); > >> + free_pte_table(pmdp, addr, next, floor, ceiling); > > > > Do we need two closely named functions here? Can you not collapse > > free_empty_pud_table() and free_pte_table() into a single one? The same > > comment for the pmd/pud variants. I just find this confusing. > > The two functions could be collapsed into a single one. But just wanted to > keep free_pxx_table() part which checks floor/ceiling alignment, non-zero > entries clear off the actual page table walking. With the pmd variant, they both take the floor/ceiling argument while the free_empty_pte_table() doesn't even free anything. So not entirely consistent. Can you not just copy the free_pgd_range() functions but instead of p*d_free_tlb() just do the TLB invalidation followed by page freeing? That seems to be an easier pattern to follow. -- Catalin ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove @ 2019-10-08 10:55 ` Catalin Marinas 0 siblings, 0 replies; 20+ messages in thread From: Catalin Marinas @ 2019-10-08 10:55 UTC (permalink / raw) To: Anshuman Khandual Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya, will, ira.weiny, steven.price, valentin.schneider, suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel, dan.j.williams, linux-arm-kernel, osalvador, steve.capper, logang, linux-kernel, akpm, mgorman On Tue, Oct 08, 2019 at 10:06:26AM +0530, Anshuman Khandual wrote: > On 10/07/2019 07:47 PM, Catalin Marinas wrote: > > On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: > >> The arch code for hot-remove must tear down portions of the linear map and > >> vmemmap corresponding to memory being removed. In both cases the page > >> tables mapping these regions must be freed, and when sparse vmemmap is in > >> use the memory backing the vmemmap must also be freed. > >> > >> This patch adds unmap_hotplug_range() and free_empty_tables() helpers which > >> can be used to tear down either region and calls it from vmemmap_free() and > >> ___remove_pgd_mapping(). The sparse_vmap argument determines whether the > >> backing memory will be freed. > > > > Can you change the 'sparse_vmap' name to something more meaningful which > > would suggest freeing of the backing memory? > > free_mapped_mem or free_backed_mem ? Even shorter forms like free_mapped or > free_backed might do as well. Do you have a particular preference here ? But > yes, sparse_vmap has been very much specific to vmemmap for these functions > which are now very generic in nature. free_mapped would do. > >> +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, > >> + unsigned long end, bool sparse_vmap) > >> +{ > >> + struct page *page; > >> + pte_t *ptep, pte; > >> + > >> + do { > >> + ptep = pte_offset_kernel(pmdp, addr); > >> + pte = READ_ONCE(*ptep); > >> + if (pte_none(pte)) > >> + continue; > >> + > >> + WARN_ON(!pte_present(pte)); > >> + page = sparse_vmap ? pte_page(pte) : NULL; > >> + pte_clear(&init_mm, addr, ptep); > >> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > >> + if (sparse_vmap) > >> + free_hotplug_page_range(page, PAGE_SIZE); > > > > You could only set 'page' if sparse_vmap (or even drop 'page' entirely). > > I am afraid 'page' is being used to hold pte_page(pte) extraction which > needs to be freed (sparse_vmap) as we are going to clear the ptep entry > in the next statement and lose access to it for good. You clear *ptep, not pte. > We will need some > where to hold onto pte_page(pte) across pte_clear() as we cannot free it > before clearing it's entry and flushing the TLB. Hence wondering how the > 'page' can be completely dropped. > > > The compiler is probably smart enough to optimise it but using a > > pointless ternary operator just makes the code harder to follow. > > Not sure I got this but are you suggesting for an 'if' statement here > > if (sparse_vmap) > page = pte_page(pte); > > instead of the current assignment ? > > page = sparse_vmap ? pte_page(pte) : NULL; I suggest: if (sparse_vmap) free_hotplug_pgtable_page(pte_page(pte), PAGE_SIZE); > >> + } while (addr += PAGE_SIZE, addr < end); > >> +} > > [...] > >> +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, > >> + unsigned long end) > >> +{ > >> + pte_t *ptep, pte; > >> + > >> + do { > >> + ptep = pte_offset_kernel(pmdp, addr); > >> + pte = READ_ONCE(*ptep); > >> + WARN_ON(!pte_none(pte)); > >> + } while (addr += PAGE_SIZE, addr < end); > >> +} > >> + > >> +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, > >> + unsigned long end, unsigned long floor, > >> + unsigned long ceiling) > >> +{ > >> + unsigned long next; > >> + pmd_t *pmdp, pmd; > >> + > >> + do { > >> + next = pmd_addr_end(addr, end); > >> + pmdp = pmd_offset(pudp, addr); > >> + pmd = READ_ONCE(*pmdp); > >> + if (pmd_none(pmd)) > >> + continue; > >> + > >> + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); > >> + free_empty_pte_table(pmdp, addr, next); > >> + free_pte_table(pmdp, addr, next, floor, ceiling); > > > > Do we need two closely named functions here? Can you not collapse > > free_empty_pud_table() and free_pte_table() into a single one? The same > > comment for the pmd/pud variants. I just find this confusing. > > The two functions could be collapsed into a single one. But just wanted to > keep free_pxx_table() part which checks floor/ceiling alignment, non-zero > entries clear off the actual page table walking. With the pmd variant, they both take the floor/ceiling argument while the free_empty_pte_table() doesn't even free anything. So not entirely consistent. Can you not just copy the free_pgd_range() functions but instead of p*d_free_tlb() just do the TLB invalidation followed by page freeing? That seems to be an easier pattern to follow. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove 2019-10-08 10:55 ` Catalin Marinas @ 2019-10-08 11:48 ` Anshuman Khandual -1 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-10-08 11:48 UTC (permalink / raw) To: Catalin Marinas Cc: linux-mm, linux-kernel, linux-arm-kernel, akpm, will, mark.rutland, mhocko, david, cai, logang, cpandya, arunks, dan.j.williams, mgorman, osalvador, ard.biesheuvel, steve.capper, broonie, valentin.schneider, Robin.Murphy, steven.price, suzuki.poulose, ira.weiny On 10/08/2019 04:25 PM, Catalin Marinas wrote: > On Tue, Oct 08, 2019 at 10:06:26AM +0530, Anshuman Khandual wrote: >> On 10/07/2019 07:47 PM, Catalin Marinas wrote: >>> On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: >>>> The arch code for hot-remove must tear down portions of the linear map and >>>> vmemmap corresponding to memory being removed. In both cases the page >>>> tables mapping these regions must be freed, and when sparse vmemmap is in >>>> use the memory backing the vmemmap must also be freed. >>>> >>>> This patch adds unmap_hotplug_range() and free_empty_tables() helpers which >>>> can be used to tear down either region and calls it from vmemmap_free() and >>>> ___remove_pgd_mapping(). The sparse_vmap argument determines whether the >>>> backing memory will be freed. >>> >>> Can you change the 'sparse_vmap' name to something more meaningful which >>> would suggest freeing of the backing memory? >> >> free_mapped_mem or free_backed_mem ? Even shorter forms like free_mapped or >> free_backed might do as well. Do you have a particular preference here ? But >> yes, sparse_vmap has been very much specific to vmemmap for these functions >> which are now very generic in nature. > > free_mapped would do. Sure. > >>>> +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, >>>> + unsigned long end, bool sparse_vmap) >>>> +{ >>>> + struct page *page; >>>> + pte_t *ptep, pte; >>>> + >>>> + do { >>>> + ptep = pte_offset_kernel(pmdp, addr); >>>> + pte = READ_ONCE(*ptep); >>>> + if (pte_none(pte)) >>>> + continue; >>>> + >>>> + WARN_ON(!pte_present(pte)); >>>> + page = sparse_vmap ? pte_page(pte) : NULL; >>>> + pte_clear(&init_mm, addr, ptep); >>>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >>>> + if (sparse_vmap) >>>> + free_hotplug_page_range(page, PAGE_SIZE); >>> >>> You could only set 'page' if sparse_vmap (or even drop 'page' entirely). >> >> I am afraid 'page' is being used to hold pte_page(pte) extraction which >> needs to be freed (sparse_vmap) as we are going to clear the ptep entry >> in the next statement and lose access to it for good. > > You clear *ptep, not pte. Ahh, missed that. We have already captured the contents with READ_ONCE(). > >> We will need some >> where to hold onto pte_page(pte) across pte_clear() as we cannot free it >> before clearing it's entry and flushing the TLB. Hence wondering how the >> 'page' can be completely dropped. >> >>> The compiler is probably smart enough to optimise it but using a >>> pointless ternary operator just makes the code harder to follow. >> >> Not sure I got this but are you suggesting for an 'if' statement here >> >> if (sparse_vmap) >> page = pte_page(pte); >> >> instead of the current assignment ? >> >> page = sparse_vmap ? pte_page(pte) : NULL; > > I suggest: > > if (sparse_vmap) > free_hotplug_pgtable_page(pte_page(pte), PAGE_SIZE); Sure, will do. > >>>> + } while (addr += PAGE_SIZE, addr < end); >>>> +} >>> [...] >>>> +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, >>>> + unsigned long end) >>>> +{ >>>> + pte_t *ptep, pte; >>>> + >>>> + do { >>>> + ptep = pte_offset_kernel(pmdp, addr); >>>> + pte = READ_ONCE(*ptep); >>>> + WARN_ON(!pte_none(pte)); >>>> + } while (addr += PAGE_SIZE, addr < end); >>>> +} >>>> + >>>> +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, >>>> + unsigned long end, unsigned long floor, >>>> + unsigned long ceiling) >>>> +{ >>>> + unsigned long next; >>>> + pmd_t *pmdp, pmd; >>>> + >>>> + do { >>>> + next = pmd_addr_end(addr, end); >>>> + pmdp = pmd_offset(pudp, addr); >>>> + pmd = READ_ONCE(*pmdp); >>>> + if (pmd_none(pmd)) >>>> + continue; >>>> + >>>> + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); >>>> + free_empty_pte_table(pmdp, addr, next); >>>> + free_pte_table(pmdp, addr, next, floor, ceiling); >>> >>> Do we need two closely named functions here? Can you not collapse >>> free_empty_pud_table() and free_pte_table() into a single one? The same >>> comment for the pmd/pud variants. I just find this confusing. >> >> The two functions could be collapsed into a single one. But just wanted to >> keep free_pxx_table() part which checks floor/ceiling alignment, non-zero >> entries clear off the actual page table walking. > > With the pmd variant, they both take the floor/ceiling argument while > the free_empty_pte_table() doesn't even free anything. So not entirely > consistent.> > Can you not just copy the free_pgd_range() functions but instead of > p*d_free_tlb() just do the TLB invalidation followed by page freeing? > That seems to be an easier pattern to follow. > Sure, will follow that pattern. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH V8 2/2] arm64/mm: Enable memory hot remove @ 2019-10-08 11:48 ` Anshuman Khandual 0 siblings, 0 replies; 20+ messages in thread From: Anshuman Khandual @ 2019-10-08 11:48 UTC (permalink / raw) To: Catalin Marinas Cc: mark.rutland, mhocko, david, linux-mm, arunks, cpandya, will, ira.weiny, steven.price, valentin.schneider, suzuki.poulose, Robin.Murphy, broonie, cai, ard.biesheuvel, dan.j.williams, linux-arm-kernel, osalvador, steve.capper, logang, linux-kernel, akpm, mgorman On 10/08/2019 04:25 PM, Catalin Marinas wrote: > On Tue, Oct 08, 2019 at 10:06:26AM +0530, Anshuman Khandual wrote: >> On 10/07/2019 07:47 PM, Catalin Marinas wrote: >>> On Mon, Sep 23, 2019 at 11:13:45AM +0530, Anshuman Khandual wrote: >>>> The arch code for hot-remove must tear down portions of the linear map and >>>> vmemmap corresponding to memory being removed. In both cases the page >>>> tables mapping these regions must be freed, and when sparse vmemmap is in >>>> use the memory backing the vmemmap must also be freed. >>>> >>>> This patch adds unmap_hotplug_range() and free_empty_tables() helpers which >>>> can be used to tear down either region and calls it from vmemmap_free() and >>>> ___remove_pgd_mapping(). The sparse_vmap argument determines whether the >>>> backing memory will be freed. >>> >>> Can you change the 'sparse_vmap' name to something more meaningful which >>> would suggest freeing of the backing memory? >> >> free_mapped_mem or free_backed_mem ? Even shorter forms like free_mapped or >> free_backed might do as well. Do you have a particular preference here ? But >> yes, sparse_vmap has been very much specific to vmemmap for these functions >> which are now very generic in nature. > > free_mapped would do. Sure. > >>>> +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, >>>> + unsigned long end, bool sparse_vmap) >>>> +{ >>>> + struct page *page; >>>> + pte_t *ptep, pte; >>>> + >>>> + do { >>>> + ptep = pte_offset_kernel(pmdp, addr); >>>> + pte = READ_ONCE(*ptep); >>>> + if (pte_none(pte)) >>>> + continue; >>>> + >>>> + WARN_ON(!pte_present(pte)); >>>> + page = sparse_vmap ? pte_page(pte) : NULL; >>>> + pte_clear(&init_mm, addr, ptep); >>>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >>>> + if (sparse_vmap) >>>> + free_hotplug_page_range(page, PAGE_SIZE); >>> >>> You could only set 'page' if sparse_vmap (or even drop 'page' entirely). >> >> I am afraid 'page' is being used to hold pte_page(pte) extraction which >> needs to be freed (sparse_vmap) as we are going to clear the ptep entry >> in the next statement and lose access to it for good. > > You clear *ptep, not pte. Ahh, missed that. We have already captured the contents with READ_ONCE(). > >> We will need some >> where to hold onto pte_page(pte) across pte_clear() as we cannot free it >> before clearing it's entry and flushing the TLB. Hence wondering how the >> 'page' can be completely dropped. >> >>> The compiler is probably smart enough to optimise it but using a >>> pointless ternary operator just makes the code harder to follow. >> >> Not sure I got this but are you suggesting for an 'if' statement here >> >> if (sparse_vmap) >> page = pte_page(pte); >> >> instead of the current assignment ? >> >> page = sparse_vmap ? pte_page(pte) : NULL; > > I suggest: > > if (sparse_vmap) > free_hotplug_pgtable_page(pte_page(pte), PAGE_SIZE); Sure, will do. > >>>> + } while (addr += PAGE_SIZE, addr < end); >>>> +} >>> [...] >>>> +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, >>>> + unsigned long end) >>>> +{ >>>> + pte_t *ptep, pte; >>>> + >>>> + do { >>>> + ptep = pte_offset_kernel(pmdp, addr); >>>> + pte = READ_ONCE(*ptep); >>>> + WARN_ON(!pte_none(pte)); >>>> + } while (addr += PAGE_SIZE, addr < end); >>>> +} >>>> + >>>> +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, >>>> + unsigned long end, unsigned long floor, >>>> + unsigned long ceiling) >>>> +{ >>>> + unsigned long next; >>>> + pmd_t *pmdp, pmd; >>>> + >>>> + do { >>>> + next = pmd_addr_end(addr, end); >>>> + pmdp = pmd_offset(pudp, addr); >>>> + pmd = READ_ONCE(*pmdp); >>>> + if (pmd_none(pmd)) >>>> + continue; >>>> + >>>> + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); >>>> + free_empty_pte_table(pmdp, addr, next); >>>> + free_pte_table(pmdp, addr, next, floor, ceiling); >>> >>> Do we need two closely named functions here? Can you not collapse >>> free_empty_pud_table() and free_pte_table() into a single one? The same >>> comment for the pmd/pud variants. I just find this confusing. >> >> The two functions could be collapsed into a single one. But just wanted to >> keep free_pxx_table() part which checks floor/ceiling alignment, non-zero >> entries clear off the actual page table walking. > > With the pmd variant, they both take the floor/ceiling argument while > the free_empty_pte_table() doesn't even free anything. So not entirely > consistent.> > Can you not just copy the free_pgd_range() functions but instead of > p*d_free_tlb() just do the TLB invalidation followed by page freeing? > That seems to be an easier pattern to follow. > Sure, will follow that pattern. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2019-10-08 11:48 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-23 5:43 [PATCH V8 0/2] arm64/mm: Enable memory hot remove Anshuman Khandual 2019-09-23 5:43 ` Anshuman Khandual 2019-09-23 5:43 ` [PATCH V8 1/2] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump Anshuman Khandual 2019-09-23 5:43 ` Anshuman Khandual 2019-09-23 5:43 ` [PATCH V8 2/2] arm64/mm: Enable memory hot remove Anshuman Khandual 2019-09-23 5:43 ` Anshuman Khandual 2019-09-23 11:17 ` Matthew Wilcox 2019-09-23 11:17 ` Matthew Wilcox 2019-09-24 8:41 ` Anshuman Khandual 2019-09-24 8:41 ` Anshuman Khandual 2019-09-23 17:39 ` kbuild test robot 2019-09-23 17:39 ` kbuild test robot 2019-10-07 14:17 ` Catalin Marinas 2019-10-07 14:17 ` Catalin Marinas 2019-10-08 4:36 ` Anshuman Khandual 2019-10-08 4:36 ` Anshuman Khandual 2019-10-08 10:55 ` Catalin Marinas 2019-10-08 10:55 ` Catalin Marinas 2019-10-08 11:48 ` Anshuman Khandual 2019-10-08 11:48 ` Anshuman Khandual
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.