* [PATCH 0/6] mremap fixes @ 2021-06-10 8:35 Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 1/6] selftest/mremap_test: Update the test to handle pagesize other than 4K Aneesh Kumar K.V ` (5 more replies) 0 siblings, 6 replies; 15+ messages in thread From: Aneesh Kumar K.V @ 2021-06-10 8:35 UTC (permalink / raw) To: linux-mm, akpm Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel, Kirill A . Shutemov, linuxppc-dev This patch series is split out series from [PATCH v7 00/11] Speedup mremap on ppc64 (https://lore.kernel.org/linux-mm/20210607055131.156184-1-aneesh.kumar@linux.ibm.com) dropping ppc64 specific changes. I will send the ppc64 specific changes separately once we agree on how to handle the TLB flush changes. Aneesh Kumar K.V (6): selftest/mremap_test: Update the test to handle pagesize other than 4K selftest/mremap_test: Avoid crash with static build mm/mremap: Convert huge PUD move to separate helper mm/mremap: Don't enable optimized PUD move if page table levels is 2 mm/mremap: Use pmd/pud_poplulate to update page table entries mm/mremap: hold the rmap lock in write mode when moving page table entries. mm/mremap.c | 93 +++++++++++++++--- tools/testing/selftests/vm/mremap_test.c | 118 ++++++++++++----------- 2 files changed, 143 insertions(+), 68 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/6] selftest/mremap_test: Update the test to handle pagesize other than 4K 2021-06-10 8:35 [PATCH 0/6] mremap fixes Aneesh Kumar K.V @ 2021-06-10 8:35 ` Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 2/6] selftest/mremap_test: Avoid crash with static build Aneesh Kumar K.V ` (4 subsequent siblings) 5 siblings, 0 replies; 15+ messages in thread From: Aneesh Kumar K.V @ 2021-06-10 8:35 UTC (permalink / raw) To: linux-mm, akpm Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel, Kirill A . Shutemov, linuxppc-dev Instead of hardcoding 4K page size fetch it using sysconf(). For the performance measurements test still assume 2M and 1G are hugepage sizes. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> --- tools/testing/selftests/vm/mremap_test.c | 113 ++++++++++++----------- 1 file changed, 61 insertions(+), 52 deletions(-) diff --git a/tools/testing/selftests/vm/mremap_test.c b/tools/testing/selftests/vm/mremap_test.c index 9c391d016922..c9a5461eb786 100644 --- a/tools/testing/selftests/vm/mremap_test.c +++ b/tools/testing/selftests/vm/mremap_test.c @@ -45,14 +45,15 @@ enum { _4MB = 4ULL << 20, _1GB = 1ULL << 30, _2GB = 2ULL << 30, - PTE = _4KB, PMD = _2MB, PUD = _1GB, }; +#define PTE page_size + #define MAKE_TEST(source_align, destination_align, size, \ overlaps, should_fail, test_name) \ -{ \ +(struct test){ \ .name = test_name, \ .config = { \ .src_alignment = source_align, \ @@ -252,12 +253,17 @@ static int parse_args(int argc, char **argv, unsigned int *threshold_mb, return 0; } +#define MAX_TEST 13 +#define MAX_PERF_TEST 3 int main(int argc, char **argv) { int failures = 0; int i, run_perf_tests; unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD; unsigned int pattern_seed; + struct test test_cases[MAX_TEST]; + struct test perf_test_cases[MAX_PERF_TEST]; + int page_size; time_t t; pattern_seed = (unsigned int) time(&t); @@ -268,56 +274,59 @@ int main(int argc, char **argv) ksft_print_msg("Test configs:\n\tthreshold_mb=%u\n\tpattern_seed=%u\n\n", threshold_mb, pattern_seed); - struct test test_cases[] = { - /* Expected mremap failures */ - MAKE_TEST(_4KB, _4KB, _4KB, OVERLAPPING, EXPECT_FAILURE, - "mremap - Source and Destination Regions Overlapping"), - MAKE_TEST(_4KB, _1KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE, - "mremap - Destination Address Misaligned (1KB-aligned)"), - MAKE_TEST(_1KB, _4KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE, - "mremap - Source Address Misaligned (1KB-aligned)"), - - /* Src addr PTE aligned */ - MAKE_TEST(PTE, PTE, _8KB, NON_OVERLAPPING, EXPECT_SUCCESS, - "8KB mremap - Source PTE-aligned, Destination PTE-aligned"), - - /* Src addr 1MB aligned */ - MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS, - "2MB mremap - Source 1MB-aligned, Destination PTE-aligned"), - MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS, - "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned"), - - /* Src addr PMD aligned */ - MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS, - "4MB mremap - Source PMD-aligned, Destination PTE-aligned"), - MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS, - "4MB mremap - Source PMD-aligned, Destination 1MB-aligned"), - MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS, - "4MB mremap - Source PMD-aligned, Destination PMD-aligned"), - - /* Src addr PUD aligned */ - MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, - "2GB mremap - Source PUD-aligned, Destination PTE-aligned"), - MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, - "2GB mremap - Source PUD-aligned, Destination 1MB-aligned"), - MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, - "2GB mremap - Source PUD-aligned, Destination PMD-aligned"), - MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, - "2GB mremap - Source PUD-aligned, Destination PUD-aligned"), - }; - - struct test perf_test_cases[] = { - /* - * mremap 1GB region - Page table level aligned time - * comparison. - */ - MAKE_TEST(PTE, PTE, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS, - "1GB mremap - Source PTE-aligned, Destination PTE-aligned"), - MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS, - "1GB mremap - Source PMD-aligned, Destination PMD-aligned"), - MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS, - "1GB mremap - Source PUD-aligned, Destination PUD-aligned"), - }; + page_size = sysconf(_SC_PAGESIZE); + + /* Expected mremap failures */ + test_cases[0] = MAKE_TEST(page_size, page_size, page_size, + OVERLAPPING, EXPECT_FAILURE, + "mremap - Source and Destination Regions Overlapping"); + + test_cases[1] = MAKE_TEST(page_size, page_size/4, page_size, + NON_OVERLAPPING, EXPECT_FAILURE, + "mremap - Destination Address Misaligned (1KB-aligned)"); + test_cases[2] = MAKE_TEST(page_size/4, page_size, page_size, + NON_OVERLAPPING, EXPECT_FAILURE, + "mremap - Source Address Misaligned (1KB-aligned)"); + + /* Src addr PTE aligned */ + test_cases[3] = MAKE_TEST(PTE, PTE, PTE * 2, + NON_OVERLAPPING, EXPECT_SUCCESS, + "8KB mremap - Source PTE-aligned, Destination PTE-aligned"); + + /* Src addr 1MB aligned */ + test_cases[4] = MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2MB mremap - Source 1MB-aligned, Destination PTE-aligned"); + test_cases[5] = MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned"); + + /* Src addr PMD aligned */ + test_cases[6] = MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS, + "4MB mremap - Source PMD-aligned, Destination PTE-aligned"); + test_cases[7] = MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS, + "4MB mremap - Source PMD-aligned, Destination 1MB-aligned"); + test_cases[8] = MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS, + "4MB mremap - Source PMD-aligned, Destination PMD-aligned"); + + /* Src addr PUD aligned */ + test_cases[9] = MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2GB mremap - Source PUD-aligned, Destination PTE-aligned"); + test_cases[10] = MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2GB mremap - Source PUD-aligned, Destination 1MB-aligned"); + test_cases[11] = MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2GB mremap - Source PUD-aligned, Destination PMD-aligned"); + test_cases[12] = MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "2GB mremap - Source PUD-aligned, Destination PUD-aligned"); + + perf_test_cases[0] = MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "1GB mremap - Source PTE-aligned, Destination PTE-aligned"); + /* + * mremap 1GB region - Page table level aligned time + * comparison. + */ + perf_test_cases[1] = MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "1GB mremap - Source PMD-aligned, Destination PMD-aligned"); + perf_test_cases[2] = MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS, + "1GB mremap - Source PUD-aligned, Destination PUD-aligned"); run_perf_tests = (threshold_mb == VALIDATION_NO_THRESHOLD) || (threshold_mb * _1MB >= _1GB); -- 2.31.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/6] selftest/mremap_test: Avoid crash with static build 2021-06-10 8:35 [PATCH 0/6] mremap fixes Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 1/6] selftest/mremap_test: Update the test to handle pagesize other than 4K Aneesh Kumar K.V @ 2021-06-10 8:35 ` Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 3/6] mm/mremap: Convert huge PUD move to separate helper Aneesh Kumar K.V ` (3 subsequent siblings) 5 siblings, 0 replies; 15+ messages in thread From: Aneesh Kumar K.V @ 2021-06-10 8:35 UTC (permalink / raw) To: linux-mm, akpm Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel, Kirill A . Shutemov, linuxppc-dev With a large mmap map size, we can overlap with the text area and using MAP_FIXED results in unmapping that area. Switch to MAP_FIXED_NOREPLACE and handle the EEXIST error. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> --- tools/testing/selftests/vm/mremap_test.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/vm/mremap_test.c b/tools/testing/selftests/vm/mremap_test.c index c9a5461eb786..0624d1bd71b5 100644 --- a/tools/testing/selftests/vm/mremap_test.c +++ b/tools/testing/selftests/vm/mremap_test.c @@ -75,9 +75,10 @@ static void *get_source_mapping(struct config c) retry: addr += c.src_alignment; src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE, - MAP_FIXED | MAP_ANONYMOUS | MAP_SHARED, -1, 0); + MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED, + -1, 0); if (src_addr == MAP_FAILED) { - if (errno == EPERM) + if (errno == EPERM || errno == EEXIST) goto retry; goto error; } -- 2.31.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/6] mm/mremap: Convert huge PUD move to separate helper 2021-06-10 8:35 [PATCH 0/6] mremap fixes Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 1/6] selftest/mremap_test: Update the test to handle pagesize other than 4K Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 2/6] selftest/mremap_test: Avoid crash with static build Aneesh Kumar K.V @ 2021-06-10 8:35 ` Aneesh Kumar K.V 2021-06-10 22:03 ` Hugh Dickins 2021-06-10 8:35 ` [PATCH 4/6] mm/mremap: Don't enable optimized PUD move if page table levels is 2 Aneesh Kumar K.V ` (2 subsequent siblings) 5 siblings, 1 reply; 15+ messages in thread From: Aneesh Kumar K.V @ 2021-06-10 8:35 UTC (permalink / raw) To: linux-mm, akpm Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel, Kirill A . Shutemov, linuxppc-dev With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD entries. Add a helper to move huge PUD entries on mremap(). This will be used by a later patch to optimize mremap of PUD_SIZE aligned level 4 PTE mapped address This also make sure we support mremap on huge PUD entries even with CONFIG_HAVE_MOVE_PUD disabled. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> --- mm/mremap.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 73 insertions(+), 7 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 47c255b60150..92ab7d24a587 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -324,10 +324,62 @@ static inline bool move_normal_pud(struct vm_area_struct *vma, } #endif + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PUD +static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) +{ + spinlock_t *old_ptl, *new_ptl; + struct mm_struct *mm = vma->vm_mm; + pud_t pud; + + /* + * The destination pud shouldn't be established, free_pgtables() + * should have released it. + */ + if (WARN_ON_ONCE(!pud_none(*new_pud))) + return false; + + /* + * We don't have to worry about the ordering of src and dst + * ptlocks because exclusive mmap_lock prevents deadlock. + */ + old_ptl = pud_lock(vma->vm_mm, old_pud); + new_ptl = pud_lockptr(mm, new_pud); + if (new_ptl != old_ptl) + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + + /* Clear the pud */ + pud = *old_pud; + pud_clear(old_pud); + + VM_BUG_ON(!pud_none(*new_pud)); + + /* Set the new pud */ + /* mark soft_ditry when we add pud level soft dirty support */ + set_pud_at(mm, new_addr, new_pud, pud); + flush_pud_tlb_range(vma, old_addr, old_addr + HPAGE_PUD_SIZE); + if (new_ptl != old_ptl) + spin_unlock(new_ptl); + spin_unlock(old_ptl); + + return true; +} +#else +static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) +{ + WARN_ON_ONCE(1); + return false; + +} +#endif + enum pgt_entry { NORMAL_PMD, HPAGE_PMD, NORMAL_PUD, + HPAGE_PUD, }; /* @@ -347,6 +399,7 @@ static __always_inline unsigned long get_extent(enum pgt_entry entry, mask = PMD_MASK; size = PMD_SIZE; break; + case HPAGE_PUD: case NORMAL_PUD: mask = PUD_MASK; size = PUD_SIZE; @@ -395,6 +448,11 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma, move_huge_pmd(vma, old_addr, new_addr, old_entry, new_entry); break; + case HPAGE_PUD: + moved = move_huge_pud(vma, old_addr, new_addr, old_entry, + new_entry); + break; + default: WARN_ON_ONCE(1); break; @@ -414,6 +472,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long extent, old_end; struct mmu_notifier_range range; pmd_t *old_pmd, *new_pmd; + pud_t *old_pud, *new_pud; old_end = old_addr + len; flush_cache_range(vma, old_addr, old_end); @@ -429,15 +488,22 @@ unsigned long move_page_tables(struct vm_area_struct *vma, * PUD level if possible. */ extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr); - if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) { - pud_t *old_pud, *new_pud; - old_pud = get_old_pud(vma->vm_mm, old_addr); - if (!old_pud) + old_pud = get_old_pud(vma->vm_mm, old_addr); + if (!old_pud) + continue; + new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr); + if (!new_pud) + break; + if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) { + if (extent == HPAGE_PUD_SIZE) { + move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr, + old_pud, new_pud, need_rmap_locks); + /* We ignore and continue on error? */ continue; - new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr); - if (!new_pud) - break; + } + } else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) { + if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr, old_pud, new_pud, need_rmap_locks)) continue; -- 2.31.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 3/6] mm/mremap: Convert huge PUD move to separate helper 2021-06-10 8:35 ` [PATCH 3/6] mm/mremap: Convert huge PUD move to separate helper Aneesh Kumar K.V @ 2021-06-10 22:03 ` Hugh Dickins 0 siblings, 0 replies; 15+ messages in thread From: Hugh Dickins @ 2021-06-10 22:03 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Linus Torvalds, npiggin, linux-mm, kaleshsingh, joel, Kirill A . Shutemov, akpm, linuxppc-dev On Thu, 10 Jun 2021, Aneesh Kumar K.V wrote: > With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD entries. > Add a helper to move huge PUD entries on mremap(). > > This will be used by a later patch to optimize mremap of PUD_SIZE aligned > level 4 PTE mapped address > > This also make sure we support mremap on huge PUD entries even with > CONFIG_HAVE_MOVE_PUD disabled. > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > --- > mm/mremap.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 73 insertions(+), 7 deletions(-) > > diff --git a/mm/mremap.c b/mm/mremap.c > index 47c255b60150..92ab7d24a587 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -324,10 +324,62 @@ static inline bool move_normal_pud(struct vm_area_struct *vma, > } > #endif > > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PUD Should that say #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD ? (I'm a PUD-THP-sceptic, but if it's just for DAX then probably okay.) > +static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr, > + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) > +{ > + spinlock_t *old_ptl, *new_ptl; > + struct mm_struct *mm = vma->vm_mm; > + pud_t pud; > + > + /* > + * The destination pud shouldn't be established, free_pgtables() > + * should have released it. > + */ > + if (WARN_ON_ONCE(!pud_none(*new_pud))) > + return false; > + > + /* > + * We don't have to worry about the ordering of src and dst > + * ptlocks because exclusive mmap_lock prevents deadlock. > + */ > + old_ptl = pud_lock(vma->vm_mm, old_pud); > + new_ptl = pud_lockptr(mm, new_pud); > + if (new_ptl != old_ptl) > + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); > + > + /* Clear the pud */ > + pud = *old_pud; > + pud_clear(old_pud); > + > + VM_BUG_ON(!pud_none(*new_pud)); > + > + /* Set the new pud */ > + /* mark soft_ditry when we add pud level soft dirty support */ > + set_pud_at(mm, new_addr, new_pud, pud); > + flush_pud_tlb_range(vma, old_addr, old_addr + HPAGE_PUD_SIZE); > + if (new_ptl != old_ptl) > + spin_unlock(new_ptl); > + spin_unlock(old_ptl); > + > + return true; > +} > +#else > +static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr, > + unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) > +{ > + WARN_ON_ONCE(1); > + return false; > + > +} > +#endif > + > enum pgt_entry { > NORMAL_PMD, > HPAGE_PMD, > NORMAL_PUD, > + HPAGE_PUD, > }; > > /* > @@ -347,6 +399,7 @@ static __always_inline unsigned long get_extent(enum pgt_entry entry, > mask = PMD_MASK; > size = PMD_SIZE; > break; > + case HPAGE_PUD: > case NORMAL_PUD: > mask = PUD_MASK; > size = PUD_SIZE; > @@ -395,6 +448,11 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma, > move_huge_pmd(vma, old_addr, new_addr, old_entry, > new_entry); > break; > + case HPAGE_PUD: > + moved = move_huge_pud(vma, old_addr, new_addr, old_entry, > + new_entry); > + break; > + > default: > WARN_ON_ONCE(1); > break; > @@ -414,6 +472,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, > unsigned long extent, old_end; > struct mmu_notifier_range range; > pmd_t *old_pmd, *new_pmd; > + pud_t *old_pud, *new_pud; > > old_end = old_addr + len; > flush_cache_range(vma, old_addr, old_end); > @@ -429,15 +488,22 @@ unsigned long move_page_tables(struct vm_area_struct *vma, > * PUD level if possible. > */ > extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr); > - if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) { > - pud_t *old_pud, *new_pud; > > - old_pud = get_old_pud(vma->vm_mm, old_addr); > - if (!old_pud) > + old_pud = get_old_pud(vma->vm_mm, old_addr); > + if (!old_pud) > + continue; > + new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr); > + if (!new_pud) > + break; > + if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) { > + if (extent == HPAGE_PUD_SIZE) { > + move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr, > + old_pud, new_pud, need_rmap_locks); > + /* We ignore and continue on error? */ > continue; > - new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr); > - if (!new_pud) > - break; > + } > + } else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) { > + > if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr, > old_pud, new_pud, need_rmap_locks)) > continue; > -- > 2.31.1 > > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 4/6] mm/mremap: Don't enable optimized PUD move if page table levels is 2 2021-06-10 8:35 [PATCH 0/6] mremap fixes Aneesh Kumar K.V ` (2 preceding siblings ...) 2021-06-10 8:35 ` [PATCH 3/6] mm/mremap: Convert huge PUD move to separate helper Aneesh Kumar K.V @ 2021-06-10 8:35 ` Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 6/6] mm/mremap: hold the rmap lock in write mode when moving " Aneesh Kumar K.V 5 siblings, 0 replies; 15+ messages in thread From: Aneesh Kumar K.V @ 2021-06-10 8:35 UTC (permalink / raw) To: linux-mm, akpm Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel, Kirill A . Shutemov, linuxppc-dev With two level page table don't enable move_normal_pud. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> --- mm/mremap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mremap.c b/mm/mremap.c index 92ab7d24a587..795a7d628b53 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -276,7 +276,7 @@ static inline bool move_normal_pmd(struct vm_area_struct *vma, } #endif -#ifdef CONFIG_HAVE_MOVE_PUD +#if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_HAVE_MOVE_PUD) static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, unsigned long new_addr, pud_t *old_pud, pud_t *new_pud) { -- 2.31.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries 2021-06-10 8:35 [PATCH 0/6] mremap fixes Aneesh Kumar K.V ` (3 preceding siblings ...) 2021-06-10 8:35 ` [PATCH 4/6] mm/mremap: Don't enable optimized PUD move if page table levels is 2 Aneesh Kumar K.V @ 2021-06-10 8:35 ` Aneesh Kumar K.V 2021-06-10 18:16 ` Linus Torvalds 2021-06-10 8:35 ` [PATCH 6/6] mm/mremap: hold the rmap lock in write mode when moving " Aneesh Kumar K.V 5 siblings, 1 reply; 15+ messages in thread From: Aneesh Kumar K.V @ 2021-06-10 8:35 UTC (permalink / raw) To: linux-mm, akpm Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel, Kirill A . Shutemov, linuxppc-dev pmd/pud_populate is the right interface to be used to set the respective page table entries. Some architectures like ppc64 do assume that set_pmd/pud_at can only be used to set a hugepage PTE. Since we are not setting up a hugepage PTE here, use the pmd/pud_populate interface. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> --- mm/mremap.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 795a7d628b53..dacfa9111ab1 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -26,6 +26,7 @@ #include <asm/cacheflush.h> #include <asm/tlbflush.h> +#include <asm/pgalloc.h> #include "internal.h" @@ -258,8 +259,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, VM_BUG_ON(!pmd_none(*new_pmd)); - /* Set the new pmd */ - set_pmd_at(mm, new_addr, new_pmd, pmd); + pmd_populate(mm, new_pmd, pmd_pgtable(pmd)); flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE); if (new_ptl != old_ptl) spin_unlock(new_ptl); @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, VM_BUG_ON(!pud_none(*new_pud)); - /* Set the new pud */ - set_pud_at(mm, new_addr, new_pud, pud); + pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud)); flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE); if (new_ptl != old_ptl) spin_unlock(new_ptl); -- 2.31.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries 2021-06-10 8:35 ` [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries Aneesh Kumar K.V @ 2021-06-10 18:16 ` Linus Torvalds 2021-06-13 9:06 ` Aneesh Kumar K.V 0 siblings, 1 reply; 15+ messages in thread From: Linus Torvalds @ 2021-06-10 18:16 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes, Kirill A . Shutemov, Andrew Morton, linuxppc-dev On Thu, Jun 10, 2021 at 1:36 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, > > VM_BUG_ON(!pud_none(*new_pud)); > > - /* Set the new pud */ > - set_pud_at(mm, new_addr, new_pud, pud); > + pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud)); > flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE); > if (new_ptl != old_ptl) > spin_unlock(new_ptl); That "(pmd_t *)pud_page_vaddr(pud)" looks a bit odd and doesn't match the pattern. Should we perhaps have a wrapper for it? Maybe called "pud_pgtable()", the same way we have pmd_pgtable()? And that wrapper would be good to have a comment or two about what it actually is. The same way that pmd_pgtable() should but doesn't ;( Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries 2021-06-10 18:16 ` Linus Torvalds @ 2021-06-13 9:06 ` Aneesh Kumar K.V 2021-06-13 10:50 ` Matthew Wilcox 2021-06-13 18:53 ` Linus Torvalds 0 siblings, 2 replies; 15+ messages in thread From: Aneesh Kumar K.V @ 2021-06-13 9:06 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes, Kirill A . Shutemov, Andrew Morton, linuxppc-dev Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, Jun 10, 2021 at 1:36 AM Aneesh Kumar K.V > <aneesh.kumar@linux.ibm.com> wrote: >> >> @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr, >> >> VM_BUG_ON(!pud_none(*new_pud)); >> >> - /* Set the new pud */ >> - set_pud_at(mm, new_addr, new_pud, pud); >> + pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud)); >> flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE); >> if (new_ptl != old_ptl) >> spin_unlock(new_ptl); > > That "(pmd_t *)pud_page_vaddr(pud)" looks a bit odd and doesn't match > the pattern. > > Should we perhaps have a wrapper for it? Maybe called "pud_pgtable()", > the same way we have pmd_pgtable()? IIUC the reason why we do have pmd_pgtable() is that pgtable_t type is arch dependent. On some architecture it is pte_t * and on the other struct page *. The reason being highmem and level 4 page table can be located in highmem. The above is not true with the level 3 table and hence we didn't add an extra type to point to level 3 table. We could add pud_pgtable(), but then it will essentially be a typecast as I did above? Even if we want to do that, that should be done as a separate patch than this series? > > And that wrapper would be good to have a comment or two about what it > actually is. The same way that pmd_pgtable() should but doesn't ;( > > Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries 2021-06-13 9:06 ` Aneesh Kumar K.V @ 2021-06-13 10:50 ` Matthew Wilcox 2021-06-13 11:13 ` Aneesh Kumar K.V 2021-06-13 18:53 ` Linus Torvalds 1 sibling, 1 reply; 15+ messages in thread From: Matthew Wilcox @ 2021-06-13 10:50 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: linuxppc-dev, Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes, Kirill A . Shutemov, Andrew Morton, Linus Torvalds On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote: > IIUC the reason why we do have pmd_pgtable() is that pgtable_t type > is arch dependent. On some architecture it is pte_t * and on the other > struct page *. The reason being highmem and level 4 page table can > be located in highmem. That is ahistorical. See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 -- we have pgtable_t for the benefit of s390's crazy sub-page page table sizes. Also, please stop numbering page tables upside down. PTEs are first level, not fourth. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries 2021-06-13 10:50 ` Matthew Wilcox @ 2021-06-13 11:13 ` Aneesh Kumar K.V 2021-06-14 5:27 ` Christophe Leroy 0 siblings, 1 reply; 15+ messages in thread From: Aneesh Kumar K.V @ 2021-06-13 11:13 UTC (permalink / raw) To: Matthew Wilcox Cc: linuxppc-dev, Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes, Kirill A . Shutemov, Andrew Morton, Linus Torvalds On 6/13/21 4:20 PM, Matthew Wilcox wrote: > On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote: >> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type >> is arch dependent. On some architecture it is pte_t * and on the other >> struct page *. The reason being highmem and level 4 page table can >> be located in highmem. > > That is ahistorical. See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 -- > we have pgtable_t for the benefit of s390's crazy sub-page page table > sizes. That is also true with ppc64. We do sub-page page table size. I was trying to explain why it can't be pte_t * everywhere and why we have it as struct page *. > > Also, please stop numbering page tables upside down. PTEs are first > level, not fourth. > POWER ISA do name it the other way. I also see some pages explaining levels the other way https://www.bottomupcs.com/virtual_memory_linux.xhtml whereas https://en.wikipedia.org/wiki/Intel_5-level_paging#/media/File:Page_Tables_(5_levels).svg I am pretty sure I had commits that explained page table level as I did in this thread. I will switch to your suggestion in further discussions. May be the best solution is to attribute it with more details like level 1 (pte_t *)? -aneesh ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries 2021-06-13 11:13 ` Aneesh Kumar K.V @ 2021-06-14 5:27 ` Christophe Leroy 0 siblings, 0 replies; 15+ messages in thread From: Christophe Leroy @ 2021-06-14 5:27 UTC (permalink / raw) To: Aneesh Kumar K.V, Matthew Wilcox Cc: Linus Torvalds, Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes, Kirill A . Shutemov, Andrew Morton, linuxppc-dev Le 13/06/2021 à 13:13, Aneesh Kumar K.V a écrit : > On 6/13/21 4:20 PM, Matthew Wilcox wrote: >> On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote: >>> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type >>> is arch dependent. On some architecture it is pte_t * and on the other >>> struct page *. The reason being highmem and level 4 page table can >>> be located in highmem. >> >> That is ahistorical. See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 -- >> we have pgtable_t for the benefit of s390's crazy sub-page page table >> sizes. > > That is also true with ppc64. We do sub-page page table size. I was trying to explain why it can't > be pte_t * everywhere and why we have > it as struct page *. ppc32 as well. On the 8xx, with 16k size pages, the HW still use 4k page tables, so we do use sub-pages. In order too keep the code simple, we have converted all powerpc to sub-pages for that, allthough some powerpc platforms have only one sub-page per page. Christophe ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries 2021-06-13 9:06 ` Aneesh Kumar K.V 2021-06-13 10:50 ` Matthew Wilcox @ 2021-06-13 18:53 ` Linus Torvalds 1 sibling, 0 replies; 15+ messages in thread From: Linus Torvalds @ 2021-06-13 18:53 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes, Kirill A . Shutemov, Andrew Morton, linuxppc-dev On Sun, Jun 13, 2021 at 2:06 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > IIUC the reason why we do have pmd_pgtable() is that pgtable_t type > is arch dependent. On some architecture it is pte_t * and on the other > struct page *. The reason being highmem and level 4 page table can > be located in highmem. Honestly, the same confusion is real - in a different way - about pud_page_vaddr(). I really hate that function. Just grep for the uses, and the definitions, to see what I mean. It's crazy. I'm perfectly happy not having a "pud_pagetable()" function, but that cast on pud_page_vaddr() is indicative of real problems. One solution might be to just say "pud_page_vaddr()" must return a "pmd_t *". I think it's what all the users actually want anyway. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 6/6] mm/mremap: hold the rmap lock in write mode when moving page table entries. 2021-06-10 8:35 [PATCH 0/6] mremap fixes Aneesh Kumar K.V ` (4 preceding siblings ...) 2021-06-10 8:35 ` [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries Aneesh Kumar K.V @ 2021-06-10 8:35 ` Aneesh Kumar K.V 2021-06-11 8:11 ` Jann Horn 5 siblings, 1 reply; 15+ messages in thread From: Aneesh Kumar K.V @ 2021-06-10 8:35 UTC (permalink / raw) To: linux-mm, akpm Cc: Aneesh Kumar K.V, Hugh Dickins, Linus Torvalds, npiggin, kaleshsingh, joel, Kirill A . Shutemov, linuxppc-dev, Kirill A . Shutemov To avoid a race between rmap walk and mremap, mremap does take_rmap_locks(). The lock was taken to ensure that rmap walk don't miss a page table entry due to PTE moves via move_pagetables(). The kernel does further optimization of this lock such that if we are going to find the newly added vma after the old vma, the rmap lock is not taken. This is because rmap walk would find the vmas in the same order and if we don't find the page table attached to older vma we would find it with the new vma which we would iterate later. As explained in commit eb66ae030829 ("mremap: properly flush TLB before releasing the page") mremap is special in that it doesn't take ownership of the page. The optimized version for PUD/PMD aligned mremap also doesn't hold the ptl lock. This can result in stale TLB entries as show below. This patch updates the rmap locking requirement in mremap to handle the race condition explained below with optimized mremap:: Optmized PMD move CPU 1 CPU 2 CPU 3 mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one mmap_write_lock_killable() addr = old_addr lock(pte_ptl) lock(pmd_ptl) pmd = *old_pmd pmd_clear(old_pmd) flush_tlb_range(old_addr) *new_pmd = pmd *new_addr = 10; and fills TLB with new addr and old pfn unlock(pmd_ptl) ptep_clear_flush() old pfn is free. Stale TLB entry Optimized PUD move also suffers from a similar race. Both the above race condition can be fixed if we force mremap path to take rmap lock. Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions") Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions") Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@mail.gmail.com Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> --- mm/mremap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index dacfa9111ab1..b8eed7645cea 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -504,7 +504,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, } else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) { if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr, - old_pud, new_pud, need_rmap_locks)) + old_pud, new_pud, true)) continue; } @@ -531,7 +531,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma, * moving at the PMD level if possible. */ if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr, - old_pmd, new_pmd, need_rmap_locks)) + old_pmd, new_pmd, true)) continue; } -- 2.31.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 6/6] mm/mremap: hold the rmap lock in write mode when moving page table entries. 2021-06-10 8:35 ` [PATCH 6/6] mm/mremap: hold the rmap lock in write mode when moving " Aneesh Kumar K.V @ 2021-06-11 8:11 ` Jann Horn 0 siblings, 0 replies; 15+ messages in thread From: Jann Horn @ 2021-06-11 8:11 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Linus Torvalds, Hugh Dickins, Nicholas Piggin, Linux-MM, Kalesh Singh, Joel Fernandes (Google), Kirill A . Shutemov, Andrew Morton, linuxppc-dev, Kirill A . Shutemov On Thu, Jun 10, 2021 at 10:35 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > To avoid a race between rmap walk and mremap, mremap does take_rmap_locks(). > The lock was taken to ensure that rmap walk don't miss a page table entry due to > PTE moves via move_pagetables(). The kernel does further optimization of > this lock such that if we are going to find the newly added vma after the > old vma, the rmap lock is not taken. This is because rmap walk would find the > vmas in the same order and if we don't find the page table attached to > older vma we would find it with the new vma which we would iterate later. [...] > Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions") > Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions") probably also "Cc: stable@vger.kernel.org"? ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2021-06-14 5:27 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-10 8:35 [PATCH 0/6] mremap fixes Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 1/6] selftest/mremap_test: Update the test to handle pagesize other than 4K Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 2/6] selftest/mremap_test: Avoid crash with static build Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 3/6] mm/mremap: Convert huge PUD move to separate helper Aneesh Kumar K.V 2021-06-10 22:03 ` Hugh Dickins 2021-06-10 8:35 ` [PATCH 4/6] mm/mremap: Don't enable optimized PUD move if page table levels is 2 Aneesh Kumar K.V 2021-06-10 8:35 ` [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries Aneesh Kumar K.V 2021-06-10 18:16 ` Linus Torvalds 2021-06-13 9:06 ` Aneesh Kumar K.V 2021-06-13 10:50 ` Matthew Wilcox 2021-06-13 11:13 ` Aneesh Kumar K.V 2021-06-14 5:27 ` Christophe Leroy 2021-06-13 18:53 ` Linus Torvalds 2021-06-10 8:35 ` [PATCH 6/6] mm/mremap: hold the rmap lock in write mode when moving " Aneesh Kumar K.V 2021-06-11 8:11 ` Jann Horn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).