* [PATCH RFC 0/2] mm: Rework zap ptes on swap entries @ 2021-11-10 8:29 Peter Xu 2021-11-10 8:29 ` [PATCH RFC 1/2] mm: Don't skip swap entry even if zap_details specified Peter Xu 2021-11-10 8:29 ` [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range Peter Xu 0 siblings, 2 replies; 5+ messages in thread From: Peter Xu @ 2021-11-10 8:29 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: peterx, Andrew Morton, Yang Shi, Hugh Dickins, David Hildenbrand, Alistair Popple, Andrea Arcangeli, Vlastimil Babka, Kirill A . Shutemov The goal of this small series is to replace the previous patch (which is the 5th patch of the series): https://lore.kernel.org/linux-mm/20210908163628.215052-1-peterx@redhat.com/ This patch used a more aggresive (but IMHO cleaner and correct..) approach by removing that trick to skip swap entries, then we handle swap entries always. The behavior of "skipping swap entries" existed starting from the initial git commit that we'll try to skip swap entries when zapping ptes if zap_detail pointer specified. I found that it's probably broken because of the introduction of page migration mechanism that does not exist yet in the world of 1st git commit, then we could errornously skip scanning the swap entries for file-backed memory, like shmem, while we should. I'm afraid we'll have RSS accounting wrong for those shmem pages during migration so there could have leftover SHMEM RSS accounts. Patch 1 did that removal of the trick, details in the commit message. Patch 2 is a further cleanup for zap pte swap handling that can be done after patch 1, in which there's no functional change intended. The change should be on the slow path for zapping swap entries (e.g., we handle none/present ptes in early code path always, so they're totally not affected), but if anyone worries about specific workload that may be affected by this patchset, please let me know and I'll be happy to run some more tests. I could also overlook something that was buried in history, in that case please kindly point that out. Marking the patchset RFC for this. Smoke tested only. Please review, thanks. Peter Xu (2): mm: Don't skip swap entry even if zap_details specified mm: Rework swap handling of zap_pte_range mm/memory.c | 31 ++++++++++--------------------- 1 file changed, 10 insertions(+), 21 deletions(-) -- 2.32.0 ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH RFC 1/2] mm: Don't skip swap entry even if zap_details specified 2021-11-10 8:29 [PATCH RFC 0/2] mm: Rework zap ptes on swap entries Peter Xu @ 2021-11-10 8:29 ` Peter Xu 2021-11-10 8:29 ` [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range Peter Xu 1 sibling, 0 replies; 5+ messages in thread From: Peter Xu @ 2021-11-10 8:29 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: peterx, Andrew Morton, Yang Shi, Hugh Dickins, David Hildenbrand, Alistair Popple, Andrea Arcangeli, Vlastimil Babka, Kirill A . Shutemov This check existed since the 1st git commit of Linux repository, but at that time there's no page migration yet so I think it's okay. With page migration enabled, it should logically be possible that we zap some shmem pages during migration. When that happens, IIUC the old code could have the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes without decreasing the counters for the migrating entries. I have no unit test to prove it as I don't know an easy way to trigger this condition, though. Besides, the optimization itself is already confusing IMHO to me in a few points: - The wording "skip swap entries" is confusing, because we're not skipping all swap entries - we handle device private/exclusive pages before that. - The skip behavior is enabled as long as zap_details pointer passed over. It's very hard to figure that out for a new zap caller because it's unclear why we should skip swap entries when we have zap_details specified. - With modern systems, especially performance critical use cases, swap entries should be rare, so I doubt the usefulness of this optimization since it should be on a slow path anyway. - It is not aligned with what we do with huge pmd swap entries, where in zap_huge_pmd() we'll do the accounting unconditionally. This patch drops that trick, so we handle swap ptes coherently. Meanwhile we should do the same mapping check upon migration entries too. Signed-off-by: Peter Xu <peterx@redhat.com> --- mm/memory.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 8f1de811a1dc..e454f3c6aeb9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, continue; } - /* If details->check_mapping, we leave swap entries. */ - if (unlikely(details)) - continue; - if (!non_swap_entry(entry)) rss[MM_SWAPENTS]--; else if (is_migration_entry(entry)) { struct page *page; page = pfn_swap_entry_to_page(entry); + if (unlikely(zap_skip_check_mapping(details, page))) + continue; rss[mm_counter(page)]--; } if (unlikely(!free_swap_and_cache(entry))) -- 2.32.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range 2021-11-10 8:29 [PATCH RFC 0/2] mm: Rework zap ptes on swap entries Peter Xu 2021-11-10 8:29 ` [PATCH RFC 1/2] mm: Don't skip swap entry even if zap_details specified Peter Xu @ 2021-11-10 8:29 ` Peter Xu 2021-11-15 11:21 ` Alistair Popple 1 sibling, 1 reply; 5+ messages in thread From: Peter Xu @ 2021-11-10 8:29 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: peterx, Andrew Morton, Yang Shi, Hugh Dickins, David Hildenbrand, Alistair Popple, Andrea Arcangeli, Vlastimil Babka, Kirill A . Shutemov Clean the code up by merging the device private/exclusive swap entry handling with the rest, then we merge the pte clear operation too. struct* page is defined in multiple places in the function, move it upward. free_swap_and_cache() is only useful for !non_swap_entry() case, put it into the condition. No functional change intended. Signed-off-by: Peter Xu <peterx@redhat.com> --- mm/memory.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e454f3c6aeb9..e5d59a6b6479 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1326,6 +1326,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, arch_enter_lazy_mmu_mode(); do { pte_t ptent = *pte; + struct page *page; + if (pte_none(ptent)) continue; @@ -1333,8 +1335,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, break; if (pte_present(ptent)) { - struct page *page; - page = vm_normal_page(vma, addr, ptent); if (unlikely(zap_skip_check_mapping(details, page))) continue; @@ -1368,32 +1368,23 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, entry = pte_to_swp_entry(ptent); if (is_device_private_entry(entry) || is_device_exclusive_entry(entry)) { - struct page *page = pfn_swap_entry_to_page(entry); - + page = pfn_swap_entry_to_page(entry); if (unlikely(zap_skip_check_mapping(details, page))) continue; - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); rss[mm_counter(page)]--; - if (is_device_private_entry(entry)) page_remove_rmap(page, false); - put_page(page); - continue; - } - - if (!non_swap_entry(entry)) - rss[MM_SWAPENTS]--; - else if (is_migration_entry(entry)) { - struct page *page; - + } else if (is_migration_entry(entry)) { page = pfn_swap_entry_to_page(entry); if (unlikely(zap_skip_check_mapping(details, page))) continue; rss[mm_counter(page)]--; + } else if (!non_swap_entry(entry)) { + rss[MM_SWAPENTS]--; + if (unlikely(!free_swap_and_cache(entry))) + print_bad_pte(vma, addr, ptent, NULL); } - if (unlikely(!free_swap_and_cache(entry))) - print_bad_pte(vma, addr, ptent, NULL); pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); } while (pte++, addr += PAGE_SIZE, addr != end); -- 2.32.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range 2021-11-10 8:29 ` [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range Peter Xu @ 2021-11-15 11:21 ` Alistair Popple 2021-11-15 11:37 ` Peter Xu 0 siblings, 1 reply; 5+ messages in thread From: Alistair Popple @ 2021-11-15 11:21 UTC (permalink / raw) To: linux-kernel, linux-mm, Peter Xu Cc: peterx, Andrew Morton, Yang Shi, Hugh Dickins, David Hildenbrand, Andrea Arcangeli, Vlastimil Babka, Kirill A . Shutemov Hi Peter, I was having trouble applying this cleanly to any of my local trees so was wondering which sha1 should I be applying this on top of? Thanks. - Alistair On Wednesday, 10 November 2021 7:29:52 PM AEDT Peter Xu wrote: > Clean the code up by merging the device private/exclusive swap entry handling > with the rest, then we merge the pte clear operation too. > > struct* page is defined in multiple places in the function, move it upward. > > free_swap_and_cache() is only useful for !non_swap_entry() case, put it into > the condition. > > No functional change intended. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > mm/memory.c | 25 ++++++++----------------- > 1 file changed, 8 insertions(+), 17 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index e454f3c6aeb9..e5d59a6b6479 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1326,6 +1326,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > arch_enter_lazy_mmu_mode(); > do { > pte_t ptent = *pte; > + struct page *page; > + > if (pte_none(ptent)) > continue; > > @@ -1333,8 +1335,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > break; > > if (pte_present(ptent)) { > - struct page *page; > - > page = vm_normal_page(vma, addr, ptent); > if (unlikely(zap_skip_check_mapping(details, page))) > continue; > @@ -1368,32 +1368,23 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > entry = pte_to_swp_entry(ptent); > if (is_device_private_entry(entry) || > is_device_exclusive_entry(entry)) { > - struct page *page = pfn_swap_entry_to_page(entry); > - > + page = pfn_swap_entry_to_page(entry); > if (unlikely(zap_skip_check_mapping(details, page))) > continue; > - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); > rss[mm_counter(page)]--; > - > if (is_device_private_entry(entry)) > page_remove_rmap(page, false); > - > put_page(page); > - continue; > - } > - > - if (!non_swap_entry(entry)) > - rss[MM_SWAPENTS]--; > - else if (is_migration_entry(entry)) { > - struct page *page; > - > + } else if (is_migration_entry(entry)) { > page = pfn_swap_entry_to_page(entry); > if (unlikely(zap_skip_check_mapping(details, page))) > continue; > rss[mm_counter(page)]--; > + } else if (!non_swap_entry(entry)) { > + rss[MM_SWAPENTS]--; > + if (unlikely(!free_swap_and_cache(entry))) > + print_bad_pte(vma, addr, ptent, NULL); > } > - if (unlikely(!free_swap_and_cache(entry))) > - print_bad_pte(vma, addr, ptent, NULL); > pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); > } while (pte++, addr += PAGE_SIZE, addr != end); > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range 2021-11-15 11:21 ` Alistair Popple @ 2021-11-15 11:37 ` Peter Xu 0 siblings, 0 replies; 5+ messages in thread From: Peter Xu @ 2021-11-15 11:37 UTC (permalink / raw) To: Alistair Popple Cc: linux-kernel, linux-mm, Andrew Morton, Yang Shi, Hugh Dickins, David Hildenbrand, Andrea Arcangeli, Vlastimil Babka, Kirill A . Shutemov On Mon, Nov 15, 2021 at 10:21:18PM +1100, Alistair Popple wrote: > Hi Peter, Hi, Alistair, > > I was having trouble applying this cleanly to any of my local trees so was > wondering which sha1 should I be applying this on top of? Thanks. Thanks for considering trying it out. I thought it was easy to apply onto any of the recent branches as long as with -mm's rc1 applied, and I just did it to Linus's 5.16-rc1 in my uffd-wp rebase: https://github.com/xzpeter/linux/commits/uffd-wp-shmem-hugetlbfs This commit is here: https://github.com/xzpeter/linux/commit/c32043436282bb352e6fe10eb5fa693340fe5281 It could be that "git rebase" is normally smarter so I didn't notice it's not applicable directly. I'll repost a new version soon, please also consider to fetch directly from the git tree too before I do so. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-11-15 11:37 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-11-10 8:29 [PATCH RFC 0/2] mm: Rework zap ptes on swap entries Peter Xu 2021-11-10 8:29 ` [PATCH RFC 1/2] mm: Don't skip swap entry even if zap_details specified Peter Xu 2021-11-10 8:29 ` [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range Peter Xu 2021-11-15 11:21 ` Alistair Popple 2021-11-15 11:37 ` Peter Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).