linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] mm: Rework zap ptes on swap entries
@ 2021-11-10  8:29 Peter Xu
  2021-11-10  8:29 ` [PATCH RFC 1/2] mm: Don't skip swap entry even if zap_details specified Peter Xu
  2021-11-10  8:29 ` [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range Peter Xu
  0 siblings, 2 replies; 5+ messages in thread
From: Peter Xu @ 2021-11-10  8:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Andrew Morton, Yang Shi, Hugh Dickins, David Hildenbrand,
	Alistair Popple, Andrea Arcangeli, Vlastimil Babka,
	Kirill A . Shutemov

The goal of this small series is to replace the previous patch (which is the
5th patch of the series):

https://lore.kernel.org/linux-mm/20210908163628.215052-1-peterx@redhat.com/

This patch used a more aggresive (but IMHO cleaner and correct..) approach by
removing that trick to skip swap entries, then we handle swap entries always.

The behavior of "skipping swap entries" existed starting from the initial git
commit that we'll try to skip swap entries when zapping ptes if zap_detail
pointer specified.

I found that it's probably broken because of the introduction of page migration
mechanism that does not exist yet in the world of 1st git commit, then we could
errornously skip scanning the swap entries for file-backed memory, like shmem,
while we should.  I'm afraid we'll have RSS accounting wrong for those shmem
pages during migration so there could have leftover SHMEM RSS accounts.

Patch 1 did that removal of the trick, details in the commit message.

Patch 2 is a further cleanup for zap pte swap handling that can be done after
patch 1, in which there's no functional change intended.

The change should be on the slow path for zapping swap entries (e.g., we handle
none/present ptes in early code path always, so they're totally not affected),
but if anyone worries about specific workload that may be affected by this
patchset, please let me know and I'll be happy to run some more tests.  I could
also overlook something that was buried in history, in that case please kindly
point that out.  Marking the patchset RFC for this.

Smoke tested only.  Please review, thanks.

Peter Xu (2):
  mm: Don't skip swap entry even if zap_details specified
  mm: Rework swap handling of zap_pte_range

 mm/memory.c | 31 ++++++++++---------------------
 1 file changed, 10 insertions(+), 21 deletions(-)

-- 
2.32.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH RFC 1/2] mm: Don't skip swap entry even if zap_details specified
  2021-11-10  8:29 [PATCH RFC 0/2] mm: Rework zap ptes on swap entries Peter Xu
@ 2021-11-10  8:29 ` Peter Xu
  2021-11-10  8:29 ` [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range Peter Xu
  1 sibling, 0 replies; 5+ messages in thread
From: Peter Xu @ 2021-11-10  8:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Andrew Morton, Yang Shi, Hugh Dickins, David Hildenbrand,
	Alistair Popple, Andrea Arcangeli, Vlastimil Babka,
	Kirill A . Shutemov

This check existed since the 1st git commit of Linux repository, but at that
time there's no page migration yet so I think it's okay.

With page migration enabled, it should logically be possible that we zap some
shmem pages during migration.  When that happens, IIUC the old code could have
the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
without decreasing the counters for the migrating entries.  I have no unit test
to prove it as I don't know an easy way to trigger this condition, though.

Besides, the optimization itself is already confusing IMHO to me in a few points:

  - The wording "skip swap entries" is confusing, because we're not skipping all
    swap entries - we handle device private/exclusive pages before that.

  - The skip behavior is enabled as long as zap_details pointer passed over.
    It's very hard to figure that out for a new zap caller because it's unclear
    why we should skip swap entries when we have zap_details specified.

  - With modern systems, especially performance critical use cases, swap
    entries should be rare, so I doubt the usefulness of this optimization
    since it should be on a slow path anyway.

  - It is not aligned with what we do with huge pmd swap entries, where in
    zap_huge_pmd() we'll do the accounting unconditionally.

This patch drops that trick, so we handle swap ptes coherently.  Meanwhile we
should do the same mapping check upon migration entries too.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..e454f3c6aeb9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			continue;
 		}
 
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
-			continue;
-
 		if (!non_swap_entry(entry))
 			rss[MM_SWAPENTS]--;
 		else if (is_migration_entry(entry)) {
 			struct page *page;
 
 			page = pfn_swap_entry_to_page(entry);
+			if (unlikely(zap_skip_check_mapping(details, page)))
+				continue;
 			rss[mm_counter(page)]--;
 		}
 		if (unlikely(!free_swap_and_cache(entry)))
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range
  2021-11-10  8:29 [PATCH RFC 0/2] mm: Rework zap ptes on swap entries Peter Xu
  2021-11-10  8:29 ` [PATCH RFC 1/2] mm: Don't skip swap entry even if zap_details specified Peter Xu
@ 2021-11-10  8:29 ` Peter Xu
  2021-11-15 11:21   ` Alistair Popple
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Xu @ 2021-11-10  8:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Andrew Morton, Yang Shi, Hugh Dickins, David Hildenbrand,
	Alistair Popple, Andrea Arcangeli, Vlastimil Babka,
	Kirill A . Shutemov

Clean the code up by merging the device private/exclusive swap entry handling
with the rest, then we merge the pte clear operation too.

struct* page is defined in multiple places in the function, move it upward.

free_swap_and_cache() is only useful for !non_swap_entry() case, put it into
the condition.

No functional change intended.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/memory.c | 25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e454f3c6aeb9..e5d59a6b6479 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1326,6 +1326,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = *pte;
+		struct page *page;
+
 		if (pte_none(ptent))
 			continue;
 
@@ -1333,8 +1335,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			break;
 
 		if (pte_present(ptent)) {
-			struct page *page;
-
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(zap_skip_check_mapping(details, page)))
 				continue;
@@ -1368,32 +1368,23 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		entry = pte_to_swp_entry(ptent);
 		if (is_device_private_entry(entry) ||
 		    is_device_exclusive_entry(entry)) {
-			struct page *page = pfn_swap_entry_to_page(entry);
-
+			page = pfn_swap_entry_to_page(entry);
 			if (unlikely(zap_skip_check_mapping(details, page)))
 				continue;
-			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-
 			if (is_device_private_entry(entry))
 				page_remove_rmap(page, false);
-
 			put_page(page);
-			continue;
-		}
-
-		if (!non_swap_entry(entry))
-			rss[MM_SWAPENTS]--;
-		else if (is_migration_entry(entry)) {
-			struct page *page;
-
+		} else if (is_migration_entry(entry)) {
 			page = pfn_swap_entry_to_page(entry);
 			if (unlikely(zap_skip_check_mapping(details, page)))
 				continue;
 			rss[mm_counter(page)]--;
+		} else if (!non_swap_entry(entry)) {
+			rss[MM_SWAPENTS]--;
+			if (unlikely(!free_swap_and_cache(entry)))
+				print_bad_pte(vma, addr, ptent, NULL);
 		}
-		if (unlikely(!free_swap_and_cache(entry)))
-			print_bad_pte(vma, addr, ptent, NULL);
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range
  2021-11-10  8:29 ` [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range Peter Xu
@ 2021-11-15 11:21   ` Alistair Popple
  2021-11-15 11:37     ` Peter Xu
  0 siblings, 1 reply; 5+ messages in thread
From: Alistair Popple @ 2021-11-15 11:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Peter Xu
  Cc: peterx, Andrew Morton, Yang Shi, Hugh Dickins, David Hildenbrand,
	Andrea Arcangeli, Vlastimil Babka, Kirill A . Shutemov

Hi Peter,

I was having trouble applying this cleanly to any of my local trees so was
wondering which sha1 should I be applying this on top of? Thanks.

 - Alistair

On Wednesday, 10 November 2021 7:29:52 PM AEDT Peter Xu wrote:
> Clean the code up by merging the device private/exclusive swap entry handling
> with the rest, then we merge the pte clear operation too.
> 
> struct* page is defined in multiple places in the function, move it upward.
> 
> free_swap_and_cache() is only useful for !non_swap_entry() case, put it into
> the condition.
> 
> No functional change intended.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/memory.c | 25 ++++++++-----------------
>  1 file changed, 8 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e454f3c6aeb9..e5d59a6b6479 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1326,6 +1326,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  	arch_enter_lazy_mmu_mode();
>  	do {
>  		pte_t ptent = *pte;
> +		struct page *page;
> +
>  		if (pte_none(ptent))
>  			continue;
>  
> @@ -1333,8 +1335,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			break;
>  
>  		if (pte_present(ptent)) {
> -			struct page *page;
> -
>  			page = vm_normal_page(vma, addr, ptent);
>  			if (unlikely(zap_skip_check_mapping(details, page)))
>  				continue;
> @@ -1368,32 +1368,23 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  		entry = pte_to_swp_entry(ptent);
>  		if (is_device_private_entry(entry) ||
>  		    is_device_exclusive_entry(entry)) {
> -			struct page *page = pfn_swap_entry_to_page(entry);
> -
> +			page = pfn_swap_entry_to_page(entry);
>  			if (unlikely(zap_skip_check_mapping(details, page)))
>  				continue;
> -			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>  			rss[mm_counter(page)]--;
> -
>  			if (is_device_private_entry(entry))
>  				page_remove_rmap(page, false);
> -
>  			put_page(page);
> -			continue;
> -		}
> -
> -		if (!non_swap_entry(entry))
> -			rss[MM_SWAPENTS]--;
> -		else if (is_migration_entry(entry)) {
> -			struct page *page;
> -
> +		} else if (is_migration_entry(entry)) {
>  			page = pfn_swap_entry_to_page(entry);
>  			if (unlikely(zap_skip_check_mapping(details, page)))
>  				continue;
>  			rss[mm_counter(page)]--;
> +		} else if (!non_swap_entry(entry)) {
> +			rss[MM_SWAPENTS]--;
> +			if (unlikely(!free_swap_and_cache(entry)))
> +				print_bad_pte(vma, addr, ptent, NULL);
>  		}
> -		if (unlikely(!free_swap_and_cache(entry)))
> -			print_bad_pte(vma, addr, ptent, NULL);
>  		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  
> 






^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range
  2021-11-15 11:21   ` Alistair Popple
@ 2021-11-15 11:37     ` Peter Xu
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Xu @ 2021-11-15 11:37 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-mm, Andrew Morton, Yang Shi, Hugh Dickins,
	David Hildenbrand, Andrea Arcangeli, Vlastimil Babka,
	Kirill A . Shutemov

On Mon, Nov 15, 2021 at 10:21:18PM +1100, Alistair Popple wrote:
> Hi Peter,

Hi, Alistair,

> 
> I was having trouble applying this cleanly to any of my local trees so was
> wondering which sha1 should I be applying this on top of? Thanks.

Thanks for considering trying it out.  I thought it was easy to apply onto any
of the recent branches as long as with -mm's rc1 applied, and I just did it to
Linus's 5.16-rc1 in my uffd-wp rebase:

https://github.com/xzpeter/linux/commits/uffd-wp-shmem-hugetlbfs

This commit is here:

https://github.com/xzpeter/linux/commit/c32043436282bb352e6fe10eb5fa693340fe5281

It could be that "git rebase" is normally smarter so I didn't notice it's not
applicable directly.

I'll repost a new version soon, please also consider to fetch directly from the
git tree too before I do so.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-11-15 11:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-10  8:29 [PATCH RFC 0/2] mm: Rework zap ptes on swap entries Peter Xu
2021-11-10  8:29 ` [PATCH RFC 1/2] mm: Don't skip swap entry even if zap_details specified Peter Xu
2021-11-10  8:29 ` [PATCH RFC 2/2] mm: Rework swap handling of zap_pte_range Peter Xu
2021-11-15 11:21   ` Alistair Popple
2021-11-15 11:37     ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).