Re: [PATCH] mm: Migrate high-order folios in swap cache correctly

From: Charan Teja Kalla <quic_charante@quicinc.com>
To: "Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>
Subject: Re: [PATCH] mm: Migrate high-order folios in swap cache correctly
Date: Tue, 6 Feb 2024 20:24:30 +0530	[thread overview]
Message-ID: <69cb784f-578d-ded1-cd9f-c6db04696336@quicinc.com> (raw)
In-Reply-To: <20231214045841.961776-1-willy@infradead.org>

Hi Matthew,

It seems that issue is not completely fixed with the below change. This
time it is because of not filling the ->private for subpages of THP.
This can result into same type of issue this patch is tried for:

1) For reclaim of anon THP, page is first added to swap cache. As part
of this, contig swap space of THP size is allocated, fills the
folio_nr_pages of swap cache indices with folio, set each subpage
->private with respective swap entry.
add_to_swap_cache():
	for (i = 0; i < nr; i++) {
		set_page_private(folio_page(folio, i), entry.val + i);
		xas_store(&xas, folio);
		xas_next(&xas);
	}

2) Migrating the folio that is sitting on the swap cache causes only
head page of newfolio->private set to swap entry. The tail pages are missed.
folio_migrate_mapping():
	newfolio->private = folio_get_private(folio);
	for (i = 0; i < entries; i++) {
		xas_store(&xas, newfolio);
		xas_next(&xas);
	}

3) Now again, when this migrated page that is still sitting on the swap
cache, is tried to migrate, this THP page can be split (see
migrate_pages()->try_split_thp), which will endup in storing the swap
cache entries with sub pages in the respective indices, but this sub
pages->private doesn't contain the valid swap cache entry.
__split_huge_page():
	for (i = nr - 1; i >= 1; i--) {
		if {
		...
		} else if(swap_cache) {
			__xa_store(&swap_cache->i_pages, offset + i,
					head + i, 0);
		}
	}

This leads to a state where all the sub pages are sitting on the swap
cache with ->private not holding the valid value. As the subpage is
tried with delete_from_swap_cache(), it tries to replace the __wrong
swap cache index__ with NULL/shadow value and subsequently decrease the
refcount of the sub page.
	for (i = 0; i < nr; i++) {
		if (subpage == page)
			continue;
			
		free_page_and_swap_cache(subpage):
			free_swap_cache(page);
			(Calls delete_from_swap_cache())
			put_page(page);
	}

Consider a folio just sitting on a swap cache, its subpage entries will
now have refcount of 1(ref counts decremented from swap cache deletion +
put_page) from 3 (isolate + swap cache + lru_add_page_tail()). Now
migrate_pages() is tried again on these splitted THP pages. But the
refcount of '1' makes the page freed directly (unmap_and_move).

So, this leads to a state of "freed page entry on the swap cache" which
can causes various corruptions, loop under RCU lock(observed in
mapping_get_entry) e.t.c.,

Seems below change is also required here.

diff --git a/mm/migrate.c b/mm/migrate.c
index 9f5f52d..8049f4e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -427,10 +427,8 @@ int folio_migrate_mapping(struct address_space
*mapping,
        folio_ref_add(newfolio, nr); /* add cache reference */
        if (folio_test_swapbacked(folio)) {
                __folio_set_swapbacked(newfolio);
-               if (folio_test_swapcache(folio)) {
+               if (folio_test_swapcache(folio))
                        folio_set_swapcache(newfolio);
-                       newfolio->private = folio_get_private(folio);
-               }
                entries = nr;
        } else {
                VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
@@ -446,6 +444,8 @@ int folio_migrate_mapping(struct address_space *mapping,

        /* Swap cache still stores N entries instead of a high-order
entry */
        for (i = 0; i < entries; i++) {
+               set_page_private(folio_page(newfolio, i),
+                               folio_page(folio, i)->private);
                xas_store(&xas, newfolio);
                xas_next(&xas);
        }



On 12/14/2023 10:28 AM, Matthew Wilcox (Oracle) wrote:
> From: Charan Teja Kalla <quic_charante@quicinc.com>
> 
> Large folios occupy N consecutive entries in the swap cache
> instead of using multi-index entries like the page cache.
> However, if a large folio is re-added to the LRU list, it can
> be migrated.  The migration code was not aware of the difference
> between the swap cache and the page cache and assumed that a single
> xas_store() would be sufficient.
> 
> This leaves potentially many stale pointers to the now-migrated folio
> in the swap cache, which can lead to almost arbitrary data corruption
> in the future.  This can also manifest as infinite loops with the
> RCU read lock held.
> 
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> [modifications to the changelog & tweaked the fix]
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/migrate.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d9d2b9432e81..2d67ca47d2e2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -405,6 +405,7 @@ int folio_migrate_mapping(struct address_space *mapping,
>  	int dirty;
>  	int expected_count = folio_expected_refs(mapping, folio) + extra_count;
>  	long nr = folio_nr_pages(folio);
> +	long entries, i;
>  
>  	if (!mapping) {
>  		/* Anonymous page without mapping */
> @@ -442,8 +443,10 @@ int folio_migrate_mapping(struct address_space *mapping,
>  			folio_set_swapcache(newfolio);
>  			newfolio->private = folio_get_private(folio);
>  		}
> +		entries = nr;
>  	} else {
>  		VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
> +		entries = 1;
>  	}
>  
>  	/* Move dirty while page refs frozen and newpage not yet exposed */
> @@ -453,7 +456,11 @@ int folio_migrate_mapping(struct address_space *mapping,
>  		folio_set_dirty(newfolio);
>  	}
>  
> -	xas_store(&xas, newfolio);
> +	/* Swap cache still stores N entries instead of a high-order entry */
> +	for (i = 0; i < entries; i++) {
> +		xas_store(&xas, newfolio);
> +		xas_next(&xas);
> +	}
>  
>  	/*
>  	 * Drop cache reference from old page by unfreezing