* [PATCH] mm: Migrate high-order folios in swap cache correctly @ 2023-12-14 4:58 Matthew Wilcox (Oracle) 2023-12-14 22:11 ` Andrew Morton 2024-02-06 14:54 ` Charan Teja Kalla 0 siblings, 2 replies; 4+ messages in thread From: Matthew Wilcox (Oracle) @ 2023-12-14 4:58 UTC (permalink / raw) To: Andrew Morton, linux-mm; +Cc: Charan Teja Kalla, Matthew Wilcox From: Charan Teja Kalla <quic_charante@quicinc.com> Large folios occupy N consecutive entries in the swap cache instead of using multi-index entries like the page cache. However, if a large folio is re-added to the LRU list, it can be migrated. The migration code was not aware of the difference between the swap cache and the page cache and assumed that a single xas_store() would be sufficient. This leaves potentially many stale pointers to the now-migrated folio in the swap cache, which can lead to almost arbitrary data corruption in the future. This can also manifest as infinite loops with the RCU read lock held. Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> [modifications to the changelog & tweaked the fix] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- mm/migrate.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index d9d2b9432e81..2d67ca47d2e2 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -405,6 +405,7 @@ int folio_migrate_mapping(struct address_space *mapping, int dirty; int expected_count = folio_expected_refs(mapping, folio) + extra_count; long nr = folio_nr_pages(folio); + long entries, i; if (!mapping) { /* Anonymous page without mapping */ @@ -442,8 +443,10 @@ int folio_migrate_mapping(struct address_space *mapping, folio_set_swapcache(newfolio); newfolio->private = folio_get_private(folio); } + entries = nr; } else { VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); + entries = 1; } /* Move dirty while page refs frozen and newpage not yet exposed */ @@ -453,7 +456,11 @@ int folio_migrate_mapping(struct address_space *mapping, folio_set_dirty(newfolio); } - xas_store(&xas, newfolio); + /* Swap cache still stores N entries instead of a high-order entry */ + for (i = 0; i < entries; i++) { + xas_store(&xas, newfolio); + xas_next(&xas); + } /* * Drop cache reference from old page by unfreezing -- 2.42.0 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] mm: Migrate high-order folios in swap cache correctly 2023-12-14 4:58 [PATCH] mm: Migrate high-order folios in swap cache correctly Matthew Wilcox (Oracle) @ 2023-12-14 22:11 ` Andrew Morton 2024-02-06 14:54 ` Charan Teja Kalla 1 sibling, 0 replies; 4+ messages in thread From: Andrew Morton @ 2023-12-14 22:11 UTC (permalink / raw) To: Matthew Wilcox (Oracle); +Cc: linux-mm, Charan Teja Kalla On Thu, 14 Dec 2023 04:58:41 +0000 "Matthew Wilcox (Oracle)" <willy@infradead.org> wrote: > From: Charan Teja Kalla <quic_charante@quicinc.com> > > Large folios occupy N consecutive entries in the swap cache > instead of using multi-index entries like the page cache. > However, if a large folio is re-added to the LRU list, it can > be migrated. The migration code was not aware of the difference > between the swap cache and the page cache and assumed that a single > xas_store() would be sufficient. > > This leaves potentially many stale pointers to the now-migrated folio > in the swap cache, which can lead to almost arbitrary data corruption > in the future. This can also manifest as infinite loops with the > RCU read lock held. > > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> > [modifications to the changelog & tweaked the fix] > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> I'm thinking Fixes: 3417013e0d183be ("mm/migrate: Add folio_migrate_mapping()") hence Cc: <stable@vger.kernel.org> And also Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm: Migrate high-order folios in swap cache correctly 2023-12-14 4:58 [PATCH] mm: Migrate high-order folios in swap cache correctly Matthew Wilcox (Oracle) 2023-12-14 22:11 ` Andrew Morton @ 2024-02-06 14:54 ` Charan Teja Kalla 2024-02-07 14:38 ` Charan Teja Kalla 1 sibling, 1 reply; 4+ messages in thread From: Charan Teja Kalla @ 2024-02-06 14:54 UTC (permalink / raw) To: Matthew Wilcox (Oracle), Andrew Morton, linux-mm Hi Matthew, It seems that issue is not completely fixed with the below change. This time it is because of not filling the ->private for subpages of THP. This can result into same type of issue this patch is tried for: 1) For reclaim of anon THP, page is first added to swap cache. As part of this, contig swap space of THP size is allocated, fills the folio_nr_pages of swap cache indices with folio, set each subpage ->private with respective swap entry. add_to_swap_cache(): for (i = 0; i < nr; i++) { set_page_private(folio_page(folio, i), entry.val + i); xas_store(&xas, folio); xas_next(&xas); } 2) Migrating the folio that is sitting on the swap cache causes only head page of newfolio->private set to swap entry. The tail pages are missed. folio_migrate_mapping(): newfolio->private = folio_get_private(folio); for (i = 0; i < entries; i++) { xas_store(&xas, newfolio); xas_next(&xas); } 3) Now again, when this migrated page that is still sitting on the swap cache, is tried to migrate, this THP page can be split (see migrate_pages()->try_split_thp), which will endup in storing the swap cache entries with sub pages in the respective indices, but this sub pages->private doesn't contain the valid swap cache entry. __split_huge_page(): for (i = nr - 1; i >= 1; i--) { if { ... } else if(swap_cache) { __xa_store(&swap_cache->i_pages, offset + i, head + i, 0); } } This leads to a state where all the sub pages are sitting on the swap cache with ->private not holding the valid value. As the subpage is tried with delete_from_swap_cache(), it tries to replace the __wrong swap cache index__ with NULL/shadow value and subsequently decrease the refcount of the sub page. for (i = 0; i < nr; i++) { if (subpage == page) continue; free_page_and_swap_cache(subpage): free_swap_cache(page); (Calls delete_from_swap_cache()) put_page(page); } Consider a folio just sitting on a swap cache, its subpage entries will now have refcount of 1(ref counts decremented from swap cache deletion + put_page) from 3 (isolate + swap cache + lru_add_page_tail()). Now migrate_pages() is tried again on these splitted THP pages. But the refcount of '1' makes the page freed directly (unmap_and_move). So, this leads to a state of "freed page entry on the swap cache" which can causes various corruptions, loop under RCU lock(observed in mapping_get_entry) e.t.c., Seems below change is also required here. diff --git a/mm/migrate.c b/mm/migrate.c index 9f5f52d..8049f4e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -427,10 +427,8 @@ int folio_migrate_mapping(struct address_space *mapping, folio_ref_add(newfolio, nr); /* add cache reference */ if (folio_test_swapbacked(folio)) { __folio_set_swapbacked(newfolio); - if (folio_test_swapcache(folio)) { + if (folio_test_swapcache(folio)) folio_set_swapcache(newfolio); - newfolio->private = folio_get_private(folio); - } entries = nr; } else { VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); @@ -446,6 +444,8 @@ int folio_migrate_mapping(struct address_space *mapping, /* Swap cache still stores N entries instead of a high-order entry */ for (i = 0; i < entries; i++) { + set_page_private(folio_page(newfolio, i), + folio_page(folio, i)->private); xas_store(&xas, newfolio); xas_next(&xas); } On 12/14/2023 10:28 AM, Matthew Wilcox (Oracle) wrote: > From: Charan Teja Kalla <quic_charante@quicinc.com> > > Large folios occupy N consecutive entries in the swap cache > instead of using multi-index entries like the page cache. > However, if a large folio is re-added to the LRU list, it can > be migrated. The migration code was not aware of the difference > between the swap cache and the page cache and assumed that a single > xas_store() would be sufficient. > > This leaves potentially many stale pointers to the now-migrated folio > in the swap cache, which can lead to almost arbitrary data corruption > in the future. This can also manifest as infinite loops with the > RCU read lock held. > > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> > [modifications to the changelog & tweaked the fix] > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > --- > mm/migrate.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index d9d2b9432e81..2d67ca47d2e2 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -405,6 +405,7 @@ int folio_migrate_mapping(struct address_space *mapping, > int dirty; > int expected_count = folio_expected_refs(mapping, folio) + extra_count; > long nr = folio_nr_pages(folio); > + long entries, i; > > if (!mapping) { > /* Anonymous page without mapping */ > @@ -442,8 +443,10 @@ int folio_migrate_mapping(struct address_space *mapping, > folio_set_swapcache(newfolio); > newfolio->private = folio_get_private(folio); > } > + entries = nr; > } else { > VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); > + entries = 1; > } > > /* Move dirty while page refs frozen and newpage not yet exposed */ > @@ -453,7 +456,11 @@ int folio_migrate_mapping(struct address_space *mapping, > folio_set_dirty(newfolio); > } > > - xas_store(&xas, newfolio); > + /* Swap cache still stores N entries instead of a high-order entry */ > + for (i = 0; i < entries; i++) { > + xas_store(&xas, newfolio); > + xas_next(&xas); > + } > > /* > * Drop cache reference from old page by unfreezing ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] mm: Migrate high-order folios in swap cache correctly 2024-02-06 14:54 ` Charan Teja Kalla @ 2024-02-07 14:38 ` Charan Teja Kalla 0 siblings, 0 replies; 4+ messages in thread From: Charan Teja Kalla @ 2024-02-07 14:38 UTC (permalink / raw) To: Matthew Wilcox (Oracle), Andrew Morton, linux-mm, David Hildenbrand, Suren Baghdasaryan I am sorry that I didn't check the latest kernel before posting. We are working on 6.1 LTS kernel where the mentioned issue is observed. I assume the analysis still holds true on 6.1 kernel. I see that this is not a problem on the latest kernels because of [1], by David. [1] https://lore.kernel.org/linux-mm/20230821160849.531668-2-david@redhat.com/#Z31mm:huge_memory.c Thanks, Charan On 2/6/2024 8:24 PM, Charan Teja Kalla wrote: > Hi Matthew, > > It seems that issue is not completely fixed with the below change. This > time it is because of not filling the ->private for subpages of THP. > This can result into same type of issue this patch is tried for: > > 1) For reclaim of anon THP, page is first added to swap cache. As part > of this, contig swap space of THP size is allocated, fills the > folio_nr_pages of swap cache indices with folio, set each subpage > ->private with respective swap entry. > add_to_swap_cache(): > for (i = 0; i < nr; i++) { > set_page_private(folio_page(folio, i), entry.val + i); > xas_store(&xas, folio); > xas_next(&xas); > } > > 2) Migrating the folio that is sitting on the swap cache causes only > head page of newfolio->private set to swap entry. The tail pages are missed. > folio_migrate_mapping(): > newfolio->private = folio_get_private(folio); > for (i = 0; i < entries; i++) { > xas_store(&xas, newfolio); > xas_next(&xas); > } > > 3) Now again, when this migrated page that is still sitting on the swap > cache, is tried to migrate, this THP page can be split (see > migrate_pages()->try_split_thp), which will endup in storing the swap > cache entries with sub pages in the respective indices, but this sub > pages->private doesn't contain the valid swap cache entry. > __split_huge_page(): > for (i = nr - 1; i >= 1; i--) { > if { > ... > } else if(swap_cache) { > __xa_store(&swap_cache->i_pages, offset + i, > head + i, 0); > } > } > > This leads to a state where all the sub pages are sitting on the swap > cache with ->private not holding the valid value. As the subpage is > tried with delete_from_swap_cache(), it tries to replace the __wrong > swap cache index__ with NULL/shadow value and subsequently decrease the > refcount of the sub page. > for (i = 0; i < nr; i++) { > if (subpage == page) > continue; > > free_page_and_swap_cache(subpage): > free_swap_cache(page); > (Calls delete_from_swap_cache()) > put_page(page); > } > > Consider a folio just sitting on a swap cache, its subpage entries will > now have refcount of 1(ref counts decremented from swap cache deletion + > put_page) from 3 (isolate + swap cache + lru_add_page_tail()). Now > migrate_pages() is tried again on these splitted THP pages. But the > refcount of '1' makes the page freed directly (unmap_and_move). > > So, this leads to a state of "freed page entry on the swap cache" which > can causes various corruptions, loop under RCU lock(observed in > mapping_get_entry) e.t.c., > > Seems below change is also required here. > > diff --git a/mm/migrate.c b/mm/migrate.c > index 9f5f52d..8049f4e 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -427,10 +427,8 @@ int folio_migrate_mapping(struct address_space > *mapping, > folio_ref_add(newfolio, nr); /* add cache reference */ > if (folio_test_swapbacked(folio)) { > __folio_set_swapbacked(newfolio); > - if (folio_test_swapcache(folio)) { > + if (folio_test_swapcache(folio)) > folio_set_swapcache(newfolio); > - newfolio->private = folio_get_private(folio); > - } > entries = nr; > } else { > VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); > @@ -446,6 +444,8 @@ int folio_migrate_mapping(struct address_space *mapping, > > /* Swap cache still stores N entries instead of a high-order > entry */ > for (i = 0; i < entries; i++) { > + set_page_private(folio_page(newfolio, i), > + folio_page(folio, i)->private); I see that the expectation of page_tail->private should be zero for __split_huge_page_tail. So, this patch also wrong :( . > xas_store(&xas, newfolio); > xas_next(&xas); > } > > > > On 12/14/2023 10:28 AM, Matthew Wilcox (Oracle) wrote: >> From: Charan Teja Kalla <quic_charante@quicinc.com> >> >> Large folios occupy N consecutive entries in the swap cache >> instead of using multi-index entries like the page cache. >> However, if a large folio is re-added to the LRU list, it can >> be migrated. The migration code was not aware of the difference >> between the swap cache and the page cache and assumed that a single >> xas_store() would be sufficient. >> >> This leaves potentially many stale pointers to the now-migrated folio >> in the swap cache, which can lead to almost arbitrary data corruption >> in the future. This can also manifest as infinite loops with the >> RCU read lock held. >> >> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> >> [modifications to the changelog & tweaked the fix] >> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> >> --- >> mm/migrate.c | 9 ++++++++- >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> index d9d2b9432e81..2d67ca47d2e2 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -405,6 +405,7 @@ int folio_migrate_mapping(struct address_space *mapping, >> int dirty; >> int expected_count = folio_expected_refs(mapping, folio) + extra_count; >> long nr = folio_nr_pages(folio); >> + long entries, i; >> >> if (!mapping) { >> /* Anonymous page without mapping */ >> @@ -442,8 +443,10 @@ int folio_migrate_mapping(struct address_space *mapping, >> folio_set_swapcache(newfolio); >> newfolio->private = folio_get_private(folio); >> } >> + entries = nr; >> } else { >> VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); >> + entries = 1; >> } >> >> /* Move dirty while page refs frozen and newpage not yet exposed */ >> @@ -453,7 +456,11 @@ int folio_migrate_mapping(struct address_space *mapping, >> folio_set_dirty(newfolio); >> } >> >> - xas_store(&xas, newfolio); >> + /* Swap cache still stores N entries instead of a high-order entry */ >> + for (i = 0; i < entries; i++) { >> + xas_store(&xas, newfolio); >> + xas_next(&xas); >> + } >> >> /* >> * Drop cache reference from old page by unfreezing ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-02-07 14:38 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-12-14 4:58 [PATCH] mm: Migrate high-order folios in swap cache correctly Matthew Wilcox (Oracle) 2023-12-14 22:11 ` Andrew Morton 2024-02-06 14:54 ` Charan Teja Kalla 2024-02-07 14:38 ` Charan Teja Kalla
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).