On 27 Mar 2024, at 10:45, Ryan Roberts wrote: > Now that we no longer have a convenient flag in the cluster to determine > if a folio is large, free_swap_and_cache() will take a reference and > lock a large folio much more often, which could lead to contention and > (e.g.) failure to split large folios, etc. > > Let's solve that problem by batch freeing swap and cache with a new > function, free_swap_and_cache_nr(), to free a contiguous range of swap > entries together. This allows us to first drop a reference to each swap > slot before we try to release the cache folio. This means we only try to > release the folio once, only taking the reference and lock once - much > better than the previous 512 times for the 2M THP case. > > Contiguous swap entries are gathered in zap_pte_range() and > madvise_free_pte_range() in a similar way to how present ptes are > already gathered in zap_pte_range(). > > While we are at it, let's simplify by converting the return type of both > functions to void. The return value was used only by zap_pte_range() to > print a bad pte, and was ignored by everyone else, so the extra > reporting wasn't exactly guaranteed. We will still get the warning with > most of the information from get_swap_device(). With the batch version, > we wouldn't know which pte was bad anyway so could print the wrong one. > > Signed-off-by: Ryan Roberts > --- > include/linux/pgtable.h | 28 +++++++++++++++ > include/linux/swap.h | 12 +++++-- > mm/internal.h | 48 +++++++++++++++++++++++++ > mm/madvise.c | 12 ++++--- > mm/memory.c | 13 +++---- > mm/swapfile.c | 78 ++++++++++++++++++++++++++++++----------- > 6 files changed, 157 insertions(+), 34 deletions(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 09c85c7bf9c2..8185939df1e8 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm, > } > #endif > > +#ifndef clear_not_present_full_ptes > +/** > + * clear_not_present_full_ptes - Clear consecutive not present PTEs. > + * @mm: Address space the ptes represent. > + * @addr: Address of the first pte. > + * @ptep: Page table pointer for the first entry. > + * @nr: Number of entries to clear. > + * @full: Whether we are clearing a full mm. > + * > + * May be overridden by the architecture; otherwise, implemented as a simple > + * loop over pte_clear_not_present_full(). > + * > + * Context: The caller holds the page table lock. The PTEs are all not present. > + * The PTEs are all in the same PMD. > + */ > +static inline void clear_not_present_full_ptes(struct mm_struct *mm, > + unsigned long addr, pte_t *ptep, unsigned int nr, int full) > +{ > + for (;;) { > + pte_clear_not_present_full(mm, addr, ptep, full); > + if (--nr == 0) > + break; > + ptep++; > + addr += PAGE_SIZE; > + } > +} > +#endif > + Would the code below be better? for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE) pte_clear_not_present_full(mm, addr, ptep, full); -- Best Regards, Yan, Zi