All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/6] Swap-out mTHP without splitting
@ 2024-03-27 14:45 Ryan Roberts
  2024-03-27 14:45 ` [PATCH v5 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
                   ` (5 more replies)
  0 siblings, 6 replies; 35+ messages in thread
From: Ryan Roberts @ 2024-03-27 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang
  Cc: Ryan Roberts, linux-mm, linux-kernel

Hi All,

This series adds support for swapping out multi-size THP (mTHP) without needing
to first split the large folio via split_huge_page_to_list_to_order(). It
closely follows the approach already used to swap-out PMD-sized THP.

There are a couple of reasons for swapping out mTHP without splitting:

  - Performance: It is expensive to split a large folio and under extreme memory
    pressure some workloads regressed performance when using 64K mTHP vs 4K
    small folios because of this extra cost in the swap-out path. This series
    not only eliminates the regression but makes it faster to swap out 64K mTHP
    vs 4K small folios.

  - Memory fragmentation avoidance: If we can avoid splitting a large folio
    memory is less likely to become fragmented, making it easier to re-allocate
    a large folio in future.

  - Performance: Enables a separate series [5] to swap-in whole mTHPs, which
    means we won't lose the TLB-efficiency benefits of mTHP once the memory has
    been through a swap cycle.

I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). Discussion against the RFC concluded
that this is sufficient.


Performance Testing
===================

I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
VM is set up with a 35G block ram device as the swap device and the test is run
from inside a memcg limited to 40G memory. I've then run `usemem` from
vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
repeated everything 6 times and taken the mean performance improvement relative
to 4K page baseline:

| alloc size |                baseline |           + this series |
|            | mm-unstable (~v6.9-rc1) |                         |
|:-----------|------------------------:|------------------------:|
| 4K Page    |                    0.0% |                    1.3% |
| 64K THP    |                  -13.6% |                   46.3% |
| 2M THP     |                   91.4% |                   89.6% |

So with this change, the 64K swap performance goes from a 14% regression to a
46% improvement. While 2M shows a small regression I'm confident that this is
just noise.

---
The series applies against mm-unstable (4e567abb6482) with the addition of a
small fix for an arm64 build break (reported at [6]).


Changes since v4 [4]
====================

  - patch #3:
    - Added R-B from Huang, Ying - thanks!
  - patch #4:
    - get_swap_pages() now takes order instead of nr_pages (per Huang, Ying)
    - Removed WARN_ON_ONCE() from get_swap_pages()
    - Reworded comment for scan_swap_map_try_ssd_cluster() (per Huang, Ying)
    - Unified VM_WARN_ON()s in scan_swap_map_slots() to scan: (per Huang, Ying)
    - Removed redundant "order == 0" check (per Huang, Ying)
  - patch #5:
    - Marked list_empty() check with data_race() (per David)
    - Added R-B from Barry and David - thanks!
  - patch #6:
    - Implemented mkold_ptes() generic helper (pre David)
    - Enhanced folio_pte_batch() to report any_young (per David)
    - madvise_cold_or_pageout_pte_range() sets old in batch (per David)
    - Added R-B from Barry - thanks!


Changes since v3 [3]
====================

 - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
 - Simplified max offset calculation (per Huang, Ying)
 - Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
   offset (per Huang, Ying)
 - Removed swap_alloc_large() and merged its functionality into
   scan_swap_map_slots() (per Huang, Ying)
 - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
   by freeing swap entries in batches (see patch 2) (per DavidH)
 - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
 - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
 - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
   since it's not actually a problem for THP as I first thought.


Changes since v2 [2]
====================

 - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
   allocation. This required some refactoring to make everything work nicely
   (new patches 2 and 3).
 - Fix bug where nr_swap_pages would say there are pages available but the
   scanner would not be able to allocate them because they were reserved for the
   per-cpu allocator. We now allow stealing of order-0 entries from the high
   order per-cpu clusters (in addition to exisiting stealing from order-0
   per-cpu clusters).


Changes since v1 [1]
====================

 - patch 1:
    - Use cluster_set_count() instead of cluster_set_count_flag() in
      swap_alloc_cluster() since we no longer have any flag to set. I was unable
      to kill cluster_set_count_flag() as proposed against v1 as other call
      sites depend explicitly setting flags to 0.
 - patch 2:
    - Moved large_next[] array into percpu_cluster to make it per-cpu
      (recommended by Huang, Ying).
    - large_next[] array is dynamically allocated because PMD_ORDER is not
      compile-time constant for powerpc (fixes build error).


[1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/all/20240311150058.1122862-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
[6] https://lore.kernel.org/all/b9944ac1-3919-4bb2-8b65-f3e5c52bc2aa@arm.com/

Thanks,
Ryan

Ryan Roberts (6):
  mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  mm: swap: Simplify struct percpu_cluster
  mm: swap: Allow storage of all mTHP orders
  mm: vmscan: Avoid split during shrink_folio_list()
  mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

 include/linux/pgtable.h |  58 ++++++++
 include/linux/swap.h    |  35 +++--
 mm/huge_memory.c        |   3 -
 mm/internal.h           |  60 +++++++-
 mm/madvise.c            | 100 +++++++------
 mm/memory.c             |  17 +--
 mm/swap_slots.c         |   6 +-
 mm/swapfile.c           | 306 ++++++++++++++++++++++------------------
 mm/vmscan.c             |   9 +-
 9 files changed, 380 insertions(+), 214 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v5 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-27 14:45 [PATCH v5 0/6] Swap-out mTHP without splitting Ryan Roberts
@ 2024-03-27 14:45 ` Ryan Roberts
  2024-03-29  1:56   ` Huang, Ying
  2024-04-05  9:22   ` David Hildenbrand
  2024-03-27 14:45 ` [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Ryan Roberts
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 35+ messages in thread
From: Ryan Roberts @ 2024-03-27 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang
  Cc: Ryan Roberts, linux-mm, linux-kernel

As preparation for supporting small-sized THP in the swap-out path,
without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
which, when present, always implies PMD-sized THP, which is the same as
the cluster size.

The only use of the flag was to determine whether a swap entry refers to
a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
Instead of relying on the flag, we now pass in nr_pages, which
originates from the folio's number of pages. This allows the logic to
work for folios of any order.

The one snag is that one of the swap_page_trans_huge_swapped() call
sites does not have the folio. But it was only being called there to
shortcut a call __try_to_reclaim_swap() in some cases.
__try_to_reclaim_swap() gets the folio and (via some other functions)
calls swap_page_trans_huge_swapped(). So I've removed the problematic
call site and believe the new logic should be functionally equivalent.

That said, removing the fast path means that we will take a reference
and trylock a large folio much more often, which we would like to avoid.
The next patch will solve this.

Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
which used to be called during folio splitting, since
split_swap_cluster()'s only job was to remove the flag.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h | 10 ----------
 mm/huge_memory.c     |  3 ---
 mm/swapfile.c        | 47 ++++++++------------------------------------
 3 files changed, 8 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a211a0383425..f6f78198f000 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,7 +259,6 @@ struct swap_cluster_info {
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
-#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */
 
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
@@ -590,15 +589,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 }
 #endif /* CONFIG_SWAP */
 
-#ifdef CONFIG_THP_SWAP
-extern int split_swap_cluster(swp_entry_t entry);
-#else
-static inline int split_swap_cluster(swp_entry_t entry)
-{
-	return 0;
-}
-#endif
-
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b49fcb8a16cc..8c1f3393994a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2961,9 +2961,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		shmem_uncharge(folio->mapping->host, nr_dropped);
 	remap_page(folio, nr);
 
-	if (folio_test_swapcache(folio))
-		split_swap_cluster(folio->swap);
-
 	/*
 	 * set page to its compound_head when split to non order-0 pages, so
 	 * we can skip unlocking it below, since PG_locked is transferred to
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5e6d2304a2a4..0d44ee2b4f9c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -343,18 +343,6 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
 	info->data = 0;
 }
 
-static inline bool cluster_is_huge(struct swap_cluster_info *info)
-{
-	if (IS_ENABLED(CONFIG_THP_SWAP))
-		return info->flags & CLUSTER_FLAG_HUGE;
-	return false;
-}
-
-static inline void cluster_clear_huge(struct swap_cluster_info *info)
-{
-	info->flags &= ~CLUSTER_FLAG_HUGE;
-}
-
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -1027,7 +1015,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
 	offset = idx * SWAPFILE_CLUSTER;
 	ci = lock_cluster(si, offset);
 	alloc_cluster(si, idx);
-	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
+	cluster_set_count(ci, SWAPFILE_CLUSTER);
 
 	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
 	unlock_cluster(ci);
@@ -1365,7 +1353,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 	ci = lock_cluster_or_swap_info(si, offset);
 	if (size == SWAPFILE_CLUSTER) {
-		VM_BUG_ON(!cluster_is_huge(ci));
 		map = si->swap_map + offset;
 		for (i = 0; i < SWAPFILE_CLUSTER; i++) {
 			val = map[i];
@@ -1373,7 +1360,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 			if (val == SWAP_HAS_CACHE)
 				free_entries++;
 		}
-		cluster_clear_huge(ci);
 		if (free_entries == SWAPFILE_CLUSTER) {
 			unlock_cluster_or_swap_info(si, ci);
 			spin_lock(&si->lock);
@@ -1395,23 +1381,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster_or_swap_info(si, ci);
 }
 
-#ifdef CONFIG_THP_SWAP
-int split_swap_cluster(swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-
-	si = _swap_info_get(entry);
-	if (!si)
-		return -EBUSY;
-	ci = lock_cluster(si, offset);
-	cluster_clear_huge(ci);
-	unlock_cluster(ci);
-	return 0;
-}
-#endif
-
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
@@ -1519,22 +1488,23 @@ int swp_swapcount(swp_entry_t entry)
 }
 
 static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
-					 swp_entry_t entry)
+					 swp_entry_t entry,
+					 unsigned int nr_pages)
 {
 	struct swap_cluster_info *ci;
 	unsigned char *map = si->swap_map;
 	unsigned long roffset = swp_offset(entry);
-	unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+	unsigned long offset = round_down(roffset, nr_pages);
 	int i;
 	bool ret = false;
 
 	ci = lock_cluster_or_swap_info(si, offset);
-	if (!ci || !cluster_is_huge(ci)) {
+	if (!ci || nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
 		goto unlock_out;
 	}
-	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		if (swap_count(map[offset + i])) {
 			ret = true;
 			break;
@@ -1556,7 +1526,7 @@ static bool folio_swapped(struct folio *folio)
 	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
 		return swap_swapcount(si, entry) != 0;
 
-	return swap_page_trans_huge_swapped(si, entry);
+	return swap_page_trans_huge_swapped(si, entry, folio_nr_pages(folio));
 }
 
 /**
@@ -1622,8 +1592,7 @@ int free_swap_and_cache(swp_entry_t entry)
 		}
 
 		count = __swap_entry_free(p, entry);
-		if (count == SWAP_HAS_CACHE &&
-		    !swap_page_trans_huge_swapped(p, entry))
+		if (count == SWAP_HAS_CACHE)
 			__try_to_reclaim_swap(p, swp_offset(entry),
 					      TTRS_UNMAPPED | TTRS_FULL);
 		put_swap_device(p);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  2024-03-27 14:45 [PATCH v5 0/6] Swap-out mTHP without splitting Ryan Roberts
  2024-03-27 14:45 ` [PATCH v5 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
@ 2024-03-27 14:45 ` Ryan Roberts
  2024-04-01  5:52   ` Huang, Ying
  2024-04-03  0:30   ` Zi Yan
  2024-03-27 14:45 ` [PATCH v5 3/6] mm: swap: Simplify struct percpu_cluster Ryan Roberts
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 35+ messages in thread
From: Ryan Roberts @ 2024-03-27 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang
  Cc: Ryan Roberts, linux-mm, linux-kernel

Now that we no longer have a convenient flag in the cluster to determine
if a folio is large, free_swap_and_cache() will take a reference and
lock a large folio much more often, which could lead to contention and
(e.g.) failure to split large folios, etc.

Let's solve that problem by batch freeing swap and cache with a new
function, free_swap_and_cache_nr(), to free a contiguous range of swap
entries together. This allows us to first drop a reference to each swap
slot before we try to release the cache folio. This means we only try to
release the folio once, only taking the reference and lock once - much
better than the previous 512 times for the 2M THP case.

Contiguous swap entries are gathered in zap_pte_range() and
madvise_free_pte_range() in a similar way to how present ptes are
already gathered in zap_pte_range().

While we are at it, let's simplify by converting the return type of both
functions to void. The return value was used only by zap_pte_range() to
print a bad pte, and was ignored by everyone else, so the extra
reporting wasn't exactly guaranteed. We will still get the warning with
most of the information from get_swap_device(). With the batch version,
we wouldn't know which pte was bad anyway so could print the wrong one.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 28 +++++++++++++++
 include/linux/swap.h    | 12 +++++--
 mm/internal.h           | 48 +++++++++++++++++++++++++
 mm/madvise.c            | 12 ++++---
 mm/memory.c             | 13 +++----
 mm/swapfile.c           | 78 ++++++++++++++++++++++++++++++-----------
 6 files changed, 157 insertions(+), 34 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 09c85c7bf9c2..8185939df1e8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
 }
 #endif
 
+#ifndef clear_not_present_full_ptes
+/**
+ * clear_not_present_full_ptes - Clear consecutive not present PTEs.
+ * @mm: Address space the ptes represent.
+ * @addr: Address of the first pte.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over pte_clear_not_present_full().
+ *
+ * Context: The caller holds the page table lock.  The PTEs are all not present.
+ * The PTEs are all in the same PMD.
+ */
+static inline void clear_not_present_full_ptes(struct mm_struct *mm,
+		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+	for (;;) {
+		pte_clear_not_present_full(mm, addr, ptep, full);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
 extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
 			      unsigned long address,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f6f78198f000..5737236dc3ce 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
-extern int free_swap_and_cache(swp_entry_t);
+extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
@@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct *si)
 #define free_pages_and_swap_cache(pages, nr) \
 	release_pages((pages), (nr));
 
-/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
-#define free_swap_and_cache(e) is_pfn_swap_entry(e)
+static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+{
+}
 
 static inline void free_swap_cache(struct folio *folio)
 {
@@ -589,6 +590,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 }
 #endif /* CONFIG_SWAP */
 
+static inline void free_swap_and_cache(swp_entry_t entry)
+{
+	free_swap_and_cache_nr(entry, 1);
+}
+
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/internal.h b/mm/internal.h
index 8e11f7b2da21..eadb79c3a357 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -11,6 +11,8 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/tracepoint-defs.h>
 
 struct folio_batch;
@@ -189,6 +191,52 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 
 	return min(ptep - start_ptep, max_nr);
 }
+
+/**
+ * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
+ * @start_ptep: Page table pointer for the first entry.
+ * @max_nr: The maximum number of table entries to consider.
+ * @entry: Swap entry recovered from the first table entry.
+ *
+ * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
+ * containing swap entries all with consecutive offsets and targeting the same
+ * swap type.
+ *
+ * max_nr must be at least one and must be limited by the caller so scanning
+ * cannot exceed a single page table.
+ *
+ * Return: the number of table entries in the batch.
+ */
+static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
+				 swp_entry_t entry)
+{
+	const pte_t *end_ptep = start_ptep + max_nr;
+	unsigned long expected_offset = swp_offset(entry) + 1;
+	unsigned int expected_type = swp_type(entry);
+	pte_t *ptep = start_ptep + 1;
+
+	VM_WARN_ON(max_nr < 1);
+	VM_WARN_ON(non_swap_entry(entry));
+
+	while (ptep < end_ptep) {
+		pte_t pte = ptep_get(ptep);
+
+		if (pte_none(pte) || pte_present(pte))
+			break;
+
+		entry = pte_to_swp_entry(pte);
+
+		if (non_swap_entry(entry) ||
+		    swp_type(entry) != expected_type ||
+		    swp_offset(entry) != expected_offset)
+			break;
+
+		expected_offset++;
+		ptep++;
+	}
+
+	return ptep - start_ptep;
+}
 #endif /* CONFIG_MMU */
 
 void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
diff --git a/mm/madvise.c b/mm/madvise.c
index 1f77a51baaac..070bedb4996e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	struct folio *folio;
 	int nr_swap = 0;
 	unsigned long next;
+	int nr, max_nr;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd))
@@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		return 0;
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
-	for (; addr != end; pte++, addr += PAGE_SIZE) {
+	for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
+		nr = 1;
 		ptent = ptep_get(pte);
 
 		if (pte_none(ptent))
@@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 
 			entry = pte_to_swp_entry(ptent);
 			if (!non_swap_entry(entry)) {
-				nr_swap--;
-				free_swap_and_cache(entry);
-				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+				max_nr = (end - addr) / PAGE_SIZE;
+				nr = swap_pte_batch(pte, max_nr, entry);
+				nr_swap -= nr;
+				free_swap_and_cache_nr(entry, nr);
+				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
 			} else if (is_hwpoison_entry(entry) ||
 				   is_poisoned_swp_entry(entry)) {
 				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
diff --git a/mm/memory.c b/mm/memory.c
index 36191a9c799c..9d844582ba38 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1631,12 +1631,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				folio_remove_rmap_pte(folio, page, vma);
 			folio_put(folio);
 		} else if (!non_swap_entry(entry)) {
-			/* Genuine swap entry, hence a private anon page */
+			max_nr = (end - addr) / PAGE_SIZE;
+			nr = swap_pte_batch(pte, max_nr, entry);
+			/* Genuine swap entries, hence a private anon pages */
 			if (!should_zap_cows(details))
 				continue;
-			rss[MM_SWAPENTS]--;
-			if (unlikely(!free_swap_and_cache(entry)))
-				print_bad_pte(vma, addr, ptent, NULL);
+			rss[MM_SWAPENTS] -= nr;
+			free_swap_and_cache_nr(entry, nr);
 		} else if (is_migration_entry(entry)) {
 			folio = pfn_swap_entry_folio(entry);
 			if (!should_zap_folio(details, folio))
@@ -1659,8 +1660,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
 			WARN_ON_ONCE(1);
 		}
-		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
-		zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
+		clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
+		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
 	add_mm_rss_vec(mm, rss);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0d44ee2b4f9c..cedfc82d37e5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
 /* Reclaim the swap entry if swap is getting full*/
 #define TTRS_FULL		0x4
 
-/* returns 1 if swap entry is freed */
+/*
+ * returns number of pages in the folio that backs the swap entry. If positive,
+ * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
+ * folio was associated with the swap entry.
+ */
 static int __try_to_reclaim_swap(struct swap_info_struct *si,
 				 unsigned long offset, unsigned long flags)
 {
@@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 			ret = folio_free_swap(folio);
 		folio_unlock(folio);
 	}
+	ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
 	folio_put(folio);
 	return ret;
 }
@@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
 		spin_lock(&si->lock);
 		/* entry was freed successfully, try to use this again */
-		if (swap_was_freed)
+		if (swap_was_freed > 0)
 			goto checks;
 		goto scan; /* check next one */
 	}
@@ -1572,32 +1577,63 @@ bool folio_free_swap(struct folio *folio)
 	return true;
 }
 
-/*
- * Free the swap entry like above, but also try to
- * free the page cache entry if it is the last user.
- */
-int free_swap_and_cache(swp_entry_t entry)
+void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 {
-	struct swap_info_struct *p;
-	unsigned char count;
+	unsigned long end = swp_offset(entry) + nr;
+	unsigned int type = swp_type(entry);
+	struct swap_info_struct *si;
+	unsigned long offset;
 
 	if (non_swap_entry(entry))
-		return 1;
+		return;
 
-	p = get_swap_device(entry);
-	if (p) {
-		if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
-			put_swap_device(p);
-			return 0;
-		}
+	si = get_swap_device(entry);
+	if (!si)
+		return;
 
-		count = __swap_entry_free(p, entry);
-		if (count == SWAP_HAS_CACHE)
-			__try_to_reclaim_swap(p, swp_offset(entry),
+	if (WARN_ON(end > si->max))
+		goto out;
+
+	/*
+	 * First free all entries in the range.
+	 */
+	for (offset = swp_offset(entry); offset < end; offset++) {
+		if (!WARN_ON(data_race(!si->swap_map[offset])))
+			__swap_entry_free(si, swp_entry(type, offset));
+	}
+
+	/*
+	 * Now go back over the range trying to reclaim the swap cache. This is
+	 * more efficient for large folios because we will only try to reclaim
+	 * the swap once per folio in the common case. If we do
+	 * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
+	 * latter will get a reference and lock the folio for every individual
+	 * page but will only succeed once the swap slot for every subpage is
+	 * zero.
+	 */
+	for (offset = swp_offset(entry); offset < end; offset += nr) {
+		nr = 1;
+		if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
+			/*
+			 * Folios are always naturally aligned in swap so
+			 * advance forward to the next boundary. Zero means no
+			 * folio was found for the swap entry, so advance by 1
+			 * in this case. Negative value means folio was found
+			 * but could not be reclaimed. Here we can still advance
+			 * to the next boundary.
+			 */
+			nr = __try_to_reclaim_swap(si, offset,
 					      TTRS_UNMAPPED | TTRS_FULL);
-		put_swap_device(p);
+			if (nr == 0)
+				nr = 1;
+			else if (nr < 0)
+				nr = -nr;
+			nr = ALIGN(offset + 1, nr) - offset;
+		}
 	}
-	return p != NULL;
+
+out:
+	put_swap_device(si);
 }
 
 #ifdef CONFIG_HIBERNATION
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 3/6] mm: swap: Simplify struct percpu_cluster
  2024-03-27 14:45 [PATCH v5 0/6] Swap-out mTHP without splitting Ryan Roberts
  2024-03-27 14:45 ` [PATCH v5 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
  2024-03-27 14:45 ` [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Ryan Roberts
@ 2024-03-27 14:45 ` Ryan Roberts
  2024-03-27 14:45 ` [PATCH v5 4/6] mm: swap: Allow storage of all mTHP orders Ryan Roberts
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Ryan Roberts @ 2024-03-27 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang
  Cc: Ryan Roberts, linux-mm, linux-kernel

struct percpu_cluster stores the index of cpu's current cluster and the
offset of the next entry that will be allocated for the cpu. These two
pieces of information are redundant because the cluster index is just
(offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the
cluster index is because the structure used for it also has a flag to
indicate "no cluster". However this data structure also contains a spin
lock, which is never used in this context, as a side effect the code
copies the spinlock_t structure, which is questionable coding practice
in my view.

So let's clean this up and store only the next offset, and use a
sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster".
SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen
legitimately; The first page in the swap file is the swap header, which
is always marked bad to prevent it from being allocated as an entry.
This also prevents the cluster to which it belongs being marked free, so
it will never appear on the free list.

This change saves 16 bytes per cpu. And given we are shortly going to
extend this mechanism to be per-cpu-AND-per-order, we will end up saving
16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the
system.

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h |  9 ++++++++-
 mm/swapfile.c        | 22 +++++++++++-----------
 2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5737236dc3ce..5e1e4f5bf0cb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,13 +260,20 @@ struct swap_cluster_info {
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
 
+/*
+ * The first page in the swap file is the swap header, which is always marked
+ * bad to prevent it from being allocated as an entry. This also prevents the
+ * cluster to which it belongs being marked free. Therefore 0 is safe to use as
+ * a sentinel to indicate next is not valid in percpu_cluster.
+ */
+#define SWAP_NEXT_INVALID	0
+
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
  * its own cluster and swapout sequentially. The purpose is to optimize swapout
  * throughput.
  */
 struct percpu_cluster {
-	struct swap_cluster_info index; /* Current cluster index */
 	unsigned int next; /* Likely next allocation offset */
 };
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cedfc82d37e5..1393966b77af 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -609,7 +609,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 		return false;
 
 	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
-	cluster_set_null(&percpu_cluster->index);
+	percpu_cluster->next = SWAP_NEXT_INVALID;
 	return true;
 }
 
@@ -622,14 +622,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 {
 	struct percpu_cluster *cluster;
 	struct swap_cluster_info *ci;
-	unsigned long tmp, max;
+	unsigned int tmp, max;
 
 new_cluster:
 	cluster = this_cpu_ptr(si->percpu_cluster);
-	if (cluster_is_null(&cluster->index)) {
+	tmp = cluster->next;
+	if (tmp == SWAP_NEXT_INVALID) {
 		if (!cluster_list_empty(&si->free_clusters)) {
-			cluster->index = si->free_clusters.head;
-			cluster->next = cluster_next(&cluster->index) *
+			tmp = cluster_next(&si->free_clusters.head) *
 					SWAPFILE_CLUSTER;
 		} else if (!cluster_list_empty(&si->discard_clusters)) {
 			/*
@@ -649,9 +649,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	 * Other CPUs can use our cluster if they can't find a free cluster,
 	 * check if there is still free entry in the cluster
 	 */
-	tmp = cluster->next;
-	max = min_t(unsigned long, si->max,
-		    (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER);
+	max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
 	if (tmp < max) {
 		ci = lock_cluster(si, tmp);
 		while (tmp < max) {
@@ -662,12 +660,13 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 		unlock_cluster(ci);
 	}
 	if (tmp >= max) {
-		cluster_set_null(&cluster->index);
+		cluster->next = SWAP_NEXT_INVALID;
 		goto new_cluster;
 	}
-	cluster->next = tmp + 1;
 	*offset = tmp;
 	*scan_base = tmp;
+	tmp += 1;
+	cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
 	return true;
 }
 
@@ -3138,8 +3137,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		}
 		for_each_possible_cpu(cpu) {
 			struct percpu_cluster *cluster;
+
 			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
-			cluster_set_null(&cluster->index);
+			cluster->next = SWAP_NEXT_INVALID;
 		}
 	} else {
 		atomic_inc(&nr_rotate_swap);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 4/6] mm: swap: Allow storage of all mTHP orders
  2024-03-27 14:45 [PATCH v5 0/6] Swap-out mTHP without splitting Ryan Roberts
                   ` (2 preceding siblings ...)
  2024-03-27 14:45 ` [PATCH v5 3/6] mm: swap: Simplify struct percpu_cluster Ryan Roberts
@ 2024-03-27 14:45 ` Ryan Roberts
  2024-04-01  3:15   ` Huang, Ying
  2024-03-27 14:45 ` [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list() Ryan Roberts
  2024-03-27 14:45 ` [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Ryan Roberts
  5 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-03-27 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang
  Cc: Ryan Roberts, linux-mm, linux-kernel

Multi-size THP enables performance improvements by allocating large,
pte-mapped folios for anonymous memory. However I've observed that on an
arm64 system running a parallel workload (e.g. kernel compilation)
across many cores, under high memory pressure, the speed regresses. This
is due to bottlenecking on the increased number of TLBIs added due to
all the extra folio splitting when the large folios are swapped out.

Therefore, solve this regression by adding support for swapping out mTHP
without needing to split the folio, just like is already done for
PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
and when the swap backing store is a non-rotating block device. These
are the same constraints as for the existing PMD-sized THP swap-out
support.

Note that no attempt is made to swap-in (m)THP here - this is still done
page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
prerequisite for swapping-in mTHP.

The main change here is to improve the swap entry allocator so that it
can allocate any power-of-2 number of contiguous entries between [1, (1
<< PMD_ORDER)]. This is done by allocating a cluster for each distinct
order and allocating sequentially from it until the cluster is full.
This ensures that we don't need to search the map and we get no
fragmentation due to alignment padding for different orders in the
cluster. If there is no current cluster for a given order, we attempt to
allocate a free cluster from the list. If there are no free clusters, we
fail the allocation and the caller can fall back to splitting the folio
and allocates individual entries (as per existing PMD-sized THP
fallback).

The per-order current clusters are maintained per-cpu using the existing
infrastructure. This is done to avoid interleving pages from different
tasks, which would prevent IO being batched. This is already done for
the order-0 allocations so we follow the same pattern.

As is done for order-0 per-cpu clusters, the scanner now can steal
order-0 entries from any per-cpu-per-order reserved cluster. This
ensures that when the swap file is getting full, space doesn't get tied
up in the per-cpu reserves.

This change only modifies swap to be able to accept any order mTHP. It
doesn't change the callers to elide doing the actual split. That will be
done in separate changes.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h |  10 ++-
 mm/swap_slots.c      |   6 +-
 mm/swapfile.c        | 175 ++++++++++++++++++++++++-------------------
 3 files changed, 109 insertions(+), 82 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5e1e4f5bf0cb..11c53692f65f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -268,13 +268,19 @@ struct swap_cluster_info {
  */
 #define SWAP_NEXT_INVALID	0
 
+#ifdef CONFIG_THP_SWAP
+#define SWAP_NR_ORDERS		(PMD_ORDER + 1)
+#else
+#define SWAP_NR_ORDERS		1
+#endif
+
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
  * its own cluster and swapout sequentially. The purpose is to optimize swapout
  * throughput.
  */
 struct percpu_cluster {
-	unsigned int next; /* Likely next allocation offset */
+	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
 
 struct swap_cluster_list {
@@ -471,7 +477,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
-extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
+extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 53abeaf1371d..13ab3b771409 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 	cache->cur = 0;
 	if (swap_slot_cache_active)
 		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
-					   cache->slots, 1);
+					   cache->slots, 0);
 
 	return cache->nr;
 }
@@ -311,7 +311,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 
 	if (folio_test_large(folio)) {
 		if (IS_ENABLED(CONFIG_THP_SWAP))
-			get_swap_pages(1, &entry, folio_nr_pages(folio));
+			get_swap_pages(1, &entry, folio_order(folio));
 		goto out;
 	}
 
@@ -343,7 +343,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 			goto out;
 	}
 
-	get_swap_pages(1, &entry, 1);
+	get_swap_pages(1, &entry, 0);
 out:
 	if (mem_cgroup_try_charge_swap(folio, entry)) {
 		put_swap_folio(folio, entry);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1393966b77af..d56cdc547a06 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -278,15 +278,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 #ifdef CONFIG_THP_SWAP
 #define SWAPFILE_CLUSTER	HPAGE_PMD_NR
 
-#define swap_entry_size(size)	(size)
+#define swap_entry_order(order)	(order)
 #else
 #define SWAPFILE_CLUSTER	256
 
 /*
- * Define swap_entry_size() as constant to let compiler to optimize
+ * Define swap_entry_order() as constant to let compiler to optimize
  * out some code if !CONFIG_THP_SWAP
  */
-#define swap_entry_size(size)	1
+#define swap_entry_order(order)	0
 #endif
 #define LATENCY_LIMIT		256
 
@@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
 
 /*
  * The cluster corresponding to page_nr will be used. The cluster will be
- * removed from free cluster list and its usage counter will be increased.
+ * removed from free cluster list and its usage counter will be increased by
+ * count.
  */
-static void inc_cluster_info_page(struct swap_info_struct *p,
-	struct swap_cluster_info *cluster_info, unsigned long page_nr)
+static void add_cluster_info_page(struct swap_info_struct *p,
+	struct swap_cluster_info *cluster_info, unsigned long page_nr,
+	unsigned long count)
 {
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
 
@@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
 	if (cluster_is_free(&cluster_info[idx]))
 		alloc_cluster(p, idx);
 
-	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
+	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
 	cluster_set_count(&cluster_info[idx],
-		cluster_count(&cluster_info[idx]) + 1);
+		cluster_count(&cluster_info[idx]) + count);
+}
+
+/*
+ * The cluster corresponding to page_nr will be used. The cluster will be
+ * removed from free cluster list and its usage counter will be increased by 1.
+ */
+static void inc_cluster_info_page(struct swap_info_struct *p,
+	struct swap_cluster_info *cluster_info, unsigned long page_nr)
+{
+	add_cluster_info_page(p, cluster_info, page_nr, 1);
 }
 
 /*
@@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
  */
 static bool
 scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
-	unsigned long offset)
+	unsigned long offset, int order)
 {
 	struct percpu_cluster *percpu_cluster;
 	bool conflict;
@@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 		return false;
 
 	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
-	percpu_cluster->next = SWAP_NEXT_INVALID;
+	percpu_cluster->next[order] = SWAP_NEXT_INVALID;
+	return true;
+}
+
+static inline bool swap_range_empty(char *swap_map, unsigned int start,
+				    unsigned int nr_pages)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr_pages; i++) {
+		if (swap_map[start + i])
+			return false;
+	}
+
 	return true;
 }
 
 /*
- * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
- * might involve allocating a new cluster for current CPU too.
+ * Try to get swap entries with specified order from current cpu's swap entry
+ * pool (a cluster). This might involve allocating a new cluster for current CPU
+ * too.
  */
 static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
-	unsigned long *offset, unsigned long *scan_base)
+	unsigned long *offset, unsigned long *scan_base, int order)
 {
+	unsigned int nr_pages = 1 << order;
 	struct percpu_cluster *cluster;
 	struct swap_cluster_info *ci;
 	unsigned int tmp, max;
 
 new_cluster:
 	cluster = this_cpu_ptr(si->percpu_cluster);
-	tmp = cluster->next;
+	tmp = cluster->next[order];
 	if (tmp == SWAP_NEXT_INVALID) {
 		if (!cluster_list_empty(&si->free_clusters)) {
 			tmp = cluster_next(&si->free_clusters.head) *
@@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 
 	/*
 	 * Other CPUs can use our cluster if they can't find a free cluster,
-	 * check if there is still free entry in the cluster
+	 * check if there is still free entry in the cluster, maintaining
+	 * natural alignment.
 	 */
 	max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
 	if (tmp < max) {
 		ci = lock_cluster(si, tmp);
 		while (tmp < max) {
-			if (!si->swap_map[tmp])
+			if (swap_range_empty(si->swap_map, tmp, nr_pages))
 				break;
-			tmp++;
+			tmp += nr_pages;
 		}
 		unlock_cluster(ci);
 	}
 	if (tmp >= max) {
-		cluster->next = SWAP_NEXT_INVALID;
+		cluster->next[order] = SWAP_NEXT_INVALID;
 		goto new_cluster;
 	}
 	*offset = tmp;
 	*scan_base = tmp;
-	tmp += 1;
-	cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
+	tmp += nr_pages;
+	cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
 	return true;
 }
 
@@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
 
 static int scan_swap_map_slots(struct swap_info_struct *si,
 			       unsigned char usage, int nr,
-			       swp_entry_t slots[])
+			       swp_entry_t slots[], int order)
 {
 	struct swap_cluster_info *ci;
 	unsigned long offset;
 	unsigned long scan_base;
 	unsigned long last_in_cluster = 0;
 	int latency_ration = LATENCY_LIMIT;
+	unsigned int nr_pages = 1 << order;
 	int n_ret = 0;
 	bool scanned_many = false;
 
@@ -817,6 +846,25 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	 * And we let swap pages go all over an SSD partition.  Hugh
 	 */
 
+	if (order > 0) {
+		/*
+		 * Should not even be attempting large allocations when huge
+		 * page swap is disabled.  Warn and fail the allocation.
+		 */
+		if (!IS_ENABLED(CONFIG_THP_SWAP) ||
+		    nr_pages > SWAPFILE_CLUSTER) {
+			VM_WARN_ON_ONCE(1);
+			return 0;
+		}
+
+		/*
+		 * Swapfile is not block device or not using clusters so unable
+		 * to allocate large entries.
+		 */
+		if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
+			return 0;
+	}
+
 	si->flags += SWP_SCANNING;
 	/*
 	 * Use percpu scan base for SSD to reduce lock contention on
@@ -831,8 +879,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 
 	/* SSD algorithm */
 	if (si->cluster_info) {
-		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
+			if (order > 0)
+				goto no_page;
 			goto scan;
+		}
 	} else if (unlikely(!si->cluster_nr--)) {
 		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
 			si->cluster_nr = SWAPFILE_CLUSTER - 1;
@@ -874,13 +925,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 
 checks:
 	if (si->cluster_info) {
-		while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
+		while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
 		/* take a break if we already got some slots */
 			if (n_ret)
 				goto done;
 			if (!scan_swap_map_try_ssd_cluster(si, &offset,
-							&scan_base))
+							&scan_base, order)) {
+				if (order > 0)
+					goto no_page;
 				goto scan;
+			}
 		}
 	}
 	if (!(si->flags & SWP_WRITEOK))
@@ -911,11 +965,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		else
 			goto done;
 	}
-	WRITE_ONCE(si->swap_map[offset], usage);
-	inc_cluster_info_page(si, si->cluster_info, offset);
+	memset(si->swap_map + offset, usage, nr_pages);
+	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
 	unlock_cluster(ci);
 
-	swap_range_alloc(si, offset, 1);
+	swap_range_alloc(si, offset, nr_pages);
 	slots[n_ret++] = swp_entry(si->type, offset);
 
 	/* got enough slots or reach max slots? */
@@ -936,8 +990,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 
 	/* try to get more slots in cluster */
 	if (si->cluster_info) {
-		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
+		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
 			goto checks;
+		if (order > 0)
+			goto done;
 	} else if (si->cluster_nr && !si->swap_map[++offset]) {
 		/* non-ssd case, still more slots in cluster? */
 		--si->cluster_nr;
@@ -964,11 +1020,13 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	}
 
 done:
-	set_cluster_next(si, offset + 1);
+	if (order == 0)
+		set_cluster_next(si, offset + 1);
 	si->flags -= SWP_SCANNING;
 	return n_ret;
 
 scan:
+	VM_WARN_ON(order > 0);
 	spin_unlock(&si->lock);
 	while (++offset <= READ_ONCE(si->highest_bit)) {
 		if (unlikely(--latency_ration < 0)) {
@@ -997,38 +1055,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	return n_ret;
 }
 
-static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
-{
-	unsigned long idx;
-	struct swap_cluster_info *ci;
-	unsigned long offset;
-
-	/*
-	 * Should not even be attempting cluster allocations when huge
-	 * page swap is disabled.  Warn and fail the allocation.
-	 */
-	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
-		VM_WARN_ON_ONCE(1);
-		return 0;
-	}
-
-	if (cluster_list_empty(&si->free_clusters))
-		return 0;
-
-	idx = cluster_list_first(&si->free_clusters);
-	offset = idx * SWAPFILE_CLUSTER;
-	ci = lock_cluster(si, offset);
-	alloc_cluster(si, idx);
-	cluster_set_count(ci, SWAPFILE_CLUSTER);
-
-	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
-	unlock_cluster(ci);
-	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
-	*slot = swp_entry(si->type, offset);
-
-	return 1;
-}
-
 static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
 {
 	unsigned long offset = idx * SWAPFILE_CLUSTER;
@@ -1042,17 +1068,15 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
 	swap_range_free(si, offset, SWAPFILE_CLUSTER);
 }
 
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
+int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 {
-	unsigned long size = swap_entry_size(entry_size);
+	int order = swap_entry_order(entry_order);
+	unsigned long size = 1 << order;
 	struct swap_info_struct *si, *next;
 	long avail_pgs;
 	int n_ret = 0;
 	int node;
 
-	/* Only single cluster request supported */
-	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
-
 	spin_lock(&swap_avail_lock);
 
 	avail_pgs = atomic_long_read(&nr_swap_pages) / size;
@@ -1088,14 +1112,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-		if (size == SWAPFILE_CLUSTER) {
-			if (si->flags & SWP_BLKDEV)
-				n_ret = swap_alloc_cluster(si, swp_entries);
-		} else
-			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-						    n_goal, swp_entries);
+		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+					    n_goal, swp_entries, order);
 		spin_unlock(&si->lock);
-		if (n_ret || size == SWAPFILE_CLUSTER)
+		if (n_ret || size > 1)
 			goto check_out;
 		cond_resched();
 
@@ -1349,7 +1369,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unsigned char *map;
 	unsigned int i, free_entries = 0;
 	unsigned char val;
-	int size = swap_entry_size(folio_nr_pages(folio));
+	int size = 1 << swap_entry_order(folio_order(folio));
 
 	si = _swap_info_get(entry);
 	if (!si)
@@ -1647,7 +1667,7 @@ swp_entry_t get_swap_page_of_type(int type)
 
 	/* This is called for allocating swap entry, not cache */
 	spin_lock(&si->lock);
-	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
+	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
 		atomic_long_dec(&nr_swap_pages);
 	spin_unlock(&si->lock);
 fail:
@@ -3101,7 +3121,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		p->flags |= SWP_SYNCHRONOUS_IO;
 
 	if (p->bdev && bdev_nonrot(p->bdev)) {
-		int cpu;
+		int cpu, i;
 		unsigned long ci, nr_cluster;
 
 		p->flags |= SWP_SOLIDSTATE;
@@ -3139,7 +3159,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 			struct percpu_cluster *cluster;
 
 			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
-			cluster->next = SWAP_NEXT_INVALID;
+			for (i = 0; i < SWAP_NR_ORDERS; i++)
+				cluster->next[i] = SWAP_NEXT_INVALID;
 		}
 	} else {
 		atomic_inc(&nr_rotate_swap);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list()
  2024-03-27 14:45 [PATCH v5 0/6] Swap-out mTHP without splitting Ryan Roberts
                   ` (3 preceding siblings ...)
  2024-03-27 14:45 ` [PATCH v5 4/6] mm: swap: Allow storage of all mTHP orders Ryan Roberts
@ 2024-03-27 14:45 ` Ryan Roberts
  2024-03-28  8:18   ` Barry Song
  2024-03-27 14:45 ` [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Ryan Roberts
  5 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-03-27 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang
  Cc: Ryan Roberts, linux-mm, linux-kernel, Barry Song

Now that swap supports storing all mTHP sizes, avoid splitting large
folios before swap-out. This benefits performance of the swap-out path
by eliding split_folio_to_list(), which is expensive, and also sets us
up for swapping in large folios in a future series.

If the folio is partially mapped, we continue to split it since we want
to avoid the extra IO overhead and storage of writing out pages
uneccessarily.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/vmscan.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 00adaf1cb2c3..293120fe54f3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1223,11 +1223,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					if (!can_split_folio(folio, NULL))
 						goto activate_locked;
 					/*
-					 * Split folios without a PMD map right
-					 * away. Chances are some or all of the
-					 * tail pages can be freed without IO.
+					 * Split partially mapped folios right
+					 * away. We can free the unmapped pages
+					 * without IO.
 					 */
-					if (!folio_entire_mapcount(folio) &&
+					if (data_race(!list_empty(
+						&folio->_deferred_list)) &&
 					    split_folio_to_list(folio,
 								folio_list))
 						goto activate_locked;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
  2024-03-27 14:45 [PATCH v5 0/6] Swap-out mTHP without splitting Ryan Roberts
                   ` (4 preceding siblings ...)
  2024-03-27 14:45 ` [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list() Ryan Roberts
@ 2024-03-27 14:45 ` Ryan Roberts
  2024-04-01 12:25   ` Lance Yang
  2024-04-02 10:16   ` Barry Song
  5 siblings, 2 replies; 35+ messages in thread
From: Ryan Roberts @ 2024-03-27 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang
  Cc: Ryan Roberts, linux-mm, linux-kernel, Barry Song

Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
folio that is fully and contiguously mapped in the pageout/cold vm
range. This change means that large folios will be maintained all the
way to swap storage. This both improves performance during swap-out, by
eliding the cost of splitting the folio, and sets us up nicely for
maintaining the large folio when it is swapped back in (to be covered in
a separate series).

Folios that are not fully mapped in the target range are still split,
but note that behavior is changed so that if the split fails for any
reason (folio locked, shared, etc) we now leave it as is and move to the
next pte in the range and continue work on the proceeding folios.
Previously any failure of this sort would cause the entire operation to
give up and no folios mapped at higher addresses were paged out or made
cold. Given large folios are becoming more common, this old behavior
would have likely lead to wasted opportunities.

While we are at it, change the code that clears young from the ptes to
use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper
function. This is more efficent than get_and_clear/modify/set,
especially for contpte mappings on arm64, where the old approach would
require unfolding/refolding and the new approach can be done in place.

Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 30 ++++++++++++++
 mm/internal.h           | 12 +++++-
 mm/madvise.c            | 88 ++++++++++++++++++++++++-----------------
 mm/memory.c             |  4 +-
 4 files changed, 93 insertions(+), 41 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8185939df1e8..391f56a1b188 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -361,6 +361,36 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 }
 #endif
 
+#ifndef mkold_ptes
+/**
+ * mkold_ptes - Mark PTEs that map consecutive pages of the same folio as old.
+ * @vma: VMA the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to mark old.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_test_and_clear_young().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline void mkold_ptes(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, unsigned int nr)
+{
+	for (;;) {
+		ptep_test_and_clear_young(vma, addr, ptep);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+}
+#endif
+
 #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
diff --git a/mm/internal.h b/mm/internal.h
index eadb79c3a357..efee8e4cd2af 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -130,6 +130,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  * @flags: Flags to modify the PTE batch semantics.
  * @any_writable: Optional pointer to indicate whether any entry except the
  *		  first one is writable.
+ * @any_young: Optional pointer to indicate whether any entry except the
+ *		  first one is young.
  *
  * Detect a PTE batch: consecutive (present) PTEs that map consecutive
  * pages of the same large folio.
@@ -145,16 +147,18 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  */
 static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
-		bool *any_writable)
+		bool *any_writable, bool *any_young)
 {
 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
 	const pte_t *end_ptep = start_ptep + max_nr;
 	pte_t expected_pte, *ptep;
-	bool writable;
+	bool writable, young;
 	int nr;
 
 	if (any_writable)
 		*any_writable = false;
+	if (any_young)
+		*any_young = false;
 
 	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 	VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
@@ -168,6 +172,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		pte = ptep_get(ptep);
 		if (any_writable)
 			writable = !!pte_write(pte);
+		if (any_young)
+			young = !!pte_young(pte);
 		pte = __pte_batch_clear_ignored(pte, flags);
 
 		if (!pte_same(pte, expected_pte))
@@ -183,6 +189,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 
 		if (any_writable)
 			*any_writable |= writable;
+		if (any_young)
+			*any_young |= young;
 
 		nr = pte_batch_hint(ptep, pte);
 		expected_pte = pte_advance_pfn(expected_pte, nr);
diff --git a/mm/madvise.c b/mm/madvise.c
index 070bedb4996e..bd00b83e7c50 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	LIST_HEAD(folio_list);
 	bool pageout_anon_only_filter;
 	unsigned int batch_count = 0;
+	int nr;
 
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		return 0;
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
-	for (; addr < end; pte++, addr += PAGE_SIZE) {
+	for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
+		nr = 1;
 		ptent = ptep_get(pte);
 
 		if (++batch_count == SWAP_CLUSTER_MAX) {
@@ -447,55 +449,67 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			continue;
 
 		/*
-		 * Creating a THP page is expensive so split it only if we
-		 * are sure it's worth. Split it if we are only owner.
+		 * If we encounter a large folio, only split it if it is not
+		 * fully mapped within the range we are operating on. Otherwise
+		 * leave it as is so that it can be swapped out whole. If we
+		 * fail to split a folio, leave it in place and advance to the
+		 * next pte in the range.
 		 */
 		if (folio_test_large(folio)) {
-			int err;
-
-			if (folio_likely_mapped_shared(folio))
-				break;
-			if (pageout_anon_only_filter && !folio_test_anon(folio))
-				break;
-			if (!folio_trylock(folio))
-				break;
-			folio_get(folio);
-			arch_leave_lazy_mmu_mode();
-			pte_unmap_unlock(start_pte, ptl);
-			start_pte = NULL;
-			err = split_folio(folio);
-			folio_unlock(folio);
-			folio_put(folio);
-			if (err)
-				break;
-			start_pte = pte =
-				pte_offset_map_lock(mm, pmd, addr, &ptl);
-			if (!start_pte)
-				break;
-			arch_enter_lazy_mmu_mode();
-			pte--;
-			addr -= PAGE_SIZE;
-			continue;
+			const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
+						FPB_IGNORE_SOFT_DIRTY;
+			int max_nr = (end - addr) / PAGE_SIZE;
+			bool any_young;
+
+			nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
+					     fpb_flags, NULL, &any_young);
+			if (any_young)
+				ptent = pte_mkyoung(ptent);
+
+			if (nr < folio_nr_pages(folio)) {
+				int err;
+
+				if (folio_likely_mapped_shared(folio))
+					continue;
+				if (pageout_anon_only_filter && !folio_test_anon(folio))
+					continue;
+				if (!folio_trylock(folio))
+					continue;
+				folio_get(folio);
+				arch_leave_lazy_mmu_mode();
+				pte_unmap_unlock(start_pte, ptl);
+				start_pte = NULL;
+				err = split_folio(folio);
+				folio_unlock(folio);
+				folio_put(folio);
+				if (err)
+					continue;
+				start_pte = pte =
+					pte_offset_map_lock(mm, pmd, addr, &ptl);
+				if (!start_pte)
+					break;
+				arch_enter_lazy_mmu_mode();
+				nr = 0;
+				continue;
+			}
 		}
 
 		/*
 		 * Do not interfere with other mappings of this folio and
-		 * non-LRU folio.
+		 * non-LRU folio. If we have a large folio at this point, we
+		 * know it is fully mapped so if its mapcount is the same as its
+		 * number of pages, it must be exclusive.
 		 */
-		if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
+		if (!folio_test_lru(folio) ||
+		    folio_mapcount(folio) != folio_nr_pages(folio))
 			continue;
 
 		if (pageout_anon_only_filter && !folio_test_anon(folio))
 			continue;
 
-		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
 		if (!pageout && pte_young(ptent)) {
-			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
-			ptent = pte_mkold(ptent);
-			set_pte_at(mm, addr, pte, ptent);
-			tlb_remove_tlb_entry(tlb, pte, addr);
+			mkold_ptes(vma, addr, pte, nr);
+			tlb_remove_tlb_entries(tlb, pte, nr, addr);
 		}
 
 		/*
diff --git a/mm/memory.c b/mm/memory.c
index 9d844582ba38..b5b48f4cf2af 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -989,7 +989,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 			flags |= FPB_IGNORE_SOFT_DIRTY;
 
 		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
-				     &any_writable);
+				     &any_writable, NULL);
 		folio_ref_add(folio, nr);
 		if (folio_test_anon(folio)) {
 			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
@@ -1553,7 +1553,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 	 */
 	if (unlikely(folio_test_large(folio) && max_nr != 1)) {
 		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
-				     NULL);
+				     NULL, NULL);
 
 		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
 				       addr, details, rss, force_flush,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list()
  2024-03-27 14:45 ` [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list() Ryan Roberts
@ 2024-03-28  8:18   ` Barry Song
  2024-03-28  8:48     ` Ryan Roberts
  2024-04-02 13:10     ` Ryan Roberts
  0 siblings, 2 replies; 35+ messages in thread
From: Barry Song @ 2024-03-28  8:18 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Chris Li, Lance Yang, linux-mm, linux-kernel, Barry Song

On Thu, Mar 28, 2024 at 3:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Now that swap supports storing all mTHP sizes, avoid splitting large
> folios before swap-out. This benefits performance of the swap-out path
> by eliding split_folio_to_list(), which is expensive, and also sets us
> up for swapping in large folios in a future series.
>
> If the folio is partially mapped, we continue to split it since we want
> to avoid the extra IO overhead and storage of writing out pages
> uneccessarily.
>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/vmscan.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 00adaf1cb2c3..293120fe54f3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1223,11 +1223,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                         if (!can_split_folio(folio, NULL))
>                                                 goto activate_locked;
>                                         /*
> -                                        * Split folios without a PMD map right
> -                                        * away. Chances are some or all of the
> -                                        * tail pages can be freed without IO.
> +                                        * Split partially mapped folios right
> +                                        * away. We can free the unmapped pages
> +                                        * without IO.
>                                          */
> -                                       if (!folio_entire_mapcount(folio) &&
> +                                       if (data_race(!list_empty(
> +                                               &folio->_deferred_list)) &&
>                                             split_folio_to_list(folio,
>                                                                 folio_list))
>                                                 goto activate_locked;

Hi Ryan,

Sorry for bringing up another minor issue at this late stage.

During the debugging of thp counter patch v2, I noticed the discrepancy between
THP_SWPOUT_FALLBACK and THP_SWPOUT.

Should we make adjustments to the counter?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 293120fe54f3..d7856603f689 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1241,8 +1241,10 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
                                                                folio_list))
                                                goto activate_locked;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
-                                       count_vm_event(THP_SWPOUT_FALLBACK);
+                                       if (folio_test_pmd_mappable(folio)) {
+
count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
+
count_vm_event(THP_SWPOUT_FALLBACK);
+                                       }
 #endif
                                        if (!add_to_swap(folio))
                                                goto activate_locked_split;


Because THP_SWPOUT is only for pmd:

static inline void count_swpout_vm_event(struct folio *folio)
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
        if (unlikely(folio_test_pmd_mappable(folio))) {
                count_memcg_folio_events(folio, THP_SWPOUT, 1);
                count_vm_event(THP_SWPOUT);
        }
#endif
        count_vm_events(PSWPOUT, folio_nr_pages(folio));
}

I can provide per-order counters for this in my THP counter patch.

> --
> 2.25.1
>

Thanks
Barry

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list()
  2024-03-28  8:18   ` Barry Song
@ 2024-03-28  8:48     ` Ryan Roberts
  2024-04-02 13:10     ` Ryan Roberts
  1 sibling, 0 replies; 35+ messages in thread
From: Ryan Roberts @ 2024-03-28  8:48 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Chris Li, Lance Yang, linux-mm, linux-kernel, Barry Song

On 28/03/2024 08:18, Barry Song wrote:
> On Thu, Mar 28, 2024 at 3:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Now that swap supports storing all mTHP sizes, avoid splitting large
>> folios before swap-out. This benefits performance of the swap-out path
>> by eliding split_folio_to_list(), which is expensive, and also sets us
>> up for swapping in large folios in a future series.
>>
>> If the folio is partially mapped, we continue to split it since we want
>> to avoid the extra IO overhead and storage of writing out pages
>> uneccessarily.
>>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/vmscan.c | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 00adaf1cb2c3..293120fe54f3 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1223,11 +1223,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>                                         if (!can_split_folio(folio, NULL))
>>                                                 goto activate_locked;
>>                                         /*
>> -                                        * Split folios without a PMD map right
>> -                                        * away. Chances are some or all of the
>> -                                        * tail pages can be freed without IO.
>> +                                        * Split partially mapped folios right
>> +                                        * away. We can free the unmapped pages
>> +                                        * without IO.
>>                                          */
>> -                                       if (!folio_entire_mapcount(folio) &&
>> +                                       if (data_race(!list_empty(
>> +                                               &folio->_deferred_list)) &&
>>                                             split_folio_to_list(folio,
>>                                                                 folio_list))
>>                                                 goto activate_locked;
> 
> Hi Ryan,
> 
> Sorry for bringing up another minor issue at this late stage.
> 
> During the debugging of thp counter patch v2, I noticed the discrepancy between
> THP_SWPOUT_FALLBACK and THP_SWPOUT.

Ahh good spot! I had noticed this previously and clearly forgot all about it.

I'm on holiday today and over the long weekend in the UK. I'll take a proper
look next week and send a fix.

> 
> Should we make adjustments to the counter?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 293120fe54f3..d7856603f689 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1241,8 +1241,10 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>                                                                 folio_list))
>                                                 goto activate_locked;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -
> count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
> -                                       count_vm_event(THP_SWPOUT_FALLBACK);
> +                                       if (folio_test_pmd_mappable(folio)) {
> +
> count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
> +
> count_vm_event(THP_SWPOUT_FALLBACK);
> +                                       }
>  #endif
>                                         if (!add_to_swap(folio))
>                                                 goto activate_locked_split;
> 
> 
> Because THP_SWPOUT is only for pmd:
> 
> static inline void count_swpout_vm_event(struct folio *folio)
> {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         if (unlikely(folio_test_pmd_mappable(folio))) {
>                 count_memcg_folio_events(folio, THP_SWPOUT, 1);
>                 count_vm_event(THP_SWPOUT);
>         }
> #endif
>         count_vm_events(PSWPOUT, folio_nr_pages(folio));
> }
> 
> I can provide per-order counters for this in my THP counter patch.
> 
>> --
>> 2.25.1
>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-27 14:45 ` [PATCH v5 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
@ 2024-03-29  1:56   ` Huang, Ying
  2024-04-05  9:22   ` David Hildenbrand
  1 sibling, 0 replies; 35+ messages in thread
From: Huang, Ying @ 2024-03-29  1:56 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, Barry Song,
	Chris Li, Lance Yang, linux-mm, linux-kernel

Ryan Roberts <ryan.roberts@arm.com> writes:

> As preparation for supporting small-sized THP in the swap-out path,
> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
> which, when present, always implies PMD-sized THP, which is the same as
> the cluster size.
>
> The only use of the flag was to determine whether a swap entry refers to
> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
> Instead of relying on the flag, we now pass in nr_pages, which
> originates from the folio's number of pages. This allows the logic to
> work for folios of any order.
>
> The one snag is that one of the swap_page_trans_huge_swapped() call
> sites does not have the folio. But it was only being called there to
> shortcut a call __try_to_reclaim_swap() in some cases.
> __try_to_reclaim_swap() gets the folio and (via some other functions)
> calls swap_page_trans_huge_swapped(). So I've removed the problematic
> call site and believe the new logic should be functionally equivalent.
>
> That said, removing the fast path means that we will take a reference
> and trylock a large folio much more often, which we would like to avoid.
> The next patch will solve this.
>
> Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
> which used to be called during folio splitting, since
> split_swap_cluster()'s only job was to remove the flag.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
>  include/linux/swap.h | 10 ----------
>  mm/huge_memory.c     |  3 ---
>  mm/swapfile.c        | 47 ++++++++------------------------------------
>  3 files changed, 8 insertions(+), 52 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a211a0383425..f6f78198f000 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -259,7 +259,6 @@ struct swap_cluster_info {
>  };
>  #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>  #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> -#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */
>  
>  /*
>   * We assign a cluster to each CPU, so each CPU can allocate swap entry from
> @@ -590,15 +589,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
>  }
>  #endif /* CONFIG_SWAP */
>  
> -#ifdef CONFIG_THP_SWAP
> -extern int split_swap_cluster(swp_entry_t entry);
> -#else
> -static inline int split_swap_cluster(swp_entry_t entry)
> -{
> -	return 0;
> -}
> -#endif
> -
>  #ifdef CONFIG_MEMCG
>  static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b49fcb8a16cc..8c1f3393994a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2961,9 +2961,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  		shmem_uncharge(folio->mapping->host, nr_dropped);
>  	remap_page(folio, nr);
>  
> -	if (folio_test_swapcache(folio))
> -		split_swap_cluster(folio->swap);
> -
>  	/*
>  	 * set page to its compound_head when split to non order-0 pages, so
>  	 * we can skip unlocking it below, since PG_locked is transferred to
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5e6d2304a2a4..0d44ee2b4f9c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -343,18 +343,6 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
>  	info->data = 0;
>  }
>  
> -static inline bool cluster_is_huge(struct swap_cluster_info *info)
> -{
> -	if (IS_ENABLED(CONFIG_THP_SWAP))
> -		return info->flags & CLUSTER_FLAG_HUGE;
> -	return false;
> -}
> -
> -static inline void cluster_clear_huge(struct swap_cluster_info *info)
> -{
> -	info->flags &= ~CLUSTER_FLAG_HUGE;
> -}
> -
>  static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
>  						     unsigned long offset)
>  {
> @@ -1027,7 +1015,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>  	offset = idx * SWAPFILE_CLUSTER;
>  	ci = lock_cluster(si, offset);
>  	alloc_cluster(si, idx);
> -	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
> +	cluster_set_count(ci, SWAPFILE_CLUSTER);
>  
>  	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>  	unlock_cluster(ci);
> @@ -1365,7 +1353,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
>  
>  	ci = lock_cluster_or_swap_info(si, offset);
>  	if (size == SWAPFILE_CLUSTER) {
> -		VM_BUG_ON(!cluster_is_huge(ci));
>  		map = si->swap_map + offset;
>  		for (i = 0; i < SWAPFILE_CLUSTER; i++) {
>  			val = map[i];
> @@ -1373,7 +1360,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
>  			if (val == SWAP_HAS_CACHE)
>  				free_entries++;
>  		}
> -		cluster_clear_huge(ci);
>  		if (free_entries == SWAPFILE_CLUSTER) {
>  			unlock_cluster_or_swap_info(si, ci);
>  			spin_lock(&si->lock);
> @@ -1395,23 +1381,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
>  	unlock_cluster_or_swap_info(si, ci);
>  }
>  
> -#ifdef CONFIG_THP_SWAP
> -int split_swap_cluster(swp_entry_t entry)
> -{
> -	struct swap_info_struct *si;
> -	struct swap_cluster_info *ci;
> -	unsigned long offset = swp_offset(entry);
> -
> -	si = _swap_info_get(entry);
> -	if (!si)
> -		return -EBUSY;
> -	ci = lock_cluster(si, offset);
> -	cluster_clear_huge(ci);
> -	unlock_cluster(ci);
> -	return 0;
> -}
> -#endif
> -
>  static int swp_entry_cmp(const void *ent1, const void *ent2)
>  {
>  	const swp_entry_t *e1 = ent1, *e2 = ent2;
> @@ -1519,22 +1488,23 @@ int swp_swapcount(swp_entry_t entry)
>  }
>  
>  static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
> -					 swp_entry_t entry)
> +					 swp_entry_t entry,
> +					 unsigned int nr_pages)
>  {
>  	struct swap_cluster_info *ci;
>  	unsigned char *map = si->swap_map;
>  	unsigned long roffset = swp_offset(entry);
> -	unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
> +	unsigned long offset = round_down(roffset, nr_pages);
>  	int i;
>  	bool ret = false;
>  
>  	ci = lock_cluster_or_swap_info(si, offset);
> -	if (!ci || !cluster_is_huge(ci)) {
> +	if (!ci || nr_pages == 1) {
>  		if (swap_count(map[roffset]))
>  			ret = true;
>  		goto unlock_out;
>  	}
> -	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> +	for (i = 0; i < nr_pages; i++) {
>  		if (swap_count(map[offset + i])) {
>  			ret = true;
>  			break;
> @@ -1556,7 +1526,7 @@ static bool folio_swapped(struct folio *folio)
>  	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
>  		return swap_swapcount(si, entry) != 0;
>  
> -	return swap_page_trans_huge_swapped(si, entry);
> +	return swap_page_trans_huge_swapped(si, entry, folio_nr_pages(folio));
>  }
>  
>  /**
> @@ -1622,8 +1592,7 @@ int free_swap_and_cache(swp_entry_t entry)
>  		}
>  
>  		count = __swap_entry_free(p, entry);
> -		if (count == SWAP_HAS_CACHE &&
> -		    !swap_page_trans_huge_swapped(p, entry))
> +		if (count == SWAP_HAS_CACHE)
>  			__try_to_reclaim_swap(p, swp_offset(entry),
>  					      TTRS_UNMAPPED | TTRS_FULL);
>  		put_swap_device(p);

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 4/6] mm: swap: Allow storage of all mTHP orders
  2024-03-27 14:45 ` [PATCH v5 4/6] mm: swap: Allow storage of all mTHP orders Ryan Roberts
@ 2024-04-01  3:15   ` Huang, Ying
  2024-04-02 11:18     ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Huang, Ying @ 2024-04-01  3:15 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, Barry Song,
	Chris Li, Lance Yang, linux-mm, linux-kernel

Ryan Roberts <ryan.roberts@arm.com> writes:

> Multi-size THP enables performance improvements by allocating large,
> pte-mapped folios for anonymous memory. However I've observed that on an
> arm64 system running a parallel workload (e.g. kernel compilation)
> across many cores, under high memory pressure, the speed regresses. This
> is due to bottlenecking on the increased number of TLBIs added due to
> all the extra folio splitting when the large folios are swapped out.
>
> Therefore, solve this regression by adding support for swapping out mTHP
> without needing to split the folio, just like is already done for
> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
> and when the swap backing store is a non-rotating block device. These
> are the same constraints as for the existing PMD-sized THP swap-out
> support.
>
> Note that no attempt is made to swap-in (m)THP here - this is still done
> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
> prerequisite for swapping-in mTHP.
>
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [1, (1
> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> order and allocating sequentially from it until the cluster is full.
> This ensures that we don't need to search the map and we get no
> fragmentation due to alignment padding for different orders in the
> cluster. If there is no current cluster for a given order, we attempt to
> allocate a free cluster from the list. If there are no free clusters, we
> fail the allocation and the caller can fall back to splitting the folio
> and allocates individual entries (as per existing PMD-sized THP
> fallback).
>
> The per-order current clusters are maintained per-cpu using the existing
> infrastructure. This is done to avoid interleving pages from different
> tasks, which would prevent IO being batched. This is already done for
> the order-0 allocations so we follow the same pattern.
>
> As is done for order-0 per-cpu clusters, the scanner now can steal
> order-0 entries from any per-cpu-per-order reserved cluster. This
> ensures that when the swap file is getting full, space doesn't get tied
> up in the per-cpu reserves.
>
> This change only modifies swap to be able to accept any order mTHP. It
> doesn't change the callers to elide doing the actual split. That will be
> done in separate changes.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/swap.h |  10 ++-
>  mm/swap_slots.c      |   6 +-
>  mm/swapfile.c        | 175 ++++++++++++++++++++++++-------------------
>  3 files changed, 109 insertions(+), 82 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 5e1e4f5bf0cb..11c53692f65f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -268,13 +268,19 @@ struct swap_cluster_info {
>   */
>  #define SWAP_NEXT_INVALID	0
>  
> +#ifdef CONFIG_THP_SWAP
> +#define SWAP_NR_ORDERS		(PMD_ORDER + 1)
> +#else
> +#define SWAP_NR_ORDERS		1
> +#endif
> +
>  /*
>   * We assign a cluster to each CPU, so each CPU can allocate swap entry from
>   * its own cluster and swapout sequentially. The purpose is to optimize swapout
>   * throughput.
>   */
>  struct percpu_cluster {
> -	unsigned int next; /* Likely next allocation offset */
> +	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>  };
>  
>  struct swap_cluster_list {
> @@ -471,7 +477,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio);
>  bool folio_free_swap(struct folio *folio);
>  void put_swap_folio(struct folio *folio, swp_entry_t entry);
>  extern swp_entry_t get_swap_page_of_type(int);
> -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
> +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>  extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 53abeaf1371d..13ab3b771409 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
>  	cache->cur = 0;
>  	if (swap_slot_cache_active)
>  		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
> -					   cache->slots, 1);
> +					   cache->slots, 0);
>  
>  	return cache->nr;
>  }
> @@ -311,7 +311,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>  
>  	if (folio_test_large(folio)) {
>  		if (IS_ENABLED(CONFIG_THP_SWAP))
> -			get_swap_pages(1, &entry, folio_nr_pages(folio));
> +			get_swap_pages(1, &entry, folio_order(folio));
>  		goto out;
>  	}
>  
> @@ -343,7 +343,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>  			goto out;
>  	}
>  
> -	get_swap_pages(1, &entry, 1);
> +	get_swap_pages(1, &entry, 0);
>  out:
>  	if (mem_cgroup_try_charge_swap(folio, entry)) {
>  		put_swap_folio(folio, entry);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1393966b77af..d56cdc547a06 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -278,15 +278,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>  #ifdef CONFIG_THP_SWAP
>  #define SWAPFILE_CLUSTER	HPAGE_PMD_NR
>  
> -#define swap_entry_size(size)	(size)
> +#define swap_entry_order(order)	(order)
>  #else
>  #define SWAPFILE_CLUSTER	256
>  
>  /*
> - * Define swap_entry_size() as constant to let compiler to optimize
> + * Define swap_entry_order() as constant to let compiler to optimize
>   * out some code if !CONFIG_THP_SWAP
>   */
> -#define swap_entry_size(size)	1
> +#define swap_entry_order(order)	0
>  #endif
>  #define LATENCY_LIMIT		256
>  
> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>  
>  /*
>   * The cluster corresponding to page_nr will be used. The cluster will be
> - * removed from free cluster list and its usage counter will be increased.
> + * removed from free cluster list and its usage counter will be increased by
> + * count.
>   */
> -static void inc_cluster_info_page(struct swap_info_struct *p,
> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void add_cluster_info_page(struct swap_info_struct *p,
> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
> +	unsigned long count)
>  {
>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>  
> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>  	if (cluster_is_free(&cluster_info[idx]))
>  		alloc_cluster(p, idx);
>  
> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>  	cluster_set_count(&cluster_info[idx],
> -		cluster_count(&cluster_info[idx]) + 1);
> +		cluster_count(&cluster_info[idx]) + count);
> +}
> +
> +/*
> + * The cluster corresponding to page_nr will be used. The cluster will be
> + * removed from free cluster list and its usage counter will be increased by 1.
> + */
> +static void inc_cluster_info_page(struct swap_info_struct *p,
> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +{
> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>  }
>  
>  /*
> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>   */
>  static bool
>  scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> -	unsigned long offset)
> +	unsigned long offset, int order)
>  {
>  	struct percpu_cluster *percpu_cluster;
>  	bool conflict;
> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>  		return false;
>  
>  	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
> -	percpu_cluster->next = SWAP_NEXT_INVALID;
> +	percpu_cluster->next[order] = SWAP_NEXT_INVALID;
> +	return true;
> +}
> +
> +static inline bool swap_range_empty(char *swap_map, unsigned int start,
> +				    unsigned int nr_pages)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		if (swap_map[start + i])
> +			return false;
> +	}
> +
>  	return true;
>  }
>  
>  /*
> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
> - * might involve allocating a new cluster for current CPU too.
> + * Try to get swap entries with specified order from current cpu's swap entry
> + * pool (a cluster). This might involve allocating a new cluster for current CPU
> + * too.
>   */
>  static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> -	unsigned long *offset, unsigned long *scan_base)
> +	unsigned long *offset, unsigned long *scan_base, int order)
>  {
> +	unsigned int nr_pages = 1 << order;

Use swap_entry_order()?

>  	struct percpu_cluster *cluster;
>  	struct swap_cluster_info *ci;
>  	unsigned int tmp, max;
>  
>  new_cluster:
>  	cluster = this_cpu_ptr(si->percpu_cluster);
> -	tmp = cluster->next;
> +	tmp = cluster->next[order];
>  	if (tmp == SWAP_NEXT_INVALID) {
>  		if (!cluster_list_empty(&si->free_clusters)) {
>  			tmp = cluster_next(&si->free_clusters.head) *
> @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>  
>  	/*
>  	 * Other CPUs can use our cluster if they can't find a free cluster,
> -	 * check if there is still free entry in the cluster
> +	 * check if there is still free entry in the cluster, maintaining
> +	 * natural alignment.
>  	 */
>  	max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
>  	if (tmp < max) {
>  		ci = lock_cluster(si, tmp);
>  		while (tmp < max) {
> -			if (!si->swap_map[tmp])
> +			if (swap_range_empty(si->swap_map, tmp, nr_pages))
>  				break;
> -			tmp++;
> +			tmp += nr_pages;
>  		}
>  		unlock_cluster(ci);
>  	}
>  	if (tmp >= max) {
> -		cluster->next = SWAP_NEXT_INVALID;
> +		cluster->next[order] = SWAP_NEXT_INVALID;
>  		goto new_cluster;
>  	}
>  	*offset = tmp;
>  	*scan_base = tmp;
> -	tmp += 1;
> -	cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
> +	tmp += nr_pages;
> +	cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
>  	return true;
>  }
>  
> @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
>  
>  static int scan_swap_map_slots(struct swap_info_struct *si,
>  			       unsigned char usage, int nr,
> -			       swp_entry_t slots[])
> +			       swp_entry_t slots[], int order)
>  {
>  	struct swap_cluster_info *ci;
>  	unsigned long offset;
>  	unsigned long scan_base;
>  	unsigned long last_in_cluster = 0;
>  	int latency_ration = LATENCY_LIMIT;
> +	unsigned int nr_pages = 1 << order;

ditto.

Otherwise LGTM, feel free to add

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

in the future versions.

--
Best Regards,
Huang, Ying

>  	int n_ret = 0;
>  	bool scanned_many = false;
>  
> @@ -817,6 +846,25 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	 * And we let swap pages go all over an SSD partition.  Hugh
>  	 */
>  
> +	if (order > 0) {
> +		/*
> +		 * Should not even be attempting large allocations when huge
> +		 * page swap is disabled.  Warn and fail the allocation.
> +		 */
> +		if (!IS_ENABLED(CONFIG_THP_SWAP) ||
> +		    nr_pages > SWAPFILE_CLUSTER) {
> +			VM_WARN_ON_ONCE(1);
> +			return 0;
> +		}
> +
> +		/*
> +		 * Swapfile is not block device or not using clusters so unable
> +		 * to allocate large entries.
> +		 */
> +		if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
> +			return 0;
> +	}
> +
>  	si->flags += SWP_SCANNING;
>  	/*
>  	 * Use percpu scan base for SSD to reduce lock contention on
> @@ -831,8 +879,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  
>  	/* SSD algorithm */
>  	if (si->cluster_info) {
> -		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
> +		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
> +			if (order > 0)
> +				goto no_page;
>  			goto scan;
> +		}
>  	} else if (unlikely(!si->cluster_nr--)) {
>  		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
>  			si->cluster_nr = SWAPFILE_CLUSTER - 1;
> @@ -874,13 +925,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  
>  checks:
>  	if (si->cluster_info) {
> -		while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
> +		while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
>  		/* take a break if we already got some slots */
>  			if (n_ret)
>  				goto done;
>  			if (!scan_swap_map_try_ssd_cluster(si, &offset,
> -							&scan_base))
> +							&scan_base, order)) {
> +				if (order > 0)
> +					goto no_page;
>  				goto scan;
> +			}
>  		}
>  	}
>  	if (!(si->flags & SWP_WRITEOK))
> @@ -911,11 +965,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  		else
>  			goto done;
>  	}
> -	WRITE_ONCE(si->swap_map[offset], usage);
> -	inc_cluster_info_page(si, si->cluster_info, offset);
> +	memset(si->swap_map + offset, usage, nr_pages);
> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>  	unlock_cluster(ci);
>  
> -	swap_range_alloc(si, offset, 1);
> +	swap_range_alloc(si, offset, nr_pages);
>  	slots[n_ret++] = swp_entry(si->type, offset);
>  
>  	/* got enough slots or reach max slots? */
> @@ -936,8 +990,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  
>  	/* try to get more slots in cluster */
>  	if (si->cluster_info) {
> -		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
> +		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
>  			goto checks;
> +		if (order > 0)
> +			goto done;
>  	} else if (si->cluster_nr && !si->swap_map[++offset]) {
>  		/* non-ssd case, still more slots in cluster? */
>  		--si->cluster_nr;
> @@ -964,11 +1020,13 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	}
>  
>  done:
> -	set_cluster_next(si, offset + 1);
> +	if (order == 0)
> +		set_cluster_next(si, offset + 1);
>  	si->flags -= SWP_SCANNING;
>  	return n_ret;
>  
>  scan:
> +	VM_WARN_ON(order > 0);
>  	spin_unlock(&si->lock);
>  	while (++offset <= READ_ONCE(si->highest_bit)) {
>  		if (unlikely(--latency_ration < 0)) {
> @@ -997,38 +1055,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	return n_ret;
>  }
>  
> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
> -{
> -	unsigned long idx;
> -	struct swap_cluster_info *ci;
> -	unsigned long offset;
> -
> -	/*
> -	 * Should not even be attempting cluster allocations when huge
> -	 * page swap is disabled.  Warn and fail the allocation.
> -	 */
> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
> -		VM_WARN_ON_ONCE(1);
> -		return 0;
> -	}
> -
> -	if (cluster_list_empty(&si->free_clusters))
> -		return 0;
> -
> -	idx = cluster_list_first(&si->free_clusters);
> -	offset = idx * SWAPFILE_CLUSTER;
> -	ci = lock_cluster(si, offset);
> -	alloc_cluster(si, idx);
> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
> -
> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
> -	unlock_cluster(ci);
> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
> -	*slot = swp_entry(si->type, offset);
> -
> -	return 1;
> -}
> -
>  static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>  {
>  	unsigned long offset = idx * SWAPFILE_CLUSTER;
> @@ -1042,17 +1068,15 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>  	swap_range_free(si, offset, SWAPFILE_CLUSTER);
>  }
>  
> -int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
> +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  {
> -	unsigned long size = swap_entry_size(entry_size);
> +	int order = swap_entry_order(entry_order);
> +	unsigned long size = 1 << order;
>  	struct swap_info_struct *si, *next;
>  	long avail_pgs;
>  	int n_ret = 0;
>  	int node;
>  
> -	/* Only single cluster request supported */
> -	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> -
>  	spin_lock(&swap_avail_lock);
>  
>  	avail_pgs = atomic_long_read(&nr_swap_pages) / size;
> @@ -1088,14 +1112,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  			spin_unlock(&si->lock);
>  			goto nextsi;
>  		}
> -		if (size == SWAPFILE_CLUSTER) {
> -			if (si->flags & SWP_BLKDEV)
> -				n_ret = swap_alloc_cluster(si, swp_entries);
> -		} else
> -			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> -						    n_goal, swp_entries);
> +		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> +					    n_goal, swp_entries, order);
>  		spin_unlock(&si->lock);
> -		if (n_ret || size == SWAPFILE_CLUSTER)
> +		if (n_ret || size > 1)
>  			goto check_out;
>  		cond_resched();
>  
> @@ -1349,7 +1369,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
>  	unsigned char *map;
>  	unsigned int i, free_entries = 0;
>  	unsigned char val;
> -	int size = swap_entry_size(folio_nr_pages(folio));
> +	int size = 1 << swap_entry_order(folio_order(folio));
>  
>  	si = _swap_info_get(entry);
>  	if (!si)
> @@ -1647,7 +1667,7 @@ swp_entry_t get_swap_page_of_type(int type)
>  
>  	/* This is called for allocating swap entry, not cache */
>  	spin_lock(&si->lock);
> -	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
> +	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
>  		atomic_long_dec(&nr_swap_pages);
>  	spin_unlock(&si->lock);
>  fail:
> @@ -3101,7 +3121,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  		p->flags |= SWP_SYNCHRONOUS_IO;
>  
>  	if (p->bdev && bdev_nonrot(p->bdev)) {
> -		int cpu;
> +		int cpu, i;
>  		unsigned long ci, nr_cluster;
>  
>  		p->flags |= SWP_SOLIDSTATE;
> @@ -3139,7 +3159,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  			struct percpu_cluster *cluster;
>  
>  			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
> -			cluster->next = SWAP_NEXT_INVALID;
> +			for (i = 0; i < SWAP_NR_ORDERS; i++)
> +				cluster->next[i] = SWAP_NEXT_INVALID;
>  		}
>  	} else {
>  		atomic_inc(&nr_rotate_swap);

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  2024-03-27 14:45 ` [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Ryan Roberts
@ 2024-04-01  5:52   ` Huang, Ying
  2024-04-02 11:15     ` Ryan Roberts
  2024-04-03  0:30   ` Zi Yan
  1 sibling, 1 reply; 35+ messages in thread
From: Huang, Ying @ 2024-04-01  5:52 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, Barry Song,
	Chris Li, Lance Yang, linux-mm, linux-kernel

Ryan Roberts <ryan.roberts@arm.com> writes:

> Now that we no longer have a convenient flag in the cluster to determine
> if a folio is large, free_swap_and_cache() will take a reference and
> lock a large folio much more often, which could lead to contention and
> (e.g.) failure to split large folios, etc.
>
> Let's solve that problem by batch freeing swap and cache with a new
> function, free_swap_and_cache_nr(), to free a contiguous range of swap
> entries together. This allows us to first drop a reference to each swap
> slot before we try to release the cache folio. This means we only try to
> release the folio once, only taking the reference and lock once - much
> better than the previous 512 times for the 2M THP case.
>
> Contiguous swap entries are gathered in zap_pte_range() and
> madvise_free_pte_range() in a similar way to how present ptes are
> already gathered in zap_pte_range().
>
> While we are at it, let's simplify by converting the return type of both
> functions to void. The return value was used only by zap_pte_range() to
> print a bad pte, and was ignored by everyone else, so the extra
> reporting wasn't exactly guaranteed. We will still get the warning with
> most of the information from get_swap_device(). With the batch version,
> we wouldn't know which pte was bad anyway so could print the wrong one.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/pgtable.h | 28 +++++++++++++++
>  include/linux/swap.h    | 12 +++++--
>  mm/internal.h           | 48 +++++++++++++++++++++++++
>  mm/madvise.c            | 12 ++++---
>  mm/memory.c             | 13 +++----
>  mm/swapfile.c           | 78 ++++++++++++++++++++++++++++++-----------
>  6 files changed, 157 insertions(+), 34 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 09c85c7bf9c2..8185939df1e8 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
>  }
>  #endif
>  
> +#ifndef clear_not_present_full_ptes
> +/**
> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
> + * @mm: Address space the ptes represent.
> + * @addr: Address of the first pte.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over pte_clear_not_present_full().
> + *
> + * Context: The caller holds the page table lock.  The PTEs are all not present.
> + * The PTEs are all in the same PMD.
> + */
> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> +	for (;;) {
> +		pte_clear_not_present_full(mm, addr, ptep, full);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>  extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>  			      unsigned long address,
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f6f78198f000..5737236dc3ce 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> -extern int free_swap_and_cache(swp_entry_t);
> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>  int swap_type_of(dev_t device, sector_t offset);
>  int find_first_swap(dev_t *device);
>  extern unsigned int count_swap_pages(int, int);
> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct *si)
>  #define free_pages_and_swap_cache(pages, nr) \
>  	release_pages((pages), (nr));
>  
> -/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
> -#define free_swap_and_cache(e) is_pfn_swap_entry(e)
> +static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
> +{
> +}
>  
>  static inline void free_swap_cache(struct folio *folio)
>  {
> @@ -589,6 +590,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
>  }
>  #endif /* CONFIG_SWAP */
>  
> +static inline void free_swap_and_cache(swp_entry_t entry)
> +{
> +	free_swap_and_cache_nr(entry, 1);
> +}
> +
>  #ifdef CONFIG_MEMCG
>  static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/internal.h b/mm/internal.h
> index 8e11f7b2da21..eadb79c3a357 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -11,6 +11,8 @@
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/rmap.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  #include <linux/tracepoint-defs.h>
>  
>  struct folio_batch;
> @@ -189,6 +191,52 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>  
>  	return min(ptep - start_ptep, max_nr);
>  }
> +
> +/**
> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
> + * @start_ptep: Page table pointer for the first entry.
> + * @max_nr: The maximum number of table entries to consider.
> + * @entry: Swap entry recovered from the first table entry.
> + *
> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
> + * containing swap entries all with consecutive offsets and targeting the same
> + * swap type.
> + *
> + * max_nr must be at least one and must be limited by the caller so scanning
> + * cannot exceed a single page table.
> + *
> + * Return: the number of table entries in the batch.
> + */
> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
> +				 swp_entry_t entry)
> +{
> +	const pte_t *end_ptep = start_ptep + max_nr;
> +	unsigned long expected_offset = swp_offset(entry) + 1;
> +	unsigned int expected_type = swp_type(entry);
> +	pte_t *ptep = start_ptep + 1;
> +
> +	VM_WARN_ON(max_nr < 1);
> +	VM_WARN_ON(non_swap_entry(entry));
> +
> +	while (ptep < end_ptep) {
> +		pte_t pte = ptep_get(ptep);
> +
> +		if (pte_none(pte) || pte_present(pte))
> +			break;
> +
> +		entry = pte_to_swp_entry(pte);
> +
> +		if (non_swap_entry(entry) ||
> +		    swp_type(entry) != expected_type ||
> +		    swp_offset(entry) != expected_offset)
> +			break;
> +
> +		expected_offset++;
> +		ptep++;
> +	}
> +
> +	return ptep - start_ptep;
> +}
>  #endif /* CONFIG_MMU */
>  
>  void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 1f77a51baaac..070bedb4996e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	struct folio *folio;
>  	int nr_swap = 0;
>  	unsigned long next;
> +	int nr, max_nr;
>  
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd))
> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		return 0;
>  	flush_tlb_batched_pending(mm);
>  	arch_enter_lazy_mmu_mode();
> -	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +	for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
> +		nr = 1;
>  		ptent = ptep_get(pte);
>  
>  		if (pte_none(ptent))
> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  
>  			entry = pte_to_swp_entry(ptent);
>  			if (!non_swap_entry(entry)) {
> -				nr_swap--;
> -				free_swap_and_cache(entry);
> -				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> +				max_nr = (end - addr) / PAGE_SIZE;
> +				nr = swap_pte_batch(pte, max_nr, entry);
> +				nr_swap -= nr;
> +				free_swap_and_cache_nr(entry, nr);
> +				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>  			} else if (is_hwpoison_entry(entry) ||
>  				   is_poisoned_swp_entry(entry)) {
>  				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> diff --git a/mm/memory.c b/mm/memory.c
> index 36191a9c799c..9d844582ba38 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1631,12 +1631,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				folio_remove_rmap_pte(folio, page, vma);
>  			folio_put(folio);
>  		} else if (!non_swap_entry(entry)) {
> -			/* Genuine swap entry, hence a private anon page */
> +			max_nr = (end - addr) / PAGE_SIZE;
> +			nr = swap_pte_batch(pte, max_nr, entry);
> +			/* Genuine swap entries, hence a private anon pages */
>  			if (!should_zap_cows(details))
>  				continue;
> -			rss[MM_SWAPENTS]--;
> -			if (unlikely(!free_swap_and_cache(entry)))
> -				print_bad_pte(vma, addr, ptent, NULL);
> +			rss[MM_SWAPENTS] -= nr;
> +			free_swap_and_cache_nr(entry, nr);
>  		} else if (is_migration_entry(entry)) {
>  			folio = pfn_swap_entry_folio(entry);
>  			if (!should_zap_folio(details, folio))
> @@ -1659,8 +1660,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>  			WARN_ON_ONCE(1);
>  		}
> -		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> -		zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
> +		clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> +		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>  	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>  
>  	add_mm_rss_vec(mm, rss);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 0d44ee2b4f9c..cedfc82d37e5 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
>  /* Reclaim the swap entry if swap is getting full*/
>  #define TTRS_FULL		0x4
>  
> -/* returns 1 if swap entry is freed */
> +/*
> + * returns number of pages in the folio that backs the swap entry. If positive,
> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
> + * folio was associated with the swap entry.
> + */
>  static int __try_to_reclaim_swap(struct swap_info_struct *si,
>  				 unsigned long offset, unsigned long flags)
>  {
> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>  			ret = folio_free_swap(folio);
>  		folio_unlock(folio);
>  	}
> +	ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
>  	folio_put(folio);
>  	return ret;
>  }
> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>  		spin_lock(&si->lock);
>  		/* entry was freed successfully, try to use this again */
> -		if (swap_was_freed)
> +		if (swap_was_freed > 0)
>  			goto checks;
>  		goto scan; /* check next one */
>  	}
> @@ -1572,32 +1577,63 @@ bool folio_free_swap(struct folio *folio)
>  	return true;
>  }
>  
> -/*
> - * Free the swap entry like above, but also try to
> - * free the page cache entry if it is the last user.
> - */
> -int free_swap_and_cache(swp_entry_t entry)
> +void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>  {
> -	struct swap_info_struct *p;
> -	unsigned char count;
> +	unsigned long end = swp_offset(entry) + nr;
> +	unsigned int type = swp_type(entry);
> +	struct swap_info_struct *si;
> +	unsigned long offset;
>  
>  	if (non_swap_entry(entry))
> -		return 1;
> +		return;
>  
> -	p = get_swap_device(entry);
> -	if (p) {
> -		if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
> -			put_swap_device(p);
> -			return 0;
> -		}
> +	si = get_swap_device(entry);
> +	if (!si)
> +		return;
>  
> -		count = __swap_entry_free(p, entry);
> -		if (count == SWAP_HAS_CACHE)
> -			__try_to_reclaim_swap(p, swp_offset(entry),
> +	if (WARN_ON(end > si->max))
> +		goto out;
> +
> +	/*
> +	 * First free all entries in the range.
> +	 */
> +	for (offset = swp_offset(entry); offset < end; offset++) {
> +		if (!WARN_ON(data_race(!si->swap_map[offset])))
> +			__swap_entry_free(si, swp_entry(type, offset));

I think that it's better to check the return value of
__swap_entry_free() here.  When the return value != SWAP_HAS_CACHE, we
can try to reclaim all swap entries we have checked before, then restart
the check with the new start.

> +	}
> +
> +	/*
> +	 * Now go back over the range trying to reclaim the swap cache. This is
> +	 * more efficient for large folios because we will only try to reclaim
> +	 * the swap once per folio in the common case. If we do
> +	 * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
> +	 * latter will get a reference and lock the folio for every individual
> +	 * page but will only succeed once the swap slot for every subpage is
> +	 * zero.
> +	 */
> +	for (offset = swp_offset(entry); offset < end; offset += nr) {
> +		nr = 1;
> +		if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
> +			/*
> +			 * Folios are always naturally aligned in swap so
> +			 * advance forward to the next boundary. Zero means no
> +			 * folio was found for the swap entry, so advance by 1
> +			 * in this case. Negative value means folio was found
> +			 * but could not be reclaimed. Here we can still advance
> +			 * to the next boundary.
> +			 */
> +			nr = __try_to_reclaim_swap(si, offset,
>  					      TTRS_UNMAPPED | TTRS_FULL);
> -		put_swap_device(p);
> +			if (nr == 0)
> +				nr = 1;
> +			else if (nr < 0)
> +				nr = -nr;
> +			nr = ALIGN(offset + 1, nr) - offset;
> +		}
>  	}
> -	return p != NULL;
> +
> +out:
> +	put_swap_device(si);
>  }
>  
>  #ifdef CONFIG_HIBERNATION

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
  2024-03-27 14:45 ` [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Ryan Roberts
@ 2024-04-01 12:25   ` Lance Yang
  2024-04-02 11:20     ` Ryan Roberts
  2024-04-02 10:16   ` Barry Song
  1 sibling, 1 reply; 35+ messages in thread
From: Lance Yang @ 2024-04-01 12:25 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, linux-mm, linux-kernel, Barry Song

On Wed, Mar 27, 2024 at 10:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> folio that is fully and contiguously mapped in the pageout/cold vm
> range. This change means that large folios will be maintained all the
> way to swap storage. This both improves performance during swap-out, by
> eliding the cost of splitting the folio, and sets us up nicely for
> maintaining the large folio when it is swapped back in (to be covered in
> a separate series).
>
> Folios that are not fully mapped in the target range are still split,
> but note that behavior is changed so that if the split fails for any
> reason (folio locked, shared, etc) we now leave it as is and move to the
> next pte in the range and continue work on the proceeding folios.
> Previously any failure of this sort would cause the entire operation to
> give up and no folios mapped at higher addresses were paged out or made
> cold. Given large folios are becoming more common, this old behavior
> would have likely lead to wasted opportunities.
>
> While we are at it, change the code that clears young from the ptes to
> use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper
> function. This is more efficent than get_and_clear/modify/set,
> especially for contpte mappings on arm64, where the old approach would
> require unfolding/refolding and the new approach can be done in place.
>
> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/pgtable.h | 30 ++++++++++++++
>  mm/internal.h           | 12 +++++-
>  mm/madvise.c            | 88 ++++++++++++++++++++++++-----------------
>  mm/memory.c             |  4 +-
>  4 files changed, 93 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 8185939df1e8..391f56a1b188 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -361,6 +361,36 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  }
>  #endif
>
> +#ifndef mkold_ptes
> +/**
> + * mkold_ptes - Mark PTEs that map consecutive pages of the same folio as old.
> + * @vma: VMA the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to mark old.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_test_and_clear_young().
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline void mkold_ptes(struct vm_area_struct *vma, unsigned long addr,
> +               pte_t *ptep, unsigned int nr)
> +{
> +       for (;;) {
> +               ptep_test_and_clear_young(vma, addr, ptep);

IIUC, if the first PTE is a CONT-PTE, then calling ptep_test_and_clear_young()
will clear the young bit for the entire contig range to avoid having
to unfold. So,
the other PTEs within the range don't need to clear again.

Maybe we should consider overriding mkold_ptes for arm64?

Thanks,
Lance

> +               if (--nr == 0)
> +                       break;
> +               ptep++;
> +               addr += PAGE_SIZE;
> +       }
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
>  static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> diff --git a/mm/internal.h b/mm/internal.h
> index eadb79c3a357..efee8e4cd2af 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -130,6 +130,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * @flags: Flags to modify the PTE batch semantics.
>   * @any_writable: Optional pointer to indicate whether any entry except the
>   *               first one is writable.
> + * @any_young: Optional pointer to indicate whether any entry except the
> + *               first one is young.
>   *
>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>   * pages of the same large folio.
> @@ -145,16 +147,18 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   */
>  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>                 pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> -               bool *any_writable)
> +               bool *any_writable, bool *any_young)
>  {
>         unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>         const pte_t *end_ptep = start_ptep + max_nr;
>         pte_t expected_pte, *ptep;
> -       bool writable;
> +       bool writable, young;
>         int nr;
>
>         if (any_writable)
>                 *any_writable = false;
> +       if (any_young)
> +               *any_young = false;
>
>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>         VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> @@ -168,6 +172,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>                 pte = ptep_get(ptep);
>                 if (any_writable)
>                         writable = !!pte_write(pte);
> +               if (any_young)
> +                       young = !!pte_young(pte);
>                 pte = __pte_batch_clear_ignored(pte, flags);
>
>                 if (!pte_same(pte, expected_pte))
> @@ -183,6 +189,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>
>                 if (any_writable)
>                         *any_writable |= writable;
> +               if (any_young)
> +                       *any_young |= young;
>
>                 nr = pte_batch_hint(ptep, pte);
>                 expected_pte = pte_advance_pfn(expected_pte, nr);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 070bedb4996e..bd00b83e7c50 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>         LIST_HEAD(folio_list);
>         bool pageout_anon_only_filter;
>         unsigned int batch_count = 0;
> +       int nr;
>
>         if (fatal_signal_pending(current))
>                 return -EINTR;
> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>                 return 0;
>         flush_tlb_batched_pending(mm);
>         arch_enter_lazy_mmu_mode();
> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> +               nr = 1;
>                 ptent = ptep_get(pte);
>
>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> @@ -447,55 +449,67 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>                         continue;
>
>                 /*
> -                * Creating a THP page is expensive so split it only if we
> -                * are sure it's worth. Split it if we are only owner.
> +                * If we encounter a large folio, only split it if it is not
> +                * fully mapped within the range we are operating on. Otherwise
> +                * leave it as is so that it can be swapped out whole. If we
> +                * fail to split a folio, leave it in place and advance to the
> +                * next pte in the range.
>                  */
>                 if (folio_test_large(folio)) {
> -                       int err;
> -
> -                       if (folio_likely_mapped_shared(folio))
> -                               break;
> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
> -                               break;
> -                       if (!folio_trylock(folio))
> -                               break;
> -                       folio_get(folio);
> -                       arch_leave_lazy_mmu_mode();
> -                       pte_unmap_unlock(start_pte, ptl);
> -                       start_pte = NULL;
> -                       err = split_folio(folio);
> -                       folio_unlock(folio);
> -                       folio_put(folio);
> -                       if (err)
> -                               break;
> -                       start_pte = pte =
> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> -                       if (!start_pte)
> -                               break;
> -                       arch_enter_lazy_mmu_mode();
> -                       pte--;
> -                       addr -= PAGE_SIZE;
> -                       continue;
> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> +                                               FPB_IGNORE_SOFT_DIRTY;
> +                       int max_nr = (end - addr) / PAGE_SIZE;
> +                       bool any_young;
> +
> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> +                                            fpb_flags, NULL, &any_young);
> +                       if (any_young)
> +                               ptent = pte_mkyoung(ptent);
> +
> +                       if (nr < folio_nr_pages(folio)) {
> +                               int err;
> +
> +                               if (folio_likely_mapped_shared(folio))
> +                                       continue;
> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
> +                                       continue;
> +                               if (!folio_trylock(folio))
> +                                       continue;
> +                               folio_get(folio);
> +                               arch_leave_lazy_mmu_mode();
> +                               pte_unmap_unlock(start_pte, ptl);
> +                               start_pte = NULL;
> +                               err = split_folio(folio);
> +                               folio_unlock(folio);
> +                               folio_put(folio);
> +                               if (err)
> +                                       continue;
> +                               start_pte = pte =
> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
> +                               if (!start_pte)
> +                                       break;
> +                               arch_enter_lazy_mmu_mode();
> +                               nr = 0;
> +                               continue;
> +                       }
>                 }
>
>                 /*
>                  * Do not interfere with other mappings of this folio and
> -                * non-LRU folio.
> +                * non-LRU folio. If we have a large folio at this point, we
> +                * know it is fully mapped so if its mapcount is the same as its
> +                * number of pages, it must be exclusive.
>                  */
> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> +               if (!folio_test_lru(folio) ||
> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>                         continue;
>
>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>                         continue;
>
> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
>                 if (!pageout && pte_young(ptent)) {
> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> -                                                       tlb->fullmm);
> -                       ptent = pte_mkold(ptent);
> -                       set_pte_at(mm, addr, pte, ptent);
> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> +                       mkold_ptes(vma, addr, pte, nr);
> +                       tlb_remove_tlb_entries(tlb, pte, nr, addr);
>                 }
>
>                 /*
> diff --git a/mm/memory.c b/mm/memory.c
> index 9d844582ba38..b5b48f4cf2af 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -989,7 +989,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>                         flags |= FPB_IGNORE_SOFT_DIRTY;
>
>                 nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
> -                                    &any_writable);
> +                                    &any_writable, NULL);
>                 folio_ref_add(folio, nr);
>                 if (folio_test_anon(folio)) {
>                         if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
> @@ -1553,7 +1553,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>          */
>         if (unlikely(folio_test_large(folio) && max_nr != 1)) {
>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
> -                                    NULL);
> +                                    NULL, NULL);
>
>                 zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
>                                        addr, details, rss, force_flush,
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
  2024-03-27 14:45 ` [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Ryan Roberts
  2024-04-01 12:25   ` Lance Yang
@ 2024-04-02 10:16   ` Barry Song
  2024-04-02 10:56     ` Ryan Roberts
  1 sibling, 1 reply; 35+ messages in thread
From: Barry Song @ 2024-04-02 10:16 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Chris Li, Lance Yang, linux-mm, linux-kernel, Barry Song

On Thu, Mar 28, 2024 at 3:46 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> folio that is fully and contiguously mapped in the pageout/cold vm
> range. This change means that large folios will be maintained all the
> way to swap storage. This both improves performance during swap-out, by
> eliding the cost of splitting the folio, and sets us up nicely for
> maintaining the large folio when it is swapped back in (to be covered in
> a separate series).
>
> Folios that are not fully mapped in the target range are still split,
> but note that behavior is changed so that if the split fails for any
> reason (folio locked, shared, etc) we now leave it as is and move to the
> next pte in the range and continue work on the proceeding folios.
> Previously any failure of this sort would cause the entire operation to
> give up and no folios mapped at higher addresses were paged out or made
> cold. Given large folios are becoming more common, this old behavior
> would have likely lead to wasted opportunities.
>
> While we are at it, change the code that clears young from the ptes to
> use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper
> function. This is more efficent than get_and_clear/modify/set,
> especially for contpte mappings on arm64, where the old approach would
> require unfolding/refolding and the new approach can be done in place.
>
> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---

Hi Ryan,

I'm not entirely certain if this issue is related to this patch, but
I've encountered
the KNIC twice while using the latest mm-unstable kernel. Each time I attempted
to debug it, the issue vanished. I'm posting here to see if you have
any ideas on it :-)

[   50.444066]
[   50.444495] =====================================
[   50.444954] WARNING: bad unlock balance detected!
[   50.445443] 6.9.0-rc2-00257-g2d9f63c285db #128 Not tainted
[   50.446233] -------------------------------------
[   50.446684] singlethread/102 is trying to release lock
(ptlock_ptr(ptdesc)) at:
[   50.447635] [<ffffb03155fe302c>]
madvise_cold_or_pageout_pte_range+0x80c/0xea0
[   50.449066] but there are no more locks to release!
[   50.449535]
[   50.449535] other info that might help us debug this:
[   50.450140] 1 lock held by singlethread/102:
[   50.450688]  #0: ffff0000c001f208 (&mm->mmap_lock){++++}-{4:4}, at:
do_madvise.part.0+0x178/0x518
[   50.452321]
[   50.452321] stack backtrace:
[   50.452959] CPU: 3 PID: 102 Comm: singlethread Not tainted
6.9.0-rc2-00257-g2d9f63c285db #128
[   50.453812] Hardware name: linux,dummy-virt (DT)
[   50.454373] Call trace:
[   50.454755]  dump_backtrace+0x9c/0x100
[   50.455246]  show_stack+0x20/0x38
[   50.455667]  dump_stack_lvl+0xec/0x150
[   50.456111]  dump_stack+0x18/0x28
[   50.456533]  print_unlock_imbalance_bug+0x130/0x148
[   50.457014]  lock_release+0x2e0/0x360
[   50.457487]  _raw_spin_unlock+0x2c/0x78
[   50.457997]  madvise_cold_or_pageout_pte_range+0x80c/0xea0
[   50.458635]  walk_pgd_range+0x388/0x7d8
[   50.459168]  __walk_page_range+0x1e0/0x1f0
[   50.459682]  walk_page_range+0x1f0/0x2c8
[   50.460225]  madvise_pageout+0xf8/0x280
[   50.460711]  madvise_vma_behavior+0x310/0x9b8
[   50.461169]  madvise_walk_vmas+0xc0/0x128
[   50.461605]  do_madvise.part.0+0xf8/0x518
[   50.462041]  __arm64_sys_madvise+0x68/0x88
[   50.462529]  invoke_syscall+0x50/0x128
[   50.463001]  el0_svc_common.constprop.0+0x48/0xf8
[   50.463508]  do_el0_svc+0x28/0x40
[   50.464004]  el0_svc+0x50/0x150
[   50.464492]  el0t_64_sync_handler+0x13c/0x158
[   50.465021]  el0t_64_sync+0x1a4/0x1a8
[   50.466959] ------------[ cut here ]------------
[   50.467451] WARNING: CPU: 3 PID: 102 at
kernel/rcu/tree_plugin.h:431 __rcu_read_unlock+0x74/0x218
[   50.468160] Modules linked in:
[   50.468803] CPU: 3 PID: 102 Comm: singlethread Not tainted
6.9.0-rc2-00257-g2d9f63c285db #128
[   50.469658] Hardware name: linux,dummy-virt (DT)
[   50.470293] pstate: a3400005 (NzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[   50.470991] pc : __rcu_read_unlock+0x74/0x218
[   50.471594] lr : madvise_cold_or_pageout_pte_range+0x828/0xea0
[   50.472236] sp : ffff800080abb7e0
[   50.472622] pmr_save: 000000e0
[   50.473010] x29: ffff800080abb7e0 x28: 0000ffffa467a000 x27: fffffdffc3128c00
[   50.474006] x26: 0010000000000001 x25: 000000000000001b x24: ffff0000c32d73d0
[   50.474971] x23: 0060000104a3afc3 x22: ffff0000c2492840 x21: 0400000000000001
[   50.475943] x20: ff77fffffffffbff x19: ffff0000c3230000 x18: ffffffffffffffff
[   50.477286] x17: 672d37353230302d x16: 3263722d302e392e x15: ffff800100abb227
[   50.478373] x14: 0000000000000001 x13: 38613178302f3461 x12: 3178302b636e7973
[   50.479354] x11: fffffffffffe0000 x10: ffffb03159697d08 x9 : ffffb03155fe3048
[   50.480265] x8 : 00000000ffffefff x7 : ffffb03159697d08 x6 : 0000000000000000
[   50.481154] x5 : 0000000000000001 x4 : ffff800080abbfe0 x3 : 0000000000000000
[   50.482035] x2 : ffff4fd055074000 x1 : 00000000ffffffff x0 : 000000003fffffff
[   50.483163] Call trace:
[   50.483599]  __rcu_read_unlock+0x74/0x218
[   50.484152]  madvise_cold_or_pageout_pte_range+0x828/0xea0
[   50.484780]  walk_pgd_range+0x388/0x7d8
[   50.485328]  __walk_page_range+0x1e0/0x1f0
[   50.485725]  walk_page_range+0x1f0/0x2c8
[   50.486117]  madvise_pageout+0xf8/0x280
[   50.486547]  madvise_vma_behavior+0x310/0x9b8
[   50.486975]  madvise_walk_vmas+0xc0/0x128
[   50.487403]  do_madvise.part.0+0xf8/0x518
[   50.487845]  __arm64_sys_madvise+0x68/0x88
[   50.488374]  invoke_syscall+0x50/0x128
[   50.488946]  el0_svc_common.constprop.0+0x48/0xf8
[   50.489732]  do_el0_svc+0x28/0x40
[   50.490210]  el0_svc+0x50/0x150
[   50.490674]  el0t_64_sync_handler+0x13c/0x158
[   50.491257]  el0t_64_sync+0x1a4/0x1a8
[   50.491793] irq event stamp: 3087
[   50.492243] hardirqs last  enabled at (3087): [<ffffb031570d89d8>]
_raw_spin_unlock_irq+0x38/0x90
[   50.492917] hardirqs last disabled at (3086): [<ffffb031570d8acc>]
_raw_spin_lock_irq+0x9c/0xc0
[   50.493742] softirqs last  enabled at (2470): [<ffffb03155c10d94>]
__do_softirq+0x534/0x588
[   50.494567] softirqs last disabled at (2461): [<ffffb03155c17238>]
____do_softirq+0x18/0x30
[   50.495328] ---[ end trace 0000000000000000 ]---
[   50.497110] BUG: sleeping function called from invalid context at
kernel/locking/rwsem.c:1578
[   50.497544] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid:
102, name: singlethread
[   50.497652] preempt_count: ffffffff, expected: 0
[   50.497728] RCU nest depth: -1, expected: 0
[   50.497851] INFO: lockdep is turned off.
[   50.498023] CPU: 3 PID: 102 Comm: singlethread Tainted: G        W
        6.9.0-rc2-00257-g2d9f63c285db #128
[   50.498166] Hardware name: linux,dummy-virt (DT)
[   50.498221] Call trace:
[   50.498260]  dump_backtrace+0x9c/0x100
[   50.498378]  show_stack+0x20/0x38
[   50.498487]  dump_stack_lvl+0xec/0x150
[   50.498574]  dump_stack+0x18/0x28
[   50.498659]  __might_resched+0x158/0x278
[   50.498741]  __might_sleep+0x50/0xa0
[   50.498849]  down_write+0x30/0x1a8
[   50.498950]  split_huge_page_to_list_to_order+0x3c8/0x1130
[   50.499052]  madvise_cold_or_pageout_pte_range+0x84c/0xea0
[   50.499138]  walk_pgd_range+0x388/0x7d8
[   50.499224]  __walk_page_range+0x1e0/0x1f0
[   50.499334]  walk_page_range+0x1f0/0x2c8
[   50.499458]  madvise_pageout+0xf8/0x280
[   50.499554]  madvise_vma_behavior+0x310/0x9b8
[   50.499657]  madvise_walk_vmas+0xc0/0x128
[   50.499739]  do_madvise.part.0+0xf8/0x518
[   50.499851]  __arm64_sys_madvise+0x68/0x88
[   50.499953]  invoke_syscall+0x50/0x128
[   50.500037]  el0_svc_common.constprop.0+0x48/0xf8
[   50.500121]  do_el0_svc+0x28/0x40
[   50.500203]  el0_svc+0x50/0x150
[   50.500322]  el0t_64_sync_handler+0x13c/0x158
[   50.500422]  el0t_64_sync+0x1a4/0x1a8
[   50.501378] BUG: scheduling while atomic: singlethread/102/0x00000000
[   50.517641] INFO: lockdep is turned off.
[   50.518206] Modules linked in:
[   50.521135] CPU: 2 PID: 102 Comm: singlethread Tainted: G        W
        6.9.0-rc2-00257-g2d9f63c285db #128
[   50.522026] Hardware name: linux,dummy-virt (DT)
[   50.522623] Call trace:
[   50.522993]  dump_backtrace+0x9c/0x100
[   50.523527]  show_stack+0x20/0x38
[   50.523950]  dump_stack_lvl+0xec/0x150
[   50.524405]  dump_stack+0x18/0x28
[   50.524849]  __schedule_bug+0x80/0xe0
[   50.525309]  __schedule+0xb1c/0xc00
[   50.525750]  schedule+0x58/0x170
[   50.526227]  schedule_preempt_disabled+0x2c/0x50
[   50.526762]  rwsem_down_write_slowpath+0x1ac/0x718
[   50.527342]  down_write+0xf8/0x1a8
[   50.527857]  split_huge_page_to_list_to_order+0x3c8/0x1130
[   50.528437]  madvise_cold_or_pageout_pte_range+0x84c/0xea0
[   50.529012]  walk_pgd_range+0x388/0x7d8
[   50.529442]  __walk_page_range+0x1e0/0x1f0
[   50.529896]  walk_page_range+0x1f0/0x2c8
[   50.530342]  madvise_pageout+0xf8/0x280
[   50.530878]  madvise_vma_behavior+0x310/0x9b8
[   50.531395]  madvise_walk_vmas+0xc0/0x128
[   50.531849]  do_madvise.part.0+0xf8/0x518
[   50.532330]  __arm64_sys_madvise+0x68/0x88
[   50.532829]  invoke_syscall+0x50/0x128
[   50.533374]  el0_svc_common.constprop.0+0x48/0xf8
[   50.533992]  do_el0_svc+0x28/0x40
[   50.534498]  el0_svc+0x50/0x150
[   50.535029]  el0t_64_sync_handler+0x13c/0x158
[   50.535588]  el0t_64_sync+0x1a4/0x1a8



>  include/linux/pgtable.h | 30 ++++++++++++++
>  mm/internal.h           | 12 +++++-
>  mm/madvise.c            | 88 ++++++++++++++++++++++++-----------------
>  mm/memory.c             |  4 +-
>  4 files changed, 93 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 8185939df1e8..391f56a1b188 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -361,6 +361,36 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  }
>  #endif
>
> +#ifndef mkold_ptes
> +/**
> + * mkold_ptes - Mark PTEs that map consecutive pages of the same folio as old.
> + * @vma: VMA the pages are mapped into.
> + * @addr: Address the first page is mapped at.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to mark old.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over ptep_test_and_clear_young().
> + *
> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> + * some PTEs might be write-protected.
> + *
> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> + */
> +static inline void mkold_ptes(struct vm_area_struct *vma, unsigned long addr,
> +               pte_t *ptep, unsigned int nr)
> +{
> +       for (;;) {
> +               ptep_test_and_clear_young(vma, addr, ptep);
> +               if (--nr == 0)
> +                       break;
> +               ptep++;
> +               addr += PAGE_SIZE;
> +       }
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
>  static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> diff --git a/mm/internal.h b/mm/internal.h
> index eadb79c3a357..efee8e4cd2af 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -130,6 +130,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * @flags: Flags to modify the PTE batch semantics.
>   * @any_writable: Optional pointer to indicate whether any entry except the
>   *               first one is writable.
> + * @any_young: Optional pointer to indicate whether any entry except the
> + *               first one is young.
>   *
>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>   * pages of the same large folio.
> @@ -145,16 +147,18 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   */
>  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>                 pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> -               bool *any_writable)
> +               bool *any_writable, bool *any_young)
>  {
>         unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>         const pte_t *end_ptep = start_ptep + max_nr;
>         pte_t expected_pte, *ptep;
> -       bool writable;
> +       bool writable, young;
>         int nr;
>
>         if (any_writable)
>                 *any_writable = false;
> +       if (any_young)
> +               *any_young = false;
>
>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>         VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> @@ -168,6 +172,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>                 pte = ptep_get(ptep);
>                 if (any_writable)
>                         writable = !!pte_write(pte);
> +               if (any_young)
> +                       young = !!pte_young(pte);
>                 pte = __pte_batch_clear_ignored(pte, flags);
>
>                 if (!pte_same(pte, expected_pte))
> @@ -183,6 +189,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>
>                 if (any_writable)
>                         *any_writable |= writable;
> +               if (any_young)
> +                       *any_young |= young;
>
>                 nr = pte_batch_hint(ptep, pte);
>                 expected_pte = pte_advance_pfn(expected_pte, nr);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 070bedb4996e..bd00b83e7c50 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>         LIST_HEAD(folio_list);
>         bool pageout_anon_only_filter;
>         unsigned int batch_count = 0;
> +       int nr;
>
>         if (fatal_signal_pending(current))
>                 return -EINTR;
> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>                 return 0;
>         flush_tlb_batched_pending(mm);
>         arch_enter_lazy_mmu_mode();
> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> +               nr = 1;
>                 ptent = ptep_get(pte);
>
>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> @@ -447,55 +449,67 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>                         continue;
>
>                 /*
> -                * Creating a THP page is expensive so split it only if we
> -                * are sure it's worth. Split it if we are only owner.
> +                * If we encounter a large folio, only split it if it is not
> +                * fully mapped within the range we are operating on. Otherwise
> +                * leave it as is so that it can be swapped out whole. If we
> +                * fail to split a folio, leave it in place and advance to the
> +                * next pte in the range.
>                  */
>                 if (folio_test_large(folio)) {
> -                       int err;
> -
> -                       if (folio_likely_mapped_shared(folio))
> -                               break;
> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
> -                               break;
> -                       if (!folio_trylock(folio))
> -                               break;
> -                       folio_get(folio);
> -                       arch_leave_lazy_mmu_mode();
> -                       pte_unmap_unlock(start_pte, ptl);
> -                       start_pte = NULL;
> -                       err = split_folio(folio);
> -                       folio_unlock(folio);
> -                       folio_put(folio);
> -                       if (err)
> -                               break;
> -                       start_pte = pte =
> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> -                       if (!start_pte)
> -                               break;
> -                       arch_enter_lazy_mmu_mode();
> -                       pte--;
> -                       addr -= PAGE_SIZE;
> -                       continue;
> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> +                                               FPB_IGNORE_SOFT_DIRTY;
> +                       int max_nr = (end - addr) / PAGE_SIZE;
> +                       bool any_young;
> +
> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> +                                            fpb_flags, NULL, &any_young);
> +                       if (any_young)
> +                               ptent = pte_mkyoung(ptent);
> +
> +                       if (nr < folio_nr_pages(folio)) {
> +                               int err;
> +
> +                               if (folio_likely_mapped_shared(folio))
> +                                       continue;
> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
> +                                       continue;
> +                               if (!folio_trylock(folio))
> +                                       continue;
> +                               folio_get(folio);
> +                               arch_leave_lazy_mmu_mode();
> +                               pte_unmap_unlock(start_pte, ptl);
> +                               start_pte = NULL;
> +                               err = split_folio(folio);
> +                               folio_unlock(folio);
> +                               folio_put(folio);
> +                               if (err)
> +                                       continue;
> +                               start_pte = pte =
> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
> +                               if (!start_pte)
> +                                       break;
> +                               arch_enter_lazy_mmu_mode();
> +                               nr = 0;
> +                               continue;
> +                       }
>                 }
>
>                 /*
>                  * Do not interfere with other mappings of this folio and
> -                * non-LRU folio.
> +                * non-LRU folio. If we have a large folio at this point, we
> +                * know it is fully mapped so if its mapcount is the same as its
> +                * number of pages, it must be exclusive.
>                  */
> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> +               if (!folio_test_lru(folio) ||
> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>                         continue;
>
>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>                         continue;
>
> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
>                 if (!pageout && pte_young(ptent)) {
> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> -                                                       tlb->fullmm);
> -                       ptent = pte_mkold(ptent);
> -                       set_pte_at(mm, addr, pte, ptent);
> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> +                       mkold_ptes(vma, addr, pte, nr);
> +                       tlb_remove_tlb_entries(tlb, pte, nr, addr);
>                 }
>
>                 /*
> diff --git a/mm/memory.c b/mm/memory.c
> index 9d844582ba38..b5b48f4cf2af 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -989,7 +989,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>                         flags |= FPB_IGNORE_SOFT_DIRTY;
>
>                 nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
> -                                    &any_writable);
> +                                    &any_writable, NULL);
>                 folio_ref_add(folio, nr);
>                 if (folio_test_anon(folio)) {
>                         if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
> @@ -1553,7 +1553,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>          */
>         if (unlikely(folio_test_large(folio) && max_nr != 1)) {
>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
> -                                    NULL);
> +                                    NULL, NULL);
>
>                 zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
>                                        addr, details, rss, force_flush,
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
  2024-04-02 10:16   ` Barry Song
@ 2024-04-02 10:56     ` Ryan Roberts
  2024-04-02 11:01       ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-04-02 10:56 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Chris Li, Lance Yang, linux-mm, linux-kernel, Barry Song

On 02/04/2024 11:16, Barry Song wrote:
> On Thu, Mar 28, 2024 at 3:46 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>> folio that is fully and contiguously mapped in the pageout/cold vm
>> range. This change means that large folios will be maintained all the
>> way to swap storage. This both improves performance during swap-out, by
>> eliding the cost of splitting the folio, and sets us up nicely for
>> maintaining the large folio when it is swapped back in (to be covered in
>> a separate series).
>>
>> Folios that are not fully mapped in the target range are still split,
>> but note that behavior is changed so that if the split fails for any
>> reason (folio locked, shared, etc) we now leave it as is and move to the
>> next pte in the range and continue work on the proceeding folios.
>> Previously any failure of this sort would cause the entire operation to
>> give up and no folios mapped at higher addresses were paged out or made
>> cold. Given large folios are becoming more common, this old behavior
>> would have likely lead to wasted opportunities.
>>
>> While we are at it, change the code that clears young from the ptes to
>> use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper
>> function. This is more efficent than get_and_clear/modify/set,
>> especially for contpte mappings on arm64, where the old approach would
>> require unfolding/refolding and the new approach can be done in place.
>>
>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
> 
> Hi Ryan,
> 
> I'm not entirely certain if this issue is related to this patch, but
> I've encountered
> the KNIC twice while using the latest mm-unstable kernel. Each time I attempted
> to debug it, the issue vanished. I'm posting here to see if you have
> any ideas on it :-)

Thanks for the report! I think I see the problem below...


> 
> [   50.444066]
> [   50.444495] =====================================
> [   50.444954] WARNING: bad unlock balance detected!
> [   50.445443] 6.9.0-rc2-00257-g2d9f63c285db #128 Not tainted
> [   50.446233] -------------------------------------
> [   50.446684] singlethread/102 is trying to release lock
> (ptlock_ptr(ptdesc)) at:
> [   50.447635] [<ffffb03155fe302c>]
> madvise_cold_or_pageout_pte_range+0x80c/0xea0
> [   50.449066] but there are no more locks to release!
> [   50.449535]
> [   50.449535] other info that might help us debug this:
> [   50.450140] 1 lock held by singlethread/102:
> [   50.450688]  #0: ffff0000c001f208 (&mm->mmap_lock){++++}-{4:4}, at:
> do_madvise.part.0+0x178/0x518
> [   50.452321]
> [   50.452321] stack backtrace:
> [   50.452959] CPU: 3 PID: 102 Comm: singlethread Not tainted
> 6.9.0-rc2-00257-g2d9f63c285db #128
> [   50.453812] Hardware name: linux,dummy-virt (DT)
> [   50.454373] Call trace:
> [   50.454755]  dump_backtrace+0x9c/0x100
> [   50.455246]  show_stack+0x20/0x38
> [   50.455667]  dump_stack_lvl+0xec/0x150
> [   50.456111]  dump_stack+0x18/0x28
> [   50.456533]  print_unlock_imbalance_bug+0x130/0x148
> [   50.457014]  lock_release+0x2e0/0x360
> [   50.457487]  _raw_spin_unlock+0x2c/0x78
> [   50.457997]  madvise_cold_or_pageout_pte_range+0x80c/0xea0
> [   50.458635]  walk_pgd_range+0x388/0x7d8
> [   50.459168]  __walk_page_range+0x1e0/0x1f0
> [   50.459682]  walk_page_range+0x1f0/0x2c8
> [   50.460225]  madvise_pageout+0xf8/0x280
> [   50.460711]  madvise_vma_behavior+0x310/0x9b8
> [   50.461169]  madvise_walk_vmas+0xc0/0x128
> [   50.461605]  do_madvise.part.0+0xf8/0x518
> [   50.462041]  __arm64_sys_madvise+0x68/0x88
> [   50.462529]  invoke_syscall+0x50/0x128
> [   50.463001]  el0_svc_common.constprop.0+0x48/0xf8
> [   50.463508]  do_el0_svc+0x28/0x40
> [   50.464004]  el0_svc+0x50/0x150
> [   50.464492]  el0t_64_sync_handler+0x13c/0x158
> [   50.465021]  el0t_64_sync+0x1a4/0x1a8
> [   50.466959] ------------[ cut here ]------------
> [   50.467451] WARNING: CPU: 3 PID: 102 at
> kernel/rcu/tree_plugin.h:431 __rcu_read_unlock+0x74/0x218
> [   50.468160] Modules linked in:
> [   50.468803] CPU: 3 PID: 102 Comm: singlethread Not tainted
> 6.9.0-rc2-00257-g2d9f63c285db #128
> [   50.469658] Hardware name: linux,dummy-virt (DT)
> [   50.470293] pstate: a3400005 (NzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [   50.470991] pc : __rcu_read_unlock+0x74/0x218
> [   50.471594] lr : madvise_cold_or_pageout_pte_range+0x828/0xea0
> [   50.472236] sp : ffff800080abb7e0
> [   50.472622] pmr_save: 000000e0
> [   50.473010] x29: ffff800080abb7e0 x28: 0000ffffa467a000 x27: fffffdffc3128c00
> [   50.474006] x26: 0010000000000001 x25: 000000000000001b x24: ffff0000c32d73d0
> [   50.474971] x23: 0060000104a3afc3 x22: ffff0000c2492840 x21: 0400000000000001
> [   50.475943] x20: ff77fffffffffbff x19: ffff0000c3230000 x18: ffffffffffffffff
> [   50.477286] x17: 672d37353230302d x16: 3263722d302e392e x15: ffff800100abb227
> [   50.478373] x14: 0000000000000001 x13: 38613178302f3461 x12: 3178302b636e7973
> [   50.479354] x11: fffffffffffe0000 x10: ffffb03159697d08 x9 : ffffb03155fe3048
> [   50.480265] x8 : 00000000ffffefff x7 : ffffb03159697d08 x6 : 0000000000000000
> [   50.481154] x5 : 0000000000000001 x4 : ffff800080abbfe0 x3 : 0000000000000000
> [   50.482035] x2 : ffff4fd055074000 x1 : 00000000ffffffff x0 : 000000003fffffff
> [   50.483163] Call trace:
> [   50.483599]  __rcu_read_unlock+0x74/0x218
> [   50.484152]  madvise_cold_or_pageout_pte_range+0x828/0xea0
> [   50.484780]  walk_pgd_range+0x388/0x7d8
> [   50.485328]  __walk_page_range+0x1e0/0x1f0
> [   50.485725]  walk_page_range+0x1f0/0x2c8
> [   50.486117]  madvise_pageout+0xf8/0x280
> [   50.486547]  madvise_vma_behavior+0x310/0x9b8
> [   50.486975]  madvise_walk_vmas+0xc0/0x128
> [   50.487403]  do_madvise.part.0+0xf8/0x518
> [   50.487845]  __arm64_sys_madvise+0x68/0x88
> [   50.488374]  invoke_syscall+0x50/0x128
> [   50.488946]  el0_svc_common.constprop.0+0x48/0xf8
> [   50.489732]  do_el0_svc+0x28/0x40
> [   50.490210]  el0_svc+0x50/0x150
> [   50.490674]  el0t_64_sync_handler+0x13c/0x158
> [   50.491257]  el0t_64_sync+0x1a4/0x1a8
> [   50.491793] irq event stamp: 3087
> [   50.492243] hardirqs last  enabled at (3087): [<ffffb031570d89d8>]
> _raw_spin_unlock_irq+0x38/0x90
> [   50.492917] hardirqs last disabled at (3086): [<ffffb031570d8acc>]
> _raw_spin_lock_irq+0x9c/0xc0
> [   50.493742] softirqs last  enabled at (2470): [<ffffb03155c10d94>]
> __do_softirq+0x534/0x588
> [   50.494567] softirqs last disabled at (2461): [<ffffb03155c17238>]
> ____do_softirq+0x18/0x30
> [   50.495328] ---[ end trace 0000000000000000 ]---
> [   50.497110] BUG: sleeping function called from invalid context at
> kernel/locking/rwsem.c:1578
> [   50.497544] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid:
> 102, name: singlethread
> [   50.497652] preempt_count: ffffffff, expected: 0
> [   50.497728] RCU nest depth: -1, expected: 0
> [   50.497851] INFO: lockdep is turned off.
> [   50.498023] CPU: 3 PID: 102 Comm: singlethread Tainted: G        W
>         6.9.0-rc2-00257-g2d9f63c285db #128
> [   50.498166] Hardware name: linux,dummy-virt (DT)
> [   50.498221] Call trace:
> [   50.498260]  dump_backtrace+0x9c/0x100
> [   50.498378]  show_stack+0x20/0x38
> [   50.498487]  dump_stack_lvl+0xec/0x150
> [   50.498574]  dump_stack+0x18/0x28
> [   50.498659]  __might_resched+0x158/0x278
> [   50.498741]  __might_sleep+0x50/0xa0
> [   50.498849]  down_write+0x30/0x1a8
> [   50.498950]  split_huge_page_to_list_to_order+0x3c8/0x1130
> [   50.499052]  madvise_cold_or_pageout_pte_range+0x84c/0xea0
> [   50.499138]  walk_pgd_range+0x388/0x7d8
> [   50.499224]  __walk_page_range+0x1e0/0x1f0
> [   50.499334]  walk_page_range+0x1f0/0x2c8
> [   50.499458]  madvise_pageout+0xf8/0x280
> [   50.499554]  madvise_vma_behavior+0x310/0x9b8
> [   50.499657]  madvise_walk_vmas+0xc0/0x128
> [   50.499739]  do_madvise.part.0+0xf8/0x518
> [   50.499851]  __arm64_sys_madvise+0x68/0x88
> [   50.499953]  invoke_syscall+0x50/0x128
> [   50.500037]  el0_svc_common.constprop.0+0x48/0xf8
> [   50.500121]  do_el0_svc+0x28/0x40
> [   50.500203]  el0_svc+0x50/0x150
> [   50.500322]  el0t_64_sync_handler+0x13c/0x158
> [   50.500422]  el0t_64_sync+0x1a4/0x1a8
> [   50.501378] BUG: scheduling while atomic: singlethread/102/0x00000000
> [   50.517641] INFO: lockdep is turned off.
> [   50.518206] Modules linked in:
> [   50.521135] CPU: 2 PID: 102 Comm: singlethread Tainted: G        W
>         6.9.0-rc2-00257-g2d9f63c285db #128
> [   50.522026] Hardware name: linux,dummy-virt (DT)
> [   50.522623] Call trace:
> [   50.522993]  dump_backtrace+0x9c/0x100
> [   50.523527]  show_stack+0x20/0x38
> [   50.523950]  dump_stack_lvl+0xec/0x150
> [   50.524405]  dump_stack+0x18/0x28
> [   50.524849]  __schedule_bug+0x80/0xe0
> [   50.525309]  __schedule+0xb1c/0xc00
> [   50.525750]  schedule+0x58/0x170
> [   50.526227]  schedule_preempt_disabled+0x2c/0x50
> [   50.526762]  rwsem_down_write_slowpath+0x1ac/0x718
> [   50.527342]  down_write+0xf8/0x1a8
> [   50.527857]  split_huge_page_to_list_to_order+0x3c8/0x1130
> [   50.528437]  madvise_cold_or_pageout_pte_range+0x84c/0xea0
> [   50.529012]  walk_pgd_range+0x388/0x7d8
> [   50.529442]  __walk_page_range+0x1e0/0x1f0
> [   50.529896]  walk_page_range+0x1f0/0x2c8
> [   50.530342]  madvise_pageout+0xf8/0x280
> [   50.530878]  madvise_vma_behavior+0x310/0x9b8
> [   50.531395]  madvise_walk_vmas+0xc0/0x128
> [   50.531849]  do_madvise.part.0+0xf8/0x518
> [   50.532330]  __arm64_sys_madvise+0x68/0x88
> [   50.532829]  invoke_syscall+0x50/0x128
> [   50.533374]  el0_svc_common.constprop.0+0x48/0xf8
> [   50.533992]  do_el0_svc+0x28/0x40
> [   50.534498]  el0_svc+0x50/0x150
> [   50.535029]  el0t_64_sync_handler+0x13c/0x158
> [   50.535588]  el0t_64_sync+0x1a4/0x1a8
> 
> 
> 
>>  include/linux/pgtable.h | 30 ++++++++++++++
>>  mm/internal.h           | 12 +++++-
>>  mm/madvise.c            | 88 ++++++++++++++++++++++++-----------------
>>  mm/memory.c             |  4 +-
>>  4 files changed, 93 insertions(+), 41 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 8185939df1e8..391f56a1b188 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -361,6 +361,36 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  }
>>  #endif
>>
>> +#ifndef mkold_ptes
>> +/**
>> + * mkold_ptes - Mark PTEs that map consecutive pages of the same folio as old.
>> + * @vma: VMA the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to mark old.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_test_and_clear_young().
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>> + * some PTEs might be write-protected.
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>> + */
>> +static inline void mkold_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +               pte_t *ptep, unsigned int nr)
>> +{
>> +       for (;;) {
>> +               ptep_test_and_clear_young(vma, addr, ptep);
>> +               if (--nr == 0)
>> +                       break;
>> +               ptep++;
>> +               addr += PAGE_SIZE;
>> +       }
>> +}
>> +#endif
>> +
>>  #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
>>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
>>  static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>> diff --git a/mm/internal.h b/mm/internal.h
>> index eadb79c3a357..efee8e4cd2af 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -130,6 +130,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>   * @flags: Flags to modify the PTE batch semantics.
>>   * @any_writable: Optional pointer to indicate whether any entry except the
>>   *               first one is writable.
>> + * @any_young: Optional pointer to indicate whether any entry except the
>> + *               first one is young.
>>   *
>>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>>   * pages of the same large folio.
>> @@ -145,16 +147,18 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>   */
>>  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>                 pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>> -               bool *any_writable)
>> +               bool *any_writable, bool *any_young)
>>  {
>>         unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>         const pte_t *end_ptep = start_ptep + max_nr;
>>         pte_t expected_pte, *ptep;
>> -       bool writable;
>> +       bool writable, young;
>>         int nr;
>>
>>         if (any_writable)
>>                 *any_writable = false;
>> +       if (any_young)
>> +               *any_young = false;
>>
>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>         VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>> @@ -168,6 +172,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>                 pte = ptep_get(ptep);
>>                 if (any_writable)
>>                         writable = !!pte_write(pte);
>> +               if (any_young)
>> +                       young = !!pte_young(pte);
>>                 pte = __pte_batch_clear_ignored(pte, flags);
>>
>>                 if (!pte_same(pte, expected_pte))
>> @@ -183,6 +189,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>
>>                 if (any_writable)
>>                         *any_writable |= writable;
>> +               if (any_young)
>> +                       *any_young |= young;
>>
>>                 nr = pte_batch_hint(ptep, pte);
>>                 expected_pte = pte_advance_pfn(expected_pte, nr);
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 070bedb4996e..bd00b83e7c50 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>         LIST_HEAD(folio_list);
>>         bool pageout_anon_only_filter;
>>         unsigned int batch_count = 0;
>> +       int nr;
>>
>>         if (fatal_signal_pending(current))
>>                 return -EINTR;
>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>                 return 0;
>>         flush_tlb_batched_pending(mm);
>>         arch_enter_lazy_mmu_mode();
>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>> +               nr = 1;
>>                 ptent = ptep_get(pte);
>>
>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>> @@ -447,55 +449,67 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>                         continue;
>>
>>                 /*
>> -                * Creating a THP page is expensive so split it only if we
>> -                * are sure it's worth. Split it if we are only owner.
>> +                * If we encounter a large folio, only split it if it is not
>> +                * fully mapped within the range we are operating on. Otherwise
>> +                * leave it as is so that it can be swapped out whole. If we
>> +                * fail to split a folio, leave it in place and advance to the
>> +                * next pte in the range.
>>                  */
>>                 if (folio_test_large(folio)) {
>> -                       int err;
>> -
>> -                       if (folio_likely_mapped_shared(folio))
>> -                               break;
>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
>> -                               break;
>> -                       if (!folio_trylock(folio))
>> -                               break;
>> -                       folio_get(folio);
>> -                       arch_leave_lazy_mmu_mode();
>> -                       pte_unmap_unlock(start_pte, ptl);
>> -                       start_pte = NULL;
>> -                       err = split_folio(folio);
>> -                       folio_unlock(folio);
>> -                       folio_put(folio);
>> -                       if (err)
>> -                               break;
>> -                       start_pte = pte =
>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
>> -                       if (!start_pte)
>> -                               break;
>> -                       arch_enter_lazy_mmu_mode();
>> -                       pte--;
>> -                       addr -= PAGE_SIZE;
>> -                       continue;
>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>> +                                               FPB_IGNORE_SOFT_DIRTY;
>> +                       int max_nr = (end - addr) / PAGE_SIZE;
>> +                       bool any_young;
>> +
>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>> +                                            fpb_flags, NULL, &any_young);
>> +                       if (any_young)
>> +                               ptent = pte_mkyoung(ptent);
>> +
>> +                       if (nr < folio_nr_pages(folio)) {
>> +                               int err;
>> +
>> +                               if (folio_likely_mapped_shared(folio))
>> +                                       continue;
>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
>> +                                       continue;
>> +                               if (!folio_trylock(folio))
>> +                                       continue;
>> +                               folio_get(folio);
>> +                               arch_leave_lazy_mmu_mode();
>> +                               pte_unmap_unlock(start_pte, ptl);
>> +                               start_pte = NULL;
>> +                               err = split_folio(folio);
>> +                               folio_unlock(folio);
>> +                               folio_put(folio);
>> +                               if (err)
>> +                                       continue;

The ptl is unlocked at this point. This used to break, but now it continues
without the lock held!

>> +                               start_pte = pte =
>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
>> +                               if (!start_pte)
>> +                                       break;

I think we would want to move the condition to here:

                                   if (err)
                                           continue;

I'll fix it in the next version.

Thanks,
Ryan


>> +                               arch_enter_lazy_mmu_mode();
>> +                               nr = 0;
>> +                               continue;
>> +                       }
>>                 }
>>
>>                 /*
>>                  * Do not interfere with other mappings of this folio and
>> -                * non-LRU folio.
>> +                * non-LRU folio. If we have a large folio at this point, we
>> +                * know it is fully mapped so if its mapcount is the same as its
>> +                * number of pages, it must be exclusive.
>>                  */
>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>> +               if (!folio_test_lru(folio) ||
>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>>                         continue;
>>
>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>>                         continue;
>>
>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -
>>                 if (!pageout && pte_young(ptent)) {
>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>> -                                                       tlb->fullmm);
>> -                       ptent = pte_mkold(ptent);
>> -                       set_pte_at(mm, addr, pte, ptent);
>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>> +                       mkold_ptes(vma, addr, pte, nr);
>> +                       tlb_remove_tlb_entries(tlb, pte, nr, addr);
>>                 }
>>
>>                 /*
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 9d844582ba38..b5b48f4cf2af 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -989,7 +989,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>>                         flags |= FPB_IGNORE_SOFT_DIRTY;
>>
>>                 nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
>> -                                    &any_writable);
>> +                                    &any_writable, NULL);
>>                 folio_ref_add(folio, nr);
>>                 if (folio_test_anon(folio)) {
>>                         if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
>> @@ -1553,7 +1553,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>>          */
>>         if (unlikely(folio_test_large(folio) && max_nr != 1)) {
>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
>> -                                    NULL);
>> +                                    NULL, NULL);
>>
>>                 zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
>>                                        addr, details, rss, force_flush,
>> --
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
  2024-04-02 10:56     ` Ryan Roberts
@ 2024-04-02 11:01       ` Ryan Roberts
  0 siblings, 0 replies; 35+ messages in thread
From: Ryan Roberts @ 2024-04-02 11:01 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Chris Li, Lance Yang, linux-mm, linux-kernel, Barry Song

Hi Andrew,

Please could you remove the series to which this patch belongs from mm-unstable?
I'll need to respin to fix the below bug. I also have a couple of other minor
changes to make based on feedback. Hopefully will be able to send out an updated
version in the next couple of days.

Thanks,
Ryan


On 02/04/2024 11:56, Ryan Roberts wrote:
> On 02/04/2024 11:16, Barry Song wrote:
>> On Thu, Mar 28, 2024 at 3:46 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>> range. This change means that large folios will be maintained all the
>>> way to swap storage. This both improves performance during swap-out, by
>>> eliding the cost of splitting the folio, and sets us up nicely for
>>> maintaining the large folio when it is swapped back in (to be covered in
>>> a separate series).
>>>
>>> Folios that are not fully mapped in the target range are still split,
>>> but note that behavior is changed so that if the split fails for any
>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>> next pte in the range and continue work on the proceeding folios.
>>> Previously any failure of this sort would cause the entire operation to
>>> give up and no folios mapped at higher addresses were paged out or made
>>> cold. Given large folios are becoming more common, this old behavior
>>> would have likely lead to wasted opportunities.
>>>
>>> While we are at it, change the code that clears young from the ptes to
>>> use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper
>>> function. This is more efficent than get_and_clear/modify/set,
>>> especially for contpte mappings on arm64, where the old approach would
>>> require unfolding/refolding and the new approach can be done in place.
>>>
>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>
>> Hi Ryan,
>>
>> I'm not entirely certain if this issue is related to this patch, but
>> I've encountered
>> the KNIC twice while using the latest mm-unstable kernel. Each time I attempted
>> to debug it, the issue vanished. I'm posting here to see if you have
>> any ideas on it :-)
> 
> Thanks for the report! I think I see the problem below...
> 
> 
>>
>> [   50.444066]
>> [   50.444495] =====================================
>> [   50.444954] WARNING: bad unlock balance detected!
>> [   50.445443] 6.9.0-rc2-00257-g2d9f63c285db #128 Not tainted
>> [   50.446233] -------------------------------------
>> [   50.446684] singlethread/102 is trying to release lock
>> (ptlock_ptr(ptdesc)) at:
>> [   50.447635] [<ffffb03155fe302c>]
>> madvise_cold_or_pageout_pte_range+0x80c/0xea0
>> [   50.449066] but there are no more locks to release!
>> [   50.449535]
>> [   50.449535] other info that might help us debug this:
>> [   50.450140] 1 lock held by singlethread/102:
>> [   50.450688]  #0: ffff0000c001f208 (&mm->mmap_lock){++++}-{4:4}, at:
>> do_madvise.part.0+0x178/0x518
>> [   50.452321]
>> [   50.452321] stack backtrace:
>> [   50.452959] CPU: 3 PID: 102 Comm: singlethread Not tainted
>> 6.9.0-rc2-00257-g2d9f63c285db #128
>> [   50.453812] Hardware name: linux,dummy-virt (DT)
>> [   50.454373] Call trace:
>> [   50.454755]  dump_backtrace+0x9c/0x100
>> [   50.455246]  show_stack+0x20/0x38
>> [   50.455667]  dump_stack_lvl+0xec/0x150
>> [   50.456111]  dump_stack+0x18/0x28
>> [   50.456533]  print_unlock_imbalance_bug+0x130/0x148
>> [   50.457014]  lock_release+0x2e0/0x360
>> [   50.457487]  _raw_spin_unlock+0x2c/0x78
>> [   50.457997]  madvise_cold_or_pageout_pte_range+0x80c/0xea0
>> [   50.458635]  walk_pgd_range+0x388/0x7d8
>> [   50.459168]  __walk_page_range+0x1e0/0x1f0
>> [   50.459682]  walk_page_range+0x1f0/0x2c8
>> [   50.460225]  madvise_pageout+0xf8/0x280
>> [   50.460711]  madvise_vma_behavior+0x310/0x9b8
>> [   50.461169]  madvise_walk_vmas+0xc0/0x128
>> [   50.461605]  do_madvise.part.0+0xf8/0x518
>> [   50.462041]  __arm64_sys_madvise+0x68/0x88
>> [   50.462529]  invoke_syscall+0x50/0x128
>> [   50.463001]  el0_svc_common.constprop.0+0x48/0xf8
>> [   50.463508]  do_el0_svc+0x28/0x40
>> [   50.464004]  el0_svc+0x50/0x150
>> [   50.464492]  el0t_64_sync_handler+0x13c/0x158
>> [   50.465021]  el0t_64_sync+0x1a4/0x1a8
>> [   50.466959] ------------[ cut here ]------------
>> [   50.467451] WARNING: CPU: 3 PID: 102 at
>> kernel/rcu/tree_plugin.h:431 __rcu_read_unlock+0x74/0x218
>> [   50.468160] Modules linked in:
>> [   50.468803] CPU: 3 PID: 102 Comm: singlethread Not tainted
>> 6.9.0-rc2-00257-g2d9f63c285db #128
>> [   50.469658] Hardware name: linux,dummy-virt (DT)
>> [   50.470293] pstate: a3400005 (NzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
>> [   50.470991] pc : __rcu_read_unlock+0x74/0x218
>> [   50.471594] lr : madvise_cold_or_pageout_pte_range+0x828/0xea0
>> [   50.472236] sp : ffff800080abb7e0
>> [   50.472622] pmr_save: 000000e0
>> [   50.473010] x29: ffff800080abb7e0 x28: 0000ffffa467a000 x27: fffffdffc3128c00
>> [   50.474006] x26: 0010000000000001 x25: 000000000000001b x24: ffff0000c32d73d0
>> [   50.474971] x23: 0060000104a3afc3 x22: ffff0000c2492840 x21: 0400000000000001
>> [   50.475943] x20: ff77fffffffffbff x19: ffff0000c3230000 x18: ffffffffffffffff
>> [   50.477286] x17: 672d37353230302d x16: 3263722d302e392e x15: ffff800100abb227
>> [   50.478373] x14: 0000000000000001 x13: 38613178302f3461 x12: 3178302b636e7973
>> [   50.479354] x11: fffffffffffe0000 x10: ffffb03159697d08 x9 : ffffb03155fe3048
>> [   50.480265] x8 : 00000000ffffefff x7 : ffffb03159697d08 x6 : 0000000000000000
>> [   50.481154] x5 : 0000000000000001 x4 : ffff800080abbfe0 x3 : 0000000000000000
>> [   50.482035] x2 : ffff4fd055074000 x1 : 00000000ffffffff x0 : 000000003fffffff
>> [   50.483163] Call trace:
>> [   50.483599]  __rcu_read_unlock+0x74/0x218
>> [   50.484152]  madvise_cold_or_pageout_pte_range+0x828/0xea0
>> [   50.484780]  walk_pgd_range+0x388/0x7d8
>> [   50.485328]  __walk_page_range+0x1e0/0x1f0
>> [   50.485725]  walk_page_range+0x1f0/0x2c8
>> [   50.486117]  madvise_pageout+0xf8/0x280
>> [   50.486547]  madvise_vma_behavior+0x310/0x9b8
>> [   50.486975]  madvise_walk_vmas+0xc0/0x128
>> [   50.487403]  do_madvise.part.0+0xf8/0x518
>> [   50.487845]  __arm64_sys_madvise+0x68/0x88
>> [   50.488374]  invoke_syscall+0x50/0x128
>> [   50.488946]  el0_svc_common.constprop.0+0x48/0xf8
>> [   50.489732]  do_el0_svc+0x28/0x40
>> [   50.490210]  el0_svc+0x50/0x150
>> [   50.490674]  el0t_64_sync_handler+0x13c/0x158
>> [   50.491257]  el0t_64_sync+0x1a4/0x1a8
>> [   50.491793] irq event stamp: 3087
>> [   50.492243] hardirqs last  enabled at (3087): [<ffffb031570d89d8>]
>> _raw_spin_unlock_irq+0x38/0x90
>> [   50.492917] hardirqs last disabled at (3086): [<ffffb031570d8acc>]
>> _raw_spin_lock_irq+0x9c/0xc0
>> [   50.493742] softirqs last  enabled at (2470): [<ffffb03155c10d94>]
>> __do_softirq+0x534/0x588
>> [   50.494567] softirqs last disabled at (2461): [<ffffb03155c17238>]
>> ____do_softirq+0x18/0x30
>> [   50.495328] ---[ end trace 0000000000000000 ]---
>> [   50.497110] BUG: sleeping function called from invalid context at
>> kernel/locking/rwsem.c:1578
>> [   50.497544] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid:
>> 102, name: singlethread
>> [   50.497652] preempt_count: ffffffff, expected: 0
>> [   50.497728] RCU nest depth: -1, expected: 0
>> [   50.497851] INFO: lockdep is turned off.
>> [   50.498023] CPU: 3 PID: 102 Comm: singlethread Tainted: G        W
>>         6.9.0-rc2-00257-g2d9f63c285db #128
>> [   50.498166] Hardware name: linux,dummy-virt (DT)
>> [   50.498221] Call trace:
>> [   50.498260]  dump_backtrace+0x9c/0x100
>> [   50.498378]  show_stack+0x20/0x38
>> [   50.498487]  dump_stack_lvl+0xec/0x150
>> [   50.498574]  dump_stack+0x18/0x28
>> [   50.498659]  __might_resched+0x158/0x278
>> [   50.498741]  __might_sleep+0x50/0xa0
>> [   50.498849]  down_write+0x30/0x1a8
>> [   50.498950]  split_huge_page_to_list_to_order+0x3c8/0x1130
>> [   50.499052]  madvise_cold_or_pageout_pte_range+0x84c/0xea0
>> [   50.499138]  walk_pgd_range+0x388/0x7d8
>> [   50.499224]  __walk_page_range+0x1e0/0x1f0
>> [   50.499334]  walk_page_range+0x1f0/0x2c8
>> [   50.499458]  madvise_pageout+0xf8/0x280
>> [   50.499554]  madvise_vma_behavior+0x310/0x9b8
>> [   50.499657]  madvise_walk_vmas+0xc0/0x128
>> [   50.499739]  do_madvise.part.0+0xf8/0x518
>> [   50.499851]  __arm64_sys_madvise+0x68/0x88
>> [   50.499953]  invoke_syscall+0x50/0x128
>> [   50.500037]  el0_svc_common.constprop.0+0x48/0xf8
>> [   50.500121]  do_el0_svc+0x28/0x40
>> [   50.500203]  el0_svc+0x50/0x150
>> [   50.500322]  el0t_64_sync_handler+0x13c/0x158
>> [   50.500422]  el0t_64_sync+0x1a4/0x1a8
>> [   50.501378] BUG: scheduling while atomic: singlethread/102/0x00000000
>> [   50.517641] INFO: lockdep is turned off.
>> [   50.518206] Modules linked in:
>> [   50.521135] CPU: 2 PID: 102 Comm: singlethread Tainted: G        W
>>         6.9.0-rc2-00257-g2d9f63c285db #128
>> [   50.522026] Hardware name: linux,dummy-virt (DT)
>> [   50.522623] Call trace:
>> [   50.522993]  dump_backtrace+0x9c/0x100
>> [   50.523527]  show_stack+0x20/0x38
>> [   50.523950]  dump_stack_lvl+0xec/0x150
>> [   50.524405]  dump_stack+0x18/0x28
>> [   50.524849]  __schedule_bug+0x80/0xe0
>> [   50.525309]  __schedule+0xb1c/0xc00
>> [   50.525750]  schedule+0x58/0x170
>> [   50.526227]  schedule_preempt_disabled+0x2c/0x50
>> [   50.526762]  rwsem_down_write_slowpath+0x1ac/0x718
>> [   50.527342]  down_write+0xf8/0x1a8
>> [   50.527857]  split_huge_page_to_list_to_order+0x3c8/0x1130
>> [   50.528437]  madvise_cold_or_pageout_pte_range+0x84c/0xea0
>> [   50.529012]  walk_pgd_range+0x388/0x7d8
>> [   50.529442]  __walk_page_range+0x1e0/0x1f0
>> [   50.529896]  walk_page_range+0x1f0/0x2c8
>> [   50.530342]  madvise_pageout+0xf8/0x280
>> [   50.530878]  madvise_vma_behavior+0x310/0x9b8
>> [   50.531395]  madvise_walk_vmas+0xc0/0x128
>> [   50.531849]  do_madvise.part.0+0xf8/0x518
>> [   50.532330]  __arm64_sys_madvise+0x68/0x88
>> [   50.532829]  invoke_syscall+0x50/0x128
>> [   50.533374]  el0_svc_common.constprop.0+0x48/0xf8
>> [   50.533992]  do_el0_svc+0x28/0x40
>> [   50.534498]  el0_svc+0x50/0x150
>> [   50.535029]  el0t_64_sync_handler+0x13c/0x158
>> [   50.535588]  el0t_64_sync+0x1a4/0x1a8
>>
>>
>>
>>>  include/linux/pgtable.h | 30 ++++++++++++++
>>>  mm/internal.h           | 12 +++++-
>>>  mm/madvise.c            | 88 ++++++++++++++++++++++++-----------------
>>>  mm/memory.c             |  4 +-
>>>  4 files changed, 93 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 8185939df1e8..391f56a1b188 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -361,6 +361,36 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>  }
>>>  #endif
>>>
>>> +#ifndef mkold_ptes
>>> +/**
>>> + * mkold_ptes - Mark PTEs that map consecutive pages of the same folio as old.
>>> + * @vma: VMA the pages are mapped into.
>>> + * @addr: Address the first page is mapped at.
>>> + * @ptep: Page table pointer for the first entry.
>>> + * @nr: Number of entries to mark old.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>> + * loop over ptep_test_and_clear_young().
>>> + *
>>> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>>> + * some PTEs might be write-protected.
>>> + *
>>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>>> + */
>>> +static inline void mkold_ptes(struct vm_area_struct *vma, unsigned long addr,
>>> +               pte_t *ptep, unsigned int nr)
>>> +{
>>> +       for (;;) {
>>> +               ptep_test_and_clear_young(vma, addr, ptep);
>>> +               if (--nr == 0)
>>> +                       break;
>>> +               ptep++;
>>> +               addr += PAGE_SIZE;
>>> +       }
>>> +}
>>> +#endif
>>> +
>>>  #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
>>>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
>>>  static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index eadb79c3a357..efee8e4cd2af 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -130,6 +130,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>   * @flags: Flags to modify the PTE batch semantics.
>>>   * @any_writable: Optional pointer to indicate whether any entry except the
>>>   *               first one is writable.
>>> + * @any_young: Optional pointer to indicate whether any entry except the
>>> + *               first one is young.
>>>   *
>>>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>>>   * pages of the same large folio.
>>> @@ -145,16 +147,18 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>>   */
>>>  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>                 pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>>> -               bool *any_writable)
>>> +               bool *any_writable, bool *any_young)
>>>  {
>>>         unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>>         const pte_t *end_ptep = start_ptep + max_nr;
>>>         pte_t expected_pte, *ptep;
>>> -       bool writable;
>>> +       bool writable, young;
>>>         int nr;
>>>
>>>         if (any_writable)
>>>                 *any_writable = false;
>>> +       if (any_young)
>>> +               *any_young = false;
>>>
>>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>>         VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>>> @@ -168,6 +172,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>                 pte = ptep_get(ptep);
>>>                 if (any_writable)
>>>                         writable = !!pte_write(pte);
>>> +               if (any_young)
>>> +                       young = !!pte_young(pte);
>>>                 pte = __pte_batch_clear_ignored(pte, flags);
>>>
>>>                 if (!pte_same(pte, expected_pte))
>>> @@ -183,6 +189,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>
>>>                 if (any_writable)
>>>                         *any_writable |= writable;
>>> +               if (any_young)
>>> +                       *any_young |= young;
>>>
>>>                 nr = pte_batch_hint(ptep, pte);
>>>                 expected_pte = pte_advance_pfn(expected_pte, nr);
>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>> index 070bedb4996e..bd00b83e7c50 100644
>>> --- a/mm/madvise.c
>>> +++ b/mm/madvise.c
>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>         LIST_HEAD(folio_list);
>>>         bool pageout_anon_only_filter;
>>>         unsigned int batch_count = 0;
>>> +       int nr;
>>>
>>>         if (fatal_signal_pending(current))
>>>                 return -EINTR;
>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>                 return 0;
>>>         flush_tlb_batched_pending(mm);
>>>         arch_enter_lazy_mmu_mode();
>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>> +               nr = 1;
>>>                 ptent = ptep_get(pte);
>>>
>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>>> @@ -447,55 +449,67 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>                         continue;
>>>
>>>                 /*
>>> -                * Creating a THP page is expensive so split it only if we
>>> -                * are sure it's worth. Split it if we are only owner.
>>> +                * If we encounter a large folio, only split it if it is not
>>> +                * fully mapped within the range we are operating on. Otherwise
>>> +                * leave it as is so that it can be swapped out whole. If we
>>> +                * fail to split a folio, leave it in place and advance to the
>>> +                * next pte in the range.
>>>                  */
>>>                 if (folio_test_large(folio)) {
>>> -                       int err;
>>> -
>>> -                       if (folio_likely_mapped_shared(folio))
>>> -                               break;
>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
>>> -                               break;
>>> -                       if (!folio_trylock(folio))
>>> -                               break;
>>> -                       folio_get(folio);
>>> -                       arch_leave_lazy_mmu_mode();
>>> -                       pte_unmap_unlock(start_pte, ptl);
>>> -                       start_pte = NULL;
>>> -                       err = split_folio(folio);
>>> -                       folio_unlock(folio);
>>> -                       folio_put(folio);
>>> -                       if (err)
>>> -                               break;
>>> -                       start_pte = pte =
>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
>>> -                       if (!start_pte)
>>> -                               break;
>>> -                       arch_enter_lazy_mmu_mode();
>>> -                       pte--;
>>> -                       addr -= PAGE_SIZE;
>>> -                       continue;
>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>> +                                               FPB_IGNORE_SOFT_DIRTY;
>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
>>> +                       bool any_young;
>>> +
>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>> +                                            fpb_flags, NULL, &any_young);
>>> +                       if (any_young)
>>> +                               ptent = pte_mkyoung(ptent);
>>> +
>>> +                       if (nr < folio_nr_pages(folio)) {
>>> +                               int err;
>>> +
>>> +                               if (folio_likely_mapped_shared(folio))
>>> +                                       continue;
>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
>>> +                                       continue;
>>> +                               if (!folio_trylock(folio))
>>> +                                       continue;
>>> +                               folio_get(folio);
>>> +                               arch_leave_lazy_mmu_mode();
>>> +                               pte_unmap_unlock(start_pte, ptl);
>>> +                               start_pte = NULL;
>>> +                               err = split_folio(folio);
>>> +                               folio_unlock(folio);
>>> +                               folio_put(folio);
>>> +                               if (err)
>>> +                                       continue;
> 
> The ptl is unlocked at this point. This used to break, but now it continues
> without the lock held!
> 
>>> +                               start_pte = pte =
>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
>>> +                               if (!start_pte)
>>> +                                       break;
> 
> I think we would want to move the condition to here:
> 
>                                    if (err)
>                                            continue;
> 
> I'll fix it in the next version.
> 
> Thanks,
> Ryan
> 
> 
>>> +                               arch_enter_lazy_mmu_mode();
>>> +                               nr = 0;
>>> +                               continue;
>>> +                       }
>>>                 }
>>>
>>>                 /*
>>>                  * Do not interfere with other mappings of this folio and
>>> -                * non-LRU folio.
>>> +                * non-LRU folio. If we have a large folio at this point, we
>>> +                * know it is fully mapped so if its mapcount is the same as its
>>> +                * number of pages, it must be exclusive.
>>>                  */
>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>> +               if (!folio_test_lru(folio) ||
>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>>>                         continue;
>>>
>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>                         continue;
>>>
>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>> -
>>>                 if (!pageout && pte_young(ptent)) {
>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>>> -                                                       tlb->fullmm);
>>> -                       ptent = pte_mkold(ptent);
>>> -                       set_pte_at(mm, addr, pte, ptent);
>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>>> +                       mkold_ptes(vma, addr, pte, nr);
>>> +                       tlb_remove_tlb_entries(tlb, pte, nr, addr);
>>>                 }
>>>
>>>                 /*
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 9d844582ba38..b5b48f4cf2af 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -989,7 +989,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>>>                         flags |= FPB_IGNORE_SOFT_DIRTY;
>>>
>>>                 nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
>>> -                                    &any_writable);
>>> +                                    &any_writable, NULL);
>>>                 folio_ref_add(folio, nr);
>>>                 if (folio_test_anon(folio)) {
>>>                         if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
>>> @@ -1553,7 +1553,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>>>          */
>>>         if (unlikely(folio_test_large(folio) && max_nr != 1)) {
>>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
>>> -                                    NULL);
>>> +                                    NULL, NULL);
>>>
>>>                 zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
>>>                                        addr, details, rss, force_flush,
>>> --
>>> 2.25.1
>>>
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  2024-04-01  5:52   ` Huang, Ying
@ 2024-04-02 11:15     ` Ryan Roberts
  2024-04-03  3:57       ` Huang, Ying
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-04-02 11:15 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, Barry Song,
	Chris Li, Lance Yang, linux-mm, linux-kernel

On 01/04/2024 06:52, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Now that we no longer have a convenient flag in the cluster to determine
>> if a folio is large, free_swap_and_cache() will take a reference and
>> lock a large folio much more often, which could lead to contention and
>> (e.g.) failure to split large folios, etc.
>>
>> Let's solve that problem by batch freeing swap and cache with a new
>> function, free_swap_and_cache_nr(), to free a contiguous range of swap
>> entries together. This allows us to first drop a reference to each swap
>> slot before we try to release the cache folio. This means we only try to
>> release the folio once, only taking the reference and lock once - much
>> better than the previous 512 times for the 2M THP case.
>>
>> Contiguous swap entries are gathered in zap_pte_range() and
>> madvise_free_pte_range() in a similar way to how present ptes are
>> already gathered in zap_pte_range().
>>
>> While we are at it, let's simplify by converting the return type of both
>> functions to void. The return value was used only by zap_pte_range() to
>> print a bad pte, and was ignored by everyone else, so the extra
>> reporting wasn't exactly guaranteed. We will still get the warning with
>> most of the information from get_swap_device(). With the batch version,
>> we wouldn't know which pte was bad anyway so could print the wrong one.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/pgtable.h | 28 +++++++++++++++
>>  include/linux/swap.h    | 12 +++++--
>>  mm/internal.h           | 48 +++++++++++++++++++++++++
>>  mm/madvise.c            | 12 ++++---
>>  mm/memory.c             | 13 +++----
>>  mm/swapfile.c           | 78 ++++++++++++++++++++++++++++++-----------
>>  6 files changed, 157 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 09c85c7bf9c2..8185939df1e8 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
>>  }
>>  #endif
>>  
>> +#ifndef clear_not_present_full_ptes
>> +/**
>> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
>> + * @mm: Address space the ptes represent.
>> + * @addr: Address of the first pte.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to clear.
>> + * @full: Whether we are clearing a full mm.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over pte_clear_not_present_full().
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs are all not present.
>> + * The PTEs are all in the same PMD.
>> + */
>> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	for (;;) {
>> +		pte_clear_not_present_full(mm, addr, ptep, full);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +	}
>> +}
>> +#endif
>> +
>>  #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>>  extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>>  			      unsigned long address,
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index f6f78198f000..5737236dc3ce 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t);
>>  extern int swapcache_prepare(swp_entry_t);
>>  extern void swap_free(swp_entry_t);
>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> -extern int free_swap_and_cache(swp_entry_t);
>> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>>  int swap_type_of(dev_t device, sector_t offset);
>>  int find_first_swap(dev_t *device);
>>  extern unsigned int count_swap_pages(int, int);
>> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct *si)
>>  #define free_pages_and_swap_cache(pages, nr) \
>>  	release_pages((pages), (nr));
>>  
>> -/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
>> -#define free_swap_and_cache(e) is_pfn_swap_entry(e)
>> +static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>> +{
>> +}
>>  
>>  static inline void free_swap_cache(struct folio *folio)
>>  {
>> @@ -589,6 +590,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
>>  }
>>  #endif /* CONFIG_SWAP */
>>  
>> +static inline void free_swap_and_cache(swp_entry_t entry)
>> +{
>> +	free_swap_and_cache_nr(entry, 1);
>> +}
>> +
>>  #ifdef CONFIG_MEMCG
>>  static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>>  {
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 8e11f7b2da21..eadb79c3a357 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -11,6 +11,8 @@
>>  #include <linux/mm.h>
>>  #include <linux/pagemap.h>
>>  #include <linux/rmap.h>
>> +#include <linux/swap.h>
>> +#include <linux/swapops.h>
>>  #include <linux/tracepoint-defs.h>
>>  
>>  struct folio_batch;
>> @@ -189,6 +191,52 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>  
>>  	return min(ptep - start_ptep, max_nr);
>>  }
>> +
>> +/**
>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>> + * @start_ptep: Page table pointer for the first entry.
>> + * @max_nr: The maximum number of table entries to consider.
>> + * @entry: Swap entry recovered from the first table entry.
>> + *
>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>> + * containing swap entries all with consecutive offsets and targeting the same
>> + * swap type.
>> + *
>> + * max_nr must be at least one and must be limited by the caller so scanning
>> + * cannot exceed a single page table.
>> + *
>> + * Return: the number of table entries in the batch.
>> + */
>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>> +				 swp_entry_t entry)
>> +{
>> +	const pte_t *end_ptep = start_ptep + max_nr;
>> +	unsigned long expected_offset = swp_offset(entry) + 1;
>> +	unsigned int expected_type = swp_type(entry);
>> +	pte_t *ptep = start_ptep + 1;
>> +
>> +	VM_WARN_ON(max_nr < 1);
>> +	VM_WARN_ON(non_swap_entry(entry));
>> +
>> +	while (ptep < end_ptep) {
>> +		pte_t pte = ptep_get(ptep);
>> +
>> +		if (pte_none(pte) || pte_present(pte))
>> +			break;
>> +
>> +		entry = pte_to_swp_entry(pte);
>> +
>> +		if (non_swap_entry(entry) ||
>> +		    swp_type(entry) != expected_type ||
>> +		    swp_offset(entry) != expected_offset)
>> +			break;
>> +
>> +		expected_offset++;
>> +		ptep++;
>> +	}
>> +
>> +	return ptep - start_ptep;
>> +}
>>  #endif /* CONFIG_MMU */
>>  
>>  void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 1f77a51baaac..070bedb4996e 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>  	struct folio *folio;
>>  	int nr_swap = 0;
>>  	unsigned long next;
>> +	int nr, max_nr;
>>  
>>  	next = pmd_addr_end(addr, end);
>>  	if (pmd_trans_huge(*pmd))
>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>  		return 0;
>>  	flush_tlb_batched_pending(mm);
>>  	arch_enter_lazy_mmu_mode();
>> -	for (; addr != end; pte++, addr += PAGE_SIZE) {
>> +	for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>> +		nr = 1;
>>  		ptent = ptep_get(pte);
>>  
>>  		if (pte_none(ptent))
>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>  
>>  			entry = pte_to_swp_entry(ptent);
>>  			if (!non_swap_entry(entry)) {
>> -				nr_swap--;
>> -				free_swap_and_cache(entry);
>> -				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> +				max_nr = (end - addr) / PAGE_SIZE;
>> +				nr = swap_pte_batch(pte, max_nr, entry);
>> +				nr_swap -= nr;
>> +				free_swap_and_cache_nr(entry, nr);
>> +				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>  			} else if (is_hwpoison_entry(entry) ||
>>  				   is_poisoned_swp_entry(entry)) {
>>  				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 36191a9c799c..9d844582ba38 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1631,12 +1631,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>  				folio_remove_rmap_pte(folio, page, vma);
>>  			folio_put(folio);
>>  		} else if (!non_swap_entry(entry)) {
>> -			/* Genuine swap entry, hence a private anon page */
>> +			max_nr = (end - addr) / PAGE_SIZE;
>> +			nr = swap_pte_batch(pte, max_nr, entry);
>> +			/* Genuine swap entries, hence a private anon pages */
>>  			if (!should_zap_cows(details))
>>  				continue;
>> -			rss[MM_SWAPENTS]--;
>> -			if (unlikely(!free_swap_and_cache(entry)))
>> -				print_bad_pte(vma, addr, ptent, NULL);
>> +			rss[MM_SWAPENTS] -= nr;
>> +			free_swap_and_cache_nr(entry, nr);
>>  		} else if (is_migration_entry(entry)) {
>>  			folio = pfn_swap_entry_folio(entry);
>>  			if (!should_zap_folio(details, folio))
>> @@ -1659,8 +1660,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>  			pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>  			WARN_ON_ONCE(1);
>>  		}
>> -		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> -		zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>> +		clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>> +		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>>  	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>>  
>>  	add_mm_rss_vec(mm, rss);
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 0d44ee2b4f9c..cedfc82d37e5 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
>>  /* Reclaim the swap entry if swap is getting full*/
>>  #define TTRS_FULL		0x4
>>  
>> -/* returns 1 if swap entry is freed */
>> +/*
>> + * returns number of pages in the folio that backs the swap entry. If positive,
>> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
>> + * folio was associated with the swap entry.
>> + */
>>  static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>  				 unsigned long offset, unsigned long flags)
>>  {
>> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>  			ret = folio_free_swap(folio);
>>  		folio_unlock(folio);
>>  	}
>> +	ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
>>  	folio_put(folio);
>>  	return ret;
>>  }
>> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>>  		spin_lock(&si->lock);
>>  		/* entry was freed successfully, try to use this again */
>> -		if (swap_was_freed)
>> +		if (swap_was_freed > 0)
>>  			goto checks;
>>  		goto scan; /* check next one */
>>  	}
>> @@ -1572,32 +1577,63 @@ bool folio_free_swap(struct folio *folio)
>>  	return true;
>>  }
>>  
>> -/*
>> - * Free the swap entry like above, but also try to
>> - * free the page cache entry if it is the last user.
>> - */
>> -int free_swap_and_cache(swp_entry_t entry)
>> +void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>>  {
>> -	struct swap_info_struct *p;
>> -	unsigned char count;
>> +	unsigned long end = swp_offset(entry) + nr;
>> +	unsigned int type = swp_type(entry);
>> +	struct swap_info_struct *si;
>> +	unsigned long offset;
>>  
>>  	if (non_swap_entry(entry))
>> -		return 1;
>> +		return;
>>  
>> -	p = get_swap_device(entry);
>> -	if (p) {
>> -		if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
>> -			put_swap_device(p);
>> -			return 0;
>> -		}
>> +	si = get_swap_device(entry);
>> +	if (!si)
>> +		return;
>>  
>> -		count = __swap_entry_free(p, entry);
>> -		if (count == SWAP_HAS_CACHE)
>> -			__try_to_reclaim_swap(p, swp_offset(entry),
>> +	if (WARN_ON(end > si->max))
>> +		goto out;
>> +
>> +	/*
>> +	 * First free all entries in the range.
>> +	 */
>> +	for (offset = swp_offset(entry); offset < end; offset++) {
>> +		if (!WARN_ON(data_race(!si->swap_map[offset])))
>> +			__swap_entry_free(si, swp_entry(type, offset));
> 
> I think that it's better to check the return value of
> __swap_entry_free() here.  When the return value != SWAP_HAS_CACHE, we
> can try to reclaim all swap entries we have checked before, then restart
> the check with the new start.

What's the benefit of your proposed aproach? I only see a drawback: if there are
large swap entries for which some pages have higher ref counts than others, we
will end up trying to reclaim (and fail) multiple times per folio. Whereas with
my current approach we only attempt reclaim once per folio.

Do you see a specific bug with what I'm currently doing?

Thanks,
Ryan

> 
>> +	}
>> +
>> +	/*
>> +	 * Now go back over the range trying to reclaim the swap cache. This is
>> +	 * more efficient for large folios because we will only try to reclaim
>> +	 * the swap once per folio in the common case. If we do
>> +	 * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
>> +	 * latter will get a reference and lock the folio for every individual
>> +	 * page but will only succeed once the swap slot for every subpage is
>> +	 * zero.
>> +	 */
>> +	for (offset = swp_offset(entry); offset < end; offset += nr) {
>> +		nr = 1;
>> +		if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
>> +			/*
>> +			 * Folios are always naturally aligned in swap so
>> +			 * advance forward to the next boundary. Zero means no
>> +			 * folio was found for the swap entry, so advance by 1
>> +			 * in this case. Negative value means folio was found
>> +			 * but could not be reclaimed. Here we can still advance
>> +			 * to the next boundary.
>> +			 */
>> +			nr = __try_to_reclaim_swap(si, offset,
>>  					      TTRS_UNMAPPED | TTRS_FULL);
>> -		put_swap_device(p);
>> +			if (nr == 0)
>> +				nr = 1;
>> +			else if (nr < 0)
>> +				nr = -nr;
>> +			nr = ALIGN(offset + 1, nr) - offset;
>> +		}
>>  	}
>> -	return p != NULL;
>> +
>> +out:
>> +	put_swap_device(si);
>>  }
>>  
>>  #ifdef CONFIG_HIBERNATION
> 
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 4/6] mm: swap: Allow storage of all mTHP orders
  2024-04-01  3:15   ` Huang, Ying
@ 2024-04-02 11:18     ` Ryan Roberts
  2024-04-03  3:07       ` Huang, Ying
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-04-02 11:18 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, Barry Song,
	Chris Li, Lance Yang, linux-mm, linux-kernel

On 01/04/2024 04:15, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Multi-size THP enables performance improvements by allocating large,
>> pte-mapped folios for anonymous memory. However I've observed that on an
>> arm64 system running a parallel workload (e.g. kernel compilation)
>> across many cores, under high memory pressure, the speed regresses. This
>> is due to bottlenecking on the increased number of TLBIs added due to
>> all the extra folio splitting when the large folios are swapped out.
>>
>> Therefore, solve this regression by adding support for swapping out mTHP
>> without needing to split the folio, just like is already done for
>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>> and when the swap backing store is a non-rotating block device. These
>> are the same constraints as for the existing PMD-sized THP swap-out
>> support.
>>
>> Note that no attempt is made to swap-in (m)THP here - this is still done
>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>> prerequisite for swapping-in mTHP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [1, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller can fall back to splitting the folio
>> and allocates individual entries (as per existing PMD-sized THP
>> fallback).
>>
>> The per-order current clusters are maintained per-cpu using the existing
>> infrastructure. This is done to avoid interleving pages from different
>> tasks, which would prevent IO being batched. This is already done for
>> the order-0 allocations so we follow the same pattern.
>>
>> As is done for order-0 per-cpu clusters, the scanner now can steal
>> order-0 entries from any per-cpu-per-order reserved cluster. This
>> ensures that when the swap file is getting full, space doesn't get tied
>> up in the per-cpu reserves.
>>
>> This change only modifies swap to be able to accept any order mTHP. It
>> doesn't change the callers to elide doing the actual split. That will be
>> done in separate changes.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/swap.h |  10 ++-
>>  mm/swap_slots.c      |   6 +-
>>  mm/swapfile.c        | 175 ++++++++++++++++++++++++-------------------
>>  3 files changed, 109 insertions(+), 82 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 5e1e4f5bf0cb..11c53692f65f 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -268,13 +268,19 @@ struct swap_cluster_info {
>>   */
>>  #define SWAP_NEXT_INVALID	0
>>  
>> +#ifdef CONFIG_THP_SWAP
>> +#define SWAP_NR_ORDERS		(PMD_ORDER + 1)
>> +#else
>> +#define SWAP_NR_ORDERS		1
>> +#endif
>> +
>>  /*
>>   * We assign a cluster to each CPU, so each CPU can allocate swap entry from
>>   * its own cluster and swapout sequentially. The purpose is to optimize swapout
>>   * throughput.
>>   */
>>  struct percpu_cluster {
>> -	unsigned int next; /* Likely next allocation offset */
>> +	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>>  };
>>  
>>  struct swap_cluster_list {
>> @@ -471,7 +477,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio);
>>  bool folio_free_swap(struct folio *folio);
>>  void put_swap_folio(struct folio *folio, swp_entry_t entry);
>>  extern swp_entry_t get_swap_page_of_type(int);
>> -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
>> +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
>>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>>  extern void swap_shmem_alloc(swp_entry_t);
>>  extern int swap_duplicate(swp_entry_t);
>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
>> index 53abeaf1371d..13ab3b771409 100644
>> --- a/mm/swap_slots.c
>> +++ b/mm/swap_slots.c
>> @@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
>>  	cache->cur = 0;
>>  	if (swap_slot_cache_active)
>>  		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
>> -					   cache->slots, 1);
>> +					   cache->slots, 0);
>>  
>>  	return cache->nr;
>>  }
>> @@ -311,7 +311,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>  
>>  	if (folio_test_large(folio)) {
>>  		if (IS_ENABLED(CONFIG_THP_SWAP))
>> -			get_swap_pages(1, &entry, folio_nr_pages(folio));
>> +			get_swap_pages(1, &entry, folio_order(folio));
>>  		goto out;
>>  	}
>>  
>> @@ -343,7 +343,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>  			goto out;
>>  	}
>>  
>> -	get_swap_pages(1, &entry, 1);
>> +	get_swap_pages(1, &entry, 0);
>>  out:
>>  	if (mem_cgroup_try_charge_swap(folio, entry)) {
>>  		put_swap_folio(folio, entry);
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 1393966b77af..d56cdc547a06 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -278,15 +278,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>>  #ifdef CONFIG_THP_SWAP
>>  #define SWAPFILE_CLUSTER	HPAGE_PMD_NR
>>  
>> -#define swap_entry_size(size)	(size)
>> +#define swap_entry_order(order)	(order)
>>  #else
>>  #define SWAPFILE_CLUSTER	256
>>  
>>  /*
>> - * Define swap_entry_size() as constant to let compiler to optimize
>> + * Define swap_entry_order() as constant to let compiler to optimize
>>   * out some code if !CONFIG_THP_SWAP
>>   */
>> -#define swap_entry_size(size)	1
>> +#define swap_entry_order(order)	0
>>  #endif
>>  #define LATENCY_LIMIT		256
>>  
>> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>  
>>  /*
>>   * The cluster corresponding to page_nr will be used. The cluster will be
>> - * removed from free cluster list and its usage counter will be increased.
>> + * removed from free cluster list and its usage counter will be increased by
>> + * count.
>>   */
>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +static void add_cluster_info_page(struct swap_info_struct *p,
>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
>> +	unsigned long count)
>>  {
>>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>  
>> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>  	if (cluster_is_free(&cluster_info[idx]))
>>  		alloc_cluster(p, idx);
>>  
>> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>  	cluster_set_count(&cluster_info[idx],
>> -		cluster_count(&cluster_info[idx]) + 1);
>> +		cluster_count(&cluster_info[idx]) + count);
>> +}
>> +
>> +/*
>> + * The cluster corresponding to page_nr will be used. The cluster will be
>> + * removed from free cluster list and its usage counter will be increased by 1.
>> + */
>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +{
>> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>>  }
>>  
>>  /*
>> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>   */
>>  static bool
>>  scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> -	unsigned long offset)
>> +	unsigned long offset, int order)
>>  {
>>  	struct percpu_cluster *percpu_cluster;
>>  	bool conflict;
>> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>  		return false;
>>  
>>  	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
>> -	percpu_cluster->next = SWAP_NEXT_INVALID;
>> +	percpu_cluster->next[order] = SWAP_NEXT_INVALID;
>> +	return true;
>> +}
>> +
>> +static inline bool swap_range_empty(char *swap_map, unsigned int start,
>> +				    unsigned int nr_pages)
>> +{
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < nr_pages; i++) {
>> +		if (swap_map[start + i])
>> +			return false;
>> +	}
>> +
>>  	return true;
>>  }
>>  
>>  /*
>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>> - * might involve allocating a new cluster for current CPU too.
>> + * Try to get swap entries with specified order from current cpu's swap entry
>> + * pool (a cluster). This might involve allocating a new cluster for current CPU
>> + * too.
>>   */
>>  static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> -	unsigned long *offset, unsigned long *scan_base)
>> +	unsigned long *offset, unsigned long *scan_base, int order)
>>  {
>> +	unsigned int nr_pages = 1 << order;
> 
> Use swap_entry_order()?

I had previously convinced myself that the compiler should be smart enough to
propagate the constant from

get_swap_pages -> scan_swap_map_slots -> scan_swap_map_try_ssd_cluster

But I'll add the explicit macro for the next version, as you suggest.

> 
>>  	struct percpu_cluster *cluster;
>>  	struct swap_cluster_info *ci;
>>  	unsigned int tmp, max;
>>  
>>  new_cluster:
>>  	cluster = this_cpu_ptr(si->percpu_cluster);
>> -	tmp = cluster->next;
>> +	tmp = cluster->next[order];
>>  	if (tmp == SWAP_NEXT_INVALID) {
>>  		if (!cluster_list_empty(&si->free_clusters)) {
>>  			tmp = cluster_next(&si->free_clusters.head) *
>> @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>  
>>  	/*
>>  	 * Other CPUs can use our cluster if they can't find a free cluster,
>> -	 * check if there is still free entry in the cluster
>> +	 * check if there is still free entry in the cluster, maintaining
>> +	 * natural alignment.
>>  	 */
>>  	max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
>>  	if (tmp < max) {
>>  		ci = lock_cluster(si, tmp);
>>  		while (tmp < max) {
>> -			if (!si->swap_map[tmp])
>> +			if (swap_range_empty(si->swap_map, tmp, nr_pages))
>>  				break;
>> -			tmp++;
>> +			tmp += nr_pages;
>>  		}
>>  		unlock_cluster(ci);
>>  	}
>>  	if (tmp >= max) {
>> -		cluster->next = SWAP_NEXT_INVALID;
>> +		cluster->next[order] = SWAP_NEXT_INVALID;
>>  		goto new_cluster;
>>  	}
>>  	*offset = tmp;
>>  	*scan_base = tmp;
>> -	tmp += 1;
>> -	cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
>> +	tmp += nr_pages;
>> +	cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
>>  	return true;
>>  }
>>  
>> @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
>>  
>>  static int scan_swap_map_slots(struct swap_info_struct *si,
>>  			       unsigned char usage, int nr,
>> -			       swp_entry_t slots[])
>> +			       swp_entry_t slots[], int order)
>>  {
>>  	struct swap_cluster_info *ci;
>>  	unsigned long offset;
>>  	unsigned long scan_base;
>>  	unsigned long last_in_cluster = 0;
>>  	int latency_ration = LATENCY_LIMIT;
>> +	unsigned int nr_pages = 1 << order;
> 
> ditto.

ditto my answer, but I'll add the explicit macro for the next version, as you
suggest.

> 
> Otherwise LGTM, feel free to add
> 
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

Thanks!

> 
> in the future versions.
> 
> --
> Best Regards,
> Huang, Ying
> 
>>  	int n_ret = 0;
>>  	bool scanned_many = false;
>>  
>> @@ -817,6 +846,25 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	 * And we let swap pages go all over an SSD partition.  Hugh
>>  	 */
>>  
>> +	if (order > 0) {
>> +		/*
>> +		 * Should not even be attempting large allocations when huge
>> +		 * page swap is disabled.  Warn and fail the allocation.
>> +		 */
>> +		if (!IS_ENABLED(CONFIG_THP_SWAP) ||
>> +		    nr_pages > SWAPFILE_CLUSTER) {
>> +			VM_WARN_ON_ONCE(1);
>> +			return 0;
>> +		}
>> +
>> +		/*
>> +		 * Swapfile is not block device or not using clusters so unable
>> +		 * to allocate large entries.
>> +		 */
>> +		if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>> +			return 0;
>> +	}
>> +
>>  	si->flags += SWP_SCANNING;
>>  	/*
>>  	 * Use percpu scan base for SSD to reduce lock contention on
>> @@ -831,8 +879,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  
>>  	/* SSD algorithm */
>>  	if (si->cluster_info) {
>> -		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
>> +		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
>> +			if (order > 0)
>> +				goto no_page;
>>  			goto scan;
>> +		}
>>  	} else if (unlikely(!si->cluster_nr--)) {
>>  		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
>>  			si->cluster_nr = SWAPFILE_CLUSTER - 1;
>> @@ -874,13 +925,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  
>>  checks:
>>  	if (si->cluster_info) {
>> -		while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
>> +		while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
>>  		/* take a break if we already got some slots */
>>  			if (n_ret)
>>  				goto done;
>>  			if (!scan_swap_map_try_ssd_cluster(si, &offset,
>> -							&scan_base))
>> +							&scan_base, order)) {
>> +				if (order > 0)
>> +					goto no_page;
>>  				goto scan;
>> +			}
>>  		}
>>  	}
>>  	if (!(si->flags & SWP_WRITEOK))
>> @@ -911,11 +965,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  		else
>>  			goto done;
>>  	}
>> -	WRITE_ONCE(si->swap_map[offset], usage);
>> -	inc_cluster_info_page(si, si->cluster_info, offset);
>> +	memset(si->swap_map + offset, usage, nr_pages);
>> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>  	unlock_cluster(ci);
>>  
>> -	swap_range_alloc(si, offset, 1);
>> +	swap_range_alloc(si, offset, nr_pages);
>>  	slots[n_ret++] = swp_entry(si->type, offset);
>>  
>>  	/* got enough slots or reach max slots? */
>> @@ -936,8 +990,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  
>>  	/* try to get more slots in cluster */
>>  	if (si->cluster_info) {
>> -		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
>> +		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
>>  			goto checks;
>> +		if (order > 0)
>> +			goto done;
>>  	} else if (si->cluster_nr && !si->swap_map[++offset]) {
>>  		/* non-ssd case, still more slots in cluster? */
>>  		--si->cluster_nr;
>> @@ -964,11 +1020,13 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	}
>>  
>>  done:
>> -	set_cluster_next(si, offset + 1);
>> +	if (order == 0)
>> +		set_cluster_next(si, offset + 1);
>>  	si->flags -= SWP_SCANNING;
>>  	return n_ret;
>>  
>>  scan:
>> +	VM_WARN_ON(order > 0);
>>  	spin_unlock(&si->lock);
>>  	while (++offset <= READ_ONCE(si->highest_bit)) {
>>  		if (unlikely(--latency_ration < 0)) {
>> @@ -997,38 +1055,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	return n_ret;
>>  }
>>  
>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>> -{
>> -	unsigned long idx;
>> -	struct swap_cluster_info *ci;
>> -	unsigned long offset;
>> -
>> -	/*
>> -	 * Should not even be attempting cluster allocations when huge
>> -	 * page swap is disabled.  Warn and fail the allocation.
>> -	 */
>> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
>> -		VM_WARN_ON_ONCE(1);
>> -		return 0;
>> -	}
>> -
>> -	if (cluster_list_empty(&si->free_clusters))
>> -		return 0;
>> -
>> -	idx = cluster_list_first(&si->free_clusters);
>> -	offset = idx * SWAPFILE_CLUSTER;
>> -	ci = lock_cluster(si, offset);
>> -	alloc_cluster(si, idx);
>> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
>> -
>> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>> -	unlock_cluster(ci);
>> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
>> -	*slot = swp_entry(si->type, offset);
>> -
>> -	return 1;
>> -}
>> -
>>  static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>>  {
>>  	unsigned long offset = idx * SWAPFILE_CLUSTER;
>> @@ -1042,17 +1068,15 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>>  	swap_range_free(si, offset, SWAPFILE_CLUSTER);
>>  }
>>  
>> -int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>> +int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>>  {
>> -	unsigned long size = swap_entry_size(entry_size);
>> +	int order = swap_entry_order(entry_order);
>> +	unsigned long size = 1 << order;
>>  	struct swap_info_struct *si, *next;
>>  	long avail_pgs;
>>  	int n_ret = 0;
>>  	int node;
>>  
>> -	/* Only single cluster request supported */
>> -	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
>> -
>>  	spin_lock(&swap_avail_lock);
>>  
>>  	avail_pgs = atomic_long_read(&nr_swap_pages) / size;
>> @@ -1088,14 +1112,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>  			spin_unlock(&si->lock);
>>  			goto nextsi;
>>  		}
>> -		if (size == SWAPFILE_CLUSTER) {
>> -			if (si->flags & SWP_BLKDEV)
>> -				n_ret = swap_alloc_cluster(si, swp_entries);
>> -		} else
>> -			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>> -						    n_goal, swp_entries);
>> +		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>> +					    n_goal, swp_entries, order);
>>  		spin_unlock(&si->lock);
>> -		if (n_ret || size == SWAPFILE_CLUSTER)
>> +		if (n_ret || size > 1)
>>  			goto check_out;
>>  		cond_resched();
>>  
>> @@ -1349,7 +1369,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
>>  	unsigned char *map;
>>  	unsigned int i, free_entries = 0;
>>  	unsigned char val;
>> -	int size = swap_entry_size(folio_nr_pages(folio));
>> +	int size = 1 << swap_entry_order(folio_order(folio));
>>  
>>  	si = _swap_info_get(entry);
>>  	if (!si)
>> @@ -1647,7 +1667,7 @@ swp_entry_t get_swap_page_of_type(int type)
>>  
>>  	/* This is called for allocating swap entry, not cache */
>>  	spin_lock(&si->lock);
>> -	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
>> +	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
>>  		atomic_long_dec(&nr_swap_pages);
>>  	spin_unlock(&si->lock);
>>  fail:
>> @@ -3101,7 +3121,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  		p->flags |= SWP_SYNCHRONOUS_IO;
>>  
>>  	if (p->bdev && bdev_nonrot(p->bdev)) {
>> -		int cpu;
>> +		int cpu, i;
>>  		unsigned long ci, nr_cluster;
>>  
>>  		p->flags |= SWP_SOLIDSTATE;
>> @@ -3139,7 +3159,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  			struct percpu_cluster *cluster;
>>  
>>  			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
>> -			cluster->next = SWAP_NEXT_INVALID;
>> +			for (i = 0; i < SWAP_NR_ORDERS; i++)
>> +				cluster->next[i] = SWAP_NEXT_INVALID;
>>  		}
>>  	} else {
>>  		atomic_inc(&nr_rotate_swap);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
  2024-04-01 12:25   ` Lance Yang
@ 2024-04-02 11:20     ` Ryan Roberts
  2024-04-02 11:30       ` Lance Yang
  0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-04-02 11:20 UTC (permalink / raw)
  To: Lance Yang
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, linux-mm, linux-kernel, Barry Song

On 01/04/2024 13:25, Lance Yang wrote:
> On Wed, Mar 27, 2024 at 10:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>> folio that is fully and contiguously mapped in the pageout/cold vm
>> range. This change means that large folios will be maintained all the
>> way to swap storage. This both improves performance during swap-out, by
>> eliding the cost of splitting the folio, and sets us up nicely for
>> maintaining the large folio when it is swapped back in (to be covered in
>> a separate series).
>>
>> Folios that are not fully mapped in the target range are still split,
>> but note that behavior is changed so that if the split fails for any
>> reason (folio locked, shared, etc) we now leave it as is and move to the
>> next pte in the range and continue work on the proceeding folios.
>> Previously any failure of this sort would cause the entire operation to
>> give up and no folios mapped at higher addresses were paged out or made
>> cold. Given large folios are becoming more common, this old behavior
>> would have likely lead to wasted opportunities.
>>
>> While we are at it, change the code that clears young from the ptes to
>> use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper
>> function. This is more efficent than get_and_clear/modify/set,
>> especially for contpte mappings on arm64, where the old approach would
>> require unfolding/refolding and the new approach can be done in place.
>>
>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/pgtable.h | 30 ++++++++++++++
>>  mm/internal.h           | 12 +++++-
>>  mm/madvise.c            | 88 ++++++++++++++++++++++++-----------------
>>  mm/memory.c             |  4 +-
>>  4 files changed, 93 insertions(+), 41 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 8185939df1e8..391f56a1b188 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -361,6 +361,36 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  }
>>  #endif
>>
>> +#ifndef mkold_ptes
>> +/**
>> + * mkold_ptes - Mark PTEs that map consecutive pages of the same folio as old.
>> + * @vma: VMA the pages are mapped into.
>> + * @addr: Address the first page is mapped at.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to mark old.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over ptep_test_and_clear_young().
>> + *
>> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>> + * some PTEs might be write-protected.
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs map consecutive
>> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
>> + */
>> +static inline void mkold_ptes(struct vm_area_struct *vma, unsigned long addr,
>> +               pte_t *ptep, unsigned int nr)
>> +{
>> +       for (;;) {
>> +               ptep_test_and_clear_young(vma, addr, ptep);
> 
> IIUC, if the first PTE is a CONT-PTE, then calling ptep_test_and_clear_young()
> will clear the young bit for the entire contig range to avoid having
> to unfold. So,
> the other PTEs within the range don't need to clear again.
> 
> Maybe we should consider overriding mkold_ptes for arm64?

Yes completely agree. I was saving this for a separate submission though, to
reduce the complexity of this initial series as much as possible. Let me know if
you disagree and want to see that change as part of this series.

> 
> Thanks,
> Lance
> 
>> +               if (--nr == 0)
>> +                       break;
>> +               ptep++;
>> +               addr += PAGE_SIZE;
>> +       }
>> +}
>> +#endif
>> +
>>  #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
>>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
>>  static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>> diff --git a/mm/internal.h b/mm/internal.h
>> index eadb79c3a357..efee8e4cd2af 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -130,6 +130,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>   * @flags: Flags to modify the PTE batch semantics.
>>   * @any_writable: Optional pointer to indicate whether any entry except the
>>   *               first one is writable.
>> + * @any_young: Optional pointer to indicate whether any entry except the
>> + *               first one is young.
>>   *
>>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>>   * pages of the same large folio.
>> @@ -145,16 +147,18 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>>   */
>>  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>                 pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>> -               bool *any_writable)
>> +               bool *any_writable, bool *any_young)
>>  {
>>         unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>         const pte_t *end_ptep = start_ptep + max_nr;
>>         pte_t expected_pte, *ptep;
>> -       bool writable;
>> +       bool writable, young;
>>         int nr;
>>
>>         if (any_writable)
>>                 *any_writable = false;
>> +       if (any_young)
>> +               *any_young = false;
>>
>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>         VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
>> @@ -168,6 +172,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>                 pte = ptep_get(ptep);
>>                 if (any_writable)
>>                         writable = !!pte_write(pte);
>> +               if (any_young)
>> +                       young = !!pte_young(pte);
>>                 pte = __pte_batch_clear_ignored(pte, flags);
>>
>>                 if (!pte_same(pte, expected_pte))
>> @@ -183,6 +189,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>
>>                 if (any_writable)
>>                         *any_writable |= writable;
>> +               if (any_young)
>> +                       *any_young |= young;
>>
>>                 nr = pte_batch_hint(ptep, pte);
>>                 expected_pte = pte_advance_pfn(expected_pte, nr);
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 070bedb4996e..bd00b83e7c50 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>         LIST_HEAD(folio_list);
>>         bool pageout_anon_only_filter;
>>         unsigned int batch_count = 0;
>> +       int nr;
>>
>>         if (fatal_signal_pending(current))
>>                 return -EINTR;
>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>                 return 0;
>>         flush_tlb_batched_pending(mm);
>>         arch_enter_lazy_mmu_mode();
>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>> +               nr = 1;
>>                 ptent = ptep_get(pte);
>>
>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>> @@ -447,55 +449,67 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>                         continue;
>>
>>                 /*
>> -                * Creating a THP page is expensive so split it only if we
>> -                * are sure it's worth. Split it if we are only owner.
>> +                * If we encounter a large folio, only split it if it is not
>> +                * fully mapped within the range we are operating on. Otherwise
>> +                * leave it as is so that it can be swapped out whole. If we
>> +                * fail to split a folio, leave it in place and advance to the
>> +                * next pte in the range.
>>                  */
>>                 if (folio_test_large(folio)) {
>> -                       int err;
>> -
>> -                       if (folio_likely_mapped_shared(folio))
>> -                               break;
>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
>> -                               break;
>> -                       if (!folio_trylock(folio))
>> -                               break;
>> -                       folio_get(folio);
>> -                       arch_leave_lazy_mmu_mode();
>> -                       pte_unmap_unlock(start_pte, ptl);
>> -                       start_pte = NULL;
>> -                       err = split_folio(folio);
>> -                       folio_unlock(folio);
>> -                       folio_put(folio);
>> -                       if (err)
>> -                               break;
>> -                       start_pte = pte =
>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
>> -                       if (!start_pte)
>> -                               break;
>> -                       arch_enter_lazy_mmu_mode();
>> -                       pte--;
>> -                       addr -= PAGE_SIZE;
>> -                       continue;
>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>> +                                               FPB_IGNORE_SOFT_DIRTY;
>> +                       int max_nr = (end - addr) / PAGE_SIZE;
>> +                       bool any_young;
>> +
>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>> +                                            fpb_flags, NULL, &any_young);
>> +                       if (any_young)
>> +                               ptent = pte_mkyoung(ptent);
>> +
>> +                       if (nr < folio_nr_pages(folio)) {
>> +                               int err;
>> +
>> +                               if (folio_likely_mapped_shared(folio))
>> +                                       continue;
>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
>> +                                       continue;
>> +                               if (!folio_trylock(folio))
>> +                                       continue;
>> +                               folio_get(folio);
>> +                               arch_leave_lazy_mmu_mode();
>> +                               pte_unmap_unlock(start_pte, ptl);
>> +                               start_pte = NULL;
>> +                               err = split_folio(folio);
>> +                               folio_unlock(folio);
>> +                               folio_put(folio);
>> +                               if (err)
>> +                                       continue;
>> +                               start_pte = pte =
>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
>> +                               if (!start_pte)
>> +                                       break;
>> +                               arch_enter_lazy_mmu_mode();
>> +                               nr = 0;
>> +                               continue;
>> +                       }
>>                 }
>>
>>                 /*
>>                  * Do not interfere with other mappings of this folio and
>> -                * non-LRU folio.
>> +                * non-LRU folio. If we have a large folio at this point, we
>> +                * know it is fully mapped so if its mapcount is the same as its
>> +                * number of pages, it must be exclusive.
>>                  */
>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>> +               if (!folio_test_lru(folio) ||
>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>>                         continue;
>>
>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>>                         continue;
>>
>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -
>>                 if (!pageout && pte_young(ptent)) {
>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>> -                                                       tlb->fullmm);
>> -                       ptent = pte_mkold(ptent);
>> -                       set_pte_at(mm, addr, pte, ptent);
>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>> +                       mkold_ptes(vma, addr, pte, nr);
>> +                       tlb_remove_tlb_entries(tlb, pte, nr, addr);
>>                 }
>>
>>                 /*
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 9d844582ba38..b5b48f4cf2af 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -989,7 +989,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>>                         flags |= FPB_IGNORE_SOFT_DIRTY;
>>
>>                 nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
>> -                                    &any_writable);
>> +                                    &any_writable, NULL);
>>                 folio_ref_add(folio, nr);
>>                 if (folio_test_anon(folio)) {
>>                         if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
>> @@ -1553,7 +1553,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
>>          */
>>         if (unlikely(folio_test_large(folio) && max_nr != 1)) {
>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
>> -                                    NULL);
>> +                                    NULL, NULL);
>>
>>                 zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
>>                                        addr, details, rss, force_flush,
>> --
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
  2024-04-02 11:20     ` Ryan Roberts
@ 2024-04-02 11:30       ` Lance Yang
  0 siblings, 0 replies; 35+ messages in thread
From: Lance Yang @ 2024-04-02 11:30 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, linux-mm, linux-kernel, Barry Song

On Tue, Apr 2, 2024 at 7:20 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 01/04/2024 13:25, Lance Yang wrote:
> > On Wed, Mar 27, 2024 at 10:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >> folio that is fully and contiguously mapped in the pageout/cold vm
> >> range. This change means that large folios will be maintained all the
> >> way to swap storage. This both improves performance during swap-out, by
> >> eliding the cost of splitting the folio, and sets us up nicely for
> >> maintaining the large folio when it is swapped back in (to be covered in
> >> a separate series).
> >>
> >> Folios that are not fully mapped in the target range are still split,
> >> but note that behavior is changed so that if the split fails for any
> >> reason (folio locked, shared, etc) we now leave it as is and move to the
> >> next pte in the range and continue work on the proceeding folios.
> >> Previously any failure of this sort would cause the entire operation to
> >> give up and no folios mapped at higher addresses were paged out or made
> >> cold. Given large folios are becoming more common, this old behavior
> >> would have likely lead to wasted opportunities.
> >>
> >> While we are at it, change the code that clears young from the ptes to
> >> use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper
> >> function. This is more efficent than get_and_clear/modify/set,
> >> especially for contpte mappings on arm64, where the old approach would
> >> require unfolding/refolding and the new approach can be done in place.
> >>
> >> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  include/linux/pgtable.h | 30 ++++++++++++++
> >>  mm/internal.h           | 12 +++++-
> >>  mm/madvise.c            | 88 ++++++++++++++++++++++++-----------------
> >>  mm/memory.c             |  4 +-
> >>  4 files changed, 93 insertions(+), 41 deletions(-)
> >>
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index 8185939df1e8..391f56a1b188 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -361,6 +361,36 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> >>  }
> >>  #endif
> >>
> >> +#ifndef mkold_ptes
> >> +/**
> >> + * mkold_ptes - Mark PTEs that map consecutive pages of the same folio as old.
> >> + * @vma: VMA the pages are mapped into.
> >> + * @addr: Address the first page is mapped at.
> >> + * @ptep: Page table pointer for the first entry.
> >> + * @nr: Number of entries to mark old.
> >> + *
> >> + * May be overridden by the architecture; otherwise, implemented as a simple
> >> + * loop over ptep_test_and_clear_young().
> >> + *
> >> + * Note that PTE bits in the PTE range besides the PFN can differ. For example,
> >> + * some PTEs might be write-protected.
> >> + *
> >> + * Context: The caller holds the page table lock.  The PTEs map consecutive
> >> + * pages that belong to the same folio.  The PTEs are all in the same PMD.
> >> + */
> >> +static inline void mkold_ptes(struct vm_area_struct *vma, unsigned long addr,
> >> +               pte_t *ptep, unsigned int nr)
> >> +{
> >> +       for (;;) {
> >> +               ptep_test_and_clear_young(vma, addr, ptep);
> >
> > IIUC, if the first PTE is a CONT-PTE, then calling ptep_test_and_clear_young()
> > will clear the young bit for the entire contig range to avoid having
> > to unfold. So,
> > the other PTEs within the range don't need to clear again.
> >
> > Maybe we should consider overriding mkold_ptes for arm64?
>
> Yes completely agree. I was saving this for a separate submission though, to
> reduce the complexity of this initial series as much as possible. Let me know if
> you disagree and want to see that change as part of this series.

Feel free to save the change for a separate submission :)

>
> >
> > Thanks,
> > Lance
> >
> >> +               if (--nr == 0)
> >> +                       break;
> >> +               ptep++;
> >> +               addr += PAGE_SIZE;
> >> +       }
> >> +}
> >> +#endif
> >> +
> >>  #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
> >>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
> >>  static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index eadb79c3a357..efee8e4cd2af 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -130,6 +130,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >>   * @flags: Flags to modify the PTE batch semantics.
> >>   * @any_writable: Optional pointer to indicate whether any entry except the
> >>   *               first one is writable.
> >> + * @any_young: Optional pointer to indicate whether any entry except the
> >> + *               first one is young.
> >>   *
> >>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> >>   * pages of the same large folio.
> >> @@ -145,16 +147,18 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >>   */
> >>  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >>                 pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> >> -               bool *any_writable)
> >> +               bool *any_writable, bool *any_young)
> >>  {
> >>         unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
> >>         const pte_t *end_ptep = start_ptep + max_nr;
> >>         pte_t expected_pte, *ptep;
> >> -       bool writable;
> >> +       bool writable, young;
> >>         int nr;
> >>
> >>         if (any_writable)
> >>                 *any_writable = false;
> >> +       if (any_young)
> >> +               *any_young = false;
> >>
> >>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> >>         VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio);
> >> @@ -168,6 +172,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >>                 pte = ptep_get(ptep);
> >>                 if (any_writable)
> >>                         writable = !!pte_write(pte);
> >> +               if (any_young)
> >> +                       young = !!pte_young(pte);
> >>                 pte = __pte_batch_clear_ignored(pte, flags);
> >>
> >>                 if (!pte_same(pte, expected_pte))
> >> @@ -183,6 +189,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >>
> >>                 if (any_writable)
> >>                         *any_writable |= writable;
> >> +               if (any_young)
> >> +                       *any_young |= young;
> >>
> >>                 nr = pte_batch_hint(ptep, pte);
> >>                 expected_pte = pte_advance_pfn(expected_pte, nr);
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index 070bedb4996e..bd00b83e7c50 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>         LIST_HEAD(folio_list);
> >>         bool pageout_anon_only_filter;
> >>         unsigned int batch_count = 0;
> >> +       int nr;
> >>
> >>         if (fatal_signal_pending(current))
> >>                 return -EINTR;
> >> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>                 return 0;
> >>         flush_tlb_batched_pending(mm);
> >>         arch_enter_lazy_mmu_mode();
> >> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
> >> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >> +               nr = 1;
> >>                 ptent = ptep_get(pte);
> >>
> >>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> >> @@ -447,55 +449,67 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>                         continue;
> >>
> >>                 /*
> >> -                * Creating a THP page is expensive so split it only if we
> >> -                * are sure it's worth. Split it if we are only owner.
> >> +                * If we encounter a large folio, only split it if it is not
> >> +                * fully mapped within the range we are operating on. Otherwise
> >> +                * leave it as is so that it can be swapped out whole. If we
> >> +                * fail to split a folio, leave it in place and advance to the
> >> +                * next pte in the range.
> >>                  */
> >>                 if (folio_test_large(folio)) {
> >> -                       int err;
> >> -
> >> -                       if (folio_likely_mapped_shared(folio))
> >> -                               break;
> >> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> -                               break;
> >> -                       if (!folio_trylock(folio))
> >> -                               break;
> >> -                       folio_get(folio);
> >> -                       arch_leave_lazy_mmu_mode();
> >> -                       pte_unmap_unlock(start_pte, ptl);
> >> -                       start_pte = NULL;
> >> -                       err = split_folio(folio);
> >> -                       folio_unlock(folio);
> >> -                       folio_put(folio);
> >> -                       if (err)
> >> -                               break;
> >> -                       start_pte = pte =
> >> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> -                       if (!start_pte)
> >> -                               break;
> >> -                       arch_enter_lazy_mmu_mode();
> >> -                       pte--;
> >> -                       addr -= PAGE_SIZE;
> >> -                       continue;
> >> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >> +                                               FPB_IGNORE_SOFT_DIRTY;
> >> +                       int max_nr = (end - addr) / PAGE_SIZE;
> >> +                       bool any_young;
> >> +
> >> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >> +                                            fpb_flags, NULL, &any_young);
> >> +                       if (any_young)
> >> +                               ptent = pte_mkyoung(ptent);
> >> +
> >> +                       if (nr < folio_nr_pages(folio)) {
> >> +                               int err;
> >> +
> >> +                               if (folio_likely_mapped_shared(folio))
> >> +                                       continue;
> >> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> +                                       continue;
> >> +                               if (!folio_trylock(folio))
> >> +                                       continue;
> >> +                               folio_get(folio);
> >> +                               arch_leave_lazy_mmu_mode();
> >> +                               pte_unmap_unlock(start_pte, ptl);
> >> +                               start_pte = NULL;
> >> +                               err = split_folio(folio);
> >> +                               folio_unlock(folio);
> >> +                               folio_put(folio);
> >> +                               if (err)
> >> +                                       continue;
> >> +                               start_pte = pte =
> >> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> +                               if (!start_pte)
> >> +                                       break;
> >> +                               arch_enter_lazy_mmu_mode();
> >> +                               nr = 0;
> >> +                               continue;
> >> +                       }
> >>                 }
> >>
> >>                 /*
> >>                  * Do not interfere with other mappings of this folio and
> >> -                * non-LRU folio.
> >> +                * non-LRU folio. If we have a large folio at this point, we
> >> +                * know it is fully mapped so if its mapcount is the same as its
> >> +                * number of pages, it must be exclusive.
> >>                  */
> >> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >> +               if (!folio_test_lru(folio) ||
> >> +                   folio_mapcount(folio) != folio_nr_pages(folio))
> >>                         continue;
> >>
> >>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>                         continue;
> >>
> >> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >> -
> >>                 if (!pageout && pte_young(ptent)) {
> >> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> >> -                                                       tlb->fullmm);
> >> -                       ptent = pte_mkold(ptent);
> >> -                       set_pte_at(mm, addr, pte, ptent);
> >> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> >> +                       mkold_ptes(vma, addr, pte, nr);
> >> +                       tlb_remove_tlb_entries(tlb, pte, nr, addr);
> >>                 }
> >>
> >>                 /*
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 9d844582ba38..b5b48f4cf2af 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -989,7 +989,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
> >>                         flags |= FPB_IGNORE_SOFT_DIRTY;
> >>
> >>                 nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags,
> >> -                                    &any_writable);
> >> +                                    &any_writable, NULL);
> >>                 folio_ref_add(folio, nr);
> >>                 if (folio_test_anon(folio)) {
> >>                         if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
> >> @@ -1553,7 +1553,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
> >>          */
> >>         if (unlikely(folio_test_large(folio) && max_nr != 1)) {
> >>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, fpb_flags,
> >> -                                    NULL);
> >> +                                    NULL, NULL);
> >>
> >>                 zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
> >>                                        addr, details, rss, force_flush,
> >> --
> >> 2.25.1
> >>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list()
  2024-03-28  8:18   ` Barry Song
  2024-03-28  8:48     ` Ryan Roberts
@ 2024-04-02 13:10     ` Ryan Roberts
  2024-04-02 13:22       ` Lance Yang
                         ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Ryan Roberts @ 2024-04-02 13:10 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Chris Li, Lance Yang, linux-mm, linux-kernel, Barry Song

On 28/03/2024 08:18, Barry Song wrote:
> On Thu, Mar 28, 2024 at 3:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Now that swap supports storing all mTHP sizes, avoid splitting large
>> folios before swap-out. This benefits performance of the swap-out path
>> by eliding split_folio_to_list(), which is expensive, and also sets us
>> up for swapping in large folios in a future series.
>>
>> If the folio is partially mapped, we continue to split it since we want
>> to avoid the extra IO overhead and storage of writing out pages
>> uneccessarily.
>>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  mm/vmscan.c | 9 +++++----
>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 00adaf1cb2c3..293120fe54f3 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1223,11 +1223,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>                                         if (!can_split_folio(folio, NULL))
>>                                                 goto activate_locked;
>>                                         /*
>> -                                        * Split folios without a PMD map right
>> -                                        * away. Chances are some or all of the
>> -                                        * tail pages can be freed without IO.
>> +                                        * Split partially mapped folios right
>> +                                        * away. We can free the unmapped pages
>> +                                        * without IO.
>>                                          */
>> -                                       if (!folio_entire_mapcount(folio) &&
>> +                                       if (data_race(!list_empty(
>> +                                               &folio->_deferred_list)) &&
>>                                             split_folio_to_list(folio,
>>                                                                 folio_list))
>>                                                 goto activate_locked;
> 
> Hi Ryan,
> 
> Sorry for bringing up another minor issue at this late stage.

No problem - I'd rather take a bit longer and get it right, rather than rush it
and get it wrong!

> 
> During the debugging of thp counter patch v2, I noticed the discrepancy between
> THP_SWPOUT_FALLBACK and THP_SWPOUT.
> 
> Should we make adjustments to the counter?

Yes, agreed; we want to be consistent here with all the other existing THP
counters; they only refer to PMD-sized THP. I'll make the change for the next
version.

I guess we will eventually want equivalent counters for per-size mTHP using the
framework you are adding.

> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 293120fe54f3..d7856603f689 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1241,8 +1241,10 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>                                                                 folio_list))
>                                                 goto activate_locked;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -
> count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
> -                                       count_vm_event(THP_SWPOUT_FALLBACK);
> +                                       if (folio_test_pmd_mappable(folio)) {
> +
> count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
> +
> count_vm_event(THP_SWPOUT_FALLBACK);
> +                                       }
>  #endif
>                                         if (!add_to_swap(folio))
>                                                 goto activate_locked_split;
> 
> 
> Because THP_SWPOUT is only for pmd:
> 
> static inline void count_swpout_vm_event(struct folio *folio)
> {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         if (unlikely(folio_test_pmd_mappable(folio))) {
>                 count_memcg_folio_events(folio, THP_SWPOUT, 1);
>                 count_vm_event(THP_SWPOUT);
>         }
> #endif
>         count_vm_events(PSWPOUT, folio_nr_pages(folio));
> }
> 
> I can provide per-order counters for this in my THP counter patch.
> 
>> --
>> 2.25.1
>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list()
  2024-04-02 13:10     ` Ryan Roberts
@ 2024-04-02 13:22       ` Lance Yang
  2024-04-02 13:22       ` Ryan Roberts
  2024-04-05  4:06       ` Barry Song
  2 siblings, 0 replies; 35+ messages in thread
From: Lance Yang @ 2024-04-02 13:22 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Barry Song, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	Huang Ying, Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko,
	Kefeng Wang, Chris Li, linux-mm, linux-kernel, Barry Song

Baolin Wang's patch[1] has avoided confusion with PMD mapped THP
related statistics.

So, these three counters (THP_SPLIT_PAGE, THP_SPLIT_PAGE_FAILED,
and THP_DEFERRED_SPLIT_PAGE) no longer include mTHP.

[1] https://lore.kernel.org/linux-mm/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com/

On Tue, Apr 2, 2024 at 9:10 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/03/2024 08:18, Barry Song wrote:
> > On Thu, Mar 28, 2024 at 3:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Now that swap supports storing all mTHP sizes, avoid splitting large
> >> folios before swap-out. This benefits performance of the swap-out path
> >> by eliding split_folio_to_list(), which is expensive, and also sets us
> >> up for swapping in large folios in a future series.
> >>
> >> If the folio is partially mapped, we continue to split it since we want
> >> to avoid the extra IO overhead and storage of writing out pages
> >> uneccessarily.
> >>
> >> Reviewed-by: David Hildenbrand <david@redhat.com>
> >> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/vmscan.c | 9 +++++----
> >>  1 file changed, 5 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 00adaf1cb2c3..293120fe54f3 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1223,11 +1223,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>                                         if (!can_split_folio(folio, NULL))
> >>                                                 goto activate_locked;
> >>                                         /*
> >> -                                        * Split folios without a PMD map right
> >> -                                        * away. Chances are some or all of the
> >> -                                        * tail pages can be freed without IO.
> >> +                                        * Split partially mapped folios right
> >> +                                        * away. We can free the unmapped pages
> >> +                                        * without IO.
> >>                                          */
> >> -                                       if (!folio_entire_mapcount(folio) &&
> >> +                                       if (data_race(!list_empty(
> >> +                                               &folio->_deferred_list)) &&
> >>                                             split_folio_to_list(folio,
> >>                                                                 folio_list))
> >>                                                 goto activate_locked;
> >
> > Hi Ryan,
> >
> > Sorry for bringing up another minor issue at this late stage.
>
> No problem - I'd rather take a bit longer and get it right, rather than rush it
> and get it wrong!
>
> >
> > During the debugging of thp counter patch v2, I noticed the discrepancy between
> > THP_SWPOUT_FALLBACK and THP_SWPOUT.
> >
> > Should we make adjustments to the counter?
>
> Yes, agreed; we want to be consistent here with all the other existing THP
> counters; they only refer to PMD-sized THP. I'll make the change for the next
> version.
>
> I guess we will eventually want equivalent counters for per-size mTHP using the
> framework you are adding.
>
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 293120fe54f3..d7856603f689 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1241,8 +1241,10 @@ static unsigned int shrink_folio_list(struct
> > list_head *folio_list,
> >                                                                 folio_list))
> >                                                 goto activate_locked;
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > -
> > count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
> > -                                       count_vm_event(THP_SWPOUT_FALLBACK);
> > +                                       if (folio_test_pmd_mappable(folio)) {
> > +
> > count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
> > +
> > count_vm_event(THP_SWPOUT_FALLBACK);
> > +                                       }
> >  #endif
> >                                         if (!add_to_swap(folio))
> >                                                 goto activate_locked_split;
> >
> >
> > Because THP_SWPOUT is only for pmd:
> >
> > static inline void count_swpout_vm_event(struct folio *folio)
> > {
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >         if (unlikely(folio_test_pmd_mappable(folio))) {
> >                 count_memcg_folio_events(folio, THP_SWPOUT, 1);
> >                 count_vm_event(THP_SWPOUT);
> >         }
> > #endif
> >         count_vm_events(PSWPOUT, folio_nr_pages(folio));
> > }
> >
> > I can provide per-order counters for this in my THP counter patch.
> >
> >> --
> >> 2.25.1
> >>
> >
> > Thanks
> > Barry
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list()
  2024-04-02 13:10     ` Ryan Roberts
  2024-04-02 13:22       ` Lance Yang
@ 2024-04-02 13:22       ` Ryan Roberts
  2024-04-02 22:54         ` Barry Song
  2024-04-05  4:06       ` Barry Song
  2 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-04-02 13:22 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Chris Li, Lance Yang, linux-mm, linux-kernel, Barry Song

On 02/04/2024 14:10, Ryan Roberts wrote:
> On 28/03/2024 08:18, Barry Song wrote:
>> On Thu, Mar 28, 2024 at 3:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>> folios before swap-out. This benefits performance of the swap-out path
>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>> up for swapping in large folios in a future series.
>>>
>>> If the folio is partially mapped, we continue to split it since we want
>>> to avoid the extra IO overhead and storage of writing out pages
>>> uneccessarily.
>>>
>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  mm/vmscan.c | 9 +++++----
>>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 00adaf1cb2c3..293120fe54f3 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1223,11 +1223,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>                                         if (!can_split_folio(folio, NULL))
>>>                                                 goto activate_locked;
>>>                                         /*
>>> -                                        * Split folios without a PMD map right
>>> -                                        * away. Chances are some or all of the
>>> -                                        * tail pages can be freed without IO.
>>> +                                        * Split partially mapped folios right
>>> +                                        * away. We can free the unmapped pages
>>> +                                        * without IO.
>>>                                          */
>>> -                                       if (!folio_entire_mapcount(folio) &&
>>> +                                       if (data_race(!list_empty(
>>> +                                               &folio->_deferred_list)) &&
>>>                                             split_folio_to_list(folio,
>>>                                                                 folio_list))
>>>                                                 goto activate_locked;
>>
>> Hi Ryan,
>>
>> Sorry for bringing up another minor issue at this late stage.
> 
> No problem - I'd rather take a bit longer and get it right, rather than rush it
> and get it wrong!
> 
>>
>> During the debugging of thp counter patch v2, I noticed the discrepancy between
>> THP_SWPOUT_FALLBACK and THP_SWPOUT.
>>
>> Should we make adjustments to the counter?
> 
> Yes, agreed; we want to be consistent here with all the other existing THP
> counters; they only refer to PMD-sized THP. I'll make the change for the next
> version.
> 
> I guess we will eventually want equivalent counters for per-size mTHP using the
> framework you are adding.
> 
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 293120fe54f3..d7856603f689 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1241,8 +1241,10 @@ static unsigned int shrink_folio_list(struct
>> list_head *folio_list,
>>                                                                 folio_list))
>>                                                 goto activate_locked;
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -
>> count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
>> -                                       count_vm_event(THP_SWPOUT_FALLBACK);
>> +                                       if (folio_test_pmd_mappable(folio)) {

This doesn't quite work because we have already split the folio here, so this
will always return false. I've changed it to:

if (nr_pages >= HPAGE_PMD_NR) {


>> +
>> count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
>> +
>> count_vm_event(THP_SWPOUT_FALLBACK);
>> +                                       }
>>  #endif
>>                                         if (!add_to_swap(folio))
>>                                                 goto activate_locked_split;
>>
>>
>> Because THP_SWPOUT is only for pmd:
>>
>> static inline void count_swpout_vm_event(struct folio *folio)
>> {
>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>         if (unlikely(folio_test_pmd_mappable(folio))) {
>>                 count_memcg_folio_events(folio, THP_SWPOUT, 1);
>>                 count_vm_event(THP_SWPOUT);
>>         }
>> #endif
>>         count_vm_events(PSWPOUT, folio_nr_pages(folio));
>> }
>>
>> I can provide per-order counters for this in my THP counter patch.
>>
>>> --
>>> 2.25.1
>>>
>>
>> Thanks
>> Barry
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list()
  2024-04-02 13:22       ` Ryan Roberts
@ 2024-04-02 22:54         ` Barry Song
  0 siblings, 0 replies; 35+ messages in thread
From: Barry Song @ 2024-04-02 22:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Chris Li, Lance Yang, linux-mm, linux-kernel, Barry Song

On Wed, Apr 3, 2024 at 2:22 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 02/04/2024 14:10, Ryan Roberts wrote:
> > On 28/03/2024 08:18, Barry Song wrote:
> >> On Thu, Mar 28, 2024 at 3:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>> Now that swap supports storing all mTHP sizes, avoid splitting large
> >>> folios before swap-out. This benefits performance of the swap-out path
> >>> by eliding split_folio_to_list(), which is expensive, and also sets us
> >>> up for swapping in large folios in a future series.
> >>>
> >>> If the folio is partially mapped, we continue to split it since we want
> >>> to avoid the extra IO overhead and storage of writing out pages
> >>> uneccessarily.
> >>>
> >>> Reviewed-by: David Hildenbrand <david@redhat.com>
> >>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>> ---
> >>>  mm/vmscan.c | 9 +++++----
> >>>  1 file changed, 5 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> index 00adaf1cb2c3..293120fe54f3 100644
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -1223,11 +1223,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>                                         if (!can_split_folio(folio, NULL))
> >>>                                                 goto activate_locked;
> >>>                                         /*
> >>> -                                        * Split folios without a PMD map right
> >>> -                                        * away. Chances are some or all of the
> >>> -                                        * tail pages can be freed without IO.
> >>> +                                        * Split partially mapped folios right
> >>> +                                        * away. We can free the unmapped pages
> >>> +                                        * without IO.
> >>>                                          */
> >>> -                                       if (!folio_entire_mapcount(folio) &&
> >>> +                                       if (data_race(!list_empty(
> >>> +                                               &folio->_deferred_list)) &&
> >>>                                             split_folio_to_list(folio,
> >>>                                                                 folio_list))
> >>>                                                 goto activate_locked;
> >>
> >> Hi Ryan,
> >>
> >> Sorry for bringing up another minor issue at this late stage.
> >
> > No problem - I'd rather take a bit longer and get it right, rather than rush it
> > and get it wrong!
> >
> >>
> >> During the debugging of thp counter patch v2, I noticed the discrepancy between
> >> THP_SWPOUT_FALLBACK and THP_SWPOUT.
> >>
> >> Should we make adjustments to the counter?
> >
> > Yes, agreed; we want to be consistent here with all the other existing THP
> > counters; they only refer to PMD-sized THP. I'll make the change for the next
> > version.
> >
> > I guess we will eventually want equivalent counters for per-size mTHP using the
> > framework you are adding.
> >
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 293120fe54f3..d7856603f689 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1241,8 +1241,10 @@ static unsigned int shrink_folio_list(struct
> >> list_head *folio_list,
> >>                                                                 folio_list))
> >>                                                 goto activate_locked;
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> -
> >> count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
> >> -                                       count_vm_event(THP_SWPOUT_FALLBACK);
> >> +                                       if (folio_test_pmd_mappable(folio)) {
>
> This doesn't quite work because we have already split the folio here, so this
> will always return false. I've changed it to:
>
> if (nr_pages >= HPAGE_PMD_NR) {

make sense to me.

>
>
> >> +
> >> count_memcg_folio_events(folio, THP_SWPOUT_FALLBACK, 1);
> >> +
> >> count_vm_event(THP_SWPOUT_FALLBACK);
> >> +                                       }
> >>  #endif
> >>                                         if (!add_to_swap(folio))
> >>                                                 goto activate_locked_split;
> >>
> >>
> >> Because THP_SWPOUT is only for pmd:
> >>
> >> static inline void count_swpout_vm_event(struct folio *folio)
> >> {
> >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>         if (unlikely(folio_test_pmd_mappable(folio))) {
> >>                 count_memcg_folio_events(folio, THP_SWPOUT, 1);
> >>                 count_vm_event(THP_SWPOUT);
> >>         }
> >> #endif
> >>         count_vm_events(PSWPOUT, folio_nr_pages(folio));
> >> }
> >>
> >> I can provide per-order counters for this in my THP counter patch.
> >>
> >>> --
> >>> 2.25.1
> >>>
> >>

Thanks
Barry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  2024-03-27 14:45 ` [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Ryan Roberts
  2024-04-01  5:52   ` Huang, Ying
@ 2024-04-03  0:30   ` Zi Yan
  2024-04-03  0:47     ` Lance Yang
  2024-04-03  7:21     ` Ryan Roberts
  1 sibling, 2 replies; 35+ messages in thread
From: Zi Yan @ 2024-04-03  0:30 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3204 bytes --]

On 27 Mar 2024, at 10:45, Ryan Roberts wrote:

> Now that we no longer have a convenient flag in the cluster to determine
> if a folio is large, free_swap_and_cache() will take a reference and
> lock a large folio much more often, which could lead to contention and
> (e.g.) failure to split large folios, etc.
>
> Let's solve that problem by batch freeing swap and cache with a new
> function, free_swap_and_cache_nr(), to free a contiguous range of swap
> entries together. This allows us to first drop a reference to each swap
> slot before we try to release the cache folio. This means we only try to
> release the folio once, only taking the reference and lock once - much
> better than the previous 512 times for the 2M THP case.
>
> Contiguous swap entries are gathered in zap_pte_range() and
> madvise_free_pte_range() in a similar way to how present ptes are
> already gathered in zap_pte_range().
>
> While we are at it, let's simplify by converting the return type of both
> functions to void. The return value was used only by zap_pte_range() to
> print a bad pte, and was ignored by everyone else, so the extra
> reporting wasn't exactly guaranteed. We will still get the warning with
> most of the information from get_swap_device(). With the batch version,
> we wouldn't know which pte was bad anyway so could print the wrong one.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/pgtable.h | 28 +++++++++++++++
>  include/linux/swap.h    | 12 +++++--
>  mm/internal.h           | 48 +++++++++++++++++++++++++
>  mm/madvise.c            | 12 ++++---
>  mm/memory.c             | 13 +++----
>  mm/swapfile.c           | 78 ++++++++++++++++++++++++++++++-----------
>  6 files changed, 157 insertions(+), 34 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 09c85c7bf9c2..8185939df1e8 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
>  }
>  #endif
>
> +#ifndef clear_not_present_full_ptes
> +/**
> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
> + * @mm: Address space the ptes represent.
> + * @addr: Address of the first pte.
> + * @ptep: Page table pointer for the first entry.
> + * @nr: Number of entries to clear.
> + * @full: Whether we are clearing a full mm.
> + *
> + * May be overridden by the architecture; otherwise, implemented as a simple
> + * loop over pte_clear_not_present_full().
> + *
> + * Context: The caller holds the page table lock.  The PTEs are all not present.
> + * The PTEs are all in the same PMD.
> + */
> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> +{
> +	for (;;) {
> +		pte_clear_not_present_full(mm, addr, ptep, full);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
> +}
> +#endif
> +

Would the code below be better?

for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
	pte_clear_not_present_full(mm, addr, ptep, full);

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  2024-04-03  0:30   ` Zi Yan
@ 2024-04-03  0:47     ` Lance Yang
  2024-04-03  7:21     ` Ryan Roberts
  1 sibling, 0 replies; 35+ messages in thread
From: Lance Yang @ 2024-04-03  0:47 UTC (permalink / raw)
  To: Zi Yan, Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang, linux-mm, linux-kernel

April 3, 2024 at 8:30 AM, "Zi Yan" <ziy@nvidia.com> wrote:



> 
> On 27 Mar 2024, at 10:45, Ryan Roberts wrote:
> 
> > 
> > Now that we no longer have a convenient flag in the cluster to determine
> > 
> >  if a folio is large, free_swap_and_cache() will take a reference and
> > 
> >  lock a large folio much more often, which could lead to contention and
> > 
> >  (e.g.) failure to split large folios, etc.
> > 
> >  Let's solve that problem by batch freeing swap and cache with a new
> > 
> >  function, free_swap_and_cache_nr(), to free a contiguous range of swap
> > 
> >  entries together. This allows us to first drop a reference to each swap
> > 
> >  slot before we try to release the cache folio. This means we only try to
> > 
> >  release the folio once, only taking the reference and lock once - much
> > 
> >  better than the previous 512 times for the 2M THP case.
> > 
> >  Contiguous swap entries are gathered in zap_pte_range() and
> > 
> >  madvise_free_pte_range() in a similar way to how present ptes are
> > 
> >  already gathered in zap_pte_range().
> > 
> >  While we are at it, let's simplify by converting the return type of both
> > 
> >  functions to void. The return value was used only by zap_pte_range() to
> > 
> >  print a bad pte, and was ignored by everyone else, so the extra
> > 
> >  reporting wasn't exactly guaranteed. We will still get the warning with
> > 
> >  most of the information from get_swap_device(). With the batch version,
> > 
> >  we wouldn't know which pte was bad anyway so could print the wrong one.
> > 
> >  Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> > 
> >  ---
> > 
> >  include/linux/pgtable.h | 28 +++++++++++++++
> > 
> >  include/linux/swap.h | 12 +++++--
> > 
> >  mm/internal.h | 48 +++++++++++++++++++++++++
> > 
> >  mm/madvise.c | 12 ++++---
> > 
> >  mm/memory.c | 13 +++----
> > 
> >  mm/swapfile.c | 78 ++++++++++++++++++++++++++++++-----------
> > 
> >  6 files changed, 157 insertions(+), 34 deletions(-)
> > 
> >  diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > 
> >  index 09c85c7bf9c2..8185939df1e8 100644
> > 
> >  --- a/include/linux/pgtable.h
> > 
> >  +++ b/include/linux/pgtable.h
> > 
> >  @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
> > 
> >  }
> > 
> >  #endif
> > 
> >  +#ifndef clear_not_present_full_ptes
> > 
> >  +/**
> > 
> >  + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
> > 
> >  + * @mm: Address space the ptes represent.
> > 
> >  + * @addr: Address of the first pte.
> > 
> >  + * @ptep: Page table pointer for the first entry.
> > 
> >  + * @nr: Number of entries to clear.
> > 
> >  + * @full: Whether we are clearing a full mm.
> > 
> >  + *
> > 
> >  + * May be overridden by the architecture; otherwise, implemented as a simple
> > 
> >  + * loop over pte_clear_not_present_full().
> > 
> >  + *
> > 
> >  + * Context: The caller holds the page table lock. The PTEs are all not present.
> > 
> >  + * The PTEs are all in the same PMD.
> > 
> >  + */
> > 
> >  +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
> > 
> >  + unsigned long addr, pte_t *ptep, unsigned int nr, int full)
> > 
> >  +{
> > 
> >  + for (;;) {
> > 
> >  + pte_clear_not_present_full(mm, addr, ptep, full);
> > 
> >  + if (--nr == 0)
> > 
> >  + break;
> > 
> >  + ptep++;
> > 
> >  + addr += PAGE_SIZE;
> > 
> >  + }
> > 
> >  +}
> > 
> >  +#endif
> > 
> >  +
> > 
> 
> Would the code below be better?
> 
> for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
> 

FWIW

for (; nr-- > 0; ptep++, addr += PAGE_SIZE)
  pte_clear_not_present_full(mm, addr, ptep, full);

Thanks,
Lance


>  pte_clear_not_present_full(mm, addr, ptep, full);
> 
> --
> 
> Best Regards,
> 
> Yan, Zi
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 4/6] mm: swap: Allow storage of all mTHP orders
  2024-04-02 11:18     ` Ryan Roberts
@ 2024-04-03  3:07       ` Huang, Ying
  2024-04-03  7:48         ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Huang, Ying @ 2024-04-03  3:07 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, Barry Song,
	Chris Li, Lance Yang, linux-mm, linux-kernel

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 01/04/2024 04:15, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> Multi-size THP enables performance improvements by allocating large,
>>> pte-mapped folios for anonymous memory. However I've observed that on an
>>> arm64 system running a parallel workload (e.g. kernel compilation)
>>> across many cores, under high memory pressure, the speed regresses. This
>>> is due to bottlenecking on the increased number of TLBIs added due to
>>> all the extra folio splitting when the large folios are swapped out.
>>>
>>> Therefore, solve this regression by adding support for swapping out mTHP
>>> without needing to split the folio, just like is already done for
>>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>>> and when the swap backing store is a non-rotating block device. These
>>> are the same constraints as for the existing PMD-sized THP swap-out
>>> support.
>>>
>>> Note that no attempt is made to swap-in (m)THP here - this is still done
>>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>>> prerequisite for swapping-in mTHP.
>>>
>>> The main change here is to improve the swap entry allocator so that it
>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>> order and allocating sequentially from it until the cluster is full.
>>> This ensures that we don't need to search the map and we get no
>>> fragmentation due to alignment padding for different orders in the
>>> cluster. If there is no current cluster for a given order, we attempt to
>>> allocate a free cluster from the list. If there are no free clusters, we
>>> fail the allocation and the caller can fall back to splitting the folio
>>> and allocates individual entries (as per existing PMD-sized THP
>>> fallback).
>>>
>>> The per-order current clusters are maintained per-cpu using the existing
>>> infrastructure. This is done to avoid interleving pages from different
>>> tasks, which would prevent IO being batched. This is already done for
>>> the order-0 allocations so we follow the same pattern.
>>>
>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>> ensures that when the swap file is getting full, space doesn't get tied
>>> up in the per-cpu reserves.
>>>
>>> This change only modifies swap to be able to accept any order mTHP. It
>>> doesn't change the callers to elide doing the actual split. That will be
>>> done in separate changes.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/swap.h |  10 ++-
>>>  mm/swap_slots.c      |   6 +-
>>>  mm/swapfile.c        | 175 ++++++++++++++++++++++++-------------------
>>>  3 files changed, 109 insertions(+), 82 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 5e1e4f5bf0cb..11c53692f65f 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -268,13 +268,19 @@ struct swap_cluster_info {
>>>   */
>>>  #define SWAP_NEXT_INVALID	0
>>>  
>>> +#ifdef CONFIG_THP_SWAP
>>> +#define SWAP_NR_ORDERS		(PMD_ORDER + 1)
>>> +#else
>>> +#define SWAP_NR_ORDERS		1
>>> +#endif
>>> +
>>>  /*
>>>   * We assign a cluster to each CPU, so each CPU can allocate swap entry from
>>>   * its own cluster and swapout sequentially. The purpose is to optimize swapout
>>>   * throughput.
>>>   */
>>>  struct percpu_cluster {
>>> -	unsigned int next; /* Likely next allocation offset */
>>> +	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>>>  };
>>>  
>>>  struct swap_cluster_list {
>>> @@ -471,7 +477,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio);
>>>  bool folio_free_swap(struct folio *folio);
>>>  void put_swap_folio(struct folio *folio, swp_entry_t entry);
>>>  extern swp_entry_t get_swap_page_of_type(int);
>>> -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
>>> +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
>>>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>>>  extern void swap_shmem_alloc(swp_entry_t);
>>>  extern int swap_duplicate(swp_entry_t);
>>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
>>> index 53abeaf1371d..13ab3b771409 100644
>>> --- a/mm/swap_slots.c
>>> +++ b/mm/swap_slots.c
>>> @@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
>>>  	cache->cur = 0;
>>>  	if (swap_slot_cache_active)
>>>  		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
>>> -					   cache->slots, 1);
>>> +					   cache->slots, 0);
>>>  
>>>  	return cache->nr;
>>>  }
>>> @@ -311,7 +311,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>>  
>>>  	if (folio_test_large(folio)) {
>>>  		if (IS_ENABLED(CONFIG_THP_SWAP))
>>> -			get_swap_pages(1, &entry, folio_nr_pages(folio));
>>> +			get_swap_pages(1, &entry, folio_order(folio));
>>>  		goto out;
>>>  	}
>>>  
>>> @@ -343,7 +343,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>>  			goto out;
>>>  	}
>>>  
>>> -	get_swap_pages(1, &entry, 1);
>>> +	get_swap_pages(1, &entry, 0);
>>>  out:
>>>  	if (mem_cgroup_try_charge_swap(folio, entry)) {
>>>  		put_swap_folio(folio, entry);
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 1393966b77af..d56cdc547a06 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -278,15 +278,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>>>  #ifdef CONFIG_THP_SWAP
>>>  #define SWAPFILE_CLUSTER	HPAGE_PMD_NR
>>>  
>>> -#define swap_entry_size(size)	(size)
>>> +#define swap_entry_order(order)	(order)
>>>  #else
>>>  #define SWAPFILE_CLUSTER	256
>>>  
>>>  /*
>>> - * Define swap_entry_size() as constant to let compiler to optimize
>>> + * Define swap_entry_order() as constant to let compiler to optimize
>>>   * out some code if !CONFIG_THP_SWAP
>>>   */
>>> -#define swap_entry_size(size)	1
>>> +#define swap_entry_order(order)	0
>>>  #endif
>>>  #define LATENCY_LIMIT		256
>>>  
>>> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>  
>>>  /*
>>>   * The cluster corresponding to page_nr will be used. The cluster will be
>>> - * removed from free cluster list and its usage counter will be increased.
>>> + * removed from free cluster list and its usage counter will be increased by
>>> + * count.
>>>   */
>>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>>> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>> +static void add_cluster_info_page(struct swap_info_struct *p,
>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
>>> +	unsigned long count)
>>>  {
>>>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>>  
>>> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>>  	if (cluster_is_free(&cluster_info[idx]))
>>>  		alloc_cluster(p, idx);
>>>  
>>> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>>> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>>  	cluster_set_count(&cluster_info[idx],
>>> -		cluster_count(&cluster_info[idx]) + 1);
>>> +		cluster_count(&cluster_info[idx]) + count);
>>> +}
>>> +
>>> +/*
>>> + * The cluster corresponding to page_nr will be used. The cluster will be
>>> + * removed from free cluster list and its usage counter will be increased by 1.
>>> + */
>>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>> +{
>>> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>>>  }
>>>  
>>>  /*
>>> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>>   */
>>>  static bool
>>>  scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>> -	unsigned long offset)
>>> +	unsigned long offset, int order)
>>>  {
>>>  	struct percpu_cluster *percpu_cluster;
>>>  	bool conflict;
>>> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>  		return false;
>>>  
>>>  	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
>>> -	percpu_cluster->next = SWAP_NEXT_INVALID;
>>> +	percpu_cluster->next[order] = SWAP_NEXT_INVALID;
>>> +	return true;
>>> +}
>>> +
>>> +static inline bool swap_range_empty(char *swap_map, unsigned int start,
>>> +				    unsigned int nr_pages)
>>> +{
>>> +	unsigned int i;
>>> +
>>> +	for (i = 0; i < nr_pages; i++) {
>>> +		if (swap_map[start + i])
>>> +			return false;
>>> +	}
>>> +
>>>  	return true;
>>>  }
>>>  
>>>  /*
>>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>> - * might involve allocating a new cluster for current CPU too.
>>> + * Try to get swap entries with specified order from current cpu's swap entry
>>> + * pool (a cluster). This might involve allocating a new cluster for current CPU
>>> + * too.
>>>   */
>>>  static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> -	unsigned long *offset, unsigned long *scan_base)
>>> +	unsigned long *offset, unsigned long *scan_base, int order)
>>>  {
>>> +	unsigned int nr_pages = 1 << order;
>> 
>> Use swap_entry_order()?
>
> I had previously convinced myself that the compiler should be smart enough to
> propagate the constant from
>
> get_swap_pages -> scan_swap_map_slots -> scan_swap_map_try_ssd_cluster

Do some experiments via calling function with constants and check the
compiled code.  It seems that "interprocedural constant propagation" in
compiler can optimize the code at least if the callee is "static".

> But I'll add the explicit macro for the next version, as you suggest.

So, I will leave it to you to decide whether to do that.

--
Best Regards,
Huang, Ying

[snip]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  2024-04-02 11:15     ` Ryan Roberts
@ 2024-04-03  3:57       ` Huang, Ying
  2024-04-03  7:16         ` Ryan Roberts
  0 siblings, 1 reply; 35+ messages in thread
From: Huang, Ying @ 2024-04-03  3:57 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, Barry Song,
	Chris Li, Lance Yang, linux-mm, linux-kernel

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 01/04/2024 06:52, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> Now that we no longer have a convenient flag in the cluster to determine
>>> if a folio is large, free_swap_and_cache() will take a reference and
>>> lock a large folio much more often, which could lead to contention and
>>> (e.g.) failure to split large folios, etc.
>>>
>>> Let's solve that problem by batch freeing swap and cache with a new
>>> function, free_swap_and_cache_nr(), to free a contiguous range of swap
>>> entries together. This allows us to first drop a reference to each swap
>>> slot before we try to release the cache folio. This means we only try to
>>> release the folio once, only taking the reference and lock once - much
>>> better than the previous 512 times for the 2M THP case.
>>>
>>> Contiguous swap entries are gathered in zap_pte_range() and
>>> madvise_free_pte_range() in a similar way to how present ptes are
>>> already gathered in zap_pte_range().
>>>
>>> While we are at it, let's simplify by converting the return type of both
>>> functions to void. The return value was used only by zap_pte_range() to
>>> print a bad pte, and was ignored by everyone else, so the extra
>>> reporting wasn't exactly guaranteed. We will still get the warning with
>>> most of the information from get_swap_device(). With the batch version,
>>> we wouldn't know which pte was bad anyway so could print the wrong one.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/pgtable.h | 28 +++++++++++++++
>>>  include/linux/swap.h    | 12 +++++--
>>>  mm/internal.h           | 48 +++++++++++++++++++++++++
>>>  mm/madvise.c            | 12 ++++---
>>>  mm/memory.c             | 13 +++----
>>>  mm/swapfile.c           | 78 ++++++++++++++++++++++++++++++-----------
>>>  6 files changed, 157 insertions(+), 34 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 09c85c7bf9c2..8185939df1e8 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
>>>  }
>>>  #endif
>>>  
>>> +#ifndef clear_not_present_full_ptes
>>> +/**
>>> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
>>> + * @mm: Address space the ptes represent.
>>> + * @addr: Address of the first pte.
>>> + * @ptep: Page table pointer for the first entry.
>>> + * @nr: Number of entries to clear.
>>> + * @full: Whether we are clearing a full mm.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>> + * loop over pte_clear_not_present_full().
>>> + *
>>> + * Context: The caller holds the page table lock.  The PTEs are all not present.
>>> + * The PTEs are all in the same PMD.
>>> + */
>>> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
>>> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>>> +{
>>> +	for (;;) {
>>> +		pte_clear_not_present_full(mm, addr, ptep, full);
>>> +		if (--nr == 0)
>>> +			break;
>>> +		ptep++;
>>> +		addr += PAGE_SIZE;
>>> +	}
>>> +}
>>> +#endif
>>> +
>>>  #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>>>  extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>>>  			      unsigned long address,
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index f6f78198f000..5737236dc3ce 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t);
>>>  extern int swapcache_prepare(swp_entry_t);
>>>  extern void swap_free(swp_entry_t);
>>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>> -extern int free_swap_and_cache(swp_entry_t);
>>> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>>>  int swap_type_of(dev_t device, sector_t offset);
>>>  int find_first_swap(dev_t *device);
>>>  extern unsigned int count_swap_pages(int, int);
>>> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct *si)
>>>  #define free_pages_and_swap_cache(pages, nr) \
>>>  	release_pages((pages), (nr));
>>>  
>>> -/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
>>> -#define free_swap_and_cache(e) is_pfn_swap_entry(e)
>>> +static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>>> +{
>>> +}
>>>  
>>>  static inline void free_swap_cache(struct folio *folio)
>>>  {
>>> @@ -589,6 +590,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
>>>  }
>>>  #endif /* CONFIG_SWAP */
>>>  
>>> +static inline void free_swap_and_cache(swp_entry_t entry)
>>> +{
>>> +	free_swap_and_cache_nr(entry, 1);
>>> +}
>>> +
>>>  #ifdef CONFIG_MEMCG
>>>  static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>>>  {
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 8e11f7b2da21..eadb79c3a357 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -11,6 +11,8 @@
>>>  #include <linux/mm.h>
>>>  #include <linux/pagemap.h>
>>>  #include <linux/rmap.h>
>>> +#include <linux/swap.h>
>>> +#include <linux/swapops.h>
>>>  #include <linux/tracepoint-defs.h>
>>>  
>>>  struct folio_batch;
>>> @@ -189,6 +191,52 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>  
>>>  	return min(ptep - start_ptep, max_nr);
>>>  }
>>> +
>>> +/**
>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>>> + * @start_ptep: Page table pointer for the first entry.
>>> + * @max_nr: The maximum number of table entries to consider.
>>> + * @entry: Swap entry recovered from the first table entry.
>>> + *
>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>>> + * containing swap entries all with consecutive offsets and targeting the same
>>> + * swap type.
>>> + *
>>> + * max_nr must be at least one and must be limited by the caller so scanning
>>> + * cannot exceed a single page table.
>>> + *
>>> + * Return: the number of table entries in the batch.
>>> + */
>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>>> +				 swp_entry_t entry)
>>> +{
>>> +	const pte_t *end_ptep = start_ptep + max_nr;
>>> +	unsigned long expected_offset = swp_offset(entry) + 1;
>>> +	unsigned int expected_type = swp_type(entry);
>>> +	pte_t *ptep = start_ptep + 1;
>>> +
>>> +	VM_WARN_ON(max_nr < 1);
>>> +	VM_WARN_ON(non_swap_entry(entry));
>>> +
>>> +	while (ptep < end_ptep) {
>>> +		pte_t pte = ptep_get(ptep);
>>> +
>>> +		if (pte_none(pte) || pte_present(pte))
>>> +			break;
>>> +
>>> +		entry = pte_to_swp_entry(pte);
>>> +
>>> +		if (non_swap_entry(entry) ||
>>> +		    swp_type(entry) != expected_type ||
>>> +		    swp_offset(entry) != expected_offset)
>>> +			break;
>>> +
>>> +		expected_offset++;
>>> +		ptep++;
>>> +	}
>>> +
>>> +	return ptep - start_ptep;
>>> +}
>>>  #endif /* CONFIG_MMU */
>>>  
>>>  void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>> index 1f77a51baaac..070bedb4996e 100644
>>> --- a/mm/madvise.c
>>> +++ b/mm/madvise.c
>>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>>  	struct folio *folio;
>>>  	int nr_swap = 0;
>>>  	unsigned long next;
>>> +	int nr, max_nr;
>>>  
>>>  	next = pmd_addr_end(addr, end);
>>>  	if (pmd_trans_huge(*pmd))
>>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>>  		return 0;
>>>  	flush_tlb_batched_pending(mm);
>>>  	arch_enter_lazy_mmu_mode();
>>> -	for (; addr != end; pte++, addr += PAGE_SIZE) {
>>> +	for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>>> +		nr = 1;
>>>  		ptent = ptep_get(pte);
>>>  
>>>  		if (pte_none(ptent))
>>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>>  
>>>  			entry = pte_to_swp_entry(ptent);
>>>  			if (!non_swap_entry(entry)) {
>>> -				nr_swap--;
>>> -				free_swap_and_cache(entry);
>>> -				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>> +				max_nr = (end - addr) / PAGE_SIZE;
>>> +				nr = swap_pte_batch(pte, max_nr, entry);
>>> +				nr_swap -= nr;
>>> +				free_swap_and_cache_nr(entry, nr);
>>> +				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>>  			} else if (is_hwpoison_entry(entry) ||
>>>  				   is_poisoned_swp_entry(entry)) {
>>>  				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 36191a9c799c..9d844582ba38 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1631,12 +1631,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>  				folio_remove_rmap_pte(folio, page, vma);
>>>  			folio_put(folio);
>>>  		} else if (!non_swap_entry(entry)) {
>>> -			/* Genuine swap entry, hence a private anon page */
>>> +			max_nr = (end - addr) / PAGE_SIZE;
>>> +			nr = swap_pte_batch(pte, max_nr, entry);
>>> +			/* Genuine swap entries, hence a private anon pages */
>>>  			if (!should_zap_cows(details))
>>>  				continue;
>>> -			rss[MM_SWAPENTS]--;
>>> -			if (unlikely(!free_swap_and_cache(entry)))
>>> -				print_bad_pte(vma, addr, ptent, NULL);
>>> +			rss[MM_SWAPENTS] -= nr;
>>> +			free_swap_and_cache_nr(entry, nr);
>>>  		} else if (is_migration_entry(entry)) {
>>>  			folio = pfn_swap_entry_folio(entry);
>>>  			if (!should_zap_folio(details, folio))
>>> @@ -1659,8 +1660,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>  			pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>>  			WARN_ON_ONCE(1);
>>>  		}
>>> -		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>> -		zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>>> +		clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>> +		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>>>  	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>>>  
>>>  	add_mm_rss_vec(mm, rss);
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 0d44ee2b4f9c..cedfc82d37e5 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
>>>  /* Reclaim the swap entry if swap is getting full*/
>>>  #define TTRS_FULL		0x4
>>>  
>>> -/* returns 1 if swap entry is freed */
>>> +/*
>>> + * returns number of pages in the folio that backs the swap entry. If positive,
>>> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
>>> + * folio was associated with the swap entry.
>>> + */
>>>  static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>>  				 unsigned long offset, unsigned long flags)
>>>  {
>>> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>>  			ret = folio_free_swap(folio);
>>>  		folio_unlock(folio);
>>>  	}
>>> +	ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
>>>  	folio_put(folio);
>>>  	return ret;
>>>  }
>>> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>>>  		spin_lock(&si->lock);
>>>  		/* entry was freed successfully, try to use this again */
>>> -		if (swap_was_freed)
>>> +		if (swap_was_freed > 0)
>>>  			goto checks;
>>>  		goto scan; /* check next one */
>>>  	}
>>> @@ -1572,32 +1577,63 @@ bool folio_free_swap(struct folio *folio)
>>>  	return true;
>>>  }
>>>  
>>> -/*
>>> - * Free the swap entry like above, but also try to
>>> - * free the page cache entry if it is the last user.
>>> - */
>>> -int free_swap_and_cache(swp_entry_t entry)
>>> +void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>>>  {
>>> -	struct swap_info_struct *p;
>>> -	unsigned char count;
>>> +	unsigned long end = swp_offset(entry) + nr;
>>> +	unsigned int type = swp_type(entry);
>>> +	struct swap_info_struct *si;
>>> +	unsigned long offset;
>>>  
>>>  	if (non_swap_entry(entry))
>>> -		return 1;
>>> +		return;
>>>  
>>> -	p = get_swap_device(entry);
>>> -	if (p) {
>>> -		if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
>>> -			put_swap_device(p);
>>> -			return 0;
>>> -		}
>>> +	si = get_swap_device(entry);
>>> +	if (!si)
>>> +		return;
>>>  
>>> -		count = __swap_entry_free(p, entry);
>>> -		if (count == SWAP_HAS_CACHE)
>>> -			__try_to_reclaim_swap(p, swp_offset(entry),
>>> +	if (WARN_ON(end > si->max))
>>> +		goto out;
>>> +
>>> +	/*
>>> +	 * First free all entries in the range.
>>> +	 */
>>> +	for (offset = swp_offset(entry); offset < end; offset++) {
>>> +		if (!WARN_ON(data_race(!si->swap_map[offset])))
>>> +			__swap_entry_free(si, swp_entry(type, offset));
>> 
>> I think that it's better to check the return value of
>> __swap_entry_free() here.  When the return value != SWAP_HAS_CACHE, we
>> can try to reclaim all swap entries we have checked before, then restart
>> the check with the new start.
>
> What's the benefit of your proposed aproach? I only see a drawback: if there are
> large swap entries for which some pages have higher ref counts than others, we
> will end up trying to reclaim (and fail) multiple times per folio. Whereas with
> my current approach we only attempt reclaim once per folio.
>
> Do you see a specific bug with what I'm currently doing?

No.  Just want to find some opportunity to optimize.  With my proposal,
we can skip reclaim if "return value != SWAP_HAS_CACHE", for example, if
the folio has been removed from the swap cache (fully reclaimed).  But
this has some drawbacks too, as you pointed out above.

At least, we can check whether any return value of __swap_entry_free()
== SWAP_HAS_CACHE, and only try to reclaim if so.  This can optimize the
most common cases.

[snip]

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  2024-04-03  3:57       ` Huang, Ying
@ 2024-04-03  7:16         ` Ryan Roberts
  0 siblings, 0 replies; 35+ messages in thread
From: Ryan Roberts @ 2024-04-03  7:16 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, Barry Song,
	Chris Li, Lance Yang, linux-mm, linux-kernel

On 03/04/2024 04:57, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 01/04/2024 06:52, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> Now that we no longer have a convenient flag in the cluster to determine
>>>> if a folio is large, free_swap_and_cache() will take a reference and
>>>> lock a large folio much more often, which could lead to contention and
>>>> (e.g.) failure to split large folios, etc.
>>>>
>>>> Let's solve that problem by batch freeing swap and cache with a new
>>>> function, free_swap_and_cache_nr(), to free a contiguous range of swap
>>>> entries together. This allows us to first drop a reference to each swap
>>>> slot before we try to release the cache folio. This means we only try to
>>>> release the folio once, only taking the reference and lock once - much
>>>> better than the previous 512 times for the 2M THP case.
>>>>
>>>> Contiguous swap entries are gathered in zap_pte_range() and
>>>> madvise_free_pte_range() in a similar way to how present ptes are
>>>> already gathered in zap_pte_range().
>>>>
>>>> While we are at it, let's simplify by converting the return type of both
>>>> functions to void. The return value was used only by zap_pte_range() to
>>>> print a bad pte, and was ignored by everyone else, so the extra
>>>> reporting wasn't exactly guaranteed. We will still get the warning with
>>>> most of the information from get_swap_device(). With the batch version,
>>>> we wouldn't know which pte was bad anyway so could print the wrong one.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/pgtable.h | 28 +++++++++++++++
>>>>  include/linux/swap.h    | 12 +++++--
>>>>  mm/internal.h           | 48 +++++++++++++++++++++++++
>>>>  mm/madvise.c            | 12 ++++---
>>>>  mm/memory.c             | 13 +++----
>>>>  mm/swapfile.c           | 78 ++++++++++++++++++++++++++++++-----------
>>>>  6 files changed, 157 insertions(+), 34 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 09c85c7bf9c2..8185939df1e8 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
>>>>  }
>>>>  #endif
>>>>  
>>>> +#ifndef clear_not_present_full_ptes
>>>> +/**
>>>> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
>>>> + * @mm: Address space the ptes represent.
>>>> + * @addr: Address of the first pte.
>>>> + * @ptep: Page table pointer for the first entry.
>>>> + * @nr: Number of entries to clear.
>>>> + * @full: Whether we are clearing a full mm.
>>>> + *
>>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>>> + * loop over pte_clear_not_present_full().
>>>> + *
>>>> + * Context: The caller holds the page table lock.  The PTEs are all not present.
>>>> + * The PTEs are all in the same PMD.
>>>> + */
>>>> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
>>>> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>>>> +{
>>>> +	for (;;) {
>>>> +		pte_clear_not_present_full(mm, addr, ptep, full);
>>>> +		if (--nr == 0)
>>>> +			break;
>>>> +		ptep++;
>>>> +		addr += PAGE_SIZE;
>>>> +	}
>>>> +}
>>>> +#endif
>>>> +
>>>>  #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>>>>  extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>>>>  			      unsigned long address,
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index f6f78198f000..5737236dc3ce 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t);
>>>>  extern int swapcache_prepare(swp_entry_t);
>>>>  extern void swap_free(swp_entry_t);
>>>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>>> -extern int free_swap_and_cache(swp_entry_t);
>>>> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>>>>  int swap_type_of(dev_t device, sector_t offset);
>>>>  int find_first_swap(dev_t *device);
>>>>  extern unsigned int count_swap_pages(int, int);
>>>> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct *si)
>>>>  #define free_pages_and_swap_cache(pages, nr) \
>>>>  	release_pages((pages), (nr));
>>>>  
>>>> -/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
>>>> -#define free_swap_and_cache(e) is_pfn_swap_entry(e)
>>>> +static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>>>> +{
>>>> +}
>>>>  
>>>>  static inline void free_swap_cache(struct folio *folio)
>>>>  {
>>>> @@ -589,6 +590,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
>>>>  }
>>>>  #endif /* CONFIG_SWAP */
>>>>  
>>>> +static inline void free_swap_and_cache(swp_entry_t entry)
>>>> +{
>>>> +	free_swap_and_cache_nr(entry, 1);
>>>> +}
>>>> +
>>>>  #ifdef CONFIG_MEMCG
>>>>  static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>>>>  {
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 8e11f7b2da21..eadb79c3a357 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -11,6 +11,8 @@
>>>>  #include <linux/mm.h>
>>>>  #include <linux/pagemap.h>
>>>>  #include <linux/rmap.h>
>>>> +#include <linux/swap.h>
>>>> +#include <linux/swapops.h>
>>>>  #include <linux/tracepoint-defs.h>
>>>>  
>>>>  struct folio_batch;
>>>> @@ -189,6 +191,52 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>>  
>>>>  	return min(ptep - start_ptep, max_nr);
>>>>  }
>>>> +
>>>> +/**
>>>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>>>> + * @start_ptep: Page table pointer for the first entry.
>>>> + * @max_nr: The maximum number of table entries to consider.
>>>> + * @entry: Swap entry recovered from the first table entry.
>>>> + *
>>>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>>>> + * containing swap entries all with consecutive offsets and targeting the same
>>>> + * swap type.
>>>> + *
>>>> + * max_nr must be at least one and must be limited by the caller so scanning
>>>> + * cannot exceed a single page table.
>>>> + *
>>>> + * Return: the number of table entries in the batch.
>>>> + */
>>>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>>>> +				 swp_entry_t entry)
>>>> +{
>>>> +	const pte_t *end_ptep = start_ptep + max_nr;
>>>> +	unsigned long expected_offset = swp_offset(entry) + 1;
>>>> +	unsigned int expected_type = swp_type(entry);
>>>> +	pte_t *ptep = start_ptep + 1;
>>>> +
>>>> +	VM_WARN_ON(max_nr < 1);
>>>> +	VM_WARN_ON(non_swap_entry(entry));
>>>> +
>>>> +	while (ptep < end_ptep) {
>>>> +		pte_t pte = ptep_get(ptep);
>>>> +
>>>> +		if (pte_none(pte) || pte_present(pte))
>>>> +			break;
>>>> +
>>>> +		entry = pte_to_swp_entry(pte);
>>>> +
>>>> +		if (non_swap_entry(entry) ||
>>>> +		    swp_type(entry) != expected_type ||
>>>> +		    swp_offset(entry) != expected_offset)
>>>> +			break;
>>>> +
>>>> +		expected_offset++;
>>>> +		ptep++;
>>>> +	}
>>>> +
>>>> +	return ptep - start_ptep;
>>>> +}
>>>>  #endif /* CONFIG_MMU */
>>>>  
>>>>  void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index 1f77a51baaac..070bedb4996e 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>>>  	struct folio *folio;
>>>>  	int nr_swap = 0;
>>>>  	unsigned long next;
>>>> +	int nr, max_nr;
>>>>  
>>>>  	next = pmd_addr_end(addr, end);
>>>>  	if (pmd_trans_huge(*pmd))
>>>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>>>  		return 0;
>>>>  	flush_tlb_batched_pending(mm);
>>>>  	arch_enter_lazy_mmu_mode();
>>>> -	for (; addr != end; pte++, addr += PAGE_SIZE) {
>>>> +	for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>>>> +		nr = 1;
>>>>  		ptent = ptep_get(pte);
>>>>  
>>>>  		if (pte_none(ptent))
>>>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>>>  
>>>>  			entry = pte_to_swp_entry(ptent);
>>>>  			if (!non_swap_entry(entry)) {
>>>> -				nr_swap--;
>>>> -				free_swap_and_cache(entry);
>>>> -				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>>> +				max_nr = (end - addr) / PAGE_SIZE;
>>>> +				nr = swap_pte_batch(pte, max_nr, entry);
>>>> +				nr_swap -= nr;
>>>> +				free_swap_and_cache_nr(entry, nr);
>>>> +				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>>>  			} else if (is_hwpoison_entry(entry) ||
>>>>  				   is_poisoned_swp_entry(entry)) {
>>>>  				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 36191a9c799c..9d844582ba38 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1631,12 +1631,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>>  				folio_remove_rmap_pte(folio, page, vma);
>>>>  			folio_put(folio);
>>>>  		} else if (!non_swap_entry(entry)) {
>>>> -			/* Genuine swap entry, hence a private anon page */
>>>> +			max_nr = (end - addr) / PAGE_SIZE;
>>>> +			nr = swap_pte_batch(pte, max_nr, entry);
>>>> +			/* Genuine swap entries, hence a private anon pages */
>>>>  			if (!should_zap_cows(details))
>>>>  				continue;
>>>> -			rss[MM_SWAPENTS]--;
>>>> -			if (unlikely(!free_swap_and_cache(entry)))
>>>> -				print_bad_pte(vma, addr, ptent, NULL);
>>>> +			rss[MM_SWAPENTS] -= nr;
>>>> +			free_swap_and_cache_nr(entry, nr);
>>>>  		} else if (is_migration_entry(entry)) {
>>>>  			folio = pfn_swap_entry_folio(entry);
>>>>  			if (!should_zap_folio(details, folio))
>>>> @@ -1659,8 +1660,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>>  			pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>>>>  			WARN_ON_ONCE(1);
>>>>  		}
>>>> -		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>>>> -		zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>>>> +		clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>>> +		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>>>>  	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>>>>  
>>>>  	add_mm_rss_vec(mm, rss);
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index 0d44ee2b4f9c..cedfc82d37e5 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
>>>>  /* Reclaim the swap entry if swap is getting full*/
>>>>  #define TTRS_FULL		0x4
>>>>  
>>>> -/* returns 1 if swap entry is freed */
>>>> +/*
>>>> + * returns number of pages in the folio that backs the swap entry. If positive,
>>>> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
>>>> + * folio was associated with the swap entry.
>>>> + */
>>>>  static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>>>  				 unsigned long offset, unsigned long flags)
>>>>  {
>>>> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>>>>  			ret = folio_free_swap(folio);
>>>>  		folio_unlock(folio);
>>>>  	}
>>>> +	ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
>>>>  	folio_put(folio);
>>>>  	return ret;
>>>>  }
>>>> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>>  		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>>>>  		spin_lock(&si->lock);
>>>>  		/* entry was freed successfully, try to use this again */
>>>> -		if (swap_was_freed)
>>>> +		if (swap_was_freed > 0)
>>>>  			goto checks;
>>>>  		goto scan; /* check next one */
>>>>  	}
>>>> @@ -1572,32 +1577,63 @@ bool folio_free_swap(struct folio *folio)
>>>>  	return true;
>>>>  }
>>>>  
>>>> -/*
>>>> - * Free the swap entry like above, but also try to
>>>> - * free the page cache entry if it is the last user.
>>>> - */
>>>> -int free_swap_and_cache(swp_entry_t entry)
>>>> +void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>>>>  {
>>>> -	struct swap_info_struct *p;
>>>> -	unsigned char count;
>>>> +	unsigned long end = swp_offset(entry) + nr;
>>>> +	unsigned int type = swp_type(entry);
>>>> +	struct swap_info_struct *si;
>>>> +	unsigned long offset;
>>>>  
>>>>  	if (non_swap_entry(entry))
>>>> -		return 1;
>>>> +		return;
>>>>  
>>>> -	p = get_swap_device(entry);
>>>> -	if (p) {
>>>> -		if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
>>>> -			put_swap_device(p);
>>>> -			return 0;
>>>> -		}
>>>> +	si = get_swap_device(entry);
>>>> +	if (!si)
>>>> +		return;
>>>>  
>>>> -		count = __swap_entry_free(p, entry);
>>>> -		if (count == SWAP_HAS_CACHE)
>>>> -			__try_to_reclaim_swap(p, swp_offset(entry),
>>>> +	if (WARN_ON(end > si->max))
>>>> +		goto out;
>>>> +
>>>> +	/*
>>>> +	 * First free all entries in the range.
>>>> +	 */
>>>> +	for (offset = swp_offset(entry); offset < end; offset++) {
>>>> +		if (!WARN_ON(data_race(!si->swap_map[offset])))
>>>> +			__swap_entry_free(si, swp_entry(type, offset));
>>>
>>> I think that it's better to check the return value of
>>> __swap_entry_free() here.  When the return value != SWAP_HAS_CACHE, we
>>> can try to reclaim all swap entries we have checked before, then restart
>>> the check with the new start.
>>
>> What's the benefit of your proposed aproach? I only see a drawback: if there are
>> large swap entries for which some pages have higher ref counts than others, we
>> will end up trying to reclaim (and fail) multiple times per folio. Whereas with
>> my current approach we only attempt reclaim once per folio.
>>
>> Do you see a specific bug with what I'm currently doing?
> 
> No.  Just want to find some opportunity to optimize.  With my proposal,
> we can skip reclaim if "return value != SWAP_HAS_CACHE", for example, if
> the folio has been removed from the swap cache (fully reclaimed).  But
> this has some drawbacks too, as you pointed out above.
> 
> At least, we can check whether any return value of __swap_entry_free()
> == SWAP_HAS_CACHE, and only try to reclaim if so.  This can optimize the
> most common cases.

Yes, good idea - I'll incorporate this in the next version. Thanks!

> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  2024-04-03  0:30   ` Zi Yan
  2024-04-03  0:47     ` Lance Yang
@ 2024-04-03  7:21     ` Ryan Roberts
  2024-04-05  9:24       ` David Hildenbrand
  1 sibling, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2024-04-03  7:21 UTC (permalink / raw)
  To: Zi Yan
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang, linux-mm, linux-kernel

On 03/04/2024 01:30, Zi Yan wrote:
> On 27 Mar 2024, at 10:45, Ryan Roberts wrote:
> 
>> Now that we no longer have a convenient flag in the cluster to determine
>> if a folio is large, free_swap_and_cache() will take a reference and
>> lock a large folio much more often, which could lead to contention and
>> (e.g.) failure to split large folios, etc.
>>
>> Let's solve that problem by batch freeing swap and cache with a new
>> function, free_swap_and_cache_nr(), to free a contiguous range of swap
>> entries together. This allows us to first drop a reference to each swap
>> slot before we try to release the cache folio. This means we only try to
>> release the folio once, only taking the reference and lock once - much
>> better than the previous 512 times for the 2M THP case.
>>
>> Contiguous swap entries are gathered in zap_pte_range() and
>> madvise_free_pte_range() in a similar way to how present ptes are
>> already gathered in zap_pte_range().
>>
>> While we are at it, let's simplify by converting the return type of both
>> functions to void. The return value was used only by zap_pte_range() to
>> print a bad pte, and was ignored by everyone else, so the extra
>> reporting wasn't exactly guaranteed. We will still get the warning with
>> most of the information from get_swap_device(). With the batch version,
>> we wouldn't know which pte was bad anyway so could print the wrong one.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/pgtable.h | 28 +++++++++++++++
>>  include/linux/swap.h    | 12 +++++--
>>  mm/internal.h           | 48 +++++++++++++++++++++++++
>>  mm/madvise.c            | 12 ++++---
>>  mm/memory.c             | 13 +++----
>>  mm/swapfile.c           | 78 ++++++++++++++++++++++++++++++-----------
>>  6 files changed, 157 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 09c85c7bf9c2..8185939df1e8 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
>>  }
>>  #endif
>>
>> +#ifndef clear_not_present_full_ptes
>> +/**
>> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
>> + * @mm: Address space the ptes represent.
>> + * @addr: Address of the first pte.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to clear.
>> + * @full: Whether we are clearing a full mm.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over pte_clear_not_present_full().
>> + *
>> + * Context: The caller holds the page table lock.  The PTEs are all not present.
>> + * The PTEs are all in the same PMD.
>> + */
>> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
>> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>> +{
>> +	for (;;) {
>> +		pte_clear_not_present_full(mm, addr, ptep, full);
>> +		if (--nr == 0)
>> +			break;
>> +		ptep++;
>> +		addr += PAGE_SIZE;
>> +	}
>> +}
>> +#endif
>> +
> 
> Would the code below be better?
> 
> for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
> 	pte_clear_not_present_full(mm, addr, ptep, full);

I certainly agree that this is cleaner and more standard. But I'm copying the
pattern used by the other batch helpers. I believe this pattern was first done
by Willy for set_ptes(), then continued by DavidH for wrprotect_ptes() and
clear_full_ptes().

I guess the benefit is that ptep and addr are only incremented if we are going
around the loop again. I'd rather continue to be consistent with those other
helpers.


> 
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 4/6] mm: swap: Allow storage of all mTHP orders
  2024-04-03  3:07       ` Huang, Ying
@ 2024-04-03  7:48         ` Ryan Roberts
  0 siblings, 0 replies; 35+ messages in thread
From: Ryan Roberts @ 2024-04-03  7:48 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, Barry Song,
	Chris Li, Lance Yang, linux-mm, linux-kernel

On 03/04/2024 04:07, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 01/04/2024 04:15, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> Multi-size THP enables performance improvements by allocating large,
>>>> pte-mapped folios for anonymous memory. However I've observed that on an
>>>> arm64 system running a parallel workload (e.g. kernel compilation)
>>>> across many cores, under high memory pressure, the speed regresses. This
>>>> is due to bottlenecking on the increased number of TLBIs added due to
>>>> all the extra folio splitting when the large folios are swapped out.
>>>>
>>>> Therefore, solve this regression by adding support for swapping out mTHP
>>>> without needing to split the folio, just like is already done for
>>>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>>>> and when the swap backing store is a non-rotating block device. These
>>>> are the same constraints as for the existing PMD-sized THP swap-out
>>>> support.
>>>>
>>>> Note that no attempt is made to swap-in (m)THP here - this is still done
>>>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>>>> prerequisite for swapping-in mTHP.
>>>>
>>>> The main change here is to improve the swap entry allocator so that it
>>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>> order and allocating sequentially from it until the cluster is full.
>>>> This ensures that we don't need to search the map and we get no
>>>> fragmentation due to alignment padding for different orders in the
>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>> fail the allocation and the caller can fall back to splitting the folio
>>>> and allocates individual entries (as per existing PMD-sized THP
>>>> fallback).
>>>>
>>>> The per-order current clusters are maintained per-cpu using the existing
>>>> infrastructure. This is done to avoid interleving pages from different
>>>> tasks, which would prevent IO being batched. This is already done for
>>>> the order-0 allocations so we follow the same pattern.
>>>>
>>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>>> ensures that when the swap file is getting full, space doesn't get tied
>>>> up in the per-cpu reserves.
>>>>
>>>> This change only modifies swap to be able to accept any order mTHP. It
>>>> doesn't change the callers to elide doing the actual split. That will be
>>>> done in separate changes.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/swap.h |  10 ++-
>>>>  mm/swap_slots.c      |   6 +-
>>>>  mm/swapfile.c        | 175 ++++++++++++++++++++++++-------------------
>>>>  3 files changed, 109 insertions(+), 82 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 5e1e4f5bf0cb..11c53692f65f 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -268,13 +268,19 @@ struct swap_cluster_info {
>>>>   */
>>>>  #define SWAP_NEXT_INVALID	0
>>>>  
>>>> +#ifdef CONFIG_THP_SWAP
>>>> +#define SWAP_NR_ORDERS		(PMD_ORDER + 1)
>>>> +#else
>>>> +#define SWAP_NR_ORDERS		1
>>>> +#endif
>>>> +
>>>>  /*
>>>>   * We assign a cluster to each CPU, so each CPU can allocate swap entry from
>>>>   * its own cluster and swapout sequentially. The purpose is to optimize swapout
>>>>   * throughput.
>>>>   */
>>>>  struct percpu_cluster {
>>>> -	unsigned int next; /* Likely next allocation offset */
>>>> +	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>>>>  };
>>>>  
>>>>  struct swap_cluster_list {
>>>> @@ -471,7 +477,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio);
>>>>  bool folio_free_swap(struct folio *folio);
>>>>  void put_swap_folio(struct folio *folio, swp_entry_t entry);
>>>>  extern swp_entry_t get_swap_page_of_type(int);
>>>> -extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
>>>> +extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
>>>>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>>>>  extern void swap_shmem_alloc(swp_entry_t);
>>>>  extern int swap_duplicate(swp_entry_t);
>>>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
>>>> index 53abeaf1371d..13ab3b771409 100644
>>>> --- a/mm/swap_slots.c
>>>> +++ b/mm/swap_slots.c
>>>> @@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
>>>>  	cache->cur = 0;
>>>>  	if (swap_slot_cache_active)
>>>>  		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
>>>> -					   cache->slots, 1);
>>>> +					   cache->slots, 0);
>>>>  
>>>>  	return cache->nr;
>>>>  }
>>>> @@ -311,7 +311,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>>>  
>>>>  	if (folio_test_large(folio)) {
>>>>  		if (IS_ENABLED(CONFIG_THP_SWAP))
>>>> -			get_swap_pages(1, &entry, folio_nr_pages(folio));
>>>> +			get_swap_pages(1, &entry, folio_order(folio));
>>>>  		goto out;
>>>>  	}
>>>>  
>>>> @@ -343,7 +343,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>>>  			goto out;
>>>>  	}
>>>>  
>>>> -	get_swap_pages(1, &entry, 1);
>>>> +	get_swap_pages(1, &entry, 0);
>>>>  out:
>>>>  	if (mem_cgroup_try_charge_swap(folio, entry)) {
>>>>  		put_swap_folio(folio, entry);
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index 1393966b77af..d56cdc547a06 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -278,15 +278,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>>>>  #ifdef CONFIG_THP_SWAP
>>>>  #define SWAPFILE_CLUSTER	HPAGE_PMD_NR
>>>>  
>>>> -#define swap_entry_size(size)	(size)
>>>> +#define swap_entry_order(order)	(order)
>>>>  #else
>>>>  #define SWAPFILE_CLUSTER	256
>>>>  
>>>>  /*
>>>> - * Define swap_entry_size() as constant to let compiler to optimize
>>>> + * Define swap_entry_order() as constant to let compiler to optimize
>>>>   * out some code if !CONFIG_THP_SWAP
>>>>   */
>>>> -#define swap_entry_size(size)	1
>>>> +#define swap_entry_order(order)	0
>>>>  #endif
>>>>  #define LATENCY_LIMIT		256
>>>>  
>>>> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>>  
>>>>  /*
>>>>   * The cluster corresponding to page_nr will be used. The cluster will be
>>>> - * removed from free cluster list and its usage counter will be increased.
>>>> + * removed from free cluster list and its usage counter will be increased by
>>>> + * count.
>>>>   */
>>>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>>>> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>>> +static void add_cluster_info_page(struct swap_info_struct *p,
>>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
>>>> +	unsigned long count)
>>>>  {
>>>>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>>>  
>>>> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>>>  	if (cluster_is_free(&cluster_info[idx]))
>>>>  		alloc_cluster(p, idx);
>>>>  
>>>> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>>>> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>>>  	cluster_set_count(&cluster_info[idx],
>>>> -		cluster_count(&cluster_info[idx]) + 1);
>>>> +		cluster_count(&cluster_info[idx]) + count);
>>>> +}
>>>> +
>>>> +/*
>>>> + * The cluster corresponding to page_nr will be used. The cluster will be
>>>> + * removed from free cluster list and its usage counter will be increased by 1.
>>>> + */
>>>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>>> +{
>>>> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>>>>  }
>>>>  
>>>>  /*
>>>> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>>>   */
>>>>  static bool
>>>>  scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>> -	unsigned long offset)
>>>> +	unsigned long offset, int order)
>>>>  {
>>>>  	struct percpu_cluster *percpu_cluster;
>>>>  	bool conflict;
>>>> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>>  		return false;
>>>>  
>>>>  	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
>>>> -	percpu_cluster->next = SWAP_NEXT_INVALID;
>>>> +	percpu_cluster->next[order] = SWAP_NEXT_INVALID;
>>>> +	return true;
>>>> +}
>>>> +
>>>> +static inline bool swap_range_empty(char *swap_map, unsigned int start,
>>>> +				    unsigned int nr_pages)
>>>> +{
>>>> +	unsigned int i;
>>>> +
>>>> +	for (i = 0; i < nr_pages; i++) {
>>>> +		if (swap_map[start + i])
>>>> +			return false;
>>>> +	}
>>>> +
>>>>  	return true;
>>>>  }
>>>>  
>>>>  /*
>>>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>>> - * might involve allocating a new cluster for current CPU too.
>>>> + * Try to get swap entries with specified order from current cpu's swap entry
>>>> + * pool (a cluster). This might involve allocating a new cluster for current CPU
>>>> + * too.
>>>>   */
>>>>  static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>> -	unsigned long *offset, unsigned long *scan_base)
>>>> +	unsigned long *offset, unsigned long *scan_base, int order)
>>>>  {
>>>> +	unsigned int nr_pages = 1 << order;
>>>
>>> Use swap_entry_order()?
>>
>> I had previously convinced myself that the compiler should be smart enough to
>> propagate the constant from
>>
>> get_swap_pages -> scan_swap_map_slots -> scan_swap_map_try_ssd_cluster
> 
> Do some experiments via calling function with constants and check the
> compiled code.  It seems that "interprocedural constant propagation" in
> compiler can optimize the code at least if the callee is "static".

Yes; I just confirmed this by compiling swapfile.c to assembly. For the
!CONFIG_THP_SWAP case, as long as get_swap_pages() is using swap_entry_order(),
the constant order=0 is propagated to scan_swap_map_slots() and
scan_swap_map_try_ssd_cluster() implicitly and those functions' assembly is
hardcoded for order=0.

So at least for arm64 with this specific toolchain, it all works as I assumed
and swap_entry_order() is not required in the static functions.

aarch64-none-linux-gnu-gcc (Arm GNU Toolchain 13.2.rel1 (Build arm-13.7)) 13.2.1
20231009

> 
>> But I'll add the explicit macro for the next version, as you suggest.
> 
> So, I will leave it to you to decide whether to do that.

On this basis, I'd rather leave the compiler to do the optimizations itself and
reduce swap_entry_order() usage to a minimum (i.e. only at the non-static entry
points).

Thanks,
Ryan

> 
> --
> Best Regards,
> Huang, Ying
> 
> [snip]


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list()
  2024-04-02 13:10     ` Ryan Roberts
  2024-04-02 13:22       ` Lance Yang
  2024-04-02 13:22       ` Ryan Roberts
@ 2024-04-05  4:06       ` Barry Song
  2024-04-05  7:28         ` Ryan Roberts
  2 siblings, 1 reply; 35+ messages in thread
From: Barry Song @ 2024-04-05  4:06 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Chris Li, Lance Yang, linux-mm, linux-kernel, Barry Song

On Wed, Apr 3, 2024 at 2:10 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/03/2024 08:18, Barry Song wrote:
> > On Thu, Mar 28, 2024 at 3:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Now that swap supports storing all mTHP sizes, avoid splitting large
> >> folios before swap-out. This benefits performance of the swap-out path
> >> by eliding split_folio_to_list(), which is expensive, and also sets us
> >> up for swapping in large folios in a future series.
> >>
> >> If the folio is partially mapped, we continue to split it since we want
> >> to avoid the extra IO overhead and storage of writing out pages
> >> uneccessarily.
> >>
> >> Reviewed-by: David Hildenbrand <david@redhat.com>
> >> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/vmscan.c | 9 +++++----
> >>  1 file changed, 5 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 00adaf1cb2c3..293120fe54f3 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1223,11 +1223,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>                                         if (!can_split_folio(folio, NULL))
> >>                                                 goto activate_locked;
> >>                                         /*
> >> -                                        * Split folios without a PMD map right
> >> -                                        * away. Chances are some or all of the
> >> -                                        * tail pages can be freed without IO.
> >> +                                        * Split partially mapped folios right
> >> +                                        * away. We can free the unmapped pages
> >> +                                        * without IO.
> >>                                          */
> >> -                                       if (!folio_entire_mapcount(folio) &&
> >> +                                       if (data_race(!list_empty(
> >> +                                               &folio->_deferred_list)) &&
> >>                                             split_folio_to_list(folio,
> >>                                                                 folio_list))
> >>                                                 goto activate_locked;
> >
> > Hi Ryan,
> >
> > Sorry for bringing up another minor issue at this late stage.
>
> No problem - I'd rather take a bit longer and get it right, rather than rush it
> and get it wrong!
>
> >
> > During the debugging of thp counter patch v2, I noticed the discrepancy between
> > THP_SWPOUT_FALLBACK and THP_SWPOUT.
> >
> > Should we make adjustments to the counter?
>
> Yes, agreed; we want to be consistent here with all the other existing THP
> counters; they only refer to PMD-sized THP. I'll make the change for the next
> version.
>
> I guess we will eventually want equivalent counters for per-size mTHP using the
> framework you are adding.

Hi Ryan,

Today, I created counters for per-order SWPOUT and SWPOUT_FALLBACK.
I'd appreciate any
suggestions you might have before I submit this as patch 2/2 of my
mTHP counters series.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index cc13fa14aa32..762a6d8759b9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -267,6 +267,8 @@ unsigned long thp_vma_allowable_orders(struct
vm_area_struct *vma,
 enum thp_stat_item {
        THP_STAT_ANON_ALLOC,
        THP_STAT_ANON_ALLOC_FALLBACK,
+       THP_STAT_ANON_SWPOUT,
+       THP_STAT_ANON_SWPOUT_FALLBACK,
        __THP_STAT_COUNT
 };

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e704b4408181..7f2b5d2852cc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -554,10 +554,14 @@ static struct kobj_attribute _name##_attr =
__ATTR_RO(_name)

 THP_STATE_ATTR(anon_alloc, THP_STAT_ANON_ALLOC);
 THP_STATE_ATTR(anon_alloc_fallback, THP_STAT_ANON_ALLOC_FALLBACK);
+THP_STATE_ATTR(anon_swpout, THP_STAT_ANON_SWPOUT);
+THP_STATE_ATTR(anon_swpout_fallback, THP_STAT_ANON_SWPOUT_FALLBACK);

 static struct attribute *stats_attrs[] = {
        &anon_alloc_attr.attr,
        &anon_alloc_fallback_attr.attr,
+       &anon_swpout_attr.attr,
+       &anon_swpout_fallback_attr.attr,
        NULL,
 };

diff --git a/mm/page_io.c b/mm/page_io.c
index a9a7c236aecc..be4f822b39f8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -212,13 +212,16 @@ int swap_writepage(struct page *page, struct
writeback_control *wbc)

 static inline void count_swpout_vm_event(struct folio *folio)
 {
+       long nr_pages = folio_nr_pages(folio);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
        if (unlikely(folio_test_pmd_mappable(folio))) {
                count_memcg_folio_events(folio, THP_SWPOUT, 1);
                count_vm_event(THP_SWPOUT);
        }
+       if (nr_pages > 0 && nr_pages <= HPAGE_PMD_NR)
+               count_thp_state(folio_order(folio), THP_STAT_ANON_SWPOUT);
 #endif
-       count_vm_events(PSWPOUT, folio_nr_pages(folio));
+       count_vm_events(PSWPOUT, nr_pages);
 }

 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffc4553c8615..b7c5fbd830b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1247,6 +1247,10 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
                                                count_vm_event(
                                                        THP_SWPOUT_FALLBACK);
                                        }
+                                       if (nr_pages > 0 && nr_pages
<= HPAGE_PMD_NR)
+
count_thp_state(folio_order(folio),
+
THP_STAT_ANON_SWPOUT_FALLBACK);
+
 #endif
                                        if (!add_to_swap(folio))
                                                goto activate_locked_split;


Thanks
Barry

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list()
  2024-04-05  4:06       ` Barry Song
@ 2024-04-05  7:28         ` Ryan Roberts
  0 siblings, 0 replies; 35+ messages in thread
From: Ryan Roberts @ 2024-04-05  7:28 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Chris Li, Lance Yang, linux-mm, linux-kernel, Barry Song

On 05/04/2024 05:06, Barry Song wrote:
> On Wed, Apr 3, 2024 at 2:10 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 28/03/2024 08:18, Barry Song wrote:
>>> On Thu, Mar 28, 2024 at 3:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Now that swap supports storing all mTHP sizes, avoid splitting large
>>>> folios before swap-out. This benefits performance of the swap-out path
>>>> by eliding split_folio_to_list(), which is expensive, and also sets us
>>>> up for swapping in large folios in a future series.
>>>>
>>>> If the folio is partially mapped, we continue to split it since we want
>>>> to avoid the extra IO overhead and storage of writing out pages
>>>> uneccessarily.
>>>>
>>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  mm/vmscan.c | 9 +++++----
>>>>  1 file changed, 5 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 00adaf1cb2c3..293120fe54f3 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1223,11 +1223,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>                                         if (!can_split_folio(folio, NULL))
>>>>                                                 goto activate_locked;
>>>>                                         /*
>>>> -                                        * Split folios without a PMD map right
>>>> -                                        * away. Chances are some or all of the
>>>> -                                        * tail pages can be freed without IO.
>>>> +                                        * Split partially mapped folios right
>>>> +                                        * away. We can free the unmapped pages
>>>> +                                        * without IO.
>>>>                                          */
>>>> -                                       if (!folio_entire_mapcount(folio) &&
>>>> +                                       if (data_race(!list_empty(
>>>> +                                               &folio->_deferred_list)) &&
>>>>                                             split_folio_to_list(folio,
>>>>                                                                 folio_list))
>>>>                                                 goto activate_locked;
>>>
>>> Hi Ryan,
>>>
>>> Sorry for bringing up another minor issue at this late stage.
>>
>> No problem - I'd rather take a bit longer and get it right, rather than rush it
>> and get it wrong!
>>
>>>
>>> During the debugging of thp counter patch v2, I noticed the discrepancy between
>>> THP_SWPOUT_FALLBACK and THP_SWPOUT.
>>>
>>> Should we make adjustments to the counter?
>>
>> Yes, agreed; we want to be consistent here with all the other existing THP
>> counters; they only refer to PMD-sized THP. I'll make the change for the next
>> version.
>>
>> I guess we will eventually want equivalent counters for per-size mTHP using the
>> framework you are adding.
> 
> Hi Ryan,
> 
> Today, I created counters for per-order SWPOUT and SWPOUT_FALLBACK.
> I'd appreciate any
> suggestions you might have before I submit this as patch 2/2 of my
> mTHP counters series.

Amazing - this is going to be very useful!

> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index cc13fa14aa32..762a6d8759b9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -267,6 +267,8 @@ unsigned long thp_vma_allowable_orders(struct
> vm_area_struct *vma,
>  enum thp_stat_item {
>         THP_STAT_ANON_ALLOC,
>         THP_STAT_ANON_ALLOC_FALLBACK,
> +       THP_STAT_ANON_SWPOUT,
> +       THP_STAT_ANON_SWPOUT_FALLBACK,
>         __THP_STAT_COUNT
>  };
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e704b4408181..7f2b5d2852cc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -554,10 +554,14 @@ static struct kobj_attribute _name##_attr =
> __ATTR_RO(_name)
> 
>  THP_STATE_ATTR(anon_alloc, THP_STAT_ANON_ALLOC);
>  THP_STATE_ATTR(anon_alloc_fallback, THP_STAT_ANON_ALLOC_FALLBACK);
> +THP_STATE_ATTR(anon_swpout, THP_STAT_ANON_SWPOUT);
> +THP_STATE_ATTR(anon_swpout_fallback, THP_STAT_ANON_SWPOUT_FALLBACK);
> 
>  static struct attribute *stats_attrs[] = {
>         &anon_alloc_attr.attr,
>         &anon_alloc_fallback_attr.attr,
> +       &anon_swpout_attr.attr,
> +       &anon_swpout_fallback_attr.attr,
>         NULL,
>  };
> 
> diff --git a/mm/page_io.c b/mm/page_io.c
> index a9a7c236aecc..be4f822b39f8 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -212,13 +212,16 @@ int swap_writepage(struct page *page, struct
> writeback_control *wbc)
> 
>  static inline void count_swpout_vm_event(struct folio *folio)
>  {
> +       long nr_pages = folio_nr_pages(folio);
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         if (unlikely(folio_test_pmd_mappable(folio))) {
>                 count_memcg_folio_events(folio, THP_SWPOUT, 1);
>                 count_vm_event(THP_SWPOUT);
>         }
> +       if (nr_pages > 0 && nr_pages <= HPAGE_PMD_NR)

The guard is a bit ugly; I wonder if we should at least check that order is in
bounds in count_thp_state(), since all callers could benefit? Then we only have
to care about the nr_pages > 0 condition here. Just a thought...

> +               count_thp_state(folio_order(folio), THP_STAT_ANON_SWPOUT);

So you're counting THPs, not pages; I agree with that approach.

>  #endif
> -       count_vm_events(PSWPOUT, folio_nr_pages(folio));
> +       count_vm_events(PSWPOUT, nr_pages);
>  }
> 
>  #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ffc4553c8615..b7c5fbd830b6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1247,6 +1247,10 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>                                                 count_vm_event(
>                                                         THP_SWPOUT_FALLBACK);
>                                         }
> +                                       if (nr_pages > 0 && nr_pages
> <= HPAGE_PMD_NR)
> +
> count_thp_state(folio_order(folio),
> +
> THP_STAT_ANON_SWPOUT_FALLBACK);
> +
>  #endif
>                                         if (!add_to_swap(folio))
>                                                 goto activate_locked_split;
> 
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-27 14:45 ` [PATCH v5 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
  2024-03-29  1:56   ` Huang, Ying
@ 2024-04-05  9:22   ` David Hildenbrand
  1 sibling, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2024-04-05  9:22 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Barry Song, Chris Li, Lance Yang
  Cc: linux-mm, linux-kernel

On 27.03.24 15:45, Ryan Roberts wrote:
> As preparation for supporting small-sized THP in the swap-out path,
> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
> which, when present, always implies PMD-sized THP, which is the same as
> the cluster size.
> 
> The only use of the flag was to determine whether a swap entry refers to
> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
> Instead of relying on the flag, we now pass in nr_pages, which
> originates from the folio's number of pages. This allows the logic to
> work for folios of any order.
> 
> The one snag is that one of the swap_page_trans_huge_swapped() call
> sites does not have the folio. But it was only being called there to
> shortcut a call __try_to_reclaim_swap() in some cases.
> __try_to_reclaim_swap() gets the folio and (via some other functions)
> calls swap_page_trans_huge_swapped(). So I've removed the problematic
> call site and believe the new logic should be functionally equivalent.
> 
> That said, removing the fast path means that we will take a reference
> and trylock a large folio much more often, which we would like to avoid.
> The next patch will solve this.
> 
> Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
> which used to be called during folio splitting, since
> split_swap_cluster()'s only job was to remove the flag.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---

Looks like a reasonable cleanup independent of everything else

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  2024-04-03  7:21     ` Ryan Roberts
@ 2024-04-05  9:24       ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2024-04-05  9:24 UTC (permalink / raw)
  To: Ryan Roberts, Zi Yan
  Cc: Andrew Morton, Matthew Wilcox, Huang Ying, Gao Xiang, Yu Zhao,
	Yang Shi, Michal Hocko, Kefeng Wang, Barry Song, Chris Li,
	Lance Yang, linux-mm, linux-kernel

On 03.04.24 09:21, Ryan Roberts wrote:
> On 03/04/2024 01:30, Zi Yan wrote:
>> On 27 Mar 2024, at 10:45, Ryan Roberts wrote:
>>
>>> Now that we no longer have a convenient flag in the cluster to determine
>>> if a folio is large, free_swap_and_cache() will take a reference and
>>> lock a large folio much more often, which could lead to contention and
>>> (e.g.) failure to split large folios, etc.
>>>
>>> Let's solve that problem by batch freeing swap and cache with a new
>>> function, free_swap_and_cache_nr(), to free a contiguous range of swap
>>> entries together. This allows us to first drop a reference to each swap
>>> slot before we try to release the cache folio. This means we only try to
>>> release the folio once, only taking the reference and lock once - much
>>> better than the previous 512 times for the 2M THP case.
>>>
>>> Contiguous swap entries are gathered in zap_pte_range() and
>>> madvise_free_pte_range() in a similar way to how present ptes are
>>> already gathered in zap_pte_range().
>>>
>>> While we are at it, let's simplify by converting the return type of both
>>> functions to void. The return value was used only by zap_pte_range() to
>>> print a bad pte, and was ignored by everyone else, so the extra
>>> reporting wasn't exactly guaranteed. We will still get the warning with
>>> most of the information from get_swap_device(). With the batch version,
>>> we wouldn't know which pte was bad anyway so could print the wrong one.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>   include/linux/pgtable.h | 28 +++++++++++++++
>>>   include/linux/swap.h    | 12 +++++--
>>>   mm/internal.h           | 48 +++++++++++++++++++++++++
>>>   mm/madvise.c            | 12 ++++---
>>>   mm/memory.c             | 13 +++----
>>>   mm/swapfile.c           | 78 ++++++++++++++++++++++++++++++-----------
>>>   6 files changed, 157 insertions(+), 34 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index 09c85c7bf9c2..8185939df1e8 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
>>>   }
>>>   #endif
>>>
>>> +#ifndef clear_not_present_full_ptes
>>> +/**
>>> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
>>> + * @mm: Address space the ptes represent.
>>> + * @addr: Address of the first pte.
>>> + * @ptep: Page table pointer for the first entry.
>>> + * @nr: Number of entries to clear.
>>> + * @full: Whether we are clearing a full mm.
>>> + *
>>> + * May be overridden by the architecture; otherwise, implemented as a simple
>>> + * loop over pte_clear_not_present_full().
>>> + *
>>> + * Context: The caller holds the page table lock.  The PTEs are all not present.
>>> + * The PTEs are all in the same PMD.
>>> + */
>>> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
>>> +		unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>>> +{
>>> +	for (;;) {
>>> +		pte_clear_not_present_full(mm, addr, ptep, full);
>>> +		if (--nr == 0)
>>> +			break;
>>> +		ptep++;
>>> +		addr += PAGE_SIZE;
>>> +	}
>>> +}
>>> +#endif
>>> +
>>
>> Would the code below be better?
>>
>> for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
>> 	pte_clear_not_present_full(mm, addr, ptep, full);
> 
> I certainly agree that this is cleaner and more standard. But I'm copying the
> pattern used by the other batch helpers. I believe this pattern was first done
> by Willy for set_ptes(), then continued by DavidH for wrprotect_ptes() and
> clear_full_ptes().
> 
> I guess the benefit is that ptep and addr are only incremented if we are going
> around the loop again. I'd rather continue to be consistent with those other
> helpers.

Yes please. I remember Willy found that variant to be most micro-optimized.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2024-04-05  9:24 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-27 14:45 [PATCH v5 0/6] Swap-out mTHP without splitting Ryan Roberts
2024-03-27 14:45 ` [PATCH v5 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-03-29  1:56   ` Huang, Ying
2024-04-05  9:22   ` David Hildenbrand
2024-03-27 14:45 ` [PATCH v5 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Ryan Roberts
2024-04-01  5:52   ` Huang, Ying
2024-04-02 11:15     ` Ryan Roberts
2024-04-03  3:57       ` Huang, Ying
2024-04-03  7:16         ` Ryan Roberts
2024-04-03  0:30   ` Zi Yan
2024-04-03  0:47     ` Lance Yang
2024-04-03  7:21     ` Ryan Roberts
2024-04-05  9:24       ` David Hildenbrand
2024-03-27 14:45 ` [PATCH v5 3/6] mm: swap: Simplify struct percpu_cluster Ryan Roberts
2024-03-27 14:45 ` [PATCH v5 4/6] mm: swap: Allow storage of all mTHP orders Ryan Roberts
2024-04-01  3:15   ` Huang, Ying
2024-04-02 11:18     ` Ryan Roberts
2024-04-03  3:07       ` Huang, Ying
2024-04-03  7:48         ` Ryan Roberts
2024-03-27 14:45 ` [PATCH v5 5/6] mm: vmscan: Avoid split during shrink_folio_list() Ryan Roberts
2024-03-28  8:18   ` Barry Song
2024-03-28  8:48     ` Ryan Roberts
2024-04-02 13:10     ` Ryan Roberts
2024-04-02 13:22       ` Lance Yang
2024-04-02 13:22       ` Ryan Roberts
2024-04-02 22:54         ` Barry Song
2024-04-05  4:06       ` Barry Song
2024-04-05  7:28         ` Ryan Roberts
2024-03-27 14:45 ` [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Ryan Roberts
2024-04-01 12:25   ` Lance Yang
2024-04-02 11:20     ` Ryan Roberts
2024-04-02 11:30       ` Lance Yang
2024-04-02 10:16   ` Barry Song
2024-04-02 10:56     ` Ryan Roberts
2024-04-02 11:01       ` Ryan Roberts

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.