linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] Swap-out small-sized THP without splitting
@ 2023-10-25 14:45 Ryan Roberts
  2023-10-25 14:45 ` [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
                   ` (5 more replies)
  0 siblings, 6 replies; 116+ messages in thread
From: Ryan Roberts @ 2023-10-25 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: Ryan Roberts, linux-kernel, linux-mm

Hi All,

This is v3 of a series to add support for swapping out small-sized THP without
needing to first split the large folio via __split_huge_page(). It closely
follows the approach already used by PMD-sized THP.

"Small-sized THP" is an upcoming feature that enables performance improvements
by allocating large folios for anonymous memory, where the large folio size is
smaller than the traditional PMD-size. See [3].

In some circumstances I've observed a performance regression (see patch 2 for
details), and this series is an attempt to fix the regression in advance of
merging small-sized THP support.

I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). Discussion against the RFC concluded
that this is probably sufficient.

The series applies against mm-unstable (1a3c85fa684a)


Changes since v2 [2]
====================

 - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
   allocation. This required some refactoring to make everything work nicely
   (new patches 2 and 3).
 - Fix bug where nr_swap_pages would say there are pages available but the
   scanner would not be able to allocate them because they were reserved for the
   per-cpu allocator. We now allow stealing of order-0 entries from the high
   order per-cpu clusters (in addition to exisiting stealing from order-0
   per-cpu clusters).

Thanks to Huang, Ying for the review feedback and suggestions!


Changes since v1 [1]
====================

 - patch 1:
    - Use cluster_set_count() instead of cluster_set_count_flag() in
      swap_alloc_cluster() since we no longer have any flag to set. I was unable
      to kill cluster_set_count_flag() as proposed against v1 as other call
      sites depend explicitly setting flags to 0.
 - patch 2:
    - Moved large_next[] array into percpu_cluster to make it per-cpu
      (recommended by Huang, Ying).
    - large_next[] array is dynamically allocated because PMD_ORDER is not
      compile-time constant for powerpc (fixes build error).


Thanks,
Ryan

P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
but given Huang Ying had provided some review feedback, I wanted to progress it.
All the actual prerequisites are either complete or being worked on by others.


[1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@arm.com/


Ryan Roberts (4):
  mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  mm: swap: Remove struct percpu_cluster
  mm: swap: Simplify ssd behavior when scanner steals entry
  mm: swap: Swap-out small-sized THP without splitting

 include/linux/swap.h |  31 +++---
 mm/huge_memory.c     |   3 -
 mm/swapfile.c        | 232 ++++++++++++++++++++++++-------------------
 mm/vmscan.c          |  10 +-
 4 files changed, 149 insertions(+), 127 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2023-10-25 14:45 [PATCH v3 0/4] Swap-out small-sized THP without splitting Ryan Roberts
@ 2023-10-25 14:45 ` Ryan Roberts
  2024-02-22 10:19   ` David Hildenbrand
  2023-10-25 14:45 ` [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster Ryan Roberts
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2023-10-25 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: Ryan Roberts, linux-kernel, linux-mm

As preparation for supporting small-sized THP in the swap-out path,
without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
which, when present, always implies PMD-sized THP, which is the same as
the cluster size.

The only use of the flag was to determine whether a swap entry refers to
a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
Instead of relying on the flag, we now pass in nr_pages, which
originates from the folio's number of pages. This allows the logic to
work for folios of any order.

The one snag is that one of the swap_page_trans_huge_swapped() call
sites does not have the folio. But it was only being called there to
avoid bothering to call __try_to_reclaim_swap() in some cases.
__try_to_reclaim_swap() gets the folio and (via some other functions)
calls swap_page_trans_huge_swapped(). So I've removed the problematic
call site and believe the new logic should be equivalent.

Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
which used to be called during folio splitting, since
split_swap_cluster()'s only job was to remove the flag.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h | 10 ----------
 mm/huge_memory.c     |  3 ---
 mm/swapfile.c        | 47 ++++++++------------------------------------
 3 files changed, 8 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 19f30a29e1f1..a073366a227c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,7 +259,6 @@ struct swap_cluster_info {
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
-#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */
 
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
@@ -595,15 +594,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 }
 #endif /* CONFIG_SWAP */
 
-#ifdef CONFIG_THP_SWAP
-extern int split_swap_cluster(swp_entry_t entry);
-#else
-static inline int split_swap_cluster(swp_entry_t entry)
-{
-	return 0;
-}
-#endif
-
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f31f02472396..b411dd4f1612 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2598,9 +2598,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		shmem_uncharge(head->mapping->host, nr_dropped);
 	remap_page(folio, nr);
 
-	if (folio_test_swapcache(folio))
-		split_swap_cluster(folio->swap);
-
 	for (i = 0; i < nr; i++) {
 		struct page *subpage = head + i;
 		if (subpage == page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e52f486834eb..b83ad77e04c0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -342,18 +342,6 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
 	info->data = 0;
 }
 
-static inline bool cluster_is_huge(struct swap_cluster_info *info)
-{
-	if (IS_ENABLED(CONFIG_THP_SWAP))
-		return info->flags & CLUSTER_FLAG_HUGE;
-	return false;
-}
-
-static inline void cluster_clear_huge(struct swap_cluster_info *info)
-{
-	info->flags &= ~CLUSTER_FLAG_HUGE;
-}
-
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -1021,7 +1009,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
 	offset = idx * SWAPFILE_CLUSTER;
 	ci = lock_cluster(si, offset);
 	alloc_cluster(si, idx);
-	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
+	cluster_set_count(ci, SWAPFILE_CLUSTER);
 
 	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
 	unlock_cluster(ci);
@@ -1354,7 +1342,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 	ci = lock_cluster_or_swap_info(si, offset);
 	if (size == SWAPFILE_CLUSTER) {
-		VM_BUG_ON(!cluster_is_huge(ci));
 		map = si->swap_map + offset;
 		for (i = 0; i < SWAPFILE_CLUSTER; i++) {
 			val = map[i];
@@ -1362,7 +1349,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 			if (val == SWAP_HAS_CACHE)
 				free_entries++;
 		}
-		cluster_clear_huge(ci);
 		if (free_entries == SWAPFILE_CLUSTER) {
 			unlock_cluster_or_swap_info(si, ci);
 			spin_lock(&si->lock);
@@ -1384,23 +1370,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster_or_swap_info(si, ci);
 }
 
-#ifdef CONFIG_THP_SWAP
-int split_swap_cluster(swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-
-	si = _swap_info_get(entry);
-	if (!si)
-		return -EBUSY;
-	ci = lock_cluster(si, offset);
-	cluster_clear_huge(ci);
-	unlock_cluster(ci);
-	return 0;
-}
-#endif
-
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
@@ -1508,22 +1477,23 @@ int swp_swapcount(swp_entry_t entry)
 }
 
 static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
-					 swp_entry_t entry)
+					 swp_entry_t entry,
+					 unsigned int nr_pages)
 {
 	struct swap_cluster_info *ci;
 	unsigned char *map = si->swap_map;
 	unsigned long roffset = swp_offset(entry);
-	unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+	unsigned long offset = round_down(roffset, nr_pages);
 	int i;
 	bool ret = false;
 
 	ci = lock_cluster_or_swap_info(si, offset);
-	if (!ci || !cluster_is_huge(ci)) {
+	if (!ci || nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
 		goto unlock_out;
 	}
-	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		if (swap_count(map[offset + i])) {
 			ret = true;
 			break;
@@ -1545,7 +1515,7 @@ static bool folio_swapped(struct folio *folio)
 	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
 		return swap_swapcount(si, entry) != 0;
 
-	return swap_page_trans_huge_swapped(si, entry);
+	return swap_page_trans_huge_swapped(si, entry, folio_nr_pages(folio));
 }
 
 /**
@@ -1606,8 +1576,7 @@ int free_swap_and_cache(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		count = __swap_entry_free(p, entry);
-		if (count == SWAP_HAS_CACHE &&
-		    !swap_page_trans_huge_swapped(p, entry))
+		if (count == SWAP_HAS_CACHE)
 			__try_to_reclaim_swap(p, swp_offset(entry),
 					      TTRS_UNMAPPED | TTRS_FULL);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster
  2023-10-25 14:45 [PATCH v3 0/4] Swap-out small-sized THP without splitting Ryan Roberts
  2023-10-25 14:45 ` [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
@ 2023-10-25 14:45 ` Ryan Roberts
  2023-10-25 14:45 ` [PATCH v3 3/4] mm: swap: Simplify ssd behavior when scanner steals entry Ryan Roberts
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2023-10-25 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: Ryan Roberts, linux-kernel, linux-mm

struct percpu_cluster stores the index of cpu's current cluster and the
offset of the next entry that will be allocated for the cpu. These two
pieces of information are redundant because the cluster index is just
(offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the
cluster index is because the structure used for it also has a flag to
indicate "no cluster". However this data structure also contains a spin
lock, which is never used in this context, as a side effect the code
copies the spinlock_t structure, which is questionable coding practice
in my view.

So let's clean this up and store only the next offset, and use a
sentinal value (SWAP_NEXT_NULL) to indicate "no cluster". SWAP_NEXT_NULL
is chosen to be 0, because 0 will never be seen legitimately; The first
page in the swap file is the swap header, which is always marked bad to
prevent it from being allocated as an entry. This also prevents the
cluster to which it belongs being marked free, so it will never appear
on the free list.

This change saves 16 bytes per cpu. And given we are shortly going to
extend this mechanism to be per-cpu-AND-per-order, we will end up saving
16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the
system.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h | 21 +++++++++++++--------
 mm/swapfile.c        | 43 +++++++++++++++++++------------------------
 2 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a073366a227c..0ca8aaa098ba 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,14 +261,12 @@ struct swap_cluster_info {
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
 
 /*
- * We assign a cluster to each CPU, so each CPU can allocate swap entry from
- * its own cluster and swapout sequentially. The purpose is to optimize swapout
- * throughput.
+ * The first page in the swap file is the swap header, which is always marked
+ * bad to prevent it from being allocated as an entry. This also prevents the
+ * cluster to which it belongs being marked free. Therefore 0 is safe to use as
+ * a sentinel to indicate cpu_next is not valid in swap_info_struct.
  */
-struct percpu_cluster {
-	struct swap_cluster_info index; /* Current cluster index */
-	unsigned int next; /* Likely next allocation offset */
-};
+#define SWAP_NEXT_NULL	0
 
 struct swap_cluster_list {
 	struct swap_cluster_info head;
@@ -295,7 +293,14 @@ struct swap_info_struct {
 	unsigned int cluster_next;	/* likely index for next allocation */
 	unsigned int cluster_nr;	/* countdown to next cluster search */
 	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
-	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
+	unsigned int __percpu *cpu_next;/*
+					 * Likely next allocation offset. We
+					 * assign a cluster to each CPU, so each
+					 * CPU can allocate swap entry from its
+					 * own cluster and swapout sequentially.
+					 * The purpose is to optimize swapout
+					 * throughput.
+					 */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
 	struct file *swap_file;		/* seldom referenced */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b83ad77e04c0..617e34b8cdbe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -591,7 +591,6 @@ static bool
 scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 	unsigned long offset)
 {
-	struct percpu_cluster *percpu_cluster;
 	bool conflict;
 
 	offset /= SWAPFILE_CLUSTER;
@@ -602,8 +601,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 	if (!conflict)
 		return false;
 
-	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
-	cluster_set_null(&percpu_cluster->index);
+	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
 	return true;
 }
 
@@ -614,16 +612,16 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	unsigned long *offset, unsigned long *scan_base)
 {
-	struct percpu_cluster *cluster;
 	struct swap_cluster_info *ci;
-	unsigned long tmp, max;
+	unsigned int tmp, max;
+	unsigned int *cpu_next;
 
 new_cluster:
-	cluster = this_cpu_ptr(si->percpu_cluster);
-	if (cluster_is_null(&cluster->index)) {
+	cpu_next = this_cpu_ptr(si->cpu_next);
+	tmp = *cpu_next;
+	if (tmp == SWAP_NEXT_NULL) {
 		if (!cluster_list_empty(&si->free_clusters)) {
-			cluster->index = si->free_clusters.head;
-			cluster->next = cluster_next(&cluster->index) *
+			tmp = cluster_next(&si->free_clusters.head) *
 					SWAPFILE_CLUSTER;
 		} else if (!cluster_list_empty(&si->discard_clusters)) {
 			/*
@@ -643,9 +641,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	 * Other CPUs can use our cluster if they can't find a free cluster,
 	 * check if there is still free entry in the cluster
 	 */
-	tmp = cluster->next;
 	max = min_t(unsigned long, si->max,
-		    (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER);
+		    ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER);
 	if (tmp < max) {
 		ci = lock_cluster(si, tmp);
 		while (tmp < max) {
@@ -656,12 +653,13 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 		unlock_cluster(ci);
 	}
 	if (tmp >= max) {
-		cluster_set_null(&cluster->index);
+		*cpu_next = SWAP_NEXT_NULL;
 		goto new_cluster;
 	}
-	cluster->next = tmp + 1;
 	*offset = tmp;
 	*scan_base = tmp;
+	tmp += 1;
+	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
 	return true;
 }
 
@@ -2488,8 +2486,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
-	free_percpu(p->percpu_cluster);
-	p->percpu_cluster = NULL;
+	free_percpu(p->cpu_next);
+	p->cpu_next = NULL;
 	free_percpu(p->cluster_next_cpu);
 	p->cluster_next_cpu = NULL;
 	vfree(swap_map);
@@ -3073,16 +3071,13 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		for (ci = 0; ci < nr_cluster; ci++)
 			spin_lock_init(&((cluster_info + ci)->lock));
 
-		p->percpu_cluster = alloc_percpu(struct percpu_cluster);
-		if (!p->percpu_cluster) {
+		p->cpu_next = alloc_percpu(unsigned int);
+		if (!p->cpu_next) {
 			error = -ENOMEM;
 			goto bad_swap_unlock_inode;
 		}
-		for_each_possible_cpu(cpu) {
-			struct percpu_cluster *cluster;
-			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
-			cluster_set_null(&cluster->index);
-		}
+		for_each_possible_cpu(cpu)
+			per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
 	} else {
 		atomic_inc(&nr_rotate_swap);
 		inced_nr_rotate_swap = true;
@@ -3171,8 +3166,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
-	free_percpu(p->percpu_cluster);
-	p->percpu_cluster = NULL;
+	free_percpu(p->cpu_next);
+	p->cpu_next = NULL;
 	free_percpu(p->cluster_next_cpu);
 	p->cluster_next_cpu = NULL;
 	if (inode && S_ISBLK(inode->i_mode) && p->bdev) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v3 3/4] mm: swap: Simplify ssd behavior when scanner steals entry
  2023-10-25 14:45 [PATCH v3 0/4] Swap-out small-sized THP without splitting Ryan Roberts
  2023-10-25 14:45 ` [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
  2023-10-25 14:45 ` [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster Ryan Roberts
@ 2023-10-25 14:45 ` Ryan Roberts
  2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2023-10-25 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: Ryan Roberts, linux-kernel, linux-mm

When a CPU fails to reserve a cluster (due to free list exhaustion), we
revert to the scanner to find a free entry somewhere in the swap file.
This might cause an entry to be stolen from another CPU's reserved
cluster. Upon noticing this, the CPU with the stolen entry would
previously scan forward to the end of the cluster trying to find a free
entry to use. If there were none, it would try to reserve a new pre-cpu
cluster and allocate from that.

This scanning behavior does not scale well to high-order allocations,
which will be introduced in a future patch since would need to scan for
a contiguous area that was naturally aligned. Given stealing is a rare
occurrence, let's remove the scanning behavior from the ssd allocator
and simply drop the cluster and try to allocate a new one. Given the
purpose of the per-cpu cluster is to ensure a given task's pages are
sequential on disk to aid readahead, allocating a new cluster at this
point makes most sense.

Furthermore, si->max will always be greater than or equal to the end of
the last cluster because any partial cluster will never be put on the
free cluster list. Therefore we can simplify this logic too.

These changes make it simpler to generalize
scan_swap_map_try_ssd_cluster() to handle any allocation order.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/swapfile.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 617e34b8cdbe..94f7cc225eb9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -639,27 +639,24 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 
 	/*
 	 * Other CPUs can use our cluster if they can't find a free cluster,
-	 * check if there is still free entry in the cluster
+	 * check if the expected entry is still free. If not, drop it and
+	 * reserve a new cluster.
 	 */
-	max = min_t(unsigned long, si->max,
-		    ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER);
-	if (tmp < max) {
-		ci = lock_cluster(si, tmp);
-		while (tmp < max) {
-			if (!si->swap_map[tmp])
-				break;
-			tmp++;
-		}
+	ci = lock_cluster(si, tmp);
+	if (si->swap_map[tmp]) {
 		unlock_cluster(ci);
-	}
-	if (tmp >= max) {
 		*cpu_next = SWAP_NEXT_NULL;
 		goto new_cluster;
 	}
+	unlock_cluster(ci);
+
 	*offset = tmp;
 	*scan_base = tmp;
+
+	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
 	tmp += 1;
 	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
+
 	return true;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-10-25 14:45 [PATCH v3 0/4] Swap-out small-sized THP without splitting Ryan Roberts
                   ` (2 preceding siblings ...)
  2023-10-25 14:45 ` [PATCH v3 3/4] mm: swap: Simplify ssd behavior when scanner steals entry Ryan Roberts
@ 2023-10-25 14:45 ` Ryan Roberts
  2023-10-30  8:18   ` Huang, Ying
                     ` (3 more replies)
  2023-11-29  7:47 ` [PATCH v3 0/4] " Barry Song
  2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
  5 siblings, 4 replies; 116+ messages in thread
From: Ryan Roberts @ 2023-10-25 14:45 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: Ryan Roberts, linux-kernel, linux-mm

The upcoming anonymous small-sized THP feature enables performance
improvements by allocating large folios for anonymous memory. However
I've observed that on an arm64 system running a parallel workload (e.g.
kernel compilation) across many cores, under high memory pressure, the
speed regresses. This is due to bottlenecking on the increased number of
TLBIs added due to all the extra folio splitting.

Therefore, solve this regression by adding support for swapping out
small-sized THP without needing to split the folio, just like is already
done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
enabled, and when the swap backing store is a non-rotating block device.
These are the same constraints as for the existing PMD-sized THP
swap-out support.

Note that no attempt is made to swap-in THP here - this is still done
page-by-page, like for PMD-sized THP.

The main change here is to improve the swap entry allocator so that it
can allocate any power-of-2 number of contiguous entries between [1, (1
<< PMD_ORDER)]. This is done by allocating a cluster for each distinct
order and allocating sequentially from it until the cluster is full.
This ensures that we don't need to search the map and we get no
fragmentation due to alignment padding for different orders in the
cluster. If there is no current cluster for a given order, we attempt to
allocate a free cluster from the list. If there are no free clusters, we
fail the allocation and the caller falls back to splitting the folio and
allocates individual entries (as per existing PMD-sized THP fallback).

The per-order current clusters are maintained per-cpu using the existing
infrastructure. This is done to avoid interleving pages from different
tasks, which would prevent IO being batched. This is already done for
the order-0 allocations so we follow the same pattern.
__scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
for order-0.

As is done for order-0 per-cpu clusters, the scanner now can steal
order-0 entries from any per-cpu-per-order reserved cluster. This
ensures that when the swap file is getting full, space doesn't get tied
up in the per-cpu reserves.

I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
device as the swap device and from inside a memcg limited to 40G memory.
I've then run `usemem` from vm-scalability with 70 processes (each has
its own core), each allocating and writing 1G of memory. I've repeated
everything 5 times and taken the mean:

Mean Performance Improvement vs 4K/baseline

| alloc size |            baseline |       + this series |
|            |  v6.6-rc4+anonfolio |                     |
|:-----------|--------------------:|--------------------:|
| 4K Page    |                0.0% |                4.9% |
| 64K THP    |              -44.1% |               10.7% |
| 2M THP     |               56.0% |               65.9% |

So with this change, the regression for 64K swap performance goes away
and 4K and 2M swap improves slightly too.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h |  10 +--
 mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
 mm/vmscan.c          |  10 +--
 3 files changed, 119 insertions(+), 50 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ca8aaa098ba..ccbca5db851b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -295,11 +295,11 @@ struct swap_info_struct {
 	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
 	unsigned int __percpu *cpu_next;/*
 					 * Likely next allocation offset. We
-					 * assign a cluster to each CPU, so each
-					 * CPU can allocate swap entry from its
-					 * own cluster and swapout sequentially.
-					 * The purpose is to optimize swapout
-					 * throughput.
+					 * assign a cluster per-order to each
+					 * CPU, so each CPU can allocate swap
+					 * entry from its own cluster and
+					 * swapout sequentially. The purpose is
+					 * to optimize swapout throughput.
 					 */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 94f7cc225eb9..b50bce50bed9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
 
 /*
  * The cluster corresponding to page_nr will be used. The cluster will be
- * removed from free cluster list and its usage counter will be increased.
+ * removed from free cluster list and its usage counter will be increased by
+ * count.
  */
-static void inc_cluster_info_page(struct swap_info_struct *p,
-	struct swap_cluster_info *cluster_info, unsigned long page_nr)
+static void add_cluster_info_page(struct swap_info_struct *p,
+	struct swap_cluster_info *cluster_info, unsigned long page_nr,
+	unsigned long count)
 {
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
 
@@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
 	if (cluster_is_free(&cluster_info[idx]))
 		alloc_cluster(p, idx);
 
-	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
+	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
 	cluster_set_count(&cluster_info[idx],
-		cluster_count(&cluster_info[idx]) + 1);
+		cluster_count(&cluster_info[idx]) + count);
+}
+
+/*
+ * The cluster corresponding to page_nr will be used. The cluster will be
+ * removed from free cluster list and its usage counter will be increased.
+ */
+static void inc_cluster_info_page(struct swap_info_struct *p,
+	struct swap_cluster_info *cluster_info, unsigned long page_nr)
+{
+	add_cluster_info_page(p, cluster_info, page_nr, 1);
 }
 
 /*
@@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
  * cluster list. Avoiding such abuse to avoid list corruption.
  */
 static bool
-scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
-	unsigned long offset)
+__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
+	unsigned long offset, int order)
 {
 	bool conflict;
 
@@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 	if (!conflict)
 		return false;
 
-	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
+	this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
 	return true;
 }
 
 /*
- * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
- * might involve allocating a new cluster for current CPU too.
+ * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
+ * cluster list. Avoiding such abuse to avoid list corruption.
  */
-static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
-	unsigned long *offset, unsigned long *scan_base)
+static bool
+scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
+	unsigned long offset)
+{
+	return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
+}
+
+/*
+ * Try to get a swap entry (or size indicated by order) from current cpu's swap
+ * entry pool (a cluster). This might involve allocating a new cluster for
+ * current CPU too.
+ */
+static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
+	unsigned long *offset, unsigned long *scan_base, int order)
 {
 	struct swap_cluster_info *ci;
-	unsigned int tmp, max;
+	unsigned int tmp, max, i;
 	unsigned int *cpu_next;
+	unsigned int nr_pages = 1 << order;
 
 new_cluster:
-	cpu_next = this_cpu_ptr(si->cpu_next);
+	cpu_next = &this_cpu_ptr(si->cpu_next)[order];
 	tmp = *cpu_next;
 	if (tmp == SWAP_NEXT_NULL) {
 		if (!cluster_list_empty(&si->free_clusters)) {
@@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	 * reserve a new cluster.
 	 */
 	ci = lock_cluster(si, tmp);
-	if (si->swap_map[tmp]) {
-		unlock_cluster(ci);
-		*cpu_next = SWAP_NEXT_NULL;
-		goto new_cluster;
+	for (i = 0; i < nr_pages; i++) {
+		if (si->swap_map[tmp + i]) {
+			unlock_cluster(ci);
+			*cpu_next = SWAP_NEXT_NULL;
+			goto new_cluster;
+		}
 	}
 	unlock_cluster(ci);
 
@@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	*scan_base = tmp;
 
 	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
-	tmp += 1;
+	tmp += nr_pages;
 	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
 
 	return true;
 }
 
+/*
+ * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
+ * might involve allocating a new cluster for current CPU too.
+ */
+static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
+	unsigned long *offset, unsigned long *scan_base)
+{
+	return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
+}
+
 static void __del_from_avail_list(struct swap_info_struct *p)
 {
 	int nid;
@@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	return n_ret;
 }
 
-static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
+			    unsigned int nr_pages)
 {
-	unsigned long idx;
 	struct swap_cluster_info *ci;
-	unsigned long offset;
+	unsigned long offset, scan_base;
+	int order = ilog2(nr_pages);
+	bool ret;
 
 	/*
-	 * Should not even be attempting cluster allocations when huge
+	 * Should not even be attempting large allocations when huge
 	 * page swap is disabled.  Warn and fail the allocation.
 	 */
-	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
+	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
+	    nr_pages < 2 || nr_pages > SWAPFILE_CLUSTER ||
+	    !is_power_of_2(nr_pages)) {
 		VM_WARN_ON_ONCE(1);
 		return 0;
 	}
 
-	if (cluster_list_empty(&si->free_clusters))
+	/*
+	 * Swapfile is not block device or not using clusters so unable to
+	 * allocate large entries.
+	 */
+	if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
 		return 0;
 
-	idx = cluster_list_first(&si->free_clusters);
-	offset = idx * SWAPFILE_CLUSTER;
-	ci = lock_cluster(si, offset);
-	alloc_cluster(si, idx);
-	cluster_set_count(ci, SWAPFILE_CLUSTER);
+again:
+	/*
+	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
+	 * so indicate that we are scanning to synchronise with swapoff.
+	 */
+	si->flags += SWP_SCANNING;
+	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
+	si->flags -= SWP_SCANNING;
+
+	/*
+	 * If we failed to allocate or if swapoff is waiting for us (due to lock
+	 * being dropped for discard above), return immediately.
+	 */
+	if (!ret || !(si->flags & SWP_WRITEOK))
+		return 0;
 
-	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
+	if (__scan_swap_map_ssd_cluster_conflict(si, offset, order))
+		goto again;
+
+	ci = lock_cluster(si, offset);
+	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
+	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
 	unlock_cluster(ci);
-	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
-	*slot = swp_entry(si->type, offset);
 
+	swap_range_alloc(si, offset, nr_pages);
+	*slot = swp_entry(si->type, offset);
 	return 1;
 }
 
@@ -1036,7 +1096,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 	int node;
 
 	/* Only single cluster request supported */
-	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
+	WARN_ON_ONCE(n_goal > 1 && size > 1);
 
 	spin_lock(&swap_avail_lock);
 
@@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-		if (size == SWAPFILE_CLUSTER) {
-			if (si->flags & SWP_BLKDEV)
-				n_ret = swap_alloc_cluster(si, swp_entries);
+		if (size > 1) {
+			n_ret = swap_alloc_large(si, swp_entries, size);
 		} else
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 						    n_goal, swp_entries);
 		spin_unlock(&si->lock);
-		if (n_ret || size == SWAPFILE_CLUSTER)
+		if (n_ret || size > 1)
 			goto check_out;
 		cond_resched();
 
@@ -3041,6 +3100,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (p->bdev && bdev_nonrot(p->bdev)) {
 		int cpu;
 		unsigned long ci, nr_cluster;
+		int nr_order;
+		int i;
 
 		p->flags |= SWP_SOLIDSTATE;
 		p->cluster_next_cpu = alloc_percpu(unsigned int);
@@ -3068,13 +3129,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		for (ci = 0; ci < nr_cluster; ci++)
 			spin_lock_init(&((cluster_info + ci)->lock));
 
-		p->cpu_next = alloc_percpu(unsigned int);
+		nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER + 1 : 1;
+		p->cpu_next = __alloc_percpu(sizeof(unsigned int) * nr_order,
+					     __alignof__(unsigned int));
 		if (!p->cpu_next) {
 			error = -ENOMEM;
 			goto bad_swap_unlock_inode;
 		}
-		for_each_possible_cpu(cpu)
-			per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
+		for_each_possible_cpu(cpu) {
+			unsigned int *cpu_next = per_cpu_ptr(p->cpu_next, cpu);
+
+			for (i = 0; i < nr_order; i++)
+				cpu_next[i] = SWAP_NEXT_NULL;
+		}
 	} else {
 		atomic_inc(&nr_rotate_swap);
 		inced_nr_rotate_swap = true;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2cc0cb41fb32..ea19710aa4cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					if (!can_split_folio(folio, NULL))
 						goto activate_locked;
 					/*
-					 * Split folios without a PMD map right
-					 * away. Chances are some or all of the
-					 * tail pages can be freed without IO.
+					 * Split PMD-mappable folios without a
+					 * PMD map right away. Chances are some
+					 * or all of the tail pages can be freed
+					 * without IO.
 					 */
-					if (!folio_entire_mapcount(folio) &&
+					if (folio_test_pmd_mappable(folio) &&
+					    !folio_entire_mapcount(folio) &&
 					    split_folio_to_list(folio,
 								folio_list))
 						goto activate_locked;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
@ 2023-10-30  8:18   ` Huang, Ying
  2023-10-30 13:59     ` Ryan Roberts
  2023-11-02  7:40   ` Barry Song
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 116+ messages in thread
From: Huang, Ying @ 2023-10-30  8:18 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel,
	linux-mm

Hi, Ryan,

Ryan Roberts <ryan.roberts@arm.com> writes:

> The upcoming anonymous small-sized THP feature enables performance
> improvements by allocating large folios for anonymous memory. However
> I've observed that on an arm64 system running a parallel workload (e.g.
> kernel compilation) across many cores, under high memory pressure, the
> speed regresses. This is due to bottlenecking on the increased number of
> TLBIs added due to all the extra folio splitting.
>
> Therefore, solve this regression by adding support for swapping out
> small-sized THP without needing to split the folio, just like is already
> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
> enabled, and when the swap backing store is a non-rotating block device.
> These are the same constraints as for the existing PMD-sized THP
> swap-out support.
>
> Note that no attempt is made to swap-in THP here - this is still done
> page-by-page, like for PMD-sized THP.
>
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [1, (1
> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> order and allocating sequentially from it until the cluster is full.
> This ensures that we don't need to search the map and we get no
> fragmentation due to alignment padding for different orders in the
> cluster. If there is no current cluster for a given order, we attempt to
> allocate a free cluster from the list. If there are no free clusters, we
> fail the allocation and the caller falls back to splitting the folio and
> allocates individual entries (as per existing PMD-sized THP fallback).
>
> The per-order current clusters are maintained per-cpu using the existing
> infrastructure. This is done to avoid interleving pages from different
> tasks, which would prevent IO being batched. This is already done for
> the order-0 allocations so we follow the same pattern.
> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
> for order-0.
>
> As is done for order-0 per-cpu clusters, the scanner now can steal
> order-0 entries from any per-cpu-per-order reserved cluster. This
> ensures that when the swap file is getting full, space doesn't get tied
> up in the per-cpu reserves.
>
> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
> device as the swap device and from inside a memcg limited to 40G memory.
> I've then run `usemem` from vm-scalability with 70 processes (each has
> its own core), each allocating and writing 1G of memory. I've repeated
> everything 5 times and taken the mean:
>
> Mean Performance Improvement vs 4K/baseline
>
> | alloc size |            baseline |       + this series |
> |            |  v6.6-rc4+anonfolio |                     |
> |:-----------|--------------------:|--------------------:|
> | 4K Page    |                0.0% |                4.9% |
> | 64K THP    |              -44.1% |               10.7% |
> | 2M THP     |               56.0% |               65.9% |
>
> So with this change, the regression for 64K swap performance goes away
> and 4K and 2M swap improves slightly too.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/swap.h |  10 +--
>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>  mm/vmscan.c          |  10 +--
>  3 files changed, 119 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0ca8aaa098ba..ccbca5db851b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -295,11 +295,11 @@ struct swap_info_struct {
>  	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>  	unsigned int __percpu *cpu_next;/*
>  					 * Likely next allocation offset. We
> -					 * assign a cluster to each CPU, so each
> -					 * CPU can allocate swap entry from its
> -					 * own cluster and swapout sequentially.
> -					 * The purpose is to optimize swapout
> -					 * throughput.
> +					 * assign a cluster per-order to each
> +					 * CPU, so each CPU can allocate swap
> +					 * entry from its own cluster and
> +					 * swapout sequentially. The purpose is
> +					 * to optimize swapout throughput.
>  					 */

This is kind of hard to understand.  Better to define some intermediate
data structure to improve readability.  For example,

#ifdef CONFIG_THP_SWAP
#define NR_SWAP_ORDER   PMD_ORDER
#else
#define NR_SWAP_ORDER   1
#endif

struct percpu_clusters {
        unsigned int alloc_next[NR_SWAP_ORDER];
};

PMD_ORDER isn't a constant on powerpc, but THP_SWAP isn't supported on
powerpc too.

>  	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>  	struct block_device *bdev;	/* swap device or bdev of swap file */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 94f7cc225eb9..b50bce50bed9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>  
>  /*
>   * The cluster corresponding to page_nr will be used. The cluster will be
> - * removed from free cluster list and its usage counter will be increased.
> + * removed from free cluster list and its usage counter will be increased by
> + * count.
>   */
> -static void inc_cluster_info_page(struct swap_info_struct *p,
> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void add_cluster_info_page(struct swap_info_struct *p,
> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
> +	unsigned long count)
>  {
>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>  
> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>  	if (cluster_is_free(&cluster_info[idx]))
>  		alloc_cluster(p, idx);
>  
> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>  	cluster_set_count(&cluster_info[idx],
> -		cluster_count(&cluster_info[idx]) + 1);
> +		cluster_count(&cluster_info[idx]) + count);
> +}
> +
> +/*
> + * The cluster corresponding to page_nr will be used. The cluster will be
> + * removed from free cluster list and its usage counter will be increased.
> + */
> +static void inc_cluster_info_page(struct swap_info_struct *p,
> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +{
> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>  }
>  
>  /*
> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>   * cluster list. Avoiding such abuse to avoid list corruption.
>   */
>  static bool
> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> -	unsigned long offset)
> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> +	unsigned long offset, int order)
>  {
>  	bool conflict;
>  
> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>  	if (!conflict)
>  		return false;
>  
> -	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
> +	this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;

This is added in the previous patch.  I don't think SWAP_NEXT_NULL is a
good name.  Because NEXT isn't a pointer (while cluster_next is). Better
to name it as SWAP_NEXT_INVALID, etc.

>  	return true;
>  }
>  
>  /*
> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
> - * might involve allocating a new cluster for current CPU too.
> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
> + * cluster list. Avoiding such abuse to avoid list corruption.
>   */
> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> -	unsigned long *offset, unsigned long *scan_base)
> +static bool
> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> +	unsigned long offset)
> +{
> +	return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
> +}
> +
> +/*
> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
> + * entry pool (a cluster). This might involve allocating a new cluster for
> + * current CPU too.
> + */
> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> +	unsigned long *offset, unsigned long *scan_base, int order)
>  {
>  	struct swap_cluster_info *ci;
> -	unsigned int tmp, max;
> +	unsigned int tmp, max, i;
>  	unsigned int *cpu_next;
> +	unsigned int nr_pages = 1 << order;
>  
>  new_cluster:
> -	cpu_next = this_cpu_ptr(si->cpu_next);
> +	cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>  	tmp = *cpu_next;
>  	if (tmp == SWAP_NEXT_NULL) {
>  		if (!cluster_list_empty(&si->free_clusters)) {
> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>  	 * reserve a new cluster.
>  	 */
>  	ci = lock_cluster(si, tmp);
> -	if (si->swap_map[tmp]) {
> -		unlock_cluster(ci);
> -		*cpu_next = SWAP_NEXT_NULL;
> -		goto new_cluster;
> +	for (i = 0; i < nr_pages; i++) {
> +		if (si->swap_map[tmp + i]) {
> +			unlock_cluster(ci);
> +			*cpu_next = SWAP_NEXT_NULL;
> +			goto new_cluster;
> +		}
>  	}
>  	unlock_cluster(ci);
>  
> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>  	*scan_base = tmp;
>  
>  	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;

This line is added in a previous patch.  Can we just use

        max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);

Or, add ALIGN_UP() for this?

> -	tmp += 1;
> +	tmp += nr_pages;
>  	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>  
>  	return true;
>  }
>  
> +/*
> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
> + * might involve allocating a new cluster for current CPU too.
> + */
> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> +	unsigned long *offset, unsigned long *scan_base)
> +{
> +	return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
> +}
> +
>  static void __del_from_avail_list(struct swap_info_struct *p)
>  {
>  	int nid;
> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	return n_ret;
>  }
>  
> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
> +			    unsigned int nr_pages)

IMHO, it's better to make scan_swap_map_slots() to support order > 0
instead of making swap_alloc_cluster() to support order != PMD_ORDER.
And, we may merge swap_alloc_cluster() with scan_swap_map_slots() after
that.

>  {
> -	unsigned long idx;
>  	struct swap_cluster_info *ci;
> -	unsigned long offset;
> +	unsigned long offset, scan_base;
> +	int order = ilog2(nr_pages);
> +	bool ret;
>  
>  	/*
> -	 * Should not even be attempting cluster allocations when huge
> +	 * Should not even be attempting large allocations when huge
>  	 * page swap is disabled.  Warn and fail the allocation.
>  	 */
> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
> +	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
> +	    nr_pages < 2 || nr_pages > SWAPFILE_CLUSTER ||
> +	    !is_power_of_2(nr_pages)) {
>  		VM_WARN_ON_ONCE(1);
>  		return 0;
>  	}
>  
> -	if (cluster_list_empty(&si->free_clusters))
> +	/*
> +	 * Swapfile is not block device or not using clusters so unable to
> +	 * allocate large entries.
> +	 */
> +	if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>  		return 0;
>  
> -	idx = cluster_list_first(&si->free_clusters);
> -	offset = idx * SWAPFILE_CLUSTER;
> -	ci = lock_cluster(si, offset);
> -	alloc_cluster(si, idx);
> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
> +again:
> +	/*
> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> +	 * so indicate that we are scanning to synchronise with swapoff.
> +	 */
> +	si->flags += SWP_SCANNING;
> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> +	si->flags -= SWP_SCANNING;
> +
> +	/*
> +	 * If we failed to allocate or if swapoff is waiting for us (due to lock
> +	 * being dropped for discard above), return immediately.
> +	 */
> +	if (!ret || !(si->flags & SWP_WRITEOK))
> +		return 0;
>  
> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
> +	if (__scan_swap_map_ssd_cluster_conflict(si, offset, order))
> +		goto again;
> +
> +	ci = lock_cluster(si, offset);
> +	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>  	unlock_cluster(ci);
> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
> -	*slot = swp_entry(si->type, offset);
>  
> +	swap_range_alloc(si, offset, nr_pages);
> +	*slot = swp_entry(si->type, offset);
>  	return 1;
>  }
>  
> @@ -1036,7 +1096,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  	int node;
>  
>  	/* Only single cluster request supported */
> -	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> +	WARN_ON_ONCE(n_goal > 1 && size > 1);
>  
>  	spin_lock(&swap_avail_lock);
>  
> @@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  			spin_unlock(&si->lock);
>  			goto nextsi;
>  		}
> -		if (size == SWAPFILE_CLUSTER) {
> -			if (si->flags & SWP_BLKDEV)
> -				n_ret = swap_alloc_cluster(si, swp_entries);
> +		if (size > 1) {
> +			n_ret = swap_alloc_large(si, swp_entries, size);
>  		} else
>  			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>  						    n_goal, swp_entries);
>  		spin_unlock(&si->lock);
> -		if (n_ret || size == SWAPFILE_CLUSTER)
> +		if (n_ret || size > 1)
>  			goto check_out;
>  		cond_resched();
>  
> @@ -3041,6 +3100,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  	if (p->bdev && bdev_nonrot(p->bdev)) {
>  		int cpu;
>  		unsigned long ci, nr_cluster;
> +		int nr_order;
> +		int i;
>  
>  		p->flags |= SWP_SOLIDSTATE;
>  		p->cluster_next_cpu = alloc_percpu(unsigned int);
> @@ -3068,13 +3129,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  		for (ci = 0; ci < nr_cluster; ci++)
>  			spin_lock_init(&((cluster_info + ci)->lock));
>  
> -		p->cpu_next = alloc_percpu(unsigned int);
> +		nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER + 1 : 1;
> +		p->cpu_next = __alloc_percpu(sizeof(unsigned int) * nr_order,
> +					     __alignof__(unsigned int));
>  		if (!p->cpu_next) {
>  			error = -ENOMEM;
>  			goto bad_swap_unlock_inode;
>  		}
> -		for_each_possible_cpu(cpu)
> -			per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
> +		for_each_possible_cpu(cpu) {
> +			unsigned int *cpu_next = per_cpu_ptr(p->cpu_next, cpu);
> +
> +			for (i = 0; i < nr_order; i++)
> +				cpu_next[i] = SWAP_NEXT_NULL;
> +		}
>  	} else {
>  		atomic_inc(&nr_rotate_swap);
>  		inced_nr_rotate_swap = true;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2cc0cb41fb32..ea19710aa4cd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  					if (!can_split_folio(folio, NULL))
>  						goto activate_locked;
>  					/*
> -					 * Split folios without a PMD map right
> -					 * away. Chances are some or all of the
> -					 * tail pages can be freed without IO.
> +					 * Split PMD-mappable folios without a
> +					 * PMD map right away. Chances are some
> +					 * or all of the tail pages can be freed
> +					 * without IO.
>  					 */
> -					if (!folio_entire_mapcount(folio) &&
> +					if (folio_test_pmd_mappable(folio) &&
> +					    !folio_entire_mapcount(folio) &&
>  					    split_folio_to_list(folio,
>  								folio_list))
>  						goto activate_locked;

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-10-30  8:18   ` Huang, Ying
@ 2023-10-30 13:59     ` Ryan Roberts
  2023-10-31  8:12       ` Huang, Ying
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2023-10-30 13:59 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel,
	linux-mm

On 30/10/2023 08:18, Huang, Ying wrote:
> Hi, Ryan,
> 
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> The upcoming anonymous small-sized THP feature enables performance
>> improvements by allocating large folios for anonymous memory. However
>> I've observed that on an arm64 system running a parallel workload (e.g.
>> kernel compilation) across many cores, under high memory pressure, the
>> speed regresses. This is due to bottlenecking on the increased number of
>> TLBIs added due to all the extra folio splitting.
>>
>> Therefore, solve this regression by adding support for swapping out
>> small-sized THP without needing to split the folio, just like is already
>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>> enabled, and when the swap backing store is a non-rotating block device.
>> These are the same constraints as for the existing PMD-sized THP
>> swap-out support.
>>
>> Note that no attempt is made to swap-in THP here - this is still done
>> page-by-page, like for PMD-sized THP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [1, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller falls back to splitting the folio and
>> allocates individual entries (as per existing PMD-sized THP fallback).
>>
>> The per-order current clusters are maintained per-cpu using the existing
>> infrastructure. This is done to avoid interleving pages from different
>> tasks, which would prevent IO being batched. This is already done for
>> the order-0 allocations so we follow the same pattern.
>> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
>> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
>> for order-0.
>>
>> As is done for order-0 per-cpu clusters, the scanner now can steal
>> order-0 entries from any per-cpu-per-order reserved cluster. This
>> ensures that when the swap file is getting full, space doesn't get tied
>> up in the per-cpu reserves.
>>
>> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
>> device as the swap device and from inside a memcg limited to 40G memory.
>> I've then run `usemem` from vm-scalability with 70 processes (each has
>> its own core), each allocating and writing 1G of memory. I've repeated
>> everything 5 times and taken the mean:
>>
>> Mean Performance Improvement vs 4K/baseline
>>
>> | alloc size |            baseline |       + this series |
>> |            |  v6.6-rc4+anonfolio |                     |
>> |:-----------|--------------------:|--------------------:|
>> | 4K Page    |                0.0% |                4.9% |
>> | 64K THP    |              -44.1% |               10.7% |
>> | 2M THP     |               56.0% |               65.9% |
>>
>> So with this change, the regression for 64K swap performance goes away
>> and 4K and 2M swap improves slightly too.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/swap.h |  10 +--
>>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>>  mm/vmscan.c          |  10 +--
>>  3 files changed, 119 insertions(+), 50 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 0ca8aaa098ba..ccbca5db851b 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -295,11 +295,11 @@ struct swap_info_struct {
>>  	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>>  	unsigned int __percpu *cpu_next;/*
>>  					 * Likely next allocation offset. We
>> -					 * assign a cluster to each CPU, so each
>> -					 * CPU can allocate swap entry from its
>> -					 * own cluster and swapout sequentially.
>> -					 * The purpose is to optimize swapout
>> -					 * throughput.
>> +					 * assign a cluster per-order to each
>> +					 * CPU, so each CPU can allocate swap
>> +					 * entry from its own cluster and
>> +					 * swapout sequentially. The purpose is
>> +					 * to optimize swapout throughput.
>>  					 */
> 
> This is kind of hard to understand.  Better to define some intermediate
> data structure to improve readability.  For example,
> 
> #ifdef CONFIG_THP_SWAP
> #define NR_SWAP_ORDER   PMD_ORDER
> #else
> #define NR_SWAP_ORDER   1
> #endif
> 
> struct percpu_clusters {
>         unsigned int alloc_next[NR_SWAP_ORDER];
> };
> 
> PMD_ORDER isn't a constant on powerpc, but THP_SWAP isn't supported on
> powerpc too.

I get your point, but this is just making it more difficult for powerpc to ever
enable the feature in future - you're implicitly depending on !powerpc, which
seems fragile. How about if I change the first line of the coment to be "per-cpu
array indexed by allocation order"? Would that be enough?

> 
>>  	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>>  	struct block_device *bdev;	/* swap device or bdev of swap file */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 94f7cc225eb9..b50bce50bed9 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>  
>>  /*
>>   * The cluster corresponding to page_nr will be used. The cluster will be
>> - * removed from free cluster list and its usage counter will be increased.
>> + * removed from free cluster list and its usage counter will be increased by
>> + * count.
>>   */
>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +static void add_cluster_info_page(struct swap_info_struct *p,
>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
>> +	unsigned long count)
>>  {
>>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>  
>> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>  	if (cluster_is_free(&cluster_info[idx]))
>>  		alloc_cluster(p, idx);
>>  
>> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>  	cluster_set_count(&cluster_info[idx],
>> -		cluster_count(&cluster_info[idx]) + 1);
>> +		cluster_count(&cluster_info[idx]) + count);
>> +}
>> +
>> +/*
>> + * The cluster corresponding to page_nr will be used. The cluster will be
>> + * removed from free cluster list and its usage counter will be increased.
>> + */
>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +{
>> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>>  }
>>  
>>  /*
>> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>   * cluster list. Avoiding such abuse to avoid list corruption.
>>   */
>>  static bool
>> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> -	unsigned long offset)
>> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> +	unsigned long offset, int order)
>>  {
>>  	bool conflict;
>>  
>> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>  	if (!conflict)
>>  		return false;
>>  
>> -	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
>> +	this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
> 
> This is added in the previous patch.  I don't think SWAP_NEXT_NULL is a
> good name.  Because NEXT isn't a pointer (while cluster_next is). Better
> to name it as SWAP_NEXT_INVALID, etc.

ACK, will make change for next version.

> 
>>  	return true;
>>  }
>>  
>>  /*
>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>> - * might involve allocating a new cluster for current CPU too.
>> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
>> + * cluster list. Avoiding such abuse to avoid list corruption.
>>   */
>> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> -	unsigned long *offset, unsigned long *scan_base)
>> +static bool
>> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> +	unsigned long offset)
>> +{
>> +	return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
>> +}
>> +
>> +/*
>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
>> + * entry pool (a cluster). This might involve allocating a new cluster for
>> + * current CPU too.
>> + */
>> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> +	unsigned long *offset, unsigned long *scan_base, int order)
>>  {
>>  	struct swap_cluster_info *ci;
>> -	unsigned int tmp, max;
>> +	unsigned int tmp, max, i;
>>  	unsigned int *cpu_next;
>> +	unsigned int nr_pages = 1 << order;
>>  
>>  new_cluster:
>> -	cpu_next = this_cpu_ptr(si->cpu_next);
>> +	cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>>  	tmp = *cpu_next;
>>  	if (tmp == SWAP_NEXT_NULL) {
>>  		if (!cluster_list_empty(&si->free_clusters)) {
>> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>  	 * reserve a new cluster.
>>  	 */
>>  	ci = lock_cluster(si, tmp);
>> -	if (si->swap_map[tmp]) {
>> -		unlock_cluster(ci);
>> -		*cpu_next = SWAP_NEXT_NULL;
>> -		goto new_cluster;
>> +	for (i = 0; i < nr_pages; i++) {
>> +		if (si->swap_map[tmp + i]) {
>> +			unlock_cluster(ci);
>> +			*cpu_next = SWAP_NEXT_NULL;
>> +			goto new_cluster;
>> +		}
>>  	}
>>  	unlock_cluster(ci);
>>  
>> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>  	*scan_base = tmp;
>>  
>>  	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
> 
> This line is added in a previous patch.  Can we just use
> 
>         max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);

Sure. This is how I originally had it, but then decided that the other approach
was a bit clearer. But I don't have a strong opinion, so I'll change it as you
suggest.

> 
> Or, add ALIGN_UP() for this?
> 
>> -	tmp += 1;
>> +	tmp += nr_pages;
>>  	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>>  
>>  	return true;
>>  }
>>  
>> +/*
>> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>> + * might involve allocating a new cluster for current CPU too.
>> + */
>> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> +	unsigned long *offset, unsigned long *scan_base)
>> +{
>> +	return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
>> +}
>> +
>>  static void __del_from_avail_list(struct swap_info_struct *p)
>>  {
>>  	int nid;
>> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	return n_ret;
>>  }
>>  
>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>> +			    unsigned int nr_pages)
> 
> IMHO, it's better to make scan_swap_map_slots() to support order > 0
> instead of making swap_alloc_cluster() to support order != PMD_ORDER.
> And, we may merge swap_alloc_cluster() with scan_swap_map_slots() after
> that.

I did consider adding a 5th patch to rename swap_alloc_large() to something like
swap_alloc_one_ssd_entry() (which would then be used for order=0 too) and
refactor scan_swap_map_slots() to fully delegate to it for the non-scaning ssd
allocation case. Would something like that suit?

I have reservations about making scan_swap_map_slots() take an order and be the
sole entry point:

  - in the non-ssd case, we can't support order!=0
  - there is a lot of other logic to deal with falling back to scanning which we
    would only want to do for order==0, so we would end up with a few ugly
    conditionals against order.
  - I was concerned the risk of me introducing a bug when refactoring all that
    subtle logic was high

What do you think? Is not making scan_swap_map_slots() support order > 0 a deal
breaker for you?

Thanks,
Ryan


> 
>>  {
>> -	unsigned long idx;
>>  	struct swap_cluster_info *ci;
>> -	unsigned long offset;
>> +	unsigned long offset, scan_base;
>> +	int order = ilog2(nr_pages);
>> +	bool ret;
>>  
>>  	/*
>> -	 * Should not even be attempting cluster allocations when huge
>> +	 * Should not even be attempting large allocations when huge
>>  	 * page swap is disabled.  Warn and fail the allocation.
>>  	 */
>> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
>> +	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
>> +	    nr_pages < 2 || nr_pages > SWAPFILE_CLUSTER ||
>> +	    !is_power_of_2(nr_pages)) {
>>  		VM_WARN_ON_ONCE(1);
>>  		return 0;
>>  	}
>>  
>> -	if (cluster_list_empty(&si->free_clusters))
>> +	/*
>> +	 * Swapfile is not block device or not using clusters so unable to
>> +	 * allocate large entries.
>> +	 */
>> +	if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>>  		return 0;
>>  
>> -	idx = cluster_list_first(&si->free_clusters);
>> -	offset = idx * SWAPFILE_CLUSTER;
>> -	ci = lock_cluster(si, offset);
>> -	alloc_cluster(si, idx);
>> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
>> +again:
>> +	/*
>> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>> +	 * so indicate that we are scanning to synchronise with swapoff.
>> +	 */
>> +	si->flags += SWP_SCANNING;
>> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>> +	si->flags -= SWP_SCANNING;
>> +
>> +	/*
>> +	 * If we failed to allocate or if swapoff is waiting for us (due to lock
>> +	 * being dropped for discard above), return immediately.
>> +	 */
>> +	if (!ret || !(si->flags & SWP_WRITEOK))
>> +		return 0;
>>  
>> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>> +	if (__scan_swap_map_ssd_cluster_conflict(si, offset, order))
>> +		goto again;
>> +
>> +	ci = lock_cluster(si, offset);
>> +	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
>> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>  	unlock_cluster(ci);
>> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
>> -	*slot = swp_entry(si->type, offset);
>>  
>> +	swap_range_alloc(si, offset, nr_pages);
>> +	*slot = swp_entry(si->type, offset);
>>  	return 1;
>>  }
>>  
>> @@ -1036,7 +1096,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>  	int node;
>>  
>>  	/* Only single cluster request supported */
>> -	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
>> +	WARN_ON_ONCE(n_goal > 1 && size > 1);
>>  
>>  	spin_lock(&swap_avail_lock);
>>  
>> @@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>  			spin_unlock(&si->lock);
>>  			goto nextsi;
>>  		}
>> -		if (size == SWAPFILE_CLUSTER) {
>> -			if (si->flags & SWP_BLKDEV)
>> -				n_ret = swap_alloc_cluster(si, swp_entries);
>> +		if (size > 1) {
>> +			n_ret = swap_alloc_large(si, swp_entries, size);
>>  		} else
>>  			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>>  						    n_goal, swp_entries);
>>  		spin_unlock(&si->lock);
>> -		if (n_ret || size == SWAPFILE_CLUSTER)
>> +		if (n_ret || size > 1)
>>  			goto check_out;
>>  		cond_resched();
>>  
>> @@ -3041,6 +3100,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  	if (p->bdev && bdev_nonrot(p->bdev)) {
>>  		int cpu;
>>  		unsigned long ci, nr_cluster;
>> +		int nr_order;
>> +		int i;
>>  
>>  		p->flags |= SWP_SOLIDSTATE;
>>  		p->cluster_next_cpu = alloc_percpu(unsigned int);
>> @@ -3068,13 +3129,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  		for (ci = 0; ci < nr_cluster; ci++)
>>  			spin_lock_init(&((cluster_info + ci)->lock));
>>  
>> -		p->cpu_next = alloc_percpu(unsigned int);
>> +		nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER + 1 : 1;
>> +		p->cpu_next = __alloc_percpu(sizeof(unsigned int) * nr_order,
>> +					     __alignof__(unsigned int));
>>  		if (!p->cpu_next) {
>>  			error = -ENOMEM;
>>  			goto bad_swap_unlock_inode;
>>  		}
>> -		for_each_possible_cpu(cpu)
>> -			per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
>> +		for_each_possible_cpu(cpu) {
>> +			unsigned int *cpu_next = per_cpu_ptr(p->cpu_next, cpu);
>> +
>> +			for (i = 0; i < nr_order; i++)
>> +				cpu_next[i] = SWAP_NEXT_NULL;
>> +		}
>>  	} else {
>>  		atomic_inc(&nr_rotate_swap);
>>  		inced_nr_rotate_swap = true;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2cc0cb41fb32..ea19710aa4cd 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>  					if (!can_split_folio(folio, NULL))
>>  						goto activate_locked;
>>  					/*
>> -					 * Split folios without a PMD map right
>> -					 * away. Chances are some or all of the
>> -					 * tail pages can be freed without IO.
>> +					 * Split PMD-mappable folios without a
>> +					 * PMD map right away. Chances are some
>> +					 * or all of the tail pages can be freed
>> +					 * without IO.
>>  					 */
>> -					if (!folio_entire_mapcount(folio) &&
>> +					if (folio_test_pmd_mappable(folio) &&
>> +					    !folio_entire_mapcount(folio) &&
>>  					    split_folio_to_list(folio,
>>  								folio_list))
>>  						goto activate_locked;
> 
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-10-30 13:59     ` Ryan Roberts
@ 2023-10-31  8:12       ` Huang, Ying
  2023-11-03 11:42         ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: Huang, Ying @ 2023-10-31  8:12 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 30/10/2023 08:18, Huang, Ying wrote:
>> Hi, Ryan,
>> 
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> The upcoming anonymous small-sized THP feature enables performance
>>> improvements by allocating large folios for anonymous memory. However
>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>> kernel compilation) across many cores, under high memory pressure, the
>>> speed regresses. This is due to bottlenecking on the increased number of
>>> TLBIs added due to all the extra folio splitting.
>>>
>>> Therefore, solve this regression by adding support for swapping out
>>> small-sized THP without needing to split the folio, just like is already
>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>> enabled, and when the swap backing store is a non-rotating block device.
>>> These are the same constraints as for the existing PMD-sized THP
>>> swap-out support.
>>>
>>> Note that no attempt is made to swap-in THP here - this is still done
>>> page-by-page, like for PMD-sized THP.
>>>
>>> The main change here is to improve the swap entry allocator so that it
>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>> order and allocating sequentially from it until the cluster is full.
>>> This ensures that we don't need to search the map and we get no
>>> fragmentation due to alignment padding for different orders in the
>>> cluster. If there is no current cluster for a given order, we attempt to
>>> allocate a free cluster from the list. If there are no free clusters, we
>>> fail the allocation and the caller falls back to splitting the folio and
>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>
>>> The per-order current clusters are maintained per-cpu using the existing
>>> infrastructure. This is done to avoid interleving pages from different
>>> tasks, which would prevent IO being batched. This is already done for
>>> the order-0 allocations so we follow the same pattern.
>>> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
>>> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
>>> for order-0.
>>>
>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>> ensures that when the swap file is getting full, space doesn't get tied
>>> up in the per-cpu reserves.
>>>
>>> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
>>> device as the swap device and from inside a memcg limited to 40G memory.
>>> I've then run `usemem` from vm-scalability with 70 processes (each has
>>> its own core), each allocating and writing 1G of memory. I've repeated
>>> everything 5 times and taken the mean:
>>>
>>> Mean Performance Improvement vs 4K/baseline
>>>
>>> | alloc size |            baseline |       + this series |
>>> |            |  v6.6-rc4+anonfolio |                     |
>>> |:-----------|--------------------:|--------------------:|
>>> | 4K Page    |                0.0% |                4.9% |
>>> | 64K THP    |              -44.1% |               10.7% |
>>> | 2M THP     |               56.0% |               65.9% |
>>>
>>> So with this change, the regression for 64K swap performance goes away
>>> and 4K and 2M swap improves slightly too.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/swap.h |  10 +--
>>>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>>>  mm/vmscan.c          |  10 +--
>>>  3 files changed, 119 insertions(+), 50 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 0ca8aaa098ba..ccbca5db851b 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -295,11 +295,11 @@ struct swap_info_struct {
>>>  	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>>>  	unsigned int __percpu *cpu_next;/*
>>>  					 * Likely next allocation offset. We
>>> -					 * assign a cluster to each CPU, so each
>>> -					 * CPU can allocate swap entry from its
>>> -					 * own cluster and swapout sequentially.
>>> -					 * The purpose is to optimize swapout
>>> -					 * throughput.
>>> +					 * assign a cluster per-order to each
>>> +					 * CPU, so each CPU can allocate swap
>>> +					 * entry from its own cluster and
>>> +					 * swapout sequentially. The purpose is
>>> +					 * to optimize swapout throughput.
>>>  					 */
>> 
>> This is kind of hard to understand.  Better to define some intermediate
>> data structure to improve readability.  For example,
>> 
>> #ifdef CONFIG_THP_SWAP
>> #define NR_SWAP_ORDER   PMD_ORDER
>> #else
>> #define NR_SWAP_ORDER   1
>> #endif
>> 
>> struct percpu_clusters {
>>         unsigned int alloc_next[NR_SWAP_ORDER];
>> };
>> 
>> PMD_ORDER isn't a constant on powerpc, but THP_SWAP isn't supported on
>> powerpc too.
>
> I get your point, but this is just making it more difficult for powerpc to ever
> enable the feature in future - you're implicitly depending on !powerpc, which
> seems fragile. How about if I change the first line of the coment to be "per-cpu
> array indexed by allocation order"? Would that be enough?

Even if PMD_ORDER isn't constant on powerpc, it's not necessary for
NR_SWAP_ORDER to be variable.  At least (1 << (NR_SWAP_ORDER-1)) should
< SWAPFILE_CLUSTER.  When someone adds THP swap support on powerpc, he
can choose a reasonable constant for NR_SWAP_ORDER (for example, 10 or
7).

>> 
>>>  	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>>>  	struct block_device *bdev;	/* swap device or bdev of swap file */
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 94f7cc225eb9..b50bce50bed9 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>  
>>>  /*
>>>   * The cluster corresponding to page_nr will be used. The cluster will be
>>> - * removed from free cluster list and its usage counter will be increased.
>>> + * removed from free cluster list and its usage counter will be increased by
>>> + * count.
>>>   */
>>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>>> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>> +static void add_cluster_info_page(struct swap_info_struct *p,
>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
>>> +	unsigned long count)
>>>  {
>>>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>>  
>>> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>>  	if (cluster_is_free(&cluster_info[idx]))
>>>  		alloc_cluster(p, idx);
>>>  
>>> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>>> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>>  	cluster_set_count(&cluster_info[idx],
>>> -		cluster_count(&cluster_info[idx]) + 1);
>>> +		cluster_count(&cluster_info[idx]) + count);
>>> +}
>>> +
>>> +/*
>>> + * The cluster corresponding to page_nr will be used. The cluster will be
>>> + * removed from free cluster list and its usage counter will be increased.
>>> + */
>>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>> +{
>>> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>>>  }
>>>  
>>>  /*
>>> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>>   * cluster list. Avoiding such abuse to avoid list corruption.
>>>   */
>>>  static bool
>>> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>> -	unsigned long offset)
>>> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>> +	unsigned long offset, int order)
>>>  {
>>>  	bool conflict;
>>>  
>>> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>  	if (!conflict)
>>>  		return false;
>>>  
>>> -	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
>>> +	this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
>> 
>> This is added in the previous patch.  I don't think SWAP_NEXT_NULL is a
>> good name.  Because NEXT isn't a pointer (while cluster_next is). Better
>> to name it as SWAP_NEXT_INVALID, etc.
>
> ACK, will make change for next version.

Thanks!

>> 
>>>  	return true;
>>>  }
>>>  
>>>  /*
>>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>> - * might involve allocating a new cluster for current CPU too.
>>> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
>>> + * cluster list. Avoiding such abuse to avoid list corruption.
>>>   */
>>> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> -	unsigned long *offset, unsigned long *scan_base)
>>> +static bool
>>> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>> +	unsigned long offset)
>>> +{
>>> +	return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
>>> +}
>>> +
>>> +/*
>>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
>>> + * entry pool (a cluster). This might involve allocating a new cluster for
>>> + * current CPU too.
>>> + */
>>> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> +	unsigned long *offset, unsigned long *scan_base, int order)
>>>  {
>>>  	struct swap_cluster_info *ci;
>>> -	unsigned int tmp, max;
>>> +	unsigned int tmp, max, i;
>>>  	unsigned int *cpu_next;
>>> +	unsigned int nr_pages = 1 << order;
>>>  
>>>  new_cluster:
>>> -	cpu_next = this_cpu_ptr(si->cpu_next);
>>> +	cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>>>  	tmp = *cpu_next;
>>>  	if (tmp == SWAP_NEXT_NULL) {
>>>  		if (!cluster_list_empty(&si->free_clusters)) {
>>> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>  	 * reserve a new cluster.
>>>  	 */
>>>  	ci = lock_cluster(si, tmp);
>>> -	if (si->swap_map[tmp]) {
>>> -		unlock_cluster(ci);
>>> -		*cpu_next = SWAP_NEXT_NULL;
>>> -		goto new_cluster;
>>> +	for (i = 0; i < nr_pages; i++) {
>>> +		if (si->swap_map[tmp + i]) {
>>> +			unlock_cluster(ci);
>>> +			*cpu_next = SWAP_NEXT_NULL;
>>> +			goto new_cluster;
>>> +		}
>>>  	}
>>>  	unlock_cluster(ci);
>>>  
>>> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>  	*scan_base = tmp;
>>>  
>>>  	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
>> 
>> This line is added in a previous patch.  Can we just use
>> 
>>         max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);
>
> Sure. This is how I originally had it, but then decided that the other approach
> was a bit clearer. But I don't have a strong opinion, so I'll change it as you
> suggest.

Thanks!

>> 
>> Or, add ALIGN_UP() for this?
>> 
>>> -	tmp += 1;
>>> +	tmp += nr_pages;
>>>  	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>>>  
>>>  	return true;
>>>  }
>>>  
>>> +/*
>>> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>> + * might involve allocating a new cluster for current CPU too.
>>> + */
>>> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> +	unsigned long *offset, unsigned long *scan_base)
>>> +{
>>> +	return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
>>> +}
>>> +
>>>  static void __del_from_avail_list(struct swap_info_struct *p)
>>>  {
>>>  	int nid;
>>> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  	return n_ret;
>>>  }
>>>  
>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>>> +			    unsigned int nr_pages)
>> 
>> IMHO, it's better to make scan_swap_map_slots() to support order > 0
>> instead of making swap_alloc_cluster() to support order != PMD_ORDER.
>> And, we may merge swap_alloc_cluster() with scan_swap_map_slots() after
>> that.
>
> I did consider adding a 5th patch to rename swap_alloc_large() to something like
> swap_alloc_one_ssd_entry() (which would then be used for order=0 too) and
> refactor scan_swap_map_slots() to fully delegate to it for the non-scaning ssd
> allocation case. Would something like that suit?
>
> I have reservations about making scan_swap_map_slots() take an order and be the
> sole entry point:
>
>   - in the non-ssd case, we can't support order!=0

Don't need to check ssd directly, we only support order != 0 if
si->cluster_info != NULL.

>   - there is a lot of other logic to deal with falling back to scanning which we
>     would only want to do for order==0, so we would end up with a few ugly
>     conditionals against order.

We don't need to care about them in most cases.  IIUC, only the "goto
scan" after scan_swap_map_try_ssd_cluster() return false need to "goto
no_page" for order != 0.

>   - I was concerned the risk of me introducing a bug when refactoring all that
>     subtle logic was high

IMHO, readability is more important for long term maintenance.  So, we
need to refactor the existing code for that.

> What do you think? Is not making scan_swap_map_slots() support order > 0 a deal
> breaker for you?

I just think that it's better to use scan_swap_map_slots() for any order
other than PMD_ORDER.  In that way, we share as much code as possible.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
  2023-10-30  8:18   ` Huang, Ying
@ 2023-11-02  7:40   ` Barry Song
  2023-11-02 10:21     ` Ryan Roberts
  2024-02-05  9:51   ` Barry Song
  2024-02-22  7:05   ` Barry Song
  3 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2023-11-02  7:40 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	linux-kernel, linux-mm

On Wed, Oct 25, 2023 at 10:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> The upcoming anonymous small-sized THP feature enables performance
> improvements by allocating large folios for anonymous memory. However
> I've observed that on an arm64 system running a parallel workload (e.g.
> kernel compilation) across many cores, under high memory pressure, the
> speed regresses. This is due to bottlenecking on the increased number of
> TLBIs added due to all the extra folio splitting.
>
> Therefore, solve this regression by adding support for swapping out
> small-sized THP without needing to split the folio, just like is already
> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
> enabled, and when the swap backing store is a non-rotating block device.
> These are the same constraints as for the existing PMD-sized THP
> swap-out support.

Hi Ryan,

We had a problem while enabling THP SWP on arm64,
commit d0637c505f8 ("arm64: enable THP_SWAP for arm64")

this means we have to depend on !system_supports_mte().
static inline bool arch_thp_swp_supported(void)
{
        return !system_supports_mte();
}

Do we have the same problem for small-sized THP? If yes, MTE has been
widely existing in various ARM64 SoC. Does it mean we should begin to fix
the issue now?


>
> Note that no attempt is made to swap-in THP here - this is still done
> page-by-page, like for PMD-sized THP.
>
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [1, (1
> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> order and allocating sequentially from it until the cluster is full.
> This ensures that we don't need to search the map and we get no
> fragmentation due to alignment padding for different orders in the
> cluster. If there is no current cluster for a given order, we attempt to
> allocate a free cluster from the list. If there are no free clusters, we
> fail the allocation and the caller falls back to splitting the folio and
> allocates individual entries (as per existing PMD-sized THP fallback).
>
> The per-order current clusters are maintained per-cpu using the existing
> infrastructure. This is done to avoid interleving pages from different
> tasks, which would prevent IO being batched. This is already done for
> the order-0 allocations so we follow the same pattern.
> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
> for order-0.
>
> As is done for order-0 per-cpu clusters, the scanner now can steal
> order-0 entries from any per-cpu-per-order reserved cluster. This
> ensures that when the swap file is getting full, space doesn't get tied
> up in the per-cpu reserves.
>
> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
> device as the swap device and from inside a memcg limited to 40G memory.
> I've then run `usemem` from vm-scalability with 70 processes (each has
> its own core), each allocating and writing 1G of memory. I've repeated
> everything 5 times and taken the mean:
>
> Mean Performance Improvement vs 4K/baseline
>
> | alloc size |            baseline |       + this series |
> |            |  v6.6-rc4+anonfolio |                     |
> |:-----------|--------------------:|--------------------:|
> | 4K Page    |                0.0% |                4.9% |
> | 64K THP    |              -44.1% |               10.7% |
> | 2M THP     |               56.0% |               65.9% |
>
> So with this change, the regression for 64K swap performance goes away
> and 4K and 2M swap improves slightly too.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/swap.h |  10 +--
>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>  mm/vmscan.c          |  10 +--
>  3 files changed, 119 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0ca8aaa098ba..ccbca5db851b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -295,11 +295,11 @@ struct swap_info_struct {
>         unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>         unsigned int __percpu *cpu_next;/*
>                                          * Likely next allocation offset. We
> -                                        * assign a cluster to each CPU, so each
> -                                        * CPU can allocate swap entry from its
> -                                        * own cluster and swapout sequentially.
> -                                        * The purpose is to optimize swapout
> -                                        * throughput.
> +                                        * assign a cluster per-order to each
> +                                        * CPU, so each CPU can allocate swap
> +                                        * entry from its own cluster and
> +                                        * swapout sequentially. The purpose is
> +                                        * to optimize swapout throughput.
>                                          */
>         struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>         struct block_device *bdev;      /* swap device or bdev of swap file */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 94f7cc225eb9..b50bce50bed9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>
>  /*
>   * The cluster corresponding to page_nr will be used. The cluster will be
> - * removed from free cluster list and its usage counter will be increased.
> + * removed from free cluster list and its usage counter will be increased by
> + * count.
>   */
> -static void inc_cluster_info_page(struct swap_info_struct *p,
> -       struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void add_cluster_info_page(struct swap_info_struct *p,
> +       struct swap_cluster_info *cluster_info, unsigned long page_nr,
> +       unsigned long count)
>  {
>         unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>
> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>         if (cluster_is_free(&cluster_info[idx]))
>                 alloc_cluster(p, idx);
>
> -       VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
> +       VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>         cluster_set_count(&cluster_info[idx],
> -               cluster_count(&cluster_info[idx]) + 1);
> +               cluster_count(&cluster_info[idx]) + count);
> +}
> +
> +/*
> + * The cluster corresponding to page_nr will be used. The cluster will be
> + * removed from free cluster list and its usage counter will be increased.
> + */
> +static void inc_cluster_info_page(struct swap_info_struct *p,
> +       struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +{
> +       add_cluster_info_page(p, cluster_info, page_nr, 1);
>  }
>
>  /*
> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>   * cluster list. Avoiding such abuse to avoid list corruption.
>   */
>  static bool
> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> -       unsigned long offset)
> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> +       unsigned long offset, int order)
>  {
>         bool conflict;
>
> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>         if (!conflict)
>                 return false;
>
> -       *this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
> +       this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
>         return true;
>  }
>
>  /*
> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
> - * might involve allocating a new cluster for current CPU too.
> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
> + * cluster list. Avoiding such abuse to avoid list corruption.
>   */
> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> -       unsigned long *offset, unsigned long *scan_base)
> +static bool
> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> +       unsigned long offset)
> +{
> +       return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
> +}
> +
> +/*
> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
> + * entry pool (a cluster). This might involve allocating a new cluster for
> + * current CPU too.
> + */
> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> +       unsigned long *offset, unsigned long *scan_base, int order)
>  {
>         struct swap_cluster_info *ci;
> -       unsigned int tmp, max;
> +       unsigned int tmp, max, i;
>         unsigned int *cpu_next;
> +       unsigned int nr_pages = 1 << order;
>
>  new_cluster:
> -       cpu_next = this_cpu_ptr(si->cpu_next);
> +       cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>         tmp = *cpu_next;
>         if (tmp == SWAP_NEXT_NULL) {
>                 if (!cluster_list_empty(&si->free_clusters)) {
> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>          * reserve a new cluster.
>          */
>         ci = lock_cluster(si, tmp);
> -       if (si->swap_map[tmp]) {
> -               unlock_cluster(ci);
> -               *cpu_next = SWAP_NEXT_NULL;
> -               goto new_cluster;
> +       for (i = 0; i < nr_pages; i++) {
> +               if (si->swap_map[tmp + i]) {
> +                       unlock_cluster(ci);
> +                       *cpu_next = SWAP_NEXT_NULL;
> +                       goto new_cluster;
> +               }
>         }
>         unlock_cluster(ci);
>
> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>         *scan_base = tmp;
>
>         max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
> -       tmp += 1;
> +       tmp += nr_pages;
>         *cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>
>         return true;
>  }
>
> +/*
> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
> + * might involve allocating a new cluster for current CPU too.
> + */
> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> +       unsigned long *offset, unsigned long *scan_base)
> +{
> +       return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
> +}
> +
>  static void __del_from_avail_list(struct swap_info_struct *p)
>  {
>         int nid;
> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>         return n_ret;
>  }
>
> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
> +                           unsigned int nr_pages)
>  {
> -       unsigned long idx;
>         struct swap_cluster_info *ci;
> -       unsigned long offset;
> +       unsigned long offset, scan_base;
> +       int order = ilog2(nr_pages);
> +       bool ret;
>
>         /*
> -        * Should not even be attempting cluster allocations when huge
> +        * Should not even be attempting large allocations when huge
>          * page swap is disabled.  Warn and fail the allocation.
>          */
> -       if (!IS_ENABLED(CONFIG_THP_SWAP)) {
> +       if (!IS_ENABLED(CONFIG_THP_SWAP) ||
> +           nr_pages < 2 || nr_pages > SWAPFILE_CLUSTER ||
> +           !is_power_of_2(nr_pages)) {
>                 VM_WARN_ON_ONCE(1);
>                 return 0;
>         }
>
> -       if (cluster_list_empty(&si->free_clusters))
> +       /*
> +        * Swapfile is not block device or not using clusters so unable to
> +        * allocate large entries.
> +        */
> +       if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>                 return 0;
>
> -       idx = cluster_list_first(&si->free_clusters);
> -       offset = idx * SWAPFILE_CLUSTER;
> -       ci = lock_cluster(si, offset);
> -       alloc_cluster(si, idx);
> -       cluster_set_count(ci, SWAPFILE_CLUSTER);
> +again:
> +       /*
> +        * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> +        * so indicate that we are scanning to synchronise with swapoff.
> +        */
> +       si->flags += SWP_SCANNING;
> +       ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> +       si->flags -= SWP_SCANNING;
> +
> +       /*
> +        * If we failed to allocate or if swapoff is waiting for us (due to lock
> +        * being dropped for discard above), return immediately.
> +        */
> +       if (!ret || !(si->flags & SWP_WRITEOK))
> +               return 0;
>
> -       memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
> +       if (__scan_swap_map_ssd_cluster_conflict(si, offset, order))
> +               goto again;
> +
> +       ci = lock_cluster(si, offset);
> +       memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
> +       add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>         unlock_cluster(ci);
> -       swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
> -       *slot = swp_entry(si->type, offset);
>
> +       swap_range_alloc(si, offset, nr_pages);
> +       *slot = swp_entry(si->type, offset);
>         return 1;
>  }
>
> @@ -1036,7 +1096,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>         int node;
>
>         /* Only single cluster request supported */
> -       WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> +       WARN_ON_ONCE(n_goal > 1 && size > 1);
>
>         spin_lock(&swap_avail_lock);
>
> @@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>                         spin_unlock(&si->lock);
>                         goto nextsi;
>                 }
> -               if (size == SWAPFILE_CLUSTER) {
> -                       if (si->flags & SWP_BLKDEV)
> -                               n_ret = swap_alloc_cluster(si, swp_entries);
> +               if (size > 1) {
> +                       n_ret = swap_alloc_large(si, swp_entries, size);
>                 } else
>                         n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>                                                     n_goal, swp_entries);
>                 spin_unlock(&si->lock);
> -               if (n_ret || size == SWAPFILE_CLUSTER)
> +               if (n_ret || size > 1)
>                         goto check_out;
>                 cond_resched();
>
> @@ -3041,6 +3100,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         if (p->bdev && bdev_nonrot(p->bdev)) {
>                 int cpu;
>                 unsigned long ci, nr_cluster;
> +               int nr_order;
> +               int i;
>
>                 p->flags |= SWP_SOLIDSTATE;
>                 p->cluster_next_cpu = alloc_percpu(unsigned int);
> @@ -3068,13 +3129,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>                 for (ci = 0; ci < nr_cluster; ci++)
>                         spin_lock_init(&((cluster_info + ci)->lock));
>
> -               p->cpu_next = alloc_percpu(unsigned int);
> +               nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER + 1 : 1;
> +               p->cpu_next = __alloc_percpu(sizeof(unsigned int) * nr_order,
> +                                            __alignof__(unsigned int));
>                 if (!p->cpu_next) {
>                         error = -ENOMEM;
>                         goto bad_swap_unlock_inode;
>                 }
> -               for_each_possible_cpu(cpu)
> -                       per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
> +               for_each_possible_cpu(cpu) {
> +                       unsigned int *cpu_next = per_cpu_ptr(p->cpu_next, cpu);
> +
> +                       for (i = 0; i < nr_order; i++)
> +                               cpu_next[i] = SWAP_NEXT_NULL;
> +               }
>         } else {
>                 atomic_inc(&nr_rotate_swap);
>                 inced_nr_rotate_swap = true;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2cc0cb41fb32..ea19710aa4cd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                         if (!can_split_folio(folio, NULL))
>                                                 goto activate_locked;
>                                         /*
> -                                        * Split folios without a PMD map right
> -                                        * away. Chances are some or all of the
> -                                        * tail pages can be freed without IO.
> +                                        * Split PMD-mappable folios without a
> +                                        * PMD map right away. Chances are some
> +                                        * or all of the tail pages can be freed
> +                                        * without IO.
>                                          */
> -                                       if (!folio_entire_mapcount(folio) &&
> +                                       if (folio_test_pmd_mappable(folio) &&
> +                                           !folio_entire_mapcount(folio) &&
>                                             split_folio_to_list(folio,
>                                                                 folio_list))
>                                                 goto activate_locked;
> --
> 2.25.1
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-02  7:40   ` Barry Song
@ 2023-11-02 10:21     ` Ryan Roberts
  2023-11-02 22:36       ` Barry Song
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2023-11-02 10:21 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	linux-kernel, linux-mm, Steven Price

On 02/11/2023 07:40, Barry Song wrote:
> On Wed, Oct 25, 2023 at 10:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> The upcoming anonymous small-sized THP feature enables performance
>> improvements by allocating large folios for anonymous memory. However
>> I've observed that on an arm64 system running a parallel workload (e.g.
>> kernel compilation) across many cores, under high memory pressure, the
>> speed regresses. This is due to bottlenecking on the increased number of
>> TLBIs added due to all the extra folio splitting.
>>
>> Therefore, solve this regression by adding support for swapping out
>> small-sized THP without needing to split the folio, just like is already
>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>> enabled, and when the swap backing store is a non-rotating block device.
>> These are the same constraints as for the existing PMD-sized THP
>> swap-out support.
> 
> Hi Ryan,
> 
> We had a problem while enabling THP SWP on arm64,
> commit d0637c505f8 ("arm64: enable THP_SWAP for arm64")
> 
> this means we have to depend on !system_supports_mte().
> static inline bool arch_thp_swp_supported(void)
> {
>         return !system_supports_mte();
> }
> 
> Do we have the same problem for small-sized THP? If yes, MTE has been
> widely existing in various ARM64 SoC. Does it mean we should begin to fix
> the issue now?

Hi Barry,

I'm guessing that the current problem for MTE is that when it saves the tags
prior to swap out, it assumes all folios are small (i.e. base page size) and
therefore doesn't have the logic to iterate over a large folio, saving the tags
for each page?

If that's the issue, then yes we have the same problem for small-sized THP, but
this is all safe - arch_thp_swp_supported() will return false and we continue to
use that signal to cause the page to be split prior to swap out.

But, yes, it would be nice to fix that! And if I've understood the problem
correctly, it doesn't sound like it should be too hard? Is this something you
are volunteering for?? :)

Thanks,
Ryan


> 
> 
>>
>> Note that no attempt is made to swap-in THP here - this is still done
>> page-by-page, like for PMD-sized THP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [1, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller falls back to splitting the folio and
>> allocates individual entries (as per existing PMD-sized THP fallback).
>>
>> The per-order current clusters are maintained per-cpu using the existing
>> infrastructure. This is done to avoid interleving pages from different
>> tasks, which would prevent IO being batched. This is already done for
>> the order-0 allocations so we follow the same pattern.
>> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
>> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
>> for order-0.
>>
>> As is done for order-0 per-cpu clusters, the scanner now can steal
>> order-0 entries from any per-cpu-per-order reserved cluster. This
>> ensures that when the swap file is getting full, space doesn't get tied
>> up in the per-cpu reserves.
>>
>> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
>> device as the swap device and from inside a memcg limited to 40G memory.
>> I've then run `usemem` from vm-scalability with 70 processes (each has
>> its own core), each allocating and writing 1G of memory. I've repeated
>> everything 5 times and taken the mean:
>>
>> Mean Performance Improvement vs 4K/baseline
>>
>> | alloc size |            baseline |       + this series |
>> |            |  v6.6-rc4+anonfolio |                     |
>> |:-----------|--------------------:|--------------------:|
>> | 4K Page    |                0.0% |                4.9% |
>> | 64K THP    |              -44.1% |               10.7% |
>> | 2M THP     |               56.0% |               65.9% |
>>
>> So with this change, the regression for 64K swap performance goes away
>> and 4K and 2M swap improves slightly too.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/swap.h |  10 +--
>>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>>  mm/vmscan.c          |  10 +--
>>  3 files changed, 119 insertions(+), 50 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 0ca8aaa098ba..ccbca5db851b 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -295,11 +295,11 @@ struct swap_info_struct {
>>         unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>>         unsigned int __percpu *cpu_next;/*
>>                                          * Likely next allocation offset. We
>> -                                        * assign a cluster to each CPU, so each
>> -                                        * CPU can allocate swap entry from its
>> -                                        * own cluster and swapout sequentially.
>> -                                        * The purpose is to optimize swapout
>> -                                        * throughput.
>> +                                        * assign a cluster per-order to each
>> +                                        * CPU, so each CPU can allocate swap
>> +                                        * entry from its own cluster and
>> +                                        * swapout sequentially. The purpose is
>> +                                        * to optimize swapout throughput.
>>                                          */
>>         struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>>         struct block_device *bdev;      /* swap device or bdev of swap file */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 94f7cc225eb9..b50bce50bed9 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>
>>  /*
>>   * The cluster corresponding to page_nr will be used. The cluster will be
>> - * removed from free cluster list and its usage counter will be increased.
>> + * removed from free cluster list and its usage counter will be increased by
>> + * count.
>>   */
>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>> -       struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +static void add_cluster_info_page(struct swap_info_struct *p,
>> +       struct swap_cluster_info *cluster_info, unsigned long page_nr,
>> +       unsigned long count)
>>  {
>>         unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>
>> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>         if (cluster_is_free(&cluster_info[idx]))
>>                 alloc_cluster(p, idx);
>>
>> -       VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>> +       VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>         cluster_set_count(&cluster_info[idx],
>> -               cluster_count(&cluster_info[idx]) + 1);
>> +               cluster_count(&cluster_info[idx]) + count);
>> +}
>> +
>> +/*
>> + * The cluster corresponding to page_nr will be used. The cluster will be
>> + * removed from free cluster list and its usage counter will be increased.
>> + */
>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>> +       struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +{
>> +       add_cluster_info_page(p, cluster_info, page_nr, 1);
>>  }
>>
>>  /*
>> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>   * cluster list. Avoiding such abuse to avoid list corruption.
>>   */
>>  static bool
>> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> -       unsigned long offset)
>> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> +       unsigned long offset, int order)
>>  {
>>         bool conflict;
>>
>> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>         if (!conflict)
>>                 return false;
>>
>> -       *this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
>> +       this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
>>         return true;
>>  }
>>
>>  /*
>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>> - * might involve allocating a new cluster for current CPU too.
>> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
>> + * cluster list. Avoiding such abuse to avoid list corruption.
>>   */
>> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> -       unsigned long *offset, unsigned long *scan_base)
>> +static bool
>> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> +       unsigned long offset)
>> +{
>> +       return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
>> +}
>> +
>> +/*
>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
>> + * entry pool (a cluster). This might involve allocating a new cluster for
>> + * current CPU too.
>> + */
>> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> +       unsigned long *offset, unsigned long *scan_base, int order)
>>  {
>>         struct swap_cluster_info *ci;
>> -       unsigned int tmp, max;
>> +       unsigned int tmp, max, i;
>>         unsigned int *cpu_next;
>> +       unsigned int nr_pages = 1 << order;
>>
>>  new_cluster:
>> -       cpu_next = this_cpu_ptr(si->cpu_next);
>> +       cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>>         tmp = *cpu_next;
>>         if (tmp == SWAP_NEXT_NULL) {
>>                 if (!cluster_list_empty(&si->free_clusters)) {
>> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>          * reserve a new cluster.
>>          */
>>         ci = lock_cluster(si, tmp);
>> -       if (si->swap_map[tmp]) {
>> -               unlock_cluster(ci);
>> -               *cpu_next = SWAP_NEXT_NULL;
>> -               goto new_cluster;
>> +       for (i = 0; i < nr_pages; i++) {
>> +               if (si->swap_map[tmp + i]) {
>> +                       unlock_cluster(ci);
>> +                       *cpu_next = SWAP_NEXT_NULL;
>> +                       goto new_cluster;
>> +               }
>>         }
>>         unlock_cluster(ci);
>>
>> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>         *scan_base = tmp;
>>
>>         max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
>> -       tmp += 1;
>> +       tmp += nr_pages;
>>         *cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>>
>>         return true;
>>  }
>>
>> +/*
>> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>> + * might involve allocating a new cluster for current CPU too.
>> + */
>> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> +       unsigned long *offset, unsigned long *scan_base)
>> +{
>> +       return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
>> +}
>> +
>>  static void __del_from_avail_list(struct swap_info_struct *p)
>>  {
>>         int nid;
>> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>         return n_ret;
>>  }
>>
>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>> +                           unsigned int nr_pages)
>>  {
>> -       unsigned long idx;
>>         struct swap_cluster_info *ci;
>> -       unsigned long offset;
>> +       unsigned long offset, scan_base;
>> +       int order = ilog2(nr_pages);
>> +       bool ret;
>>
>>         /*
>> -        * Should not even be attempting cluster allocations when huge
>> +        * Should not even be attempting large allocations when huge
>>          * page swap is disabled.  Warn and fail the allocation.
>>          */
>> -       if (!IS_ENABLED(CONFIG_THP_SWAP)) {
>> +       if (!IS_ENABLED(CONFIG_THP_SWAP) ||
>> +           nr_pages < 2 || nr_pages > SWAPFILE_CLUSTER ||
>> +           !is_power_of_2(nr_pages)) {
>>                 VM_WARN_ON_ONCE(1);
>>                 return 0;
>>         }
>>
>> -       if (cluster_list_empty(&si->free_clusters))
>> +       /*
>> +        * Swapfile is not block device or not using clusters so unable to
>> +        * allocate large entries.
>> +        */
>> +       if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>>                 return 0;
>>
>> -       idx = cluster_list_first(&si->free_clusters);
>> -       offset = idx * SWAPFILE_CLUSTER;
>> -       ci = lock_cluster(si, offset);
>> -       alloc_cluster(si, idx);
>> -       cluster_set_count(ci, SWAPFILE_CLUSTER);
>> +again:
>> +       /*
>> +        * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>> +        * so indicate that we are scanning to synchronise with swapoff.
>> +        */
>> +       si->flags += SWP_SCANNING;
>> +       ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>> +       si->flags -= SWP_SCANNING;
>> +
>> +       /*
>> +        * If we failed to allocate or if swapoff is waiting for us (due to lock
>> +        * being dropped for discard above), return immediately.
>> +        */
>> +       if (!ret || !(si->flags & SWP_WRITEOK))
>> +               return 0;
>>
>> -       memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>> +       if (__scan_swap_map_ssd_cluster_conflict(si, offset, order))
>> +               goto again;
>> +
>> +       ci = lock_cluster(si, offset);
>> +       memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
>> +       add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>         unlock_cluster(ci);
>> -       swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
>> -       *slot = swp_entry(si->type, offset);
>>
>> +       swap_range_alloc(si, offset, nr_pages);
>> +       *slot = swp_entry(si->type, offset);
>>         return 1;
>>  }
>>
>> @@ -1036,7 +1096,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>         int node;
>>
>>         /* Only single cluster request supported */
>> -       WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
>> +       WARN_ON_ONCE(n_goal > 1 && size > 1);
>>
>>         spin_lock(&swap_avail_lock);
>>
>> @@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>                         spin_unlock(&si->lock);
>>                         goto nextsi;
>>                 }
>> -               if (size == SWAPFILE_CLUSTER) {
>> -                       if (si->flags & SWP_BLKDEV)
>> -                               n_ret = swap_alloc_cluster(si, swp_entries);
>> +               if (size > 1) {
>> +                       n_ret = swap_alloc_large(si, swp_entries, size);
>>                 } else
>>                         n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>>                                                     n_goal, swp_entries);
>>                 spin_unlock(&si->lock);
>> -               if (n_ret || size == SWAPFILE_CLUSTER)
>> +               if (n_ret || size > 1)
>>                         goto check_out;
>>                 cond_resched();
>>
>> @@ -3041,6 +3100,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>         if (p->bdev && bdev_nonrot(p->bdev)) {
>>                 int cpu;
>>                 unsigned long ci, nr_cluster;
>> +               int nr_order;
>> +               int i;
>>
>>                 p->flags |= SWP_SOLIDSTATE;
>>                 p->cluster_next_cpu = alloc_percpu(unsigned int);
>> @@ -3068,13 +3129,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>                 for (ci = 0; ci < nr_cluster; ci++)
>>                         spin_lock_init(&((cluster_info + ci)->lock));
>>
>> -               p->cpu_next = alloc_percpu(unsigned int);
>> +               nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER + 1 : 1;
>> +               p->cpu_next = __alloc_percpu(sizeof(unsigned int) * nr_order,
>> +                                            __alignof__(unsigned int));
>>                 if (!p->cpu_next) {
>>                         error = -ENOMEM;
>>                         goto bad_swap_unlock_inode;
>>                 }
>> -               for_each_possible_cpu(cpu)
>> -                       per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
>> +               for_each_possible_cpu(cpu) {
>> +                       unsigned int *cpu_next = per_cpu_ptr(p->cpu_next, cpu);
>> +
>> +                       for (i = 0; i < nr_order; i++)
>> +                               cpu_next[i] = SWAP_NEXT_NULL;
>> +               }
>>         } else {
>>                 atomic_inc(&nr_rotate_swap);
>>                 inced_nr_rotate_swap = true;
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2cc0cb41fb32..ea19710aa4cd 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>                                         if (!can_split_folio(folio, NULL))
>>                                                 goto activate_locked;
>>                                         /*
>> -                                        * Split folios without a PMD map right
>> -                                        * away. Chances are some or all of the
>> -                                        * tail pages can be freed without IO.
>> +                                        * Split PMD-mappable folios without a
>> +                                        * PMD map right away. Chances are some
>> +                                        * or all of the tail pages can be freed
>> +                                        * without IO.
>>                                          */
>> -                                       if (!folio_entire_mapcount(folio) &&
>> +                                       if (folio_test_pmd_mappable(folio) &&
>> +                                           !folio_entire_mapcount(folio) &&
>>                                             split_folio_to_list(folio,
>>                                                                 folio_list))
>>                                                 goto activate_locked;
>> --
>> 2.25.1
>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-02 10:21     ` Ryan Roberts
@ 2023-11-02 22:36       ` Barry Song
  2023-11-03 11:31         ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2023-11-02 22:36 UTC (permalink / raw)
  To: ryan.roberts
  Cc: 21cnbao, Steven.Price, akpm, david, linux-kernel, linux-mm,
	mhocko, shy828301, wangkefeng.wang, willy, xiang, ying.huang,
	yuzhao

> But, yes, it would be nice to fix that! And if I've understood the problem
> correctly, it doesn't sound like it should be too hard? Is this something you
> are volunteering for?? :)

Unfornately right now I haven't a real hardware with MTE which can run the latest
kernel. but i have written a RFC, it will be nice to get someone to test it. Let
me figure out if we can get someone :-)

[RFC PATCH] arm64: mm: swap: save and restore mte tags for large folios

This patch makes MTE tags saving and restoring support large folios,
then we don't need to split them into base pages for swapping on
ARM64 SoCs with MTE.

---
 arch/arm64/include/asm/pgtable.h | 21 ++++-----------------
 arch/arm64/mm/mteswap.c          | 20 ++++++++++++++++++++
 2 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7f7d9b1df4e5..b12783dca00a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -45,12 +45,6 @@
 	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline bool arch_thp_swp_supported(void)
-{
-	return !system_supports_mte();
-}
-#define arch_thp_swp_supported arch_thp_swp_supported
-
 /*
  * Outside of a few very special situations (e.g. hibernation), we always
  * use broadcast TLB invalidation instructions, therefore a spurious page
@@ -1028,12 +1022,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 #ifdef CONFIG_ARM64_MTE
 
 #define __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
-{
-	if (system_supports_mte())
-		return mte_save_tags(page);
-	return 0;
-}
+#define arch_prepare_to_swap arch_prepare_to_swap
+extern int arch_prepare_to_swap(struct page *page);
 
 #define __HAVE_ARCH_SWAP_INVALIDATE
 static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
@@ -1049,11 +1039,8 @@ static inline void arch_swap_invalidate_area(int type)
 }
 
 #define __HAVE_ARCH_SWAP_RESTORE
-static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
-{
-	if (system_supports_mte())
-		mte_restore_tags(entry, &folio->page);
-}
+#define arch_swap_restore arch_swap_restore
+extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
 
 #endif /* CONFIG_ARM64_MTE */
 
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a31833e3ddc5..e5637e931e4f 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -83,3 +83,23 @@ void mte_invalidate_tags_area(int type)
 	}
 	xa_unlock(&mte_pages);
 }
+
+int arch_prepare_to_swap(struct page *page)
+{
+	if (system_supports_mte()) {
+		struct folio *folio = page_folio(page);
+		long i, nr = folio_nr_pages(folio);
+		for (i = 0; i < nr; i++)
+			return mte_save_tags(folio_page(folio, i));
+	}
+	return 0;
+}
+
+void arch_swap_restore(swp_entry_t entry, struct folio *folio)
+{
+	if (system_supports_mte()) {
+		long i, nr = folio_nr_pages(folio);
+		for (i = 0; i < nr; i++)
+			mte_restore_tags(entry, folio_page(folio, i));
+	}
+}
-- 
2.25.1

> Thanks,
> Ryan

Barry

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-02 22:36       ` Barry Song
@ 2023-11-03 11:31         ` Ryan Roberts
  2023-11-03 13:57           ` Steven Price
  2023-11-04  5:49           ` Barry Song
  0 siblings, 2 replies; 116+ messages in thread
From: Ryan Roberts @ 2023-11-03 11:31 UTC (permalink / raw)
  To: Barry Song
  Cc: Steven.Price, akpm, david, linux-kernel, linux-mm, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao

On 02/11/2023 22:36, Barry Song wrote:
>> But, yes, it would be nice to fix that! And if I've understood the problem
>> correctly, it doesn't sound like it should be too hard? Is this something you
>> are volunteering for?? :)
> 
> Unfornately right now I haven't a real hardware with MTE which can run the latest
> kernel. but i have written a RFC, it will be nice to get someone to test it. Let
> me figure out if we can get someone :-)

OK, let me know if you find someone. Otherwise I can have a hunt around to see
if I can test it.

> 
> [RFC PATCH] arm64: mm: swap: save and restore mte tags for large folios
> 
> This patch makes MTE tags saving and restoring support large folios,
> then we don't need to split them into base pages for swapping on
> ARM64 SoCs with MTE.
> 
> ---
>  arch/arm64/include/asm/pgtable.h | 21 ++++-----------------
>  arch/arm64/mm/mteswap.c          | 20 ++++++++++++++++++++
>  2 files changed, 24 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 7f7d9b1df4e5..b12783dca00a 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -45,12 +45,6 @@
>  	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -static inline bool arch_thp_swp_supported(void)
> -{
> -	return !system_supports_mte();
> -}
> -#define arch_thp_swp_supported arch_thp_swp_supported

IIRC, arm64 was the only arch implementing this, so perhaps it should be ripped
out from the core code now?

> -
>  /*
>   * Outside of a few very special situations (e.g. hibernation), we always
>   * use broadcast TLB invalidation instructions, therefore a spurious page
> @@ -1028,12 +1022,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  #ifdef CONFIG_ARM64_MTE
>  
>  #define __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> -{
> -	if (system_supports_mte())
> -		return mte_save_tags(page);
> -	return 0;
> -}
> +#define arch_prepare_to_swap arch_prepare_to_swap
> +extern int arch_prepare_to_swap(struct page *page);

I think it would be better to modify this API to take a folio explicitly. The
caller already has the folio.

>  
>  #define __HAVE_ARCH_SWAP_INVALIDATE
>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> @@ -1049,11 +1039,8 @@ static inline void arch_swap_invalidate_area(int type)
>  }
>  
>  #define __HAVE_ARCH_SWAP_RESTORE
> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> -{
> -	if (system_supports_mte())
> -		mte_restore_tags(entry, &folio->page);
> -}
> +#define arch_swap_restore arch_swap_restore
> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>  
>  #endif /* CONFIG_ARM64_MTE */
>  
> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> index a31833e3ddc5..e5637e931e4f 100644
> --- a/arch/arm64/mm/mteswap.c
> +++ b/arch/arm64/mm/mteswap.c
> @@ -83,3 +83,23 @@ void mte_invalidate_tags_area(int type)
>  	}
>  	xa_unlock(&mte_pages);
>  }
> +
> +int arch_prepare_to_swap(struct page *page)
> +{
> +	if (system_supports_mte()) {
> +		struct folio *folio = page_folio(page);
> +		long i, nr = folio_nr_pages(folio);
> +		for (i = 0; i < nr; i++)
> +			return mte_save_tags(folio_page(folio, i));

This will return after saving the first page of the folio! You will need to add
each page in a loop, and if you get an error at any point, you will need to
remove the pages that you already added successfully, by calling
arch_swap_invalidate_page() as far as I can see. Steven can you confirm?

> +	}
> +	return 0;
> +}
> +
> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> +{
> +	if (system_supports_mte()) {
> +		long i, nr = folio_nr_pages(folio);
> +		for (i = 0; i < nr; i++)
> +			mte_restore_tags(entry, folio_page(folio, i));

swap-in currently doesn't support large folios - everything is a single page
folio. So this isn't technically needed. But from the API POV, it seems
reasonable to make this change - except your implementation is broken. You are
currently setting every page in the folio to use the same tags as the first
page. You need to increment the swap entry for each page.

Thanks,
Ryan


> +	}
> +}


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-10-31  8:12       ` Huang, Ying
@ 2023-11-03 11:42         ` Ryan Roberts
  0 siblings, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2023-11-03 11:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel,
	linux-mm

On 31/10/2023 08:12, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 30/10/2023 08:18, Huang, Ying wrote:
>>> Hi, Ryan,
>>>
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> The upcoming anonymous small-sized THP feature enables performance
>>>> improvements by allocating large folios for anonymous memory. However
>>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>>> kernel compilation) across many cores, under high memory pressure, the
>>>> speed regresses. This is due to bottlenecking on the increased number of
>>>> TLBIs added due to all the extra folio splitting.
>>>>
>>>> Therefore, solve this regression by adding support for swapping out
>>>> small-sized THP without needing to split the folio, just like is already
>>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>>> enabled, and when the swap backing store is a non-rotating block device.
>>>> These are the same constraints as for the existing PMD-sized THP
>>>> swap-out support.
>>>>
>>>> Note that no attempt is made to swap-in THP here - this is still done
>>>> page-by-page, like for PMD-sized THP.
>>>>
>>>> The main change here is to improve the swap entry allocator so that it
>>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>> order and allocating sequentially from it until the cluster is full.
>>>> This ensures that we don't need to search the map and we get no
>>>> fragmentation due to alignment padding for different orders in the
>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>> fail the allocation and the caller falls back to splitting the folio and
>>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>>
>>>> The per-order current clusters are maintained per-cpu using the existing
>>>> infrastructure. This is done to avoid interleving pages from different
>>>> tasks, which would prevent IO being batched. This is already done for
>>>> the order-0 allocations so we follow the same pattern.
>>>> __scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
>>>> orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
>>>> for order-0.
>>>>
>>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>>> ensures that when the swap file is getting full, space doesn't get tied
>>>> up in the per-cpu reserves.
>>>>
>>>> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
>>>> device as the swap device and from inside a memcg limited to 40G memory.
>>>> I've then run `usemem` from vm-scalability with 70 processes (each has
>>>> its own core), each allocating and writing 1G of memory. I've repeated
>>>> everything 5 times and taken the mean:
>>>>
>>>> Mean Performance Improvement vs 4K/baseline
>>>>
>>>> | alloc size |            baseline |       + this series |
>>>> |            |  v6.6-rc4+anonfolio |                     |
>>>> |:-----------|--------------------:|--------------------:|
>>>> | 4K Page    |                0.0% |                4.9% |
>>>> | 64K THP    |              -44.1% |               10.7% |
>>>> | 2M THP     |               56.0% |               65.9% |
>>>>
>>>> So with this change, the regression for 64K swap performance goes away
>>>> and 4K and 2M swap improves slightly too.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/swap.h |  10 +--
>>>>  mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
>>>>  mm/vmscan.c          |  10 +--
>>>>  3 files changed, 119 insertions(+), 50 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 0ca8aaa098ba..ccbca5db851b 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -295,11 +295,11 @@ struct swap_info_struct {
>>>>  	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>>>>  	unsigned int __percpu *cpu_next;/*
>>>>  					 * Likely next allocation offset. We
>>>> -					 * assign a cluster to each CPU, so each
>>>> -					 * CPU can allocate swap entry from its
>>>> -					 * own cluster and swapout sequentially.
>>>> -					 * The purpose is to optimize swapout
>>>> -					 * throughput.
>>>> +					 * assign a cluster per-order to each
>>>> +					 * CPU, so each CPU can allocate swap
>>>> +					 * entry from its own cluster and
>>>> +					 * swapout sequentially. The purpose is
>>>> +					 * to optimize swapout throughput.
>>>>  					 */
>>>
>>> This is kind of hard to understand.  Better to define some intermediate
>>> data structure to improve readability.  For example,
>>>
>>> #ifdef CONFIG_THP_SWAP
>>> #define NR_SWAP_ORDER   PMD_ORDER
>>> #else
>>> #define NR_SWAP_ORDER   1
>>> #endif
>>>
>>> struct percpu_clusters {
>>>         unsigned int alloc_next[NR_SWAP_ORDER];
>>> };
>>>
>>> PMD_ORDER isn't a constant on powerpc, but THP_SWAP isn't supported on
>>> powerpc too.
>>
>> I get your point, but this is just making it more difficult for powerpc to ever
>> enable the feature in future - you're implicitly depending on !powerpc, which
>> seems fragile. How about if I change the first line of the coment to be "per-cpu
>> array indexed by allocation order"? Would that be enough?
> 
> Even if PMD_ORDER isn't constant on powerpc, it's not necessary for
> NR_SWAP_ORDER to be variable.  At least (1 << (NR_SWAP_ORDER-1)) should
> < SWAPFILE_CLUSTER.  When someone adds THP swap support on powerpc, he
> can choose a reasonable constant for NR_SWAP_ORDER (for example, 10 or
> 7).
> 
>>>
>>>>  	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>>>>  	struct block_device *bdev;	/* swap device or bdev of swap file */
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index 94f7cc225eb9..b50bce50bed9 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>>  
>>>>  /*
>>>>   * The cluster corresponding to page_nr will be used. The cluster will be
>>>> - * removed from free cluster list and its usage counter will be increased.
>>>> + * removed from free cluster list and its usage counter will be increased by
>>>> + * count.
>>>>   */
>>>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>>>> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>>> +static void add_cluster_info_page(struct swap_info_struct *p,
>>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
>>>> +	unsigned long count)
>>>>  {
>>>>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>>>  
>>>> @@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>>>  	if (cluster_is_free(&cluster_info[idx]))
>>>>  		alloc_cluster(p, idx);
>>>>  
>>>> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>>>> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>>>  	cluster_set_count(&cluster_info[idx],
>>>> -		cluster_count(&cluster_info[idx]) + 1);
>>>> +		cluster_count(&cluster_info[idx]) + count);
>>>> +}
>>>> +
>>>> +/*
>>>> + * The cluster corresponding to page_nr will be used. The cluster will be
>>>> + * removed from free cluster list and its usage counter will be increased.
>>>> + */
>>>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>>> +{
>>>> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>>>>  }
>>>>  
>>>>  /*
>>>> @@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>>>   * cluster list. Avoiding such abuse to avoid list corruption.
>>>>   */
>>>>  static bool
>>>> -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>> -	unsigned long offset)
>>>> +__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>> +	unsigned long offset, int order)
>>>>  {
>>>>  	bool conflict;
>>>>  
>>>> @@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>>  	if (!conflict)
>>>>  		return false;
>>>>  
>>>> -	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
>>>> +	this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
>>>
>>> This is added in the previous patch.  I don't think SWAP_NEXT_NULL is a
>>> good name.  Because NEXT isn't a pointer (while cluster_next is). Better
>>> to name it as SWAP_NEXT_INVALID, etc.
>>
>> ACK, will make change for next version.
> 
> Thanks!
> 
>>>
>>>>  	return true;
>>>>  }
>>>>  
>>>>  /*
>>>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>>> - * might involve allocating a new cluster for current CPU too.
>>>> + * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
>>>> + * cluster list. Avoiding such abuse to avoid list corruption.
>>>>   */
>>>> -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>> -	unsigned long *offset, unsigned long *scan_base)
>>>> +static bool
>>>> +scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>> +	unsigned long offset)
>>>> +{
>>>> +	return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
>>>> + * entry pool (a cluster). This might involve allocating a new cluster for
>>>> + * current CPU too.
>>>> + */
>>>> +static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>> +	unsigned long *offset, unsigned long *scan_base, int order)
>>>>  {
>>>>  	struct swap_cluster_info *ci;
>>>> -	unsigned int tmp, max;
>>>> +	unsigned int tmp, max, i;
>>>>  	unsigned int *cpu_next;
>>>> +	unsigned int nr_pages = 1 << order;
>>>>  
>>>>  new_cluster:
>>>> -	cpu_next = this_cpu_ptr(si->cpu_next);
>>>> +	cpu_next = &this_cpu_ptr(si->cpu_next)[order];
>>>>  	tmp = *cpu_next;
>>>>  	if (tmp == SWAP_NEXT_NULL) {
>>>>  		if (!cluster_list_empty(&si->free_clusters)) {
>>>> @@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>>  	 * reserve a new cluster.
>>>>  	 */
>>>>  	ci = lock_cluster(si, tmp);
>>>> -	if (si->swap_map[tmp]) {
>>>> -		unlock_cluster(ci);
>>>> -		*cpu_next = SWAP_NEXT_NULL;
>>>> -		goto new_cluster;
>>>> +	for (i = 0; i < nr_pages; i++) {
>>>> +		if (si->swap_map[tmp + i]) {
>>>> +			unlock_cluster(ci);
>>>> +			*cpu_next = SWAP_NEXT_NULL;
>>>> +			goto new_cluster;
>>>> +		}
>>>>  	}
>>>>  	unlock_cluster(ci);
>>>>  
>>>> @@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>>  	*scan_base = tmp;
>>>>  
>>>>  	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
>>>
>>> This line is added in a previous patch.  Can we just use
>>>
>>>         max = ALIGN(tmp + 1, SWAPFILE_CLUSTER);
>>
>> Sure. This is how I originally had it, but then decided that the other approach
>> was a bit clearer. But I don't have a strong opinion, so I'll change it as you
>> suggest.
> 
> Thanks!
> 
>>>
>>> Or, add ALIGN_UP() for this?
>>>
>>>> -	tmp += 1;
>>>> +	tmp += nr_pages;
>>>>  	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
>>>>  
>>>>  	return true;
>>>>  }
>>>>  
>>>> +/*
>>>> + * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>>> + * might involve allocating a new cluster for current CPU too.
>>>> + */
>>>> +static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>> +	unsigned long *offset, unsigned long *scan_base)
>>>> +{
>>>> +	return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
>>>> +}
>>>> +
>>>>  static void __del_from_avail_list(struct swap_info_struct *p)
>>>>  {
>>>>  	int nid;
>>>> @@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>>  	return n_ret;
>>>>  }
>>>>  
>>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>>>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>>>> +			    unsigned int nr_pages)
>>>
>>> IMHO, it's better to make scan_swap_map_slots() to support order > 0
>>> instead of making swap_alloc_cluster() to support order != PMD_ORDER.
>>> And, we may merge swap_alloc_cluster() with scan_swap_map_slots() after
>>> that.
>>
>> I did consider adding a 5th patch to rename swap_alloc_large() to something like
>> swap_alloc_one_ssd_entry() (which would then be used for order=0 too) and
>> refactor scan_swap_map_slots() to fully delegate to it for the non-scaning ssd
>> allocation case. Would something like that suit?
>>
>> I have reservations about making scan_swap_map_slots() take an order and be the
>> sole entry point:
>>
>>   - in the non-ssd case, we can't support order!=0
> 
> Don't need to check ssd directly, we only support order != 0 if
> si->cluster_info != NULL.
> 
>>   - there is a lot of other logic to deal with falling back to scanning which we
>>     would only want to do for order==0, so we would end up with a few ugly
>>     conditionals against order.
> 
> We don't need to care about them in most cases.  IIUC, only the "goto
> scan" after scan_swap_map_try_ssd_cluster() return false need to "goto
> no_page" for order != 0.
> 
>>   - I was concerned the risk of me introducing a bug when refactoring all that
>>     subtle logic was high
> 
> IMHO, readability is more important for long term maintenance.  So, we
> need to refactor the existing code for that.
> 
>> What do you think? Is not making scan_swap_map_slots() support order > 0 a deal
>> breaker for you?
> 
> I just think that it's better to use scan_swap_map_slots() for any order
> other than PMD_ORDER.  In that way, we share as much code as possible.

OK, I'll take a look at implementing it as you propose, although I likely won't
have bandwidth until start of December. Will repost once I have something.

Thanks,
Ryan

> 
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-03 11:31         ` Ryan Roberts
@ 2023-11-03 13:57           ` Steven Price
  2023-11-04  9:34             ` Barry Song
  2023-11-04  5:49           ` Barry Song
  1 sibling, 1 reply; 116+ messages in thread
From: Steven Price @ 2023-11-03 13:57 UTC (permalink / raw)
  To: Ryan Roberts, Barry Song
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao

On 03/11/2023 11:31, Ryan Roberts wrote:
> On 02/11/2023 22:36, Barry Song wrote:
>>> But, yes, it would be nice to fix that! And if I've understood the problem
>>> correctly, it doesn't sound like it should be too hard? Is this something you
>>> are volunteering for?? :)
>>
>> Unfornately right now I haven't a real hardware with MTE which can run the latest
>> kernel. but i have written a RFC, it will be nice to get someone to test it. Let
>> me figure out if we can get someone :-)
> 
> OK, let me know if you find someone. Otherwise I can have a hunt around to see
> if I can test it.
> 
>>
>> [RFC PATCH] arm64: mm: swap: save and restore mte tags for large folios
>>
>> This patch makes MTE tags saving and restoring support large folios,
>> then we don't need to split them into base pages for swapping on
>> ARM64 SoCs with MTE.
>>
>> ---
>>  arch/arm64/include/asm/pgtable.h | 21 ++++-----------------
>>  arch/arm64/mm/mteswap.c          | 20 ++++++++++++++++++++
>>  2 files changed, 24 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 7f7d9b1df4e5..b12783dca00a 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -45,12 +45,6 @@
>>  	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
>> -static inline bool arch_thp_swp_supported(void)
>> -{
>> -	return !system_supports_mte();
>> -}
>> -#define arch_thp_swp_supported arch_thp_swp_supported
> 
> IIRC, arm64 was the only arch implementing this, so perhaps it should be ripped
> out from the core code now?
> 
>> -
>>  /*
>>   * Outside of a few very special situations (e.g. hibernation), we always
>>   * use broadcast TLB invalidation instructions, therefore a spurious page
>> @@ -1028,12 +1022,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>>  #ifdef CONFIG_ARM64_MTE
>>  
>>  #define __HAVE_ARCH_PREPARE_TO_SWAP
>> -static inline int arch_prepare_to_swap(struct page *page)
>> -{
>> -	if (system_supports_mte())
>> -		return mte_save_tags(page);
>> -	return 0;
>> -}
>> +#define arch_prepare_to_swap arch_prepare_to_swap
>> +extern int arch_prepare_to_swap(struct page *page);
> 
> I think it would be better to modify this API to take a folio explicitly. The
> caller already has the folio.
> 
>>  
>>  #define __HAVE_ARCH_SWAP_INVALIDATE
>>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
>> @@ -1049,11 +1039,8 @@ static inline void arch_swap_invalidate_area(int type)
>>  }
>>  
>>  #define __HAVE_ARCH_SWAP_RESTORE
>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>> -{
>> -	if (system_supports_mte())
>> -		mte_restore_tags(entry, &folio->page);
>> -}
>> +#define arch_swap_restore arch_swap_restore
>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>>  
>>  #endif /* CONFIG_ARM64_MTE */
>>  
>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
>> index a31833e3ddc5..e5637e931e4f 100644
>> --- a/arch/arm64/mm/mteswap.c
>> +++ b/arch/arm64/mm/mteswap.c
>> @@ -83,3 +83,23 @@ void mte_invalidate_tags_area(int type)
>>  	}
>>  	xa_unlock(&mte_pages);
>>  }
>> +
>> +int arch_prepare_to_swap(struct page *page)
>> +{
>> +	if (system_supports_mte()) {
>> +		struct folio *folio = page_folio(page);
>> +		long i, nr = folio_nr_pages(folio);
>> +		for (i = 0; i < nr; i++)
>> +			return mte_save_tags(folio_page(folio, i));
> 
> This will return after saving the first page of the folio! You will need to add
> each page in a loop, and if you get an error at any point, you will need to
> remove the pages that you already added successfully, by calling
> arch_swap_invalidate_page() as far as I can see. Steven can you confirm?

Yes that's right. mte_save_tags() needs to allocate memory so can fail
and if failing then arch_prepare_to_swap() would need to put things back
how they were with calls to mte_invalidate_tags() (although I think
you'd actually want to refactor to create a function which takes a
struct page *).

Steve

>> +	}
>> +	return 0;
>> +}
>> +
>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>> +{
>> +	if (system_supports_mte()) {
>> +		long i, nr = folio_nr_pages(folio);
>> +		for (i = 0; i < nr; i++)
>> +			mte_restore_tags(entry, folio_page(folio, i));
> 
> swap-in currently doesn't support large folios - everything is a single page
> folio. So this isn't technically needed. But from the API POV, it seems
> reasonable to make this change - except your implementation is broken. You are
> currently setting every page in the folio to use the same tags as the first
> page. You need to increment the swap entry for each page.
> 
> Thanks,
> Ryan
> 
> 
>> +	}
>> +}
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-03 11:31         ` Ryan Roberts
  2023-11-03 13:57           ` Steven Price
@ 2023-11-04  5:49           ` Barry Song
  1 sibling, 0 replies; 116+ messages in thread
From: Barry Song @ 2023-11-04  5:49 UTC (permalink / raw)
  To: ryan.roberts
  Cc: 21cnbao, Steven.Price, akpm, david, linux-kernel, linux-mm,
	mhocko, shy828301, wangkefeng.wang, willy, xiang, ying.huang,
	yuzhao

>>  #define __HAVE_ARCH_PREPARE_TO_SWAP
>> -static inline int arch_prepare_to_swap(struct page *page)
>> -{
>> -	if (system_supports_mte())
>> -		return mte_save_tags(page);
>> -	return 0;
>> -}
>> +#define arch_prepare_to_swap arch_prepare_to_swap
>> +extern int arch_prepare_to_swap(struct page *page);
> 
> I think it would be better to modify this API to take a folio explicitly. The
> caller already has the folio.

agree. that was actually what i thought I should change while making this rfc,
though i didn't do it.

>> +int arch_prepare_to_swap(struct page *page)
>> +{
>> +	if (system_supports_mte()) {
>> +		struct folio *folio = page_folio(page);
>> +		long i, nr = folio_nr_pages(folio);
>> +		for (i = 0; i < nr; i++)
>> +			return mte_save_tags(folio_page(folio, i));
>
> This will return after saving the first page of the folio! You will need to add
> each page in a loop, and if you get an error at any point, you will need to
> remove the pages that you already added successfully, by calling
> arch_swap_invalidate_page() as far as I can see. Steven can you confirm?

right. oops...

> 
>> +	}
>> +	return 0;
>> +}
>> +
>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>> +{
>> +	if (system_supports_mte()) {
>> +		long i, nr = folio_nr_pages(folio);
>> +		for (i = 0; i < nr; i++)
>> +			mte_restore_tags(entry, folio_page(folio, i));
>
> swap-in currently doesn't support large folios - everything is a single page
> folio. So this isn't technically needed. But from the API POV, it seems
> reasonable to make this change - except your implementation is broken. You are
> currently setting every page in the folio to use the same tags as the first
> page. You need to increment the swap entry for each page.

one case is that we have a chance to "swapin" a folio which is still in swapcache
and hasn't been dropped yet. i mean the process's ptes have been swap entry, but
the large folio is still in swapcache. in this case, we will hit the swapcache
while swapping in, thus we are handling a large folio. in this case, it seems
we are restoring tags multiple times? i mean, if large folio has 16 basepages,
for each page fault of each base page, we are restoring a large folio, then
for 16 page faults, we are duplicating the restore.
any thought to handle this situation? should we move arch_swap_restore() to take
page rather than folio since swapin only supports basepage at this moment.

> Thanks,
> Ryan

Thanks
Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-03 13:57           ` Steven Price
@ 2023-11-04  9:34             ` Barry Song
  2023-11-06 10:12               ` Steven Price
  2023-11-07 12:46               ` Ryan Roberts
  0 siblings, 2 replies; 116+ messages in thread
From: Barry Song @ 2023-11-04  9:34 UTC (permalink / raw)
  To: steven.price
  Cc: 21cnbao, akpm, david, linux-kernel, linux-mm, mhocko,
	ryan.roberts, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, Barry Song

> Yes that's right. mte_save_tags() needs to allocate memory so can fail
> and if failing then arch_prepare_to_swap() would need to put things back
> how they were with calls to mte_invalidate_tags() (although I think
> you'd actually want to refactor to create a function which takes a
> struct page *).
> 
> Steve

Thanks, Steve. combining all comments from You and Ryan, I made a v2.
One tricky thing is that we are restoring one page rather than folio
in arch_restore_swap() as we are only swapping in one page at this
stage.

[RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios

This patch makes MTE tags saving and restoring support large folios,
then we don't need to split them into base pages for swapping on
ARM64 SoCs with MTE.

This patch moves arch_prepare_to_swap() to take folio rather than
page, as we support THP swap-out as a whole. And this patch also
drops arch_thp_swp_supported() as ARM64 MTE is the only one who
needs it.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/arm64/include/asm/pgtable.h | 21 +++------------
 arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
 include/linux/huge_mm.h          | 12 ---------
 include/linux/pgtable.h          |  2 +-
 mm/page_io.c                     |  2 +-
 mm/swap_slots.c                  |  2 +-
 6 files changed, 51 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b19a8aee684c..d8f523dc41e7 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -45,12 +45,6 @@
 	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline bool arch_thp_swp_supported(void)
-{
-	return !system_supports_mte();
-}
-#define arch_thp_swp_supported arch_thp_swp_supported
-
 /*
  * Outside of a few very special situations (e.g. hibernation), we always
  * use broadcast TLB invalidation instructions, therefore a spurious page
@@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 #ifdef CONFIG_ARM64_MTE
 
 #define __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
-{
-	if (system_supports_mte())
-		return mte_save_tags(page);
-	return 0;
-}
+#define arch_prepare_to_swap arch_prepare_to_swap
+extern int arch_prepare_to_swap(struct folio *folio);
 
 #define __HAVE_ARCH_SWAP_INVALIDATE
 static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
@@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
 }
 
 #define __HAVE_ARCH_SWAP_RESTORE
-static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
-{
-	if (system_supports_mte())
-		mte_restore_tags(entry, &folio->page);
-}
+#define arch_swap_restore arch_swap_restore
+extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
 
 #endif /* CONFIG_ARM64_MTE */
 
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a31833e3ddc5..14a479e4ea8e 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
 	mte_free_tag_storage(tags);
 }
 
+static inline void __mte_invalidate_tags(struct page *page)
+{
+	swp_entry_t entry = page_swap_entry(page);
+	mte_invalidate_tags(swp_type(entry), swp_offset(entry));
+}
+
 void mte_invalidate_tags_area(int type)
 {
 	swp_entry_t entry = swp_entry(type, 0);
@@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
 	}
 	xa_unlock(&mte_pages);
 }
+
+int arch_prepare_to_swap(struct folio *folio)
+{
+	int err;
+	long i;
+
+	if (system_supports_mte()) {
+		long nr = folio_nr_pages(folio);
+		for (i = 0; i < nr; i++) {
+			err = mte_save_tags(folio_page(folio, i));
+			if (err)
+				goto out;
+		}
+	}
+	return 0;
+
+out:
+	while (--i)
+		__mte_invalidate_tags(folio_page(folio, i));
+	return err;
+}
+
+void arch_swap_restore(swp_entry_t entry, struct folio *folio)
+{
+	if (system_supports_mte()) {
+		/*
+		 * We don't support large folios swap in as whole yet, but
+		 * we can hit a large folio which is still in swapcache
+		 * after those related processes' PTEs have been unmapped
+		 * but before the swapcache folio  is dropped, in this case,
+		 * we need to find the exact page which "entry" is mapping
+		 * to. If we are not hitting swapcache, this folio won't be
+		 * large
+		 */
+		struct page *page = folio_file_page(folio, swp_offset(entry));
+		mte_restore_tags(entry, page);
+	}
+}
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fa0350b0812a..f83fb8d5241e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
 	return split_folio_to_list(folio, NULL);
 }
 
-/*
- * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
- * limitations in the implementation like arm64 MTE can override this to
- * false
- */
-#ifndef arch_thp_swp_supported
-static inline bool arch_thp_swp_supported(void)
-{
-	return true;
-}
-#endif
-
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..33ab4ddd91dd 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * prototypes must be defined in the arch-specific asm/pgtable.h file.
  */
 #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
+static inline int arch_prepare_to_swap(struct folio *folio)
 {
 	return 0;
 }
diff --git a/mm/page_io.c b/mm/page_io.c
index cb559ae324c6..0fd832474c1d 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 	 * Arch code may have to preserve more data than just the page
 	 * contents, e.g. memory tags.
 	 */
-	ret = arch_prepare_to_swap(&folio->page);
+	ret = arch_prepare_to_swap(folio);
 	if (ret) {
 		folio_mark_dirty(folio);
 		folio_unlock(folio);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..2325adbb1f19 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 	entry.val = 0;
 
 	if (folio_test_large(folio)) {
-		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
+		if (IS_ENABLED(CONFIG_THP_SWAP))
 			get_swap_pages(1, &entry, folio_nr_pages(folio));
 		goto out;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-04  9:34             ` Barry Song
@ 2023-11-06 10:12               ` Steven Price
  2023-11-06 21:39                 ` Barry Song
  2023-11-07 12:46               ` Ryan Roberts
  1 sibling, 1 reply; 116+ messages in thread
From: Steven Price @ 2023-11-06 10:12 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, ryan.roberts,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	Barry Song

On 04/11/2023 09:34, Barry Song wrote:
>> Yes that's right. mte_save_tags() needs to allocate memory so can fail
>> and if failing then arch_prepare_to_swap() would need to put things back
>> how they were with calls to mte_invalidate_tags() (although I think
>> you'd actually want to refactor to create a function which takes a
>> struct page *).
>>
>> Steve
> 
> Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> One tricky thing is that we are restoring one page rather than folio
> in arch_restore_swap() as we are only swapping in one page at this
> stage.
> 
> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> 
> This patch makes MTE tags saving and restoring support large folios,
> then we don't need to split them into base pages for swapping on
> ARM64 SoCs with MTE.
> 
> This patch moves arch_prepare_to_swap() to take folio rather than
> page, as we support THP swap-out as a whole. And this patch also
> drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> needs it.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 21 +++------------
>  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h          | 12 ---------
>  include/linux/pgtable.h          |  2 +-
>  mm/page_io.c                     |  2 +-
>  mm/swap_slots.c                  |  2 +-
>  6 files changed, 51 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b19a8aee684c..d8f523dc41e7 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -45,12 +45,6 @@
>  	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -static inline bool arch_thp_swp_supported(void)
> -{
> -	return !system_supports_mte();
> -}
> -#define arch_thp_swp_supported arch_thp_swp_supported
> -
>  /*
>   * Outside of a few very special situations (e.g. hibernation), we always
>   * use broadcast TLB invalidation instructions, therefore a spurious page
> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  #ifdef CONFIG_ARM64_MTE
>  
>  #define __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> -{
> -	if (system_supports_mte())
> -		return mte_save_tags(page);
> -	return 0;
> -}
> +#define arch_prepare_to_swap arch_prepare_to_swap
> +extern int arch_prepare_to_swap(struct folio *folio);
>  
>  #define __HAVE_ARCH_SWAP_INVALIDATE
>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
>  }
>  
>  #define __HAVE_ARCH_SWAP_RESTORE
> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> -{
> -	if (system_supports_mte())
> -		mte_restore_tags(entry, &folio->page);
> -}
> +#define arch_swap_restore arch_swap_restore
> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>  
>  #endif /* CONFIG_ARM64_MTE */
>  
> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> index a31833e3ddc5..14a479e4ea8e 100644
> --- a/arch/arm64/mm/mteswap.c
> +++ b/arch/arm64/mm/mteswap.c
> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>  	mte_free_tag_storage(tags);
>  }
>  
> +static inline void __mte_invalidate_tags(struct page *page)
> +{
> +	swp_entry_t entry = page_swap_entry(page);
> +	mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> +}
> +
>  void mte_invalidate_tags_area(int type)
>  {
>  	swp_entry_t entry = swp_entry(type, 0);
> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
>  	}
>  	xa_unlock(&mte_pages);
>  }
> +
> +int arch_prepare_to_swap(struct folio *folio)
> +{
> +	int err;
> +	long i;
> +
> +	if (system_supports_mte()) {
> +		long nr = folio_nr_pages(folio);
> +		for (i = 0; i < nr; i++) {
> +			err = mte_save_tags(folio_page(folio, i));
> +			if (err)
> +				goto out;
> +		}
> +	}
> +	return 0;
> +
> +out:
> +	while (--i)
> +		__mte_invalidate_tags(folio_page(folio, i));
> +	return err;
> +}
> +
> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> +{
> +	if (system_supports_mte()) {
> +		/*
> +		 * We don't support large folios swap in as whole yet, but
> +		 * we can hit a large folio which is still in swapcache
> +		 * after those related processes' PTEs have been unmapped
> +		 * but before the swapcache folio  is dropped, in this case,
> +		 * we need to find the exact page which "entry" is mapping
> +		 * to. If we are not hitting swapcache, this folio won't be
> +		 * large
> +		 */

Does it make sense to keep arch_swap_restore taking a folio? I'm not
sure I understand why the change was made in the first place. It just
seems odd to have a function taking a struct folio but making the
assumption that it's actually only a single page (and having to use
entry to figure out which page).

It seems particularly broken in the case of unuse_pte() which calls
page_folio() to get the folio in the first place.

Other than that it looks correct to me.

Thanks,

Steve

> +		struct page *page = folio_file_page(folio, swp_offset(entry));
> +		mte_restore_tags(entry, page);
> +	}
> +}
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fa0350b0812a..f83fb8d5241e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
>  	return split_folio_to_list(folio, NULL);
>  }
>  
> -/*
> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> - * limitations in the implementation like arm64 MTE can override this to
> - * false
> - */
> -#ifndef arch_thp_swp_supported
> -static inline bool arch_thp_swp_supported(void)
> -{
> -	return true;
> -}
> -#endif
> -
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..33ab4ddd91dd 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   * prototypes must be defined in the arch-specific asm/pgtable.h file.
>   */
>  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> +static inline int arch_prepare_to_swap(struct folio *folio)
>  {
>  	return 0;
>  }
> diff --git a/mm/page_io.c b/mm/page_io.c
> index cb559ae324c6..0fd832474c1d 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>  	 * Arch code may have to preserve more data than just the page
>  	 * contents, e.g. memory tags.
>  	 */
> -	ret = arch_prepare_to_swap(&folio->page);
> +	ret = arch_prepare_to_swap(folio);
>  	if (ret) {
>  		folio_mark_dirty(folio);
>  		folio_unlock(folio);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..2325adbb1f19 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>  	entry.val = 0;
>  
>  	if (folio_test_large(folio)) {
> -		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> +		if (IS_ENABLED(CONFIG_THP_SWAP))
>  			get_swap_pages(1, &entry, folio_nr_pages(folio));
>  		goto out;
>  	}


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-06 10:12               ` Steven Price
@ 2023-11-06 21:39                 ` Barry Song
  2023-11-08 11:51                   ` Steven Price
  0 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2023-11-06 21:39 UTC (permalink / raw)
  To: Steven Price
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, ryan.roberts,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	Barry Song

On Mon, Nov 6, 2023 at 6:12 PM Steven Price <steven.price@arm.com> wrote:
>
> On 04/11/2023 09:34, Barry Song wrote:
> >> Yes that's right. mte_save_tags() needs to allocate memory so can fail
> >> and if failing then arch_prepare_to_swap() would need to put things back
> >> how they were with calls to mte_invalidate_tags() (although I think
> >> you'd actually want to refactor to create a function which takes a
> >> struct page *).
> >>
> >> Steve
> >
> > Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> > One tricky thing is that we are restoring one page rather than folio
> > in arch_restore_swap() as we are only swapping in one page at this
> > stage.
> >
> > [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> >
> > This patch makes MTE tags saving and restoring support large folios,
> > then we don't need to split them into base pages for swapping on
> > ARM64 SoCs with MTE.
> >
> > This patch moves arch_prepare_to_swap() to take folio rather than
> > page, as we support THP swap-out as a whole. And this patch also
> > drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> > needs it.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  arch/arm64/include/asm/pgtable.h | 21 +++------------
> >  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
> >  include/linux/huge_mm.h          | 12 ---------
> >  include/linux/pgtable.h          |  2 +-
> >  mm/page_io.c                     |  2 +-
> >  mm/swap_slots.c                  |  2 +-
> >  6 files changed, 51 insertions(+), 32 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index b19a8aee684c..d8f523dc41e7 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -45,12 +45,6 @@
> >       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > -     return !system_supports_mte();
> > -}
> > -#define arch_thp_swp_supported arch_thp_swp_supported
> > -
> >  /*
> >   * Outside of a few very special situations (e.g. hibernation), we always
> >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >  #ifdef CONFIG_ARM64_MTE
> >
> >  #define __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > -{
> > -     if (system_supports_mte())
> > -             return mte_save_tags(page);
> > -     return 0;
> > -}
> > +#define arch_prepare_to_swap arch_prepare_to_swap
> > +extern int arch_prepare_to_swap(struct folio *folio);
> >
> >  #define __HAVE_ARCH_SWAP_INVALIDATE
> >  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> > @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
> >  }
> >
> >  #define __HAVE_ARCH_SWAP_RESTORE
> > -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > -{
> > -     if (system_supports_mte())
> > -             mte_restore_tags(entry, &folio->page);
> > -}
> > +#define arch_swap_restore arch_swap_restore
> > +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >
> >  #endif /* CONFIG_ARM64_MTE */
> >
> > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > index a31833e3ddc5..14a479e4ea8e 100644
> > --- a/arch/arm64/mm/mteswap.c
> > +++ b/arch/arm64/mm/mteswap.c
> > @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> >       mte_free_tag_storage(tags);
> >  }
> >
> > +static inline void __mte_invalidate_tags(struct page *page)
> > +{
> > +     swp_entry_t entry = page_swap_entry(page);
> > +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> > +}
> > +
> >  void mte_invalidate_tags_area(int type)
> >  {
> >       swp_entry_t entry = swp_entry(type, 0);
> > @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
> >       }
> >       xa_unlock(&mte_pages);
> >  }
> > +
> > +int arch_prepare_to_swap(struct folio *folio)
> > +{
> > +     int err;
> > +     long i;
> > +
> > +     if (system_supports_mte()) {
> > +             long nr = folio_nr_pages(folio);
> > +             for (i = 0; i < nr; i++) {
> > +                     err = mte_save_tags(folio_page(folio, i));
> > +                     if (err)
> > +                             goto out;
> > +             }
> > +     }
> > +     return 0;
> > +
> > +out:
> > +     while (--i)
> > +             __mte_invalidate_tags(folio_page(folio, i));
> > +     return err;
> > +}
> > +
> > +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > +{
> > +     if (system_supports_mte()) {
> > +             /*
> > +              * We don't support large folios swap in as whole yet, but
> > +              * we can hit a large folio which is still in swapcache
> > +              * after those related processes' PTEs have been unmapped
> > +              * but before the swapcache folio  is dropped, in this case,
> > +              * we need to find the exact page which "entry" is mapping
> > +              * to. If we are not hitting swapcache, this folio won't be
> > +              * large
> > +              */
>
> Does it make sense to keep arch_swap_restore taking a folio? I'm not
> sure I understand why the change was made in the first place. It just
> seems odd to have a function taking a struct folio but making the
> assumption that it's actually only a single page (and having to use
> entry to figure out which page).

Steve, let me give an example. in case we have a large anon folios with
16 pages.

while reclaiming, we do add_to_swap(), this folio is added to swapcache
as a whole; then we unmap the folio; in the last step,  we try to release
the folio.

we have a good chance some processes might access the virtual address
after the folio is unmapped but before the folio is finally released. thus,
do_swap_page() will find the large folio in swapcache, there is no I/O needed.

Let's assume processes read the 3rd page of the unmapped folio, in
do_swap_page(), the code is like,

vm_fault_t do_swap_page(struct vm_fault *vmf)
{
     swp_entry_t entry;
     ...
     entry = pte_to_swp_entry(vmf->orig_pte);

     folio = swap_cache_get_folio(entry, vma, vmf->address);
     if (folio)
           page = folio_file_page(folio, swp_offset(entry));

     arch_swap_restore(entry, folio);
}

entry points to the 3rd page, but folio points to the head page. so we
can't use the entry parameter to restore the whole folio in
arch_swap_restore()

then we have two choices in arch_swap_restore()
1. we get the 1st page's swap entry and restore all 16 tags in this large folio.
2. we restore the 3rd tag only by getting the right page in the folio

if we choose 1, in all 16 page faults of do_swap_page for the 16 unmapped
PTEs, we will restore 16*16=256 tags. One pte will have one page fault
since we don't restore 16 PTEs in do_swap_page().

if we choose 2, in all 16 pages fault of do_swap_page for the 16 unmapped
PTEs, we will only restore 16 *1=16 tags.

>
> It seems particularly broken in the case of unuse_pte() which calls
> page_folio() to get the folio in the first place.
>
> Other than that it looks correct to me.
>
> Thanks,
>
> Steve
>
> > +             struct page *page = folio_file_page(folio, swp_offset(entry));
> > +             mte_restore_tags(entry, page);
> > +     }
> > +}
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index fa0350b0812a..f83fb8d5241e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
> >       return split_folio_to_list(folio, NULL);
> >  }
> >
> > -/*
> > - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > - * limitations in the implementation like arm64 MTE can override this to
> > - * false
> > - */
> > -#ifndef arch_thp_swp_supported
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > -     return true;
> > -}
> > -#endif
> > -
> >  #endif /* _LINUX_HUGE_MM_H */
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index af7639c3b0a3..33ab4ddd91dd 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   * prototypes must be defined in the arch-specific asm/pgtable.h file.
> >   */
> >  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > +static inline int arch_prepare_to_swap(struct folio *folio)
> >  {
> >       return 0;
> >  }
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index cb559ae324c6..0fd832474c1d 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> >        * Arch code may have to preserve more data than just the page
> >        * contents, e.g. memory tags.
> >        */
> > -     ret = arch_prepare_to_swap(&folio->page);
> > +     ret = arch_prepare_to_swap(folio);
> >       if (ret) {
> >               folio_mark_dirty(folio);
> >               folio_unlock(folio);
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 0bec1f705f8e..2325adbb1f19 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> >       entry.val = 0;
> >
> >       if (folio_test_large(folio)) {
> > -             if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > +             if (IS_ENABLED(CONFIG_THP_SWAP))
> >                       get_swap_pages(1, &entry, folio_nr_pages(folio));
> >               goto out;
> >       }
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-04  9:34             ` Barry Song
  2023-11-06 10:12               ` Steven Price
@ 2023-11-07 12:46               ` Ryan Roberts
  2023-11-07 18:05                 ` Barry Song
  1 sibling, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2023-11-07 12:46 UTC (permalink / raw)
  To: Barry Song, steven.price
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, Barry Song,
	nd

On 04/11/2023 09:34, Barry Song wrote:
>> Yes that's right. mte_save_tags() needs to allocate memory so can fail
>> and if failing then arch_prepare_to_swap() would need to put things back
>> how they were with calls to mte_invalidate_tags() (although I think
>> you'd actually want to refactor to create a function which takes a
>> struct page *).
>>
>> Steve
> 
> Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> One tricky thing is that we are restoring one page rather than folio
> in arch_restore_swap() as we are only swapping in one page at this
> stage.
> 
> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> 
> This patch makes MTE tags saving and restoring support large folios,
> then we don't need to split them into base pages for swapping on
> ARM64 SoCs with MTE.
> 
> This patch moves arch_prepare_to_swap() to take folio rather than
> page, as we support THP swap-out as a whole. And this patch also
> drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> needs it.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 21 +++------------
>  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h          | 12 ---------
>  include/linux/pgtable.h          |  2 +-
>  mm/page_io.c                     |  2 +-
>  mm/swap_slots.c                  |  2 +-
>  6 files changed, 51 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b19a8aee684c..d8f523dc41e7 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -45,12 +45,6 @@
>  	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> -static inline bool arch_thp_swp_supported(void)
> -{
> -	return !system_supports_mte();
> -}
> -#define arch_thp_swp_supported arch_thp_swp_supported
> -
>  /*
>   * Outside of a few very special situations (e.g. hibernation), we always
>   * use broadcast TLB invalidation instructions, therefore a spurious page
> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  #ifdef CONFIG_ARM64_MTE
>  
>  #define __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> -{
> -	if (system_supports_mte())
> -		return mte_save_tags(page);
> -	return 0;
> -}
> +#define arch_prepare_to_swap arch_prepare_to_swap
> +extern int arch_prepare_to_swap(struct folio *folio);
>  
>  #define __HAVE_ARCH_SWAP_INVALIDATE
>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
>  }
>  
>  #define __HAVE_ARCH_SWAP_RESTORE
> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> -{
> -	if (system_supports_mte())
> -		mte_restore_tags(entry, &folio->page);
> -}
> +#define arch_swap_restore arch_swap_restore
> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>  
>  #endif /* CONFIG_ARM64_MTE */
>  
> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> index a31833e3ddc5..14a479e4ea8e 100644
> --- a/arch/arm64/mm/mteswap.c
> +++ b/arch/arm64/mm/mteswap.c
> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>  	mte_free_tag_storage(tags);
>  }
>  
> +static inline void __mte_invalidate_tags(struct page *page)
> +{
> +	swp_entry_t entry = page_swap_entry(page);
> +	mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> +}
> +
>  void mte_invalidate_tags_area(int type)
>  {
>  	swp_entry_t entry = swp_entry(type, 0);
> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
>  	}
>  	xa_unlock(&mte_pages);
>  }
> +
> +int arch_prepare_to_swap(struct folio *folio)
> +{
> +	int err;
> +	long i;
> +
> +	if (system_supports_mte()) {
> +		long nr = folio_nr_pages(folio);

nit: there should be a clear line between variable declarations and logic.

> +		for (i = 0; i < nr; i++) {
> +			err = mte_save_tags(folio_page(folio, i));
> +			if (err)
> +				goto out;
> +		}
> +	}
> +	return 0;
> +
> +out:
> +	while (--i)

If i is initially > 0, this will fail to invalidate page 0. If i is initially 0
then it will wrap and run ~forever. I think you meant `while (i--)`?

> +		__mte_invalidate_tags(folio_page(folio, i));
> +	return err;
> +}
> +
> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> +{
> +	if (system_supports_mte()) {
> +		/*
> +		 * We don't support large folios swap in as whole yet, but
> +		 * we can hit a large folio which is still in swapcache
> +		 * after those related processes' PTEs have been unmapped
> +		 * but before the swapcache folio  is dropped, in this case,
> +		 * we need to find the exact page which "entry" is mapping
> +		 * to. If we are not hitting swapcache, this folio won't be
> +		 * large
> +		 */

So the currently defined API allows a large folio to be passed but the caller is
supposed to find the single correct page using the swap entry? That feels quite
nasty to me. And that's not what the old version of the function was doing; it
always assumed that the folio was small and passed the first page (which also
doesn't feel 'nice'). If the old version was wrong, I suggest a separate commit
to fix that. If the old version is correct, then I guess this version is wrong.

Thanks,
Ryan

> +		struct page *page = folio_file_page(folio, swp_offset(entry));
> +		mte_restore_tags(entry, page);
> +	}
> +}
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fa0350b0812a..f83fb8d5241e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
>  	return split_folio_to_list(folio, NULL);
>  }
>  
> -/*
> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> - * limitations in the implementation like arm64 MTE can override this to
> - * false
> - */
> -#ifndef arch_thp_swp_supported
> -static inline bool arch_thp_swp_supported(void)
> -{
> -	return true;
> -}
> -#endif
> -
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..33ab4ddd91dd 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   * prototypes must be defined in the arch-specific asm/pgtable.h file.
>   */
>  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> +static inline int arch_prepare_to_swap(struct folio *folio)
>  {
>  	return 0;
>  }
> diff --git a/mm/page_io.c b/mm/page_io.c
> index cb559ae324c6..0fd832474c1d 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>  	 * Arch code may have to preserve more data than just the page
>  	 * contents, e.g. memory tags.
>  	 */
> -	ret = arch_prepare_to_swap(&folio->page);
> +	ret = arch_prepare_to_swap(folio);
>  	if (ret) {
>  		folio_mark_dirty(folio);
>  		folio_unlock(folio);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..2325adbb1f19 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>  	entry.val = 0;
>  
>  	if (folio_test_large(folio)) {
> -		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> +		if (IS_ENABLED(CONFIG_THP_SWAP))
>  			get_swap_pages(1, &entry, folio_nr_pages(folio));
>  		goto out;
>  	}


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-07 12:46               ` Ryan Roberts
@ 2023-11-07 18:05                 ` Barry Song
  2023-11-08 11:23                   ` Barry Song
  0 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2023-11-07 18:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: steven.price, akpm, david, linux-kernel, linux-mm, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	Barry Song, nd

On Tue, Nov 7, 2023 at 8:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/11/2023 09:34, Barry Song wrote:
> >> Yes that's right. mte_save_tags() needs to allocate memory so can fail
> >> and if failing then arch_prepare_to_swap() would need to put things back
> >> how they were with calls to mte_invalidate_tags() (although I think
> >> you'd actually want to refactor to create a function which takes a
> >> struct page *).
> >>
> >> Steve
> >
> > Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> > One tricky thing is that we are restoring one page rather than folio
> > in arch_restore_swap() as we are only swapping in one page at this
> > stage.
> >
> > [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> >
> > This patch makes MTE tags saving and restoring support large folios,
> > then we don't need to split them into base pages for swapping on
> > ARM64 SoCs with MTE.
> >
> > This patch moves arch_prepare_to_swap() to take folio rather than
> > page, as we support THP swap-out as a whole. And this patch also
> > drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> > needs it.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  arch/arm64/include/asm/pgtable.h | 21 +++------------
> >  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
> >  include/linux/huge_mm.h          | 12 ---------
> >  include/linux/pgtable.h          |  2 +-
> >  mm/page_io.c                     |  2 +-
> >  mm/swap_slots.c                  |  2 +-
> >  6 files changed, 51 insertions(+), 32 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index b19a8aee684c..d8f523dc41e7 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -45,12 +45,6 @@
> >       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > -     return !system_supports_mte();
> > -}
> > -#define arch_thp_swp_supported arch_thp_swp_supported
> > -
> >  /*
> >   * Outside of a few very special situations (e.g. hibernation), we always
> >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >  #ifdef CONFIG_ARM64_MTE
> >
> >  #define __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > -{
> > -     if (system_supports_mte())
> > -             return mte_save_tags(page);
> > -     return 0;
> > -}
> > +#define arch_prepare_to_swap arch_prepare_to_swap
> > +extern int arch_prepare_to_swap(struct folio *folio);
> >
> >  #define __HAVE_ARCH_SWAP_INVALIDATE
> >  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> > @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
> >  }
> >
> >  #define __HAVE_ARCH_SWAP_RESTORE
> > -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > -{
> > -     if (system_supports_mte())
> > -             mte_restore_tags(entry, &folio->page);
> > -}
> > +#define arch_swap_restore arch_swap_restore
> > +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >
> >  #endif /* CONFIG_ARM64_MTE */
> >
> > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > index a31833e3ddc5..14a479e4ea8e 100644
> > --- a/arch/arm64/mm/mteswap.c
> > +++ b/arch/arm64/mm/mteswap.c
> > @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> >       mte_free_tag_storage(tags);
> >  }
> >
> > +static inline void __mte_invalidate_tags(struct page *page)
> > +{
> > +     swp_entry_t entry = page_swap_entry(page);
> > +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> > +}
> > +
> >  void mte_invalidate_tags_area(int type)
> >  {
> >       swp_entry_t entry = swp_entry(type, 0);
> > @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
> >       }
> >       xa_unlock(&mte_pages);
> >  }
> > +
> > +int arch_prepare_to_swap(struct folio *folio)
> > +{
> > +     int err;
> > +     long i;
> > +
> > +     if (system_supports_mte()) {
> > +             long nr = folio_nr_pages(folio);
>
> nit: there should be a clear line between variable declarations and logic.

right.

>
> > +             for (i = 0; i < nr; i++) {
> > +                     err = mte_save_tags(folio_page(folio, i));
> > +                     if (err)
> > +                             goto out;
> > +             }
> > +     }
> > +     return 0;
> > +
> > +out:
> > +     while (--i)
>
> If i is initially > 0, this will fail to invalidate page 0. If i is initially 0
> then it will wrap and run ~forever. I think you meant `while (i--)`?

nop. if i=0 and we goto out, that means the page0 has failed to save tags,
there is nothing to revert. if i=3 and we goto out, that means 0,1,2 have
saved, we restore 0,1,2 and we don't restore 3.

>
> > +             __mte_invalidate_tags(folio_page(folio, i));
> > +     return err;
> > +}
> > +
> > +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > +{
> > +     if (system_supports_mte()) {
> > +             /*
> > +              * We don't support large folios swap in as whole yet, but
> > +              * we can hit a large folio which is still in swapcache
> > +              * after those related processes' PTEs have been unmapped
> > +              * but before the swapcache folio  is dropped, in this case,
> > +              * we need to find the exact page which "entry" is mapping
> > +              * to. If we are not hitting swapcache, this folio won't be
> > +              * large
> > +              */
>
> So the currently defined API allows a large folio to be passed but the caller is
> supposed to find the single correct page using the swap entry? That feels quite
> nasty to me. And that's not what the old version of the function was doing; it
> always assumed that the folio was small and passed the first page (which also
> doesn't feel 'nice'). If the old version was wrong, I suggest a separate commit
> to fix that. If the old version is correct, then I guess this version is wrong.

the original version(mainline) is wrong but it works as once we find the SoCs
support MTE, we will split large folios into small pages. so only small pages
will be added into swapcache successfully.

but now we want to swap out large folios even on SoCs with MTE as a whole,
we don't split, so this breaks the assumption do_swap_page() will always get
small pages.

>
> Thanks,
> Ryan
>
> > +             struct page *page = folio_file_page(folio, swp_offset(entry));
> > +             mte_restore_tags(entry, page);
> > +     }
> > +}
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index fa0350b0812a..f83fb8d5241e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
> >       return split_folio_to_list(folio, NULL);
> >  }
> >
> > -/*
> > - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > - * limitations in the implementation like arm64 MTE can override this to
> > - * false
> > - */
> > -#ifndef arch_thp_swp_supported
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > -     return true;
> > -}
> > -#endif
> > -
> >  #endif /* _LINUX_HUGE_MM_H */
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index af7639c3b0a3..33ab4ddd91dd 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   * prototypes must be defined in the arch-specific asm/pgtable.h file.
> >   */
> >  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > +static inline int arch_prepare_to_swap(struct folio *folio)
> >  {
> >       return 0;
> >  }
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index cb559ae324c6..0fd832474c1d 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> >        * Arch code may have to preserve more data than just the page
> >        * contents, e.g. memory tags.
> >        */
> > -     ret = arch_prepare_to_swap(&folio->page);
> > +     ret = arch_prepare_to_swap(folio);
> >       if (ret) {
> >               folio_mark_dirty(folio);
> >               folio_unlock(folio);
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 0bec1f705f8e..2325adbb1f19 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> >       entry.val = 0;
> >
> >       if (folio_test_large(folio)) {
> > -             if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > +             if (IS_ENABLED(CONFIG_THP_SWAP))
> >                       get_swap_pages(1, &entry, folio_nr_pages(folio));
> >               goto out;
> >       }
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-07 18:05                 ` Barry Song
@ 2023-11-08 11:23                   ` Barry Song
  2023-11-08 20:20                     ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2023-11-08 11:23 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: steven.price, akpm, david, linux-kernel, linux-mm, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	Barry Song, nd

On Wed, Nov 8, 2023 at 2:05 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Nov 7, 2023 at 8:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 04/11/2023 09:34, Barry Song wrote:
> > >> Yes that's right. mte_save_tags() needs to allocate memory so can fail
> > >> and if failing then arch_prepare_to_swap() would need to put things back
> > >> how they were with calls to mte_invalidate_tags() (although I think
> > >> you'd actually want to refactor to create a function which takes a
> > >> struct page *).
> > >>
> > >> Steve
> > >
> > > Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> > > One tricky thing is that we are restoring one page rather than folio
> > > in arch_restore_swap() as we are only swapping in one page at this
> > > stage.
> > >
> > > [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> > >
> > > This patch makes MTE tags saving and restoring support large folios,
> > > then we don't need to split them into base pages for swapping on
> > > ARM64 SoCs with MTE.
> > >
> > > This patch moves arch_prepare_to_swap() to take folio rather than
> > > page, as we support THP swap-out as a whole. And this patch also
> > > drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> > > needs it.
> > >
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  arch/arm64/include/asm/pgtable.h | 21 +++------------
> > >  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
> > >  include/linux/huge_mm.h          | 12 ---------
> > >  include/linux/pgtable.h          |  2 +-
> > >  mm/page_io.c                     |  2 +-
> > >  mm/swap_slots.c                  |  2 +-
> > >  6 files changed, 51 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > index b19a8aee684c..d8f523dc41e7 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -45,12 +45,6 @@
> > >       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > >
> > > -static inline bool arch_thp_swp_supported(void)
> > > -{
> > > -     return !system_supports_mte();
> > > -}
> > > -#define arch_thp_swp_supported arch_thp_swp_supported
> > > -
> > >  /*
> > >   * Outside of a few very special situations (e.g. hibernation), we always
> > >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > > @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > >  #ifdef CONFIG_ARM64_MTE
> > >
> > >  #define __HAVE_ARCH_PREPARE_TO_SWAP
> > > -static inline int arch_prepare_to_swap(struct page *page)
> > > -{
> > > -     if (system_supports_mte())
> > > -             return mte_save_tags(page);
> > > -     return 0;
> > > -}
> > > +#define arch_prepare_to_swap arch_prepare_to_swap
> > > +extern int arch_prepare_to_swap(struct folio *folio);
> > >
> > >  #define __HAVE_ARCH_SWAP_INVALIDATE
> > >  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> > > @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
> > >  }
> > >
> > >  #define __HAVE_ARCH_SWAP_RESTORE
> > > -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > > -{
> > > -     if (system_supports_mte())
> > > -             mte_restore_tags(entry, &folio->page);
> > > -}
> > > +#define arch_swap_restore arch_swap_restore
> > > +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> > >
> > >  #endif /* CONFIG_ARM64_MTE */
> > >
> > > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > > index a31833e3ddc5..14a479e4ea8e 100644
> > > --- a/arch/arm64/mm/mteswap.c
> > > +++ b/arch/arm64/mm/mteswap.c
> > > @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> > >       mte_free_tag_storage(tags);
> > >  }
> > >
> > > +static inline void __mte_invalidate_tags(struct page *page)
> > > +{
> > > +     swp_entry_t entry = page_swap_entry(page);
> > > +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> > > +}
> > > +
> > >  void mte_invalidate_tags_area(int type)
> > >  {
> > >       swp_entry_t entry = swp_entry(type, 0);
> > > @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
> > >       }
> > >       xa_unlock(&mte_pages);
> > >  }
> > > +
> > > +int arch_prepare_to_swap(struct folio *folio)
> > > +{
> > > +     int err;
> > > +     long i;
> > > +
> > > +     if (system_supports_mte()) {
> > > +             long nr = folio_nr_pages(folio);
> >
> > nit: there should be a clear line between variable declarations and logic.
>
> right.
>
> >
> > > +             for (i = 0; i < nr; i++) {
> > > +                     err = mte_save_tags(folio_page(folio, i));
> > > +                     if (err)
> > > +                             goto out;
> > > +             }
> > > +     }
> > > +     return 0;
> > > +
> > > +out:
> > > +     while (--i)
> >
> > If i is initially > 0, this will fail to invalidate page 0. If i is initially 0
> > then it will wrap and run ~forever. I think you meant `while (i--)`?
>
> nop. if i=0 and we goto out, that means the page0 has failed to save tags,
> there is nothing to revert. if i=3 and we goto out, that means 0,1,2 have
> saved, we restore 0,1,2 and we don't restore 3.

I am terribly sorry for my previous noise. You are right, Ryan. i
actually meant i--.

>
> >
> > > +             __mte_invalidate_tags(folio_page(folio, i));
> > > +     return err;
> > > +}
> > > +
> > > +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > > +{
> > > +     if (system_supports_mte()) {
> > > +             /*
> > > +              * We don't support large folios swap in as whole yet, but
> > > +              * we can hit a large folio which is still in swapcache
> > > +              * after those related processes' PTEs have been unmapped
> > > +              * but before the swapcache folio  is dropped, in this case,
> > > +              * we need to find the exact page which "entry" is mapping
> > > +              * to. If we are not hitting swapcache, this folio won't be
> > > +              * large
> > > +              */
> >
> > So the currently defined API allows a large folio to be passed but the caller is
> > supposed to find the single correct page using the swap entry? That feels quite
> > nasty to me. And that's not what the old version of the function was doing; it
> > always assumed that the folio was small and passed the first page (which also
> > doesn't feel 'nice'). If the old version was wrong, I suggest a separate commit
> > to fix that. If the old version is correct, then I guess this version is wrong.
>
> the original version(mainline) is wrong but it works as once we find the SoCs
> support MTE, we will split large folios into small pages. so only small pages
> will be added into swapcache successfully.
>
> but now we want to swap out large folios even on SoCs with MTE as a whole,
> we don't split, so this breaks the assumption do_swap_page() will always get
> small pages.

let me clarify this more. The current mainline assumes
arch_swap_restore() always
get a folio with only one page. this is true as we split large folios
if we find SoCs
have MTE. but since we are dropping the split now, that means a large
folio can be
gotten by do_swap_page(). we have a chance that try_to_unmap_one() has been done
but folio is not put. so PTEs will have swap entry but folio is still
there, and do_swap_page()
to hit cache directly and the folio won't be released.

but after getting the large folio in do_swap_page, it still only takes
one basepage particularly
for the faulted PTE and maps this 4KB PTE only. so it uses the faulted
swap_entry and
the folio as parameters to call arch_swap_restore() which can be something like:

do_swap_page()
{
        arch_swap_restore(the swap entry for the faulted 4KB PTE, large folio);
}
>
> >
> > Thanks,
> > Ryan

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-06 21:39                 ` Barry Song
@ 2023-11-08 11:51                   ` Steven Price
  0 siblings, 0 replies; 116+ messages in thread
From: Steven Price @ 2023-11-08 11:51 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, ryan.roberts,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	Barry Song

On 06/11/2023 21:39, Barry Song wrote:
> On Mon, Nov 6, 2023 at 6:12 PM Steven Price <steven.price@arm.com> wrote:
>>
>> On 04/11/2023 09:34, Barry Song wrote:
>>>> Yes that's right. mte_save_tags() needs to allocate memory so can fail
>>>> and if failing then arch_prepare_to_swap() would need to put things back
>>>> how they were with calls to mte_invalidate_tags() (although I think
>>>> you'd actually want to refactor to create a function which takes a
>>>> struct page *).
>>>>
>>>> Steve
>>>
>>> Thanks, Steve. combining all comments from You and Ryan, I made a v2.
>>> One tricky thing is that we are restoring one page rather than folio
>>> in arch_restore_swap() as we are only swapping in one page at this
>>> stage.
>>>
>>> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
>>>
>>> This patch makes MTE tags saving and restoring support large folios,
>>> then we don't need to split them into base pages for swapping on
>>> ARM64 SoCs with MTE.
>>>
>>> This patch moves arch_prepare_to_swap() to take folio rather than
>>> page, as we support THP swap-out as a whole. And this patch also
>>> drops arch_thp_swp_supported() as ARM64 MTE is the only one who
>>> needs it.
>>>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>  arch/arm64/include/asm/pgtable.h | 21 +++------------
>>>  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
>>>  include/linux/huge_mm.h          | 12 ---------
>>>  include/linux/pgtable.h          |  2 +-
>>>  mm/page_io.c                     |  2 +-
>>>  mm/swap_slots.c                  |  2 +-
>>>  6 files changed, 51 insertions(+), 32 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index b19a8aee684c..d8f523dc41e7 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -45,12 +45,6 @@
>>>       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>
>>> -static inline bool arch_thp_swp_supported(void)
>>> -{
>>> -     return !system_supports_mte();
>>> -}
>>> -#define arch_thp_swp_supported arch_thp_swp_supported
>>> -
>>>  /*
>>>   * Outside of a few very special situations (e.g. hibernation), we always
>>>   * use broadcast TLB invalidation instructions, therefore a spurious page
>>> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>>>  #ifdef CONFIG_ARM64_MTE
>>>
>>>  #define __HAVE_ARCH_PREPARE_TO_SWAP
>>> -static inline int arch_prepare_to_swap(struct page *page)
>>> -{
>>> -     if (system_supports_mte())
>>> -             return mte_save_tags(page);
>>> -     return 0;
>>> -}
>>> +#define arch_prepare_to_swap arch_prepare_to_swap
>>> +extern int arch_prepare_to_swap(struct folio *folio);
>>>
>>>  #define __HAVE_ARCH_SWAP_INVALIDATE
>>>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
>>> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
>>>  }
>>>
>>>  #define __HAVE_ARCH_SWAP_RESTORE
>>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>> -{
>>> -     if (system_supports_mte())
>>> -             mte_restore_tags(entry, &folio->page);
>>> -}
>>> +#define arch_swap_restore arch_swap_restore
>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>>>
>>>  #endif /* CONFIG_ARM64_MTE */
>>>
>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
>>> index a31833e3ddc5..14a479e4ea8e 100644
>>> --- a/arch/arm64/mm/mteswap.c
>>> +++ b/arch/arm64/mm/mteswap.c
>>> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>>>       mte_free_tag_storage(tags);
>>>  }
>>>
>>> +static inline void __mte_invalidate_tags(struct page *page)
>>> +{
>>> +     swp_entry_t entry = page_swap_entry(page);
>>> +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
>>> +}
>>> +
>>>  void mte_invalidate_tags_area(int type)
>>>  {
>>>       swp_entry_t entry = swp_entry(type, 0);
>>> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
>>>       }
>>>       xa_unlock(&mte_pages);
>>>  }
>>> +
>>> +int arch_prepare_to_swap(struct folio *folio)
>>> +{
>>> +     int err;
>>> +     long i;
>>> +
>>> +     if (system_supports_mte()) {
>>> +             long nr = folio_nr_pages(folio);
>>> +             for (i = 0; i < nr; i++) {
>>> +                     err = mte_save_tags(folio_page(folio, i));
>>> +                     if (err)
>>> +                             goto out;
>>> +             }
>>> +     }
>>> +     return 0;
>>> +
>>> +out:
>>> +     while (--i)
>>> +             __mte_invalidate_tags(folio_page(folio, i));
>>> +     return err;
>>> +}
>>> +
>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>> +{
>>> +     if (system_supports_mte()) {
>>> +             /*
>>> +              * We don't support large folios swap in as whole yet, but
>>> +              * we can hit a large folio which is still in swapcache
>>> +              * after those related processes' PTEs have been unmapped
>>> +              * but before the swapcache folio  is dropped, in this case,
>>> +              * we need to find the exact page which "entry" is mapping
>>> +              * to. If we are not hitting swapcache, this folio won't be
>>> +              * large
>>> +              */
>>
>> Does it make sense to keep arch_swap_restore taking a folio? I'm not
>> sure I understand why the change was made in the first place. It just
>> seems odd to have a function taking a struct folio but making the
>> assumption that it's actually only a single page (and having to use
>> entry to figure out which page).
> 
> Steve, let me give an example. in case we have a large anon folios with
> 16 pages.
> 
> while reclaiming, we do add_to_swap(), this folio is added to swapcache
> as a whole; then we unmap the folio; in the last step,  we try to release
> the folio.
> 
> we have a good chance some processes might access the virtual address
> after the folio is unmapped but before the folio is finally released. thus,
> do_swap_page() will find the large folio in swapcache, there is no I/O needed.
> 
> Let's assume processes read the 3rd page of the unmapped folio, in
> do_swap_page(), the code is like,
> 
> vm_fault_t do_swap_page(struct vm_fault *vmf)
> {
>      swp_entry_t entry;
>      ...
>      entry = pte_to_swp_entry(vmf->orig_pte);
> 
>      folio = swap_cache_get_folio(entry, vma, vmf->address);
>      if (folio)
>            page = folio_file_page(folio, swp_offset(entry));
> 
>      arch_swap_restore(entry, folio);
> }
> 
> entry points to the 3rd page, but folio points to the head page. so we
> can't use the entry parameter to restore the whole folio in
> arch_swap_restore()

Sorry, I don't think I explained myself very clearly. My issue was that
with your patch (and currently) we have the situation where
arch_swap_restore() can only restore a single page. But the function
takes a "struct folio *" argument.

Current mainline assumes that the folio is a single page, and with your
patch we now have a big comment explaining what's going on (bonus points
for that!) and pick out the correct page from the folio. What I'm
puzzled by is why the change was made in the first place to pass a
"struct folio *" - if we passed a "struct page *" that:

 a) It would be clear that the current API only allows a single page at
    a time.

 b) The correct page could be passed by the caller rather than
    arch_swap_restore() having to obtain the offset into the folio.

> then we have two choices in arch_swap_restore()
> 1. we get the 1st page's swap entry and restore all 16 tags in this large folio.
> 2. we restore the 3rd tag only by getting the right page in the folio
> 
> if we choose 1, in all 16 page faults of do_swap_page for the 16 unmapped
> PTEs, we will restore 16*16=256 tags. One pte will have one page fault
> since we don't restore 16 PTEs in do_swap_page().
> 
> if we choose 2, in all 16 pages fault of do_swap_page for the 16 unmapped
> PTEs, we will only restore 16 *1=16 tags.

So if we choose option 1 then we're changing the API of
arch_swap_restore() to actually restore the entire folio and it makes
sense to pass a "struct folio *" - and I'm happy with that. But AFAICT
that's not what your patch currently implements as it appears to be
doing option 2.

I'm quite happy to believe that the overhead of option 2 is small and
that might be the right solution, but at the moment we've got an API
which implies arch_swap_restore() should be operating on an entire folio.

Note that I don't have any particularly strong views on this - I've not
been following the folio work very closely, but I personally find it
confusing when a function takes a "struct folio *" but then operates on
only one page of it.

Steve

>>
>> It seems particularly broken in the case of unuse_pte() which calls
>> page_folio() to get the folio in the first place.
>>
>> Other than that it looks correct to me.
>>
>> Thanks,
>>
>> Steve
>>
>>> +             struct page *page = folio_file_page(folio, swp_offset(entry));
>>> +             mte_restore_tags(entry, page);
>>> +     }
>>> +}
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index fa0350b0812a..f83fb8d5241e 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -400,16 +400,4 @@ static inline int split_folio(struct folio *folio)
>>>       return split_folio_to_list(folio, NULL);
>>>  }
>>>
>>> -/*
>>> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
>>> - * limitations in the implementation like arm64 MTE can override this to
>>> - * false
>>> - */
>>> -#ifndef arch_thp_swp_supported
>>> -static inline bool arch_thp_swp_supported(void)
>>> -{
>>> -     return true;
>>> -}
>>> -#endif
>>> -
>>>  #endif /* _LINUX_HUGE_MM_H */
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index af7639c3b0a3..33ab4ddd91dd 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -897,7 +897,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>>>   * prototypes must be defined in the arch-specific asm/pgtable.h file.
>>>   */
>>>  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
>>> -static inline int arch_prepare_to_swap(struct page *page)
>>> +static inline int arch_prepare_to_swap(struct folio *folio)
>>>  {
>>>       return 0;
>>>  }
>>> diff --git a/mm/page_io.c b/mm/page_io.c
>>> index cb559ae324c6..0fd832474c1d 100644
>>> --- a/mm/page_io.c
>>> +++ b/mm/page_io.c
>>> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>>>        * Arch code may have to preserve more data than just the page
>>>        * contents, e.g. memory tags.
>>>        */
>>> -     ret = arch_prepare_to_swap(&folio->page);
>>> +     ret = arch_prepare_to_swap(folio);
>>>       if (ret) {
>>>               folio_mark_dirty(folio);
>>>               folio_unlock(folio);
>>> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
>>> index 0bec1f705f8e..2325adbb1f19 100644
>>> --- a/mm/swap_slots.c
>>> +++ b/mm/swap_slots.c
>>> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>>>       entry.val = 0;
>>>
>>>       if (folio_test_large(folio)) {
>>> -             if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
>>> +             if (IS_ENABLED(CONFIG_THP_SWAP))
>>>                       get_swap_pages(1, &entry, folio_nr_pages(folio));
>>>               goto out;
>>>       }
>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-08 11:23                   ` Barry Song
@ 2023-11-08 20:20                     ` Ryan Roberts
  2023-11-08 21:04                       ` Barry Song
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2023-11-08 20:20 UTC (permalink / raw)
  To: Barry Song
  Cc: steven.price, akpm, david, linux-kernel, linux-mm, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	Barry Song, nd

On 08/11/2023 11:23, Barry Song wrote:
> On Wed, Nov 8, 2023 at 2:05 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, Nov 7, 2023 at 8:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 04/11/2023 09:34, Barry Song wrote:
>>>>> Yes that's right. mte_save_tags() needs to allocate memory so can fail
>>>>> and if failing then arch_prepare_to_swap() would need to put things back
>>>>> how they were with calls to mte_invalidate_tags() (although I think
>>>>> you'd actually want to refactor to create a function which takes a
>>>>> struct page *).
>>>>>
>>>>> Steve
>>>>
>>>> Thanks, Steve. combining all comments from You and Ryan, I made a v2.
>>>> One tricky thing is that we are restoring one page rather than folio
>>>> in arch_restore_swap() as we are only swapping in one page at this
>>>> stage.
>>>>
>>>> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
>>>>
>>>> This patch makes MTE tags saving and restoring support large folios,
>>>> then we don't need to split them into base pages for swapping on
>>>> ARM64 SoCs with MTE.
>>>>
>>>> This patch moves arch_prepare_to_swap() to take folio rather than
>>>> page, as we support THP swap-out as a whole. And this patch also
>>>> drops arch_thp_swp_supported() as ARM64 MTE is the only one who
>>>> needs it.
>>>>
>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>> ---
>>>>  arch/arm64/include/asm/pgtable.h | 21 +++------------
>>>>  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
>>>>  include/linux/huge_mm.h          | 12 ---------
>>>>  include/linux/pgtable.h          |  2 +-
>>>>  mm/page_io.c                     |  2 +-
>>>>  mm/swap_slots.c                  |  2 +-
>>>>  6 files changed, 51 insertions(+), 32 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index b19a8aee684c..d8f523dc41e7 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -45,12 +45,6 @@
>>>>       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>>
>>>> -static inline bool arch_thp_swp_supported(void)
>>>> -{
>>>> -     return !system_supports_mte();
>>>> -}
>>>> -#define arch_thp_swp_supported arch_thp_swp_supported
>>>> -
>>>>  /*
>>>>   * Outside of a few very special situations (e.g. hibernation), we always
>>>>   * use broadcast TLB invalidation instructions, therefore a spurious page
>>>> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>>>>  #ifdef CONFIG_ARM64_MTE
>>>>
>>>>  #define __HAVE_ARCH_PREPARE_TO_SWAP
>>>> -static inline int arch_prepare_to_swap(struct page *page)
>>>> -{
>>>> -     if (system_supports_mte())
>>>> -             return mte_save_tags(page);
>>>> -     return 0;
>>>> -}
>>>> +#define arch_prepare_to_swap arch_prepare_to_swap
>>>> +extern int arch_prepare_to_swap(struct folio *folio);
>>>>
>>>>  #define __HAVE_ARCH_SWAP_INVALIDATE
>>>>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
>>>> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
>>>>  }
>>>>
>>>>  #define __HAVE_ARCH_SWAP_RESTORE
>>>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>>> -{
>>>> -     if (system_supports_mte())
>>>> -             mte_restore_tags(entry, &folio->page);
>>>> -}
>>>> +#define arch_swap_restore arch_swap_restore
>>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>>>>
>>>>  #endif /* CONFIG_ARM64_MTE */
>>>>
>>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
>>>> index a31833e3ddc5..14a479e4ea8e 100644
>>>> --- a/arch/arm64/mm/mteswap.c
>>>> +++ b/arch/arm64/mm/mteswap.c
>>>> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>>>>       mte_free_tag_storage(tags);
>>>>  }
>>>>
>>>> +static inline void __mte_invalidate_tags(struct page *page)
>>>> +{
>>>> +     swp_entry_t entry = page_swap_entry(page);
>>>> +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
>>>> +}
>>>> +
>>>>  void mte_invalidate_tags_area(int type)
>>>>  {
>>>>       swp_entry_t entry = swp_entry(type, 0);
>>>> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
>>>>       }
>>>>       xa_unlock(&mte_pages);
>>>>  }
>>>> +
>>>> +int arch_prepare_to_swap(struct folio *folio)
>>>> +{
>>>> +     int err;
>>>> +     long i;
>>>> +
>>>> +     if (system_supports_mte()) {
>>>> +             long nr = folio_nr_pages(folio);
>>>
>>> nit: there should be a clear line between variable declarations and logic.
>>
>> right.
>>
>>>
>>>> +             for (i = 0; i < nr; i++) {
>>>> +                     err = mte_save_tags(folio_page(folio, i));
>>>> +                     if (err)
>>>> +                             goto out;
>>>> +             }
>>>> +     }
>>>> +     return 0;
>>>> +
>>>> +out:
>>>> +     while (--i)
>>>
>>> If i is initially > 0, this will fail to invalidate page 0. If i is initially 0
>>> then it will wrap and run ~forever. I think you meant `while (i--)`?
>>
>> nop. if i=0 and we goto out, that means the page0 has failed to save tags,
>> there is nothing to revert. if i=3 and we goto out, that means 0,1,2 have
>> saved, we restore 0,1,2 and we don't restore 3.
> 
> I am terribly sorry for my previous noise. You are right, Ryan. i
> actually meant i--.

No problem - it saves me from writing a long response explaining why --i is
wrong, at least!

> 
>>
>>>
>>>> +             __mte_invalidate_tags(folio_page(folio, i));
>>>> +     return err;
>>>> +}
>>>> +
>>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>>>> +{
>>>> +     if (system_supports_mte()) {
>>>> +             /*
>>>> +              * We don't support large folios swap in as whole yet, but
>>>> +              * we can hit a large folio which is still in swapcache
>>>> +              * after those related processes' PTEs have been unmapped
>>>> +              * but before the swapcache folio  is dropped, in this case,
>>>> +              * we need to find the exact page which "entry" is mapping
>>>> +              * to. If we are not hitting swapcache, this folio won't be
>>>> +              * large
>>>> +              */
>>>
>>> So the currently defined API allows a large folio to be passed but the caller is
>>> supposed to find the single correct page using the swap entry? That feels quite
>>> nasty to me. And that's not what the old version of the function was doing; it
>>> always assumed that the folio was small and passed the first page (which also
>>> doesn't feel 'nice'). If the old version was wrong, I suggest a separate commit
>>> to fix that. If the old version is correct, then I guess this version is wrong.
>>
>> the original version(mainline) is wrong but it works as once we find the SoCs
>> support MTE, we will split large folios into small pages. so only small pages
>> will be added into swapcache successfully.
>>
>> but now we want to swap out large folios even on SoCs with MTE as a whole,
>> we don't split, so this breaks the assumption do_swap_page() will always get
>> small pages.
> 
> let me clarify this more. The current mainline assumes
> arch_swap_restore() always
> get a folio with only one page. this is true as we split large folios
> if we find SoCs
> have MTE. but since we are dropping the split now, that means a large
> folio can be
> gotten by do_swap_page(). we have a chance that try_to_unmap_one() has been done
> but folio is not put. so PTEs will have swap entry but folio is still
> there, and do_swap_page()
> to hit cache directly and the folio won't be released.
> 
> but after getting the large folio in do_swap_page, it still only takes
> one basepage particularly
> for the faulted PTE and maps this 4KB PTE only. so it uses the faulted
> swap_entry and
> the folio as parameters to call arch_swap_restore() which can be something like:
> 
> do_swap_page()
> {
>         arch_swap_restore(the swap entry for the faulted 4KB PTE, large folio);
> }

OK, I understand what's going on, but it seems like a bad API decision. I think
Steve is saying the same thing; If its only intended to operate on a single
page, it would be much clearer to pass the actual page rather than the folio;
i.e. leave the complexity of figuring out the target page to the caller, which
understands all this.

As a side note, if the folio is still in the cache, doesn't that imply that the
tags haven't been torn down yet? So perhaps you can avoid even making the call
in this case?

>>
>>>
>>> Thanks,
>>> Ryan
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-11-08 20:20                     ` Ryan Roberts
@ 2023-11-08 21:04                       ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2023-11-08 21:04 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: steven.price, akpm, david, linux-kernel, linux-mm, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	Barry Song, nd

On Thu, Nov 9, 2023 at 4:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 08/11/2023 11:23, Barry Song wrote:
> > On Wed, Nov 8, 2023 at 2:05 AM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Tue, Nov 7, 2023 at 8:46 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>
> >>> On 04/11/2023 09:34, Barry Song wrote:
> >>>>> Yes that's right. mte_save_tags() needs to allocate memory so can fail
> >>>>> and if failing then arch_prepare_to_swap() would need to put things back
> >>>>> how they were with calls to mte_invalidate_tags() (although I think
> >>>>> you'd actually want to refactor to create a function which takes a
> >>>>> struct page *).
> >>>>>
> >>>>> Steve
> >>>>
> >>>> Thanks, Steve. combining all comments from You and Ryan, I made a v2.
> >>>> One tricky thing is that we are restoring one page rather than folio
> >>>> in arch_restore_swap() as we are only swapping in one page at this
> >>>> stage.
> >>>>
> >>>> [RFC v2 PATCH] arm64: mm: swap: save and restore mte tags for large folios
> >>>>
> >>>> This patch makes MTE tags saving and restoring support large folios,
> >>>> then we don't need to split them into base pages for swapping on
> >>>> ARM64 SoCs with MTE.
> >>>>
> >>>> This patch moves arch_prepare_to_swap() to take folio rather than
> >>>> page, as we support THP swap-out as a whole. And this patch also
> >>>> drops arch_thp_swp_supported() as ARM64 MTE is the only one who
> >>>> needs it.
> >>>>
> >>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>> ---
> >>>>  arch/arm64/include/asm/pgtable.h | 21 +++------------
> >>>>  arch/arm64/mm/mteswap.c          | 44 ++++++++++++++++++++++++++++++++
> >>>>  include/linux/huge_mm.h          | 12 ---------
> >>>>  include/linux/pgtable.h          |  2 +-
> >>>>  mm/page_io.c                     |  2 +-
> >>>>  mm/swap_slots.c                  |  2 +-
> >>>>  6 files changed, 51 insertions(+), 32 deletions(-)
> >>>>
> >>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >>>> index b19a8aee684c..d8f523dc41e7 100644
> >>>> --- a/arch/arm64/include/asm/pgtable.h
> >>>> +++ b/arch/arm64/include/asm/pgtable.h
> >>>> @@ -45,12 +45,6 @@
> >>>>       __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >>>>
> >>>> -static inline bool arch_thp_swp_supported(void)
> >>>> -{
> >>>> -     return !system_supports_mte();
> >>>> -}
> >>>> -#define arch_thp_swp_supported arch_thp_swp_supported
> >>>> -
> >>>>  /*
> >>>>   * Outside of a few very special situations (e.g. hibernation), we always
> >>>>   * use broadcast TLB invalidation instructions, therefore a spurious page
> >>>> @@ -1036,12 +1030,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >>>>  #ifdef CONFIG_ARM64_MTE
> >>>>
> >>>>  #define __HAVE_ARCH_PREPARE_TO_SWAP
> >>>> -static inline int arch_prepare_to_swap(struct page *page)
> >>>> -{
> >>>> -     if (system_supports_mte())
> >>>> -             return mte_save_tags(page);
> >>>> -     return 0;
> >>>> -}
> >>>> +#define arch_prepare_to_swap arch_prepare_to_swap
> >>>> +extern int arch_prepare_to_swap(struct folio *folio);
> >>>>
> >>>>  #define __HAVE_ARCH_SWAP_INVALIDATE
> >>>>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> >>>> @@ -1057,11 +1047,8 @@ static inline void arch_swap_invalidate_area(int type)
> >>>>  }
> >>>>
> >>>>  #define __HAVE_ARCH_SWAP_RESTORE
> >>>> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >>>> -{
> >>>> -     if (system_supports_mte())
> >>>> -             mte_restore_tags(entry, &folio->page);
> >>>> -}
> >>>> +#define arch_swap_restore arch_swap_restore
> >>>> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >>>>
> >>>>  #endif /* CONFIG_ARM64_MTE */
> >>>>
> >>>> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> >>>> index a31833e3ddc5..14a479e4ea8e 100644
> >>>> --- a/arch/arm64/mm/mteswap.c
> >>>> +++ b/arch/arm64/mm/mteswap.c
> >>>> @@ -68,6 +68,12 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> >>>>       mte_free_tag_storage(tags);
> >>>>  }
> >>>>
> >>>> +static inline void __mte_invalidate_tags(struct page *page)
> >>>> +{
> >>>> +     swp_entry_t entry = page_swap_entry(page);
> >>>> +     mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> >>>> +}
> >>>> +
> >>>>  void mte_invalidate_tags_area(int type)
> >>>>  {
> >>>>       swp_entry_t entry = swp_entry(type, 0);
> >>>> @@ -83,3 +89,41 @@ void mte_invalidate_tags_area(int type)
> >>>>       }
> >>>>       xa_unlock(&mte_pages);
> >>>>  }
> >>>> +
> >>>> +int arch_prepare_to_swap(struct folio *folio)
> >>>> +{
> >>>> +     int err;
> >>>> +     long i;
> >>>> +
> >>>> +     if (system_supports_mte()) {
> >>>> +             long nr = folio_nr_pages(folio);
> >>>
> >>> nit: there should be a clear line between variable declarations and logic.
> >>
> >> right.
> >>
> >>>
> >>>> +             for (i = 0; i < nr; i++) {
> >>>> +                     err = mte_save_tags(folio_page(folio, i));
> >>>> +                     if (err)
> >>>> +                             goto out;
> >>>> +             }
> >>>> +     }
> >>>> +     return 0;
> >>>> +
> >>>> +out:
> >>>> +     while (--i)
> >>>
> >>> If i is initially > 0, this will fail to invalidate page 0. If i is initially 0
> >>> then it will wrap and run ~forever. I think you meant `while (i--)`?
> >>
> >> nop. if i=0 and we goto out, that means the page0 has failed to save tags,
> >> there is nothing to revert. if i=3 and we goto out, that means 0,1,2 have
> >> saved, we restore 0,1,2 and we don't restore 3.
> >
> > I am terribly sorry for my previous noise. You are right, Ryan. i
> > actually meant i--.
>
> No problem - it saves me from writing a long response explaining why --i is
> wrong, at least!
>
> >
> >>
> >>>
> >>>> +             __mte_invalidate_tags(folio_page(folio, i));
> >>>> +     return err;
> >>>> +}
> >>>> +
> >>>> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >>>> +{
> >>>> +     if (system_supports_mte()) {
> >>>> +             /*
> >>>> +              * We don't support large folios swap in as whole yet, but
> >>>> +              * we can hit a large folio which is still in swapcache
> >>>> +              * after those related processes' PTEs have been unmapped
> >>>> +              * but before the swapcache folio  is dropped, in this case,
> >>>> +              * we need to find the exact page which "entry" is mapping
> >>>> +              * to. If we are not hitting swapcache, this folio won't be
> >>>> +              * large
> >>>> +              */
> >>>
> >>> So the currently defined API allows a large folio to be passed but the caller is
> >>> supposed to find the single correct page using the swap entry? That feels quite
> >>> nasty to me. And that's not what the old version of the function was doing; it
> >>> always assumed that the folio was small and passed the first page (which also
> >>> doesn't feel 'nice'). If the old version was wrong, I suggest a separate commit
> >>> to fix that. If the old version is correct, then I guess this version is wrong.
> >>
> >> the original version(mainline) is wrong but it works as once we find the SoCs
> >> support MTE, we will split large folios into small pages. so only small pages
> >> will be added into swapcache successfully.
> >>
> >> but now we want to swap out large folios even on SoCs with MTE as a whole,
> >> we don't split, so this breaks the assumption do_swap_page() will always get
> >> small pages.
> >
> > let me clarify this more. The current mainline assumes
> > arch_swap_restore() always
> > get a folio with only one page. this is true as we split large folios
> > if we find SoCs
> > have MTE. but since we are dropping the split now, that means a large
> > folio can be
> > gotten by do_swap_page(). we have a chance that try_to_unmap_one() has been done
> > but folio is not put. so PTEs will have swap entry but folio is still
> > there, and do_swap_page()
> > to hit cache directly and the folio won't be released.
> >
> > but after getting the large folio in do_swap_page, it still only takes
> > one basepage particularly
> > for the faulted PTE and maps this 4KB PTE only. so it uses the faulted
> > swap_entry and
> > the folio as parameters to call arch_swap_restore() which can be something like:
> >
> > do_swap_page()
> > {
> >         arch_swap_restore(the swap entry for the faulted 4KB PTE, large folio);
> > }
>
> OK, I understand what's going on, but it seems like a bad API decision. I think
> Steve is saying the same thing; If its only intended to operate on a single
> page, it would be much clearer to pass the actual page rather than the folio;
> i.e. leave the complexity of figuring out the target page to the caller, which
> understands all this.

right.

>
> As a side note, if the folio is still in the cache, doesn't that imply that the
> tags haven't been torn down yet? So perhaps you can avoid even making the call
> in this case?

right. but it is practically very hard as arch_swap_restore() is
always called unconditionally.
it is hard to find a decent condition before calling
arch_swap_restore(). That is why we
actually have been doing redundant arch_swap_restore() lots of times right now.

For example, A forks B,C,D,E,F,G. now A,B,C,D,E,F,G will share one
page before CoW. After
the page is swapped out, if B is the first process to swap in, B will
add the page to swapcache,
and restore MTE. After that, A, C, D, E, F,G will directly hit the
page swapped in by B, now they
restore MTE again. so the MTE is restored 7 times but actually only B
needs to do it.

so it seems we can put a condition to only let B do restore.  But it
won't work because we can't
guarrent B is the first process who will do PTE mapping.  A, C, D, E,
F, G can map PTEs earlier than B
even if B is the one who did the I/O swapin. swapin/add swapcache and
PTE mapping are
not done atomically. PTE mapping needs to take PTL. so After B has
done swapin, A, C,E,F,G
can still begin to use the page earlier than B. So it turns out anyone
who first maps the page
should restore MTE, but the question is that: How could A,B,C,D,E,F,G
know if it is the first
one mapping the page to PTE?

>
> >>
> >>>
> >>> Thanks,
> >>> Ryan
> >
> > Thanks
> > Barry
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 0/4] Swap-out small-sized THP without splitting
  2023-10-25 14:45 [PATCH v3 0/4] Swap-out small-sized THP without splitting Ryan Roberts
                   ` (3 preceding siblings ...)
  2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
@ 2023-11-29  7:47 ` Barry Song
  2023-11-29 12:06   ` Ryan Roberts
  2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
  5 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2023-11-29  7:47 UTC (permalink / raw)
  To: ryan.roberts
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, hanchuanhua

> Hi All,
> 
> This is v3 of a series to add support for swapping out small-sized THP without
> needing to first split the large folio via __split_huge_page(). It closely
> follows the approach already used by PMD-sized THP.
> 
> "Small-sized THP" is an upcoming feature that enables performance improvements
> by allocating large folios for anonymous memory, where the large folio size is
> smaller than the traditional PMD-size. See [3].
> 
> In some circumstances I've observed a performance regression (see patch 2 for
> details), and this series is an attempt to fix the regression in advance of
> merging small-sized THP support.
> 
> I've done what I thought was the smallest change possible, and as a result, this
> approach is only employed when the swap is backed by a non-rotating block device
> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
> that this is probably sufficient.
> 
> The series applies against mm-unstable (1a3c85fa684a)
> 
> 
> Changes since v2 [2]
> ====================
> 
>  - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>    allocation. This required some refactoring to make everything work nicely
>    (new patches 2 and 3).
>  - Fix bug where nr_swap_pages would say there are pages available but the
>    scanner would not be able to allocate them because they were reserved for the
>    per-cpu allocator. We now allow stealing of order-0 entries from the high
>    order per-cpu clusters (in addition to exisiting stealing from order-0
>    per-cpu clusters).
> 
> Thanks to Huang, Ying for the review feedback and suggestions!
> 
> 
> Changes since v1 [1]
> ====================
> 
>  - patch 1:
>     - Use cluster_set_count() instead of cluster_set_count_flag() in
>       swap_alloc_cluster() since we no longer have any flag to set. I was unable
>       to kill cluster_set_count_flag() as proposed against v1 as other call
>       sites depend explicitly setting flags to 0.
>  - patch 2:
>     - Moved large_next[] array into percpu_cluster to make it per-cpu
>       (recommended by Huang, Ying).
>     - large_next[] array is dynamically allocated because PMD_ORDER is not
>       compile-time constant for powerpc (fixes build error).
> 
> 
> Thanks,
> Ryan

> P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
> but given Huang Ying had provided some review feedback, I wanted to progress it.
> All the actual prerequisites are either complete or being worked on by others.
> 

Hi Ryan,

this is quite important to a phone and a must-have component, so is large-folio
swapin, as i explained to you in another email. 
Luckily, we are having Chuanhua Han(Cc-ed) to prepare a patchset of largefolio
swapin on top of your this patchset, probably a port and cleanup of our
do_swap_page[1] againest yours.

Another concern is that swapslots can be fragmented, if we place small/large folios
in a swap device, since large folios always require contiguous swapslot, we can
result in failure of getting slots even we still have many free slots which are not
contiguous. To avoid this, [2] dynamic hugepage solution have two swap devices,
one for basepage, the other one for CONTPTE. we have modified the priority-based
selection of swap devices to choose swap devices based on small/large folios.
i realize this approache is super ugly and might be very hard to find a way to
upstream though, it seems not universal especially if you are a linux server (-_-)

two devices are not a nice approach though it works well for a real product,
we might still need some decent way to address this problem while the problem
is for sure not a stopper of your patchset.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L4648
[2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/swapfile.c#L1129

> 
> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@arm.com/
> 
> 
> Ryan Roberts (4):
>   mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>   mm: swap: Remove struct percpu_cluster
>   mm: swap: Simplify ssd behavior when scanner steals entry
>   mm: swap: Swap-out small-sized THP without splitting
> 
>  include/linux/swap.h |  31 +++---
>  mm/huge_memory.c     |   3 -
>  mm/swapfile.c        | 232 ++++++++++++++++++++++++-------------------
>  mm/vmscan.c          |  10 +-
>  4 files changed, 149 insertions(+), 127 deletions(-)

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 0/4] Swap-out small-sized THP without splitting
  2023-11-29  7:47 ` [PATCH v3 0/4] " Barry Song
@ 2023-11-29 12:06   ` Ryan Roberts
  2023-11-29 20:38     ` Barry Song
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2023-11-29 12:06 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, hanchuanhua

On 29/11/2023 07:47, Barry Song wrote:
>> Hi All,
>>
>> This is v3 of a series to add support for swapping out small-sized THP without
>> needing to first split the large folio via __split_huge_page(). It closely
>> follows the approach already used by PMD-sized THP.
>>
>> "Small-sized THP" is an upcoming feature that enables performance improvements
>> by allocating large folios for anonymous memory, where the large folio size is
>> smaller than the traditional PMD-size. See [3].
>>
>> In some circumstances I've observed a performance regression (see patch 2 for
>> details), and this series is an attempt to fix the regression in advance of
>> merging small-sized THP support.
>>
>> I've done what I thought was the smallest change possible, and as a result, this
>> approach is only employed when the swap is backed by a non-rotating block device
>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>> that this is probably sufficient.
>>
>> The series applies against mm-unstable (1a3c85fa684a)
>>
>>
>> Changes since v2 [2]
>> ====================
>>
>>  - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>>    allocation. This required some refactoring to make everything work nicely
>>    (new patches 2 and 3).
>>  - Fix bug where nr_swap_pages would say there are pages available but the
>>    scanner would not be able to allocate them because they were reserved for the
>>    per-cpu allocator. We now allow stealing of order-0 entries from the high
>>    order per-cpu clusters (in addition to exisiting stealing from order-0
>>    per-cpu clusters).
>>
>> Thanks to Huang, Ying for the review feedback and suggestions!
>>
>>
>> Changes since v1 [1]
>> ====================
>>
>>  - patch 1:
>>     - Use cluster_set_count() instead of cluster_set_count_flag() in
>>       swap_alloc_cluster() since we no longer have any flag to set. I was unable
>>       to kill cluster_set_count_flag() as proposed against v1 as other call
>>       sites depend explicitly setting flags to 0.
>>  - patch 2:
>>     - Moved large_next[] array into percpu_cluster to make it per-cpu
>>       (recommended by Huang, Ying).
>>     - large_next[] array is dynamically allocated because PMD_ORDER is not
>>       compile-time constant for powerpc (fixes build error).
>>
>>
>> Thanks,
>> Ryan
> 
>> P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
>> but given Huang Ying had provided some review feedback, I wanted to progress it.
>> All the actual prerequisites are either complete or being worked on by others.
>>
> 
> Hi Ryan,
> 
> this is quite important to a phone and a must-have component, so is large-folio
> swapin, as i explained to you in another email. 

Yes understood; the "prerequisites" are just the things that must be merged
*before* small-sized THP to ensure we don't regress existing behaviour or to
ensure that small-size THP is correct/robust when enabled. Performance
improvements can be merged after the initial small-sized series.

> Luckily, we are having Chuanhua Han(Cc-ed) to prepare a patchset of largefolio
> swapin on top of your this patchset, probably a port and cleanup of our
> do_swap_page[1] againest yours.

That's great to hear - welcome aboard, Chuanhua Han! Feel free to reach out if
you have questions.

I would guess that any large swap-in changes would be independent of this
swap-out patch though? Wouldn't you just be looking for contiguous swap entries
in the page table to determine a suitable folio order, then swap-in each of
those entries into the folio? And if they happen to have contiguous swap offsets
(enabled by this swap-out series) then you potentially get a batched disk access
benefit.

That's just a guess though, perhaps you can describe your proposed approach?

> 
> Another concern is that swapslots can be fragmented, if we place small/large folios
> in a swap device, since large folios always require contiguous swapslot, we can
> result in failure of getting slots even we still have many free slots which are not
> contiguous.

This series tries to mitigate that problem by reserving a swap cluster per
order. That works well until we run out of swap clusters; a cluster can't be
freed until all contained swap entries are swapped back in and deallocated.

But I think we should start with the simple approach first, and only solve the
problems as they arise through real testing.

 To avoid this, [2] dynamic hugepage solution have two swap devices,
> one for basepage, the other one for CONTPTE. we have modified the priority-based
> selection of swap devices to choose swap devices based on small/large folios.
> i realize this approache is super ugly and might be very hard to find a way to
> upstream though, it seems not universal especially if you are a linux server (-_-)
> 
> two devices are not a nice approach though it works well for a real product,
> we might still need some decent way to address this problem while the problem
> is for sure not a stopper of your patchset.

I guess that approach works for your case because A) you only have 2 sizes, and
B) your swap device is zRAM, which dynamically allocate RAM as it needs it.

The upstream small-sized THP solution can support multiple sizes, so you would
need a swap device per size (I think 13 is the limit at the moment - PMD size
for 64K base page). And if your swap device is a physical block device, you
can't dynamically parition it the way you can with zRAM. Nether of those things
scale particularly well IMHO.

> 
> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L4648
> [2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/swapfile.c#L1129
> 
>>
>> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
>> [3] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@arm.com/
>>
>>
>> Ryan Roberts (4):
>>   mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>>   mm: swap: Remove struct percpu_cluster
>>   mm: swap: Simplify ssd behavior when scanner steals entry
>>   mm: swap: Swap-out small-sized THP without splitting
>>
>>  include/linux/swap.h |  31 +++---
>>  mm/huge_memory.c     |   3 -
>>  mm/swapfile.c        | 232 ++++++++++++++++++++++++-------------------
>>  mm/vmscan.c          |  10 +-
>>  4 files changed, 149 insertions(+), 127 deletions(-)
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 0/4] Swap-out small-sized THP without splitting
  2023-11-29 12:06   ` Ryan Roberts
@ 2023-11-29 20:38     ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2023-11-29 20:38 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, hanchuanhua

On Thu, Nov 30, 2023 at 1:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 29/11/2023 07:47, Barry Song wrote:
> >> Hi All,
> >>
> >> This is v3 of a series to add support for swapping out small-sized THP without
> >> needing to first split the large folio via __split_huge_page(). It closely
> >> follows the approach already used by PMD-sized THP.
> >>
> >> "Small-sized THP" is an upcoming feature that enables performance improvements
> >> by allocating large folios for anonymous memory, where the large folio size is
> >> smaller than the traditional PMD-size. See [3].
> >>
> >> In some circumstances I've observed a performance regression (see patch 2 for
> >> details), and this series is an attempt to fix the regression in advance of
> >> merging small-sized THP support.
> >>
> >> I've done what I thought was the smallest change possible, and as a result, this
> >> approach is only employed when the swap is backed by a non-rotating block device
> >> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
> >> that this is probably sufficient.
> >>
> >> The series applies against mm-unstable (1a3c85fa684a)
> >>
> >>
> >> Changes since v2 [2]
> >> ====================
> >>
> >>  - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
> >>    allocation. This required some refactoring to make everything work nicely
> >>    (new patches 2 and 3).
> >>  - Fix bug where nr_swap_pages would say there are pages available but the
> >>    scanner would not be able to allocate them because they were reserved for the
> >>    per-cpu allocator. We now allow stealing of order-0 entries from the high
> >>    order per-cpu clusters (in addition to exisiting stealing from order-0
> >>    per-cpu clusters).
> >>
> >> Thanks to Huang, Ying for the review feedback and suggestions!
> >>
> >>
> >> Changes since v1 [1]
> >> ====================
> >>
> >>  - patch 1:
> >>     - Use cluster_set_count() instead of cluster_set_count_flag() in
> >>       swap_alloc_cluster() since we no longer have any flag to set. I was unable
> >>       to kill cluster_set_count_flag() as proposed against v1 as other call
> >>       sites depend explicitly setting flags to 0.
> >>  - patch 2:
> >>     - Moved large_next[] array into percpu_cluster to make it per-cpu
> >>       (recommended by Huang, Ying).
> >>     - large_next[] array is dynamically allocated because PMD_ORDER is not
> >>       compile-time constant for powerpc (fixes build error).
> >>
> >>
> >> Thanks,
> >> Ryan
> >
> >> P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
> >> but given Huang Ying had provided some review feedback, I wanted to progress it.
> >> All the actual prerequisites are either complete or being worked on by others.
> >>
> >
> > Hi Ryan,
> >
> > this is quite important to a phone and a must-have component, so is large-folio
> > swapin, as i explained to you in another email.
>
> Yes understood; the "prerequisites" are just the things that must be merged
> *before* small-sized THP to ensure we don't regress existing behaviour or to
> ensure that small-size THP is correct/robust when enabled. Performance
> improvements can be merged after the initial small-sized series.

I completely agree. I didn't mean small-THP swap out as a whole  should be
a prerequisite for small-THP initial patchset, just describing how important
it is to a phone :-)

And actually we have done much further than this on phones by optimizing
zsmalloc/zram and allow a large folio compressed and decompressed
as a whole,  we have seen compressing/decompressing a whole large folio
can significantly improve compression ratio and decrease CPU consumption.

so that means large folios can not only save memory but also decrease
CPU consumption.

>
> > Luckily, we are having Chuanhua Han(Cc-ed) to prepare a patchset of largefolio
> > swapin on top of your this patchset, probably a port and cleanup of our
> > do_swap_page[1] againest yours.
>
> That's great to hear - welcome aboard, Chuanhua Han! Feel free to reach out if
> you have questions.
>
> I would guess that any large swap-in changes would be independent of this
> swap-out patch though? Wouldn't you just be looking for contiguous swap entries
> in the page table to determine a suitable folio order, then swap-in each of
> those entries into the folio? And if they happen to have contiguous swap offsets
> (enabled by this swap-out series) then you potentially get a batched disk access
> benefit.

I agree. Maybe we still need to check if the number of contiguous swap entries
is one of those supported large folio sizes?

>
> That's just a guess though, perhaps you can describe your proposed approach?

we have an ugly hack if we are swapping in from the dedicated zRAM for
large folios,
we assume we have a chance to swapin as a whole, but we do also handle corner
cases in which some entries might have been zap_pte_range()-ed.

My current proposal is as below,
A1. we get the number of contiguous swap entries with PTL and find it
is a valid large folio size
A2. we allocate large folio without PTL
A3. after getting PTL again, we re-check PTEs if the situation in A1
have been changed,
if no other threads change those PTEs, we set_ptes and finish the swap-in

but we have a chance to fail in A2, so in this case we still need to
fall back to basepage.

considering the MTE thread[1] I am handling, and MTE tag life cycle is
the same with swap
entry life cycle. it seems we will still need a page-level
arch_swap_restore even after
we support large folio swap-in for the below two reasons

1. contiguous PTEs might be partially dropped by madvise(DONTNEED) etc
2. we can still fall back to basepage for swap-in if we fail to get
large folio even PTEs are all
contiguous swap entries

Of course,  if we succeed in setting all PTEs for a large folio in A3,
we can have
a folio-level arch_swap_restore.

To me, an universal folio-level arch_swap_restore seems not sensible
to handle all kinds of
complex cases.

[1] [RFC V3 PATCH] arm64: mm: swap: save and restore mte tags for large folios
https://lore.kernel.org/linux-mm/20231114014313.67232-1-v-songbaohua@oppo.com/

>
> >
> > Another concern is that swapslots can be fragmented, if we place small/large folios
> > in a swap device, since large folios always require contiguous swapslot, we can
> > result in failure of getting slots even we still have many free slots which are not
> > contiguous.
>
> This series tries to mitigate that problem by reserving a swap cluster per
> order. That works well until we run out of swap clusters; a cluster can't be
> freed until all contained swap entries are swapped back in and deallocated.
>
> But I think we should start with the simple approach first, and only solve the
> problems as they arise through real testing.

I agree.

>
>  To avoid this, [2] dynamic hugepage solution have two swap devices,
> > one for basepage, the other one for CONTPTE. we have modified the priority-based
> > selection of swap devices to choose swap devices based on small/large folios.
> > i realize this approache is super ugly and might be very hard to find a way to
> > upstream though, it seems not universal especially if you are a linux server (-_-)
> >
> > two devices are not a nice approach though it works well for a real product,
> > we might still need some decent way to address this problem while the problem
> > is for sure not a stopper of your patchset.
>
> I guess that approach works for your case because A) you only have 2 sizes, and
> B) your swap device is zRAM, which dynamically allocate RAM as it needs it.
>
> The upstream small-sized THP solution can support multiple sizes, so you would
> need a swap device per size (I think 13 is the limit at the moment - PMD size
> for 64K base page). And if your swap device is a physical block device, you
> can't dynamically parition it the way you can with zRAM. Nether of those things
> scale particularly well IMHO.

right.

>
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L4648
> > [2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/swapfile.c#L1129
> >
> >>
> >> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
> >> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
> >> [3] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@arm.com/
> >>
> >>
> >> Ryan Roberts (4):
> >>   mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
> >>   mm: swap: Remove struct percpu_cluster
> >>   mm: swap: Simplify ssd behavior when scanner steals entry
> >>   mm: swap: Swap-out small-sized THP without splitting
> >>
> >>  include/linux/swap.h |  31 +++---
> >>  mm/huge_memory.c     |   3 -
> >>  mm/swapfile.c        | 232 ++++++++++++++++++++++++-------------------
> >>  mm/vmscan.c          |  10 +-
> >>  4 files changed, 149 insertions(+), 127 deletions(-)
> >

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH RFC 0/6] mm: support large folios swap-in
  2023-10-25 14:45 [PATCH v3 0/4] Swap-out small-sized THP without splitting Ryan Roberts
                   ` (4 preceding siblings ...)
  2023-11-29  7:47 ` [PATCH v3 0/4] " Barry Song
@ 2024-01-18 11:10 ` Barry Song
  2024-01-18 11:10   ` [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE Barry Song
                     ` (7 more replies)
  5 siblings, 8 replies; 116+ messages in thread
From: Barry Song @ 2024-01-18 11:10 UTC (permalink / raw)
  To: ryan.roberts, akpm, david, linux-mm
  Cc: linux-kernel, mhocko, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, surenb, steven.price, Barry Song

On an embedded system like Android, more than half of anon memory is actually
in swap devices such as zRAM. For example, while an app is switched to back-
ground, its most memory might be swapped-out.

Now we have mTHP features, unfortunately, if we don't support large folios
swap-in, once those large folios are swapped-out, we immediately lose the 
performance gain we can get through large folios and hardware optimization
such as CONT-PTE.

In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
before swap-out, if some memory were normal pages, but when swapping in, we
can also swap-in them as large folios. But this might require I/O happen at
some random places in swap devices. So we limit the large folios swap-in to
those areas which were large folios before swapping-out, aka, swaps are also
contiguous in hardware. On the other hand, in OPPO's product, we've deployed
anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
compress and decompress large folios as a whole, which help improve compression
ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
save large objects whose original size are 64KiB for example. So it is also a
better choice for us to only swap-in large folios for those compressed large
objects as a large folio can be decompressed all together.

Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
with MTE" to this series as it might help review.

[1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
[2] OnePlusOSS / android_kernel_oneplus_sm8550 
https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11

Barry Song (2):
  arm64: mm: swap: support THP_SWAP on hardware with MTE
  mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

Chuanhua Han (4):
  mm: swap: introduce swap_nr_free() for batched swap_free()
  mm: swap: make should_try_to_free_swap() support large-folio
  mm: support large folios swapin as a whole
  mm: madvise: don't split mTHP for MADV_PAGEOUT

 arch/arm64/include/asm/pgtable.h |  21 ++----
 arch/arm64/mm/mteswap.c          |  42 ++++++++++++
 include/asm-generic/tlb.h        |  10 +++
 include/linux/huge_mm.h          |  12 ----
 include/linux/pgtable.h          |  62 ++++++++++++++++-
 include/linux/swap.h             |   6 ++
 mm/madvise.c                     |  48 ++++++++++++++
 mm/memory.c                      | 110 ++++++++++++++++++++++++++-----
 mm/page_io.c                     |   2 +-
 mm/rmap.c                        |   5 +-
 mm/swap_slots.c                  |   2 +-
 mm/swapfile.c                    |  29 ++++++++
 12 files changed, 301 insertions(+), 48 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE
  2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
@ 2024-01-18 11:10   ` Barry Song
  2024-01-26 23:14     ` Chris Li
  2024-01-18 11:10   ` [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free() Barry Song
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2024-01-18 11:10 UTC (permalink / raw)
  To: ryan.roberts, akpm, david, linux-mm
  Cc: linux-kernel, mhocko, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, surenb, steven.price, Barry Song

From: Barry Song <v-songbaohua@oppo.com>

Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
MTE as the MTE code works with the assumption tags save/restore is
always handling a folio with only one page.

The limitation should be removed as more and more ARM64 SoCs have
this feature. Co-existence of MTE and THP_SWAP becomes more and
more important.

This patch makes MTE tags saving support large folios, then we don't
need to split large folios into base pages for swapping out on ARM64
SoCs with MTE any more.

arch_prepare_to_swap() should take folio rather than page as parameter
because we support THP swap-out as a whole. It saves tags for all
pages in a large folio.

As now we are restoring tags based-on folio, in arch_swap_restore(),
we may increase some extra loops and early-exitings while refaulting
a large folio which is still in swapcache in do_swap_page(). In case
a large folio has nr pages, do_swap_page() will only set the PTE of
the particular page which is causing the page fault.
Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
will loop nr times for those subpages in the folio. So right now the
algorithmic complexity becomes O(nr^2).

Once we support mapping large folios in do_swap_page(), extra loops
and early-exitings will decrease while not being completely removed
as a large folio might get partially tagged in corner cases such as,
1. a large folio in swapcache can be partially unmapped, thus, MTE
tags for the unmapped pages will be invalidated;
2. users might use mprotect() to set MTEs on a part of a large folio.

arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
who needed it.

Reviewed-by: Steven Price <steven.price@arm.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/arm64/include/asm/pgtable.h | 21 +++-------------
 arch/arm64/mm/mteswap.c          | 42 ++++++++++++++++++++++++++++++++
 include/linux/huge_mm.h          | 12 ---------
 include/linux/pgtable.h          |  2 +-
 mm/page_io.c                     |  2 +-
 mm/swap_slots.c                  |  2 +-
 6 files changed, 49 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 79ce70fbb751..9902395ca426 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -45,12 +45,6 @@
 	__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline bool arch_thp_swp_supported(void)
-{
-	return !system_supports_mte();
-}
-#define arch_thp_swp_supported arch_thp_swp_supported
-
 /*
  * Outside of a few very special situations (e.g. hibernation), we always
  * use broadcast TLB invalidation instructions, therefore a spurious page
@@ -1042,12 +1036,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 #ifdef CONFIG_ARM64_MTE
 
 #define __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
-{
-	if (system_supports_mte())
-		return mte_save_tags(page);
-	return 0;
-}
+#define arch_prepare_to_swap arch_prepare_to_swap
+extern int arch_prepare_to_swap(struct folio *folio);
 
 #define __HAVE_ARCH_SWAP_INVALIDATE
 static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
@@ -1063,11 +1053,8 @@ static inline void arch_swap_invalidate_area(int type)
 }
 
 #define __HAVE_ARCH_SWAP_RESTORE
-static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
-{
-	if (system_supports_mte())
-		mte_restore_tags(entry, &folio->page);
-}
+#define arch_swap_restore arch_swap_restore
+extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
 
 #endif /* CONFIG_ARM64_MTE */
 
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a31833e3ddc5..b9ca1b35902f 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
 	mte_free_tag_storage(tags);
 }
 
+static inline void __mte_invalidate_tags(struct page *page)
+{
+	swp_entry_t entry = page_swap_entry(page);
+
+	mte_invalidate_tags(swp_type(entry), swp_offset(entry));
+}
+
 void mte_invalidate_tags_area(int type)
 {
 	swp_entry_t entry = swp_entry(type, 0);
@@ -83,3 +90,38 @@ void mte_invalidate_tags_area(int type)
 	}
 	xa_unlock(&mte_pages);
 }
+
+int arch_prepare_to_swap(struct folio *folio)
+{
+	int err;
+	long i;
+
+	if (system_supports_mte()) {
+		long nr = folio_nr_pages(folio);
+
+		for (i = 0; i < nr; i++) {
+			err = mte_save_tags(folio_page(folio, i));
+			if (err)
+				goto out;
+		}
+	}
+	return 0;
+
+out:
+	while (i--)
+		__mte_invalidate_tags(folio_page(folio, i));
+	return err;
+}
+
+void arch_swap_restore(swp_entry_t entry, struct folio *folio)
+{
+	if (system_supports_mte()) {
+		long i, nr = folio_nr_pages(folio);
+
+		entry.val -= swp_offset(entry) & (nr - 1);
+		for (i = 0; i < nr; i++) {
+			mte_restore_tags(entry, folio_page(folio, i));
+			entry.val++;
+		}
+	}
+}
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5adb86af35fc..67219d2309dd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -530,16 +530,4 @@ static inline int split_folio(struct folio *folio)
 	return split_folio_to_list(folio, NULL);
 }
 
-/*
- * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
- * limitations in the implementation like arm64 MTE can override this to
- * false
- */
-#ifndef arch_thp_swp_supported
-static inline bool arch_thp_swp_supported(void)
-{
-	return true;
-}
-#endif
-
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f6d0e3513948..37fe83b0c358 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -925,7 +925,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  * prototypes must be defined in the arch-specific asm/pgtable.h file.
  */
 #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
+static inline int arch_prepare_to_swap(struct folio *folio)
 {
 	return 0;
 }
diff --git a/mm/page_io.c b/mm/page_io.c
index ae2b49055e43..a9a7c236aecc 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 	 * Arch code may have to preserve more data than just the page
 	 * contents, e.g. memory tags.
 	 */
-	ret = arch_prepare_to_swap(&folio->page);
+	ret = arch_prepare_to_swap(folio);
 	if (ret) {
 		folio_mark_dirty(folio);
 		folio_unlock(folio);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..2325adbb1f19 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 	entry.val = 0;
 
 	if (folio_test_large(folio)) {
-		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
+		if (IS_ENABLED(CONFIG_THP_SWAP))
 			get_swap_pages(1, &entry, folio_nr_pages(folio));
 		goto out;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free()
  2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
  2024-01-18 11:10   ` [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE Barry Song
@ 2024-01-18 11:10   ` Barry Song
  2024-01-26 23:17     ` Chris Li
  2024-01-18 11:10   ` [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2024-01-18 11:10 UTC (permalink / raw)
  To: ryan.roberts, akpm, david, linux-mm
  Cc: linux-kernel, mhocko, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, surenb, steven.price, Chuanhua Han,
	Barry Song

From: Chuanhua Han <hanchuanhua@oppo.com>

While swapping in a large folio, we need to free swaps related to the whole
folio. To avoid frequently acquiring and releasing swap locks, it is better
to introduce an API for batched free.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/swap.h |  6 ++++++
 mm/swapfile.c        | 29 +++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4db00ddad261..31a4ee2dcd1c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -478,6 +478,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
+extern void swap_nr_free(swp_entry_t entry, int nr_pages);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 int swap_type_of(dev_t device, sector_t offset);
@@ -553,6 +554,11 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
+void swap_nr_free(swp_entry_t entry, int nr_pages)
+{
+
+}
+
 static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
 {
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 556ff7347d5f..6321bda96b77 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1335,6 +1335,35 @@ void swap_free(swp_entry_t entry)
 		__swap_entry_free(p, entry);
 }
 
+void swap_nr_free(swp_entry_t entry, int nr_pages)
+{
+	int i;
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *p;
+	unsigned type = swp_type(entry);
+	unsigned long offset = swp_offset(entry);
+	DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
+
+	VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
+
+	if (nr_pages == 1) {
+		swap_free(entry);
+		return;
+	}
+
+	p = _swap_info_get(entry);
+
+	ci = lock_cluster(p, offset);
+	for (i = 0; i < nr_pages; i++) {
+		if (__swap_entry_free_locked(p, offset + i, 1))
+			__bitmap_set(usage, i, 1);
+	}
+	unlock_cluster(ci);
+
+	for_each_clear_bit(i, usage, nr_pages)
+		free_swap_slot(swp_entry(type, offset + i));
+}
+
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio
  2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
  2024-01-18 11:10   ` [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE Barry Song
  2024-01-18 11:10   ` [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free() Barry Song
@ 2024-01-18 11:10   ` Barry Song
  2024-01-26 23:22     ` Chris Li
  2024-01-18 11:10   ` [PATCH RFC 4/6] mm: support large folios swapin as a whole Barry Song
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2024-01-18 11:10 UTC (permalink / raw)
  To: ryan.roberts, akpm, david, linux-mm
  Cc: linux-kernel, mhocko, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, surenb, steven.price, Chuanhua Han,
	Barry Song

From: Chuanhua Han <hanchuanhua@oppo.com>

should_try_to_free_swap() works with an assumption that swap-in is always done
at normal page granularity, aka, folio_nr_pages = 1. To support large folio
swap-in, this patch removes the assumption.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e1f4849463a..f61a48929ba7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3714,7 +3714,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
 	 * reference only in case it's likely that we'll be the exlusive user.
 	 */
 	return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
-		folio_ref_count(folio) == 2;
+		folio_ref_count(folio) == (1 + folio_nr_pages(folio));
 }
 
 static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH RFC 4/6] mm: support large folios swapin as a whole
  2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
                     ` (2 preceding siblings ...)
  2024-01-18 11:10   ` [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
@ 2024-01-18 11:10   ` Barry Song
  2024-01-27 19:53     ` Chris Li
  2024-01-27 20:06     ` Chris Li
  2024-01-18 11:10   ` [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap() Barry Song
                     ` (3 subsequent siblings)
  7 siblings, 2 replies; 116+ messages in thread
From: Barry Song @ 2024-01-18 11:10 UTC (permalink / raw)
  To: ryan.roberts, akpm, david, linux-mm
  Cc: linux-kernel, mhocko, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, surenb, steven.price, Chuanhua Han,
	Barry Song

From: Chuanhua Han <hanchuanhua@oppo.com>

On an embedded system like Android, more than half of anon memory is actually
in swap devices such as zRAM. For example, while an app is switched to back-
ground, its most memory might be swapped-out.

Now we have mTHP features, unfortunately, if we don't support large folios
swap-in, once those large folios are swapped-out, we immediately lose the
performance gain we can get through large folios and hardware optimization
such as CONT-PTE.

This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
to those contiguous swaps which were likely swapped out from mTHP as a whole.

On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
case. It doesn't support swapin_readahead as large folios yet.

Right now, we are re-faulting large folios which are still in swapcache as a
whole, this can effectively decrease extra loops and early-exitings which we
have increased in arch_swap_restore() while supporting MTE restore for folios
rather than page.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 94 insertions(+), 14 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f61a48929ba7..928b3f542932 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -107,6 +107,8 @@ EXPORT_SYMBOL(mem_map);
 static vm_fault_t do_fault(struct vm_fault *vmf);
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
 static bool vmf_pte_changed(struct vm_fault *vmf);
+static struct folio *alloc_anon_folio(struct vm_fault *vmf,
+				      bool (*pte_range_check)(pte_t *, int));
 
 /*
  * Return true if the original pte was a uffd-wp pte marker (so the pte was
@@ -3784,6 +3786,34 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	return VM_FAULT_SIGBUS;
 }
 
+static bool pte_range_swap(pte_t *pte, int nr_pages)
+{
+	int i;
+	swp_entry_t entry;
+	unsigned type;
+	pgoff_t start_offset;
+
+	entry = pte_to_swp_entry(ptep_get_lockless(pte));
+	if (non_swap_entry(entry))
+		return false;
+	start_offset = swp_offset(entry);
+	if (start_offset % nr_pages)
+		return false;
+
+	type = swp_type(entry);
+	for (i = 1; i < nr_pages; i++) {
+		entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
+		if (non_swap_entry(entry))
+			return false;
+		if (swp_offset(entry) != start_offset + i)
+			return false;
+		if (swp_type(entry) != type)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3804,6 +3834,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	pte_t pte;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
+	int nr_pages = 1;
+	unsigned long start_address;
+	pte_t *start_pte;
 
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -3868,13 +3901,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
 			/* skip swapcache */
-			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
+			folio = alloc_anon_folio(vmf, pte_range_swap);
 			page = &folio->page;
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
 
+				if (folio_test_large(folio)) {
+					unsigned long start_offset;
+
+					nr_pages = folio_nr_pages(folio);
+					start_offset = swp_offset(entry) & ~(nr_pages - 1);
+					entry = swp_entry(swp_type(entry), start_offset);
+				}
+
 				if (mem_cgroup_swapin_charge_folio(folio,
 							vma->vm_mm, GFP_KERNEL,
 							entry)) {
@@ -3980,6 +4020,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 */
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
+
+	start_address = vmf->address;
+	start_pte = vmf->pte;
+	if (folio_test_large(folio)) {
+		unsigned long nr = folio_nr_pages(folio);
+		unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+		pte_t *pte_t = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
+
+		/*
+		 * case 1: we are allocating large_folio, try to map it as a whole
+		 * iff the swap entries are still entirely mapped;
+		 * case 2: we hit a large folio in swapcache, and all swap entries
+		 * are still entirely mapped, try to map a large folio as a whole.
+		 * otherwise, map only the faulting page within the large folio
+		 * which is swapcache
+		 */
+		if (pte_range_swap(pte_t, nr)) {
+			start_address = addr;
+			start_pte = pte_t;
+			if (unlikely(folio == swapcache)) {
+				/*
+				 * the below has been done before swap_read_folio()
+				 * for case 1
+				 */
+				nr_pages = nr;
+				entry = pte_to_swp_entry(ptep_get(start_pte));
+				page = &folio->page;
+			}
+		} else if (nr_pages > 1) { /* ptes have changed for case 1 */
+			goto out_nomap;
+		}
+	}
+
 	if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
 		goto out_nomap;
 
@@ -4047,12 +4120,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * We're already holding a reference on the page but haven't mapped it
 	 * yet.
 	 */
-	swap_free(entry);
+	swap_nr_free(entry, nr_pages);
 	if (should_try_to_free_swap(folio, vma, vmf->flags))
 		folio_free_swap(folio);
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
+
 	pte = mk_pte(page, vma->vm_page_prot);
 
 	/*
@@ -4062,14 +4137,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * exclusivity.
 	 */
 	if (!folio_test_ksm(folio) &&
-	    (exclusive || folio_ref_count(folio) == 1)) {
+	    (exclusive || folio_ref_count(folio) == nr_pages)) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 			vmf->flags &= ~FAULT_FLAG_WRITE;
 		}
 		rmap_flags |= RMAP_EXCLUSIVE;
 	}
-	flush_icache_page(vma, page);
+	flush_icache_pages(vma, page, nr_pages);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
 	if (pte_swp_uffd_wp(vmf->orig_pte))
@@ -4081,14 +4156,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_add_new_anon_rmap(folio, vma, vmf->address);
 		folio_add_lru_vma(folio, vma);
 	} else {
-		folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
+		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
 					rmap_flags);
 	}
 
 	VM_BUG_ON(!folio_test_anon(folio) ||
 			(pte_write(pte) && !PageAnonExclusive(page)));
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
-	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
+	set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
+
+	arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
 
 	folio_unlock(folio);
 	if (folio != swapcache && swapcache) {
@@ -4105,6 +4181,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	if (vmf->flags & FAULT_FLAG_WRITE) {
+		if (folio_test_large(folio) && nr_pages > 1)
+			vmf->orig_pte = ptep_get(vmf->pte);
+
 		ret |= do_wp_page(vmf);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
@@ -4112,7 +4191,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+	update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -4148,7 +4227,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
 	return true;
 }
 
-static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+static struct folio *alloc_anon_folio(struct vm_fault *vmf,
+				      bool (*pte_range_check)(pte_t *, int))
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	struct vm_area_struct *vma = vmf->vma;
@@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	order = highest_order(orders);
 	while (orders) {
 		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
-		if (pte_range_none(pte + pte_index(addr), 1 << order))
+		if (pte_range_check(pte + pte_index(addr), 1 << order))
 			break;
 		order = next_order(&orders, order);
 	}
@@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
 	/* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
-	folio = alloc_anon_folio(vmf);
+	folio = alloc_anon_folio(vmf, pte_range_none);
 	if (IS_ERR(folio))
 		return 0;
 	if (!folio)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
  2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
                     ` (3 preceding siblings ...)
  2024-01-18 11:10   ` [PATCH RFC 4/6] mm: support large folios swapin as a whole Barry Song
@ 2024-01-18 11:10   ` Barry Song
  2024-01-18 11:54     ` David Hildenbrand
  2024-01-27 23:41     ` Chris Li
  2024-01-18 11:10   ` [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT Barry Song
                     ` (2 subsequent siblings)
  7 siblings, 2 replies; 116+ messages in thread
From: Barry Song @ 2024-01-18 11:10 UTC (permalink / raw)
  To: ryan.roberts, akpm, david, linux-mm
  Cc: linux-kernel, mhocko, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, surenb, steven.price, Barry Song,
	Chuanhua Han

From: Barry Song <v-songbaohua@oppo.com>

In do_swap_page(), while supporting large folio swap-in, we are using the helper
folio_add_anon_rmap_ptes. This is triggerring a WARN_ON in __folio_add_anon_rmap.
We can make the warning quiet by two ways
1. in do_swap_page, we call folio_add_new_anon_rmap() if we are sure the large
folio is new allocated one; we call folio_add_anon_rmap_ptes() if we find the
large folio in swapcache.
2. we always call folio_add_anon_rmap_ptes() in do_swap_page but weaken the
WARN_ON in __folio_add_anon_rmap() by letting the WARN_ON less sensitive.

Option 2 seems to be better for do_swap_page() as it can use unified code for
all cases.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Chuanhua Han <hanchuanhua@oppo.com>
---
 mm/rmap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index f5d43edad529..469fcfd32317 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1304,7 +1304,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
 		 * page.
 		 */
 		VM_WARN_ON_FOLIO(folio_test_large(folio) &&
-				 level != RMAP_LEVEL_PMD, folio);
+				 level != RMAP_LEVEL_PMD &&
+				 (!IS_ALIGNED(address, nr_pages * PAGE_SIZE) ||
+				 (folio_test_swapcache(folio) && !IS_ALIGNED(folio->index, nr_pages)) ||
+				 page != &folio->page), folio);
 		__folio_set_anon(folio, vma, address,
 				 !!(flags & RMAP_EXCLUSIVE));
 	} else if (likely(!folio_test_ksm(folio))) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
  2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
                     ` (4 preceding siblings ...)
  2024-01-18 11:10   ` [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap() Barry Song
@ 2024-01-18 11:10   ` Barry Song
  2024-01-29  2:15     ` Chris Li
                       ` (2 more replies)
  2024-01-18 15:25   ` [PATCH RFC 0/6] mm: support large folios swap-in Ryan Roberts
  2024-01-29  9:05   ` Huang, Ying
  7 siblings, 3 replies; 116+ messages in thread
From: Barry Song @ 2024-01-18 11:10 UTC (permalink / raw)
  To: ryan.roberts, akpm, david, linux-mm
  Cc: linux-kernel, mhocko, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, surenb, steven.price, Chuanhua Han,
	Barry Song

From: Chuanhua Han <hanchuanhua@oppo.com>

MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
supported swapping large folios out as a whole for vmscan case. This patch
extends the feature to madvise.

If madvised range covers the whole large folio, we don't split it. Otherwise,
we still need to split it.

This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
helper named pte_range_cont_mapped() to check if all PTEs are contiguously
mapped to a large folio.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/asm-generic/tlb.h | 10 +++++++
 include/linux/pgtable.h   | 60 +++++++++++++++++++++++++++++++++++++++
 mm/madvise.c              | 48 +++++++++++++++++++++++++++++++
 3 files changed, 118 insertions(+)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 129a3a759976..f894e22da5d6 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
 		__tlb_remove_tlb_entry(tlb, ptep, address);	\
 	} while (0)
 
+#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr)			\
+	do {                                                    	\
+		int i;							\
+		tlb_flush_pte_range(tlb, address,			\
+				PAGE_SIZE * nr);			\
+		for (i = 0; i < nr; i++)				\
+			__tlb_remove_tlb_entry(tlb, ptep + i,		\
+					address + i * PAGE_SIZE);	\
+	} while (0)
+
 #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	\
 	do {							\
 		unsigned long _sz = huge_page_size(h);		\
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 37fe83b0c358..da0c1cf447e3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
 }
 #endif
 
+#ifndef pte_range_cont_mapped
+static inline bool pte_range_cont_mapped(unsigned long start_pfn,
+					 pte_t *start_pte,
+					 unsigned long start_addr,
+					 int nr)
+{
+	int i;
+	pte_t pte_val;
+
+	for (i = 0; i < nr; i++) {
+		pte_val = ptep_get(start_pte + i);
+
+		if (pte_none(pte_val))
+			return false;
+
+		if (pte_pfn(pte_val) != (start_pfn + i))
+			return false;
+	}
+
+	return true;
+}
+#endif
+
+#ifndef pte_range_young
+static inline bool pte_range_young(pte_t *start_pte, int nr)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		if (pte_young(ptep_get(start_pte + i)))
+			return true;
+
+	return false;
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
@@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 }
 #endif
 
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL
+static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm,
+						  unsigned long start_addr,
+						  pte_t *start_pte,
+						  int nr, int full)
+{
+	int i;
+	pte_t pte;
+
+	pte = ptep_get_and_clear_full(mm, start_addr, start_pte, full);
+
+	for (i = 1; i < nr; i++)
+		ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE,
+					start_pte + i, full);
+
+	return pte;
+}
 
 /*
  * If two threads concurrently fault at the same page, the thread that
@@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 })
 #endif
 
+#ifndef pte_nr_addr_end
+#define pte_nr_addr_end(addr, size, end)				\
+({	unsigned long __boundary = ((addr) + size) & (~(size - 1));	\
+	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
+})
+#endif
+
 /*
  * When walking page tables, we usually want to skip any p?d_none entries;
  * and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/madvise.c b/mm/madvise.c
index 912155a94ed5..262460ac4b2e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		if (folio_test_large(folio)) {
 			int err;
 
+			if (!folio_test_pmd_mappable(folio)) {
+				int nr_pages = folio_nr_pages(folio);
+				unsigned long folio_size = PAGE_SIZE * nr_pages;
+				unsigned long start_addr = ALIGN_DOWN(addr, nr_pages * PAGE_SIZE);;
+				unsigned long start_pfn = page_to_pfn(folio_page(folio, 0));
+				pte_t *start_pte = pte - (addr - start_addr) / PAGE_SIZE;
+				unsigned long next = pte_nr_addr_end(addr, folio_size, end);
+
+				if (!pte_range_cont_mapped(start_pfn, start_pte, start_addr, nr_pages))
+					goto split;
+
+				if (next - addr != folio_size) {
+					goto split;
+				} else {
+					/* Do not interfere with other mappings of this page */
+					if (folio_estimated_sharers(folio) != 1)
+						goto skip;
+
+					VM_BUG_ON(addr != start_addr || pte != start_pte);
+
+					if (pte_range_young(start_pte, nr_pages)) {
+						ptent = ptep_get_and_clear_range_full(mm, start_addr, start_pte,
+										      nr_pages, tlb->fullmm);
+						ptent = pte_mkold(ptent);
+
+						set_ptes(mm, start_addr, start_pte, ptent, nr_pages);
+						tlb_remove_nr_tlb_entry(tlb, start_pte, start_addr, nr_pages);
+					}
+
+					folio_clear_referenced(folio);
+					folio_test_clear_young(folio);
+					if (pageout) {
+						if (folio_isolate_lru(folio)) {
+							if (folio_test_unevictable(folio))
+								folio_putback_lru(folio);
+							else
+								list_add(&folio->lru, &folio_list);
+						}
+					} else
+						folio_deactivate(folio);
+				}
+skip:
+				pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
+				addr = next - PAGE_SIZE;
+				continue;
+
+			}
+split:
 			if (folio_estimated_sharers(folio) != 1)
 				break;
 			if (pageout_anon_only_filter && !folio_test_anon(folio))
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
  2024-01-18 11:10   ` [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap() Barry Song
@ 2024-01-18 11:54     ` David Hildenbrand
  2024-01-23  6:49       ` Barry Song
  2024-01-27 23:41     ` Chris Li
  1 sibling, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2024-01-18 11:54 UTC (permalink / raw)
  To: Barry Song, ryan.roberts, akpm, linux-mm
  Cc: linux-kernel, mhocko, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, surenb, steven.price, Barry Song,
	Chuanhua Han

On 18.01.24 12:10, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> In do_swap_page(), while supporting large folio swap-in, we are using the helper
> folio_add_anon_rmap_ptes. This is triggerring a WARN_ON in __folio_add_anon_rmap.
> We can make the warning quiet by two ways
> 1. in do_swap_page, we call folio_add_new_anon_rmap() if we are sure the large
> folio is new allocated one; we call folio_add_anon_rmap_ptes() if we find the
> large folio in swapcache.
> 2. we always call folio_add_anon_rmap_ptes() in do_swap_page but weaken the
> WARN_ON in __folio_add_anon_rmap() by letting the WARN_ON less sensitive.
> 
> Option 2 seems to be better for do_swap_page() as it can use unified code for
> all cases.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Tested-by: Chuanhua Han <hanchuanhua@oppo.com>
> ---
>   mm/rmap.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f5d43edad529..469fcfd32317 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1304,7 +1304,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>   		 * page.
>   		 */
>   		VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> -				 level != RMAP_LEVEL_PMD, folio);
> +				 level != RMAP_LEVEL_PMD &&
> +				 (!IS_ALIGNED(address, nr_pages * PAGE_SIZE) ||
> +				 (folio_test_swapcache(folio) && !IS_ALIGNED(folio->index, nr_pages)) ||
> +				 page != &folio->page), folio);
>   		__folio_set_anon(folio, vma, address,
>   				 !!(flags & RMAP_EXCLUSIVE));
>   	} else if (likely(!folio_test_ksm(folio))) {


I have on my todo list to move all that !anon handling out of 
folio_add_anon_rmap_ptes(), and instead make swapin code call add 
folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag 
then (-> whole new folio exclusive).

That's the cleaner approach.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 0/6] mm: support large folios swap-in
  2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
                     ` (5 preceding siblings ...)
  2024-01-18 11:10   ` [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT Barry Song
@ 2024-01-18 15:25   ` Ryan Roberts
  2024-01-18 23:54     ` Barry Song
  2024-01-29  9:05   ` Huang, Ying
  7 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-01-18 15:25 UTC (permalink / raw)
  To: Barry Song, akpm, david, linux-mm
  Cc: linux-kernel, mhocko, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, surenb, steven.price

On 18/01/2024 11:10, Barry Song wrote:
> On an embedded system like Android, more than half of anon memory is actually
> in swap devices such as zRAM. For example, while an app is switched to back-
> ground, its most memory might be swapped-out.
> 
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the 
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
> 
> In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
> before swap-out, if some memory were normal pages, but when swapping in, we
> can also swap-in them as large folios. 

I think this could also violate MADV_NOHUGEPAGE; if the application has
requested that we do not create a THP, then we had better not; it could cause a
correctness issue in some circumstances. You would need to pay attention to this
vma flag if taking this approach.

> But this might require I/O happen at
> some random places in swap devices. So we limit the large folios swap-in to
> those areas which were large folios before swapping-out, aka, swaps are also
> contiguous in hardware. 

In fact, even this may not be sufficient; it's possible that a contiguous set of
base pages (small folios) were allocated to a virtual mapping and all swapped
out together - they would likely end up contiguous in the swap file, but should
not be swapped back in as a single folio because of this (same reasoning applies
to cluster of smaller THPs that you mistake for a larger THP, etc).

So you will need to check what THP sizes are enabled and check the VMA
suitability regardless; Perhaps you are already doing this - I haven't looked at
the code yet.

I'll aim to review the code in the next couple of weeks.

Thanks,
Ryan

> On the other hand, in OPPO's product, we've deployed
> anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
> compress and decompress large folios as a whole, which help improve compression
> ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
> save large objects whose original size are 64KiB for example. So it is also a
> better choice for us to only swap-in large folios for those compressed large
> objects as a large folio can be decompressed all together.
> 
> Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
> with MTE" to this series as it might help review.
> 
> [1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> [2] OnePlusOSS / android_kernel_oneplus_sm8550 
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> 
> Barry Song (2):
>   arm64: mm: swap: support THP_SWAP on hardware with MTE
>   mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
> 
> Chuanhua Han (4):
>   mm: swap: introduce swap_nr_free() for batched swap_free()
>   mm: swap: make should_try_to_free_swap() support large-folio
>   mm: support large folios swapin as a whole
>   mm: madvise: don't split mTHP for MADV_PAGEOUT
> 
>  arch/arm64/include/asm/pgtable.h |  21 ++----
>  arch/arm64/mm/mteswap.c          |  42 ++++++++++++
>  include/asm-generic/tlb.h        |  10 +++
>  include/linux/huge_mm.h          |  12 ----
>  include/linux/pgtable.h          |  62 ++++++++++++++++-
>  include/linux/swap.h             |   6 ++
>  mm/madvise.c                     |  48 ++++++++++++++
>  mm/memory.c                      | 110 ++++++++++++++++++++++++++-----
>  mm/page_io.c                     |   2 +-
>  mm/rmap.c                        |   5 +-
>  mm/swap_slots.c                  |   2 +-
>  mm/swapfile.c                    |  29 ++++++++
>  12 files changed, 301 insertions(+), 48 deletions(-)
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 0/6] mm: support large folios swap-in
  2024-01-18 15:25   ` [PATCH RFC 0/6] mm: support large folios swap-in Ryan Roberts
@ 2024-01-18 23:54     ` Barry Song
  2024-01-19 13:25       ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2024-01-18 23:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: akpm, david, linux-mm, linux-kernel, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, surenb,
	steven.price

On Thu, Jan 18, 2024 at 11:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 18/01/2024 11:10, Barry Song wrote:
> > On an embedded system like Android, more than half of anon memory is actually
> > in swap devices such as zRAM. For example, while an app is switched to back-
> > ground, its most memory might be swapped-out.
> >
> > Now we have mTHP features, unfortunately, if we don't support large folios
> > swap-in, once those large folios are swapped-out, we immediately lose the
> > performance gain we can get through large folios and hardware optimization
> > such as CONT-PTE.
> >
> > In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
> > before swap-out, if some memory were normal pages, but when swapping in, we
> > can also swap-in them as large folios.
>
> I think this could also violate MADV_NOHUGEPAGE; if the application has
> requested that we do not create a THP, then we had better not; it could cause a
> correctness issue in some circumstances. You would need to pay attention to this
> vma flag if taking this approach.
>
> > But this might require I/O happen at
> > some random places in swap devices. So we limit the large folios swap-in to
> > those areas which were large folios before swapping-out, aka, swaps are also
> > contiguous in hardware.
>
> In fact, even this may not be sufficient; it's possible that a contiguous set of
> base pages (small folios) were allocated to a virtual mapping and all swapped
> out together - they would likely end up contiguous in the swap file, but should
> not be swapped back in as a single folio because of this (same reasoning applies
> to cluster of smaller THPs that you mistake for a larger THP, etc).
>
> So you will need to check what THP sizes are enabled and check the VMA
> suitability regardless; Perhaps you are already doing this - I haven't looked at
> the code yet.

we are actually re-using your alloc_anon_folio() by adding a parameter
to make it
support both do_anon_page and do_swap_page,

-static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+static struct folio *alloc_anon_folio(struct vm_fault *vmf,
+      bool (*pte_range_check)(pte_t *, int))
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
  struct vm_area_struct *vma = vmf->vma;
@@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct
vm_fault *vmf)
  order = highest_order(orders);
  while (orders) {
  addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
- if (pte_range_none(pte + pte_index(addr), 1 << order))
+ if (pte_range_check(pte + pte_index(addr), 1 << order))
  break;
  order = next_order(&orders, order);
  }
@@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
  if (unlikely(anon_vma_prepare(vma)))
  goto oom;
  /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
- folio = alloc_anon_folio(vmf);
+ folio = alloc_anon_folio(vmf, pte_range_none);
  if (IS_ERR(folio))
  return 0;
  if (!folio)
--

I assume this has checked everything?

>
> I'll aim to review the code in the next couple of weeks.

nice, thanks!

>
> Thanks,
> Ryan
>
> > On the other hand, in OPPO's product, we've deployed
> > anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
> > compress and decompress large folios as a whole, which help improve compression
> > ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
> > save large objects whose original size are 64KiB for example. So it is also a
> > better choice for us to only swap-in large folios for those compressed large
> > objects as a large folio can be decompressed all together.
> >
> > Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
> > with MTE" to this series as it might help review.
> >
> > [1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
> > https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> > [2] OnePlusOSS / android_kernel_oneplus_sm8550
> > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >
> > Barry Song (2):
> >   arm64: mm: swap: support THP_SWAP on hardware with MTE
> >   mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
> >
> > Chuanhua Han (4):
> >   mm: swap: introduce swap_nr_free() for batched swap_free()
> >   mm: swap: make should_try_to_free_swap() support large-folio
> >   mm: support large folios swapin as a whole
> >   mm: madvise: don't split mTHP for MADV_PAGEOUT
> >
> >  arch/arm64/include/asm/pgtable.h |  21 ++----
> >  arch/arm64/mm/mteswap.c          |  42 ++++++++++++
> >  include/asm-generic/tlb.h        |  10 +++
> >  include/linux/huge_mm.h          |  12 ----
> >  include/linux/pgtable.h          |  62 ++++++++++++++++-
> >  include/linux/swap.h             |   6 ++
> >  mm/madvise.c                     |  48 ++++++++++++++
> >  mm/memory.c                      | 110 ++++++++++++++++++++++++++-----
> >  mm/page_io.c                     |   2 +-
> >  mm/rmap.c                        |   5 +-
> >  mm/swap_slots.c                  |   2 +-
> >  mm/swapfile.c                    |  29 ++++++++
> >  12 files changed, 301 insertions(+), 48 deletions(-)
> >
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 0/6] mm: support large folios swap-in
  2024-01-18 23:54     ` Barry Song
@ 2024-01-19 13:25       ` Ryan Roberts
  2024-01-27 14:27         ` Barry Song
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-01-19 13:25 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, linux-mm, linux-kernel, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, surenb,
	steven.price

On 18/01/2024 23:54, Barry Song wrote:
> On Thu, Jan 18, 2024 at 11:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 18/01/2024 11:10, Barry Song wrote:
>>> On an embedded system like Android, more than half of anon memory is actually
>>> in swap devices such as zRAM. For example, while an app is switched to back-
>>> ground, its most memory might be swapped-out.
>>>
>>> Now we have mTHP features, unfortunately, if we don't support large folios
>>> swap-in, once those large folios are swapped-out, we immediately lose the
>>> performance gain we can get through large folios and hardware optimization
>>> such as CONT-PTE.
>>>
>>> In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
>>> before swap-out, if some memory were normal pages, but when swapping in, we
>>> can also swap-in them as large folios.
>>
>> I think this could also violate MADV_NOHUGEPAGE; if the application has
>> requested that we do not create a THP, then we had better not; it could cause a
>> correctness issue in some circumstances. You would need to pay attention to this
>> vma flag if taking this approach.
>>
>>> But this might require I/O happen at
>>> some random places in swap devices. So we limit the large folios swap-in to
>>> those areas which were large folios before swapping-out, aka, swaps are also
>>> contiguous in hardware.
>>
>> In fact, even this may not be sufficient; it's possible that a contiguous set of
>> base pages (small folios) were allocated to a virtual mapping and all swapped
>> out together - they would likely end up contiguous in the swap file, but should
>> not be swapped back in as a single folio because of this (same reasoning applies
>> to cluster of smaller THPs that you mistake for a larger THP, etc).
>>
>> So you will need to check what THP sizes are enabled and check the VMA
>> suitability regardless; Perhaps you are already doing this - I haven't looked at
>> the code yet.
> 
> we are actually re-using your alloc_anon_folio() by adding a parameter
> to make it
> support both do_anon_page and do_swap_page,
> 
> -static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> +      bool (*pte_range_check)(pte_t *, int))
>  {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   struct vm_area_struct *vma = vmf->vma;
> @@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct
> vm_fault *vmf)
>   order = highest_order(orders);
>   while (orders) {
>   addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> - if (pte_range_none(pte + pte_index(addr), 1 << order))
> + if (pte_range_check(pte + pte_index(addr), 1 << order))
>   break;
>   order = next_order(&orders, order);
>   }
> @@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>   if (unlikely(anon_vma_prepare(vma)))
>   goto oom;
>   /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> - folio = alloc_anon_folio(vmf);
> + folio = alloc_anon_folio(vmf, pte_range_none);
>   if (IS_ERR(folio))
>   return 0;
>   if (!folio)
> --
> 
> I assume this has checked everything?

Ahh yes, very good. In that case you can disregard what I said; its already covered.

I notice that this series appears as a reply to my series. I'm not sure what the
normal convention is, but I expect more people would see it if you posted it as
its own thread?


> 
>>
>> I'll aim to review the code in the next couple of weeks.
> 
> nice, thanks!
> 
>>
>> Thanks,
>> Ryan
>>
>>> On the other hand, in OPPO's product, we've deployed
>>> anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
>>> compress and decompress large folios as a whole, which help improve compression
>>> ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
>>> save large objects whose original size are 64KiB for example. So it is also a
>>> better choice for us to only swap-in large folios for those compressed large
>>> objects as a large folio can be decompressed all together.
>>>
>>> Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
>>> with MTE" to this series as it might help review.
>>>
>>> [1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
>>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
>>> [2] OnePlusOSS / android_kernel_oneplus_sm8550
>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>>>
>>> Barry Song (2):
>>>   arm64: mm: swap: support THP_SWAP on hardware with MTE
>>>   mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
>>>
>>> Chuanhua Han (4):
>>>   mm: swap: introduce swap_nr_free() for batched swap_free()
>>>   mm: swap: make should_try_to_free_swap() support large-folio
>>>   mm: support large folios swapin as a whole
>>>   mm: madvise: don't split mTHP for MADV_PAGEOUT
>>>
>>>  arch/arm64/include/asm/pgtable.h |  21 ++----
>>>  arch/arm64/mm/mteswap.c          |  42 ++++++++++++
>>>  include/asm-generic/tlb.h        |  10 +++
>>>  include/linux/huge_mm.h          |  12 ----
>>>  include/linux/pgtable.h          |  62 ++++++++++++++++-
>>>  include/linux/swap.h             |   6 ++
>>>  mm/madvise.c                     |  48 ++++++++++++++
>>>  mm/memory.c                      | 110 ++++++++++++++++++++++++++-----
>>>  mm/page_io.c                     |   2 +-
>>>  mm/rmap.c                        |   5 +-
>>>  mm/swap_slots.c                  |   2 +-
>>>  mm/swapfile.c                    |  29 ++++++++
>>>  12 files changed, 301 insertions(+), 48 deletions(-)
>>>
>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
  2024-01-18 11:54     ` David Hildenbrand
@ 2024-01-23  6:49       ` Barry Song
  2024-01-29  3:25         ` Chris Li
  0 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2024-01-23  6:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: ryan.roberts, akpm, linux-mm, linux-kernel, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, surenb,
	steven.price, Barry Song, Chuanhua Han

On Thu, Jan 18, 2024 at 7:54 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.01.24 12:10, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > In do_swap_page(), while supporting large folio swap-in, we are using the helper
> > folio_add_anon_rmap_ptes. This is triggerring a WARN_ON in __folio_add_anon_rmap.
> > We can make the warning quiet by two ways
> > 1. in do_swap_page, we call folio_add_new_anon_rmap() if we are sure the large
> > folio is new allocated one; we call folio_add_anon_rmap_ptes() if we find the
> > large folio in swapcache.
> > 2. we always call folio_add_anon_rmap_ptes() in do_swap_page but weaken the
> > WARN_ON in __folio_add_anon_rmap() by letting the WARN_ON less sensitive.
> >
> > Option 2 seems to be better for do_swap_page() as it can use unified code for
> > all cases.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > Tested-by: Chuanhua Han <hanchuanhua@oppo.com>
> > ---
> >   mm/rmap.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index f5d43edad529..469fcfd32317 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1304,7 +1304,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> >                * page.
> >                */
> >               VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> > -                              level != RMAP_LEVEL_PMD, folio);
> > +                              level != RMAP_LEVEL_PMD &&
> > +                              (!IS_ALIGNED(address, nr_pages * PAGE_SIZE) ||
> > +                              (folio_test_swapcache(folio) && !IS_ALIGNED(folio->index, nr_pages)) ||
> > +                              page != &folio->page), folio);
> >               __folio_set_anon(folio, vma, address,
> >                                !!(flags & RMAP_EXCLUSIVE));
> >       } else if (likely(!folio_test_ksm(folio))) {
>
>
> I have on my todo list to move all that !anon handling out of
> folio_add_anon_rmap_ptes(), and instead make swapin code call add
> folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
> then (-> whole new folio exclusive).
>
> That's the cleaner approach.
>

one tricky thing is that sometimes it is hard to know who is the first
one to add rmap and thus should
call folio_add_new_anon_rmap.
especially when we want to support swapin_readahead(), the one who
allocated large filio might not
be that one who firstly does rmap.
is it an acceptable way to do the below in do_swap_page?
if (!folio_test_anon(folio))
      folio_add_new_anon_rmap()
else
      folio_add_anon_rmap_ptes()

> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE
  2024-01-18 11:10   ` [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE Barry Song
@ 2024-01-26 23:14     ` Chris Li
  2024-02-26  2:59       ` Barry Song
  0 siblings, 1 reply; 116+ messages in thread
From: Chris Li @ 2024-01-26 23:14 UTC (permalink / raw)
  To: Barry Song
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Barry Song

On Thu, Jan 18, 2024 at 3:11 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
> THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
> MTE as the MTE code works with the assumption tags save/restore is
> always handling a folio with only one page.
>
> The limitation should be removed as more and more ARM64 SoCs have
> this feature. Co-existence of MTE and THP_SWAP becomes more and
> more important.
>
> This patch makes MTE tags saving support large folios, then we don't
> need to split large folios into base pages for swapping out on ARM64
> SoCs with MTE any more.
>
> arch_prepare_to_swap() should take folio rather than page as parameter
> because we support THP swap-out as a whole. It saves tags for all
> pages in a large folio.
>
> As now we are restoring tags based-on folio, in arch_swap_restore(),
> we may increase some extra loops and early-exitings while refaulting
> a large folio which is still in swapcache in do_swap_page(). In case
> a large folio has nr pages, do_swap_page() will only set the PTE of
> the particular page which is causing the page fault.
> Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
> will loop nr times for those subpages in the folio. So right now the
> algorithmic complexity becomes O(nr^2).
>
> Once we support mapping large folios in do_swap_page(), extra loops
> and early-exitings will decrease while not being completely removed
> as a large folio might get partially tagged in corner cases such as,
> 1. a large folio in swapcache can be partially unmapped, thus, MTE
> tags for the unmapped pages will be invalidated;
> 2. users might use mprotect() to set MTEs on a part of a large folio.
>
> arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
> who needed it.
>
> Reviewed-by: Steven Price <steven.price@arm.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 21 +++-------------
>  arch/arm64/mm/mteswap.c          | 42 ++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h          | 12 ---------
>  include/linux/pgtable.h          |  2 +-
>  mm/page_io.c                     |  2 +-
>  mm/swap_slots.c                  |  2 +-
>  6 files changed, 49 insertions(+), 32 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 79ce70fbb751..9902395ca426 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -45,12 +45,6 @@
>         __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -static inline bool arch_thp_swp_supported(void)
> -{
> -       return !system_supports_mte();
> -}
> -#define arch_thp_swp_supported arch_thp_swp_supported
> -
>  /*
>   * Outside of a few very special situations (e.g. hibernation), we always
>   * use broadcast TLB invalidation instructions, therefore a spurious page
> @@ -1042,12 +1036,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>  #ifdef CONFIG_ARM64_MTE
>
>  #define __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> -{
> -       if (system_supports_mte())
> -               return mte_save_tags(page);
> -       return 0;
> -}
> +#define arch_prepare_to_swap arch_prepare_to_swap

This seems a noop, define "arch_prepare_to_swap" back to itself.
What am I missing?

I see. Answer my own question, I guess you want to allow someone to
overwrite the arch_prepare_to_swap.
Wouldn't testing against  __HAVE_ARCH_PREPARE_TO_SWAP enough to support that?

Maybe I need to understand better how you want others to extend this
code to make suggestions.
As it is, this looks strange.

> +extern int arch_prepare_to_swap(struct folio *folio);
>
>  #define __HAVE_ARCH_SWAP_INVALIDATE
>  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> @@ -1063,11 +1053,8 @@ static inline void arch_swap_invalidate_area(int type)
>  }
>
>  #define __HAVE_ARCH_SWAP_RESTORE
> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> -{
> -       if (system_supports_mte())
> -               mte_restore_tags(entry, &folio->page);
> -}
> +#define arch_swap_restore arch_swap_restore

Same here.

> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>
>  #endif /* CONFIG_ARM64_MTE */
>
> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> index a31833e3ddc5..b9ca1b35902f 100644
> --- a/arch/arm64/mm/mteswap.c
> +++ b/arch/arm64/mm/mteswap.c
> @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
>         mte_free_tag_storage(tags);
>  }
>
> +static inline void __mte_invalidate_tags(struct page *page)
> +{
> +       swp_entry_t entry = page_swap_entry(page);
> +
> +       mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> +}
> +
>  void mte_invalidate_tags_area(int type)
>  {
>         swp_entry_t entry = swp_entry(type, 0);
> @@ -83,3 +90,38 @@ void mte_invalidate_tags_area(int type)
>         }
>         xa_unlock(&mte_pages);
>  }
> +
> +int arch_prepare_to_swap(struct folio *folio)
> +{
> +       int err;
> +       long i;
> +
> +       if (system_supports_mte()) {
Very minor nitpick.

You can do
if (!system_supports_mte())
    return 0;

Here and the for loop would have less indent. The function looks flatter.

> +               long nr = folio_nr_pages(folio);
> +
> +               for (i = 0; i < nr; i++) {
> +                       err = mte_save_tags(folio_page(folio, i));
> +                       if (err)
> +                               goto out;
> +               }
> +       }
> +       return 0;
> +
> +out:
> +       while (i--)
> +               __mte_invalidate_tags(folio_page(folio, i));
> +       return err;
> +}
> +
> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> +{
> +       if (system_supports_mte()) {

Same here.

Looks good otherwise. None of the nitpicks are deal breakers.

Acked-by: Chris Li <chrisl@kernel.org>


Chris

> +               long i, nr = folio_nr_pages(folio);
> +
> +               entry.val -= swp_offset(entry) & (nr - 1);
> +               for (i = 0; i < nr; i++) {
> +                       mte_restore_tags(entry, folio_page(folio, i));
> +                       entry.val++;
> +               }
> +       }
> +}
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..67219d2309dd 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -530,16 +530,4 @@ static inline int split_folio(struct folio *folio)
>         return split_folio_to_list(folio, NULL);
>  }
>
> -/*
> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> - * limitations in the implementation like arm64 MTE can override this to
> - * false
> - */
> -#ifndef arch_thp_swp_supported
> -static inline bool arch_thp_swp_supported(void)
> -{
> -       return true;
> -}
> -#endif
> -
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f6d0e3513948..37fe83b0c358 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -925,7 +925,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
>   * prototypes must be defined in the arch-specific asm/pgtable.h file.
>   */
>  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> +static inline int arch_prepare_to_swap(struct folio *folio)
>  {
>         return 0;
>  }
> diff --git a/mm/page_io.c b/mm/page_io.c
> index ae2b49055e43..a9a7c236aecc 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>          * Arch code may have to preserve more data than just the page
>          * contents, e.g. memory tags.
>          */
> -       ret = arch_prepare_to_swap(&folio->page);
> +       ret = arch_prepare_to_swap(folio);
>         if (ret) {
>                 folio_mark_dirty(folio);
>                 folio_unlock(folio);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..2325adbb1f19 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
>         entry.val = 0;
>
>         if (folio_test_large(folio)) {
> -               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> +               if (IS_ENABLED(CONFIG_THP_SWAP))
>                         get_swap_pages(1, &entry, folio_nr_pages(folio));
>                 goto out;
>         }
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free()
  2024-01-18 11:10   ` [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free() Barry Song
@ 2024-01-26 23:17     ` Chris Li
  2024-02-26  4:47       ` Barry Song
  0 siblings, 1 reply; 116+ messages in thread
From: Chris Li @ 2024-01-26 23:17 UTC (permalink / raw)
  To: Barry Song
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Chuanhua Han, Barry Song

On Thu, Jan 18, 2024 at 3:11 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/swap.h |  6 ++++++
>  mm/swapfile.c        | 29 +++++++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4db00ddad261..31a4ee2dcd1c 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -478,6 +478,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern int free_swap_and_cache(swp_entry_t);
>  int swap_type_of(dev_t device, sector_t offset);
> @@ -553,6 +554,11 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>
> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> +{
> +
> +}
> +
>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>  {
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 556ff7347d5f..6321bda96b77 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1335,6 +1335,35 @@ void swap_free(swp_entry_t entry)
>                 __swap_entry_free(p, entry);
>  }
>
> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> +{
> +       int i;
> +       struct swap_cluster_info *ci;
> +       struct swap_info_struct *p;
> +       unsigned type = swp_type(entry);
> +       unsigned long offset = swp_offset(entry);
> +       DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
> +
> +       VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);

BUG_ON here seems a bit too developer originated. Maybe warn once and
roll back to free one by one?

How big is your typical SWAPFILE_CUSTER and nr_pages typically in arm?

I ask this question because nr_ppages > 64, that is a totally
different game, we can completely  bypass the swap cache slots.

> +
> +       if (nr_pages == 1) {
> +               swap_free(entry);
> +               return;
> +       }
> +
> +       p = _swap_info_get(entry);
> +
> +       ci = lock_cluster(p, offset);
> +       for (i = 0; i < nr_pages; i++) {
> +               if (__swap_entry_free_locked(p, offset + i, 1))
> +                       __bitmap_set(usage, i, 1);
> +       }
> +       unlock_cluster(ci);
> +
> +       for_each_clear_bit(i, usage, nr_pages)
> +               free_swap_slot(swp_entry(type, offset + i));

Notice that free_swap_slot() internal has per CPU cache batching as
well. Every free_swap_slot will get some per_cpu swap slot cache and
cache->lock. There is double batching here.
If the typical batch size here is bigger than 64 entries, we can go
directly to batching swap_entry_free and avoid the free_swap_slot()
batching altogether. Unlike free_swap_slot_entries(), here swap slots
are all from one swap device, there is no need to sort and group the
swap slot by swap devices.

Chris


Chris

> +}
> +
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio
  2024-01-18 11:10   ` [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
@ 2024-01-26 23:22     ` Chris Li
  0 siblings, 0 replies; 116+ messages in thread
From: Chris Li @ 2024-01-26 23:22 UTC (permalink / raw)
  To: Barry Song
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Chuanhua Han, Barry Song

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Thu, Jan 18, 2024 at 3:11 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> should_try_to_free_swap() works with an assumption that swap-in is always done
> at normal page granularity, aka, folio_nr_pages = 1. To support large folio
> swap-in, this patch removes the assumption.
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e1f4849463a..f61a48929ba7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3714,7 +3714,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
>          * reference only in case it's likely that we'll be the exlusive user.
>          */
>         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> -               folio_ref_count(folio) == 2;
> +               folio_ref_count(folio) == (1 + folio_nr_pages(folio));
>  }
>
>  static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 0/6] mm: support large folios swap-in
  2024-01-19 13:25       ` Ryan Roberts
@ 2024-01-27 14:27         ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-01-27 14:27 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: akpm, david, linux-mm, linux-kernel, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, surenb,
	steven.price

On Fri, Jan 19, 2024 at 9:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 18/01/2024 23:54, Barry Song wrote:
> > On Thu, Jan 18, 2024 at 11:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 18/01/2024 11:10, Barry Song wrote:
> >>> On an embedded system like Android, more than half of anon memory is actually
> >>> in swap devices such as zRAM. For example, while an app is switched to back-
> >>> ground, its most memory might be swapped-out.
> >>>
> >>> Now we have mTHP features, unfortunately, if we don't support large folios
> >>> swap-in, once those large folios are swapped-out, we immediately lose the
> >>> performance gain we can get through large folios and hardware optimization
> >>> such as CONT-PTE.
> >>>
> >>> In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
> >>> before swap-out, if some memory were normal pages, but when swapping in, we
> >>> can also swap-in them as large folios.
> >>
> >> I think this could also violate MADV_NOHUGEPAGE; if the application has
> >> requested that we do not create a THP, then we had better not; it could cause a
> >> correctness issue in some circumstances. You would need to pay attention to this
> >> vma flag if taking this approach.
> >>
> >>> But this might require I/O happen at
> >>> some random places in swap devices. So we limit the large folios swap-in to
> >>> those areas which were large folios before swapping-out, aka, swaps are also
> >>> contiguous in hardware.
> >>
> >> In fact, even this may not be sufficient; it's possible that a contiguous set of
> >> base pages (small folios) were allocated to a virtual mapping and all swapped
> >> out together - they would likely end up contiguous in the swap file, but should
> >> not be swapped back in as a single folio because of this (same reasoning applies
> >> to cluster of smaller THPs that you mistake for a larger THP, etc).
> >>
> >> So you will need to check what THP sizes are enabled and check the VMA
> >> suitability regardless; Perhaps you are already doing this - I haven't looked at
> >> the code yet.
> >
> > we are actually re-using your alloc_anon_folio() by adding a parameter
> > to make it
> > support both do_anon_page and do_swap_page,
> >
> > -static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> > +      bool (*pte_range_check)(pte_t *, int))
> >  {
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >   struct vm_area_struct *vma = vmf->vma;
> > @@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct
> > vm_fault *vmf)
> >   order = highest_order(orders);
> >   while (orders) {
> >   addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > - if (pte_range_none(pte + pte_index(addr), 1 << order))
> > + if (pte_range_check(pte + pte_index(addr), 1 << order))
> >   break;
> >   order = next_order(&orders, order);
> >   }
> > @@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >   if (unlikely(anon_vma_prepare(vma)))
> >   goto oom;
> >   /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> > - folio = alloc_anon_folio(vmf);
> > + folio = alloc_anon_folio(vmf, pte_range_none);
> >   if (IS_ERR(folio))
> >   return 0;
> >   if (!folio)
> > --
> >
> > I assume this has checked everything?
>
> Ahh yes, very good. In that case you can disregard what I said; its already covered.
>
> I notice that this series appears as a reply to my series. I'm not sure what the
> normal convention is, but I expect more people would see it if you posted it as
> its own thread?

yes. i was replying your series because we were using your series to
swapout large folios
without splitting. in v2,  i will send it as a new thread.

>
>
> >
> >>
> >> I'll aim to review the code in the next couple of weeks.
> >
> > nice, thanks!
> >
> >>
> >> Thanks,
> >> Ryan
> >>
> >>> On the other hand, in OPPO's product, we've deployed
> >>> anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
> >>> compress and decompress large folios as a whole, which help improve compression
> >>> ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
> >>> save large objects whose original size are 64KiB for example. So it is also a
> >>> better choice for us to only swap-in large folios for those compressed large
> >>> objects as a large folio can be decompressed all together.
> >>>
> >>> Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
> >>> with MTE" to this series as it might help review.
> >>>
> >>> [1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
> >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> >>> [2] OnePlusOSS / android_kernel_oneplus_sm8550
> >>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >>>
> >>> Barry Song (2):
> >>>   arm64: mm: swap: support THP_SWAP on hardware with MTE
> >>>   mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
> >>>
> >>> Chuanhua Han (4):
> >>>   mm: swap: introduce swap_nr_free() for batched swap_free()
> >>>   mm: swap: make should_try_to_free_swap() support large-folio
> >>>   mm: support large folios swapin as a whole
> >>>   mm: madvise: don't split mTHP for MADV_PAGEOUT
> >>>
> >>>  arch/arm64/include/asm/pgtable.h |  21 ++----
> >>>  arch/arm64/mm/mteswap.c          |  42 ++++++++++++
> >>>  include/asm-generic/tlb.h        |  10 +++
> >>>  include/linux/huge_mm.h          |  12 ----
> >>>  include/linux/pgtable.h          |  62 ++++++++++++++++-
> >>>  include/linux/swap.h             |   6 ++
> >>>  mm/madvise.c                     |  48 ++++++++++++++
> >>>  mm/memory.c                      | 110 ++++++++++++++++++++++++++-----
> >>>  mm/page_io.c                     |   2 +-
> >>>  mm/rmap.c                        |   5 +-
> >>>  mm/swap_slots.c                  |   2 +-
> >>>  mm/swapfile.c                    |  29 ++++++++
> >>>  12 files changed, 301 insertions(+), 48 deletions(-)
> >>>
> >>
> >
 Thanks
 Barry
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole
  2024-01-18 11:10   ` [PATCH RFC 4/6] mm: support large folios swapin as a whole Barry Song
@ 2024-01-27 19:53     ` Chris Li
  2024-02-26  7:29       ` Barry Song
  2024-01-27 20:06     ` Chris Li
  1 sibling, 1 reply; 116+ messages in thread
From: Chris Li @ 2024-01-27 19:53 UTC (permalink / raw)
  To: Barry Song
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Chuanhua Han, Barry Song

On Thu, Jan 18, 2024 at 3:12 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> On an embedded system like Android, more than half of anon memory is actually
> in swap devices such as zRAM. For example, while an app is switched to back-
> ground, its most memory might be swapped-out.
>
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
>
> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> to those contiguous swaps which were likely swapped out from mTHP as a whole.
>
> On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
> case. It doesn't support swapin_readahead as large folios yet.
>
> Right now, we are re-faulting large folios which are still in swapcache as a
> whole, this can effectively decrease extra loops and early-exitings which we
> have increased in arch_swap_restore() while supporting MTE restore for folios
> rather than page.
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 94 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f61a48929ba7..928b3f542932 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -107,6 +107,8 @@ EXPORT_SYMBOL(mem_map);
>  static vm_fault_t do_fault(struct vm_fault *vmf);
>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
>  static bool vmf_pte_changed(struct vm_fault *vmf);
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> +                                     bool (*pte_range_check)(pte_t *, int));

Instead of returning "bool", the pte_range_check() can return the
start of the swap entry of the large folio.
That will save some of the later code needed to get the start of the
large folio.

>
>  /*
>   * Return true if the original pte was a uffd-wp pte marker (so the pte was
> @@ -3784,6 +3786,34 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>         return VM_FAULT_SIGBUS;
>  }
>
> +static bool pte_range_swap(pte_t *pte, int nr_pages)

This function name seems to suggest it will perform the range swap.
That is not what it is doing.
Suggest change to some other name reflecting that it is only a
condition test without actual swap action.
I am not very good at naming functions. Just think it out loud: e.g.
pte_range_swap_check, pte_test_range_swap. You can come up with
something better.


> +{
> +       int i;
> +       swp_entry_t entry;
> +       unsigned type;
> +       pgoff_t start_offset;
> +
> +       entry = pte_to_swp_entry(ptep_get_lockless(pte));
> +       if (non_swap_entry(entry))
> +               return false;
> +       start_offset = swp_offset(entry);
> +       if (start_offset % nr_pages)
> +               return false;

This suggests the pte argument needs to point to the beginning of the
large folio equivalent of swap entry(not sure what to call it. Let me
call it "large folio swap" here.).
We might want to unify the terms for that.
Any way, might want to document this requirement, otherwise the caller
might consider passing the current pte that generates the fault. From
the function name it is not obvious which pte should pass it.

> +
> +       type = swp_type(entry);
> +       for (i = 1; i < nr_pages; i++) {

You might want to test the last page backwards, because if the entry
is not the large folio swap, most likely it will have the last entry
invalid.  Some of the beginning swap entries might match due to batch
allocation etc. The SSD likes to group the nearby swap entry write out
together on the disk.



> +               entry = pte_to_swp_entry(ptep_get_lockless(pte + i));

> +               if (non_swap_entry(entry))
> +                       return false;
> +               if (swp_offset(entry) != start_offset + i)
> +                       return false;
> +               if (swp_type(entry) != type)
> +                       return false;
> +       }
> +
> +       return true;
> +}
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3804,6 +3834,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         pte_t pte;
>         vm_fault_t ret = 0;
>         void *shadow = NULL;
> +       int nr_pages = 1;
> +       unsigned long start_address;
> +       pte_t *start_pte;
>
>         if (!pte_unmap_same(vmf))
>                 goto out;
> @@ -3868,13 +3901,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
>                         /* skip swapcache */
> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -                                               vma, vmf->address, false);
> +                       folio = alloc_anon_folio(vmf, pte_range_swap);

This function can call pte_range_swap() twice(), one here, another one
in folio_test_large().
Consider caching the result so it does not need to walk the pte range
swap twice.

I think alloc_anon_folio should either be told what is the
size(prefered) or just figure out the right size. I don't think it
needs to pass in the checking function as function callbacks. There
are two call sites of alloc_anon_folio, they are all within this
function. The call back seems a bit overkill here. Also duplicate the
range swap walk.

>                         page = &folio->page;
>                         if (folio) {
>                                 __folio_set_locked(folio);
>                                 __folio_set_swapbacked(folio);
>
> +                               if (folio_test_large(folio)) {
> +                                       unsigned long start_offset;
> +
> +                                       nr_pages = folio_nr_pages(folio);
> +                                       start_offset = swp_offset(entry) & ~(nr_pages - 1);
Here is the first place place we roll up the start offset with folio size

> +                                       entry = swp_entry(swp_type(entry), start_offset);
> +                               }
> +
>                                 if (mem_cgroup_swapin_charge_folio(folio,
>                                                         vma->vm_mm, GFP_KERNEL,
>                                                         entry)) {
> @@ -3980,6 +4020,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>          */
>         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>                         &vmf->ptl);
> +
> +       start_address = vmf->address;
> +       start_pte = vmf->pte;
> +       if (folio_test_large(folio)) {
> +               unsigned long nr = folio_nr_pages(folio);
> +               unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> +               pte_t *pte_t = vmf->pte - (vmf->address - addr) / PAGE_SIZE;

Here is the second place we roll up the folio size.
Maybe we can cache results and avoid repetition?

> +
> +               /*
> +                * case 1: we are allocating large_folio, try to map it as a whole
> +                * iff the swap entries are still entirely mapped;
> +                * case 2: we hit a large folio in swapcache, and all swap entries
> +                * are still entirely mapped, try to map a large folio as a whole.
> +                * otherwise, map only the faulting page within the large folio
> +                * which is swapcache
> +                */

One question I have in mind is that the swap device is locked. We
can't change the swap slot allocations.
It does not stop the pte entry getting changed right? Then we can have
someone in the user pace racing to change the PTE vs we checking the
pte there.

> +               if (pte_range_swap(pte_t, nr)) {

After this pte_range_swap() check, some of the PTE entries get changed
and now we don't have the full large page swap any more?
At least I can't conclude this possibility can't happen yet, please
enlighten me.

> +                       start_address = addr;
> +                       start_pte = pte_t;
> +                       if (unlikely(folio == swapcache)) {
> +                               /*
> +                                * the below has been done before swap_read_folio()
> +                                * for case 1
> +                                */
> +                               nr_pages = nr;
> +                               entry = pte_to_swp_entry(ptep_get(start_pte));

If we make pte_range_swap() return the entry, we can avoid refetching
the swap entry here.

> +                               page = &folio->page;
> +                       }
> +               } else if (nr_pages > 1) { /* ptes have changed for case 1 */
> +                       goto out_nomap;
> +               }
> +       }
> +
I rewrote the above to make the code indentation matching the execution flow.
I did not add any functional change. Just rearrange the code to be a
bit more streamlined. Get rid of the "else if goto".
               if (!pte_range_swap(pte_t, nr)) {
                        if (nr_pages > 1)  /* ptes have changed for case 1 */
                                goto out_nomap;
                        goto check_pte;
                }

                start_address = addr;
                start_pte = pte_t;
                if (unlikely(folio == swapcache)) {
                        /*
                         * the below has been done before swap_read_folio()
                         * for case 1
                         */
                        nr_pages = nr;
                        entry = pte_to_swp_entry(ptep_get(start_pte));
                        page = &folio->page;
                }
        }

check_pte:

>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>                 goto out_nomap;
>
> @@ -4047,12 +4120,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>          * We're already holding a reference on the page but haven't mapped it
>          * yet.
>          */
> -       swap_free(entry);
> +       swap_nr_free(entry, nr_pages);
>         if (should_try_to_free_swap(folio, vma, vmf->flags))
>                 folio_free_swap(folio);
>
> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -       dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> +       folio_ref_add(folio, nr_pages - 1);
> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +       add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +
>         pte = mk_pte(page, vma->vm_page_prot);
>
>         /*
> @@ -4062,14 +4137,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>          * exclusivity.
>          */
>         if (!folio_test_ksm(folio) &&
> -           (exclusive || folio_ref_count(folio) == 1)) {
> +           (exclusive || folio_ref_count(folio) == nr_pages)) {
>                 if (vmf->flags & FAULT_FLAG_WRITE) {
>                         pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>                         vmf->flags &= ~FAULT_FLAG_WRITE;
>                 }
>                 rmap_flags |= RMAP_EXCLUSIVE;
>         }
> -       flush_icache_page(vma, page);
> +       flush_icache_pages(vma, page, nr_pages);
>         if (pte_swp_soft_dirty(vmf->orig_pte))
>                 pte = pte_mksoft_dirty(pte);
>         if (pte_swp_uffd_wp(vmf->orig_pte))
> @@ -4081,14 +4156,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_add_new_anon_rmap(folio, vma, vmf->address);
>                 folio_add_lru_vma(folio, vma);
>         } else {
> -               folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> +               folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
>                                         rmap_flags);
>         }
>
>         VM_BUG_ON(!folio_test_anon(folio) ||
>                         (pte_write(pte) && !PageAnonExclusive(page)));
> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> -       arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> +       set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> +
> +       arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
>
>         folio_unlock(folio);
>         if (folio != swapcache && swapcache) {
> @@ -4105,6 +4181,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         }
>
>         if (vmf->flags & FAULT_FLAG_WRITE) {
> +               if (folio_test_large(folio) && nr_pages > 1)
> +                       vmf->orig_pte = ptep_get(vmf->pte);
> +
>                 ret |= do_wp_page(vmf);
>                 if (ret & VM_FAULT_ERROR)
>                         ret &= VM_FAULT_ERROR;
> @@ -4112,7 +4191,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         }
>
>         /* No need to invalidate - it was non-present before */
> -       update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> +       update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>  unlock:
>         if (vmf->pte)
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -4148,7 +4227,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
>         return true;
>  }
>
> -static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> +                                     bool (*pte_range_check)(pte_t *, int))
>  {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         struct vm_area_struct *vma = vmf->vma;
> @@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)

About this patch context we have the following comments in the source code.
        /*
         * Find the highest order where the aligned range is completely
         * pte_none(). Note that all remaining orders will be completely
         * pte_none().
         */
>         order = highest_order(orders);
>         while (orders) {
>                 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> -               if (pte_range_none(pte + pte_index(addr), 1 << order))
> +               if (pte_range_check(pte + pte_index(addr), 1 << order))

Again, I don't think we need to pass in the pte_range_check() as call
back functions.
There are only two call sites, all within this file. This will totally
invalide the above comments about pte_none(). In the worst case, just
make it accept one argument: it is checking swap range or none range
or not. Depending on the argument, do check none or swap range.
We should make it blend in with alloc_anon_folio better. My gut
feeling is that there should be a better way to make the range check
blend in with alloc_anon_folio better. e.g. Maybe store some of the
large swap context in the vmf and pass to different places etc. I need
to spend more time thinking about it to come up with happier
solutions.

Chris

>                         break;
>                 order = next_order(&orders, order);
>         }
> @@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         if (unlikely(anon_vma_prepare(vma)))
>                 goto oom;
>         /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> -       folio = alloc_anon_folio(vmf);
> +       folio = alloc_anon_folio(vmf, pte_range_none);
>         if (IS_ERR(folio))
>                 return 0;
>         if (!folio)
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole
  2024-01-18 11:10   ` [PATCH RFC 4/6] mm: support large folios swapin as a whole Barry Song
  2024-01-27 19:53     ` Chris Li
@ 2024-01-27 20:06     ` Chris Li
  2024-02-26  7:31       ` Barry Song
  1 sibling, 1 reply; 116+ messages in thread
From: Chris Li @ 2024-01-27 20:06 UTC (permalink / raw)
  To: Barry Song
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Chuanhua Han, Barry Song

On Thu, Jan 18, 2024 at 3:12 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> On an embedded system like Android, more than half of anon memory is actually
> in swap devices such as zRAM. For example, while an app is switched to back-
> ground, its most memory might be swapped-out.
>
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
>
> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> to those contiguous swaps which were likely swapped out from mTHP as a whole.
>
> On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
> case. It doesn't support swapin_readahead as large folios yet.
>
> Right now, we are re-faulting large folios which are still in swapcache as a
> whole, this can effectively decrease extra loops and early-exitings which we
> have increased in arch_swap_restore() while supporting MTE restore for folios
> rather than page.
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 94 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f61a48929ba7..928b3f542932 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -107,6 +107,8 @@ EXPORT_SYMBOL(mem_map);
>  static vm_fault_t do_fault(struct vm_fault *vmf);
>  static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
>  static bool vmf_pte_changed(struct vm_fault *vmf);
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> +                                     bool (*pte_range_check)(pte_t *, int));
>
>  /*
>   * Return true if the original pte was a uffd-wp pte marker (so the pte was
> @@ -3784,6 +3786,34 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>         return VM_FAULT_SIGBUS;
>  }
>
> +static bool pte_range_swap(pte_t *pte, int nr_pages)
> +{
> +       int i;
> +       swp_entry_t entry;
> +       unsigned type;
> +       pgoff_t start_offset;
> +
> +       entry = pte_to_swp_entry(ptep_get_lockless(pte));
> +       if (non_swap_entry(entry))
> +               return false;
> +       start_offset = swp_offset(entry);
> +       if (start_offset % nr_pages)
> +               return false;
> +
> +       type = swp_type(entry);
> +       for (i = 1; i < nr_pages; i++) {
> +               entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> +               if (non_swap_entry(entry))
> +                       return false;
> +               if (swp_offset(entry) != start_offset + i)
> +                       return false;
> +               if (swp_type(entry) != type)
> +                       return false;
> +       }
> +
> +       return true;
> +}
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3804,6 +3834,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         pte_t pte;
>         vm_fault_t ret = 0;
>         void *shadow = NULL;
> +       int nr_pages = 1;
> +       unsigned long start_address;
> +       pte_t *start_pte;
>
>         if (!pte_unmap_same(vmf))
>                 goto out;
> @@ -3868,13 +3901,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
>                         /* skip swapcache */
> -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -                                               vma, vmf->address, false);
> +                       folio = alloc_anon_folio(vmf, pte_range_swap);
>                         page = &folio->page;
>                         if (folio) {
>                                 __folio_set_locked(folio);
>                                 __folio_set_swapbacked(folio);
>
> +                               if (folio_test_large(folio)) {
> +                                       unsigned long start_offset;
> +
> +                                       nr_pages = folio_nr_pages(folio);
> +                                       start_offset = swp_offset(entry) & ~(nr_pages - 1);
> +                                       entry = swp_entry(swp_type(entry), start_offset);
> +                               }
> +
>                                 if (mem_cgroup_swapin_charge_folio(folio,
>                                                         vma->vm_mm, GFP_KERNEL,
>                                                         entry)) {
> @@ -3980,6 +4020,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>          */
>         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>                         &vmf->ptl);
> +
> +       start_address = vmf->address;
> +       start_pte = vmf->pte;
> +       if (folio_test_large(folio)) {
> +               unsigned long nr = folio_nr_pages(folio);
> +               unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> +               pte_t *pte_t = vmf->pte - (vmf->address - addr) / PAGE_SIZE;

I forgot about one comment here.
Please change the variable name other than "pte_t", it is a bit
strange to use the typedef name as variable name here.

Chris

> +
> +               /*
> +                * case 1: we are allocating large_folio, try to map it as a whole
> +                * iff the swap entries are still entirely mapped;
> +                * case 2: we hit a large folio in swapcache, and all swap entries
> +                * are still entirely mapped, try to map a large folio as a whole.
> +                * otherwise, map only the faulting page within the large folio
> +                * which is swapcache
> +                */
> +               if (pte_range_swap(pte_t, nr)) {
> +                       start_address = addr;
> +                       start_pte = pte_t;
> +                       if (unlikely(folio == swapcache)) {
> +                               /*
> +                                * the below has been done before swap_read_folio()
> +                                * for case 1
> +                                */
> +                               nr_pages = nr;
> +                               entry = pte_to_swp_entry(ptep_get(start_pte));
> +                               page = &folio->page;
> +                       }
> +               } else if (nr_pages > 1) { /* ptes have changed for case 1 */
> +                       goto out_nomap;
> +               }
> +       }
> +
>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>                 goto out_nomap;
>
> @@ -4047,12 +4120,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>          * We're already holding a reference on the page but haven't mapped it
>          * yet.
>          */
> -       swap_free(entry);
> +       swap_nr_free(entry, nr_pages);
>         if (should_try_to_free_swap(folio, vma, vmf->flags))
>                 folio_free_swap(folio);
>
> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -       dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> +       folio_ref_add(folio, nr_pages - 1);
> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +       add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +
>         pte = mk_pte(page, vma->vm_page_prot);
>
>         /*
> @@ -4062,14 +4137,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>          * exclusivity.
>          */
>         if (!folio_test_ksm(folio) &&
> -           (exclusive || folio_ref_count(folio) == 1)) {
> +           (exclusive || folio_ref_count(folio) == nr_pages)) {
>                 if (vmf->flags & FAULT_FLAG_WRITE) {
>                         pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>                         vmf->flags &= ~FAULT_FLAG_WRITE;
>                 }
>                 rmap_flags |= RMAP_EXCLUSIVE;
>         }
> -       flush_icache_page(vma, page);
> +       flush_icache_pages(vma, page, nr_pages);
>         if (pte_swp_soft_dirty(vmf->orig_pte))
>                 pte = pte_mksoft_dirty(pte);
>         if (pte_swp_uffd_wp(vmf->orig_pte))
> @@ -4081,14 +4156,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 folio_add_new_anon_rmap(folio, vma, vmf->address);
>                 folio_add_lru_vma(folio, vma);
>         } else {
> -               folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> +               folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
>                                         rmap_flags);
>         }
>
>         VM_BUG_ON(!folio_test_anon(folio) ||
>                         (pte_write(pte) && !PageAnonExclusive(page)));
> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> -       arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> +       set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> +
> +       arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
>
>         folio_unlock(folio);
>         if (folio != swapcache && swapcache) {
> @@ -4105,6 +4181,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         }
>
>         if (vmf->flags & FAULT_FLAG_WRITE) {
> +               if (folio_test_large(folio) && nr_pages > 1)
> +                       vmf->orig_pte = ptep_get(vmf->pte);
> +
>                 ret |= do_wp_page(vmf);
>                 if (ret & VM_FAULT_ERROR)
>                         ret &= VM_FAULT_ERROR;
> @@ -4112,7 +4191,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         }
>
>         /* No need to invalidate - it was non-present before */
> -       update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> +       update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>  unlock:
>         if (vmf->pte)
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -4148,7 +4227,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
>         return true;
>  }
>
> -static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> +                                     bool (*pte_range_check)(pte_t *, int))
>  {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         struct vm_area_struct *vma = vmf->vma;
> @@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>         order = highest_order(orders);
>         while (orders) {
>                 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> -               if (pte_range_none(pte + pte_index(addr), 1 << order))
> +               if (pte_range_check(pte + pte_index(addr), 1 << order))
>                         break;
>                 order = next_order(&orders, order);
>         }
> @@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         if (unlikely(anon_vma_prepare(vma)))
>                 goto oom;
>         /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> -       folio = alloc_anon_folio(vmf);
> +       folio = alloc_anon_folio(vmf, pte_range_none);
>         if (IS_ERR(folio))
>                 return 0;
>         if (!folio)
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
  2024-01-18 11:10   ` [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap() Barry Song
  2024-01-18 11:54     ` David Hildenbrand
@ 2024-01-27 23:41     ` Chris Li
  1 sibling, 0 replies; 116+ messages in thread
From: Chris Li @ 2024-01-27 23:41 UTC (permalink / raw)
  To: Barry Song
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Barry Song, Chuanhua Han

On Thu, Jan 18, 2024 at 3:12 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> In do_swap_page(), while supporting large folio swap-in, we are using the helper
> folio_add_anon_rmap_ptes. This is triggerring a WARN_ON in __folio_add_anon_rmap.
> We can make the warning quiet by two ways
> 1. in do_swap_page, we call folio_add_new_anon_rmap() if we are sure the large
> folio is new allocated one; we call folio_add_anon_rmap_ptes() if we find the
> large folio in swapcache.
> 2. we always call folio_add_anon_rmap_ptes() in do_swap_page but weaken the
> WARN_ON in __folio_add_anon_rmap() by letting the WARN_ON less sensitive.
>
> Option 2 seems to be better for do_swap_page() as it can use unified code for
> all cases.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Tested-by: Chuanhua Han <hanchuanhua@oppo.com>
> ---
>  mm/rmap.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f5d43edad529..469fcfd32317 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1304,7 +1304,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>                  * page.
>                  */
>                 VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> -                                level != RMAP_LEVEL_PMD, folio);
> +                                level != RMAP_LEVEL_PMD &&
> +                                (!IS_ALIGNED(address, nr_pages * PAGE_SIZE) ||
Some minor nitpick here.
There are two leading "(" in this and next line. This is the first "("
> +                                (folio_test_swapcache(folio) && !IS_ALIGNED(folio->index, nr_pages)) ||
 Second "("  here.

These two "(" are NOT at the same nested level. They should not have
the same indentation.
On my first glance, I misread the scope of the "||" due to the same
level indentation.
We can do one of the two
1) add more indentation on the second "(" to reflect the nesting level.

> +                                page != &folio->page), folio);

Also moving the folio to the next line, because the multiline
expression is huge and complex. Make it obvious the ending "folio" is
not part of the testing condition.

2) Move the multiline test condition to a checking function. Inside
the function it can return early when the shortcut condition is met.
That will also help the readability of this warning condition.

Chris

>                 __folio_set_anon(folio, vma, address,
>                                  !!(flags & RMAP_EXCLUSIVE));
>         } else if (likely(!folio_test_ksm(folio))) {
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
  2024-01-18 11:10   ` [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT Barry Song
@ 2024-01-29  2:15     ` Chris Li
  2024-02-26  6:39       ` Barry Song
  2024-02-27 12:22     ` Ryan Roberts
  2024-02-27 14:40     ` Ryan Roberts
  2 siblings, 1 reply; 116+ messages in thread
From: Chris Li @ 2024-01-29  2:15 UTC (permalink / raw)
  To: Barry Song
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Chuanhua Han, Barry Song

On Thu, Jan 18, 2024 at 3:12 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> supported swapping large folios out as a whole for vmscan case. This patch
> extends the feature to madvise.
>
> If madvised range covers the whole large folio, we don't split it. Otherwise,
> we still need to split it.
>
> This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> mapped to a large folio.
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/asm-generic/tlb.h | 10 +++++++
>  include/linux/pgtable.h   | 60 +++++++++++++++++++++++++++++++++++++++
>  mm/madvise.c              | 48 +++++++++++++++++++++++++++++++
>  3 files changed, 118 insertions(+)
>
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 129a3a759976..f894e22da5d6 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
>                 __tlb_remove_tlb_entry(tlb, ptep, address);     \
>         } while (0)
>
> +#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr)                        \
> +       do {                                                            \
> +               int i;                                                  \
> +               tlb_flush_pte_range(tlb, address,                       \
> +                               PAGE_SIZE * nr);                        \
> +               for (i = 0; i < nr; i++)                                \
> +                       __tlb_remove_tlb_entry(tlb, ptep + i,           \
> +                                       address + i * PAGE_SIZE);       \
> +       } while (0)
> +
>  #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)       \
>         do {                                                    \
>                 unsigned long _sz = huge_page_size(h);          \
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 37fe83b0c358..da0c1cf447e3 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>  }
>  #endif
>
> +#ifndef pte_range_cont_mapped
> +static inline bool pte_range_cont_mapped(unsigned long start_pfn,
> +                                        pte_t *start_pte,
> +                                        unsigned long start_addr,
> +                                        int nr)
> +{
> +       int i;
> +       pte_t pte_val;
> +
> +       for (i = 0; i < nr; i++) {
> +               pte_val = ptep_get(start_pte + i);
> +
> +               if (pte_none(pte_val))
> +                       return false;

Hmm, the following check pte_pfn == start_pfn + i should have covered
the pte none case?

I think the pte_none means it can't have a valid pfn. So this check
can be skipped?

> +
> +               if (pte_pfn(pte_val) != (start_pfn + i))
> +                       return false;
> +       }
> +
> +       return true;
> +}
> +#endif
> +
> +#ifndef pte_range_young
> +static inline bool pte_range_young(pte_t *start_pte, int nr)
> +{
> +       int i;
> +
> +       for (i = 0; i < nr; i++)
> +               if (pte_young(ptep_get(start_pte + i)))
> +                       return true;
> +
> +       return false;
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>                                             unsigned long address,
> @@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  }
>  #endif
>
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL
> +static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm,
> +                                                 unsigned long start_addr,
> +                                                 pte_t *start_pte,
> +                                                 int nr, int full)
> +{
> +       int i;
> +       pte_t pte;
> +
> +       pte = ptep_get_and_clear_full(mm, start_addr, start_pte, full);
> +
> +       for (i = 1; i < nr; i++)
> +               ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE,
> +                                       start_pte + i, full);
> +
> +       return pte;
> +}
>
>  /*
>   * If two threads concurrently fault at the same page, the thread that
> @@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>  })
>  #endif
>
> +#ifndef pte_nr_addr_end
> +#define pte_nr_addr_end(addr, size, end)                               \
> +({     unsigned long __boundary = ((addr) + size) & (~(size - 1));     \
> +       (__boundary - 1 < (end) - 1)? __boundary: (end);                \
> +})
> +#endif
> +
>  /*
>   * When walking page tables, we usually want to skip any p?d_none entries;
>   * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 912155a94ed5..262460ac4b2e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>                 if (folio_test_large(folio)) {
>                         int err;
>
> +                       if (!folio_test_pmd_mappable(folio)) {

This session of code indent into the right too much.
You can do:

if (folio_test_pmd_mappable(folio))
         goto split;

to make the code flatter.

> +                               int nr_pages = folio_nr_pages(folio);
> +                               unsigned long folio_size = PAGE_SIZE * nr_pages;
> +                               unsigned long start_addr = ALIGN_DOWN(addr, nr_pages * PAGE_SIZE);;
> +                               unsigned long start_pfn = page_to_pfn(folio_page(folio, 0));
> +                               pte_t *start_pte = pte - (addr - start_addr) / PAGE_SIZE;
> +                               unsigned long next = pte_nr_addr_end(addr, folio_size, end);
> +
> +                               if (!pte_range_cont_mapped(start_pfn, start_pte, start_addr, nr_pages))
> +                                       goto split;
> +
> +                               if (next - addr != folio_size) {

Nitpick: One line statement does not need {

> +                                       goto split;
> +                               } else {

When the previous if statement already "goto split", there is no need
for the else. You can save one level of indentation.



> +                                       /* Do not interfere with other mappings of this page */
> +                                       if (folio_estimated_sharers(folio) != 1)
> +                                               goto skip;
> +
> +                                       VM_BUG_ON(addr != start_addr || pte != start_pte);
> +
> +                                       if (pte_range_young(start_pte, nr_pages)) {
> +                                               ptent = ptep_get_and_clear_range_full(mm, start_addr, start_pte,
> +                                                                                     nr_pages, tlb->fullmm);
> +                                               ptent = pte_mkold(ptent);
> +
> +                                               set_ptes(mm, start_addr, start_pte, ptent, nr_pages);
> +                                               tlb_remove_nr_tlb_entry(tlb, start_pte, start_addr, nr_pages);
> +                                       }
> +
> +                                       folio_clear_referenced(folio);
> +                                       folio_test_clear_young(folio);
> +                                       if (pageout) {
> +                                               if (folio_isolate_lru(folio)) {
> +                                                       if (folio_test_unevictable(folio))
> +                                                               folio_putback_lru(folio);
> +                                                       else
> +                                                               list_add(&folio->lru, &folio_list);
> +                                               }
> +                                       } else
> +                                               folio_deactivate(folio);

I notice this section is very similar to the earlier statements inside
the same function.
"if (pmd_trans_huge(*pmd)) {"

Wondering if there is some way to unify the two a bit somehow.

Also notice if you test the else condition first,

If (!pageout) {
    folio_deactivate(folio);
    goto skip;
}

You can save one level of indentation.
Not your fault, I notice the section inside (pmd_trans_huge(*pmd))
does exactly the same thing.

Chris


> +                               }
> +skip:
> +                               pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> +                               addr = next - PAGE_SIZE;
> +                               continue;
> +
> +                       }
> +split:
>                         if (folio_estimated_sharers(folio) != 1)
>                                 break;
>                         if (pageout_anon_only_filter && !folio_test_anon(folio))
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
  2024-01-23  6:49       ` Barry Song
@ 2024-01-29  3:25         ` Chris Li
  2024-01-29 10:06           ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: Chris Li @ 2024-01-29  3:25 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand, ryan.roberts, akpm, linux-mm, linux-kernel,
	mhocko, shy828301, wangkefeng.wang, willy, xiang, ying.huang,
	yuzhao, surenb, steven.price, Barry Song, Chuanhua Han

Hi David and Barry,

On Mon, Jan 22, 2024 at 10:49 PM Barry Song <21cnbao@gmail.com> wrote:
>
> >
> >
> > I have on my todo list to move all that !anon handling out of
> > folio_add_anon_rmap_ptes(), and instead make swapin code call add
> > folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
> > then (-> whole new folio exclusive).
> >
> > That's the cleaner approach.
> >
>
> one tricky thing is that sometimes it is hard to know who is the first
> one to add rmap and thus should
> call folio_add_new_anon_rmap.
> especially when we want to support swapin_readahead(), the one who
> allocated large filio might not
> be that one who firstly does rmap.

I think Barry has a point. Two tasks might race to swap in the folio
then race to perform the rmap.
folio_add_new_anon_rmap() should only call a folio that is absolutely
"new", not shared. The sharing in swap cache disqualifies that
condition.

> is it an acceptable way to do the below in do_swap_page?
> if (!folio_test_anon(folio))
>       folio_add_new_anon_rmap()
> else
>       folio_add_anon_rmap_ptes()

I am curious to know the answer as well.

BTW, that test might have a race as well. By the time the task got
!anon result, this result might get changed by another task. We need
to make sure in the caller context this race can't happen. Otherwise
we can't do the above safely.

Chris.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 0/6] mm: support large folios swap-in
  2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
                     ` (6 preceding siblings ...)
  2024-01-18 15:25   ` [PATCH RFC 0/6] mm: support large folios swap-in Ryan Roberts
@ 2024-01-29  9:05   ` Huang, Ying
  7 siblings, 0 replies; 116+ messages in thread
From: Huang, Ying @ 2024-01-29  9:05 UTC (permalink / raw)
  To: Barry Song
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, yuzhao, surenb,
	steven.price

Barry Song <21cnbao@gmail.com> writes:

> On an embedded system like Android, more than half of anon memory is actually
> in swap devices such as zRAM. For example, while an app is switched to back-
> ground, its most memory might be swapped-out.
>
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the 
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
>
> In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
> before swap-out, if some memory were normal pages, but when swapping in, we
> can also swap-in them as large folios. But this might require I/O happen at
> some random places in swap devices. So we limit the large folios swap-in to
> those areas which were large folios before swapping-out, aka, swaps are also
> contiguous in hardware. On the other hand, in OPPO's product, we've deployed
> anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
> compress and decompress large folios as a whole, which help improve compression
> ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
> save large objects whose original size are 64KiB for example. So it is also a
> better choice for us to only swap-in large folios for those compressed large
> objects as a large folio can be decompressed all together.

Another possibility is to combine large folios swap-in with VMA based
swap-in readahead.  If we will swap-in readahead several pages based on
VMA, we can swap-in a large folio instead.

I think that it is similar as allocating large file folios for file
readahead (TBH, I haven't check file large folios allocation code).

--
Best Regards,
Huang, Ying

> Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
> with MTE" to this series as it might help review.
>
> [1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> [2] OnePlusOSS / android_kernel_oneplus_sm8550 
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>
> Barry Song (2):
>   arm64: mm: swap: support THP_SWAP on hardware with MTE
>   mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
>
> Chuanhua Han (4):
>   mm: swap: introduce swap_nr_free() for batched swap_free()
>   mm: swap: make should_try_to_free_swap() support large-folio
>   mm: support large folios swapin as a whole
>   mm: madvise: don't split mTHP for MADV_PAGEOUT
>
>  arch/arm64/include/asm/pgtable.h |  21 ++----
>  arch/arm64/mm/mteswap.c          |  42 ++++++++++++
>  include/asm-generic/tlb.h        |  10 +++
>  include/linux/huge_mm.h          |  12 ----
>  include/linux/pgtable.h          |  62 ++++++++++++++++-
>  include/linux/swap.h             |   6 ++
>  mm/madvise.c                     |  48 ++++++++++++++
>  mm/memory.c                      | 110 ++++++++++++++++++++++++++-----
>  mm/page_io.c                     |   2 +-
>  mm/rmap.c                        |   5 +-
>  mm/swap_slots.c                  |   2 +-
>  mm/swapfile.c                    |  29 ++++++++
>  12 files changed, 301 insertions(+), 48 deletions(-)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
  2024-01-29  3:25         ` Chris Li
@ 2024-01-29 10:06           ` David Hildenbrand
  2024-01-29 16:31             ` Chris Li
  2024-04-06 23:27             ` Barry Song
  0 siblings, 2 replies; 116+ messages in thread
From: David Hildenbrand @ 2024-01-29 10:06 UTC (permalink / raw)
  To: Chris Li, Barry Song
  Cc: ryan.roberts, akpm, linux-mm, linux-kernel, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, surenb,
	steven.price, Barry Song, Chuanhua Han

On 29.01.24 04:25, Chris Li wrote:
> Hi David and Barry,
> 
> On Mon, Jan 22, 2024 at 10:49 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>>>
>>>
>>> I have on my todo list to move all that !anon handling out of
>>> folio_add_anon_rmap_ptes(), and instead make swapin code call add
>>> folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
>>> then (-> whole new folio exclusive).
>>>
>>> That's the cleaner approach.
>>>
>>
>> one tricky thing is that sometimes it is hard to know who is the first
>> one to add rmap and thus should
>> call folio_add_new_anon_rmap.
>> especially when we want to support swapin_readahead(), the one who
>> allocated large filio might not
>> be that one who firstly does rmap.
> 
> I think Barry has a point. Two tasks might race to swap in the folio
> then race to perform the rmap.
> folio_add_new_anon_rmap() should only call a folio that is absolutely
> "new", not shared. The sharing in swap cache disqualifies that
> condition.

We have to hold the folio lock. So only one task at a time might do the
folio_add_anon_rmap_ptes() right now, and the 
folio_add_new_shared_anon_rmap() in the future [below].

Also observe how folio_add_anon_rmap_ptes() states that one must hold 
the page lock, because otherwise this would all be completely racy.

 From the pte swp exclusive flags, we know for sure whether we are 
dealing with exclusive vs. shared. I think patch #6 does not properly 
check that all entries are actually the same in that regard (all 
exclusive vs all shared). That likely needs fixing.

[I have converting per-page PageAnonExclusive flags to a single 
per-folio flag on my todo list. I suspect that we'll keep the 
per-swp-pte exlusive bits, but the question is rather what we can 
actually make work, because swap and migration just make it much more 
complicated. Anyhow, future work]

> 
>> is it an acceptable way to do the below in do_swap_page?
>> if (!folio_test_anon(folio))
>>        folio_add_new_anon_rmap()
>> else
>>        folio_add_anon_rmap_ptes()
> 
> I am curious to know the answer as well.


Yes, the end code should likely be something like:

/* ksm created a completely new copy */
if (unlikely(folio != swapcache && swapcache)) {
	folio_add_new_anon_rmap(folio, vma, vmf->address);
	folio_add_lru_vma(folio, vma);
} else if (folio_test_anon(folio)) {
	folio_add_anon_rmap_ptes(rmap_flags)
} else {
	folio_add_new_anon_rmap(rmap_flags)
}

Maybe we want to avoid teaching all existing folio_add_new_anon_rmap() 
callers about a new flag, and just have a new 
folio_add_new_shared_anon_rmap() instead. TBD.

> 
> BTW, that test might have a race as well. By the time the task got
> !anon result, this result might get changed by another task. We need
> to make sure in the caller context this race can't happen. Otherwise
> we can't do the above safely.
Again, folio lock. Observe the folio_lock_or_retry() call that covers 
our existing folio_add_new_anon_rmap/folio_add_anon_rmap_pte calls.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
  2024-01-29 10:06           ` David Hildenbrand
@ 2024-01-29 16:31             ` Chris Li
  2024-02-26  5:05               ` Barry Song
  2024-04-06 23:27             ` Barry Song
  1 sibling, 1 reply; 116+ messages in thread
From: Chris Li @ 2024-01-29 16:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Barry Song, ryan.roberts, akpm, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Barry Song, Chuanhua Han

On Mon, Jan 29, 2024 at 2:07 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 29.01.24 04:25, Chris Li wrote:
> > Hi David and Barry,
> >
> > On Mon, Jan 22, 2024 at 10:49 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >>>
> >>>
> >>> I have on my todo list to move all that !anon handling out of
> >>> folio_add_anon_rmap_ptes(), and instead make swapin code call add
> >>> folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
> >>> then (-> whole new folio exclusive).
> >>>
> >>> That's the cleaner approach.
> >>>
> >>
> >> one tricky thing is that sometimes it is hard to know who is the first
> >> one to add rmap and thus should
> >> call folio_add_new_anon_rmap.
> >> especially when we want to support swapin_readahead(), the one who
> >> allocated large filio might not
> >> be that one who firstly does rmap.
> >
> > I think Barry has a point. Two tasks might race to swap in the folio
> > then race to perform the rmap.
> > folio_add_new_anon_rmap() should only call a folio that is absolutely
> > "new", not shared. The sharing in swap cache disqualifies that
> > condition.
>
> We have to hold the folio lock. So only one task at a time might do the
> folio_add_anon_rmap_ptes() right now, and the
> folio_add_new_shared_anon_rmap() in the future [below].
>

Ah, I see. The folio_lock() is the answer I am looking for.

> Also observe how folio_add_anon_rmap_ptes() states that one must hold
> the page lock, because otherwise this would all be completely racy.
>
>  From the pte swp exclusive flags, we know for sure whether we are
> dealing with exclusive vs. shared. I think patch #6 does not properly
> check that all entries are actually the same in that regard (all
> exclusive vs all shared). That likely needs fixing.
>
> [I have converting per-page PageAnonExclusive flags to a single
> per-folio flag on my todo list. I suspect that we'll keep the
> per-swp-pte exlusive bits, but the question is rather what we can
> actually make work, because swap and migration just make it much more
> complicated. Anyhow, future work]
>
> >
> >> is it an acceptable way to do the below in do_swap_page?
> >> if (!folio_test_anon(folio))
> >>        folio_add_new_anon_rmap()
> >> else
> >>        folio_add_anon_rmap_ptes()
> >
> > I am curious to know the answer as well.
>
>
> Yes, the end code should likely be something like:
>
> /* ksm created a completely new copy */
> if (unlikely(folio != swapcache && swapcache)) {
>         folio_add_new_anon_rmap(folio, vma, vmf->address);
>         folio_add_lru_vma(folio, vma);
> } else if (folio_test_anon(folio)) {
>         folio_add_anon_rmap_ptes(rmap_flags)
> } else {
>         folio_add_new_anon_rmap(rmap_flags)
> }
>
> Maybe we want to avoid teaching all existing folio_add_new_anon_rmap()
> callers about a new flag, and just have a new
> folio_add_new_shared_anon_rmap() instead. TBD.

There is more than one caller needed to perform that dance around
folio_test_anon() then decide which function to call. It would be nice
to have a wrapper function folio_add_new_shared_anon_rmap() to
abstract this behavior.


>
> >
> > BTW, that test might have a race as well. By the time the task got
> > !anon result, this result might get changed by another task. We need
> > to make sure in the caller context this race can't happen. Otherwise
> > we can't do the above safely.
> Again, folio lock. Observe the folio_lock_or_retry() call that covers
> our existing folio_add_new_anon_rmap/folio_add_anon_rmap_pte calls.

Ack. Thanks for the explanation.

Chris

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
  2023-10-30  8:18   ` Huang, Ying
  2023-11-02  7:40   ` Barry Song
@ 2024-02-05  9:51   ` Barry Song
  2024-02-05 12:14     ` Ryan Roberts
                       ` (2 more replies)
  2024-02-22  7:05   ` Barry Song
  3 siblings, 3 replies; 116+ messages in thread
From: Barry Song @ 2024-02-05  9:51 UTC (permalink / raw)
  To: ryan.roberts
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

+Chris, Suren and Chuanhua

Hi Ryan,

> +	/*
> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> +	 * so indicate that we are scanning to synchronise with swapoff.
> +	 */
> +	si->flags += SWP_SCANNING;
> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> +	si->flags -= SWP_SCANNING;

nobody is using this scan_base afterwards. it seems a bit weird to
pass a pointer.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  					if (!can_split_folio(folio, NULL))
>  						goto activate_locked;
>  					/*
> -					 * Split folios without a PMD map right
> -					 * away. Chances are some or all of the
> -					 * tail pages can be freed without IO.
> +					 * Split PMD-mappable folios without a
> +					 * PMD map right away. Chances are some
> +					 * or all of the tail pages can be freed
> +					 * without IO.
>  					 */
> -					if (!folio_entire_mapcount(folio) &&
> +					if (folio_test_pmd_mappable(folio) &&
> +					    !folio_entire_mapcount(folio) &&
>  					    split_folio_to_list(folio,
>  								folio_list))
>  						goto activate_locked;
> --

Chuanhua and I ran this patchset for a couple of days and found a race
between reclamation and split_folio. this might cause applications get
wrong data 0 while swapping-in.

in case one thread(T1) is reclaiming a large folio by some means, still
another thread is calling madvise MADV_PGOUT(T2). and at the same time,
we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
and T2 does split as below,

static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
                                unsigned long addr, unsigned long end, 
                                struct mm_walk *walk)
{

                /*   
                 * Creating a THP page is expensive so split it only if we
                 * are sure it's worth. Split it if we are only owner.
                 */
                if (folio_test_large(folio)) {
                        int err; 

                        if (folio_estimated_sharers(folio) != 1)
                                break;
                        if (pageout_anon_only_filter && !folio_test_anon(folio))
                                break;
                        if (!folio_trylock(folio))
                                break;
                        folio_get(folio);
                        arch_leave_lazy_mmu_mode();
                        pte_unmap_unlock(start_pte, ptl);
                        start_pte = NULL;
                        err = split_folio(folio);
                        folio_unlock(folio);
                        folio_put(folio);
                        if (err)
                                break;
                        start_pte = pte =
                                pte_offset_map_lock(mm, pmd, addr, &ptl);
                        if (!start_pte)
                                break;
                        arch_enter_lazy_mmu_mode();
                        pte--;
                        addr -= PAGE_SIZE;
                        continue;
                }    

        return 0;
}



if T3 and T4 swap-in same page, and they both do swap_read_folio(). the
first one of T3 and T4 who gets PTL will set pte, and the 2nd one will
check pte_same() and find pte has been changed by another thread, thus
goto out_nomap in do_swap_page.
vm_fault_t do_swap_page(struct vm_fault *vmf)
{
        if (!folio) {
                if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
                    __swap_count(entry) == 1) {
                        /* skip swapcache */
                        folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
                                                vma, vmf->address, false);
                        page = &folio->page;
                        if (folio) {
                                __folio_set_locked(folio);
                                __folio_set_swapbacked(folio);
                         
                                /* To provide entry to swap_read_folio() */
                                folio->swap = entry;
                                swap_read_folio(folio, true, NULL);
                                folio->private = NULL;
                        }
                } else {
                }
        
        
        /*
         * Back out if somebody else already faulted in this pte.
         */
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
                        &vmf->ptl);
        if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
                goto out_nomap;

        swap_free(entry);
        pte = mk_pte(page, vma->vm_page_prot);

        set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
        return ret;
}


while T1 and T2 is working in parallel, T2 will split folio. this can
run into race with T1's reclamation without splitting. T2 will split
a large folio into a couple of normal pages and reclaim them.

If T3 finishes swap_read_folio and gets PTL earlier than T4, it calls
set_pte and swap_free. this will cause zRAM to free the slot. then
t4 will get zero data in swap_read_folio() as the below zRAM code
will fill zero for freed slots, 

static int zram_read_from_zspool(struct zram *zram, struct page *page,
                                 u32 index)
{
        ...

        handle = zram_get_handle(zram, index);
        if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
                unsigned long value;
                void *mem;

                value = handle ? zram_get_element(zram, index) : 0; 
                mem = kmap_local_page(page);
                zram_fill_page(mem, PAGE_SIZE, value);
                kunmap_local(mem);
                return 0;
        }
}

usually, after t3 frees swap and does set_pte, t4's pte_same becomes
false, it won't set pte again. So filling zero data into freed slot
by zRAM driver is not a problem at all. but the race is that T1 and
T2 might do set swap to ptes twice as t1 doesn't split but t2 splits
(splitted normal folios are also added into reclaim_list), thus, the
corrupted zero data will get a chance to be set into PTE by t4 as t4
reads the new PTE which is set secondly and has the same swap entry
as its orig_pte after T3 has swapped-in and free the swap entry.

we have worked around this problem by preventing T4 from splitting
large folios and letting it goto skip the large folios entirely in
MADV PAGEOUT once we detect a concurrent reclamation for this large
folio.

so my understanding is changing vmscan isn't sufficient to support
large folio swap-out without splitting. we have to adjust madvise
as well. we will have a fix for this problem in
[PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/

But i feel this patch should be a part of your swap-out patchset rather
than the swap-in series of Chuanhua and me :-)

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-05  9:51   ` Barry Song
@ 2024-02-05 12:14     ` Ryan Roberts
  2024-02-18 23:40       ` Barry Song
  2024-02-27 12:28     ` Ryan Roberts
  2024-02-27 13:37     ` Ryan Roberts
  2 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-02-05 12:14 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

On 05/02/2024 09:51, Barry Song wrote:
> +Chris, Suren and Chuanhua
> 
> Hi Ryan,
> 
>> +	/*
>> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>> +	 * so indicate that we are scanning to synchronise with swapoff.
>> +	 */
>> +	si->flags += SWP_SCANNING;
>> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>> +	si->flags -= SWP_SCANNING;
> 
> nobody is using this scan_base afterwards. it seems a bit weird to
> pass a pointer.
> 
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>  					if (!can_split_folio(folio, NULL))
>>  						goto activate_locked;
>>  					/*
>> -					 * Split folios without a PMD map right
>> -					 * away. Chances are some or all of the
>> -					 * tail pages can be freed without IO.
>> +					 * Split PMD-mappable folios without a
>> +					 * PMD map right away. Chances are some
>> +					 * or all of the tail pages can be freed
>> +					 * without IO.
>>  					 */
>> -					if (!folio_entire_mapcount(folio) &&
>> +					if (folio_test_pmd_mappable(folio) &&
>> +					    !folio_entire_mapcount(folio) &&
>>  					    split_folio_to_list(folio,
>>  								folio_list))
>>  						goto activate_locked;
>> --
> 
> Chuanhua and I ran this patchset for a couple of days and found a race
> between reclamation and split_folio. this might cause applications get
> wrong data 0 while swapping-in.
> 
> in case one thread(T1) is reclaiming a large folio by some means, still
> another thread is calling madvise MADV_PGOUT(T2). and at the same time,
> we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
> and T2 does split as below,
> 
> static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>                                 unsigned long addr, unsigned long end, 
>                                 struct mm_walk *walk)
> {
> 
>                 /*   
>                  * Creating a THP page is expensive so split it only if we
>                  * are sure it's worth. Split it if we are only owner.
>                  */
>                 if (folio_test_large(folio)) {
>                         int err; 
> 
>                         if (folio_estimated_sharers(folio) != 1)
>                                 break;
>                         if (pageout_anon_only_filter && !folio_test_anon(folio))
>                                 break;
>                         if (!folio_trylock(folio))
>                                 break;
>                         folio_get(folio);
>                         arch_leave_lazy_mmu_mode();
>                         pte_unmap_unlock(start_pte, ptl);
>                         start_pte = NULL;
>                         err = split_folio(folio);
>                         folio_unlock(folio);
>                         folio_put(folio);
>                         if (err)
>                                 break;
>                         start_pte = pte =
>                                 pte_offset_map_lock(mm, pmd, addr, &ptl);
>                         if (!start_pte)
>                                 break;
>                         arch_enter_lazy_mmu_mode();
>                         pte--;
>                         addr -= PAGE_SIZE;
>                         continue;
>                 }    
> 
>         return 0;
> }
> 
> 
> 
> if T3 and T4 swap-in same page, and they both do swap_read_folio(). the
> first one of T3 and T4 who gets PTL will set pte, and the 2nd one will
> check pte_same() and find pte has been changed by another thread, thus
> goto out_nomap in do_swap_page.
> vm_fault_t do_swap_page(struct vm_fault *vmf)
> {
>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
>                         /* skip swapcache */
>                         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>                                                 vma, vmf->address, false);
>                         page = &folio->page;
>                         if (folio) {
>                                 __folio_set_locked(folio);
>                                 __folio_set_swapbacked(folio);
>                          
>                                 /* To provide entry to swap_read_folio() */
>                                 folio->swap = entry;
>                                 swap_read_folio(folio, true, NULL);
>                                 folio->private = NULL;
>                         }
>                 } else {
>                 }
>         
>         
>         /*
>          * Back out if somebody else already faulted in this pte.
>          */
>         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>                         &vmf->ptl);
>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>                 goto out_nomap;
> 
>         swap_free(entry);
>         pte = mk_pte(page, vma->vm_page_prot);
> 
>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>         return ret;
> }
> 
> 
> while T1 and T2 is working in parallel, T2 will split folio. this can
> run into race with T1's reclamation without splitting. T2 will split
> a large folio into a couple of normal pages and reclaim them.
> 
> If T3 finishes swap_read_folio and gets PTL earlier than T4, it calls
> set_pte and swap_free. this will cause zRAM to free the slot. then
> t4 will get zero data in swap_read_folio() as the below zRAM code
> will fill zero for freed slots, 
> 
> static int zram_read_from_zspool(struct zram *zram, struct page *page,
>                                  u32 index)
> {
>         ...
> 
>         handle = zram_get_handle(zram, index);
>         if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
>                 unsigned long value;
>                 void *mem;
> 
>                 value = handle ? zram_get_element(zram, index) : 0; 
>                 mem = kmap_local_page(page);
>                 zram_fill_page(mem, PAGE_SIZE, value);
>                 kunmap_local(mem);
>                 return 0;
>         }
> }
> 
> usually, after t3 frees swap and does set_pte, t4's pte_same becomes
> false, it won't set pte again. So filling zero data into freed slot
> by zRAM driver is not a problem at all. but the race is that T1 and
> T2 might do set swap to ptes twice as t1 doesn't split but t2 splits
> (splitted normal folios are also added into reclaim_list), thus, the
> corrupted zero data will get a chance to be set into PTE by t4 as t4
> reads the new PTE which is set secondly and has the same swap entry
> as its orig_pte after T3 has swapped-in and free the swap entry.
> 
> we have worked around this problem by preventing T4 from splitting
> large folios and letting it goto skip the large folios entirely in
> MADV PAGEOUT once we detect a concurrent reclamation for this large
> folio.
> 
> so my understanding is changing vmscan isn't sufficient to support
> large folio swap-out without splitting. we have to adjust madvise
> as well. we will have a fix for this problem in
> [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
> https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/
> 
> But i feel this patch should be a part of your swap-out patchset rather
> than the swap-in series of Chuanhua and me :-)

Hi Barry, Chuanhua,

Thanks for the very detailed bug report! I'm going to have to take some time to
get my head around the details. But yes, I agree the fix needs to be part of the
swap-out series.

Sorry I haven't progressed this series as I had hoped. I've been concentrating
on getting the contpte series upstream. I'm hoping I will find some time to move
this series along by the tail end of Feb (hoping to get it in shape for v6.10).
Hopefully that doesn't cause you any big problems?

Thanks,
Ryan

> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-05 12:14     ` Ryan Roberts
@ 2024-02-18 23:40       ` Barry Song
  2024-02-20 20:03         ` Ryan Roberts
  2024-03-05  9:00         ` Ryan Roberts
  0 siblings, 2 replies; 116+ messages in thread
From: Barry Song @ 2024-02-18 23:40 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

On Tue, Feb 6, 2024 at 1:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/02/2024 09:51, Barry Song wrote:
> > +Chris, Suren and Chuanhua
> >
> > Hi Ryan,
> >
> >> +    /*
> >> +     * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> >> +     * so indicate that we are scanning to synchronise with swapoff.
> >> +     */
> >> +    si->flags += SWP_SCANNING;
> >> +    ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> >> +    si->flags -= SWP_SCANNING;
> >
> > nobody is using this scan_base afterwards. it seems a bit weird to
> > pass a pointer.
> >
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>                                      if (!can_split_folio(folio, NULL))
> >>                                              goto activate_locked;
> >>                                      /*
> >> -                                     * Split folios without a PMD map right
> >> -                                     * away. Chances are some or all of the
> >> -                                     * tail pages can be freed without IO.
> >> +                                     * Split PMD-mappable folios without a
> >> +                                     * PMD map right away. Chances are some
> >> +                                     * or all of the tail pages can be freed
> >> +                                     * without IO.
> >>                                       */
> >> -                                    if (!folio_entire_mapcount(folio) &&
> >> +                                    if (folio_test_pmd_mappable(folio) &&
> >> +                                        !folio_entire_mapcount(folio) &&
> >>                                          split_folio_to_list(folio,
> >>                                                              folio_list))
> >>                                              goto activate_locked;
> >> --
> >
> > Chuanhua and I ran this patchset for a couple of days and found a race
> > between reclamation and split_folio. this might cause applications get
> > wrong data 0 while swapping-in.
> >
> > in case one thread(T1) is reclaiming a large folio by some means, still
> > another thread is calling madvise MADV_PGOUT(T2). and at the same time,
> > we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
> > and T2 does split as below,
> >
> > static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >                                 unsigned long addr, unsigned long end,
> >                                 struct mm_walk *walk)
> > {
> >
> >                 /*
> >                  * Creating a THP page is expensive so split it only if we
> >                  * are sure it's worth. Split it if we are only owner.
> >                  */
> >                 if (folio_test_large(folio)) {
> >                         int err;
> >
> >                         if (folio_estimated_sharers(folio) != 1)
> >                                 break;
> >                         if (pageout_anon_only_filter && !folio_test_anon(folio))
> >                                 break;
> >                         if (!folio_trylock(folio))
> >                                 break;
> >                         folio_get(folio);
> >                         arch_leave_lazy_mmu_mode();
> >                         pte_unmap_unlock(start_pte, ptl);
> >                         start_pte = NULL;
> >                         err = split_folio(folio);
> >                         folio_unlock(folio);
> >                         folio_put(folio);
> >                         if (err)
> >                                 break;
> >                         start_pte = pte =
> >                                 pte_offset_map_lock(mm, pmd, addr, &ptl);
> >                         if (!start_pte)
> >                                 break;
> >                         arch_enter_lazy_mmu_mode();
> >                         pte--;
> >                         addr -= PAGE_SIZE;
> >                         continue;
> >                 }
> >
> >         return 0;
> > }
> >
> >
> >
> > if T3 and T4 swap-in same page, and they both do swap_read_folio(). the
> > first one of T3 and T4 who gets PTL will set pte, and the 2nd one will
> > check pte_same() and find pte has been changed by another thread, thus
> > goto out_nomap in do_swap_page.
> > vm_fault_t do_swap_page(struct vm_fault *vmf)
> > {
> >         if (!folio) {
> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                     __swap_count(entry) == 1) {
> >                         /* skip swapcache */
> >                         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >                                                 vma, vmf->address, false);
> >                         page = &folio->page;
> >                         if (folio) {
> >                                 __folio_set_locked(folio);
> >                                 __folio_set_swapbacked(folio);
> >
> >                                 /* To provide entry to swap_read_folio() */
> >                                 folio->swap = entry;
> >                                 swap_read_folio(folio, true, NULL);
> >                                 folio->private = NULL;
> >                         }
> >                 } else {
> >                 }
> >
> >
> >         /*
> >          * Back out if somebody else already faulted in this pte.
> >          */
> >         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >                         &vmf->ptl);
> >         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >                 goto out_nomap;
> >
> >         swap_free(entry);
> >         pte = mk_pte(page, vma->vm_page_prot);
> >
> >         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> >         return ret;
> > }
> >
> >
> > while T1 and T2 is working in parallel, T2 will split folio. this can
> > run into race with T1's reclamation without splitting. T2 will split
> > a large folio into a couple of normal pages and reclaim them.
> >
> > If T3 finishes swap_read_folio and gets PTL earlier than T4, it calls
> > set_pte and swap_free. this will cause zRAM to free the slot. then
> > t4 will get zero data in swap_read_folio() as the below zRAM code
> > will fill zero for freed slots,
> >
> > static int zram_read_from_zspool(struct zram *zram, struct page *page,
> >                                  u32 index)
> > {
> >         ...
> >
> >         handle = zram_get_handle(zram, index);
> >         if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
> >                 unsigned long value;
> >                 void *mem;
> >
> >                 value = handle ? zram_get_element(zram, index) : 0;
> >                 mem = kmap_local_page(page);
> >                 zram_fill_page(mem, PAGE_SIZE, value);
> >                 kunmap_local(mem);
> >                 return 0;
> >         }
> > }
> >
> > usually, after t3 frees swap and does set_pte, t4's pte_same becomes
> > false, it won't set pte again. So filling zero data into freed slot
> > by zRAM driver is not a problem at all. but the race is that T1 and
> > T2 might do set swap to ptes twice as t1 doesn't split but t2 splits
> > (splitted normal folios are also added into reclaim_list), thus, the
> > corrupted zero data will get a chance to be set into PTE by t4 as t4
> > reads the new PTE which is set secondly and has the same swap entry
> > as its orig_pte after T3 has swapped-in and free the swap entry.
> >
> > we have worked around this problem by preventing T4 from splitting
> > large folios and letting it goto skip the large folios entirely in
> > MADV PAGEOUT once we detect a concurrent reclamation for this large
> > folio.
> >
> > so my understanding is changing vmscan isn't sufficient to support
> > large folio swap-out without splitting. we have to adjust madvise
> > as well. we will have a fix for this problem in
> > [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
> > https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/
> >
> > But i feel this patch should be a part of your swap-out patchset rather
> > than the swap-in series of Chuanhua and me :-)
>
> Hi Barry, Chuanhua,
>
> Thanks for the very detailed bug report! I'm going to have to take some time to
> get my head around the details. But yes, I agree the fix needs to be part of the
> swap-out series.
>

Hi Ryan,
I am running into some races especially while enabling large folio swap-out and
swap-in both. some of them, i am still struggling with the detailed
timing how they
are happening.
but the below change can help remove those bugs which cause corrupted data.

index da2aab219c40..ef9cfbc84760 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1953,6 +1953,16 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,

                        if (folio_test_pmd_mappable(folio))
                                flags |= TTU_SPLIT_HUGE_PMD;
+                       /*
+                        * make try_to_unmap_one hold ptl from the very first
+                        * beginning if we are reclaiming a folio with multi-
+                        * ptes. otherwise, we may only reclaim a part of the
+                        * folio from the middle.
+                        * for example, a parallel thread might temporarily
+                        * set pte to none for various purposes.
+                        */
+                       else if (folio_test_large(folio))
+                               flags |= TTU_SYNC;

                        try_to_unmap(folio, flags);
                        if (folio_mapped(folio)) {


While we are swapping-out a large folio, it has many ptes, we change those ptes
to swap entries in try_to_unmap_one(). "while (page_vma_mapped_walk(&pvmw))"
will iterate all ptes within the large folio. but it will only begin
to acquire ptl when
it meets a valid pte as below /* xxxxxxx */

static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
{
        pte_t ptent;

        if (pvmw->flags & PVMW_SYNC) {
                /* Use the stricter lookup */
                pvmw->pte = pte_offset_map_lock(pvmw->vma->vm_mm, pvmw->pmd,
                                                pvmw->address, &pvmw->ptl);
                *ptlp = pvmw->ptl;
                return !!pvmw->pte;
        }

       ...
        pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
                                          pvmw->address, ptlp);
        if (!pvmw->pte)
                return false;

        ptent = ptep_get(pvmw->pte);

        if (pvmw->flags & PVMW_MIGRATION) {
                if (!is_swap_pte(ptent))
                        return false;
        } else if (is_swap_pte(ptent)) {
                swp_entry_t entry;
                ...
                entry = pte_to_swp_entry(ptent);
                if (!is_device_private_entry(entry) &&
                    !is_device_exclusive_entry(entry))
                        return false;
        } else if (!pte_present(ptent)) {
                return false;
        }
        pvmw->ptl = *ptlp;
        spin_lock(pvmw->ptl);   /* xxxxxxx */
        return true;
}


for various reasons,  for example, break-before-make for clearing access flags
etc. pte can be set to none. since page_vma_mapped_walk() doesn't hold ptl
from the beginning,  it might only begin to set swap entries from the middle of
a large folio.

For example, in case a large folio has 16 ptes, and 0,1,2 are somehow zero
in the intermediate stage of a break-before-make, ptl will be held
from the 3rd pte,
and swap entries will be set from 3rd pte as well. it seems not good as we are
trying to swap out a large folio, but we are swapping out a part of them.

I am still struggling with all the timing of races, but using PVMW_SYNC to
explicitly ask for ptl from the first pte seems a good thing for large folio
regardless of those races. it can avoid try_to_unmap_one reading intermediate
pte and further make the wrong decision since reclaiming pte-mapped large
folios is atomic with just one pte.

> Sorry I haven't progressed this series as I had hoped. I've been concentrating
> on getting the contpte series upstream. I'm hoping I will find some time to move
> this series along by the tail end of Feb (hoping to get it in shape for v6.10).
> Hopefully that doesn't cause you any big problems?

no worries. Anyway, we are already using your code to run various tests.

>
> Thanks,
> Ryan

Thanks
Barry

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-18 23:40       ` Barry Song
@ 2024-02-20 20:03         ` Ryan Roberts
  2024-03-05  9:00         ` Ryan Roberts
  1 sibling, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-02-20 20:03 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

On 18/02/2024 23:40, Barry Song wrote:
> On Tue, Feb 6, 2024 at 1:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 05/02/2024 09:51, Barry Song wrote:
>>> +Chris, Suren and Chuanhua
>>>
>>> Hi Ryan,
>>>
>>>> +    /*
>>>> +     * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>>>> +     * so indicate that we are scanning to synchronise with swapoff.
>>>> +     */
>>>> +    si->flags += SWP_SCANNING;
>>>> +    ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>>>> +    si->flags -= SWP_SCANNING;
>>>
>>> nobody is using this scan_base afterwards. it seems a bit weird to
>>> pass a pointer.
>>>
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>                                      if (!can_split_folio(folio, NULL))
>>>>                                              goto activate_locked;
>>>>                                      /*
>>>> -                                     * Split folios without a PMD map right
>>>> -                                     * away. Chances are some or all of the
>>>> -                                     * tail pages can be freed without IO.
>>>> +                                     * Split PMD-mappable folios without a
>>>> +                                     * PMD map right away. Chances are some
>>>> +                                     * or all of the tail pages can be freed
>>>> +                                     * without IO.
>>>>                                       */
>>>> -                                    if (!folio_entire_mapcount(folio) &&
>>>> +                                    if (folio_test_pmd_mappable(folio) &&
>>>> +                                        !folio_entire_mapcount(folio) &&
>>>>                                          split_folio_to_list(folio,
>>>>                                                              folio_list))
>>>>                                              goto activate_locked;
>>>> --
>>>
>>> Chuanhua and I ran this patchset for a couple of days and found a race
>>> between reclamation and split_folio. this might cause applications get
>>> wrong data 0 while swapping-in.
>>>
>>> in case one thread(T1) is reclaiming a large folio by some means, still
>>> another thread is calling madvise MADV_PGOUT(T2). and at the same time,
>>> we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
>>> and T2 does split as below,
>>>
>>> static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>                                 unsigned long addr, unsigned long end,
>>>                                 struct mm_walk *walk)
>>> {
>>>
>>>                 /*
>>>                  * Creating a THP page is expensive so split it only if we
>>>                  * are sure it's worth. Split it if we are only owner.
>>>                  */
>>>                 if (folio_test_large(folio)) {
>>>                         int err;
>>>
>>>                         if (folio_estimated_sharers(folio) != 1)
>>>                                 break;
>>>                         if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>                                 break;
>>>                         if (!folio_trylock(folio))
>>>                                 break;
>>>                         folio_get(folio);
>>>                         arch_leave_lazy_mmu_mode();
>>>                         pte_unmap_unlock(start_pte, ptl);
>>>                         start_pte = NULL;
>>>                         err = split_folio(folio);
>>>                         folio_unlock(folio);
>>>                         folio_put(folio);
>>>                         if (err)
>>>                                 break;
>>>                         start_pte = pte =
>>>                                 pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>                         if (!start_pte)
>>>                                 break;
>>>                         arch_enter_lazy_mmu_mode();
>>>                         pte--;
>>>                         addr -= PAGE_SIZE;
>>>                         continue;
>>>                 }
>>>
>>>         return 0;
>>> }
>>>
>>>
>>>
>>> if T3 and T4 swap-in same page, and they both do swap_read_folio(). the
>>> first one of T3 and T4 who gets PTL will set pte, and the 2nd one will
>>> check pte_same() and find pte has been changed by another thread, thus
>>> goto out_nomap in do_swap_page.
>>> vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> {
>>>         if (!folio) {
>>>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>>                     __swap_count(entry) == 1) {
>>>                         /* skip swapcache */
>>>                         folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>>>                                                 vma, vmf->address, false);
>>>                         page = &folio->page;
>>>                         if (folio) {
>>>                                 __folio_set_locked(folio);
>>>                                 __folio_set_swapbacked(folio);
>>>
>>>                                 /* To provide entry to swap_read_folio() */
>>>                                 folio->swap = entry;
>>>                                 swap_read_folio(folio, true, NULL);
>>>                                 folio->private = NULL;
>>>                         }
>>>                 } else {
>>>                 }
>>>
>>>
>>>         /*
>>>          * Back out if somebody else already faulted in this pte.
>>>          */
>>>         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>>                         &vmf->ptl);
>>>         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>>>                 goto out_nomap;
>>>
>>>         swap_free(entry);
>>>         pte = mk_pte(page, vma->vm_page_prot);
>>>
>>>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>>>         return ret;
>>> }
>>>
>>>
>>> while T1 and T2 is working in parallel, T2 will split folio. this can
>>> run into race with T1's reclamation without splitting. T2 will split
>>> a large folio into a couple of normal pages and reclaim them.
>>>
>>> If T3 finishes swap_read_folio and gets PTL earlier than T4, it calls
>>> set_pte and swap_free. this will cause zRAM to free the slot. then
>>> t4 will get zero data in swap_read_folio() as the below zRAM code
>>> will fill zero for freed slots,
>>>
>>> static int zram_read_from_zspool(struct zram *zram, struct page *page,
>>>                                  u32 index)
>>> {
>>>         ...
>>>
>>>         handle = zram_get_handle(zram, index);
>>>         if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
>>>                 unsigned long value;
>>>                 void *mem;
>>>
>>>                 value = handle ? zram_get_element(zram, index) : 0;
>>>                 mem = kmap_local_page(page);
>>>                 zram_fill_page(mem, PAGE_SIZE, value);
>>>                 kunmap_local(mem);
>>>                 return 0;
>>>         }
>>> }
>>>
>>> usually, after t3 frees swap and does set_pte, t4's pte_same becomes
>>> false, it won't set pte again. So filling zero data into freed slot
>>> by zRAM driver is not a problem at all. but the race is that T1 and
>>> T2 might do set swap to ptes twice as t1 doesn't split but t2 splits
>>> (splitted normal folios are also added into reclaim_list), thus, the
>>> corrupted zero data will get a chance to be set into PTE by t4 as t4
>>> reads the new PTE which is set secondly and has the same swap entry
>>> as its orig_pte after T3 has swapped-in and free the swap entry.
>>>
>>> we have worked around this problem by preventing T4 from splitting
>>> large folios and letting it goto skip the large folios entirely in
>>> MADV PAGEOUT once we detect a concurrent reclamation for this large
>>> folio.
>>>
>>> so my understanding is changing vmscan isn't sufficient to support
>>> large folio swap-out without splitting. we have to adjust madvise
>>> as well. we will have a fix for this problem in
>>> [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
>>> https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/
>>>
>>> But i feel this patch should be a part of your swap-out patchset rather
>>> than the swap-in series of Chuanhua and me :-)
>>
>> Hi Barry, Chuanhua,
>>
>> Thanks for the very detailed bug report! I'm going to have to take some time to
>> get my head around the details. But yes, I agree the fix needs to be part of the
>> swap-out series.
>>
> 
> Hi Ryan,
> I am running into some races especially while enabling large folio swap-out and
> swap-in both. some of them, i am still struggling with the detailed
> timing how they
> are happening.
> but the below change can help remove those bugs which cause corrupted data.

Thanks for the report! I'm out of office this week, but this is top of my todo
list starting next week, so hopefully will knock these into shape and repost
very soon.

Thanks,
Ryan

> 
> index da2aab219c40..ef9cfbc84760 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1953,6 +1953,16 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
> 
>                         if (folio_test_pmd_mappable(folio))
>                                 flags |= TTU_SPLIT_HUGE_PMD;
> +                       /*
> +                        * make try_to_unmap_one hold ptl from the very first
> +                        * beginning if we are reclaiming a folio with multi-
> +                        * ptes. otherwise, we may only reclaim a part of the
> +                        * folio from the middle.
> +                        * for example, a parallel thread might temporarily
> +                        * set pte to none for various purposes.
> +                        */
> +                       else if (folio_test_large(folio))
> +                               flags |= TTU_SYNC;
> 
>                         try_to_unmap(folio, flags);
>                         if (folio_mapped(folio)) {
> 
> 
> While we are swapping-out a large folio, it has many ptes, we change those ptes
> to swap entries in try_to_unmap_one(). "while (page_vma_mapped_walk(&pvmw))"
> will iterate all ptes within the large folio. but it will only begin
> to acquire ptl when
> it meets a valid pte as below /* xxxxxxx */
> 
> static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
> {
>         pte_t ptent;
> 
>         if (pvmw->flags & PVMW_SYNC) {
>                 /* Use the stricter lookup */
>                 pvmw->pte = pte_offset_map_lock(pvmw->vma->vm_mm, pvmw->pmd,
>                                                 pvmw->address, &pvmw->ptl);
>                 *ptlp = pvmw->ptl;
>                 return !!pvmw->pte;
>         }
> 
>        ...
>         pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
>                                           pvmw->address, ptlp);
>         if (!pvmw->pte)
>                 return false;
> 
>         ptent = ptep_get(pvmw->pte);
> 
>         if (pvmw->flags & PVMW_MIGRATION) {
>                 if (!is_swap_pte(ptent))
>                         return false;
>         } else if (is_swap_pte(ptent)) {
>                 swp_entry_t entry;
>                 ...
>                 entry = pte_to_swp_entry(ptent);
>                 if (!is_device_private_entry(entry) &&
>                     !is_device_exclusive_entry(entry))
>                         return false;
>         } else if (!pte_present(ptent)) {
>                 return false;
>         }
>         pvmw->ptl = *ptlp;
>         spin_lock(pvmw->ptl);   /* xxxxxxx */
>         return true;
> }
> 
> 
> for various reasons,  for example, break-before-make for clearing access flags
> etc. pte can be set to none. since page_vma_mapped_walk() doesn't hold ptl
> from the beginning,  it might only begin to set swap entries from the middle of
> a large folio.
> 
> For example, in case a large folio has 16 ptes, and 0,1,2 are somehow zero
> in the intermediate stage of a break-before-make, ptl will be held
> from the 3rd pte,
> and swap entries will be set from 3rd pte as well. it seems not good as we are
> trying to swap out a large folio, but we are swapping out a part of them.
> 
> I am still struggling with all the timing of races, but using PVMW_SYNC to
> explicitly ask for ptl from the first pte seems a good thing for large folio
> regardless of those races. it can avoid try_to_unmap_one reading intermediate
> pte and further make the wrong decision since reclaiming pte-mapped large
> folios is atomic with just one pte.
> 
>> Sorry I haven't progressed this series as I had hoped. I've been concentrating
>> on getting the contpte series upstream. I'm hoping I will find some time to move
>> this series along by the tail end of Feb (hoping to get it in shape for v6.10).
>> Hopefully that doesn't cause you any big problems?
> 
> no worries. Anyway, we are already using your code to run various tests.
> 
>>
>> Thanks,
>> Ryan
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
                     ` (2 preceding siblings ...)
  2024-02-05  9:51   ` Barry Song
@ 2024-02-22  7:05   ` Barry Song
  2024-02-22 10:09     ` David Hildenbrand
  3 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2024-02-22  7:05 UTC (permalink / raw)
  To: ryan.roberts
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

Hi Ryan,

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2cc0cb41fb32..ea19710aa4cd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  					if (!can_split_folio(folio, NULL))
>  						goto activate_locked;
>  					/*
> -					 * Split folios without a PMD map right
> -					 * away. Chances are some or all of the
> -					 * tail pages can be freed without IO.
> +					 * Split PMD-mappable folios without a
> +					 * PMD map right away. Chances are some
> +					 * or all of the tail pages can be freed
> +					 * without IO.
>  					 */
> -					if (!folio_entire_mapcount(folio) &&
> +					if (folio_test_pmd_mappable(folio) &&
> +					    !folio_entire_mapcount(folio) &&
>  					    split_folio_to_list(folio,
>  								folio_list))
>  						goto activate_locked;

I ran a test to investigate what would happen while reclaiming a partially
unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
4KB~64KB, and keep the first subpage 0~4KiB.
 
My test wants to address my three concerns,
a. whether we will have leak on swap slots
b. whether we will have redundant I/O
c. whether we will cause races on swapcache

what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
at some specific stage
1. just after add_to_swap   (swap slots are allocated)
2. before and after try_to_unmap   (ptes are set to swap_entry)
3. before and after pageout (also add printk in zram driver to dump all I/O write)
4. before and after remove_mapping

The below is the dumped info for a particular large folio,

1. after add_to_swap
[   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
[   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40

as you can see,
_nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)


2. before and after try_to_unmap
[   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
[   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
[   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
[   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40

as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1

3. before and after pageout
[   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
[   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
[   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
[   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
[   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
[   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
[   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
[   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
[   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
[   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
[   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
[   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
[   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
[   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
[   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
[   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
[   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
[   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
[   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0

as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
4~64KiB has been zap_pte_range before, we still write them to zRAM.

4. before and after remove_mapping
[   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
[   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
[   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00

as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
slot leak at all!

Thus, only two concerns are left for me,
1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
is partially unmapped.
2. large folio is added as a whole as a swapcache covering the range whose
part has been zapped. I am not quite sure if this will cause some problems
while some concurrent do_anon_page, swapin and swapout occurs between 3 and
4 on zapped subpage1~subpage15. still struggling.. my brain is exploding... 

To me, it seems safer to split or do some other similar optimization if we find a
large folio has partial map and unmap.

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-22  7:05   ` Barry Song
@ 2024-02-22 10:09     ` David Hildenbrand
  2024-02-23  9:46       ` Barry Song
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2024-02-22 10:09 UTC (permalink / raw)
  To: Barry Song, ryan.roberts
  Cc: akpm, linux-kernel, linux-mm, mhocko, shy828301, wangkefeng.wang,
	willy, xiang, ying.huang, yuzhao, chrisl, surenb, hanchuanhua

On 22.02.24 08:05, Barry Song wrote:
> Hi Ryan,
> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2cc0cb41fb32..ea19710aa4cd 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>   					if (!can_split_folio(folio, NULL))
>>   						goto activate_locked;
>>   					/*
>> -					 * Split folios without a PMD map right
>> -					 * away. Chances are some or all of the
>> -					 * tail pages can be freed without IO.
>> +					 * Split PMD-mappable folios without a
>> +					 * PMD map right away. Chances are some
>> +					 * or all of the tail pages can be freed
>> +					 * without IO.
>>   					 */
>> -					if (!folio_entire_mapcount(folio) &&
>> +					if (folio_test_pmd_mappable(folio) &&
>> +					    !folio_entire_mapcount(folio) &&
>>   					    split_folio_to_list(folio,
>>   								folio_list))
>>   						goto activate_locked;
> 
> I ran a test to investigate what would happen while reclaiming a partially
> unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
> 4KB~64KB, and keep the first subpage 0~4KiB.

IOW, something that already happens with ordinary THP already IIRC.

>   
> My test wants to address my three concerns,
> a. whether we will have leak on swap slots
> b. whether we will have redundant I/O
> c. whether we will cause races on swapcache
> 
> what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
> at some specific stage
> 1. just after add_to_swap   (swap slots are allocated)
> 2. before and after try_to_unmap   (ptes are set to swap_entry)
> 3. before and after pageout (also add printk in zram driver to dump all I/O write)
> 4. before and after remove_mapping
> 
> The below is the dumped info for a particular large folio,
> 
> 1. after add_to_swap
> [   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
> [   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> 
> as you can see,
> _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)
> 
> 
> 2. before and after try_to_unmap
> [   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
> [   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
> [   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
> [   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> 
> as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
> 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1
> 
> 3. before and after pageout
> [   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
> [   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> [   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
> [   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
> [   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
> [   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
> [   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
> [   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
> [   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
> [   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
> [   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
> [   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
> [   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
> [   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
> [   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
> [   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
> [   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
> [   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
> [   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0
> 
> as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
> 4~64KiB has been zap_pte_range before, we still write them to zRAM.
> 
> 4. before and after remove_mapping
> [   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> [   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
> [   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> 
> as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
> all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
> slot leak at all!
> 
> Thus, only two concerns are left for me,
> 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
> is partially unmapped.
> 2. large folio is added as a whole as a swapcache covering the range whose
> part has been zapped. I am not quite sure if this will cause some problems
> while some concurrent do_anon_page, swapin and swapout occurs between 3 and
> 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding...

Just noting: I was running into something different in the past with 
THP. And it's effectively the same scenario, just swapout and 
MADV_DONTNEED reversed.

Imagine you swapped out a THP and the THP it still is in the swapcache.

Then you unmap/zap some PTEs, freeing up the swap slots.

In zap_pte_range(), we'll call free_swap_and_cache(). There, we run into 
the "!swap_page_trans_huge_swapped(p, entry)", and we won't be calling 
__try_to_reclaim_swap().

So we won't split the large folio that is in the swapcache and it will 
continue consuming "more memory" than intended until fully evicted.

> 
> To me, it seems safer to split or do some other similar optimization if we find a
> large folio has partial map and unmap.

I'm hoping that we can avoid any new direct users of _nr_pages_mapped if 
possible.

If we find that the folio is on the deferred split list, we might as 
well just split it right away, before swapping it out. That might be a 
reasonable optimization for the case you describe.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2023-10-25 14:45 ` [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
@ 2024-02-22 10:19   ` David Hildenbrand
  2024-02-22 10:20     ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2024-02-22 10:19 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 25.10.23 16:45, Ryan Roberts wrote:
> As preparation for supporting small-sized THP in the swap-out path,
> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
> which, when present, always implies PMD-sized THP, which is the same as
> the cluster size.
> 
> The only use of the flag was to determine whether a swap entry refers to
> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
> Instead of relying on the flag, we now pass in nr_pages, which
> originates from the folio's number of pages. This allows the logic to
> work for folios of any order.
> 
> The one snag is that one of the swap_page_trans_huge_swapped() call
> sites does not have the folio. But it was only being called there to
> avoid bothering to call __try_to_reclaim_swap() in some cases.
> __try_to_reclaim_swap() gets the folio and (via some other functions)
> calls swap_page_trans_huge_swapped(). So I've removed the problematic
> call site and believe the new logic should be equivalent.

That is the  __try_to_reclaim_swap() -> folio_free_swap() -> 
folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.

The "difference" is that you will now (1) get another temporary 
reference on the folio and (2) (try)lock the folio every time you 
discard a single PTE of a (possibly) large THP.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-22 10:19   ` David Hildenbrand
@ 2024-02-22 10:20     ` David Hildenbrand
  2024-02-26 17:41       ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2024-02-22 10:20 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 22.02.24 11:19, David Hildenbrand wrote:
> On 25.10.23 16:45, Ryan Roberts wrote:
>> As preparation for supporting small-sized THP in the swap-out path,
>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>> which, when present, always implies PMD-sized THP, which is the same as
>> the cluster size.
>>
>> The only use of the flag was to determine whether a swap entry refers to
>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>> Instead of relying on the flag, we now pass in nr_pages, which
>> originates from the folio's number of pages. This allows the logic to
>> work for folios of any order.
>>
>> The one snag is that one of the swap_page_trans_huge_swapped() call
>> sites does not have the folio. But it was only being called there to
>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>> call site and believe the new logic should be equivalent.
> 
> That is the  __try_to_reclaim_swap() -> folio_free_swap() ->
> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
> 
> The "difference" is that you will now (1) get another temporary
> reference on the folio and (2) (try)lock the folio every time you
> discard a single PTE of a (possibly) large THP.
> 

Thinking about it, your change will not only affect THP, but any call to 
free_swap_and_cache().

Likely that's not what we want. :/

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-22 10:09     ` David Hildenbrand
@ 2024-02-23  9:46       ` Barry Song
  2024-02-27 12:05         ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2024-02-23  9:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: ryan.roberts, akpm, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

On Thu, Feb 22, 2024 at 11:09 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.02.24 08:05, Barry Song wrote:
> > Hi Ryan,
> >
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 2cc0cb41fb32..ea19710aa4cd 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>                                      if (!can_split_folio(folio, NULL))
> >>                                              goto activate_locked;
> >>                                      /*
> >> -                                     * Split folios without a PMD map right
> >> -                                     * away. Chances are some or all of the
> >> -                                     * tail pages can be freed without IO.
> >> +                                     * Split PMD-mappable folios without a
> >> +                                     * PMD map right away. Chances are some
> >> +                                     * or all of the tail pages can be freed
> >> +                                     * without IO.
> >>                                       */
> >> -                                    if (!folio_entire_mapcount(folio) &&
> >> +                                    if (folio_test_pmd_mappable(folio) &&
> >> +                                        !folio_entire_mapcount(folio) &&
> >>                                          split_folio_to_list(folio,
> >>                                                              folio_list))
> >>                                              goto activate_locked;
> >
> > I ran a test to investigate what would happen while reclaiming a partially
> > unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
> > 4KB~64KB, and keep the first subpage 0~4KiB.
>
> IOW, something that already happens with ordinary THP already IIRC.
>
> >
> > My test wants to address my three concerns,
> > a. whether we will have leak on swap slots
> > b. whether we will have redundant I/O
> > c. whether we will cause races on swapcache
> >
> > what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
> > at some specific stage
> > 1. just after add_to_swap   (swap slots are allocated)
> > 2. before and after try_to_unmap   (ptes are set to swap_entry)
> > 3. before and after pageout (also add printk in zram driver to dump all I/O write)
> > 4. before and after remove_mapping
> >
> > The below is the dumped info for a particular large folio,
> >
> > 1. after add_to_swap
> > [   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
> > [   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >
> > as you can see,
> > _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)
> >
> >
> > 2. before and after try_to_unmap
> > [   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
> > [   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
> > [   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
> > [   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >
> > as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
> > 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1
> >
> > 3. before and after pageout
> > [   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
> > [   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> > [   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
> > [   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
> > [   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
> > [   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
> > [   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
> > [   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
> > [   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
> > [   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
> > [   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
> > [   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
> > [   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
> > [   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
> > [   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
> > [   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
> > [   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
> > [   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
> > [   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0
> >
> > as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
> > 4~64KiB has been zap_pte_range before, we still write them to zRAM.
> >
> > 4. before and after remove_mapping
> > [   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> > [   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
> > [   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> >
> > as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
> > all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
> > slot leak at all!
> >
> > Thus, only two concerns are left for me,
> > 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
> > is partially unmapped.
> > 2. large folio is added as a whole as a swapcache covering the range whose
> > part has been zapped. I am not quite sure if this will cause some problems
> > while some concurrent do_anon_page, swapin and swapout occurs between 3 and
> > 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding...
>
> Just noting: I was running into something different in the past with
> THP. And it's effectively the same scenario, just swapout and
> MADV_DONTNEED reversed.
>
> Imagine you swapped out a THP and the THP it still is in the swapcache.
>
> Then you unmap/zap some PTEs, freeing up the swap slots.
>
> In zap_pte_range(), we'll call free_swap_and_cache(). There, we run into
> the "!swap_page_trans_huge_swapped(p, entry)", and we won't be calling
> __try_to_reclaim_swap().

I guess you mean swap_page_trans_huge_swapped(p, entry)  not
!swap_page_trans_huge_swapped(p, entry) ?

at that time, swap_page_trans_huge_swapped should be true as there are still
some entries whose swap_map=0x41 or above (SWAP_HAS_CACHE and
swap_count >= 1)

static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
                                         swp_entry_t entry,
                                         unsigned int nr_pages)
{
        ...
        for (i = 0; i < nr_pages; i++) {
                if (swap_count(map[offset + i])) {
                        ret = true;
                        break;
                }
        }
unlock_out:
        unlock_cluster_or_swap_info(si, ci);
        return ret;
}
So this will stop the swap free even for those ptes which have been
zapped?

Another case I have reported[1] is that while reclaiming a large folio,
in try_to_unmap_one, we are calling  page_vma_mapped_walk().
as it only begins to hold PTL after it hits a valid pte, a paralel
break-before-make might make 0nd, 1st and beginning PTEs zero,
try_to_unmap_one can read intermediate ptes value, thus we can run
into this below case.  afte try_to_unmap_one,
pte 0   -  untouched, present pte
pte 1   - untouched, present pte
pte 2  - swap entries
pte 3 - swap entries
...
pte n - swap entries

or

pte 0   -  untouched, present pte
pte 1  - swap entries
pte 2  - swap entries
pte 3  - swap entries
...
pte n - swap entries

etc.

Thus, after try_to_unmap, the folio is still mapped with some ptes becoming
swap entries, some PTEs are still present. it might be staying in swapcache
for a long time with a broken CONT-PTE.

I also hate that and hope for a SYNC way to let large folio hold PTL from the
0nd pte, thus, it won't get intermediate PTEs from other break-before-make.

This also doesn't increase PTL contention as my test shows we will always
get PTL for a large folio after skipping zero, one or two PTEs in
try_to_unmap_one.
but skipping 1 or 2 is really bad and sad, breaking a large folio from the same
whole to nr_pages different segments.

[1] https://lore.kernel.org/linux-mm/CAGsJ_4wo7BiJWSKb1K_WyAai30KmfckMQ3-mCJPXZ892CtXpyQ@mail.gmail.com/

>
> So we won't split the large folio that is in the swapcache and it will
> continue consuming "more memory" than intended until fully evicted.
>
> >
> > To me, it seems safer to split or do some other similar optimization if we find a
> > large folio has partial map and unmap.
>
> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
> possible.
>

Is _nr_pages_mapped < nr_pages a reasonable case to split as we
have known the folio has at least some subpages zapped?

> If we find that the folio is on the deferred split list, we might as
> well just split it right away, before swapping it out. That might be a
> reasonable optimization for the case you describe.

i tried to change Ryan's code as below

@@ -1905,11 +1922,12 @@ static unsigned int shrink_folio_list(struct
list_head *folio_list,
                                         * PMD map right away. Chances are some
                                         * or all of the tail pages can be freed
                                         * without IO.
+                                        * Similarly, split PTE-mapped folios if
+                                        * they have been already
deferred_split.
                                         */
-                                       if (folio_test_pmd_mappable(folio) &&
-                                           !folio_entire_mapcount(folio) &&
-                                           split_folio_to_list(folio,
-                                                               folio_list))
+                                       if
(((folio_test_pmd_mappable(folio) && !folio_entire_mapcount(folio)) ||
+
(!folio_test_pmd_mappable(folio) &&
!list_empty(&folio->_deferred_list)))
+                                           &&
split_folio_to_list(folio, folio_list))
                                                goto activate_locked;
                                }
                                if (!add_to_swap(folio)) {

It seems to work as expected. only one I/O is left for a large folio
with 16 PTEs
but 15 of them have been zapped before.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE
  2024-01-26 23:14     ` Chris Li
@ 2024-02-26  2:59       ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-02-26  2:59 UTC (permalink / raw)
  To: Chris Li
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Barry Song

Hi Chris,

Thanks for reviewing. sorry for the late reply as I’ve been getting a lot to do
recently.

On Sat, Jan 27, 2024 at 12:14 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Jan 18, 2024 at 3:11 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
> > THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
> > MTE as the MTE code works with the assumption tags save/restore is
> > always handling a folio with only one page.
> >
> > The limitation should be removed as more and more ARM64 SoCs have
> > this feature. Co-existence of MTE and THP_SWAP becomes more and
> > more important.
> >
> > This patch makes MTE tags saving support large folios, then we don't
> > need to split large folios into base pages for swapping out on ARM64
> > SoCs with MTE any more.
> >
> > arch_prepare_to_swap() should take folio rather than page as parameter
> > because we support THP swap-out as a whole. It saves tags for all
> > pages in a large folio.
> >
> > As now we are restoring tags based-on folio, in arch_swap_restore(),
> > we may increase some extra loops and early-exitings while refaulting
> > a large folio which is still in swapcache in do_swap_page(). In case
> > a large folio has nr pages, do_swap_page() will only set the PTE of
> > the particular page which is causing the page fault.
> > Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
> > will loop nr times for those subpages in the folio. So right now the
> > algorithmic complexity becomes O(nr^2).
> >
> > Once we support mapping large folios in do_swap_page(), extra loops
> > and early-exitings will decrease while not being completely removed
> > as a large folio might get partially tagged in corner cases such as,
> > 1. a large folio in swapcache can be partially unmapped, thus, MTE
> > tags for the unmapped pages will be invalidated;
> > 2. users might use mprotect() to set MTEs on a part of a large folio.
> >
> > arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
> > who needed it.
> >
> > Reviewed-by: Steven Price <steven.price@arm.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  arch/arm64/include/asm/pgtable.h | 21 +++-------------
> >  arch/arm64/mm/mteswap.c          | 42 ++++++++++++++++++++++++++++++++
> >  include/linux/huge_mm.h          | 12 ---------
> >  include/linux/pgtable.h          |  2 +-
> >  mm/page_io.c                     |  2 +-
> >  mm/swap_slots.c                  |  2 +-
> >  6 files changed, 49 insertions(+), 32 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 79ce70fbb751..9902395ca426 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -45,12 +45,6 @@
> >         __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > -       return !system_supports_mte();
> > -}
> > -#define arch_thp_swp_supported arch_thp_swp_supported
> > -
> >  /*
> >   * Outside of a few very special situations (e.g. hibernation), we always
> >   * use broadcast TLB invalidation instructions, therefore a spurious page
> > @@ -1042,12 +1036,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >  #ifdef CONFIG_ARM64_MTE
> >
> >  #define __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > -{
> > -       if (system_supports_mte())
> > -               return mte_save_tags(page);
> > -       return 0;
> > -}
> > +#define arch_prepare_to_swap arch_prepare_to_swap
>
> This seems a noop, define "arch_prepare_to_swap" back to itself.
> What am I missing?
>
> I see. Answer my own question, I guess you want to allow someone to
> overwrite the arch_prepare_to_swap.
> Wouldn't testing against  __HAVE_ARCH_PREPARE_TO_SWAP enough to support that?

you are right. i was blindly copying my previous code

static inline bool arch_thp_swp_supported(void)
{
        return !system_supports_mte();
}
#define arch_thp_swp_supported arch_thp_swp_supported

for arch_thp_swp_supported, there isn't a similar MACRO. so we are depending
on arch_thp_swp_supported as a MACRO for itself in include/linux/huge_mm.h.

/*
 * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
 * limitations in the implementation like arm64 MTE can override this to
 * false
 */
#ifndef arch_thp_swp_supported
static inline bool arch_thp_swp_supported(void)
{
        return true;
}
#endif

Now the case is different, we do have  __HAVE_ARCH_PREPARE_TO_SWAP
instead.

>
> Maybe I need to understand better how you want others to extend this
> code to make suggestions.
> As it is, this looks strange.
>
> > +extern int arch_prepare_to_swap(struct folio *folio);
> >
> >  #define __HAVE_ARCH_SWAP_INVALIDATE
> >  static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> > @@ -1063,11 +1053,8 @@ static inline void arch_swap_invalidate_area(int type)
> >  }
> >
> >  #define __HAVE_ARCH_SWAP_RESTORE
> > -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > -{
> > -       if (system_supports_mte())
> > -               mte_restore_tags(entry, &folio->page);
> > -}
> > +#define arch_swap_restore arch_swap_restore
>
> Same here.

you are right, again.

>
> > +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >
> >  #endif /* CONFIG_ARM64_MTE */
> >
> > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > index a31833e3ddc5..b9ca1b35902f 100644
> > --- a/arch/arm64/mm/mteswap.c
> > +++ b/arch/arm64/mm/mteswap.c
> > @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> >         mte_free_tag_storage(tags);
> >  }
> >
> > +static inline void __mte_invalidate_tags(struct page *page)
> > +{
> > +       swp_entry_t entry = page_swap_entry(page);
> > +
> > +       mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> > +}
> > +
> >  void mte_invalidate_tags_area(int type)
> >  {
> >         swp_entry_t entry = swp_entry(type, 0);
> > @@ -83,3 +90,38 @@ void mte_invalidate_tags_area(int type)
> >         }
> >         xa_unlock(&mte_pages);
> >  }
> > +
> > +int arch_prepare_to_swap(struct folio *folio)
> > +{
> > +       int err;
> > +       long i;
> > +
> > +       if (system_supports_mte()) {
> Very minor nitpick.
>
> You can do
> if (!system_supports_mte())
>     return 0;
>
> Here and the for loop would have less indent. The function looks flatter.

I agree.

>
> > +               long nr = folio_nr_pages(folio);
> > +
> > +               for (i = 0; i < nr; i++) {
> > +                       err = mte_save_tags(folio_page(folio, i));
> > +                       if (err)
> > +                               goto out;
> > +               }
> > +       }
> > +       return 0;
> > +
> > +out:
> > +       while (i--)
> > +               __mte_invalidate_tags(folio_page(folio, i));
> > +       return err;
> > +}
> > +
> > +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > +{
> > +       if (system_supports_mte()) {
>
> Same here.
>
> Looks good otherwise. None of the nitpicks are deal breakers.
>
> Acked-by: Chris Li <chrisl@kernel.org>

Thanks!

>
>
> Chris
>
> > +               long i, nr = folio_nr_pages(folio);
> > +
> > +               entry.val -= swp_offset(entry) & (nr - 1);
> > +               for (i = 0; i < nr; i++) {
> > +                       mte_restore_tags(entry, folio_page(folio, i));
> > +                       entry.val++;
> > +               }
> > +       }
> > +}
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 5adb86af35fc..67219d2309dd 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -530,16 +530,4 @@ static inline int split_folio(struct folio *folio)
> >         return split_folio_to_list(folio, NULL);
> >  }
> >
> > -/*
> > - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > - * limitations in the implementation like arm64 MTE can override this to
> > - * false
> > - */
> > -#ifndef arch_thp_swp_supported
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > -       return true;
> > -}
> > -#endif
> > -
> >  #endif /* _LINUX_HUGE_MM_H */
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index f6d0e3513948..37fe83b0c358 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -925,7 +925,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> >   * prototypes must be defined in the arch-specific asm/pgtable.h file.
> >   */
> >  #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > +static inline int arch_prepare_to_swap(struct folio *folio)
> >  {
> >         return 0;
> >  }
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index ae2b49055e43..a9a7c236aecc 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> >          * Arch code may have to preserve more data than just the page
> >          * contents, e.g. memory tags.
> >          */
> > -       ret = arch_prepare_to_swap(&folio->page);
> > +       ret = arch_prepare_to_swap(folio);
> >         if (ret) {
> >                 folio_mark_dirty(folio);
> >                 folio_unlock(folio);
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 0bec1f705f8e..2325adbb1f19 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> >         entry.val = 0;
> >
> >         if (folio_test_large(folio)) {
> > -               if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > +               if (IS_ENABLED(CONFIG_THP_SWAP))
> >                         get_swap_pages(1, &entry, folio_nr_pages(folio));
> >                 goto out;
> >         }
> > --
> > 2.34.1
> >

Best Regards,
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free()
  2024-01-26 23:17     ` Chris Li
@ 2024-02-26  4:47       ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-02-26  4:47 UTC (permalink / raw)
  To: Chris Li
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Chuanhua Han, Barry Song

Hi Chris,

Thanks!

On Sat, Jan 27, 2024 at 12:17 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Jan 18, 2024 at 3:11 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > While swapping in a large folio, we need to free swaps related to the whole
> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > to introduce an API for batched free.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/swap.h |  6 ++++++
> >  mm/swapfile.c        | 29 +++++++++++++++++++++++++++++
> >  2 files changed, 35 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 4db00ddad261..31a4ee2dcd1c 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -478,6 +478,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >  extern int swap_duplicate(swp_entry_t);
> >  extern int swapcache_prepare(swp_entry_t);
> >  extern void swap_free(swp_entry_t);
> > +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >  extern int free_swap_and_cache(swp_entry_t);
> >  int swap_type_of(dev_t device, sector_t offset);
> > @@ -553,6 +554,11 @@ static inline void swap_free(swp_entry_t swp)
> >  {
> >  }
> >
> > +void swap_nr_free(swp_entry_t entry, int nr_pages)
> > +{
> > +
> > +}
> > +
> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >  {
> >  }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 556ff7347d5f..6321bda96b77 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1335,6 +1335,35 @@ void swap_free(swp_entry_t entry)
> >                 __swap_entry_free(p, entry);
> >  }
> >
> > +void swap_nr_free(swp_entry_t entry, int nr_pages)
> > +{
> > +       int i;
> > +       struct swap_cluster_info *ci;
> > +       struct swap_info_struct *p;
> > +       unsigned type = swp_type(entry);
> > +       unsigned long offset = swp_offset(entry);
> > +       DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
> > +
> > +       VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>
> BUG_ON here seems a bit too developer originated. Maybe warn once and
> roll back to free one by one?

The function is used only for the case we are quite sure we are freeing
some contiguous swap entries within a cluster. if it is not the case,
we will need an array of entries[]. will people be more comfortable to
have a WARN_ON instead? but the problem is if that really happens,
it is a bug, WARN isn't enough.

>
> How big is your typical SWAPFILE_CUSTER and nr_pages typically in arm?

My case is SWAPFILE_CLUSTER  = HPAGE_PMD_NR = 2MB/4KB = 512.

>
> I ask this question because nr_ppages > 64, that is a totally
> different game, we can completely  bypass the swap cache slots.
>

I agree we have a chance to bypass slot cache if nr_pages is bigger than
SWAP_SLOTS_CACHE_SIZE. on the other hand, even when nr_pages <
64, we still have a good chance to optimize free_swap_slot() by batching
as there are many spin_lock and sort() for each single entry.


> > +
> > +       if (nr_pages == 1) {
> > +               swap_free(entry);
> > +               return;
> > +       }
> > +
> > +       p = _swap_info_get(entry);
> > +
> > +       ci = lock_cluster(p, offset);
> > +       for (i = 0; i < nr_pages; i++) {
> > +               if (__swap_entry_free_locked(p, offset + i, 1))
> > +                       __bitmap_set(usage, i, 1);
> > +       }
> > +       unlock_cluster(ci);
> > +
> > +       for_each_clear_bit(i, usage, nr_pages)
> > +               free_swap_slot(swp_entry(type, offset + i));
>
> Notice that free_swap_slot() internal has per CPU cache batching as
> well. Every free_swap_slot will get some per_cpu swap slot cache and
> cache->lock. There is double batching here.
> If the typical batch size here is bigger than 64 entries, we can go
> directly to batching swap_entry_free and avoid the free_swap_slot()
> batching altogether. Unlike free_swap_slot_entries(), here swap slots
> are all from one swap device, there is no need to sort and group the
> swap slot by swap devices.

I agree.  you are completely right!
However, to make the patchset smaller at the beginning, I prefer
these optimizations to be deferred as a separate patchset after this one.

>
> Chris
>
>
> Chris
>
> > +}
> > +
> >  /*
> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >   */
> > --
> > 2.34.1

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
  2024-01-29 16:31             ` Chris Li
@ 2024-02-26  5:05               ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-02-26  5:05 UTC (permalink / raw)
  To: Chris Li
  Cc: David Hildenbrand, ryan.roberts, akpm, linux-mm, linux-kernel,
	mhocko, shy828301, wangkefeng.wang, willy, xiang, ying.huang,
	yuzhao, surenb, steven.price, Barry Song, Chuanhua Han

On Tue, Jan 30, 2024 at 5:32 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Mon, Jan 29, 2024 at 2:07 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 29.01.24 04:25, Chris Li wrote:
> > > Hi David and Barry,
> > >
> > > On Mon, Jan 22, 2024 at 10:49 PM Barry Song <21cnbao@gmail.com> wrote:
> > >>
> > >>>
> > >>>
> > >>> I have on my todo list to move all that !anon handling out of
> > >>> folio_add_anon_rmap_ptes(), and instead make swapin code call add
> > >>> folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
> > >>> then (-> whole new folio exclusive).
> > >>>
> > >>> That's the cleaner approach.
> > >>>
> > >>
> > >> one tricky thing is that sometimes it is hard to know who is the first
> > >> one to add rmap and thus should
> > >> call folio_add_new_anon_rmap.
> > >> especially when we want to support swapin_readahead(), the one who
> > >> allocated large filio might not
> > >> be that one who firstly does rmap.
> > >
> > > I think Barry has a point. Two tasks might race to swap in the folio
> > > then race to perform the rmap.
> > > folio_add_new_anon_rmap() should only call a folio that is absolutely
> > > "new", not shared. The sharing in swap cache disqualifies that
> > > condition.
> >
> > We have to hold the folio lock. So only one task at a time might do the
> > folio_add_anon_rmap_ptes() right now, and the
> > folio_add_new_shared_anon_rmap() in the future [below].
> >
>
> Ah, I see. The folio_lock() is the answer I am looking for.
>
> > Also observe how folio_add_anon_rmap_ptes() states that one must hold
> > the page lock, because otherwise this would all be completely racy.
> >
> >  From the pte swp exclusive flags, we know for sure whether we are
> > dealing with exclusive vs. shared. I think patch #6 does not properly
> > check that all entries are actually the same in that regard (all
> > exclusive vs all shared). That likely needs fixing.
> >
> > [I have converting per-page PageAnonExclusive flags to a single
> > per-folio flag on my todo list. I suspect that we'll keep the
> > per-swp-pte exlusive bits, but the question is rather what we can
> > actually make work, because swap and migration just make it much more
> > complicated. Anyhow, future work]
> >
> > >
> > >> is it an acceptable way to do the below in do_swap_page?
> > >> if (!folio_test_anon(folio))
> > >>        folio_add_new_anon_rmap()
> > >> else
> > >>        folio_add_anon_rmap_ptes()
> > >
> > > I am curious to know the answer as well.
> >
> >
> > Yes, the end code should likely be something like:
> >
> > /* ksm created a completely new copy */
> > if (unlikely(folio != swapcache && swapcache)) {
> >         folio_add_new_anon_rmap(folio, vma, vmf->address);
> >         folio_add_lru_vma(folio, vma);
> > } else if (folio_test_anon(folio)) {
> >         folio_add_anon_rmap_ptes(rmap_flags)
> > } else {
> >         folio_add_new_anon_rmap(rmap_flags)
> > }
> >
> > Maybe we want to avoid teaching all existing folio_add_new_anon_rmap()
> > callers about a new flag, and just have a new
> > folio_add_new_shared_anon_rmap() instead. TBD.

if we have to add a wrapper like folio_add_new_shared_anon_rmap()
to avoid "if (folio_test_anon(folio))" and "else" everywhere, why not
we just do it in folio_add_anon_rmap_ptes() ?

folio_add_anon_rmap_ptes()
{
      if (!folio_test_anon(folio))
               return folio_add_new_anon_rmap();
}

Anyway, I am going to change the patch 4/6 to if(folio_test_anon)/else first
and drop this 5/6.
we may figure out if we need a wrapper later.

>
> There is more than one caller needed to perform that dance around
> folio_test_anon() then decide which function to call. It would be nice
> to have a wrapper function folio_add_new_shared_anon_rmap() to
> abstract this behavior.
>
>
> >
> > >
> > > BTW, that test might have a race as well. By the time the task got
> > > !anon result, this result might get changed by another task. We need
> > > to make sure in the caller context this race can't happen. Otherwise
> > > we can't do the above safely.
> > Again, folio lock. Observe the folio_lock_or_retry() call that covers
> > our existing folio_add_new_anon_rmap/folio_add_anon_rmap_pte calls.
>
> Ack. Thanks for the explanation.
>
> Chris

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
  2024-01-29  2:15     ` Chris Li
@ 2024-02-26  6:39       ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-02-26  6:39 UTC (permalink / raw)
  To: Chris Li
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Chuanhua Han, Barry Song

On Mon, Jan 29, 2024 at 3:15 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Jan 18, 2024 at 3:12 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> > supported swapping large folios out as a whole for vmscan case. This patch
> > extends the feature to madvise.
> >
> > If madvised range covers the whole large folio, we don't split it. Otherwise,
> > we still need to split it.
> >
> > This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> > helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> > mapped to a large folio.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/asm-generic/tlb.h | 10 +++++++
> >  include/linux/pgtable.h   | 60 +++++++++++++++++++++++++++++++++++++++
> >  mm/madvise.c              | 48 +++++++++++++++++++++++++++++++
> >  3 files changed, 118 insertions(+)
> >
> > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> > index 129a3a759976..f894e22da5d6 100644
> > --- a/include/asm-generic/tlb.h
> > +++ b/include/asm-generic/tlb.h
> > @@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
> >                 __tlb_remove_tlb_entry(tlb, ptep, address);     \
> >         } while (0)
> >
> > +#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr)                        \
> > +       do {                                                            \
> > +               int i;                                                  \
> > +               tlb_flush_pte_range(tlb, address,                       \
> > +                               PAGE_SIZE * nr);                        \
> > +               for (i = 0; i < nr; i++)                                \
> > +                       __tlb_remove_tlb_entry(tlb, ptep + i,           \
> > +                                       address + i * PAGE_SIZE);       \
> > +       } while (0)
> > +
> >  #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)       \
> >         do {                                                    \
> >                 unsigned long _sz = huge_page_size(h);          \
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 37fe83b0c358..da0c1cf447e3 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> >  }
> >  #endif
> >
> > +#ifndef pte_range_cont_mapped
> > +static inline bool pte_range_cont_mapped(unsigned long start_pfn,
> > +                                        pte_t *start_pte,
> > +                                        unsigned long start_addr,
> > +                                        int nr)
> > +{
> > +       int i;
> > +       pte_t pte_val;
> > +
> > +       for (i = 0; i < nr; i++) {
> > +               pte_val = ptep_get(start_pte + i);
> > +
> > +               if (pte_none(pte_val))
> > +                       return false;
>
> Hmm, the following check pte_pfn == start_pfn + i should have covered
> the pte none case?
>
> I think the pte_none means it can't have a valid pfn. So this check
> can be skipped?

yes. check pte_pfn == start_pfn + i should have covered the pte none
case. but leaving pte_none there seems to make the code more
readable.  i guess we need to check pte_present() too, a small chance is
swp_offset can equal pte_pfn after some shifting? in case, a PTE
within the large folio range has been a swap entry?

I am still thinking about if we have some cheaper way to check if a folio
is still entirely mapped. maybe sth like if
(list_empty(&folio->_deferred_list))?

>
> > +
> > +               if (pte_pfn(pte_val) != (start_pfn + i))
> > +                       return false;
> > +       }
> > +
> > +       return true;
> > +}
> > +#endif
> > +
> > +#ifndef pte_range_young
> > +static inline bool pte_range_young(pte_t *start_pte, int nr)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < nr; i++)
> > +               if (pte_young(ptep_get(start_pte + i)))
> > +                       return true;
> > +
> > +       return false;
> > +}
> > +#endif
> > +
> >  #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> >  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> >                                             unsigned long address,
> > @@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> >  }
> >  #endif
> >
> > +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL
> > +static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm,
> > +                                                 unsigned long start_addr,
> > +                                                 pte_t *start_pte,
> > +                                                 int nr, int full)
> > +{
> > +       int i;
> > +       pte_t pte;
> > +
> > +       pte = ptep_get_and_clear_full(mm, start_addr, start_pte, full);
> > +
> > +       for (i = 1; i < nr; i++)
> > +               ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE,
> > +                                       start_pte + i, full);
> > +
> > +       return pte;
> > +}
> >
> >  /*
> >   * If two threads concurrently fault at the same page, the thread that
> > @@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >  })
> >  #endif
> >
> > +#ifndef pte_nr_addr_end
> > +#define pte_nr_addr_end(addr, size, end)                               \
> > +({     unsigned long __boundary = ((addr) + size) & (~(size - 1));     \
> > +       (__boundary - 1 < (end) - 1)? __boundary: (end);                \
> > +})
> > +#endif
> > +
> >  /*
> >   * When walking page tables, we usually want to skip any p?d_none entries;
> >   * and any p?d_bad entries - reporting the error before resetting to none.
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 912155a94ed5..262460ac4b2e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >                 if (folio_test_large(folio)) {
> >                         int err;
> >
> > +                       if (!folio_test_pmd_mappable(folio)) {
>
> This session of code indent into the right too much.
> You can do:
>
> if (folio_test_pmd_mappable(folio))
>          goto split;
>
> to make the code flatter.

I guess we don't need  "if (!folio_test_pmd_mappable(folio))" at all
as the pmd case has been
handled at the first beginning of  madvise_cold_or_pageout_pte_range().

>
> > +                               int nr_pages = folio_nr_pages(folio);
> > +                               unsigned long folio_size = PAGE_SIZE * nr_pages;
> > +                               unsigned long start_addr = ALIGN_DOWN(addr, nr_pages * PAGE_SIZE);;
> > +                               unsigned long start_pfn = page_to_pfn(folio_page(folio, 0));
> > +                               pte_t *start_pte = pte - (addr - start_addr) / PAGE_SIZE;
> > +                               unsigned long next = pte_nr_addr_end(addr, folio_size, end);
> > +
> > +                               if (!pte_range_cont_mapped(start_pfn, start_pte, start_addr, nr_pages))
> > +                                       goto split;
> > +
> > +                               if (next - addr != folio_size) {
>
> Nitpick: One line statement does not need {
>
> > +                                       goto split;
> > +                               } else {
>
> When the previous if statement already "goto split", there is no need
> for the else. You can save one level of indentation.

right!

>
>
>
> > +                                       /* Do not interfere with other mappings of this page */
> > +                                       if (folio_estimated_sharers(folio) != 1)
> > +                                               goto skip;
> > +
> > +                                       VM_BUG_ON(addr != start_addr || pte != start_pte);
> > +
> > +                                       if (pte_range_young(start_pte, nr_pages)) {
> > +                                               ptent = ptep_get_and_clear_range_full(mm, start_addr, start_pte,
> > +                                                                                     nr_pages, tlb->fullmm);
> > +                                               ptent = pte_mkold(ptent);
> > +
> > +                                               set_ptes(mm, start_addr, start_pte, ptent, nr_pages);
> > +                                               tlb_remove_nr_tlb_entry(tlb, start_pte, start_addr, nr_pages);
> > +                                       }
> > +
> > +                                       folio_clear_referenced(folio);
> > +                                       folio_test_clear_young(folio);
> > +                                       if (pageout) {
> > +                                               if (folio_isolate_lru(folio)) {
> > +                                                       if (folio_test_unevictable(folio))
> > +                                                               folio_putback_lru(folio);
> > +                                                       else
> > +                                                               list_add(&folio->lru, &folio_list);
> > +                                               }
> > +                                       } else
> > +                                               folio_deactivate(folio);
>
> I notice this section is very similar to the earlier statements inside
> the same function.
> "if (pmd_trans_huge(*pmd)) {"
>
> Wondering if there is some way to unify the two a bit somehow.

we have duplicated the code three times - pmd, pte-mapped large, normal folio.
I am quite sure if we can extract a common function.

>
> Also notice if you test the else condition first,
>
> If (!pageout) {
>     folio_deactivate(folio);
>     goto skip;
> }
>
> You can save one level of indentation.
> Not your fault, I notice the section inside (pmd_trans_huge(*pmd))
> does exactly the same thing.
>

can address this issue once we have a common func.

> Chris
>
>
> > +                               }
> > +skip:
> > +                               pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> > +                               addr = next - PAGE_SIZE;
> > +                               continue;
> > +
> > +                       }
> > +split:
> >                         if (folio_estimated_sharers(folio) != 1)
> >                                 break;
> >                         if (pageout_anon_only_filter && !folio_test_anon(folio))
> > --
> > 2.34.1
> >
> >

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole
  2024-01-27 19:53     ` Chris Li
@ 2024-02-26  7:29       ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-02-26  7:29 UTC (permalink / raw)
  To: Chris Li
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Chuanhua Han, Barry Song

On Sun, Jan 28, 2024 at 8:53 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Jan 18, 2024 at 3:12 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > On an embedded system like Android, more than half of anon memory is actually
> > in swap devices such as zRAM. For example, while an app is switched to back-
> > ground, its most memory might be swapped-out.
> >
> > Now we have mTHP features, unfortunately, if we don't support large folios
> > swap-in, once those large folios are swapped-out, we immediately lose the
> > performance gain we can get through large folios and hardware optimization
> > such as CONT-PTE.
> >
> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> > to those contiguous swaps which were likely swapped out from mTHP as a whole.
> >
> > On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
> > case. It doesn't support swapin_readahead as large folios yet.
> >
> > Right now, we are re-faulting large folios which are still in swapcache as a
> > whole, this can effectively decrease extra loops and early-exitings which we
> > have increased in arch_swap_restore() while supporting MTE restore for folios
> > rather than page.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 94 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f61a48929ba7..928b3f542932 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -107,6 +107,8 @@ EXPORT_SYMBOL(mem_map);
> >  static vm_fault_t do_fault(struct vm_fault *vmf);
> >  static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
> >  static bool vmf_pte_changed(struct vm_fault *vmf);
> > +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> > +                                     bool (*pte_range_check)(pte_t *, int));
>
> Instead of returning "bool", the pte_range_check() can return the
> start of the swap entry of the large folio.
> That will save some of the later code needed to get the start of the
> large folio.

I am trying to reuse alloc_anon_folio() for both do_anon_page and
do_swap_page. Unfortunately, this func returns a folio, no more
place to return a swap entry unless we add a parameter. Getting
start swap is quite cheap on the other hand.

>
> >
> >  /*
> >   * Return true if the original pte was a uffd-wp pte marker (so the pte was
> > @@ -3784,6 +3786,34 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >         return VM_FAULT_SIGBUS;
> >  }
> >
> > +static bool pte_range_swap(pte_t *pte, int nr_pages)
>
> This function name seems to suggest it will perform the range swap.
> That is not what it is doing.
> Suggest change to some other name reflecting that it is only a
> condition test without actual swap action.
> I am not very good at naming functions. Just think it out loud: e.g.
> pte_range_swap_check, pte_test_range_swap. You can come up with
> something better.

Ryan has a function named pte_range_none, which is checking the whole
range is pte_none. Maybe we can have an is_pte_range_contig_swap
which includes both swap and contiguous as we only need contiguous
swap entries.

>
>
> > +{
> > +       int i;
> > +       swp_entry_t entry;
> > +       unsigned type;
> > +       pgoff_t start_offset;
> > +
> > +       entry = pte_to_swp_entry(ptep_get_lockless(pte));
> > +       if (non_swap_entry(entry))
> > +               return false;
> > +       start_offset = swp_offset(entry);
> > +       if (start_offset % nr_pages)
> > +               return false;
>
> This suggests the pte argument needs to point to the beginning of the
> large folio equivalent of swap entry(not sure what to call it. Let me
> call it "large folio swap" here.).
> We might want to unify the terms for that.
> Any way, might want to document this requirement, otherwise the caller
> might consider passing the current pte that generates the fault. From
> the function name it is not obvious which pte should pass it.

ok, Ryan's swap-out will allocate swap entries whose start offset is
aligned with nr_pages. will add some doc to describe the first entry.

>
> > +
> > +       type = swp_type(entry);
> > +       for (i = 1; i < nr_pages; i++) {
>
> You might want to test the last page backwards, because if the entry
> is not the large folio swap, most likely it will have the last entry
> invalid.  Some of the beginning swap entries might match due to batch
> allocation etc. The SSD likes to group the nearby swap entry write out
> together on the disk.

I am not sure I got your point. This is checking all pages within
the range of a large folio, Ryan's patch allocates swap entries
all together as a whole for a large folio while swapping out.

@@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t
swp_entries[], int entry_size)
         spin_unlock(&si->lock);
         goto nextsi;
  }
-        if (size == SWAPFILE_CLUSTER) {
-               if (si->flags & SWP_BLKDEV)
-                      n_ret = swap_alloc_cluster(si, swp_entries);
+               if (size > 1) {
+                      n_ret = swap_alloc_large(si, swp_entries, size);
                } else
                       n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
     n_goal, swp_entries);


>
>
>
> > +               entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
>
> > +               if (non_swap_entry(entry))
> > +                       return false;
> > +               if (swp_offset(entry) != start_offset + i)
> > +                       return false;
> > +               if (swp_type(entry) != type)
> > +                       return false;
> > +       }
> > +
> > +       return true;
> > +}
> > +
> >  /*
> >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >   * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -3804,6 +3834,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         pte_t pte;
> >         vm_fault_t ret = 0;
> >         void *shadow = NULL;
> > +       int nr_pages = 1;
> > +       unsigned long start_address;
> > +       pte_t *start_pte;
> >
> >         if (!pte_unmap_same(vmf))
> >                 goto out;
> > @@ -3868,13 +3901,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                     __swap_count(entry) == 1) {
> >                         /* skip swapcache */
> > -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > -                                               vma, vmf->address, false);
> > +                       folio = alloc_anon_folio(vmf, pte_range_swap);
>
> This function can call pte_range_swap() twice(), one here, another one
> in folio_test_large().
> Consider caching the result so it does not need to walk the pte range
> swap twice.
>
> I think alloc_anon_folio should either be told what is the
> size(prefered) or just figure out the right size. I don't think it
> needs to pass in the checking function as function callbacks. There
> are two call sites of alloc_anon_folio, they are all within this
> function. The call back seems a bit overkill here. Also duplicate the
> range swap walk.

alloc_anon_folio is reusing the one for do_anon_page. in both
cases, scanning PTEs to figure out the proper size is done.
The other call site is within do_anonymous_page().

>
> >                         page = &folio->page;
> >                         if (folio) {
> >                                 __folio_set_locked(folio);
> >                                 __folio_set_swapbacked(folio);
> >
> > +                               if (folio_test_large(folio)) {
> > +                                       unsigned long start_offset;
> > +
> > +                                       nr_pages = folio_nr_pages(folio);
> > +                                       start_offset = swp_offset(entry) & ~(nr_pages - 1);
> Here is the first place place we roll up the start offset with folio size
>
> > +                                       entry = swp_entry(swp_type(entry), start_offset);
> > +                               }
> > +
> >                                 if (mem_cgroup_swapin_charge_folio(folio,
> >                                                         vma->vm_mm, GFP_KERNEL,
> >                                                         entry)) {
> > @@ -3980,6 +4020,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >          */
> >         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >                         &vmf->ptl);
> > +
> > +       start_address = vmf->address;
> > +       start_pte = vmf->pte;
> > +       if (folio_test_large(folio)) {
> > +               unsigned long nr = folio_nr_pages(folio);
> > +               unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> > +               pte_t *pte_t = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
>
> Here is the second place we roll up the folio size.
> Maybe we can cache results and avoid repetition?

We have two paths getting into large folios
1. we allocate a new large folio
2. we find a large folio in swapcache

We have rolled up the folio size for case 1 before, but here we need to
take care of case 2 as well. so that is why we need both. let me think
if we can have some way to remove some redundant code for case 1.

>
> > +
> > +               /*
> > +                * case 1: we are allocating large_folio, try to map it as a whole
> > +                * iff the swap entries are still entirely mapped;
> > +                * case 2: we hit a large folio in swapcache, and all swap entries
> > +                * are still entirely mapped, try to map a large folio as a whole.
> > +                * otherwise, map only the faulting page within the large folio
> > +                * which is swapcache
> > +                */
>
> One question I have in mind is that the swap device is locked. We
> can't change the swap slot allocations.
> It does not stop the pte entry getting changed right? Then we can have
> someone in the user pace racing to change the PTE vs we checking the
> pte there.
>
> > +               if (pte_range_swap(pte_t, nr)) {
>
> After this pte_range_swap() check, some of the PTE entries get changed
> and now we don't have the full large page swap any more?
> At least I can't conclude this possibility can't happen yet, please
> enlighten me.

This check is under PTL. no one else can change it as they have to
hold PTL to change pte.
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
                        &vmf->ptl);


>
> > +                       start_address = addr;
> > +                       start_pte = pte_t;
> > +                       if (unlikely(folio == swapcache)) {
> > +                               /*
> > +                                * the below has been done before swap_read_folio()
> > +                                * for case 1
> > +                                */
> > +                               nr_pages = nr;
> > +                               entry = pte_to_swp_entry(ptep_get(start_pte));
>
> If we make pte_range_swap() return the entry, we can avoid refetching
> the swap entry here.

we will have to add a parameter swp_entry_t *first_entry to return
the entry. The difficulty is we will have to add this parameter in
alloc_anon_folio() as well, that's a bit overkill for that function.


>
> > +                               page = &folio->page;
> > +                       }
> > +               } else if (nr_pages > 1) { /* ptes have changed for case 1 */
> > +                       goto out_nomap;
> > +               }
> > +       }
> > +
> I rewrote the above to make the code indentation matching the execution flow.
> I did not add any functional change. Just rearrange the code to be a
> bit more streamlined. Get rid of the "else if goto".
>                if (!pte_range_swap(pte_t, nr)) {
>                         if (nr_pages > 1)  /* ptes have changed for case 1 */
>                                 goto out_nomap;
>                         goto check_pte;
>                 }
>
>                 start_address = addr;
>                 start_pte = pte_t;
>                 if (unlikely(folio == swapcache)) {
>                         /*
>                          * the below has been done before swap_read_folio()
>                          * for case 1
>                          */
>                         nr_pages = nr;
>                         entry = pte_to_swp_entry(ptep_get(start_pte));
>                         page = &folio->page;
>                 }
>         }

looks good to me.

>
> check_pte:
>
> >         if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >                 goto out_nomap;
> >
> > @@ -4047,12 +4120,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >          * We're already holding a reference on the page but haven't mapped it
> >          * yet.
> >          */
> > -       swap_free(entry);
> > +       swap_nr_free(entry, nr_pages);
> >         if (should_try_to_free_swap(folio, vma, vmf->flags))
> >                 folio_free_swap(folio);
> >
> > -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > -       dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > +       folio_ref_add(folio, nr_pages - 1);
> > +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > +       add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > +
> >         pte = mk_pte(page, vma->vm_page_prot);
> >
> >         /*
> > @@ -4062,14 +4137,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >          * exclusivity.
> >          */
> >         if (!folio_test_ksm(folio) &&
> > -           (exclusive || folio_ref_count(folio) == 1)) {
> > +           (exclusive || folio_ref_count(folio) == nr_pages)) {
> >                 if (vmf->flags & FAULT_FLAG_WRITE) {
> >                         pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >                         vmf->flags &= ~FAULT_FLAG_WRITE;
> >                 }
> >                 rmap_flags |= RMAP_EXCLUSIVE;
> >         }
> > -       flush_icache_page(vma, page);
> > +       flush_icache_pages(vma, page, nr_pages);
> >         if (pte_swp_soft_dirty(vmf->orig_pte))
> >                 pte = pte_mksoft_dirty(pte);
> >         if (pte_swp_uffd_wp(vmf->orig_pte))
> > @@ -4081,14 +4156,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                 folio_add_new_anon_rmap(folio, vma, vmf->address);
> >                 folio_add_lru_vma(folio, vma);
> >         } else {
> > -               folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> > +               folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> >                                         rmap_flags);
> >         }
> >
> >         VM_BUG_ON(!folio_test_anon(folio) ||
> >                         (pte_write(pte) && !PageAnonExclusive(page)));
> > -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > -       arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > +       set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> > +
> > +       arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
> >
> >         folio_unlock(folio);
> >         if (folio != swapcache && swapcache) {
> > @@ -4105,6 +4181,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         }
> >
> >         if (vmf->flags & FAULT_FLAG_WRITE) {
> > +               if (folio_test_large(folio) && nr_pages > 1)
> > +                       vmf->orig_pte = ptep_get(vmf->pte);
> > +
> >                 ret |= do_wp_page(vmf);
> >                 if (ret & VM_FAULT_ERROR)
> >                         ret &= VM_FAULT_ERROR;
> > @@ -4112,7 +4191,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         }
> >
> >         /* No need to invalidate - it was non-present before */
> > -       update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> > +       update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> >  unlock:
> >         if (vmf->pte)
> >                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> > @@ -4148,7 +4227,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
> >         return true;
> >  }
> >
> > -static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> > +                                     bool (*pte_range_check)(pte_t *, int))
> >  {
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >         struct vm_area_struct *vma = vmf->vma;
> > @@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>
> About this patch context we have the following comments in the source code.
>         /*
>          * Find the highest order where the aligned range is completely
>          * pte_none(). Note that all remaining orders will be completely
>          * pte_none().
>          */
> >         order = highest_order(orders);
> >         while (orders) {
> >                 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > -               if (pte_range_none(pte + pte_index(addr), 1 << order))
> > +               if (pte_range_check(pte + pte_index(addr), 1 << order))
>
> Again, I don't think we need to pass in the pte_range_check() as call
> back functions.
> There are only two call sites, all within this file. This will totally
> invalide the above comments about pte_none(). In the worst case, just
> make it accept one argument: it is checking swap range or none range
> or not. Depending on the argument, do check none or swap range.
> We should make it blend in with alloc_anon_folio better. My gut
> feeling is that there should be a better way to make the range check
> blend in with alloc_anon_folio better. e.g. Maybe store some of the
> large swap context in the vmf and pass to different places etc. I need
> to spend more time thinking about it to come up with happier
> solutions.

could pass a type to hint pte_range_none or pte_range_swap.
i'd like to avoid changing any global variable like vmf, as people
will have to cross two or more functions to understand what is
going on though the second function might be able to use the
changed vmf value in the first function. but it really makes the
code have more couples.

>
> Chris
>
> >                         break;
> >                 order = next_order(&orders, order);
> >         }
> > @@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >         if (unlikely(anon_vma_prepare(vma)))
> >                 goto oom;
> >         /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> > -       folio = alloc_anon_folio(vmf);
> > +       folio = alloc_anon_folio(vmf, pte_range_none);
> >         if (IS_ERR(folio))
> >                 return 0;
> >         if (!folio)
> > --
> > 2.34.1
> >

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole
  2024-01-27 20:06     ` Chris Li
@ 2024-02-26  7:31       ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-02-26  7:31 UTC (permalink / raw)
  To: Chris Li
  Cc: ryan.roberts, akpm, david, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Chuanhua Han, Barry Song

On Sun, Jan 28, 2024 at 9:06 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Jan 18, 2024 at 3:12 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > On an embedded system like Android, more than half of anon memory is actually
> > in swap devices such as zRAM. For example, while an app is switched to back-
> > ground, its most memory might be swapped-out.
> >
> > Now we have mTHP features, unfortunately, if we don't support large folios
> > swap-in, once those large folios are swapped-out, we immediately lose the
> > performance gain we can get through large folios and hardware optimization
> > such as CONT-PTE.
> >
> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> > to those contiguous swaps which were likely swapped out from mTHP as a whole.
> >
> > On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
> > case. It doesn't support swapin_readahead as large folios yet.
> >
> > Right now, we are re-faulting large folios which are still in swapcache as a
> > whole, this can effectively decrease extra loops and early-exitings which we
> > have increased in arch_swap_restore() while supporting MTE restore for folios
> > rather than page.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 94 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f61a48929ba7..928b3f542932 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -107,6 +107,8 @@ EXPORT_SYMBOL(mem_map);
> >  static vm_fault_t do_fault(struct vm_fault *vmf);
> >  static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
> >  static bool vmf_pte_changed(struct vm_fault *vmf);
> > +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> > +                                     bool (*pte_range_check)(pte_t *, int));
> >
> >  /*
> >   * Return true if the original pte was a uffd-wp pte marker (so the pte was
> > @@ -3784,6 +3786,34 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >         return VM_FAULT_SIGBUS;
> >  }
> >
> > +static bool pte_range_swap(pte_t *pte, int nr_pages)
> > +{
> > +       int i;
> > +       swp_entry_t entry;
> > +       unsigned type;
> > +       pgoff_t start_offset;
> > +
> > +       entry = pte_to_swp_entry(ptep_get_lockless(pte));
> > +       if (non_swap_entry(entry))
> > +               return false;
> > +       start_offset = swp_offset(entry);
> > +       if (start_offset % nr_pages)
> > +               return false;
> > +
> > +       type = swp_type(entry);
> > +       for (i = 1; i < nr_pages; i++) {
> > +               entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> > +               if (non_swap_entry(entry))
> > +                       return false;
> > +               if (swp_offset(entry) != start_offset + i)
> > +                       return false;
> > +               if (swp_type(entry) != type)
> > +                       return false;
> > +       }
> > +
> > +       return true;
> > +}
> > +
> >  /*
> >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >   * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -3804,6 +3834,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         pte_t pte;
> >         vm_fault_t ret = 0;
> >         void *shadow = NULL;
> > +       int nr_pages = 1;
> > +       unsigned long start_address;
> > +       pte_t *start_pte;
> >
> >         if (!pte_unmap_same(vmf))
> >                 goto out;
> > @@ -3868,13 +3901,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                     __swap_count(entry) == 1) {
> >                         /* skip swapcache */
> > -                       folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > -                                               vma, vmf->address, false);
> > +                       folio = alloc_anon_folio(vmf, pte_range_swap);
> >                         page = &folio->page;
> >                         if (folio) {
> >                                 __folio_set_locked(folio);
> >                                 __folio_set_swapbacked(folio);
> >
> > +                               if (folio_test_large(folio)) {
> > +                                       unsigned long start_offset;
> > +
> > +                                       nr_pages = folio_nr_pages(folio);
> > +                                       start_offset = swp_offset(entry) & ~(nr_pages - 1);
> > +                                       entry = swp_entry(swp_type(entry), start_offset);
> > +                               }
> > +
> >                                 if (mem_cgroup_swapin_charge_folio(folio,
> >                                                         vma->vm_mm, GFP_KERNEL,
> >                                                         entry)) {
> > @@ -3980,6 +4020,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >          */
> >         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >                         &vmf->ptl);
> > +
> > +       start_address = vmf->address;
> > +       start_pte = vmf->pte;
> > +       if (folio_test_large(folio)) {
> > +               unsigned long nr = folio_nr_pages(folio);
> > +               unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> > +               pte_t *pte_t = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
>
> I forgot about one comment here.
> Please change the variable name other than "pte_t", it is a bit
> strange to use the typedef name as variable name here.
>

make sense!

> Chris

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-22 10:20     ` David Hildenbrand
@ 2024-02-26 17:41       ` Ryan Roberts
  2024-02-27 17:10         ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-02-26 17:41 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 22/02/2024 10:20, David Hildenbrand wrote:
> On 22.02.24 11:19, David Hildenbrand wrote:
>> On 25.10.23 16:45, Ryan Roberts wrote:
>>> As preparation for supporting small-sized THP in the swap-out path,
>>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>>> which, when present, always implies PMD-sized THP, which is the same as
>>> the cluster size.
>>>
>>> The only use of the flag was to determine whether a swap entry refers to
>>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>>> Instead of relying on the flag, we now pass in nr_pages, which
>>> originates from the folio's number of pages. This allows the logic to
>>> work for folios of any order.
>>>
>>> The one snag is that one of the swap_page_trans_huge_swapped() call
>>> sites does not have the folio. But it was only being called there to
>>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>>> call site and believe the new logic should be equivalent.
>>
>> That is the  __try_to_reclaim_swap() -> folio_free_swap() ->
>> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
>>
>> The "difference" is that you will now (1) get another temporary
>> reference on the folio and (2) (try)lock the folio every time you
>> discard a single PTE of a (possibly) large THP.
>>
> 
> Thinking about it, your change will not only affect THP, but any call to
> free_swap_and_cache().
> 
> Likely that's not what we want. :/
> 

Is folio_trylock() really that expensive given the code path is already locking
multiple spinlocks, and I don't think we would expect the folio lock to be very
contended?

I guess filemap_get_folio() could be a bit more expensive, but again, is this
really a deal-breaker?


I'm just trying to refamiliarize myself with this series, but I think I ended up
allocating a cluster per cpu per order. So one potential solution would be to
turn the flag into a size and store it in the cluster info. (In fact I think I
was doing that in an early version of this series - will have to look at why I
got rid of that). Then we could avoid needing to figure out nr_pages from the folio.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-23  9:46       ` Barry Song
@ 2024-02-27 12:05         ` Ryan Roberts
  2024-02-28  1:23           ` Barry Song
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-02-27 12:05 UTC (permalink / raw)
  To: Barry Song, David Hildenbrand
  Cc: akpm, linux-kernel, linux-mm, mhocko, shy828301, wangkefeng.wang,
	willy, xiang, ying.huang, yuzhao, chrisl, surenb, hanchuanhua

On 23/02/2024 09:46, Barry Song wrote:
> On Thu, Feb 22, 2024 at 11:09 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 22.02.24 08:05, Barry Song wrote:
>>> Hi Ryan,
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 2cc0cb41fb32..ea19710aa4cd 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>                                      if (!can_split_folio(folio, NULL))
>>>>                                              goto activate_locked;
>>>>                                      /*
>>>> -                                     * Split folios without a PMD map right
>>>> -                                     * away. Chances are some or all of the
>>>> -                                     * tail pages can be freed without IO.
>>>> +                                     * Split PMD-mappable folios without a
>>>> +                                     * PMD map right away. Chances are some
>>>> +                                     * or all of the tail pages can be freed
>>>> +                                     * without IO.
>>>>                                       */
>>>> -                                    if (!folio_entire_mapcount(folio) &&
>>>> +                                    if (folio_test_pmd_mappable(folio) &&
>>>> +                                        !folio_entire_mapcount(folio) &&
>>>>                                          split_folio_to_list(folio,
>>>>                                                              folio_list))
>>>>                                              goto activate_locked;
>>>
>>> I ran a test to investigate what would happen while reclaiming a partially
>>> unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
>>> 4KB~64KB, and keep the first subpage 0~4KiB.
>>
>> IOW, something that already happens with ordinary THP already IIRC.
>>
>>>
>>> My test wants to address my three concerns,
>>> a. whether we will have leak on swap slots
>>> b. whether we will have redundant I/O
>>> c. whether we will cause races on swapcache
>>>
>>> what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
>>> at some specific stage
>>> 1. just after add_to_swap   (swap slots are allocated)
>>> 2. before and after try_to_unmap   (ptes are set to swap_entry)
>>> 3. before and after pageout (also add printk in zram driver to dump all I/O write)
>>> 4. before and after remove_mapping
>>>
>>> The below is the dumped info for a particular large folio,
>>>
>>> 1. after add_to_swap
>>> [   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
>>> [   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>
>>> as you can see,
>>> _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)
>>>
>>>
>>> 2. before and after try_to_unmap
>>> [   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
>>> [   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
>>> [   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
>>> [   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>
>>> as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
>>> 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1
>>>
>>> 3. before and after pageout
>>> [   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
>>> [   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>> [   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
>>> [   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
>>> [   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
>>> [   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
>>> [   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
>>> [   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
>>> [   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
>>> [   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
>>> [   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
>>> [   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
>>> [   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
>>> [   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
>>> [   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
>>> [   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
>>> [   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
>>> [   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
>>> [   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0
>>>
>>> as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
>>> 4~64KiB has been zap_pte_range before, we still write them to zRAM.
>>>
>>> 4. before and after remove_mapping
>>> [   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>> [   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
>>> [   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>>>
>>> as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
>>> all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
>>> slot leak at all!
>>>
>>> Thus, only two concerns are left for me,
>>> 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
>>> is partially unmapped.

So the cost of this is increased IO and swap storage, correct? Is this a big
problem in practice? i.e. do you see a lot of partially mapped large folios in
your workload? (I agree the proposed fix below is simple, so I think we should
do it anyway - I'm just interested in the scale of the problem).

>>> 2. large folio is added as a whole as a swapcache covering the range whose
>>> part has been zapped. I am not quite sure if this will cause some problems
>>> while some concurrent do_anon_page, swapin and swapout occurs between 3 and
>>> 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding...

Yes mine too. I would only expect the ptes that map the folio will get replaced
with swap entries? So I would expect it to be safe. Although I understand the
concern with the extra swap consumption.

[...]
>>>
>>> To me, it seems safer to split or do some other similar optimization if we find a
>>> large folio has partial map and unmap.
>>
>> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
>> possible.
>>
> 
> Is _nr_pages_mapped < nr_pages a reasonable case to split as we
> have known the folio has at least some subpages zapped?

I'm not sure we need this - the folio's presence on the split list will tell us
everything we need to know I think?

> 
>> If we find that the folio is on the deferred split list, we might as
>> well just split it right away, before swapping it out. That might be a
>> reasonable optimization for the case you describe.

Yes, agreed. I think there is still chance of a race though; Some other thread
could be munmapping in parallel. But in that case, I think we just end up with
the increased IO and swap storage? That's not the end of the world if its a
corner case.

> 
> i tried to change Ryan's code as below
> 
> @@ -1905,11 +1922,12 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
>                                          * PMD map right away. Chances are some
>                                          * or all of the tail pages can be freed
>                                          * without IO.
> +                                        * Similarly, split PTE-mapped folios if
> +                                        * they have been already
> deferred_split.
>                                          */
> -                                       if (folio_test_pmd_mappable(folio) &&
> -                                           !folio_entire_mapcount(folio) &&
> -                                           split_folio_to_list(folio,
> -                                                               folio_list))
> +                                       if
> (((folio_test_pmd_mappable(folio) && !folio_entire_mapcount(folio)) ||
> +
> (!folio_test_pmd_mappable(folio) &&
> !list_empty(&folio->_deferred_list)))

I'm not sure we need the different tests for pmd_mappable vs !pmd_mappable. I
think presence on the deferred list is a sufficient indicator that there are
unmapped subpages?

I'll incorporate this into my next version.

> +                                           &&
> split_folio_to_list(folio, folio_list))
>                                                 goto activate_locked;
>                                 }
>                                 if (!add_to_swap(folio)) {
> 
> It seems to work as expected. only one I/O is left for a large folio
> with 16 PTEs
> but 15 of them have been zapped before.
> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
  2024-01-18 11:10   ` [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT Barry Song
  2024-01-29  2:15     ` Chris Li
@ 2024-02-27 12:22     ` Ryan Roberts
  2024-02-27 22:39       ` Barry Song
  2024-02-27 14:40     ` Ryan Roberts
  2 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-02-27 12:22 UTC (permalink / raw)
  To: Barry Song, akpm, david, linux-mm
  Cc: linux-kernel, mhocko, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, surenb, steven.price, Chuanhua Han,
	Barry Song

Hi Barry,

I've scanned through this patch as part of trying to understand the races you
have reported (It's going to take me a while to fully understand it all :) ). In
the meantime I have a few comments on this patch...

On 18/01/2024 11:10, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> supported swapping large folios out as a whole for vmscan case. This patch
> extends the feature to madvise.
> 
> If madvised range covers the whole large folio, we don't split it. Otherwise,
> we still need to split it.
> 
> This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> mapped to a large folio.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/asm-generic/tlb.h | 10 +++++++
>  include/linux/pgtable.h   | 60 +++++++++++++++++++++++++++++++++++++++
>  mm/madvise.c              | 48 +++++++++++++++++++++++++++++++
>  3 files changed, 118 insertions(+)
> 
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 129a3a759976..f894e22da5d6 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
>  		__tlb_remove_tlb_entry(tlb, ptep, address);	\
>  	} while (0)
>  
> +#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr)			\
> +	do {                                                    	\
> +		int i;							\
> +		tlb_flush_pte_range(tlb, address,			\
> +				PAGE_SIZE * nr);			\
> +		for (i = 0; i < nr; i++)				\
> +			__tlb_remove_tlb_entry(tlb, ptep + i,		\
> +					address + i * PAGE_SIZE);	\
> +	} while (0)

David has recently added tlb_remove_tlb_entries() which does the same thing.

> +
>  #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)	\
>  	do {							\
>  		unsigned long _sz = huge_page_size(h);		\
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 37fe83b0c358..da0c1cf447e3 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
>  }
>  #endif
>  
> +#ifndef pte_range_cont_mapped
> +static inline bool pte_range_cont_mapped(unsigned long start_pfn,
> +					 pte_t *start_pte,
> +					 unsigned long start_addr,
> +					 int nr)
> +{
> +	int i;
> +	pte_t pte_val;
> +
> +	for (i = 0; i < nr; i++) {
> +		pte_val = ptep_get(start_pte + i);
> +
> +		if (pte_none(pte_val))
> +			return false;
> +
> +		if (pte_pfn(pte_val) != (start_pfn + i))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +#endif

David has recently added folio_pte_batch() which does a similar thing (as
discussed in other context).

> +
> +#ifndef pte_range_young
> +static inline bool pte_range_young(pte_t *start_pte, int nr)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr; i++)
> +		if (pte_young(ptep_get(start_pte + i)))
> +			return true;
> +
> +	return false;
> +}
> +#endif

I wonder if this should come from folio_pte_batch()?

> +
>  #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  					    unsigned long address,
> @@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  }
>  #endif
>  
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL
> +static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm,
> +						  unsigned long start_addr,
> +						  pte_t *start_pte,
> +						  int nr, int full)
> +{
> +	int i;
> +	pte_t pte;
> +
> +	pte = ptep_get_and_clear_full(mm, start_addr, start_pte, full);
> +
> +	for (i = 1; i < nr; i++)
> +		ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE,
> +					start_pte + i, full);
> +
> +	return pte;
> +}

David has recently added get_and_clear_full_ptes(). Your version isn't gathering
access/dirty, which may be ok for your case, but not ok in general.

>  
>  /*
>   * If two threads concurrently fault at the same page, the thread that
> @@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>  })
>  #endif
>  
> +#ifndef pte_nr_addr_end
> +#define pte_nr_addr_end(addr, size, end)				\
> +({	unsigned long __boundary = ((addr) + size) & (~(size - 1));	\
> +	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
> +})
> +#endif
> +
>  /*
>   * When walking page tables, we usually want to skip any p?d_none entries;
>   * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 912155a94ed5..262460ac4b2e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		if (folio_test_large(folio)) {
>  			int err;
>  
> +			if (!folio_test_pmd_mappable(folio)) {
> +				int nr_pages = folio_nr_pages(folio);
> +				unsigned long folio_size = PAGE_SIZE * nr_pages;
> +				unsigned long start_addr = ALIGN_DOWN(addr, nr_pages * PAGE_SIZE);;

I doubt it is correct to align down here. Couldn't you be going outside the
bounds that the user supplied?

nit: you've defined folio_size, why not use it here?
nit: double semi-colon.

> +				unsigned long start_pfn = page_to_pfn(folio_page(folio, 0));
> +				pte_t *start_pte = pte - (addr - start_addr) / PAGE_SIZE;

I think start_pte could be off the start of the pgtable and into random memory
in some corner cases (and outside the protection of the PTL)? You're assuming
that the folio is fully and contigously mapped and correctly aligned. mremap
(and other things) could break that assumption.

> +				unsigned long next = pte_nr_addr_end(addr, folio_size, end);
> +
> +				if (!pte_range_cont_mapped(start_pfn, start_pte, start_addr, nr_pages))
> +					goto split;
> +
> +				if (next - addr != folio_size) {
> +					goto split;
> +				} else {
> +					/* Do not interfere with other mappings of this page */
> +					if (folio_estimated_sharers(folio) != 1)
> +						goto skip;
> +
> +					VM_BUG_ON(addr != start_addr || pte != start_pte);
> +
> +					if (pte_range_young(start_pte, nr_pages)) {
> +						ptent = ptep_get_and_clear_range_full(mm, start_addr, start_pte,
> +										      nr_pages, tlb->fullmm);
> +						ptent = pte_mkold(ptent);
> +
> +						set_ptes(mm, start_addr, start_pte, ptent, nr_pages);
> +						tlb_remove_nr_tlb_entry(tlb, start_pte, start_addr, nr_pages);
> +					}
> +
> +					folio_clear_referenced(folio);
> +					folio_test_clear_young(folio);
> +					if (pageout) {
> +						if (folio_isolate_lru(folio)) {
> +							if (folio_test_unevictable(folio))
> +								folio_putback_lru(folio);
> +							else
> +								list_add(&folio->lru, &folio_list);
> +						}
> +					} else
> +						folio_deactivate(folio);
> +				}
> +skip:
> +				pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> +				addr = next - PAGE_SIZE;
> +				continue;
> +
> +			}
> +split:
>  			if (folio_estimated_sharers(folio) != 1)
>  				break;
>  			if (pageout_anon_only_filter && !folio_test_anon(folio))


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-05  9:51   ` Barry Song
  2024-02-05 12:14     ` Ryan Roberts
@ 2024-02-27 12:28     ` Ryan Roberts
  2024-02-27 13:37     ` Ryan Roberts
  2 siblings, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-02-27 12:28 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

On 05/02/2024 09:51, Barry Song wrote:
> +Chris, Suren and Chuanhua
> 
> Hi Ryan,
> 
>> +	/*
>> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>> +	 * so indicate that we are scanning to synchronise with swapoff.
>> +	 */
>> +	si->flags += SWP_SCANNING;
>> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>> +	si->flags -= SWP_SCANNING;
> 
> nobody is using this scan_base afterwards. it seems a bit weird to
> pass a pointer.
> 
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>  					if (!can_split_folio(folio, NULL))
>>  						goto activate_locked;
>>  					/*
>> -					 * Split folios without a PMD map right
>> -					 * away. Chances are some or all of the
>> -					 * tail pages can be freed without IO.
>> +					 * Split PMD-mappable folios without a
>> +					 * PMD map right away. Chances are some
>> +					 * or all of the tail pages can be freed
>> +					 * without IO.
>>  					 */
>> -					if (!folio_entire_mapcount(folio) &&
>> +					if (folio_test_pmd_mappable(folio) &&
>> +					    !folio_entire_mapcount(folio) &&
>>  					    split_folio_to_list(folio,
>>  								folio_list))
>>  						goto activate_locked;
>> --
> 
> Chuanhua and I ran this patchset for a couple of days and found a race
> between reclamation and split_folio. this might cause applications get
> wrong data 0 while swapping-in.

I can't claim to fully understand the problem yet (thanks for all the details -
I'll keep reading it and looking at the code until I do), but I guess this
problem should exist today for PMD-mappable folios? We already skip splitting
those folios if they are pmd-mapped. Or does the problem only apply to
pte-mapped folios?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-05  9:51   ` Barry Song
  2024-02-05 12:14     ` Ryan Roberts
  2024-02-27 12:28     ` Ryan Roberts
@ 2024-02-27 13:37     ` Ryan Roberts
  2024-02-28  2:46       ` Barry Song
  2 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-02-27 13:37 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

On 05/02/2024 09:51, Barry Song wrote:
> +Chris, Suren and Chuanhua
> 
> Hi Ryan,
> 
>> +	/*
>> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
>> +	 * so indicate that we are scanning to synchronise with swapoff.
>> +	 */
>> +	si->flags += SWP_SCANNING;
>> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
>> +	si->flags -= SWP_SCANNING;
> 
> nobody is using this scan_base afterwards. it seems a bit weird to
> pass a pointer.
> 
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>  					if (!can_split_folio(folio, NULL))
>>  						goto activate_locked;
>>  					/*
>> -					 * Split folios without a PMD map right
>> -					 * away. Chances are some or all of the
>> -					 * tail pages can be freed without IO.
>> +					 * Split PMD-mappable folios without a
>> +					 * PMD map right away. Chances are some
>> +					 * or all of the tail pages can be freed
>> +					 * without IO.
>>  					 */
>> -					if (!folio_entire_mapcount(folio) &&
>> +					if (folio_test_pmd_mappable(folio) &&
>> +					    !folio_entire_mapcount(folio) &&
>>  					    split_folio_to_list(folio,
>>  								folio_list))
>>  						goto activate_locked;
>> --
> 
> Chuanhua and I ran this patchset for a couple of days and found a race
> between reclamation and split_folio. this might cause applications get
> wrong data 0 while swapping-in.
> 
> in case one thread(T1) is reclaiming a large folio by some means, still
> another thread is calling madvise MADV_PGOUT(T2). and at the same time,
> we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
> and T2 does split as below,

Hi Barry,

Do you have a test case you can share that provokes this problem? And is this a
separate problem to the race you solved with TTU_SYNC or is this solving the
same problem?

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
  2024-01-18 11:10   ` [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT Barry Song
  2024-01-29  2:15     ` Chris Li
  2024-02-27 12:22     ` Ryan Roberts
@ 2024-02-27 14:40     ` Ryan Roberts
  2024-02-27 18:57       ` Barry Song
  2 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-02-27 14:40 UTC (permalink / raw)
  To: Barry Song, akpm, david, linux-mm
  Cc: linux-kernel, mhocko, shy828301, wangkefeng.wang, willy, xiang,
	ying.huang, yuzhao, surenb, steven.price, Chuanhua Han,
	Barry Song

On 18/01/2024 11:10, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> supported swapping large folios out as a whole for vmscan case. This patch
> extends the feature to madvise.
> 
> If madvised range covers the whole large folio, we don't split it. Otherwise,
> we still need to split it.
> 
> This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> mapped to a large folio.

I'm going to rework this patch and integrate it into my series if that's ok with
you?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-26 17:41       ` Ryan Roberts
@ 2024-02-27 17:10         ` Ryan Roberts
  2024-02-27 19:17           ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-02-27 17:10 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

Hi David,

On 26/02/2024 17:41, Ryan Roberts wrote:
> On 22/02/2024 10:20, David Hildenbrand wrote:
>> On 22.02.24 11:19, David Hildenbrand wrote:
>>> On 25.10.23 16:45, Ryan Roberts wrote:
>>>> As preparation for supporting small-sized THP in the swap-out path,
>>>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>>>> which, when present, always implies PMD-sized THP, which is the same as
>>>> the cluster size.
>>>>
>>>> The only use of the flag was to determine whether a swap entry refers to
>>>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>>>> Instead of relying on the flag, we now pass in nr_pages, which
>>>> originates from the folio's number of pages. This allows the logic to
>>>> work for folios of any order.
>>>>
>>>> The one snag is that one of the swap_page_trans_huge_swapped() call
>>>> sites does not have the folio. But it was only being called there to
>>>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>>>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>>>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>>>> call site and believe the new logic should be equivalent.
>>>
>>> That is the  __try_to_reclaim_swap() -> folio_free_swap() ->
>>> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
>>>
>>> The "difference" is that you will now (1) get another temporary
>>> reference on the folio and (2) (try)lock the folio every time you
>>> discard a single PTE of a (possibly) large THP.
>>>
>>
>> Thinking about it, your change will not only affect THP, but any call to
>> free_swap_and_cache().
>>
>> Likely that's not what we want. :/
>>
> 
> Is folio_trylock() really that expensive given the code path is already locking
> multiple spinlocks, and I don't think we would expect the folio lock to be very
> contended?
> 
> I guess filemap_get_folio() could be a bit more expensive, but again, is this
> really a deal-breaker?
> 
> 
> I'm just trying to refamiliarize myself with this series, but I think I ended up
> allocating a cluster per cpu per order. So one potential solution would be to
> turn the flag into a size and store it in the cluster info. (In fact I think I
> was doing that in an early version of this series - will have to look at why I
> got rid of that). Then we could avoid needing to figure out nr_pages from the folio.

I ran some microbenchmarks to see if these extra operations cause a performance
issue - it all looks OK to me.

I modified your "pte-mapped-folio-benchmarks" to add a "munmap-swapped-forked"
mode, which prepares the 1G memory mapping by first paging it out with
MADV_PAGEOUT, then it forks a child (and keeps that child alive) so that the
swap slots have 2 references, then it measures the duration of munmap() in the
parent on the entire range. The idea is that free_swap_and_cache() is called for
each PTE during munmap(). Prior to my change, swap_page_trans_huge_swapped()
will return true, due to the child's references, and __try_to_reclaim_swap() is
not called. After my change, we no longer have this short cut.

In both cases the results are within 1% (confirmed across multiple runs of 20
seconds each):

mm-stable: Average: 0.004997
 + change: Average: 0.005037

(these numbers are for Ampere Altra. I also tested on M2 VM - no regression
their either).

Do you still have a concern about this change?

An alternative is to store the folio size in the cluster, but that won't be
accurate if the folio is later split or if an entry within the cluster is later
stolen for an order-0 entry. I think it would still work though; it just means
that you might get a false positive in those circumstances, which means taking
the "slow" path. But this is a rare event.

Regardless, I prefer not to do this, since it adds complexity and doesn't
benefit performance.

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
  2024-02-27 14:40     ` Ryan Roberts
@ 2024-02-27 18:57       ` Barry Song
  2024-02-28  3:49         ` Barry Song
  0 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2024-02-27 18:57 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: akpm, david, linux-mm, linux-kernel, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, surenb,
	steven.price, Chuanhua Han, Barry Song

On Wed, Feb 28, 2024 at 3:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 18/01/2024 11:10, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> > supported swapping large folios out as a whole for vmscan case. This patch
> > extends the feature to madvise.
> >
> > If madvised range covers the whole large folio, we don't split it. Otherwise,
> > we still need to split it.
> >
> > This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> > helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> > mapped to a large folio.
>
> I'm going to rework this patch and integrate it into my series if that's ok with
> you?

This is perfect. Please integrate it into your swap-out series which is the
perfect place for this MADV_PAGEOUT.

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-27 17:10         ` Ryan Roberts
@ 2024-02-27 19:17           ` David Hildenbrand
  2024-02-28  9:37             ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2024-02-27 19:17 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 27.02.24 18:10, Ryan Roberts wrote:
> Hi David,
> 
> On 26/02/2024 17:41, Ryan Roberts wrote:
>> On 22/02/2024 10:20, David Hildenbrand wrote:
>>> On 22.02.24 11:19, David Hildenbrand wrote:
>>>> On 25.10.23 16:45, Ryan Roberts wrote:
>>>>> As preparation for supporting small-sized THP in the swap-out path,
>>>>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>>>>> which, when present, always implies PMD-sized THP, which is the same as
>>>>> the cluster size.
>>>>>
>>>>> The only use of the flag was to determine whether a swap entry refers to
>>>>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>>>>> Instead of relying on the flag, we now pass in nr_pages, which
>>>>> originates from the folio's number of pages. This allows the logic to
>>>>> work for folios of any order.
>>>>>
>>>>> The one snag is that one of the swap_page_trans_huge_swapped() call
>>>>> sites does not have the folio. But it was only being called there to
>>>>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>>>>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>>>>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>>>>> call site and believe the new logic should be equivalent.
>>>>
>>>> That is the  __try_to_reclaim_swap() -> folio_free_swap() ->
>>>> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
>>>>
>>>> The "difference" is that you will now (1) get another temporary
>>>> reference on the folio and (2) (try)lock the folio every time you
>>>> discard a single PTE of a (possibly) large THP.
>>>>
>>>
>>> Thinking about it, your change will not only affect THP, but any call to
>>> free_swap_and_cache().
>>>
>>> Likely that's not what we want. :/
>>>
>>
>> Is folio_trylock() really that expensive given the code path is already locking
>> multiple spinlocks, and I don't think we would expect the folio lock to be very
>> contended?
>>
>> I guess filemap_get_folio() could be a bit more expensive, but again, is this
>> really a deal-breaker?
>>
>>
>> I'm just trying to refamiliarize myself with this series, but I think I ended up
>> allocating a cluster per cpu per order. So one potential solution would be to
>> turn the flag into a size and store it in the cluster info. (In fact I think I
>> was doing that in an early version of this series - will have to look at why I
>> got rid of that). Then we could avoid needing to figure out nr_pages from the folio.
> 
> I ran some microbenchmarks to see if these extra operations cause a performance
> issue - it all looks OK to me.

Sorry, I'm drowning in reviews right now. I was hoping to get some of my own
stuff figured out today ... maybe tomorrow.

> 
> I modified your "pte-mapped-folio-benchmarks" to add a "munmap-swapped-forked"
> mode, which prepares the 1G memory mapping by first paging it out with
> MADV_PAGEOUT, then it forks a child (and keeps that child alive) so that the
> swap slots have 2 references, then it measures the duration of munmap() in the
> parent on the entire range. The idea is that free_swap_and_cache() is called for
> each PTE during munmap(). Prior to my change, swap_page_trans_huge_swapped()
> will return true, due to the child's references, and __try_to_reclaim_swap() is
> not called. After my change, we no longer have this short cut.
> 
> In both cases the results are within 1% (confirmed across multiple runs of 20
> seconds each):
> 
> mm-stable: Average: 0.004997
>   + change: Average: 0.005037
> 
> (these numbers are for Ampere Altra. I also tested on M2 VM - no regression
> their either).
> 
> Do you still have a concern about this change?

The main concern I had was not about overhead due to atomic operations in the
non-concurrent case that you are measuring.

We might now unnecessarily be incrementing the folio refcount and taking
the folio lock. That will affects large folios in the swapcache now IIUC.
Small folios should be unaffected.

The side effects of that can be:
* Code checking for additional folio reference could now detect some and
   back out. (the "mapcount + swapcache*folio_nr_pages != folio_refcount"
   stuff)
* Code that might really benefit from trylocking the folio might fail to
   do so.

For example, splitting a large folio might now fail more often simply
because some process zaps a swap entry and the additional reference+page
lock were optimized out previously.

How relevant is it? Relevant enough that someone decided to put that
optimization in? I don't know :)

Arguably, zapping a present PTE also leaves the refcount elevated for a while
until the mapcount was freed. But here, it could be avoided.

Digging a bit, it was introduced in:

commit e07098294adfd03d582af7626752255e3d170393
Author: Huang Ying <ying.huang@intel.com>
Date:   Wed Sep 6 16:22:16 2017 -0700

     mm, THP, swap: support to reclaim swap space for THP swapped out
     
     The normal swap slot reclaiming can be done when the swap count reaches
     SWAP_HAS_CACHE.  But for the swap slot which is backing a THP, all swap
     slots backing one THP must be reclaimed together, because the swap slot
     may be used again when the THP is swapped out again later.  So the swap
     slots backing one THP can be reclaimed together when the swap count for
     all swap slots for the THP reached SWAP_HAS_CACHE.  In the patch, the
     functions to check whether the swap count for all swap slots backing one
     THP reached SWAP_HAS_CACHE are implemented and used when checking
     whether a swap slot can be reclaimed.
     
     To make it easier to determine whether a swap slot is backing a THP, a
     new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
     cluster which is backing a THP (Transparent Huge Page).  Because THP
     swap in as a whole isn't supported now.  After deleting the THP from the
     swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE
     flag will be cleared.  So that, the normal pages inside THP can be
     swapped in individually.


With your change, if we have a swapped out THP with 512 entries and exit(), we
would now 512 times in a row grab a folio reference and trylock the folio. In the
past, we would have done that at most once.

That doesn't feel quite right TBH ... so I'm wondering if there are any low-hanging
fruits to avoid that.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
  2024-02-27 12:22     ` Ryan Roberts
@ 2024-02-27 22:39       ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-02-27 22:39 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: akpm, david, linux-mm, linux-kernel, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, surenb,
	steven.price, Chuanhua Han, Barry Song

On Wed, Feb 28, 2024 at 1:22 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Barry,
>
> I've scanned through this patch as part of trying to understand the races you
> have reported (It's going to take me a while to fully understand it all :) ). In
> the meantime I have a few comments on this patch...
>
> On 18/01/2024 11:10, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> > supported swapping large folios out as a whole for vmscan case. This patch
> > extends the feature to madvise.
> >
> > If madvised range covers the whole large folio, we don't split it. Otherwise,
> > we still need to split it.
> >
> > This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> > helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> > mapped to a large folio.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/asm-generic/tlb.h | 10 +++++++
> >  include/linux/pgtable.h   | 60 +++++++++++++++++++++++++++++++++++++++
> >  mm/madvise.c              | 48 +++++++++++++++++++++++++++++++
> >  3 files changed, 118 insertions(+)
> >
> > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> > index 129a3a759976..f894e22da5d6 100644
> > --- a/include/asm-generic/tlb.h
> > +++ b/include/asm-generic/tlb.h
> > @@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
> >               __tlb_remove_tlb_entry(tlb, ptep, address);     \
> >       } while (0)
> >
> > +#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr)                      \
> > +     do {                                                            \
> > +             int i;                                                  \
> > +             tlb_flush_pte_range(tlb, address,                       \
> > +                             PAGE_SIZE * nr);                        \
> > +             for (i = 0; i < nr; i++)                                \
> > +                     __tlb_remove_tlb_entry(tlb, ptep + i,           \
> > +                                     address + i * PAGE_SIZE);       \
> > +     } while (0)
>
> David has recently added tlb_remove_tlb_entries() which does the same thing.

cool. While sending the patchset, we were not depending on other work.
Nice to know David's work can help this case.

>
> > +
> >  #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)     \
> >       do {                                                    \
> >               unsigned long _sz = huge_page_size(h);          \
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 37fe83b0c358..da0c1cf447e3 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> >  }
> >  #endif
> >
> > +#ifndef pte_range_cont_mapped
> > +static inline bool pte_range_cont_mapped(unsigned long start_pfn,
> > +                                      pte_t *start_pte,
> > +                                      unsigned long start_addr,
> > +                                      int nr)
> > +{
> > +     int i;
> > +     pte_t pte_val;
> > +
> > +     for (i = 0; i < nr; i++) {
> > +             pte_val = ptep_get(start_pte + i);
> > +
> > +             if (pte_none(pte_val))
> > +                     return false;
> > +
> > +             if (pte_pfn(pte_val) != (start_pfn + i))
> > +                     return false;
> > +     }
> > +
> > +     return true;
> > +}
> > +#endif
>
> David has recently added folio_pte_batch() which does a similar thing (as
> discussed in other context).

yes.

>
> > +
> > +#ifndef pte_range_young
> > +static inline bool pte_range_young(pte_t *start_pte, int nr)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < nr; i++)
> > +             if (pte_young(ptep_get(start_pte + i)))
> > +                     return true;
> > +
> > +     return false;
> > +}
> > +#endif
>
> I wonder if this should come from folio_pte_batch()?

not quite sure folio_pte_batch can return young. but i guess
you already have a batched function to check if a large folio
is young?

>
> > +
> >  #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> >  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> >                                           unsigned long address,
> > @@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> >  }
> >  #endif
> >
> > +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL
> > +static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm,
> > +                                               unsigned long start_addr,
> > +                                               pte_t *start_pte,
> > +                                               int nr, int full)
> > +{
> > +     int i;
> > +     pte_t pte;
> > +
> > +     pte = ptep_get_and_clear_full(mm, start_addr, start_pte, full);
> > +
> > +     for (i = 1; i < nr; i++)
> > +             ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE,
> > +                                     start_pte + i, full);
> > +
> > +     return pte;
> > +}
>
> David has recently added get_and_clear_full_ptes(). Your version isn't gathering
> access/dirty, which may be ok for your case, but not ok in general.

ok. glad to know we can use get_and_clear_full_ptes().

>
> >
> >  /*
> >   * If two threads concurrently fault at the same page, the thread that
> > @@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >  })
> >  #endif
> >
> > +#ifndef pte_nr_addr_end
> > +#define pte_nr_addr_end(addr, size, end)                             \
> > +({   unsigned long __boundary = ((addr) + size) & (~(size - 1));     \
> > +     (__boundary - 1 < (end) - 1)? __boundary: (end);                \
> > +})
> > +#endif
> > +
> >  /*
> >   * When walking page tables, we usually want to skip any p?d_none entries;
> >   * and any p?d_bad entries - reporting the error before resetting to none.
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 912155a94ed5..262460ac4b2e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >               if (folio_test_large(folio)) {
> >                       int err;
> >
> > +                     if (!folio_test_pmd_mappable(folio)) {
> > +                             int nr_pages = folio_nr_pages(folio);
> > +                             unsigned long folio_size = PAGE_SIZE * nr_pages;
> > +                             unsigned long start_addr = ALIGN_DOWN(addr, nr_pages * PAGE_SIZE);;
>
> I doubt it is correct to align down here. Couldn't you be going outside the
> bounds that the user supplied?

Yes, it can. This is ugly and suspicious but does not cause problems
if the large folio's virtadd is aligned , but it is wrong if virtual address is
not aligned as explained below.

>
> nit: you've defined folio_size, why not use it here?
> nit: double semi-colon.
>
> > +                             unsigned long start_pfn = page_to_pfn(folio_page(folio, 0));
> > +                             pte_t *start_pte = pte - (addr - start_addr) / PAGE_SIZE;
>
> I think start_pte could be off the start of the pgtable and into random memory

> in some corner cases (and outside the protection of the PTL)? You're assuming
> that the folio is fully and contigously mapped and correctly aligned. mremap
> (and other things) could break that assumption.

actually we don't run under the assumption folio is fully and
contiguously mapped.
but the code does assume a large folio's virtual address is aligned with
nr_pages * PAGE_SIZE.

OTOH,  we have  if (next - addr != folio_size) to split folios if
users just want to partially
reclaim a large folio, but I do agree we should move if (next - addr
!= folio_size)
before pte_range_cont_mapped().

as long as the virt addr is aligned, pte_range_cont_mapped() won't
cause a problem
for the code even before if (next - addr != folio_size) (but ugly and
suspicious) as it is
still under the protection of PTL since we don't cross a PMD for a
pte-mapped large
folio.

but you are really right, we have cases like mremap which can remap an aligned
large folio to an unaligned address. I actually placed a trace point
in kernel, running
lots of phones, didn't find this case was happening. so i feel mremap
is really rare.
Is it possible to split large folios and avoid complexity  instead if
we are remapping
to an unaligned address?

And, the code is really completely wrong if the large folio is
unaligned. we have to
remove the assumption if that is really happening. So shouldn't do ALIGN_DOWN.

>
> > +                             unsigned long next = pte_nr_addr_end(addr, folio_size, end);
> > +
> > +                             if (!pte_range_cont_mapped(start_pfn, start_pte, start_addr, nr_pages))
> > +                                     goto split;
> > +
> > +                             if (next - addr != folio_size) {
> > +                                     goto split;
> > +                             } else {
> > +                                     /* Do not interfere with other mappings of this page */
> > +                                     if (folio_estimated_sharers(folio) != 1)
> > +                                             goto skip;
> > +
> > +                                     VM_BUG_ON(addr != start_addr || pte != start_pte);
> > +
> > +                                     if (pte_range_young(start_pte, nr_pages)) {
> > +                                             ptent = ptep_get_and_clear_range_full(mm, start_addr, start_pte,
> > +                                                                                   nr_pages, tlb->fullmm);
> > +                                             ptent = pte_mkold(ptent);
> > +
> > +                                             set_ptes(mm, start_addr, start_pte, ptent, nr_pages);
> > +                                             tlb_remove_nr_tlb_entry(tlb, start_pte, start_addr, nr_pages);
> > +                                     }
> > +
> > +                                     folio_clear_referenced(folio);
> > +                                     folio_test_clear_young(folio);
> > +                                     if (pageout) {
> > +                                             if (folio_isolate_lru(folio)) {
> > +                                                     if (folio_test_unevictable(folio))
> > +                                                             folio_putback_lru(folio);
> > +                                                     else
> > +                                                             list_add(&folio->lru, &folio_list);
> > +                                             }
> > +                                     } else
> > +                                             folio_deactivate(folio);
> > +                             }
> > +skip:
> > +                             pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> > +                             addr = next - PAGE_SIZE;
> > +                             continue;
> > +
> > +                     }
> > +split:
> >                       if (folio_estimated_sharers(folio) != 1)
> >                               break;
> >                       if (pageout_anon_only_filter && !folio_test_anon(folio))

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-27 12:05         ` Ryan Roberts
@ 2024-02-28  1:23           ` Barry Song
  2024-02-28  9:34             ` David Hildenbrand
  2024-02-28 15:57             ` Ryan Roberts
  0 siblings, 2 replies; 116+ messages in thread
From: Barry Song @ 2024-02-28  1:23 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, akpm, linux-kernel, linux-mm, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	chrisl, surenb, hanchuanhua

On Wed, Feb 28, 2024 at 1:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 23/02/2024 09:46, Barry Song wrote:
> > On Thu, Feb 22, 2024 at 11:09 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 22.02.24 08:05, Barry Song wrote:
> >>> Hi Ryan,
> >>>
> >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>> index 2cc0cb41fb32..ea19710aa4cd 100644
> >>>> --- a/mm/vmscan.c
> >>>> +++ b/mm/vmscan.c
> >>>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>>>                                      if (!can_split_folio(folio, NULL))
> >>>>                                              goto activate_locked;
> >>>>                                      /*
> >>>> -                                     * Split folios without a PMD map right
> >>>> -                                     * away. Chances are some or all of the
> >>>> -                                     * tail pages can be freed without IO.
> >>>> +                                     * Split PMD-mappable folios without a
> >>>> +                                     * PMD map right away. Chances are some
> >>>> +                                     * or all of the tail pages can be freed
> >>>> +                                     * without IO.
> >>>>                                       */
> >>>> -                                    if (!folio_entire_mapcount(folio) &&
> >>>> +                                    if (folio_test_pmd_mappable(folio) &&
> >>>> +                                        !folio_entire_mapcount(folio) &&
> >>>>                                          split_folio_to_list(folio,
> >>>>                                                              folio_list))
> >>>>                                              goto activate_locked;
> >>>
> >>> I ran a test to investigate what would happen while reclaiming a partially
> >>> unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
> >>> 4KB~64KB, and keep the first subpage 0~4KiB.
> >>
> >> IOW, something that already happens with ordinary THP already IIRC.
> >>
> >>>
> >>> My test wants to address my three concerns,
> >>> a. whether we will have leak on swap slots
> >>> b. whether we will have redundant I/O
> >>> c. whether we will cause races on swapcache
> >>>
> >>> what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
> >>> at some specific stage
> >>> 1. just after add_to_swap   (swap slots are allocated)
> >>> 2. before and after try_to_unmap   (ptes are set to swap_entry)
> >>> 3. before and after pageout (also add printk in zram driver to dump all I/O write)
> >>> 4. before and after remove_mapping
> >>>
> >>> The below is the dumped info for a particular large folio,
> >>>
> >>> 1. after add_to_swap
> >>> [   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
> >>> [   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >>>
> >>> as you can see,
> >>> _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)
> >>>
> >>>
> >>> 2. before and after try_to_unmap
> >>> [   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
> >>> [   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
> >>> [   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
> >>> [   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >>>
> >>> as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
> >>> 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1
> >>>
> >>> 3. before and after pageout
> >>> [   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
> >>> [   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >>> [   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
> >>> [   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
> >>> [   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
> >>> [   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
> >>> [   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
> >>> [   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
> >>> [   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
> >>> [   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
> >>> [   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
> >>> [   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
> >>> [   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
> >>> [   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
> >>> [   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
> >>> [   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
> >>> [   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
> >>> [   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
> >>> [   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0
> >>>
> >>> as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
> >>> 4~64KiB has been zap_pte_range before, we still write them to zRAM.
> >>>
> >>> 4. before and after remove_mapping
> >>> [   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
> >>> [   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
> >>> [   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
> >>>
> >>> as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
> >>> all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
> >>> slot leak at all!
> >>>
> >>> Thus, only two concerns are left for me,
> >>> 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
> >>> is partially unmapped.
>
> So the cost of this is increased IO and swap storage, correct? Is this a big
> problem in practice? i.e. do you see a lot of partially mapped large folios in
> your workload? (I agree the proposed fix below is simple, so I think we should
> do it anyway - I'm just interested in the scale of the problem).
>
> >>> 2. large folio is added as a whole as a swapcache covering the range whose
> >>> part has been zapped. I am not quite sure if this will cause some problems
> >>> while some concurrent do_anon_page, swapin and swapout occurs between 3 and
> >>> 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding...
>
> Yes mine too. I would only expect the ptes that map the folio will get replaced
> with swap entries? So I would expect it to be safe. Although I understand the
> concern with the extra swap consumption.

yes. it should still be safe. just more I/O and more swap spaces. but they will
be removed while remove_mapping happens if try_to_unmap_one makes
the folio unmapped.

but with the potential possibility even mapped PTEs can be skipped by
try_to_unmap_one (reported intermediate PTEs issue - PTL is held till
a valid PTE, some PTEs might be skipped by try_to_unmap without being
set to swap entries), we could have the possibility folio_mapped() is still true
after try_to_unmap_one. so we can't get to __remove_mapping() for a long
time. but it still doesn't cause a crash.

>
> [...]
> >>>
> >>> To me, it seems safer to split or do some other similar optimization if we find a
> >>> large folio has partial map and unmap.
> >>
> >> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
> >> possible.
> >>
> >
> > Is _nr_pages_mapped < nr_pages a reasonable case to split as we
> > have known the folio has at least some subpages zapped?
>
> I'm not sure we need this - the folio's presence on the split list will tell us
> everything we need to know I think?

I agree, this is just one question to David, not my proposal.  if
deferred_list is sufficient,
I prefer we use deferred_list.

I actually don't quite understand why David dislikes _nr_pages_mapped being used
though I do think _nr_pages_mapped cannot precisely reflect how a
folio is mapped
by multi-processes. but _nr_pages_mapped < nr_pages seems be safe to
tell the folio
is partially unmapped :-)

>
> >
> >> If we find that the folio is on the deferred split list, we might as
> >> well just split it right away, before swapping it out. That might be a
> >> reasonable optimization for the case you describe.
>
> Yes, agreed. I think there is still chance of a race though; Some other thread
> could be munmapping in parallel. But in that case, I think we just end up with
> the increased IO and swap storage? That's not the end of the world if its a
> corner case.

I agree. btw, do we need a spinlock ds_queue->split_queue_lock for checking
the list? deferred_split_folio(), for itself, has no spinlock while checking
 if (!list_empty(&folio->_deferred_list)), but why? the read and write
need to be exclusive.....

void deferred_split_folio(struct folio *folio)
{
        ...

        if (!list_empty(&folio->_deferred_list))
                return;

        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
        if (list_empty(&folio->_deferred_list)) {
                count_vm_event(THP_DEFERRED_SPLIT_PAGE);
                list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
                ds_queue->split_queue_len++;
#ifdef CONFIG_MEMCG
                if (memcg)
                        set_shrinker_bit(memcg, folio_nid(folio),
                                         deferred_split_shrinker->id);
#endif
        }
        spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
}

>
> >
> > i tried to change Ryan's code as below
> >
> > @@ -1905,11 +1922,12 @@ static unsigned int shrink_folio_list(struct
> > list_head *folio_list,
> >                                          * PMD map right away. Chances are some
> >                                          * or all of the tail pages can be freed
> >                                          * without IO.
> > +                                        * Similarly, split PTE-mapped folios if
> > +                                        * they have been already
> > deferred_split.
> >                                          */
> > -                                       if (folio_test_pmd_mappable(folio) &&
> > -                                           !folio_entire_mapcount(folio) &&
> > -                                           split_folio_to_list(folio,
> > -                                                               folio_list))
> > +                                       if
> > (((folio_test_pmd_mappable(folio) && !folio_entire_mapcount(folio)) ||
> > +
> > (!folio_test_pmd_mappable(folio) &&
> > !list_empty(&folio->_deferred_list)))
>
> I'm not sure we need the different tests for pmd_mappable vs !pmd_mappable. I
> think presence on the deferred list is a sufficient indicator that there are
> unmapped subpages?

I don't think there are fundamental differences for pmd and pte. i was
testing pte-mapped folio at that time, so kept the behavior of pmd as is.

>
> I'll incorporate this into my next version.

Great!

>
> > +                                           &&
> > split_folio_to_list(folio, folio_list))
> >                                                 goto activate_locked;
> >                                 }
> >                                 if (!add_to_swap(folio)) {
> >
> > It seems to work as expected. only one I/O is left for a large folio
> > with 16 PTEs
> > but 15 of them have been zapped before.
> >
> >>
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >>
> >

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-27 13:37     ` Ryan Roberts
@ 2024-02-28  2:46       ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-02-28  2:46 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

On Wed, Feb 28, 2024 at 2:37 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/02/2024 09:51, Barry Song wrote:
> > +Chris, Suren and Chuanhua
> >
> > Hi Ryan,
> >
> >> +    /*
> >> +     * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> >> +     * so indicate that we are scanning to synchronise with swapoff.
> >> +     */
> >> +    si->flags += SWP_SCANNING;
> >> +    ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> >> +    si->flags -= SWP_SCANNING;
> >
> > nobody is using this scan_base afterwards. it seems a bit weird to
> > pass a pointer.
> >
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>                                      if (!can_split_folio(folio, NULL))
> >>                                              goto activate_locked;
> >>                                      /*
> >> -                                     * Split folios without a PMD map right
> >> -                                     * away. Chances are some or all of the
> >> -                                     * tail pages can be freed without IO.
> >> +                                     * Split PMD-mappable folios without a
> >> +                                     * PMD map right away. Chances are some
> >> +                                     * or all of the tail pages can be freed
> >> +                                     * without IO.
> >>                                       */
> >> -                                    if (!folio_entire_mapcount(folio) &&
> >> +                                    if (folio_test_pmd_mappable(folio) &&
> >> +                                        !folio_entire_mapcount(folio) &&
> >>                                          split_folio_to_list(folio,
> >>                                                              folio_list))
> >>                                              goto activate_locked;
> >> --
> >
> > Chuanhua and I ran this patchset for a couple of days and found a race
> > between reclamation and split_folio. this might cause applications get
> > wrong data 0 while swapping-in.
> >
> > in case one thread(T1) is reclaiming a large folio by some means, still
> > another thread is calling madvise MADV_PGOUT(T2). and at the same time,
> > we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
> > and T2 does split as below,
>

Hi Ryan,

> Hi Barry,
>
> Do you have a test case you can share that provokes this problem? And is this a
> separate problem to the race you solved with TTU_SYNC or is this solving the
> same problem?

They are the same.

After sending you the report about the races, I spent some time and
finally figured
out what was happening, why corrupted data came while swapping in. it
is absolutely
not your fault, but TTU_SYNC does somehow resolve my problem though it is not
the root cause. this corrupted data only can reproduce after applying
patch 4[1] of
swap-in series,
[1]  [PATCH RFC 4/6] mm: support large folios swapin as a whole
https://lore.kernel.org/linux-mm/20240118111036.72641-5-21cnbao@gmail.com/

In case we have a large folio with 16 PTEs as below, and after
add_to_swap(), they get
swapoffset 0x10000,  their PTEs are all present as they are still mapped.
PTE          pte_stat
PTE0        present
PTE1        present
PTE2        present
PTE3        present
...
PTE15       present

then we get to try_to_unmap_one, as try_to_unmap_one doesn't hold PTL
from PTE0, while it scans PTEs, we might have
PTE          pte_stat
PTE0        none (someone is writing PTE0 for various reasons)
PTE1        present
PTE2        present
PTE3        present
...
PTE15       present

We hold PTL from PTE1.

after try_to_unmap_one, PTEs become

PTE          pte_stat
PTE0        present (someone finished the write of PTE0)
PTE1        swap 0x10001
PTE2        swap 0x10002
PTE3        swap 0x10003
...
...
PTE15      swap 0x1000F

Thus, after try_to_unmap_one, the large folio is still mapped. so its swapcache
will still be there.

Now a parallel thread runs MADV_PAGEOUT, and it finds this large folio
is not completely mapped, so it splits the folio into 16 small folios but their
swap offsets are kept.

Now in swapcache, we have 16 folios with contiguous swap offsets.
MADV_PAGEOUT will reclaim these 16 folios, after new 16 try_to_unmap_one,

PTE          pte_stat
PTE0        swap 0x10000  SWAP_HAS_CACHE
PTE1        swap 0x10001  SWAP_HAS_CACHE
PTE2        swap 0x10002  SWAP_HAS_CACHE
PTE3        swap 0x10003  SWAP_HAS_CACHE
...
PTE15        swap 0x1000F  SWAP_HAS_CACHE

From this time, we can have various different cases for these 16 PTEs.
for example,

PTE          pte_stat
PTE0        swap 0x10000  SWAP_HAS_CACHE = 0 -> become false due to
finished pageout and remove_mapping
PTE1        swap 0x10001  SWAP_HAS_CACHE = 0 -> become false due to
finished pageout and remove_mapping
PTE2        swap 0x10002  SWAP_HAS_CACHE = 0 -> become false due to
concurrent swapin and swapout
PTE3        swap 0x10003  SWAP_HAS_CACHE = 1
...
PTE13        swap 0x1000D  SWAP_HAS_CACHE = 1
PTE14        swap 0x1000E  SWAP_HAS_CACHE = 1
PTE15        swap 0x1000F  SWAP_HAS_CACHE = 1

but all of them have swp_count = 1 and different SWAP_HAS_CACHE. some of these
small folios might be in swapcache, some others might not be in.

then we do_swap_page at one PTE whose SWAP_HAS_CACHE=0 and
swap_count=1 (the folio is not in swapcache, thus has been written to swap),
we do this check:

static bool pte_range_swap(pte_t *pte, int nr_pages)
{
        int i;
        swp_entry_t entry;
        unsigned type;
        pgoff_t start_offset;

        entry = pte_to_swp_entry(ptep_get_lockless(pte));
        if (non_swap_entry(entry))
                return false;
        start_offset = swp_offset(entry);
        if (start_offset % nr_pages)
                return false;

        type = swp_type(entry);
        for (i = 1; i < nr_pages; i++) {
                entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
                if (non_swap_entry(entry))
                        return false;
                if (swp_offset(entry) != start_offset + i)
                        return false;
                if (swp_type(entry) != type)
                        return false;
        }

        return true;
}

as those swp entries are contiguous, we will call swap_read_folio().
For those folios which are still in swapcache and haven't been written,
we get zero-filled data from zRAM.

So the root cause is that pte_range_swap should check
all 16 swap_map have the same SWAP_HAS_CACHE as
false.

static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
{
       ...
       count = si->swap_map[start_offset];
       for (i = 1; i < nr_pages; i++) {
               entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
               if (non_swap_entry(entry))
                       return false;
               if (swp_offset(entry) != start_offset + i)
                       return false;
               if (swp_type(entry) != type)
                       return false;
               /* fallback to small folios if SWAP_HAS_CACHE isn't same */
               if (si->swap_map[start_offset + i] != count)
                       return false;
       }

       return true;
}

but somehow TTU_SYNC "resolves" it by giving no chance to
MADV_PAGEOUT to split this folio as the large folio are either
entirely written by swap entries, or entirely keep present PTEs.

Though the bug is within the swap-in series, I am still a big fan of
TTU_SYNC for large folio reclamation for at least three reasons,

1. We remove some possibility that large folios fail to be reclaimed, improving
reclamation efficiency.

2. We avoid many strange cases and potential folio_split during reclamation.
without TTU_SYNC, folios can be splitted later, or partially being set to swap
entries while partially being still present

3. we don't increase PTL contention. My test shows try_to_unmap_one
will always get PTL after it sometimes skips one or two PTEs because
intermediate break-before-makes are short. Of course, most time try_to_unmap_one
will get PTL from PTE0.

>
> Thanks,
> Ryan
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
  2024-02-27 18:57       ` Barry Song
@ 2024-02-28  3:49         ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-02-28  3:49 UTC (permalink / raw)
  To: 21cnbao, ryan.roberts
  Cc: akpm, david, hanchuanhua, linux-kernel, linux-mm, mhocko,
	shy828301, steven.price, surenb, v-songbaohua, wangkefeng.wang,
	willy, xiang, ying.huang, yuzhao, Chris Li, Minchan Kim,
	SeongJae Park, Johannes Weiner

>> I'm going to rework this patch and integrate it into my series if that's ok with
>> you?
> 
> This is perfect. Please integrate it into your swap-out series which is the
> perfect place for this MADV_PAGEOUT.

BTW, Ryan, while you integrate this into your swap-put series, can you also
add the below one which is addressing one comment of Chris,

From: Barry Song <v-songbaohua@oppo.com>
Date: Tue, 27 Feb 2024 22:03:59 +1300
Subject: [PATCH] mm: madvise: extract common function
 folio_deactivate_or_add_to_reclaim_list

For madvise_cold_or_pageout_pte_range, both pmd-mapped and pte-mapped
normal folios are duplicating the same code right now, and we might
have more such as pte-mapped large folios to use it. It is better
to extract a common function.

Cc: Chris Li <chrisl@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/madvise.c | 52 ++++++++++++++++++++--------------------------------
 1 file changed, 20 insertions(+), 32 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 44a498c94158..1812457144ea 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -321,6 +321,24 @@ static inline bool can_do_file_pageout(struct vm_area_struct *vma)
 	       file_permission(vma->vm_file, MAY_WRITE) == 0;
 }
 
+static inline void folio_deactivate_or_add_to_reclaim_list(struct folio *folio, bool pageout,
+				struct list_head *folio_list)
+{
+	folio_clear_referenced(folio);
+	folio_test_clear_young(folio);
+
+	if (folio_test_active(folio))
+		folio_set_workingset(folio);
+	if (!pageout)
+		return folio_deactivate(folio);
+	if (folio_isolate_lru(folio)) {
+		if (folio_test_unevictable(folio))
+			folio_putback_lru(folio);
+		else
+			list_add(&folio->lru, folio_list);
+	}
+}
+
 static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 				unsigned long addr, unsigned long end,
 				struct mm_walk *walk)
@@ -394,19 +412,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 		}
 
-		folio_clear_referenced(folio);
-		folio_test_clear_young(folio);
-		if (folio_test_active(folio))
-			folio_set_workingset(folio);
-		if (pageout) {
-			if (folio_isolate_lru(folio)) {
-				if (folio_test_unevictable(folio))
-					folio_putback_lru(folio);
-				else
-					list_add(&folio->lru, &folio_list);
-			}
-		} else
-			folio_deactivate(folio);
+		folio_deactivate_or_add_to_reclaim_list(folio, pageout, &folio_list);
 huge_unlock:
 		spin_unlock(ptl);
 		if (pageout)
@@ -498,25 +504,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
 
-		/*
-		 * We are deactivating a folio for accelerating reclaiming.
-		 * VM couldn't reclaim the folio unless we clear PG_young.
-		 * As a side effect, it makes confuse idle-page tracking
-		 * because they will miss recent referenced history.
-		 */
-		folio_clear_referenced(folio);
-		folio_test_clear_young(folio);
-		if (folio_test_active(folio))
-			folio_set_workingset(folio);
-		if (pageout) {
-			if (folio_isolate_lru(folio)) {
-				if (folio_test_unevictable(folio))
-					folio_putback_lru(folio);
-				else
-					list_add(&folio->lru, &folio_list);
-			}
-		} else
-			folio_deactivate(folio);
+		folio_deactivate_or_add_to_reclaim_list(folio, pageout, &folio_list);
 	}
 
 	if (start_pte) {
-- 
2.34.1

Thanks
Barry

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-28  1:23           ` Barry Song
@ 2024-02-28  9:34             ` David Hildenbrand
  2024-02-28 23:18               ` Barry Song
  2024-02-28 15:57             ` Ryan Roberts
  1 sibling, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2024-02-28  9:34 UTC (permalink / raw)
  To: Barry Song, Ryan Roberts
  Cc: akpm, linux-kernel, linux-mm, mhocko, shy828301, wangkefeng.wang,
	willy, xiang, ying.huang, yuzhao, chrisl, surenb, hanchuanhua

>>>>>
>>>>> To me, it seems safer to split or do some other similar optimization if we find a
>>>>> large folio has partial map and unmap.
>>>>
>>>> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
>>>> possible.
>>>>
>>>
>>> Is _nr_pages_mapped < nr_pages a reasonable case to split as we
>>> have known the folio has at least some subpages zapped?
>>
>> I'm not sure we need this - the folio's presence on the split list will tell us
>> everything we need to know I think?
> 
> I agree, this is just one question to David, not my proposal.  if
> deferred_list is sufficient,
> I prefer we use deferred_list.
> 
> I actually don't quite understand why David dislikes _nr_pages_mapped being used
> though I do think _nr_pages_mapped cannot precisely reflect how a
> folio is mapped
> by multi-processes. but _nr_pages_mapped < nr_pages seems be safe to
> tell the folio
> is partially unmapped :-)

I'm hoping we can get rid of _nr_pages_mapped in some kernel configs in 
the future (that's what I am working on). So the less we depend on it 
the better.

With the total mapcount patch I'll revive shortly, _nr_pages_mapped will 
only be used inside rmap code. I'm hoping we won't have to introduce 
other users that will be harder to get rid of.

So please, if avoidable, no usage of _nr_pages_mapped outside of core 
rmap code.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-27 19:17           ` David Hildenbrand
@ 2024-02-28  9:37             ` Ryan Roberts
  2024-02-28 12:12               ` David Hildenbrand
  2024-02-28 13:33               ` Matthew Wilcox
  0 siblings, 2 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-02-28  9:37 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

Hi David, Huang,


On 27/02/2024 19:17, David Hildenbrand wrote:
> On 27.02.24 18:10, Ryan Roberts wrote:
>> Hi David,
>>
>> On 26/02/2024 17:41, Ryan Roberts wrote:
>>> On 22/02/2024 10:20, David Hildenbrand wrote:
>>>> On 22.02.24 11:19, David Hildenbrand wrote:
>>>>> On 25.10.23 16:45, Ryan Roberts wrote:
>>>>>> As preparation for supporting small-sized THP in the swap-out path,
>>>>>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>>>>>> which, when present, always implies PMD-sized THP, which is the same as
>>>>>> the cluster size.
>>>>>>
>>>>>> The only use of the flag was to determine whether a swap entry refers to
>>>>>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>>>>>> Instead of relying on the flag, we now pass in nr_pages, which
>>>>>> originates from the folio's number of pages. This allows the logic to
>>>>>> work for folios of any order.
>>>>>>
>>>>>> The one snag is that one of the swap_page_trans_huge_swapped() call
>>>>>> sites does not have the folio. But it was only being called there to
>>>>>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>>>>>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>>>>>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>>>>>> call site and believe the new logic should be equivalent.
>>>>>
>>>>> That is the  __try_to_reclaim_swap() -> folio_free_swap() ->
>>>>> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume.
>>>>>
>>>>> The "difference" is that you will now (1) get another temporary
>>>>> reference on the folio and (2) (try)lock the folio every time you
>>>>> discard a single PTE of a (possibly) large THP.
>>>>>
>>>>
>>>> Thinking about it, your change will not only affect THP, but any call to
>>>> free_swap_and_cache().
>>>>
>>>> Likely that's not what we want. :/
>>>>
>>>
>>> Is folio_trylock() really that expensive given the code path is already locking
>>> multiple spinlocks, and I don't think we would expect the folio lock to be very
>>> contended?
>>>
>>> I guess filemap_get_folio() could be a bit more expensive, but again, is this
>>> really a deal-breaker?
>>>
>>>
>>> I'm just trying to refamiliarize myself with this series, but I think I ended up
>>> allocating a cluster per cpu per order. So one potential solution would be to
>>> turn the flag into a size and store it in the cluster info. (In fact I think I
>>> was doing that in an early version of this series - will have to look at why I
>>> got rid of that). Then we could avoid needing to figure out nr_pages from the
>>> folio.
>>
>> I ran some microbenchmarks to see if these extra operations cause a performance
>> issue - it all looks OK to me.
> 
> Sorry, I'm drowning in reviews right now. I was hoping to get some of my own
> stuff figured out today ... maybe tomorrow.

No need to apologise - as always I appreciate whatever time you can spare.

> 
>>
>> I modified your "pte-mapped-folio-benchmarks" to add a "munmap-swapped-forked"
>> mode, which prepares the 1G memory mapping by first paging it out with
>> MADV_PAGEOUT, then it forks a child (and keeps that child alive) so that the
>> swap slots have 2 references, then it measures the duration of munmap() in the
>> parent on the entire range. The idea is that free_swap_and_cache() is called for
>> each PTE during munmap(). Prior to my change, swap_page_trans_huge_swapped()
>> will return true, due to the child's references, and __try_to_reclaim_swap() is
>> not called. After my change, we no longer have this short cut.
>>
>> In both cases the results are within 1% (confirmed across multiple runs of 20
>> seconds each):
>>
>> mm-stable: Average: 0.004997
>>   + change: Average: 0.005037
>>
>> (these numbers are for Ampere Altra. I also tested on M2 VM - no regression
>> their either).
>>
>> Do you still have a concern about this change?
> 
> The main concern I had was not about overhead due to atomic operations in the
> non-concurrent case that you are measuring.
> 
> We might now unnecessarily be incrementing the folio refcount and taking
> the folio lock. That will affects large folios in the swapcache now IIUC.
> Small folios should be unaffected.

Yes I think you are right, because `count == SWAP_HAS_CACHE` is already checking
the small page is not swapped. So my perf tests weren't actually doing what I
thought they were.

> 
> The side effects of that can be:
> * Code checking for additional folio reference could now detect some and
>   back out. (the "mapcount + swapcache*folio_nr_pages != folio_refcount"
>   stuff)
> * Code that might really benefit from trylocking the folio might fail to
>   do so.
> 
> For example, splitting a large folio might now fail more often simply
> because some process zaps a swap entry and the additional reference+page
> lock were optimized out previously.

Understood. Of course this is the type of fuzzy reasoning that is very dificult
to test objectively :). But it makes sense and I suppose I'll have to come up
with an alternative approach (see below).

> 
> How relevant is it? Relevant enough that someone decided to put that
> optimization in? I don't know :)

I'll have one last go at convincing you: Huang Ying (original author) commented
"I believe this should be OK.  Better to compare the performance too." at [1].
That implies to me that perhaps the optimization wasn't in response to a
specific problem after all. Do you have any thoughts, Huang?

[1]
https://lore.kernel.org/linux-mm/87v8bdfvtj.fsf@yhuang6-desk2.ccr.corp.intel.com/

> 
> Arguably, zapping a present PTE also leaves the refcount elevated for a while
> until the mapcount was freed. But here, it could be avoided.
> 
> Digging a bit, it was introduced in:
> 
> commit e07098294adfd03d582af7626752255e3d170393
> Author: Huang Ying <ying.huang@intel.com>
> Date:   Wed Sep 6 16:22:16 2017 -0700
> 
>     mm, THP, swap: support to reclaim swap space for THP swapped out
>         The normal swap slot reclaiming can be done when the swap count reaches
>     SWAP_HAS_CACHE.  But for the swap slot which is backing a THP, all swap
>     slots backing one THP must be reclaimed together, because the swap slot
>     may be used again when the THP is swapped out again later.  So the swap
>     slots backing one THP can be reclaimed together when the swap count for
>     all swap slots for the THP reached SWAP_HAS_CACHE.  In the patch, the
>     functions to check whether the swap count for all swap slots backing one
>     THP reached SWAP_HAS_CACHE are implemented and used when checking
>     whether a swap slot can be reclaimed.
>         To make it easier to determine whether a swap slot is backing a THP, a
>     new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap
>     cluster which is backing a THP (Transparent Huge Page).  Because THP
>     swap in as a whole isn't supported now.  After deleting the THP from the
>     swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE
>     flag will be cleared.  So that, the normal pages inside THP can be
>     swapped in individually.

Thanks. I did this same archaeology, but found nothing pointing to the rationale
for this optimization, so decided that if its undocumented, then it probably
wasn't critical.

> 
> 
> With your change, if we have a swapped out THP with 512 entries and exit(), we
> would now 512 times in a row grab a folio reference and trylock the folio. In the
> past, we would have done that at most once.
> 
> That doesn't feel quite right TBH ... so I'm wondering if there are any low-hanging
> fruits to avoid that.
> 

OK so if we really do need to keep this optimization, here are some ideas:

Fundamentally, we would like to be able to figure out the size of the swap slot
from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
to mark it as PMD_SIZE.

Going forwards, we want to support all sizes (power-of-2). Most of the time, a
cluster will contain only one size of THPs, but this is not the case when a THP
in the swapcache gets split or when an order-0 slot gets stolen. We expect these
cases to be rare.

1) Keep the size of the smallest swap entry in the cluster header. Most of the
time it will be the full size of the swap entry, but sometimes it will cover
only a portion. In the latter case you may see a false negative for
swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
to order-0). I think that is safe, but haven't completely convinced myself yet.

2) allocate 4 bits per (small) swap slot to hold the order. This will give
precise information and is conceptually simpler to understand, but will cost
more memory (half as much as the initial swap_map[] again).

I still prefer to avoid this at all if we can (and would like to hear Huang's
thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
prototyping.

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-28  9:37             ` Ryan Roberts
@ 2024-02-28 12:12               ` David Hildenbrand
  2024-02-28 14:57                 ` Ryan Roberts
  2024-03-04 16:03                 ` Ryan Roberts
  2024-02-28 13:33               ` Matthew Wilcox
  1 sibling, 2 replies; 116+ messages in thread
From: David Hildenbrand @ 2024-02-28 12:12 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

>> How relevant is it? Relevant enough that someone decided to put that
>> optimization in? I don't know :)
> 
> I'll have one last go at convincing you: Huang Ying (original author) commented
> "I believe this should be OK.  Better to compare the performance too." at [1].
> That implies to me that perhaps the optimization wasn't in response to a
> specific problem after all. Do you have any thoughts, Huang?

Might make sense to include that in the patch description!

> OK so if we really do need to keep this optimization, here are some ideas:
> 
> Fundamentally, we would like to be able to figure out the size of the swap slot
> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
> to mark it as PMD_SIZE.
> 
> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
> cluster will contain only one size of THPs, but this is not the case when a THP
> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
> cases to be rare.
> 
> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
> time it will be the full size of the swap entry, but sometimes it will cover
> only a portion. In the latter case you may see a false negative for
> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
> to order-0). I think that is safe, but haven't completely convinced myself yet.
> 
> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
> precise information and is conceptually simpler to understand, but will cost
> more memory (half as much as the initial swap_map[] again).
> 
> I still prefer to avoid this at all if we can (and would like to hear Huang's
> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
> prototyping.

Taking a step back: what about we simply batch unmapping of swap entries?

That is, if we're unmapping a PTE range, we'll collect swap entries 
(under PT lock) that reference consecutive swap offsets in the same swap 
file.

There, we can then first decrement all the swap counts, and then try 
minimizing how often we actually have to try reclaiming swap space 
(lookup folio, see it's a large folio that we cannot reclaim or could 
reclaim, ...).

Might need some fine-tuning in swap code to "advance" to the next entry 
to try freeing up, but we certainly can do better than what we would do 
right now.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-28  9:37             ` Ryan Roberts
  2024-02-28 12:12               ` David Hildenbrand
@ 2024-02-28 13:33               ` Matthew Wilcox
  2024-02-28 14:24                 ` Ryan Roberts
  1 sibling, 1 reply; 116+ messages in thread
From: Matthew Wilcox @ 2024-02-28 13:33 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Huang Ying, Gao Xiang, Yu Zhao,
	Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel, linux-mm

On Wed, Feb 28, 2024 at 09:37:06AM +0000, Ryan Roberts wrote:
> Fundamentally, we would like to be able to figure out the size of the swap slot
> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
> to mark it as PMD_SIZE.
> 
> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
> cluster will contain only one size of THPs, but this is not the case when a THP
> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
> cases to be rare.
> 
> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
> time it will be the full size of the swap entry, but sometimes it will cover
> only a portion. In the latter case you may see a false negative for
> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
> to order-0). I think that is safe, but haven't completely convinced myself yet.
> 
> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
> precise information and is conceptually simpler to understand, but will cost
> more memory (half as much as the initial swap_map[] again).
> 
> I still prefer to avoid this at all if we can (and would like to hear Huang's
> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
> prototyping.

I can't quite bring myself to look up the encoding of swap entries
but as long as we're willing to restrict ourselves to naturally aligning
the clusters, there's an encoding (which I believe I invented) that lets
us encode arbitrary power-of-two sizes with a single bit.

I describe it here:
https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder

Let me know if it's not clear.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-28 13:33               ` Matthew Wilcox
@ 2024-02-28 14:24                 ` Ryan Roberts
  2024-02-28 14:59                   ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-02-28 14:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Andrew Morton, Huang Ying, Gao Xiang, Yu Zhao,
	Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel, linux-mm

On 28/02/2024 13:33, Matthew Wilcox wrote:
> On Wed, Feb 28, 2024 at 09:37:06AM +0000, Ryan Roberts wrote:
>> Fundamentally, we would like to be able to figure out the size of the swap slot
>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>> to mark it as PMD_SIZE.
>>
>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>> cluster will contain only one size of THPs, but this is not the case when a THP
>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>> cases to be rare.
>>
>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>> time it will be the full size of the swap entry, but sometimes it will cover
>> only a portion. In the latter case you may see a false negative for
>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>
>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>> precise information and is conceptually simpler to understand, but will cost
>> more memory (half as much as the initial swap_map[] again).
>>
>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>> prototyping.
> 
> I can't quite bring myself to look up the encoding of swap entries
> but as long as we're willing to restrict ourselves to naturally aligning
> the clusters, there's an encoding (which I believe I invented) that lets
> us encode arbitrary power-of-two sizes with a single bit.
> 
> I describe it here:
> https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder
> 
> Let me know if it's not clear.

Ahh yes, I'm familiar with this encoding scheme from other settings. Although
I've previously thought of it as having a bit to indicate whether the scheme is
enabled or not, and if it is enabled then the encoded PFN is:

PFNe = PFNd | (1 << (log2(n) - 1))

Where n is the power-of-2 page count.

Same thing, I think.

I think we would have to steal a bit from the offset to make this work, and it
looks like the size of that is bottlnecked on the arch's swp_entry PTE
representation. Looks like there is a MIPS config that only has 17 bits for
offset to begin with, so I doubt we would be able to spare a bit here? Although
it looks possible that there are some unused low bits that could be used...


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-28 12:12               ` David Hildenbrand
@ 2024-02-28 14:57                 ` Ryan Roberts
  2024-02-28 15:12                   ` David Hildenbrand
  2024-03-04 16:03                 ` Ryan Roberts
  1 sibling, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-02-28 14:57 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 28/02/2024 12:12, David Hildenbrand wrote:
>>> How relevant is it? Relevant enough that someone decided to put that
>>> optimization in? I don't know :)
>>
>> I'll have one last go at convincing you: Huang Ying (original author) commented
>> "I believe this should be OK.  Better to compare the performance too." at [1].
>> That implies to me that perhaps the optimization wasn't in response to a
>> specific problem after all. Do you have any thoughts, Huang?
> 
> Might make sense to include that in the patch description!
> 
>> OK so if we really do need to keep this optimization, here are some ideas:
>>
>> Fundamentally, we would like to be able to figure out the size of the swap slot
>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>> to mark it as PMD_SIZE.
>>
>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>> cluster will contain only one size of THPs, but this is not the case when a THP
>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>> cases to be rare.
>>
>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>> time it will be the full size of the swap entry, but sometimes it will cover
>> only a portion. In the latter case you may see a false negative for
>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>
>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>> precise information and is conceptually simpler to understand, but will cost
>> more memory (half as much as the initial swap_map[] again).
>>
>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>> prototyping.
> 
> Taking a step back: what about we simply batch unmapping of swap entries?
> 
> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
> lock) that reference consecutive swap offsets in the same swap file.

Yes in principle, but there are 4 places where free_swap_and_cache() is called,
and only 2 of those are really amenable to batching (zap_pte_range() and
madvise_free_pte_range()). So the other two users will still take the "slow"
path. Maybe those 2 callsites are the only ones that really matter? I can
certainly have a stab at this approach.

> 
> There, we can then first decrement all the swap counts, and then try minimizing
> how often we actually have to try reclaiming swap space (lookup folio, see it's
> a large folio that we cannot reclaim or could reclaim, ...).
> 
> Might need some fine-tuning in swap code to "advance" to the next entry to try
> freeing up, but we certainly can do better than what we would do right now.

I'm not sure I've understood this. Isn't advancing just a matter of:

entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-28 14:24                 ` Ryan Roberts
@ 2024-02-28 14:59                   ` Ryan Roberts
  0 siblings, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-02-28 14:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Andrew Morton, Huang Ying, Gao Xiang, Yu Zhao,
	Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel, linux-mm

On 28/02/2024 14:24, Ryan Roberts wrote:
> On 28/02/2024 13:33, Matthew Wilcox wrote:
>> On Wed, Feb 28, 2024 at 09:37:06AM +0000, Ryan Roberts wrote:
>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>>> to mark it as PMD_SIZE.
>>>
>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>>> cases to be rare.
>>>
>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>> time it will be the full size of the swap entry, but sometimes it will cover
>>> only a portion. In the latter case you may see a false negative for
>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>
>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>> precise information and is conceptually simpler to understand, but will cost
>>> more memory (half as much as the initial swap_map[] again).
>>>
>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>> prototyping.
>>
>> I can't quite bring myself to look up the encoding of swap entries
>> but as long as we're willing to restrict ourselves to naturally aligning
>> the clusters, there's an encoding (which I believe I invented) that lets
>> us encode arbitrary power-of-two sizes with a single bit.
>>
>> I describe it here:
>> https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder
>>
>> Let me know if it's not clear.
> 
> Ahh yes, I'm familiar with this encoding scheme from other settings. Although
> I've previously thought of it as having a bit to indicate whether the scheme is
> enabled or not, and if it is enabled then the encoded PFN is:
> 
> PFNe = PFNd | (1 << (log2(n) - 1))
> 
> Where n is the power-of-2 page count.
> 
> Same thing, I think.
> 
> I think we would have to steal a bit from the offset to make this work, and it
> looks like the size of that is bottlnecked on the arch's swp_entry PTE
> representation. Looks like there is a MIPS config that only has 17 bits for
> offset to begin with, so I doubt we would be able to spare a bit here? Although
> it looks possible that there are some unused low bits that could be used...
> 

I think the other problem with this is that it won't tell us which slot in the
"swap slot block" each entry is targetting?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-28 14:57                 ` Ryan Roberts
@ 2024-02-28 15:12                   ` David Hildenbrand
  2024-02-28 15:18                     ` Ryan Roberts
  2024-03-01 16:27                     ` Ryan Roberts
  0 siblings, 2 replies; 116+ messages in thread
From: David Hildenbrand @ 2024-02-28 15:12 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 28.02.24 15:57, Ryan Roberts wrote:
> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>> How relevant is it? Relevant enough that someone decided to put that
>>>> optimization in? I don't know :)
>>>
>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>> That implies to me that perhaps the optimization wasn't in response to a
>>> specific problem after all. Do you have any thoughts, Huang?
>>
>> Might make sense to include that in the patch description!
>>
>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>
>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>>> to mark it as PMD_SIZE.
>>>
>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>>> cases to be rare.
>>>
>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>> time it will be the full size of the swap entry, but sometimes it will cover
>>> only a portion. In the latter case you may see a false negative for
>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>
>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>> precise information and is conceptually simpler to understand, but will cost
>>> more memory (half as much as the initial swap_map[] again).
>>>
>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>> prototyping.
>>
>> Taking a step back: what about we simply batch unmapping of swap entries?
>>
>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>> lock) that reference consecutive swap offsets in the same swap file.
> 
> Yes in principle, but there are 4 places where free_swap_and_cache() is called,
> and only 2 of those are really amenable to batching (zap_pte_range() and
> madvise_free_pte_range()). So the other two users will still take the "slow"
> path. Maybe those 2 callsites are the only ones that really matter? I can
> certainly have a stab at this approach.

We can ignore the s390x one. That s390x code should only apply to KVM 
guest memory where ordinary THP are not even supported. (and nobody uses 
mTHP there yet).

Long story short: the VM can hint that some memory pages are now unused 
and the hypervisor can reclaim them. That's what that callback does (zap 
guest-provided guest memory). No need to worry about any batching for now.

Then, there is the shmem one in shmem_free_swap(). I really don't know 
how shmem handles THP+swapout.

But looking at shmem_writepage(), we split any large folios before 
moving them to the swapcache, so likely we don't care at all, because 
THP don't apply.

> 
>>
>> There, we can then first decrement all the swap counts, and then try minimizing
>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>> a large folio that we cannot reclaim or could reclaim, ...).
>>
>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>> freeing up, but we certainly can do better than what we would do right now.
> 
> I'm not sure I've understood this. Isn't advancing just a matter of:
> 
> entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);

I was talking about the advancing swapslot processing after decrementing 
the swapcounts.

Assume you decremented 512 swapcounts and some of them went to 0. AFAIU, 
you'd have to start with the first swapslot that has now a swapcount=0 
one and try to reclaim swap.

Assume you get a small folio, then you'll have to proceed with the next 
swap slot and try to reclaim swap.

Assume you get a large folio, then you can skip more swapslots 
(depending on offset into the folio etc).

If you get what I mean. :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-28 15:12                   ` David Hildenbrand
@ 2024-02-28 15:18                     ` Ryan Roberts
  2024-03-01 16:27                     ` Ryan Roberts
  1 sibling, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-02-28 15:18 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 28/02/2024 15:12, David Hildenbrand wrote:
> On 28.02.24 15:57, Ryan Roberts wrote:
>> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>>> How relevant is it? Relevant enough that someone decided to put that
>>>>> optimization in? I don't know :)
>>>>
>>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>>> That implies to me that perhaps the optimization wasn't in response to a
>>>> specific problem after all. Do you have any thoughts, Huang?
>>>
>>> Might make sense to include that in the patch description!
>>>
>>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>>
>>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the
>>>> cluster
>>>> to mark it as PMD_SIZE.
>>>>
>>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect
>>>> these
>>>> cases to be rare.
>>>>
>>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>>> time it will be the full size of the swap entry, but sometimes it will cover
>>>> only a portion. In the latter case you may see a false negative for
>>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>>> There is one wrinkle: currently the HUGE flag is cleared in
>>>> put_swap_folio(). We
>>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole
>>>> cluster
>>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>>
>>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>>> precise information and is conceptually simpler to understand, but will cost
>>>> more memory (half as much as the initial swap_map[] again).
>>>>
>>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>>> prototyping.
>>>
>>> Taking a step back: what about we simply batch unmapping of swap entries?
>>>
>>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>>> lock) that reference consecutive swap offsets in the same swap file.
>>
>> Yes in principle, but there are 4 places where free_swap_and_cache() is called,
>> and only 2 of those are really amenable to batching (zap_pte_range() and
>> madvise_free_pte_range()). So the other two users will still take the "slow"
>> path. Maybe those 2 callsites are the only ones that really matter? I can
>> certainly have a stab at this approach.
> 
> We can ignore the s390x one. That s390x code should only apply to KVM guest
> memory where ordinary THP are not even supported. (and nobody uses mTHP there yet).
> 
> Long story short: the VM can hint that some memory pages are now unused and the
> hypervisor can reclaim them. That's what that callback does (zap guest-provided
> guest memory). No need to worry about any batching for now.

OK good.

> 
> Then, there is the shmem one in shmem_free_swap(). I really don't know how shmem
> handles THP+swapout.
> 
> But looking at shmem_writepage(), we split any large folios before moving them
> to the swapcache, so likely we don't care at all, because THP don't apply.

Excellent.

> 
>>
>>>
>>> There, we can then first decrement all the swap counts, and then try minimizing
>>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>>> a large folio that we cannot reclaim or could reclaim, ...).
>>>
>>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>>> freeing up, but we certainly can do better than what we would do right now.
>>
>> I'm not sure I've understood this. Isn't advancing just a matter of:
>>
>> entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);
> 
> I was talking about the advancing swapslot processing after decrementing the
> swapcounts.
> 
> Assume you decremented 512 swapcounts and some of them went to 0. AFAIU, you'd
> have to start with the first swapslot that has now a swapcount=0 one and try to
> reclaim swap.
> 
> Assume you get a small folio, then you'll have to proceed with the next swap
> slot and try to reclaim swap.
> 
> Assume you get a large folio, then you can skip more swapslots (depending on
> offset into the folio etc).
> 
> If you get what I mean. :)

Ahh gottya. I'll have a play.



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-28  1:23           ` Barry Song
  2024-02-28  9:34             ` David Hildenbrand
@ 2024-02-28 15:57             ` Ryan Roberts
  1 sibling, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-02-28 15:57 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand, akpm, linux-kernel, linux-mm, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	chrisl, surenb, hanchuanhua

On 28/02/2024 01:23, Barry Song wrote:
> On Wed, Feb 28, 2024 at 1:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 23/02/2024 09:46, Barry Song wrote:
>>> On Thu, Feb 22, 2024 at 11:09 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 22.02.24 08:05, Barry Song wrote:
>>>>> Hi Ryan,
>>>>>
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index 2cc0cb41fb32..ea19710aa4cd 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>>>>>                                      if (!can_split_folio(folio, NULL))
>>>>>>                                              goto activate_locked;
>>>>>>                                      /*
>>>>>> -                                     * Split folios without a PMD map right
>>>>>> -                                     * away. Chances are some or all of the
>>>>>> -                                     * tail pages can be freed without IO.
>>>>>> +                                     * Split PMD-mappable folios without a
>>>>>> +                                     * PMD map right away. Chances are some
>>>>>> +                                     * or all of the tail pages can be freed
>>>>>> +                                     * without IO.
>>>>>>                                       */
>>>>>> -                                    if (!folio_entire_mapcount(folio) &&
>>>>>> +                                    if (folio_test_pmd_mappable(folio) &&
>>>>>> +                                        !folio_entire_mapcount(folio) &&
>>>>>>                                          split_folio_to_list(folio,
>>>>>>                                                              folio_list))
>>>>>>                                              goto activate_locked;
>>>>>
>>>>> I ran a test to investigate what would happen while reclaiming a partially
>>>>> unmapped large folio. for example, for 64KiB large folios, MADV_DONTNEED
>>>>> 4KB~64KB, and keep the first subpage 0~4KiB.
>>>>
>>>> IOW, something that already happens with ordinary THP already IIRC.
>>>>
>>>>>
>>>>> My test wants to address my three concerns,
>>>>> a. whether we will have leak on swap slots
>>>>> b. whether we will have redundant I/O
>>>>> c. whether we will cause races on swapcache
>>>>>
>>>>> what i have done is printing folio->_nr_pages_mapped and dumping 16 swap_map[]
>>>>> at some specific stage
>>>>> 1. just after add_to_swap   (swap slots are allocated)
>>>>> 2. before and after try_to_unmap   (ptes are set to swap_entry)
>>>>> 3. before and after pageout (also add printk in zram driver to dump all I/O write)
>>>>> 4. before and after remove_mapping
>>>>>
>>>>> The below is the dumped info for a particular large folio,
>>>>>
>>>>> 1. after add_to_swap
>>>>> [   27.267357] vmscan: After add_to_swap shrink_folio_list 1947 mapnr:1
>>>>> [   27.267650] vmscan: offset:101b0 swp_map 40-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>>>
>>>>> as you can see,
>>>>> _nr_pages_mapped is 1 and all 16 swap_map are SWAP_HAS_CACHE (0x40)
>>>>>
>>>>>
>>>>> 2. before and after try_to_unmap
>>>>> [   27.268067] vmscan: before try to unmap shrink_folio_list 1991 mapnr:1
>>>>> [   27.268372] try_to_unmap_one address:ffff731f0000 pte:e8000103cd0b43 pte_p:ffff0000c36a8f80
>>>>> [   27.268854] vmscan: after try to unmap shrink_folio_list 1997 mapnr:0
>>>>> [   27.269180] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>>>
>>>>> as you can see, one pte is set to swp_entry, and _nr_pages_mapped becomes
>>>>> 0 from 1. The 1st swp_map becomes 0x41, SWAP_HAS_CACHE + 1
>>>>>
>>>>> 3. before and after pageout
>>>>> [   27.269602] vmscan: before pageout shrink_folio_list 2065 mapnr:0
>>>>> [   27.269880] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>>> [   27.270691] zram: zram_write_page page:fffffc00030f3400 index:101b0
>>>>> [   27.271061] zram: zram_write_page page:fffffc00030f3440 index:101b1
>>>>> [   27.271416] zram: zram_write_page page:fffffc00030f3480 index:101b2
>>>>> [   27.271751] zram: zram_write_page page:fffffc00030f34c0 index:101b3
>>>>> [   27.272046] zram: zram_write_page page:fffffc00030f3500 index:101b4
>>>>> [   27.272384] zram: zram_write_page page:fffffc00030f3540 index:101b5
>>>>> [   27.272746] zram: zram_write_page page:fffffc00030f3580 index:101b6
>>>>> [   27.273042] zram: zram_write_page page:fffffc00030f35c0 index:101b7
>>>>> [   27.273339] zram: zram_write_page page:fffffc00030f3600 index:101b8
>>>>> [   27.273676] zram: zram_write_page page:fffffc00030f3640 index:101b9
>>>>> [   27.274044] zram: zram_write_page page:fffffc00030f3680 index:101ba
>>>>> [   27.274554] zram: zram_write_page page:fffffc00030f36c0 index:101bb
>>>>> [   27.274870] zram: zram_write_page page:fffffc00030f3700 index:101bc
>>>>> [   27.275166] zram: zram_write_page page:fffffc00030f3740 index:101bd
>>>>> [   27.275463] zram: zram_write_page page:fffffc00030f3780 index:101be
>>>>> [   27.275760] zram: zram_write_page page:fffffc00030f37c0 index:101bf
>>>>> [   27.276102] vmscan: after pageout and before needs_release shrink_folio_list 2124 mapnr:0
>>>>>
>>>>> as you can see, obviously, we have done redundant I/O - 16 zram_write_page though
>>>>> 4~64KiB has been zap_pte_range before, we still write them to zRAM.
>>>>>
>>>>> 4. before and after remove_mapping
>>>>> [   27.276428] vmscan: offset:101b0 swp_map 41-40-40-40-40-40-40-40-40-40-40-40-40-40-40-40
>>>>> [   27.277485] vmscan: after remove_mapping shrink_folio_list 2169 mapnr:0 offset:0
>>>>> [   27.277802] vmscan: offset:101b0 01-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>>>>>
>>>>> as you can see, swp_map 1-15 becomes 0 and only the first swp_map is 1.
>>>>> all SWAP_HAS_CACHE has been removed. This is perfect and there is no swap
>>>>> slot leak at all!
>>>>>
>>>>> Thus, only two concerns are left for me,
>>>>> 1. as we don't split anyway, we have done 15 unnecessary I/O if a large folio
>>>>> is partially unmapped.
>>
>> So the cost of this is increased IO and swap storage, correct? Is this a big
>> problem in practice? i.e. do you see a lot of partially mapped large folios in
>> your workload? (I agree the proposed fix below is simple, so I think we should
>> do it anyway - I'm just interested in the scale of the problem).
>>
>>>>> 2. large folio is added as a whole as a swapcache covering the range whose
>>>>> part has been zapped. I am not quite sure if this will cause some problems
>>>>> while some concurrent do_anon_page, swapin and swapout occurs between 3 and
>>>>> 4 on zapped subpage1~subpage15. still struggling.. my brain is exploding...
>>
>> Yes mine too. I would only expect the ptes that map the folio will get replaced
>> with swap entries? So I would expect it to be safe. Although I understand the
>> concern with the extra swap consumption.
> 
> yes. it should still be safe. just more I/O and more swap spaces. but they will
> be removed while remove_mapping happens if try_to_unmap_one makes
> the folio unmapped.
> 
> but with the potential possibility even mapped PTEs can be skipped by
> try_to_unmap_one (reported intermediate PTEs issue - PTL is held till
> a valid PTE, some PTEs might be skipped by try_to_unmap without being
> set to swap entries), we could have the possibility folio_mapped() is still true
> after try_to_unmap_one. so we can't get to __remove_mapping() for a long
> time. but it still doesn't cause a crash.
> 
>>
>> [...]
>>>>>
>>>>> To me, it seems safer to split or do some other similar optimization if we find a
>>>>> large folio has partial map and unmap.
>>>>
>>>> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
>>>> possible.
>>>>
>>>
>>> Is _nr_pages_mapped < nr_pages a reasonable case to split as we
>>> have known the folio has at least some subpages zapped?
>>
>> I'm not sure we need this - the folio's presence on the split list will tell us
>> everything we need to know I think?
> 
> I agree, this is just one question to David, not my proposal.  if
> deferred_list is sufficient,
> I prefer we use deferred_list.
> 
> I actually don't quite understand why David dislikes _nr_pages_mapped being used
> though I do think _nr_pages_mapped cannot precisely reflect how a
> folio is mapped
> by multi-processes. but _nr_pages_mapped < nr_pages seems be safe to
> tell the folio
> is partially unmapped :-)
> 
>>
>>>
>>>> If we find that the folio is on the deferred split list, we might as
>>>> well just split it right away, before swapping it out. That might be a
>>>> reasonable optimization for the case you describe.
>>
>> Yes, agreed. I think there is still chance of a race though; Some other thread
>> could be munmapping in parallel. But in that case, I think we just end up with
>> the increased IO and swap storage? That's not the end of the world if its a
>> corner case.
> 
> I agree. btw, do we need a spinlock ds_queue->split_queue_lock for checking
> the list? deferred_split_folio(), for itself, has no spinlock while checking
>  if (!list_empty(&folio->_deferred_list)), but why? the read and write
> need to be exclusive.....

I don't think so. It's safe to check if the folio is on the queue like this; but
if it isn't then you need to recheck under the lock, as is done here. So for us,
I think we can also do this safely. It is certainly preferable to avoid taking
the lock.

The original change says this:

Before acquire split_queue_lock, check and bail out early if the THP
head page is in the queue already. The checking without holding
split_queue_lock could race with deferred_split_scan, but it doesn't
impact the correctness here.

> 
> void deferred_split_folio(struct folio *folio)
> {
>         ...
> 
>         if (!list_empty(&folio->_deferred_list))
>                 return;
> 
>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>         if (list_empty(&folio->_deferred_list)) {
>                 count_vm_event(THP_DEFERRED_SPLIT_PAGE);
>                 list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
>                 ds_queue->split_queue_len++;
> #ifdef CONFIG_MEMCG
>                 if (memcg)
>                         set_shrinker_bit(memcg, folio_nid(folio),
>                                          deferred_split_shrinker->id);
> #endif
>         }
>         spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> }
> 
>>
>>>
>>> i tried to change Ryan's code as below
>>>
>>> @@ -1905,11 +1922,12 @@ static unsigned int shrink_folio_list(struct
>>> list_head *folio_list,
>>>                                          * PMD map right away. Chances are some
>>>                                          * or all of the tail pages can be freed
>>>                                          * without IO.
>>> +                                        * Similarly, split PTE-mapped folios if
>>> +                                        * they have been already
>>> deferred_split.
>>>                                          */
>>> -                                       if (folio_test_pmd_mappable(folio) &&
>>> -                                           !folio_entire_mapcount(folio) &&
>>> -                                           split_folio_to_list(folio,
>>> -                                                               folio_list))
>>> +                                       if
>>> (((folio_test_pmd_mappable(folio) && !folio_entire_mapcount(folio)) ||
>>> +
>>> (!folio_test_pmd_mappable(folio) &&
>>> !list_empty(&folio->_deferred_list)))
>>
>> I'm not sure we need the different tests for pmd_mappable vs !pmd_mappable. I
>> think presence on the deferred list is a sufficient indicator that there are
>> unmapped subpages?
> 
> I don't think there are fundamental differences for pmd and pte. i was
> testing pte-mapped folio at that time, so kept the behavior of pmd as is.
> 
>>
>> I'll incorporate this into my next version.
> 
> Great!
> 
>>
>>> +                                           &&
>>> split_folio_to_list(folio, folio_list))
>>>                                                 goto activate_locked;
>>>                                 }
>>>                                 if (!add_to_swap(folio)) {
>>>
>>> It seems to work as expected. only one I/O is left for a large folio
>>> with 16 PTEs
>>> but 15 of them have been zapped before.
>>>
>>>>
>>>> --
>>>> Cheers,
>>>>
>>>> David / dhildenb
>>>>
>>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-28  9:34             ` David Hildenbrand
@ 2024-02-28 23:18               ` Barry Song
  0 siblings, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-02-28 23:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, akpm, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

On Wed, Feb 28, 2024 at 10:34 PM David Hildenbrand <david@redhat.com> wrote:
>
> >>>>>
> >>>>> To me, it seems safer to split or do some other similar optimization if we find a
> >>>>> large folio has partial map and unmap.
> >>>>
> >>>> I'm hoping that we can avoid any new direct users of _nr_pages_mapped if
> >>>> possible.
> >>>>
> >>>
> >>> Is _nr_pages_mapped < nr_pages a reasonable case to split as we
> >>> have known the folio has at least some subpages zapped?
> >>
> >> I'm not sure we need this - the folio's presence on the split list will tell us
> >> everything we need to know I think?
> >
> > I agree, this is just one question to David, not my proposal.  if
> > deferred_list is sufficient,
> > I prefer we use deferred_list.
> >
> > I actually don't quite understand why David dislikes _nr_pages_mapped being used
> > though I do think _nr_pages_mapped cannot precisely reflect how a
> > folio is mapped
> > by multi-processes. but _nr_pages_mapped < nr_pages seems be safe to
> > tell the folio
> > is partially unmapped :-)
>
> I'm hoping we can get rid of _nr_pages_mapped in some kernel configs in
> the future (that's what I am working on). So the less we depend on it
> the better.
>
> With the total mapcount patch I'll revive shortly, _nr_pages_mapped will
> only be used inside rmap code. I'm hoping we won't have to introduce
> other users that will be harder to get rid of.
>
> So please, if avoidable, no usage of _nr_pages_mapped outside of core
> rmap code.

Thanks for clarification on the plan. good to use deferred_list in this
swap-out case.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-28 15:12                   ` David Hildenbrand
  2024-02-28 15:18                     ` Ryan Roberts
@ 2024-03-01 16:27                     ` Ryan Roberts
  2024-03-01 16:31                       ` Matthew Wilcox
                                         ` (2 more replies)
  1 sibling, 3 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-03-01 16:27 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 28/02/2024 15:12, David Hildenbrand wrote:
> On 28.02.24 15:57, Ryan Roberts wrote:
>> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>>> How relevant is it? Relevant enough that someone decided to put that
>>>>> optimization in? I don't know :)
>>>>
>>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>>> That implies to me that perhaps the optimization wasn't in response to a
>>>> specific problem after all. Do you have any thoughts, Huang?
>>>
>>> Might make sense to include that in the patch description!
>>>
>>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>>
>>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the
>>>> cluster
>>>> to mark it as PMD_SIZE.
>>>>
>>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect
>>>> these
>>>> cases to be rare.
>>>>
>>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>>> time it will be the full size of the swap entry, but sometimes it will cover
>>>> only a portion. In the latter case you may see a false negative for
>>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>>> There is one wrinkle: currently the HUGE flag is cleared in
>>>> put_swap_folio(). We
>>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole
>>>> cluster
>>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>>
>>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>>> precise information and is conceptually simpler to understand, but will cost
>>>> more memory (half as much as the initial swap_map[] again).
>>>>
>>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>>> prototyping.
>>>
>>> Taking a step back: what about we simply batch unmapping of swap entries?
>>>
>>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>>> lock) that reference consecutive swap offsets in the same swap file.
>>
>> Yes in principle, but there are 4 places where free_swap_and_cache() is called,
>> and only 2 of those are really amenable to batching (zap_pte_range() and
>> madvise_free_pte_range()). So the other two users will still take the "slow"
>> path. Maybe those 2 callsites are the only ones that really matter? I can
>> certainly have a stab at this approach.
> 
> We can ignore the s390x one. That s390x code should only apply to KVM guest
> memory where ordinary THP are not even supported. (and nobody uses mTHP there yet).
> 
> Long story short: the VM can hint that some memory pages are now unused and the
> hypervisor can reclaim them. That's what that callback does (zap guest-provided
> guest memory). No need to worry about any batching for now.
> 
> Then, there is the shmem one in shmem_free_swap(). I really don't know how shmem
> handles THP+swapout.
> 
> But looking at shmem_writepage(), we split any large folios before moving them
> to the swapcache, so likely we don't care at all, because THP don't apply.
> 
>>
>>>
>>> There, we can then first decrement all the swap counts, and then try minimizing
>>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>>> a large folio that we cannot reclaim or could reclaim, ...).
>>>
>>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>>> freeing up, but we certainly can do better than what we would do right now.
>>
>> I'm not sure I've understood this. Isn't advancing just a matter of:
>>
>> entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);
> 
> I was talking about the advancing swapslot processing after decrementing the
> swapcounts.
> 
> Assume you decremented 512 swapcounts and some of them went to 0. AFAIU, you'd
> have to start with the first swapslot that has now a swapcount=0 one and try to
> reclaim swap.
> 
> Assume you get a small folio, then you'll have to proceed with the next swap
> slot and try to reclaim swap.
> 
> Assume you get a large folio, then you can skip more swapslots (depending on
> offset into the folio etc).
> 
> If you get what I mean. :)
> 

I've implemented the batching as David suggested, and I'm pretty confident it's
correct. The only problem is that during testing I can't provoke the code to
take the path. I've been pouring through the code but struggling to figure out
under what situation you would expect the swap entry passed to
free_swap_and_cache() to still have a cached folio? Does anyone have any idea?

This is the original (unbatched) function, after my change, which caused David's
concern that we would end up calling __try_to_reclaim_swap() far too much:

int free_swap_and_cache(swp_entry_t entry)
{
	struct swap_info_struct *p;
	unsigned char count;

	if (non_swap_entry(entry))
		return 1;

	p = _swap_info_get(entry);
	if (p) {
		count = __swap_entry_free(p, entry);
		if (count == SWAP_HAS_CACHE)
			__try_to_reclaim_swap(p, swp_offset(entry),
					      TTRS_UNMAPPED | TTRS_FULL);
	}
	return p != NULL;
}

The trouble is, whenever its called, count is always 0, so
__try_to_reclaim_swap() never gets called.

My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
function to be called for every PTE, but count is always 0 after
__swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
order-0 as well as PTE- and PMD-mapped 2M THP.

I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
using a block ram device as my backing store - I think this does synchronous IO
so perhaps if I have a real block device with async IO I might have more luck?
Just a guess...

Or perhaps this code path is a corner case? In which case, perhaps its not worth
adding the batching optimization after all?

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-01 16:27                     ` Ryan Roberts
@ 2024-03-01 16:31                       ` Matthew Wilcox
  2024-03-01 16:44                         ` Ryan Roberts
  2024-03-01 16:31                       ` Ryan Roberts
  2024-03-01 16:32                       ` David Hildenbrand
  2 siblings, 1 reply; 116+ messages in thread
From: Matthew Wilcox @ 2024-03-01 16:31 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Huang Ying, Gao Xiang, Yu Zhao,
	Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel, linux-mm

On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
> I've implemented the batching as David suggested, and I'm pretty confident it's
> correct. The only problem is that during testing I can't provoke the code to
> take the path. I've been pouring through the code but struggling to figure out
> under what situation you would expect the swap entry passed to
> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
> 
> This is the original (unbatched) function, after my change, which caused David's
> concern that we would end up calling __try_to_reclaim_swap() far too much:
> 
> int free_swap_and_cache(swp_entry_t entry)
> {
> 	struct swap_info_struct *p;
> 	unsigned char count;
> 
> 	if (non_swap_entry(entry))
> 		return 1;
> 
> 	p = _swap_info_get(entry);
> 	if (p) {
> 		count = __swap_entry_free(p, entry);
> 		if (count == SWAP_HAS_CACHE)
> 			__try_to_reclaim_swap(p, swp_offset(entry),
> 					      TTRS_UNMAPPED | TTRS_FULL);
> 	}
> 	return p != NULL;
> }
> 
> The trouble is, whenever its called, count is always 0, so
> __try_to_reclaim_swap() never gets called.
> 
> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
> function to be called for every PTE, but count is always 0 after
> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
> order-0 as well as PTE- and PMD-mapped 2M THP.

I think you have to page it back in again, then it will have an entry in
the swap cache.  Maybe.  I know little about anon memory ;-)

If that doesn't work, perhaps use tmpfs, and use some memory pressure to
force that to swap?

> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
> using a block ram device as my backing store - I think this does synchronous IO
> so perhaps if I have a real block device with async IO I might have more luck?
> Just a guess...
> 
> Or perhaps this code path is a corner case? In which case, perhaps its not worth
> adding the batching optimization after all?
> 
> Thanks,
> Ryan
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-01 16:27                     ` Ryan Roberts
  2024-03-01 16:31                       ` Matthew Wilcox
@ 2024-03-01 16:31                       ` Ryan Roberts
  2024-03-01 16:32                       ` David Hildenbrand
  2 siblings, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-03-01 16:31 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 01/03/2024 16:27, Ryan Roberts wrote:
> On 28/02/2024 15:12, David Hildenbrand wrote:
>> On 28.02.24 15:57, Ryan Roberts wrote:
>>> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>>>> How relevant is it? Relevant enough that someone decided to put that
>>>>>> optimization in? I don't know :)
>>>>>
>>>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>>>> That implies to me that perhaps the optimization wasn't in response to a
>>>>> specific problem after all. Do you have any thoughts, Huang?
>>>>
>>>> Might make sense to include that in the patch description!
>>>>
>>>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>>>
>>>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the
>>>>> cluster
>>>>> to mark it as PMD_SIZE.
>>>>>
>>>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect
>>>>> these
>>>>> cases to be rare.
>>>>>
>>>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>>>> time it will be the full size of the swap entry, but sometimes it will cover
>>>>> only a portion. In the latter case you may see a false negative for
>>>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>>>> There is one wrinkle: currently the HUGE flag is cleared in
>>>>> put_swap_folio(). We
>>>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole
>>>>> cluster
>>>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>>>
>>>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>>>> precise information and is conceptually simpler to understand, but will cost
>>>>> more memory (half as much as the initial swap_map[] again).
>>>>>
>>>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>>>> prototyping.
>>>>
>>>> Taking a step back: what about we simply batch unmapping of swap entries?
>>>>
>>>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>>>> lock) that reference consecutive swap offsets in the same swap file.
>>>
>>> Yes in principle, but there are 4 places where free_swap_and_cache() is called,
>>> and only 2 of those are really amenable to batching (zap_pte_range() and
>>> madvise_free_pte_range()). So the other two users will still take the "slow"
>>> path. Maybe those 2 callsites are the only ones that really matter? I can
>>> certainly have a stab at this approach.
>>
>> We can ignore the s390x one. That s390x code should only apply to KVM guest
>> memory where ordinary THP are not even supported. (and nobody uses mTHP there yet).
>>
>> Long story short: the VM can hint that some memory pages are now unused and the
>> hypervisor can reclaim them. That's what that callback does (zap guest-provided
>> guest memory). No need to worry about any batching for now.
>>
>> Then, there is the shmem one in shmem_free_swap(). I really don't know how shmem
>> handles THP+swapout.
>>
>> But looking at shmem_writepage(), we split any large folios before moving them
>> to the swapcache, so likely we don't care at all, because THP don't apply.
>>
>>>
>>>>
>>>> There, we can then first decrement all the swap counts, and then try minimizing
>>>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>>>> a large folio that we cannot reclaim or could reclaim, ...).
>>>>
>>>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>>>> freeing up, but we certainly can do better than what we would do right now.
>>>
>>> I'm not sure I've understood this. Isn't advancing just a matter of:
>>>
>>> entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);
>>
>> I was talking about the advancing swapslot processing after decrementing the
>> swapcounts.
>>
>> Assume you decremented 512 swapcounts and some of them went to 0. AFAIU, you'd
>> have to start with the first swapslot that has now a swapcount=0 one and try to
>> reclaim swap.
>>
>> Assume you get a small folio, then you'll have to proceed with the next swap
>> slot and try to reclaim swap.
>>
>> Assume you get a large folio, then you can skip more swapslots (depending on
>> offset into the folio etc).
>>
>> If you get what I mean. :)
>>
> 
> I've implemented the batching as David suggested, and I'm pretty confident it's
> correct. The only problem is that during testing I can't provoke the code to
> take the path. I've been pouring through the code but struggling to figure out
> under what situation you would expect the swap entry passed to
> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
> 
> This is the original (unbatched) function, after my change, which caused David's
> concern that we would end up calling __try_to_reclaim_swap() far too much:
> 
> int free_swap_and_cache(swp_entry_t entry)
> {
> 	struct swap_info_struct *p;
> 	unsigned char count;
> 
> 	if (non_swap_entry(entry))
> 		return 1;
> 
> 	p = _swap_info_get(entry);
> 	if (p) {
> 		count = __swap_entry_free(p, entry);
> 		if (count == SWAP_HAS_CACHE)
> 			__try_to_reclaim_swap(p, swp_offset(entry),
> 					      TTRS_UNMAPPED | TTRS_FULL);
> 	}
> 	return p != NULL;
> }
> 
> The trouble is, whenever its called, count is always 0, so
> __try_to_reclaim_swap() never gets called.
> 
> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
> function to be called for every PTE, but count is always 0 after
> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
> order-0 as well as PTE- and PMD-mapped 2M THP.
> 
> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
> using a block ram device as my backing store - I think this does synchronous IO
> so perhaps if I have a real block device with async IO I might have more luck?

Ahh I just switched to SSD as swap device and now its getting called. I guess
that's the reason. Sorry for the noise.

> Just a guess...
> 
> Or perhaps this code path is a corner case? In which case, perhaps its not worth
> adding the batching optimization after all?
> 
> Thanks,
> Ryan
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-01 16:27                     ` Ryan Roberts
  2024-03-01 16:31                       ` Matthew Wilcox
  2024-03-01 16:31                       ` Ryan Roberts
@ 2024-03-01 16:32                       ` David Hildenbrand
  2 siblings, 0 replies; 116+ messages in thread
From: David Hildenbrand @ 2024-03-01 16:32 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 01.03.24 17:27, Ryan Roberts wrote:
> On 28/02/2024 15:12, David Hildenbrand wrote:
>> On 28.02.24 15:57, Ryan Roberts wrote:
>>> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>>>> How relevant is it? Relevant enough that someone decided to put that
>>>>>> optimization in? I don't know :)
>>>>>
>>>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>>>> That implies to me that perhaps the optimization wasn't in response to a
>>>>> specific problem after all. Do you have any thoughts, Huang?
>>>>
>>>> Might make sense to include that in the patch description!
>>>>
>>>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>>>
>>>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the
>>>>> cluster
>>>>> to mark it as PMD_SIZE.
>>>>>
>>>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect
>>>>> these
>>>>> cases to be rare.
>>>>>
>>>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>>>> time it will be the full size of the swap entry, but sometimes it will cover
>>>>> only a portion. In the latter case you may see a false negative for
>>>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>>>> There is one wrinkle: currently the HUGE flag is cleared in
>>>>> put_swap_folio(). We
>>>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole
>>>>> cluster
>>>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>>>
>>>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>>>> precise information and is conceptually simpler to understand, but will cost
>>>>> more memory (half as much as the initial swap_map[] again).
>>>>>
>>>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>>>> prototyping.
>>>>
>>>> Taking a step back: what about we simply batch unmapping of swap entries?
>>>>
>>>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>>>> lock) that reference consecutive swap offsets in the same swap file.
>>>
>>> Yes in principle, but there are 4 places where free_swap_and_cache() is called,
>>> and only 2 of those are really amenable to batching (zap_pte_range() and
>>> madvise_free_pte_range()). So the other two users will still take the "slow"
>>> path. Maybe those 2 callsites are the only ones that really matter? I can
>>> certainly have a stab at this approach.
>>
>> We can ignore the s390x one. That s390x code should only apply to KVM guest
>> memory where ordinary THP are not even supported. (and nobody uses mTHP there yet).
>>
>> Long story short: the VM can hint that some memory pages are now unused and the
>> hypervisor can reclaim them. That's what that callback does (zap guest-provided
>> guest memory). No need to worry about any batching for now.
>>
>> Then, there is the shmem one in shmem_free_swap(). I really don't know how shmem
>> handles THP+swapout.
>>
>> But looking at shmem_writepage(), we split any large folios before moving them
>> to the swapcache, so likely we don't care at all, because THP don't apply.
>>
>>>
>>>>
>>>> There, we can then first decrement all the swap counts, and then try minimizing
>>>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>>>> a large folio that we cannot reclaim or could reclaim, ...).
>>>>
>>>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>>>> freeing up, but we certainly can do better than what we would do right now.
>>>
>>> I'm not sure I've understood this. Isn't advancing just a matter of:
>>>
>>> entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);
>>
>> I was talking about the advancing swapslot processing after decrementing the
>> swapcounts.
>>
>> Assume you decremented 512 swapcounts and some of them went to 0. AFAIU, you'd
>> have to start with the first swapslot that has now a swapcount=0 one and try to
>> reclaim swap.
>>
>> Assume you get a small folio, then you'll have to proceed with the next swap
>> slot and try to reclaim swap.
>>
>> Assume you get a large folio, then you can skip more swapslots (depending on
>> offset into the folio etc).
>>
>> If you get what I mean. :)
>>
> 
> I've implemented the batching as David suggested, and I'm pretty confident it's
> correct. The only problem is that during testing I can't provoke the code to
> take the path. I've been pouring through the code but struggling to figure out
> under what situation you would expect the swap entry passed to
> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
> 
> This is the original (unbatched) function, after my change, which caused David's
> concern that we would end up calling __try_to_reclaim_swap() far too much:
> 
> int free_swap_and_cache(swp_entry_t entry)
> {
> 	struct swap_info_struct *p;
> 	unsigned char count;
> 
> 	if (non_swap_entry(entry))
> 		return 1;
> 
> 	p = _swap_info_get(entry);
> 	if (p) {
> 		count = __swap_entry_free(p, entry);
> 		if (count == SWAP_HAS_CACHE)
> 			__try_to_reclaim_swap(p, swp_offset(entry),
> 					      TTRS_UNMAPPED | TTRS_FULL);
> 	}
> 	return p != NULL;
> }
> 
> The trouble is, whenever its called, count is always 0, so
> __try_to_reclaim_swap() never gets called.
> 
> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
> function to be called for every PTE, but count is always 0 after
> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
> order-0 as well as PTE- and PMD-mapped 2M THP.
> 
> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
> using a block ram device as my backing store - I think this does synchronous IO
> so perhaps if I have a real block device with async IO I might have more luck?
> Just a guess...
> 
> Or perhaps this code path is a corner case? In which case, perhaps its not worth
> adding the batching optimization after all?

I had to disable zswap in the past and was able to trigger this reliably 
with an ordinary swap backend (e.g., proper disk).

Whenever you involve swap-to-ram, you might just get it reclaimed 
immediately.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-01 16:31                       ` Matthew Wilcox
@ 2024-03-01 16:44                         ` Ryan Roberts
  2024-03-01 17:00                           ` David Hildenbrand
  2024-03-01 17:06                           ` Ryan Roberts
  0 siblings, 2 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-03-01 16:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Andrew Morton, Huang Ying, Gao Xiang, Yu Zhao,
	Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel, linux-mm

On 01/03/2024 16:31, Matthew Wilcox wrote:
> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>> I've implemented the batching as David suggested, and I'm pretty confident it's
>> correct. The only problem is that during testing I can't provoke the code to
>> take the path. I've been pouring through the code but struggling to figure out
>> under what situation you would expect the swap entry passed to
>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>
>> This is the original (unbatched) function, after my change, which caused David's
>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>
>> int free_swap_and_cache(swp_entry_t entry)
>> {
>> 	struct swap_info_struct *p;
>> 	unsigned char count;
>>
>> 	if (non_swap_entry(entry))
>> 		return 1;
>>
>> 	p = _swap_info_get(entry);
>> 	if (p) {
>> 		count = __swap_entry_free(p, entry);
>> 		if (count == SWAP_HAS_CACHE)
>> 			__try_to_reclaim_swap(p, swp_offset(entry),
>> 					      TTRS_UNMAPPED | TTRS_FULL);
>> 	}
>> 	return p != NULL;
>> }
>>
>> The trouble is, whenever its called, count is always 0, so
>> __try_to_reclaim_swap() never gets called.
>>
>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
>> function to be called for every PTE, but count is always 0 after
>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>> order-0 as well as PTE- and PMD-mapped 2M THP.
> 
> I think you have to page it back in again, then it will have an entry in
> the swap cache.  Maybe.  I know little about anon memory ;-)

Ahh, I was under the impression that the original folio is put into the swap
cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
I'm miles out... what exactly is the lifecycle of a folio going through swap out?

I guess I can try forking after swap out, then fault it back in in the child and
exit. Then do the munmap in the parent. I guess that could force it? Thanks for
the tip - I'll have a play.

> 
> If that doesn't work, perhaps use tmpfs, and use some memory pressure to
> force that to swap?
> 
>> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
>> using a block ram device as my backing store - I think this does synchronous IO
>> so perhaps if I have a real block device with async IO I might have more luck?
>> Just a guess...
>>
>> Or perhaps this code path is a corner case? In which case, perhaps its not worth
>> adding the batching optimization after all?
>>
>> Thanks,
>> Ryan
>>


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-01 16:44                         ` Ryan Roberts
@ 2024-03-01 17:00                           ` David Hildenbrand
  2024-03-01 17:14                             ` Ryan Roberts
  2024-03-01 17:06                           ` Ryan Roberts
  1 sibling, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2024-03-01 17:00 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox
  Cc: Andrew Morton, Huang Ying, Gao Xiang, Yu Zhao, Yang Shi,
	Michal Hocko, Kefeng Wang, linux-kernel, linux-mm

On 01.03.24 17:44, Ryan Roberts wrote:
> On 01/03/2024 16:31, Matthew Wilcox wrote:
>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>>> I've implemented the batching as David suggested, and I'm pretty confident it's
>>> correct. The only problem is that during testing I can't provoke the code to
>>> take the path. I've been pouring through the code but struggling to figure out
>>> under what situation you would expect the swap entry passed to
>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>>
>>> This is the original (unbatched) function, after my change, which caused David's
>>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>>
>>> int free_swap_and_cache(swp_entry_t entry)
>>> {
>>> 	struct swap_info_struct *p;
>>> 	unsigned char count;
>>>
>>> 	if (non_swap_entry(entry))
>>> 		return 1;
>>>
>>> 	p = _swap_info_get(entry);
>>> 	if (p) {
>>> 		count = __swap_entry_free(p, entry);
>>> 		if (count == SWAP_HAS_CACHE)
>>> 			__try_to_reclaim_swap(p, swp_offset(entry),
>>> 					      TTRS_UNMAPPED | TTRS_FULL);
>>> 	}
>>> 	return p != NULL;
>>> }
>>>
>>> The trouble is, whenever its called, count is always 0, so
>>> __try_to_reclaim_swap() never gets called.
>>>
>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
>>> function to be called for every PTE, but count is always 0 after
>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>>> order-0 as well as PTE- and PMD-mapped 2M THP.
>>
>> I think you have to page it back in again, then it will have an entry in
>> the swap cache.  Maybe.  I know little about anon memory ;-)
> 
> Ahh, I was under the impression that the original folio is put into the swap
> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
> I'm miles out... what exactly is the lifecycle of a folio going through swap out?

I thought with most (disk) backends you will add it to the swapcache and 
leave it there until there is actual memory pressure. Only then, under 
memory pressure, you'd actually reclaim the folio.

You can fault it back in from the swapcache without having to go to disk.

That's how you can today end up with a THP in the swapcache: during 
swapin from disk (after the folio was reclaimed) you'd currently only 
get order-0 folios.

At least that was my assumption with my MADV_PAGEOUT testing so far :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-01 16:44                         ` Ryan Roberts
  2024-03-01 17:00                           ` David Hildenbrand
@ 2024-03-01 17:06                           ` Ryan Roberts
  2024-03-04  4:52                             ` Barry Song
  1 sibling, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-03-01 17:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand, Andrew Morton, Huang Ying, Gao Xiang, Yu Zhao,
	Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel, linux-mm

On 01/03/2024 16:44, Ryan Roberts wrote:
> On 01/03/2024 16:31, Matthew Wilcox wrote:
>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>>> I've implemented the batching as David suggested, and I'm pretty confident it's
>>> correct. The only problem is that during testing I can't provoke the code to
>>> take the path. I've been pouring through the code but struggling to figure out
>>> under what situation you would expect the swap entry passed to
>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>>
>>> This is the original (unbatched) function, after my change, which caused David's
>>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>>
>>> int free_swap_and_cache(swp_entry_t entry)
>>> {
>>> 	struct swap_info_struct *p;
>>> 	unsigned char count;
>>>
>>> 	if (non_swap_entry(entry))
>>> 		return 1;
>>>
>>> 	p = _swap_info_get(entry);
>>> 	if (p) {
>>> 		count = __swap_entry_free(p, entry);
>>> 		if (count == SWAP_HAS_CACHE)
>>> 			__try_to_reclaim_swap(p, swp_offset(entry),
>>> 					      TTRS_UNMAPPED | TTRS_FULL);
>>> 	}
>>> 	return p != NULL;
>>> }
>>>
>>> The trouble is, whenever its called, count is always 0, so
>>> __try_to_reclaim_swap() never gets called.
>>>
>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
>>> function to be called for every PTE, but count is always 0 after
>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>>> order-0 as well as PTE- and PMD-mapped 2M THP.
>>
>> I think you have to page it back in again, then it will have an entry in
>> the swap cache.  Maybe.  I know little about anon memory ;-)
> 
> Ahh, I was under the impression that the original folio is put into the swap
> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
> I'm miles out... what exactly is the lifecycle of a folio going through swap out?
> 
> I guess I can try forking after swap out, then fault it back in in the child and
> exit. Then do the munmap in the parent. I guess that could force it? Thanks for
> the tip - I'll have a play.

That has sort of solved it, the only problem now is that all the folios in the
swap cache are small (because I don't have Barry's large swap-in series). So
really I need to figure out how to avoid removing the folio from the cache in
the first place...

> 
>>
>> If that doesn't work, perhaps use tmpfs, and use some memory pressure to
>> force that to swap?
>>
>>> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
>>> using a block ram device as my backing store - I think this does synchronous IO
>>> so perhaps if I have a real block device with async IO I might have more luck?
>>> Just a guess...
>>>
>>> Or perhaps this code path is a corner case? In which case, perhaps its not worth
>>> adding the batching optimization after all?
>>>
>>> Thanks,
>>> Ryan
>>>
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-01 17:00                           ` David Hildenbrand
@ 2024-03-01 17:14                             ` Ryan Roberts
  2024-03-01 17:18                               ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-03-01 17:14 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: Andrew Morton, Huang Ying, Gao Xiang, Yu Zhao, Yang Shi,
	Michal Hocko, Kefeng Wang, linux-kernel, linux-mm

On 01/03/2024 17:00, David Hildenbrand wrote:
> On 01.03.24 17:44, Ryan Roberts wrote:
>> On 01/03/2024 16:31, Matthew Wilcox wrote:
>>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>>>> I've implemented the batching as David suggested, and I'm pretty confident it's
>>>> correct. The only problem is that during testing I can't provoke the code to
>>>> take the path. I've been pouring through the code but struggling to figure out
>>>> under what situation you would expect the swap entry passed to
>>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>>>
>>>> This is the original (unbatched) function, after my change, which caused
>>>> David's
>>>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>>>
>>>> int free_swap_and_cache(swp_entry_t entry)
>>>> {
>>>>     struct swap_info_struct *p;
>>>>     unsigned char count;
>>>>
>>>>     if (non_swap_entry(entry))
>>>>         return 1;
>>>>
>>>>     p = _swap_info_get(entry);
>>>>     if (p) {
>>>>         count = __swap_entry_free(p, entry);
>>>>         if (count == SWAP_HAS_CACHE)
>>>>             __try_to_reclaim_swap(p, swp_offset(entry),
>>>>                           TTRS_UNMAPPED | TTRS_FULL);
>>>>     }
>>>>     return p != NULL;
>>>> }
>>>>
>>>> The trouble is, whenever its called, count is always 0, so
>>>> __try_to_reclaim_swap() never gets called.
>>>>
>>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT)
>>>> over
>>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause
>>>> this
>>>> function to be called for every PTE, but count is always 0 after
>>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>>>> order-0 as well as PTE- and PMD-mapped 2M THP.
>>>
>>> I think you have to page it back in again, then it will have an entry in
>>> the swap cache.  Maybe.  I know little about anon memory ;-)
>>
>> Ahh, I was under the impression that the original folio is put into the swap
>> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
>> I'm miles out... what exactly is the lifecycle of a folio going through swap out?
> 
> I thought with most (disk) backends you will add it to the swapcache and leave
> it there until there is actual memory pressure. Only then, under memory
> pressure, you'd actually reclaim the folio.

OK, my problem is that I'm using a VM, whose disk shows up as rotating media, so
the swap subsystem refuses to swap out THPs to that and they get split. To solve
that, (and to speed up testing) I moved to the block ram disk, which convinces
swap to swap-out THPs. But that causes the folios to be removed from the swap
cache (I assumed because its syncrhonous, but maybe there is a flag somewhere to
affect that behavior?) And I can't convince QEMU to emulate an SSD to the guest
under macos. Perhaps the easiest thing is to hack it to ignore the rotating
media flag.

> 
> You can fault it back in from the swapcache without having to go to disk.
> 
> That's how you can today end up with a THP in the swapcache: during swapin from
> disk (after the folio was reclaimed) you'd currently only get order-0 folios.
> 
> At least that was my assumption with my MADV_PAGEOUT testing so far :)
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-01 17:14                             ` Ryan Roberts
@ 2024-03-01 17:18                               ` David Hildenbrand
  0 siblings, 0 replies; 116+ messages in thread
From: David Hildenbrand @ 2024-03-01 17:18 UTC (permalink / raw)
  To: Ryan Roberts, Matthew Wilcox
  Cc: Andrew Morton, Huang Ying, Gao Xiang, Yu Zhao, Yang Shi,
	Michal Hocko, Kefeng Wang, linux-kernel, linux-mm

On 01.03.24 18:14, Ryan Roberts wrote:
> On 01/03/2024 17:00, David Hildenbrand wrote:
>> On 01.03.24 17:44, Ryan Roberts wrote:
>>> On 01/03/2024 16:31, Matthew Wilcox wrote:
>>>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>>>>> I've implemented the batching as David suggested, and I'm pretty confident it's
>>>>> correct. The only problem is that during testing I can't provoke the code to
>>>>> take the path. I've been pouring through the code but struggling to figure out
>>>>> under what situation you would expect the swap entry passed to
>>>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>>>>
>>>>> This is the original (unbatched) function, after my change, which caused
>>>>> David's
>>>>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>>>>
>>>>> int free_swap_and_cache(swp_entry_t entry)
>>>>> {
>>>>>      struct swap_info_struct *p;
>>>>>      unsigned char count;
>>>>>
>>>>>      if (non_swap_entry(entry))
>>>>>          return 1;
>>>>>
>>>>>      p = _swap_info_get(entry);
>>>>>      if (p) {
>>>>>          count = __swap_entry_free(p, entry);
>>>>>          if (count == SWAP_HAS_CACHE)
>>>>>              __try_to_reclaim_swap(p, swp_offset(entry),
>>>>>                            TTRS_UNMAPPED | TTRS_FULL);
>>>>>      }
>>>>>      return p != NULL;
>>>>> }
>>>>>
>>>>> The trouble is, whenever its called, count is always 0, so
>>>>> __try_to_reclaim_swap() never gets called.
>>>>>
>>>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT)
>>>>> over
>>>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause
>>>>> this
>>>>> function to be called for every PTE, but count is always 0 after
>>>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>>>>> order-0 as well as PTE- and PMD-mapped 2M THP.
>>>>
>>>> I think you have to page it back in again, then it will have an entry in
>>>> the swap cache.  Maybe.  I know little about anon memory ;-)
>>>
>>> Ahh, I was under the impression that the original folio is put into the swap
>>> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
>>> I'm miles out... what exactly is the lifecycle of a folio going through swap out?
>>
>> I thought with most (disk) backends you will add it to the swapcache and leave
>> it there until there is actual memory pressure. Only then, under memory
>> pressure, you'd actually reclaim the folio.
> 
> OK, my problem is that I'm using a VM, whose disk shows up as rotating media, so
> the swap subsystem refuses to swap out THPs to that and they get split. To solve
> that, (and to speed up testing) I moved to the block ram disk, which convinces
> swap to swap-out THPs. But that causes the folios to be removed from the swap
> cache (I assumed because its syncrhonous, but maybe there is a flag somewhere to
> affect that behavior?) And I can't convince QEMU to emulate an SSD to the guest
> under macos. Perhaps the easiest thing is to hack it to ignore the rotating
> media flag.

I'm trying to remember how I triggered it in the past, I thought cow.c 
selftest was able to do that.

What certainly works is taking a reference on the page using vmsplice() 
and then doing the MADV_PAGEOUT. But there has to be a better way :)

I'll dig on Monday!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-01 17:06                           ` Ryan Roberts
@ 2024-03-04  4:52                             ` Barry Song
  2024-03-04  5:42                               ` Barry Song
  0 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2024-03-04  4:52 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, David Hildenbrand, Andrew Morton, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	linux-kernel, linux-mm

On Sat, Mar 2, 2024 at 6:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 01/03/2024 16:44, Ryan Roberts wrote:
> > On 01/03/2024 16:31, Matthew Wilcox wrote:
> >> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
> >>> I've implemented the batching as David suggested, and I'm pretty confident it's
> >>> correct. The only problem is that during testing I can't provoke the code to
> >>> take the path. I've been pouring through the code but struggling to figure out
> >>> under what situation you would expect the swap entry passed to
> >>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
> >>>
> >>> This is the original (unbatched) function, after my change, which caused David's
> >>> concern that we would end up calling __try_to_reclaim_swap() far too much:
> >>>
> >>> int free_swap_and_cache(swp_entry_t entry)
> >>> {
> >>>     struct swap_info_struct *p;
> >>>     unsigned char count;
> >>>
> >>>     if (non_swap_entry(entry))
> >>>             return 1;
> >>>
> >>>     p = _swap_info_get(entry);
> >>>     if (p) {
> >>>             count = __swap_entry_free(p, entry);
> >>>             if (count == SWAP_HAS_CACHE)
> >>>                     __try_to_reclaim_swap(p, swp_offset(entry),
> >>>                                           TTRS_UNMAPPED | TTRS_FULL);
> >>>     }
> >>>     return p != NULL;
> >>> }
> >>>
> >>> The trouble is, whenever its called, count is always 0, so
> >>> __try_to_reclaim_swap() never gets called.
> >>>
> >>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
> >>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
> >>> function to be called for every PTE, but count is always 0 after
> >>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
> >>> order-0 as well as PTE- and PMD-mapped 2M THP.
> >>
> >> I think you have to page it back in again, then it will have an entry in
> >> the swap cache.  Maybe.  I know little about anon memory ;-)
> >
> > Ahh, I was under the impression that the original folio is put into the swap
> > cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
> > I'm miles out... what exactly is the lifecycle of a folio going through swap out?
> >
> > I guess I can try forking after swap out, then fault it back in in the child and
> > exit. Then do the munmap in the parent. I guess that could force it? Thanks for
> > the tip - I'll have a play.
>
> That has sort of solved it, the only problem now is that all the folios in the
> swap cache are small (because I don't have Barry's large swap-in series). So
> really I need to figure out how to avoid removing the folio from the cache in
> the first place...

I am quite sure we have a chance to hit a large swapcache even using zRAM -
a sync swapfile and even during swap-out.

I have a test case as below,
1. two threads to run MADV_PAGEOUT
2. two threads to read data being swapped-out

in do_swap_page, from time to time, I can get a large swapcache.

We have a short time window after add_to_swap() and before
__removing_mapping() of
vmscan,  a large folio is still in swapcache.

So Ryan, I guess you can trigger this by adding one more thread of
MADV_DONTNEED to do zap_pte_range?


>
> >
> >>
> >> If that doesn't work, perhaps use tmpfs, and use some memory pressure to
> >> force that to swap?
> >>
> >>> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
> >>> using a block ram device as my backing store - I think this does synchronous IO
> >>> so perhaps if I have a real block device with async IO I might have more luck?
> >>> Just a guess...
> >>>
> >>> Or perhaps this code path is a corner case? In which case, perhaps its not worth
> >>> adding the batching optimization after all?
> >>>
> >>> Thanks,
> >>> Ryan
> >>>
> >

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-04  4:52                             ` Barry Song
@ 2024-03-04  5:42                               ` Barry Song
  2024-03-05  7:41                                 ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2024-03-04  5:42 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Matthew Wilcox, David Hildenbrand, Andrew Morton, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	linux-kernel, linux-mm

On Mon, Mar 4, 2024 at 5:52 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Mar 2, 2024 at 6:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 01/03/2024 16:44, Ryan Roberts wrote:
> > > On 01/03/2024 16:31, Matthew Wilcox wrote:
> > >> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
> > >>> I've implemented the batching as David suggested, and I'm pretty confident it's
> > >>> correct. The only problem is that during testing I can't provoke the code to
> > >>> take the path. I've been pouring through the code but struggling to figure out
> > >>> under what situation you would expect the swap entry passed to
> > >>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
> > >>>
> > >>> This is the original (unbatched) function, after my change, which caused David's
> > >>> concern that we would end up calling __try_to_reclaim_swap() far too much:
> > >>>
> > >>> int free_swap_and_cache(swp_entry_t entry)
> > >>> {
> > >>>     struct swap_info_struct *p;
> > >>>     unsigned char count;
> > >>>
> > >>>     if (non_swap_entry(entry))
> > >>>             return 1;
> > >>>
> > >>>     p = _swap_info_get(entry);
> > >>>     if (p) {
> > >>>             count = __swap_entry_free(p, entry);
> > >>>             if (count == SWAP_HAS_CACHE)
> > >>>                     __try_to_reclaim_swap(p, swp_offset(entry),
> > >>>                                           TTRS_UNMAPPED | TTRS_FULL);
> > >>>     }
> > >>>     return p != NULL;
> > >>> }
> > >>>
> > >>> The trouble is, whenever its called, count is always 0, so
> > >>> __try_to_reclaim_swap() never gets called.
> > >>>
> > >>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
> > >>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
> > >>> function to be called for every PTE, but count is always 0 after
> > >>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
> > >>> order-0 as well as PTE- and PMD-mapped 2M THP.
> > >>
> > >> I think you have to page it back in again, then it will have an entry in
> > >> the swap cache.  Maybe.  I know little about anon memory ;-)
> > >
> > > Ahh, I was under the impression that the original folio is put into the swap
> > > cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
> > > I'm miles out... what exactly is the lifecycle of a folio going through swap out?
> > >
> > > I guess I can try forking after swap out, then fault it back in in the child and
> > > exit. Then do the munmap in the parent. I guess that could force it? Thanks for
> > > the tip - I'll have a play.
> >
> > That has sort of solved it, the only problem now is that all the folios in the
> > swap cache are small (because I don't have Barry's large swap-in series). So
> > really I need to figure out how to avoid removing the folio from the cache in
> > the first place...
>
> I am quite sure we have a chance to hit a large swapcache even using zRAM -
> a sync swapfile and even during swap-out.
>
> I have a test case as below,
> 1. two threads to run MADV_PAGEOUT
> 2. two threads to read data being swapped-out
>
> in do_swap_page, from time to time, I can get a large swapcache.
>
> We have a short time window after add_to_swap() and before
> __removing_mapping() of
> vmscan,  a large folio is still in swapcache.
>
> So Ryan, I guess you can trigger this by adding one more thread of
> MADV_DONTNEED to do zap_pte_range?

Ryan, I have modified my test case to have 4 threads:
1. MADV_PAGEOUT
2. MADV_DONTNEED
3. write data
4. read data

and git push the code here so that you can get it,
https://github.com/BarrySong666/swaptest/blob/main/swptest.c

I can reproduce the issue in zap_pte_range() in just a couple of minutes.

>
>
> >
> > >
> > >>
> > >> If that doesn't work, perhaps use tmpfs, and use some memory pressure to
> > >> force that to swap?
> > >>
> > >>> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
> > >>> using a block ram device as my backing store - I think this does synchronous IO
> > >>> so perhaps if I have a real block device with async IO I might have more luck?
> > >>> Just a guess...
> > >>>
> > >>> Or perhaps this code path is a corner case? In which case, perhaps its not worth
> > >>> adding the batching optimization after all?
> > >>>
> > >>> Thanks,
> > >>> Ryan
> > >>>
> > >

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-02-28 12:12               ` David Hildenbrand
  2024-02-28 14:57                 ` Ryan Roberts
@ 2024-03-04 16:03                 ` Ryan Roberts
  2024-03-04 17:30                   ` David Hildenbrand
  1 sibling, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-03-04 16:03 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 28/02/2024 12:12, David Hildenbrand wrote:
>>> How relevant is it? Relevant enough that someone decided to put that
>>> optimization in? I don't know :)
>>
>> I'll have one last go at convincing you: Huang Ying (original author) commented
>> "I believe this should be OK.  Better to compare the performance too." at [1].
>> That implies to me that perhaps the optimization wasn't in response to a
>> specific problem after all. Do you have any thoughts, Huang?
> 
> Might make sense to include that in the patch description!
> 
>> OK so if we really do need to keep this optimization, here are some ideas:
>>
>> Fundamentally, we would like to be able to figure out the size of the swap slot
>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>> to mark it as PMD_SIZE.
>>
>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>> cluster will contain only one size of THPs, but this is not the case when a THP
>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>> cases to be rare.
>>
>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>> time it will be the full size of the swap entry, but sometimes it will cover
>> only a portion. In the latter case you may see a false negative for
>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>
>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>> precise information and is conceptually simpler to understand, but will cost
>> more memory (half as much as the initial swap_map[] again).
>>
>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>> prototyping.
> 
> Taking a step back: what about we simply batch unmapping of swap entries?
> 
> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
> lock) that reference consecutive swap offsets in the same swap file.
> 
> There, we can then first decrement all the swap counts, and then try minimizing
> how often we actually have to try reclaiming swap space (lookup folio, see it's
> a large folio that we cannot reclaim or could reclaim, ...).
> 
> Might need some fine-tuning in swap code to "advance" to the next entry to try
> freeing up, but we certainly can do better than what we would do right now.
> 

Hi,

I'm struggling to convince myself that free_swap_and_cache() can't race with
with swapoff(). Can anyone explain that this is safe?

I *think* they are both serialized by the PTL, since all callers of
free_swap_and_cache() (except shmem) have the PTL, and swapoff() calls
try_to_unuse() early on, which takes the PTL as it iterates over every vma in
every mm. It looks like shmem is handled specially by a call to shmem_unuse(),
but I can't see the exact serialization mechanism.

I've implemented a batching function, as David suggested above, but I'm trying
to convince myself that it is safe for it to access si->swap_map[] without a
lock (i.e. that swapoff() can't concurrently free it). But I think
free_swap_and_cache() as it already exists depends on being able to access the
si without an explicit lock, so I'm assuming the same mechanism will protect my
new changes. But I want to be sure I understand the mechanism...


This is the existing free_swap_and_cache(). I think _swap_info_get() would break
if this could race with swapoff(), and __swap_entry_free() looks up the cluster
from an array, which would also be freed by swapoff if racing:

int free_swap_and_cache(swp_entry_t entry)
{
	struct swap_info_struct *p;
	unsigned char count;

	if (non_swap_entry(entry))
		return 1;

	p = _swap_info_get(entry);
	if (p) {
		count = __swap_entry_free(p, entry);
		if (count == SWAP_HAS_CACHE)
			__try_to_reclaim_swap(p, swp_offset(entry),
					      TTRS_UNMAPPED | TTRS_FULL);
	}
	return p != NULL;
}


This is my new function. I want to be sure that it's safe to do the
READ_ONCE(si->swap_info[...]):

void free_swap_and_cache_nr(swp_entry_t entry, int nr)
{
	unsigned long end = swp_offset(entry) + nr;
	unsigned type = swp_type(entry);
	struct swap_info_struct *si;
	unsigned long offset;

	if (non_swap_entry(entry))
		return;

	si = _swap_info_get(entry);
	if (!si || end > si->max)
		return;

	/*
	 * First free all entries in the range.
	 */
	for (offset = swp_offset(entry); offset < end; offset++) {
		VM_WARN_ON(data_race(!si->swap_map[offset]));
		__swap_entry_free(si, swp_entry(type, offset));
	}

	/*
	 * Now go back over the range trying to reclaim the swap cache. This is
	 * more efficient for large folios because we will only try to reclaim
	 * the swap once per folio in the common case. If we do
	 * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
	 * latter will get a reference and lock the folio for every individual
	 * page but will only succeed once the swap slot for every subpage is
	 * zero.
	 */
	for (offset = swp_offset(entry); offset < end; offset += nr) {
		nr = 1;
		if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { << HERE
			/*
			 * Folios are always naturally aligned in swap so
			 * advance forward to the next boundary. Zero means no
			 * folio was found for the swap entry, so advance by 1
			 * in this case. Negative value means folio was found
			 * but could not be reclaimed. Here we can still advance
			 * to the next boundary.
			 */
			nr = __try_to_reclaim_swap(si, offset,
					      TTRS_UNMAPPED | TTRS_FULL);
			if (nr == 0)
				nr = 1;
			else if (nr < 0)
				nr = -nr;
			nr = ALIGN(offset + 1, nr) - offset;
		}
	}
}

Thanks,
Ryan


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-04 16:03                 ` Ryan Roberts
@ 2024-03-04 17:30                   ` David Hildenbrand
  2024-03-04 18:38                     ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2024-03-04 17:30 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 04.03.24 17:03, Ryan Roberts wrote:
> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>> How relevant is it? Relevant enough that someone decided to put that
>>>> optimization in? I don't know :)
>>>
>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>> That implies to me that perhaps the optimization wasn't in response to a
>>> specific problem after all. Do you have any thoughts, Huang?
>>
>> Might make sense to include that in the patch description!
>>
>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>
>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster
>>> to mark it as PMD_SIZE.
>>>
>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these
>>> cases to be rare.
>>>
>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>> time it will be the full size of the swap entry, but sometimes it will cover
>>> only a portion. In the latter case you may see a false negative for
>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We
>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster
>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>
>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>> precise information and is conceptually simpler to understand, but will cost
>>> more memory (half as much as the initial swap_map[] again).
>>>
>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>> prototyping.
>>
>> Taking a step back: what about we simply batch unmapping of swap entries?
>>
>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>> lock) that reference consecutive swap offsets in the same swap file.
>>
>> There, we can then first decrement all the swap counts, and then try minimizing
>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>> a large folio that we cannot reclaim or could reclaim, ...).
>>
>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>> freeing up, but we certainly can do better than what we would do right now.
>>
> 
> Hi,
> 
> I'm struggling to convince myself that free_swap_and_cache() can't race with
> with swapoff(). Can anyone explain that this is safe?
> 
> I *think* they are both serialized by the PTL, since all callers of
> free_swap_and_cache() (except shmem) have the PTL, and swapoff() calls
> try_to_unuse() early on, which takes the PTL as it iterates over every vma in
> every mm. It looks like shmem is handled specially by a call to shmem_unuse(),
> but I can't see the exact serialization mechanism.

As get_swap_device() documents:

"if there aren't some other ways to prevent swapoff, such as the folio 
in swap cache is locked, page table lock is held, etc., the swap entry 
may become invalid because of swapoff"

PTL it is, in theory. But I'm afraid that's half the story.

> 
> I've implemented a batching function, as David suggested above, but I'm trying
> to convince myself that it is safe for it to access si->swap_map[] without a
> lock (i.e. that swapoff() can't concurrently free it). But I think
> free_swap_and_cache() as it already exists depends on being able to access the
> si without an explicit lock, so I'm assuming the same mechanism will protect my
> new changes. But I want to be sure I understand the mechanism...

Very valid concern.

> 
> 
> This is the existing free_swap_and_cache(). I think _swap_info_get() would break
> if this could race with swapoff(), and __swap_entry_free() looks up the cluster
> from an array, which would also be freed by swapoff if racing:
> 
> int free_swap_and_cache(swp_entry_t entry)
> {
> 	struct swap_info_struct *p;
> 	unsigned char count;
> 
> 	if (non_swap_entry(entry))
> 		return 1;
> 
> 	p = _swap_info_get(entry);
> 	if (p) {
> 		count = __swap_entry_free(p, entry);

If count dropped to 0 and

> 		if (count == SWAP_HAS_CACHE)


count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We 
removed it. That one would have to be reclaimed asynchronously.

The existing code we would call swap_page_trans_huge_swapped() with the 
SI it obtained via _swap_info_get().

I also don't see what should be left protecting the SI. It's not locked 
anymore, the swapcounts are at 0. We don't hold the folio lock.

try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ...

Would performing the overall operation under lock_cluster_or_swap_info 
help? Not so sure :(

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-04 17:30                   ` David Hildenbrand
@ 2024-03-04 18:38                     ` Ryan Roberts
  2024-03-04 20:50                       ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-03-04 18:38 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 04/03/2024 17:30, David Hildenbrand wrote:
> On 04.03.24 17:03, Ryan Roberts wrote:
>> On 28/02/2024 12:12, David Hildenbrand wrote:
>>>>> How relevant is it? Relevant enough that someone decided to put that
>>>>> optimization in? I don't know :)
>>>>
>>>> I'll have one last go at convincing you: Huang Ying (original author) commented
>>>> "I believe this should be OK.  Better to compare the performance too." at [1].
>>>> That implies to me that perhaps the optimization wasn't in response to a
>>>> specific problem after all. Do you have any thoughts, Huang?
>>>
>>> Might make sense to include that in the patch description!
>>>
>>>> OK so if we really do need to keep this optimization, here are some ideas:
>>>>
>>>> Fundamentally, we would like to be able to figure out the size of the swap slot
>>>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For
>>>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the
>>>> cluster
>>>> to mark it as PMD_SIZE.
>>>>
>>>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a
>>>> cluster will contain only one size of THPs, but this is not the case when a THP
>>>> in the swapcache gets split or when an order-0 slot gets stolen. We expect
>>>> these
>>>> cases to be rare.
>>>>
>>>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the
>>>> time it will be the full size of the swap entry, but sometimes it will cover
>>>> only a portion. In the latter case you may see a false negative for
>>>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare.
>>>> There is one wrinkle: currently the HUGE flag is cleared in
>>>> put_swap_folio(). We
>>>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole
>>>> cluster
>>>> to order-0). I think that is safe, but haven't completely convinced myself yet.
>>>>
>>>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give
>>>> precise information and is conceptually simpler to understand, but will cost
>>>> more memory (half as much as the initial swap_map[] again).
>>>>
>>>> I still prefer to avoid this at all if we can (and would like to hear Huang's
>>>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some
>>>> prototyping.
>>>
>>> Taking a step back: what about we simply batch unmapping of swap entries?
>>>
>>> That is, if we're unmapping a PTE range, we'll collect swap entries (under PT
>>> lock) that reference consecutive swap offsets in the same swap file.
>>>
>>> There, we can then first decrement all the swap counts, and then try minimizing
>>> how often we actually have to try reclaiming swap space (lookup folio, see it's
>>> a large folio that we cannot reclaim or could reclaim, ...).
>>>
>>> Might need some fine-tuning in swap code to "advance" to the next entry to try
>>> freeing up, but we certainly can do better than what we would do right now.
>>>
>>
>> Hi,
>>
>> I'm struggling to convince myself that free_swap_and_cache() can't race with
>> with swapoff(). Can anyone explain that this is safe?
>>
>> I *think* they are both serialized by the PTL, since all callers of
>> free_swap_and_cache() (except shmem) have the PTL, and swapoff() calls
>> try_to_unuse() early on, which takes the PTL as it iterates over every vma in
>> every mm. It looks like shmem is handled specially by a call to shmem_unuse(),
>> but I can't see the exact serialization mechanism.
> 
> As get_swap_device() documents:
> 
> "if there aren't some other ways to prevent swapoff, such as the folio in swap
> cache is locked, page table lock is held, etc., the swap entry may become
> invalid because of swapoff"
> 
> PTL it is, in theory. But I'm afraid that's half the story.

Ahh I didn't notice that comment - thanks!

> 
>>
>> I've implemented a batching function, as David suggested above, but I'm trying
>> to convince myself that it is safe for it to access si->swap_map[] without a
>> lock (i.e. that swapoff() can't concurrently free it). But I think
>> free_swap_and_cache() as it already exists depends on being able to access the
>> si without an explicit lock, so I'm assuming the same mechanism will protect my
>> new changes. But I want to be sure I understand the mechanism...
> 
> Very valid concern.
> 
>>
>>
>> This is the existing free_swap_and_cache(). I think _swap_info_get() would break
>> if this could race with swapoff(), and __swap_entry_free() looks up the cluster
>> from an array, which would also be freed by swapoff if racing:
>>
>> int free_swap_and_cache(swp_entry_t entry)
>> {
>>     struct swap_info_struct *p;
>>     unsigned char count;
>>
>>     if (non_swap_entry(entry))
>>         return 1;
>>
>>     p = _swap_info_get(entry);
>>     if (p) {
>>         count = __swap_entry_free(p, entry);
> 
> If count dropped to 0 and
> 
>>         if (count == SWAP_HAS_CACHE)
> 
> 
> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We removed
> it. That one would have to be reclaimed asynchronously.
> 
> The existing code we would call swap_page_trans_huge_swapped() with the SI it
> obtained via _swap_info_get().
> 
> I also don't see what should be left protecting the SI. It's not locked anymore,
> the swapcounts are at 0. We don't hold the folio lock.
> 
> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ...

But, assuming the caller of free_swap_and_cache() acquires the PTL first, I
think this all works out ok? While free_swap_and_cache() is running,
try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then
free_swap_and_cache() will never be called because the swap entry will have been
removed from the PTE?

That just leaves shmem... I suspected there might be some serialization between
shmem_unuse() (called from try_to_unuse()) and the shmem free_swap_and_cache()
callsites, but I can't see it. Hmm...

> 
> Would performing the overall operation under lock_cluster_or_swap_info help? Not
> so sure :(

No - that function relies on being able to access the cluster from the array in
the swap_info and lock it. And I think that array has the same lifetime as
swap_map, so same problem. You'd need get_swap_device()/put_swap_device() and a
bunch of refactoring for the internals not to take the locks, I guess. I think
its doable, just not sure if neccessary...



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-04 18:38                     ` Ryan Roberts
@ 2024-03-04 20:50                       ` David Hildenbrand
  2024-03-04 21:55                         ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2024-03-04 20:50 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

>>>
>>> This is the existing free_swap_and_cache(). I think _swap_info_get() would break
>>> if this could race with swapoff(), and __swap_entry_free() looks up the cluster
>>> from an array, which would also be freed by swapoff if racing:
>>>
>>> int free_swap_and_cache(swp_entry_t entry)
>>> {
>>>      struct swap_info_struct *p;
>>>      unsigned char count;
>>>
>>>      if (non_swap_entry(entry))
>>>          return 1;
>>>
>>>      p = _swap_info_get(entry);
>>>      if (p) {
>>>          count = __swap_entry_free(p, entry);
>>
>> If count dropped to 0 and
>>
>>>          if (count == SWAP_HAS_CACHE)
>>
>>
>> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We removed
>> it. That one would have to be reclaimed asynchronously.
>>
>> The existing code we would call swap_page_trans_huge_swapped() with the SI it
>> obtained via _swap_info_get().
>>
>> I also don't see what should be left protecting the SI. It's not locked anymore,
>> the swapcounts are at 0. We don't hold the folio lock.
>>
>> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ...
> 
> But, assuming the caller of free_swap_and_cache() acquires the PTL first, I
> think this all works out ok? While free_swap_and_cache() is running,
> try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then
> free_swap_and_cache() will never be called because the swap entry will have been
> removed from the PTE?

But can't try_to_unuse() run, detect !si->inuse_pages and not even 
bother about scanning any further page tables?

But my head hurts from digging through that code.

Let me try again:

__swap_entry_free() might be the last user and result in "count == 
SWAP_HAS_CACHE".

swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.


So the question is: could someone reclaim the folio and turn 
si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().

Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are 
still references by swap entries.

Process 1 still references subpage 0 via swap entry.
Process 2 still references subpage 1 via swap entry.

Process 1 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE
[then, preempted in the hypervisor etc.]

Process 2 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE

Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls 
__try_to_reclaim_swap().

__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->put_swap_folio()->
free_swap_slot()->swapcache_free_entries()->swap_entry_free()->swap_range_free()->
...
WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);


What stops swapoff to succeed after process 2 reclaimed the swap cache 
but before process 1 finished its call to swap_page_trans_huge_swapped()?



> 
> That just leaves shmem... I suspected there might be some serialization between
> shmem_unuse() (called from try_to_unuse()) and the shmem free_swap_and_cache()
> callsites, but I can't see it. Hmm...
> 
>>
>> Would performing the overall operation under lock_cluster_or_swap_info help? Not
>> so sure :(
> 
> No - that function relies on being able to access the cluster from the array in
> the swap_info and lock it. And I think that array has the same lifetime as
> swap_map, so same problem. You'd need get_swap_device()/put_swap_device() and a
> bunch of refactoring for the internals not to take the locks, I guess. I think
> its doable, just not sure if neccessary...

Agreed.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-04 20:50                       ` David Hildenbrand
@ 2024-03-04 21:55                         ` Ryan Roberts
  2024-03-04 22:02                           ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-03-04 21:55 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 04/03/2024 20:50, David Hildenbrand wrote:
>>>>
>>>> This is the existing free_swap_and_cache(). I think _swap_info_get() would
>>>> break
>>>> if this could race with swapoff(), and __swap_entry_free() looks up the cluster
>>>> from an array, which would also be freed by swapoff if racing:
>>>>
>>>> int free_swap_and_cache(swp_entry_t entry)
>>>> {
>>>>      struct swap_info_struct *p;
>>>>      unsigned char count;
>>>>
>>>>      if (non_swap_entry(entry))
>>>>          return 1;
>>>>
>>>>      p = _swap_info_get(entry);
>>>>      if (p) {
>>>>          count = __swap_entry_free(p, entry);
>>>
>>> If count dropped to 0 and
>>>
>>>>          if (count == SWAP_HAS_CACHE)
>>>
>>>
>>> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We removed
>>> it. That one would have to be reclaimed asynchronously.
>>>
>>> The existing code we would call swap_page_trans_huge_swapped() with the SI it
>>> obtained via _swap_info_get().
>>>
>>> I also don't see what should be left protecting the SI. It's not locked anymore,
>>> the swapcounts are at 0. We don't hold the folio lock.
>>>
>>> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ...
>>
>> But, assuming the caller of free_swap_and_cache() acquires the PTL first, I
>> think this all works out ok? While free_swap_and_cache() is running,
>> try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then
>> free_swap_and_cache() will never be called because the swap entry will have been
>> removed from the PTE?
> 
> But can't try_to_unuse() run, detect !si->inuse_pages and not even bother about
> scanning any further page tables?
> 
> But my head hurts from digging through that code.

Yep, glad I'm not the only one that gets headaches from swapfile.c.

> 
> Let me try again:
> 
> __swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
> 
> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
> 
> 
> So the question is: could someone reclaim the folio and turn si->inuse_pages==0,
> before we completed swap_page_trans_huge_swapped().
> 
> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still
> references by swap entries.
> 
> Process 1 still references subpage 0 via swap entry.
> Process 2 still references subpage 1 via swap entry.
> 
> Process 1 quits. Calls free_swap_and_cache().
> -> count == SWAP_HAS_CACHE
> [then, preempted in the hypervisor etc.]
> 
> Process 2 quits. Calls free_swap_and_cache().
> -> count == SWAP_HAS_CACHE
> 
> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
> __try_to_reclaim_swap().
> 
> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->put_swap_folio()->
> free_swap_slot()->swapcache_free_entries()->swap_entry_free()->swap_range_free()->
> ...
> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
> 
> 
> What stops swapoff to succeed after process 2 reclaimed the swap cache but
> before process 1 finished its call to swap_page_trans_huge_swapped()?

Assuming you are talking about anonymous memory, process 1 has the PTL while
it's executing free_swap_and_cache(). try_to_unuse() iterates over every vma in
every mm, and it swaps-in a page for every PTE that holds a swap entry for the
device being swapoff'ed. It takes the PTL while converting the swap entry to
present PTE - see unuse_pte(). Process 1 must have beaten try_to_unuse() to the
particular pte, because if try_to_unuse() got there first, it would have
converted it from a swap entry to present pte and process 1 would never even
have called free_swap_and_cache(). So try_to_unuse() will eventually wait on the
PTL until process 1 has released it after free_swap_and_cache() completes. Am I
missing something? Because that part feels pretty clear to me.

Its the shmem case that I'm struggling to explain.

> 
> 
> 
>>
>> That just leaves shmem... I suspected there might be some serialization between
>> shmem_unuse() (called from try_to_unuse()) and the shmem free_swap_and_cache()
>> callsites, but I can't see it. Hmm...
>>
>>>
>>> Would performing the overall operation under lock_cluster_or_swap_info help? Not
>>> so sure :(
>>
>> No - that function relies on being able to access the cluster from the array in
>> the swap_info and lock it. And I think that array has the same lifetime as
>> swap_map, so same problem. You'd need get_swap_device()/put_swap_device() and a
>> bunch of refactoring for the internals not to take the locks, I guess. I think
>> its doable, just not sure if neccessary...
> 
> Agreed.
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-04 21:55                         ` Ryan Roberts
@ 2024-03-04 22:02                           ` David Hildenbrand
  2024-03-04 22:34                             ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2024-03-04 22:02 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: linux-kernel, linux-mm

On 04.03.24 22:55, Ryan Roberts wrote:
> On 04/03/2024 20:50, David Hildenbrand wrote:
>>>>>
>>>>> This is the existing free_swap_and_cache(). I think _swap_info_get() would
>>>>> break
>>>>> if this could race with swapoff(), and __swap_entry_free() looks up the cluster
>>>>> from an array, which would also be freed by swapoff if racing:
>>>>>
>>>>> int free_swap_and_cache(swp_entry_t entry)
>>>>> {
>>>>>       struct swap_info_struct *p;
>>>>>       unsigned char count;
>>>>>
>>>>>       if (non_swap_entry(entry))
>>>>>           return 1;
>>>>>
>>>>>       p = _swap_info_get(entry);
>>>>>       if (p) {
>>>>>           count = __swap_entry_free(p, entry);
>>>>
>>>> If count dropped to 0 and
>>>>
>>>>>           if (count == SWAP_HAS_CACHE)
>>>>
>>>>
>>>> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We removed
>>>> it. That one would have to be reclaimed asynchronously.
>>>>
>>>> The existing code we would call swap_page_trans_huge_swapped() with the SI it
>>>> obtained via _swap_info_get().
>>>>
>>>> I also don't see what should be left protecting the SI. It's not locked anymore,
>>>> the swapcounts are at 0. We don't hold the folio lock.
>>>>
>>>> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ...
>>>
>>> But, assuming the caller of free_swap_and_cache() acquires the PTL first, I
>>> think this all works out ok? While free_swap_and_cache() is running,
>>> try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then
>>> free_swap_and_cache() will never be called because the swap entry will have been
>>> removed from the PTE?
>>
>> But can't try_to_unuse() run, detect !si->inuse_pages and not even bother about
>> scanning any further page tables?
>>
>> But my head hurts from digging through that code.
> 
> Yep, glad I'm not the only one that gets headaches from swapfile.c.
> 
>>
>> Let me try again:
>>
>> __swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE".
>>
>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>
>>
>> So the question is: could someone reclaim the folio and turn si->inuse_pages==0,
>> before we completed swap_page_trans_huge_swapped().
>>
>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still
>> references by swap entries.
>>
>> Process 1 still references subpage 0 via swap entry.
>> Process 2 still references subpage 1 via swap entry.
>>
>> Process 1 quits. Calls free_swap_and_cache().
>> -> count == SWAP_HAS_CACHE
>> [then, preempted in the hypervisor etc.]
>>
>> Process 2 quits. Calls free_swap_and_cache().
>> -> count == SWAP_HAS_CACHE
>>
>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>> __try_to_reclaim_swap().
>>
>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->put_swap_folio()->
>> free_swap_slot()->swapcache_free_entries()->swap_entry_free()->swap_range_free()->
>> ...
>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>
>>
>> What stops swapoff to succeed after process 2 reclaimed the swap cache but
>> before process 1 finished its call to swap_page_trans_huge_swapped()?
> 
> Assuming you are talking about anonymous memory, process 1 has the PTL while
> it's executing free_swap_and_cache(). try_to_unuse() iterates over every vma in
> every mm, and it swaps-in a page for every PTE that holds a swap entry for the
> device being swapoff'ed. It takes the PTL while converting the swap entry to
> present PTE - see unuse_pte(). Process 1 must have beaten try_to_unuse() to the
> particular pte, because if try_to_unuse() got there first, it would have
> converted it from a swap entry to present pte and process 1 would never even
> have called free_swap_and_cache(). So try_to_unuse() will eventually wait on the
> PTL until process 1 has released it after free_swap_and_cache() completes. Am I
> missing something? Because that part feels pretty clear to me.

Why should try_to_unuse() do *anything* if it already finds
si->inuse_pages == 0 because we (p1 } p2) just freed the swapentries and 
process 2 managed to free the last remaining swapcache entry?

I'm probably missing something important :)

try_to_unuse() really starts with

	if (!READ_ONCE(si->inuse_pages))
		goto success;

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-04 22:02                           ` David Hildenbrand
@ 2024-03-04 22:34                             ` Ryan Roberts
  2024-03-05  6:11                               ` Huang, Ying
  0 siblings, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-03-04 22:34 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	Hugh Dickins
  Cc: linux-kernel, linux-mm

+ Hugh

On 04/03/2024 22:02, David Hildenbrand wrote:
> On 04.03.24 22:55, Ryan Roberts wrote:
>> On 04/03/2024 20:50, David Hildenbrand wrote:
>>>>>>
>>>>>> This is the existing free_swap_and_cache(). I think _swap_info_get() would
>>>>>> break
>>>>>> if this could race with swapoff(), and __swap_entry_free() looks up the
>>>>>> cluster
>>>>>> from an array, which would also be freed by swapoff if racing:
>>>>>>
>>>>>> int free_swap_and_cache(swp_entry_t entry)
>>>>>> {
>>>>>>       struct swap_info_struct *p;
>>>>>>       unsigned char count;
>>>>>>
>>>>>>       if (non_swap_entry(entry))
>>>>>>           return 1;
>>>>>>
>>>>>>       p = _swap_info_get(entry);
>>>>>>       if (p) {
>>>>>>           count = __swap_entry_free(p, entry);
>>>>>
>>>>> If count dropped to 0 and
>>>>>
>>>>>>           if (count == SWAP_HAS_CACHE)
>>>>>
>>>>>
>>>>> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We
>>>>> removed
>>>>> it. That one would have to be reclaimed asynchronously.
>>>>>
>>>>> The existing code we would call swap_page_trans_huge_swapped() with the SI it
>>>>> obtained via _swap_info_get().
>>>>>
>>>>> I also don't see what should be left protecting the SI. It's not locked
>>>>> anymore,
>>>>> the swapcounts are at 0. We don't hold the folio lock.
>>>>>
>>>>> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ...
>>>>
>>>> But, assuming the caller of free_swap_and_cache() acquires the PTL first, I
>>>> think this all works out ok? While free_swap_and_cache() is running,
>>>> try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then
>>>> free_swap_and_cache() will never be called because the swap entry will have
>>>> been
>>>> removed from the PTE?
>>>
>>> But can't try_to_unuse() run, detect !si->inuse_pages and not even bother about
>>> scanning any further page tables?
>>>
>>> But my head hurts from digging through that code.
>>
>> Yep, glad I'm not the only one that gets headaches from swapfile.c.
>>
>>>
>>> Let me try again:
>>>
>>> __swap_entry_free() might be the last user and result in "count ==
>>> SWAP_HAS_CACHE".
>>>
>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>
>>>
>>> So the question is: could someone reclaim the folio and turn si->inuse_pages==0,
>>> before we completed swap_page_trans_huge_swapped().
>>>
>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still
>>> references by swap entries.
>>>
>>> Process 1 still references subpage 0 via swap entry.
>>> Process 2 still references subpage 1 via swap entry.
>>>
>>> Process 1 quits. Calls free_swap_and_cache().
>>> -> count == SWAP_HAS_CACHE
>>> [then, preempted in the hypervisor etc.]
>>>
>>> Process 2 quits. Calls free_swap_and_cache().
>>> -> count == SWAP_HAS_CACHE
>>>
>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>> __try_to_reclaim_swap().
>>>
>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->put_swap_folio()->
>>> free_swap_slot()->swapcache_free_entries()->swap_entry_free()->swap_range_free()->
>>> ...
>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>
>>>
>>> What stops swapoff to succeed after process 2 reclaimed the swap cache but
>>> before process 1 finished its call to swap_page_trans_huge_swapped()?
>>
>> Assuming you are talking about anonymous memory, process 1 has the PTL while
>> it's executing free_swap_and_cache(). try_to_unuse() iterates over every vma in
>> every mm, and it swaps-in a page for every PTE that holds a swap entry for the
>> device being swapoff'ed. It takes the PTL while converting the swap entry to
>> present PTE - see unuse_pte(). Process 1 must have beaten try_to_unuse() to the
>> particular pte, because if try_to_unuse() got there first, it would have
>> converted it from a swap entry to present pte and process 1 would never even
>> have called free_swap_and_cache(). So try_to_unuse() will eventually wait on the
>> PTL until process 1 has released it after free_swap_and_cache() completes. Am I
>> missing something? Because that part feels pretty clear to me.
> 
> Why should try_to_unuse() do *anything* if it already finds
> si->inuse_pages == 0 because we (p1 } p2) just freed the swapentries and process
> 2 managed to free the last remaining swapcache entry?

Yeah ok. For some reason I thought unuse_mm() was iterating over all mms and so
the `while (READ_ONCE(si->inuse_pages))` was only evaluated after iterating over
every mm. Oops.

So yes, I agree with you; I think this is broken. And I'm a bit worried this
could be a can of worms; By the same logic, I think folio_free_swap(),
swp_swapcount() and probably others are broken in the same way.

I wonder if we are missing something here? I've added Hugh - I see he has a lot
of commits in this area, perhaps he has some advice?

Thanks,
Ryan


> 
> I'm probably missing something important :)
> 
> try_to_unuse() really starts with
> 
>     if (!READ_ONCE(si->inuse_pages))
>         goto success;
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-04 22:34                             ` Ryan Roberts
@ 2024-03-05  6:11                               ` Huang, Ying
  2024-03-05  8:35                                 ` David Hildenbrand
  0 siblings, 1 reply; 116+ messages in thread
From: Huang, Ying @ 2024-03-05  6:11 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, Hugh Dickins,
	linux-kernel, linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> + Hugh
>
> On 04/03/2024 22:02, David Hildenbrand wrote:
>> On 04.03.24 22:55, Ryan Roberts wrote:
>>> On 04/03/2024 20:50, David Hildenbrand wrote:
>>>>>>>
>>>>>>> This is the existing free_swap_and_cache(). I think _swap_info_get() would
>>>>>>> break
>>>>>>> if this could race with swapoff(), and __swap_entry_free() looks up the
>>>>>>> cluster
>>>>>>> from an array, which would also be freed by swapoff if racing:
>>>>>>>
>>>>>>> int free_swap_and_cache(swp_entry_t entry)
>>>>>>> {
>>>>>>>       struct swap_info_struct *p;
>>>>>>>       unsigned char count;
>>>>>>>
>>>>>>>       if (non_swap_entry(entry))
>>>>>>>           return 1;
>>>>>>>
>>>>>>>       p = _swap_info_get(entry);
>>>>>>>       if (p) {
>>>>>>>           count = __swap_entry_free(p, entry);
>>>>>>
>>>>>> If count dropped to 0 and
>>>>>>
>>>>>>>           if (count == SWAP_HAS_CACHE)
>>>>>>
>>>>>>
>>>>>> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We
>>>>>> removed
>>>>>> it. That one would have to be reclaimed asynchronously.
>>>>>>
>>>>>> The existing code we would call swap_page_trans_huge_swapped() with the SI it
>>>>>> obtained via _swap_info_get().
>>>>>>
>>>>>> I also don't see what should be left protecting the SI. It's not locked
>>>>>> anymore,
>>>>>> the swapcounts are at 0. We don't hold the folio lock.
>>>>>>
>>>>>> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ...
>>>>>
>>>>> But, assuming the caller of free_swap_and_cache() acquires the PTL first, I
>>>>> think this all works out ok? While free_swap_and_cache() is running,
>>>>> try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then
>>>>> free_swap_and_cache() will never be called because the swap entry will have
>>>>> been
>>>>> removed from the PTE?
>>>>
>>>> But can't try_to_unuse() run, detect !si->inuse_pages and not even bother about
>>>> scanning any further page tables?
>>>>
>>>> But my head hurts from digging through that code.
>>>
>>> Yep, glad I'm not the only one that gets headaches from swapfile.c.
>>>
>>>>
>>>> Let me try again:
>>>>
>>>> __swap_entry_free() might be the last user and result in "count ==
>>>> SWAP_HAS_CACHE".
>>>>
>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>
>>>>
>>>> So the question is: could someone reclaim the folio and turn si->inuse_pages==0,
>>>> before we completed swap_page_trans_huge_swapped().
>>>>
>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still
>>>> references by swap entries.
>>>>
>>>> Process 1 still references subpage 0 via swap entry.
>>>> Process 2 still references subpage 1 via swap entry.
>>>>
>>>> Process 1 quits. Calls free_swap_and_cache().
>>>> -> count == SWAP_HAS_CACHE
>>>> [then, preempted in the hypervisor etc.]
>>>>
>>>> Process 2 quits. Calls free_swap_and_cache().
>>>> -> count == SWAP_HAS_CACHE
>>>>
>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>> __try_to_reclaim_swap().
>>>>
>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->put_swap_folio()->
>>>> free_swap_slot()->swapcache_free_entries()->swap_entry_free()->swap_range_free()->
>>>> ...
>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>
>>>>
>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache but
>>>> before process 1 finished its call to swap_page_trans_huge_swapped()?
>>>
>>> Assuming you are talking about anonymous memory, process 1 has the PTL while
>>> it's executing free_swap_and_cache(). try_to_unuse() iterates over every vma in
>>> every mm, and it swaps-in a page for every PTE that holds a swap entry for the
>>> device being swapoff'ed. It takes the PTL while converting the swap entry to
>>> present PTE - see unuse_pte(). Process 1 must have beaten try_to_unuse() to the
>>> particular pte, because if try_to_unuse() got there first, it would have
>>> converted it from a swap entry to present pte and process 1 would never even
>>> have called free_swap_and_cache(). So try_to_unuse() will eventually wait on the
>>> PTL until process 1 has released it after free_swap_and_cache() completes. Am I
>>> missing something? Because that part feels pretty clear to me.
>> 
>> Why should try_to_unuse() do *anything* if it already finds
>> si->inuse_pages == 0 because we (p1 } p2) just freed the swapentries and process
>> 2 managed to free the last remaining swapcache entry?
>
> Yeah ok. For some reason I thought unuse_mm() was iterating over all mms and so
> the `while (READ_ONCE(si->inuse_pages))` was only evaluated after iterating over
> every mm. Oops.
>
> So yes, I agree with you; I think this is broken. And I'm a bit worried this
> could be a can of worms; By the same logic, I think folio_free_swap(),
> swp_swapcount() and probably others are broken in the same way.

Don't worry too much :-), we have get_swap_device() at least.  We can
insert it anywhere we want because it's quite lightweight.  And, because
swapoff() is so rare, the race is theoretical only.

For this specific case, I had thought that PTL is enough.  But after
looking at this more, I found a race here too.  Until
__swap_entry_free() return, we are OK, nobody can reduce the swap count
because we held the PTL.  But, after that, even if its return value is
SWAP_HAS_CACHE (that is, in swap cache), parallel swap_unuse() or
__try_to_reclaim_swap() may remove the folio from swap cache, so free
the swap entry.  So, swapoff() can proceed to free the data structures
in parallel.

To fix the race, we can add get/put_swap_device() in
free_swap_and_cache().

For other places, we can check whether get/put_swap_device() has been
called in callers, and the swap reference we held has been decreased
(e.g., swap count protected by PTL, SWAP_HAS_CACHE protected by folio
lock).

> I wonder if we are missing something here? I've added Hugh - I see he has a lot
> of commits in this area, perhaps he has some advice?
>
> Thanks,
> Ryan
>
>
>> 
>> I'm probably missing something important :)
>> 
>> try_to_unuse() really starts with
>> 
>>     if (!READ_ONCE(si->inuse_pages))
>>         goto success;
>> 

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-04  5:42                               ` Barry Song
@ 2024-03-05  7:41                                 ` Ryan Roberts
  0 siblings, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-03-05  7:41 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, David Hildenbrand, Andrew Morton, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang,
	linux-kernel, linux-mm

On 04/03/2024 05:42, Barry Song wrote:
> On Mon, Mar 4, 2024 at 5:52 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Sat, Mar 2, 2024 at 6:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 01/03/2024 16:44, Ryan Roberts wrote:
>>>> On 01/03/2024 16:31, Matthew Wilcox wrote:
>>>>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote:
>>>>>> I've implemented the batching as David suggested, and I'm pretty confident it's
>>>>>> correct. The only problem is that during testing I can't provoke the code to
>>>>>> take the path. I've been pouring through the code but struggling to figure out
>>>>>> under what situation you would expect the swap entry passed to
>>>>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea?
>>>>>>
>>>>>> This is the original (unbatched) function, after my change, which caused David's
>>>>>> concern that we would end up calling __try_to_reclaim_swap() far too much:
>>>>>>
>>>>>> int free_swap_and_cache(swp_entry_t entry)
>>>>>> {
>>>>>>     struct swap_info_struct *p;
>>>>>>     unsigned char count;
>>>>>>
>>>>>>     if (non_swap_entry(entry))
>>>>>>             return 1;
>>>>>>
>>>>>>     p = _swap_info_get(entry);
>>>>>>     if (p) {
>>>>>>             count = __swap_entry_free(p, entry);
>>>>>>             if (count == SWAP_HAS_CACHE)
>>>>>>                     __try_to_reclaim_swap(p, swp_offset(entry),
>>>>>>                                           TTRS_UNMAPPED | TTRS_FULL);
>>>>>>     }
>>>>>>     return p != NULL;
>>>>>> }
>>>>>>
>>>>>> The trouble is, whenever its called, count is always 0, so
>>>>>> __try_to_reclaim_swap() never gets called.
>>>>>>
>>>>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) over
>>>>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause this
>>>>>> function to be called for every PTE, but count is always 0 after
>>>>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for
>>>>>> order-0 as well as PTE- and PMD-mapped 2M THP.
>>>>>
>>>>> I think you have to page it back in again, then it will have an entry in
>>>>> the swap cache.  Maybe.  I know little about anon memory ;-)
>>>>
>>>> Ahh, I was under the impression that the original folio is put into the swap
>>>> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure
>>>> I'm miles out... what exactly is the lifecycle of a folio going through swap out?
>>>>
>>>> I guess I can try forking after swap out, then fault it back in in the child and
>>>> exit. Then do the munmap in the parent. I guess that could force it? Thanks for
>>>> the tip - I'll have a play.
>>>
>>> That has sort of solved it, the only problem now is that all the folios in the
>>> swap cache are small (because I don't have Barry's large swap-in series). So
>>> really I need to figure out how to avoid removing the folio from the cache in
>>> the first place...
>>
>> I am quite sure we have a chance to hit a large swapcache even using zRAM -
>> a sync swapfile and even during swap-out.
>>
>> I have a test case as below,
>> 1. two threads to run MADV_PAGEOUT
>> 2. two threads to read data being swapped-out
>>
>> in do_swap_page, from time to time, I can get a large swapcache.
>>
>> We have a short time window after add_to_swap() and before
>> __removing_mapping() of
>> vmscan,  a large folio is still in swapcache.
>>
>> So Ryan, I guess you can trigger this by adding one more thread of
>> MADV_DONTNEED to do zap_pte_range?
> 
> Ryan, I have modified my test case to have 4 threads:
> 1. MADV_PAGEOUT
> 2. MADV_DONTNEED
> 3. write data
> 4. read data
> 
> and git push the code here so that you can get it,
> https://github.com/BarrySong666/swaptest/blob/main/swptest.c

Thanks for this, Barry!


> 
> I can reproduce the issue in zap_pte_range() in just a couple of minutes.
> 
>>
>>
>>>
>>>>
>>>>>
>>>>> If that doesn't work, perhaps use tmpfs, and use some memory pressure to
>>>>> force that to swap?
>>>>>
>>>>>> I'm guessing the swapcache was already reclaimed as part of MADV_PAGEOUT? I'm
>>>>>> using a block ram device as my backing store - I think this does synchronous IO
>>>>>> so perhaps if I have a real block device with async IO I might have more luck?
>>>>>> Just a guess...
>>>>>>
>>>>>> Or perhaps this code path is a corner case? In which case, perhaps its not worth
>>>>>> adding the batching optimization after all?
>>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-05  6:11                               ` Huang, Ying
@ 2024-03-05  8:35                                 ` David Hildenbrand
  2024-03-05  8:46                                   ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: David Hildenbrand @ 2024-03-05  8:35 UTC (permalink / raw)
  To: Huang, Ying, Ryan Roberts
  Cc: Andrew Morton, Matthew Wilcox, Gao Xiang, Yu Zhao, Yang Shi,
	Michal Hocko, Kefeng Wang, Hugh Dickins, linux-kernel, linux-mm

On 05.03.24 07:11, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> + Hugh
>>
>> On 04/03/2024 22:02, David Hildenbrand wrote:
>>> On 04.03.24 22:55, Ryan Roberts wrote:
>>>> On 04/03/2024 20:50, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> This is the existing free_swap_and_cache(). I think _swap_info_get() would
>>>>>>>> break
>>>>>>>> if this could race with swapoff(), and __swap_entry_free() looks up the
>>>>>>>> cluster
>>>>>>>> from an array, which would also be freed by swapoff if racing:
>>>>>>>>
>>>>>>>> int free_swap_and_cache(swp_entry_t entry)
>>>>>>>> {
>>>>>>>>        struct swap_info_struct *p;
>>>>>>>>        unsigned char count;
>>>>>>>>
>>>>>>>>        if (non_swap_entry(entry))
>>>>>>>>            return 1;
>>>>>>>>
>>>>>>>>        p = _swap_info_get(entry);
>>>>>>>>        if (p) {
>>>>>>>>            count = __swap_entry_free(p, entry);
>>>>>>>
>>>>>>> If count dropped to 0 and
>>>>>>>
>>>>>>>>            if (count == SWAP_HAS_CACHE)
>>>>>>>
>>>>>>>
>>>>>>> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We
>>>>>>> removed
>>>>>>> it. That one would have to be reclaimed asynchronously.
>>>>>>>
>>>>>>> The existing code we would call swap_page_trans_huge_swapped() with the SI it
>>>>>>> obtained via _swap_info_get().
>>>>>>>
>>>>>>> I also don't see what should be left protecting the SI. It's not locked
>>>>>>> anymore,
>>>>>>> the swapcounts are at 0. We don't hold the folio lock.
>>>>>>>
>>>>>>> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ...
>>>>>>
>>>>>> But, assuming the caller of free_swap_and_cache() acquires the PTL first, I
>>>>>> think this all works out ok? While free_swap_and_cache() is running,
>>>>>> try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then
>>>>>> free_swap_and_cache() will never be called because the swap entry will have
>>>>>> been
>>>>>> removed from the PTE?
>>>>>
>>>>> But can't try_to_unuse() run, detect !si->inuse_pages and not even bother about
>>>>> scanning any further page tables?
>>>>>
>>>>> But my head hurts from digging through that code.
>>>>
>>>> Yep, glad I'm not the only one that gets headaches from swapfile.c.
>>>>
>>>>>
>>>>> Let me try again:
>>>>>
>>>>> __swap_entry_free() might be the last user and result in "count ==
>>>>> SWAP_HAS_CACHE".
>>>>>
>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>>
>>>>>
>>>>> So the question is: could someone reclaim the folio and turn si->inuse_pages==0,
>>>>> before we completed swap_page_trans_huge_swapped().
>>>>>
>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still
>>>>> references by swap entries.
>>>>>
>>>>> Process 1 still references subpage 0 via swap entry.
>>>>> Process 2 still references subpage 1 via swap entry.
>>>>>
>>>>> Process 1 quits. Calls free_swap_and_cache().
>>>>> -> count == SWAP_HAS_CACHE
>>>>> [then, preempted in the hypervisor etc.]
>>>>>
>>>>> Process 2 quits. Calls free_swap_and_cache().
>>>>> -> count == SWAP_HAS_CACHE
>>>>>
>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>>> __try_to_reclaim_swap().
>>>>>
>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->put_swap_folio()->
>>>>> free_swap_slot()->swapcache_free_entries()->swap_entry_free()->swap_range_free()->
>>>>> ...
>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>>
>>>>>
>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache but
>>>>> before process 1 finished its call to swap_page_trans_huge_swapped()?
>>>>
>>>> Assuming you are talking about anonymous memory, process 1 has the PTL while
>>>> it's executing free_swap_and_cache(). try_to_unuse() iterates over every vma in
>>>> every mm, and it swaps-in a page for every PTE that holds a swap entry for the
>>>> device being swapoff'ed. It takes the PTL while converting the swap entry to
>>>> present PTE - see unuse_pte(). Process 1 must have beaten try_to_unuse() to the
>>>> particular pte, because if try_to_unuse() got there first, it would have
>>>> converted it from a swap entry to present pte and process 1 would never even
>>>> have called free_swap_and_cache(). So try_to_unuse() will eventually wait on the
>>>> PTL until process 1 has released it after free_swap_and_cache() completes. Am I
>>>> missing something? Because that part feels pretty clear to me.
>>>
>>> Why should try_to_unuse() do *anything* if it already finds
>>> si->inuse_pages == 0 because we (p1 } p2) just freed the swapentries and process
>>> 2 managed to free the last remaining swapcache entry?
>>
>> Yeah ok. For some reason I thought unuse_mm() was iterating over all mms and so
>> the `while (READ_ONCE(si->inuse_pages))` was only evaluated after iterating over
>> every mm. Oops.
>>
>> So yes, I agree with you; I think this is broken. And I'm a bit worried this
>> could be a can of worms; By the same logic, I think folio_free_swap(),
>> swp_swapcount() and probably others are broken in the same way.
> 
> Don't worry too much :-), we have get_swap_device() at least.  We can
> insert it anywhere we want because it's quite lightweight.  And, because
> swapoff() is so rare, the race is theoretical only.
> 
> For this specific case, I had thought that PTL is enough.  But after
> looking at this more, I found a race here too.  Until
> __swap_entry_free() return, we are OK, nobody can reduce the swap count
> because we held the PTL.  But, after that, even if its return value is
> SWAP_HAS_CACHE (that is, in swap cache), parallel swap_unuse() or
> __try_to_reclaim_swap() may remove the folio from swap cache, so free
> the swap entry.  So, swapoff() can proceed to free the data structures
> in parallel.
> 
> To fix the race, we can add get/put_swap_device() in
> free_swap_and_cache().
> 
> For other places, we can check whether get/put_swap_device() has been
> called in callers, and the swap reference we held has been decreased
> (e.g., swap count protected by PTL, SWAP_HAS_CACHE protected by folio
> lock).

Yes, sounds reasonable. We should likely update the documentation of 
get_swap_device(), that after decrementing the refcount, the SI might 
become stale and should not be touched without a prior get_swap_device().

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2024-03-05  8:35                                 ` David Hildenbrand
@ 2024-03-05  8:46                                   ` Ryan Roberts
  0 siblings, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-03-05  8:46 UTC (permalink / raw)
  To: David Hildenbrand, Huang, Ying
  Cc: Andrew Morton, Matthew Wilcox, Gao Xiang, Yu Zhao, Yang Shi,
	Michal Hocko, Kefeng Wang, Hugh Dickins, linux-kernel, linux-mm

On 05/03/2024 08:35, David Hildenbrand wrote:
> On 05.03.24 07:11, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> + Hugh
>>>
>>> On 04/03/2024 22:02, David Hildenbrand wrote:
>>>> On 04.03.24 22:55, Ryan Roberts wrote:
>>>>> On 04/03/2024 20:50, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>> This is the existing free_swap_and_cache(). I think _swap_info_get() would
>>>>>>>>> break
>>>>>>>>> if this could race with swapoff(), and __swap_entry_free() looks up the
>>>>>>>>> cluster
>>>>>>>>> from an array, which would also be freed by swapoff if racing:
>>>>>>>>>
>>>>>>>>> int free_swap_and_cache(swp_entry_t entry)
>>>>>>>>> {
>>>>>>>>>        struct swap_info_struct *p;
>>>>>>>>>        unsigned char count;
>>>>>>>>>
>>>>>>>>>        if (non_swap_entry(entry))
>>>>>>>>>            return 1;
>>>>>>>>>
>>>>>>>>>        p = _swap_info_get(entry);
>>>>>>>>>        if (p) {
>>>>>>>>>            count = __swap_entry_free(p, entry);
>>>>>>>>
>>>>>>>> If count dropped to 0 and
>>>>>>>>
>>>>>>>>>            if (count == SWAP_HAS_CACHE)
>>>>>>>>
>>>>>>>>
>>>>>>>> count is now SWAP_HAS_CACHE, there is in fact no swap entry anymore. We
>>>>>>>> removed
>>>>>>>> it. That one would have to be reclaimed asynchronously.
>>>>>>>>
>>>>>>>> The existing code we would call swap_page_trans_huge_swapped() with the
>>>>>>>> SI it
>>>>>>>> obtained via _swap_info_get().
>>>>>>>>
>>>>>>>> I also don't see what should be left protecting the SI. It's not locked
>>>>>>>> anymore,
>>>>>>>> the swapcounts are at 0. We don't hold the folio lock.
>>>>>>>>
>>>>>>>> try_to_unuse() will stop as soon as si->inuse_pages is at 0. Hm ...
>>>>>>>
>>>>>>> But, assuming the caller of free_swap_and_cache() acquires the PTL first, I
>>>>>>> think this all works out ok? While free_swap_and_cache() is running,
>>>>>>> try_to_unuse() will wait for the PTL. Or if try_to_unuse() runs first, then
>>>>>>> free_swap_and_cache() will never be called because the swap entry will have
>>>>>>> been
>>>>>>> removed from the PTE?
>>>>>>
>>>>>> But can't try_to_unuse() run, detect !si->inuse_pages and not even bother
>>>>>> about
>>>>>> scanning any further page tables?
>>>>>>
>>>>>> But my head hurts from digging through that code.
>>>>>
>>>>> Yep, glad I'm not the only one that gets headaches from swapfile.c.
>>>>>
>>>>>>
>>>>>> Let me try again:
>>>>>>
>>>>>> __swap_entry_free() might be the last user and result in "count ==
>>>>>> SWAP_HAS_CACHE".
>>>>>>
>>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
>>>>>>
>>>>>>
>>>>>> So the question is: could someone reclaim the folio and turn
>>>>>> si->inuse_pages==0,
>>>>>> before we completed swap_page_trans_huge_swapped().
>>>>>>
>>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
>>>>>> still
>>>>>> references by swap entries.
>>>>>>
>>>>>> Process 1 still references subpage 0 via swap entry.
>>>>>> Process 2 still references subpage 1 via swap entry.
>>>>>>
>>>>>> Process 1 quits. Calls free_swap_and_cache().
>>>>>> -> count == SWAP_HAS_CACHE
>>>>>> [then, preempted in the hypervisor etc.]
>>>>>>
>>>>>> Process 2 quits. Calls free_swap_and_cache().
>>>>>> -> count == SWAP_HAS_CACHE
>>>>>>
>>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
>>>>>> __try_to_reclaim_swap().
>>>>>>
>>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->put_swap_folio()->
>>>>>> free_swap_slot()->swapcache_free_entries()->swap_entry_free()->swap_range_free()->
>>>>>> ...
>>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>>>>>>
>>>>>>
>>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache but
>>>>>> before process 1 finished its call to swap_page_trans_huge_swapped()?
>>>>>
>>>>> Assuming you are talking about anonymous memory, process 1 has the PTL while
>>>>> it's executing free_swap_and_cache(). try_to_unuse() iterates over every
>>>>> vma in
>>>>> every mm, and it swaps-in a page for every PTE that holds a swap entry for the
>>>>> device being swapoff'ed. It takes the PTL while converting the swap entry to
>>>>> present PTE - see unuse_pte(). Process 1 must have beaten try_to_unuse() to
>>>>> the
>>>>> particular pte, because if try_to_unuse() got there first, it would have
>>>>> converted it from a swap entry to present pte and process 1 would never even
>>>>> have called free_swap_and_cache(). So try_to_unuse() will eventually wait
>>>>> on the
>>>>> PTL until process 1 has released it after free_swap_and_cache() completes.
>>>>> Am I
>>>>> missing something? Because that part feels pretty clear to me.
>>>>
>>>> Why should try_to_unuse() do *anything* if it already finds
>>>> si->inuse_pages == 0 because we (p1 } p2) just freed the swapentries and
>>>> process
>>>> 2 managed to free the last remaining swapcache entry?
>>>
>>> Yeah ok. For some reason I thought unuse_mm() was iterating over all mms and so
>>> the `while (READ_ONCE(si->inuse_pages))` was only evaluated after iterating over
>>> every mm. Oops.
>>>
>>> So yes, I agree with you; I think this is broken. And I'm a bit worried this
>>> could be a can of worms; By the same logic, I think folio_free_swap(),
>>> swp_swapcount() and probably others are broken in the same way.
>>
>> Don't worry too much :-), we have get_swap_device() at least.  We can
>> insert it anywhere we want because it's quite lightweight.  And, because
>> swapoff() is so rare, the race is theoretical only.

Thanks for the response!

>>
>> For this specific case, I had thought that PTL is enough.  But after
>> looking at this more, I found a race here too.  Until
>> __swap_entry_free() return, we are OK, nobody can reduce the swap count
>> because we held the PTL.  

Even that is not true for the shmem case: As far as I can see, shmem doesn't
have the PTL or any other synchronizing lock when it calls
free_swap_and_cache(). I don't think that changes anything solution-wise though.

>> But, after that, even if its return value is
>> SWAP_HAS_CACHE (that is, in swap cache), parallel swap_unuse() or
>> __try_to_reclaim_swap() may remove the folio from swap cache, so free
>> the swap entry.  So, swapoff() can proceed to free the data structures
>> in parallel.
>>
>> To fix the race, we can add get/put_swap_device() in
>> free_swap_and_cache().
>>
>> For other places, we can check whether get/put_swap_device() has been
>> called in callers, and the swap reference we held has been decreased
>> (e.g., swap count protected by PTL, SWAP_HAS_CACHE protected by folio
>> lock).
> 
> Yes, sounds reasonable. We should likely update the documentation of
> get_swap_device(), that after decrementing the refcount, the SI might become
> stale and should not be touched without a prior get_swap_device().

Yep agreed. If nobody else is planning to do it, I'll try to create a test case
that provokes the problem then put a patch together to fix it.



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-02-18 23:40       ` Barry Song
  2024-02-20 20:03         ` Ryan Roberts
@ 2024-03-05  9:00         ` Ryan Roberts
  2024-03-05  9:54           ` Barry Song
  1 sibling, 1 reply; 116+ messages in thread
From: Ryan Roberts @ 2024-03-05  9:00 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

Hi Barry,

On 18/02/2024 23:40, Barry Song wrote:
> On Tue, Feb 6, 2024 at 1:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 05/02/2024 09:51, Barry Song wrote:
>>> +Chris, Suren and Chuanhua
>>>
>>> Hi Ryan,
[...]
>>
> 
> Hi Ryan,
> I am running into some races especially while enabling large folio swap-out and
> swap-in both. some of them, i am still struggling with the detailed
> timing how they
> are happening.
> but the below change can help remove those bugs which cause corrupted data.

I'm getting quite confused with all the emails flying around on this topic. Here
you were reporting a data corruption bug and your suggested fix below is the one
you have now posted at [1]. But in the thread at [1] we concluded that it is not
fixing a functional correctness issue, but is just an optimization in some
corner cases. So does the corruption issue still manifest? Did you manage to
root cause it? Is it a problem with my swap-out series or your swap-in series,
or pre-existing?

[1] https://lore.kernel.org/linux-mm/20240304103757.235352-1-21cnbao@gmail.com/

Thanks,
Ryan

> 
> index da2aab219c40..ef9cfbc84760 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1953,6 +1953,16 @@ static unsigned int shrink_folio_list(struct
> list_head *folio_list,
> 
>                         if (folio_test_pmd_mappable(folio))
>                                 flags |= TTU_SPLIT_HUGE_PMD;
> +                       /*
> +                        * make try_to_unmap_one hold ptl from the very first
> +                        * beginning if we are reclaiming a folio with multi-
> +                        * ptes. otherwise, we may only reclaim a part of the
> +                        * folio from the middle.
> +                        * for example, a parallel thread might temporarily
> +                        * set pte to none for various purposes.
> +                        */
> +                       else if (folio_test_large(folio))
> +                               flags |= TTU_SYNC;
> 
>                         try_to_unmap(folio, flags);
>                         if (folio_mapped(folio)) {
> 
> 
> While we are swapping-out a large folio, it has many ptes, we change those ptes
> to swap entries in try_to_unmap_one(). "while (page_vma_mapped_walk(&pvmw))"
> will iterate all ptes within the large folio. but it will only begin
> to acquire ptl when
> it meets a valid pte as below /* xxxxxxx */
> 
> static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
> {
>         pte_t ptent;
> 
>         if (pvmw->flags & PVMW_SYNC) {
>                 /* Use the stricter lookup */
>                 pvmw->pte = pte_offset_map_lock(pvmw->vma->vm_mm, pvmw->pmd,
>                                                 pvmw->address, &pvmw->ptl);
>                 *ptlp = pvmw->ptl;
>                 return !!pvmw->pte;
>         }
> 
>        ...
>         pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
>                                           pvmw->address, ptlp);
>         if (!pvmw->pte)
>                 return false;
> 
>         ptent = ptep_get(pvmw->pte);
> 
>         if (pvmw->flags & PVMW_MIGRATION) {
>                 if (!is_swap_pte(ptent))
>                         return false;
>         } else if (is_swap_pte(ptent)) {
>                 swp_entry_t entry;
>                 ...
>                 entry = pte_to_swp_entry(ptent);
>                 if (!is_device_private_entry(entry) &&
>                     !is_device_exclusive_entry(entry))
>                         return false;
>         } else if (!pte_present(ptent)) {
>                 return false;
>         }
>         pvmw->ptl = *ptlp;
>         spin_lock(pvmw->ptl);   /* xxxxxxx */
>         return true;
> }
> 
> 
> for various reasons,  for example, break-before-make for clearing access flags
> etc. pte can be set to none. since page_vma_mapped_walk() doesn't hold ptl
> from the beginning,  it might only begin to set swap entries from the middle of
> a large folio.
> 
> For example, in case a large folio has 16 ptes, and 0,1,2 are somehow zero
> in the intermediate stage of a break-before-make, ptl will be held
> from the 3rd pte,
> and swap entries will be set from 3rd pte as well. it seems not good as we are
> trying to swap out a large folio, but we are swapping out a part of them.
> 
> I am still struggling with all the timing of races, but using PVMW_SYNC to
> explicitly ask for ptl from the first pte seems a good thing for large folio
> regardless of those races. it can avoid try_to_unmap_one reading intermediate
> pte and further make the wrong decision since reclaiming pte-mapped large
> folios is atomic with just one pte.
> 
>> Sorry I haven't progressed this series as I had hoped. I've been concentrating
>> on getting the contpte series upstream. I'm hoping I will find some time to move
>> this series along by the tail end of Feb (hoping to get it in shape for v6.10).
>> Hopefully that doesn't cause you any big problems?
> 
> no worries. Anyway, we are already using your code to run various tests.
> 
>>
>> Thanks,
>> Ryan
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-03-05  9:00         ` Ryan Roberts
@ 2024-03-05  9:54           ` Barry Song
  2024-03-05 10:44             ` Ryan Roberts
  0 siblings, 1 reply; 116+ messages in thread
From: Barry Song @ 2024-03-05  9:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

On Tue, Mar 5, 2024 at 10:00 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Barry,
>
> On 18/02/2024 23:40, Barry Song wrote:
> > On Tue, Feb 6, 2024 at 1:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 05/02/2024 09:51, Barry Song wrote:
> >>> +Chris, Suren and Chuanhua
> >>>
> >>> Hi Ryan,
> [...]
> >>
> >
> > Hi Ryan,
> > I am running into some races especially while enabling large folio swap-out and
> > swap-in both. some of them, i am still struggling with the detailed
> > timing how they
> > are happening.
> > but the below change can help remove those bugs which cause corrupted data.
>
> I'm getting quite confused with all the emails flying around on this topic. Here
> you were reporting a data corruption bug and your suggested fix below is the one
> you have now posted at [1]. But in the thread at [1] we concluded that it is not
> fixing a functional correctness issue, but is just an optimization in some
> corner cases. So does the corruption issue still manifest? Did you manage to
> root cause it? Is it a problem with my swap-out series or your swap-in series,
> or pre-existing?

Hi Ryan,

It is not a problem of your swap-out series, but a problem of my swap-in
series. The bug in swap-in series is triggered by the skipped PTEs in the
thread[1], but my swap-in code should still be able to cope with this situation
and survive it -  a large folio might be partially but not completely unmapped
after try_to_unmap_one(). I actually replied to you and explained all
the details here[2], but guess you missed it :-)

[1] https://lore.kernel.org/linux-mm/20240304103757.235352-1-21cnbao@gmail.com/
[2] https://lore.kernel.org/linux-mm/CAGsJ_4zdh5kOG7QP4UDaE-wmLFiTEJC2PX-_LxtOj=QrZSvkCA@mail.gmail.com/

apology this makes you confused.

>
> [1] https://lore.kernel.org/linux-mm/20240304103757.235352-1-21cnbao@gmail.com/
>
> Thanks,
> Ryan
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
  2024-03-05  9:54           ` Barry Song
@ 2024-03-05 10:44             ` Ryan Roberts
  0 siblings, 0 replies; 116+ messages in thread
From: Ryan Roberts @ 2024-03-05 10:44 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, linux-kernel, linux-mm, mhocko, shy828301,
	wangkefeng.wang, willy, xiang, ying.huang, yuzhao, chrisl,
	surenb, hanchuanhua

On 05/03/2024 09:54, Barry Song wrote:
> On Tue, Mar 5, 2024 at 10:00 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi Barry,
>>
>> On 18/02/2024 23:40, Barry Song wrote:
>>> On Tue, Feb 6, 2024 at 1:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 05/02/2024 09:51, Barry Song wrote:
>>>>> +Chris, Suren and Chuanhua
>>>>>
>>>>> Hi Ryan,
>> [...]
>>>>
>>>
>>> Hi Ryan,
>>> I am running into some races especially while enabling large folio swap-out and
>>> swap-in both. some of them, i am still struggling with the detailed
>>> timing how they
>>> are happening.
>>> but the below change can help remove those bugs which cause corrupted data.
>>
>> I'm getting quite confused with all the emails flying around on this topic. Here
>> you were reporting a data corruption bug and your suggested fix below is the one
>> you have now posted at [1]. But in the thread at [1] we concluded that it is not
>> fixing a functional correctness issue, but is just an optimization in some
>> corner cases. So does the corruption issue still manifest? Did you manage to
>> root cause it? Is it a problem with my swap-out series or your swap-in series,
>> or pre-existing?
> 
> Hi Ryan,
> 
> It is not a problem of your swap-out series, but a problem of my swap-in
> series. The bug in swap-in series is triggered by the skipped PTEs in the
> thread[1], but my swap-in code should still be able to cope with this situation
> and survive it -  a large folio might be partially but not completely unmapped
> after try_to_unmap_one(). 

Ahh, understood, thanks!

> I actually replied to you and explained all
> the details here[2], but guess you missed it :-)

I did read that mail, but the first line "They are the same" made me think this
was solving a functional problem. And I still have a very shaky understanding of
parts of the code that I haven't directly worked on, so sometimes some of the
details go over my head - I'll get there eventually!

> 
> [1] https://lore.kernel.org/linux-mm/20240304103757.235352-1-21cnbao@gmail.com/
> [2] https://lore.kernel.org/linux-mm/CAGsJ_4zdh5kOG7QP4UDaE-wmLFiTEJC2PX-_LxtOj=QrZSvkCA@mail.gmail.com/
> 
> apology this makes you confused.

No need to apologise - I appreciate your taking the time to write it all down in
detail. It helps me to learn these areas of the code.

> 
>>
>> [1] https://lore.kernel.org/linux-mm/20240304103757.235352-1-21cnbao@gmail.com/
>>
>> Thanks,
>> Ryan
>>
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
  2024-01-29 10:06           ` David Hildenbrand
  2024-01-29 16:31             ` Chris Li
@ 2024-04-06 23:27             ` Barry Song
  1 sibling, 0 replies; 116+ messages in thread
From: Barry Song @ 2024-04-06 23:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chris Li, ryan.roberts, akpm, linux-mm, linux-kernel, mhocko,
	shy828301, wangkefeng.wang, willy, xiang, ying.huang, yuzhao,
	surenb, steven.price, Barry Song, Chuanhua Han

On Mon, Jan 29, 2024 at 11:07 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 29.01.24 04:25, Chris Li wrote:
> > Hi David and Barry,
> >
> > On Mon, Jan 22, 2024 at 10:49 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >>>
> >>>
> >>> I have on my todo list to move all that !anon handling out of
> >>> folio_add_anon_rmap_ptes(), and instead make swapin code call add
> >>> folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
> >>> then (-> whole new folio exclusive).
> >>>
> >>> That's the cleaner approach.
> >>>
> >>
> >> one tricky thing is that sometimes it is hard to know who is the first
> >> one to add rmap and thus should
> >> call folio_add_new_anon_rmap.
> >> especially when we want to support swapin_readahead(), the one who
> >> allocated large filio might not
> >> be that one who firstly does rmap.
> >
> > I think Barry has a point. Two tasks might race to swap in the folio
> > then race to perform the rmap.
> > folio_add_new_anon_rmap() should only call a folio that is absolutely
> > "new", not shared. The sharing in swap cache disqualifies that
> > condition.
>
> We have to hold the folio lock. So only one task at a time might do the
> folio_add_anon_rmap_ptes() right now, and the
> folio_add_new_shared_anon_rmap() in the future [below].
>
> Also observe how folio_add_anon_rmap_ptes() states that one must hold
> the page lock, because otherwise this would all be completely racy.
>
>  From the pte swp exclusive flags, we know for sure whether we are
> dealing with exclusive vs. shared. I think patch #6 does not properly
> check that all entries are actually the same in that regard (all
> exclusive vs all shared). That likely needs fixing.
>
> [I have converting per-page PageAnonExclusive flags to a single
> per-folio flag on my todo list. I suspect that we'll keep the
> per-swp-pte exlusive bits, but the question is rather what we can
> actually make work, because swap and migration just make it much more
> complicated. Anyhow, future work]
>
> >
> >> is it an acceptable way to do the below in do_swap_page?
> >> if (!folio_test_anon(folio))
> >>        folio_add_new_anon_rmap()
> >> else
> >>        folio_add_anon_rmap_ptes()
> >
> > I am curious to know the answer as well.
>
>
> Yes, the end code should likely be something like:
>
> /* ksm created a completely new copy */
> if (unlikely(folio != swapcache && swapcache)) {
>         folio_add_new_anon_rmap(folio, vma, vmf->address);
>         folio_add_lru_vma(folio, vma);
> } else if (folio_test_anon(folio)) {
>         folio_add_anon_rmap_ptes(rmap_flags)
> } else {
>         folio_add_new_anon_rmap(rmap_flags)
> }
>
> Maybe we want to avoid teaching all existing folio_add_new_anon_rmap()
> callers about a new flag, and just have a new
> folio_add_new_shared_anon_rmap() instead. TBD.

right.

We need to clarify that the new anon_folio might not necessarily be exclusive.
Unlike folio_add_new_anon_rmap, which assumes the new folio is exclusive,
folio_add_anon_rmap_ptes is capable of handling both exclusive and
non-exclusive new anon folios.

The code would be like:

 if (unlikely(folio != swapcache && swapcache)) {
         folio_add_new_anon_rmap(folio, vma, vmf->address);
         folio_add_lru_vma(folio, vma);
 } else if (!folio_test_anon(folio)) {
         folio_add_anon_rmap_ptes(rmap_flags);
 } else {
         if (exclusive)
                 folio_add_new_anon_rmap();
         else
                 folio_add_new_shared_anon_rmap();
 }

It appears a bit lengthy?

>
> >
> > BTW, that test might have a race as well. By the time the task got
> > !anon result, this result might get changed by another task. We need
> > to make sure in the caller context this race can't happen. Otherwise
> > we can't do the above safely.
> Again, folio lock. Observe the folio_lock_or_retry() call that covers
> our existing folio_add_new_anon_rmap/folio_add_anon_rmap_pte calls.
>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2024-04-06 23:27 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-25 14:45 [PATCH v3 0/4] Swap-out small-sized THP without splitting Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-02-22 10:19   ` David Hildenbrand
2024-02-22 10:20     ` David Hildenbrand
2024-02-26 17:41       ` Ryan Roberts
2024-02-27 17:10         ` Ryan Roberts
2024-02-27 19:17           ` David Hildenbrand
2024-02-28  9:37             ` Ryan Roberts
2024-02-28 12:12               ` David Hildenbrand
2024-02-28 14:57                 ` Ryan Roberts
2024-02-28 15:12                   ` David Hildenbrand
2024-02-28 15:18                     ` Ryan Roberts
2024-03-01 16:27                     ` Ryan Roberts
2024-03-01 16:31                       ` Matthew Wilcox
2024-03-01 16:44                         ` Ryan Roberts
2024-03-01 17:00                           ` David Hildenbrand
2024-03-01 17:14                             ` Ryan Roberts
2024-03-01 17:18                               ` David Hildenbrand
2024-03-01 17:06                           ` Ryan Roberts
2024-03-04  4:52                             ` Barry Song
2024-03-04  5:42                               ` Barry Song
2024-03-05  7:41                                 ` Ryan Roberts
2024-03-01 16:31                       ` Ryan Roberts
2024-03-01 16:32                       ` David Hildenbrand
2024-03-04 16:03                 ` Ryan Roberts
2024-03-04 17:30                   ` David Hildenbrand
2024-03-04 18:38                     ` Ryan Roberts
2024-03-04 20:50                       ` David Hildenbrand
2024-03-04 21:55                         ` Ryan Roberts
2024-03-04 22:02                           ` David Hildenbrand
2024-03-04 22:34                             ` Ryan Roberts
2024-03-05  6:11                               ` Huang, Ying
2024-03-05  8:35                                 ` David Hildenbrand
2024-03-05  8:46                                   ` Ryan Roberts
2024-02-28 13:33               ` Matthew Wilcox
2024-02-28 14:24                 ` Ryan Roberts
2024-02-28 14:59                   ` Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 3/4] mm: swap: Simplify ssd behavior when scanner steals entry Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
2023-10-30  8:18   ` Huang, Ying
2023-10-30 13:59     ` Ryan Roberts
2023-10-31  8:12       ` Huang, Ying
2023-11-03 11:42         ` Ryan Roberts
2023-11-02  7:40   ` Barry Song
2023-11-02 10:21     ` Ryan Roberts
2023-11-02 22:36       ` Barry Song
2023-11-03 11:31         ` Ryan Roberts
2023-11-03 13:57           ` Steven Price
2023-11-04  9:34             ` Barry Song
2023-11-06 10:12               ` Steven Price
2023-11-06 21:39                 ` Barry Song
2023-11-08 11:51                   ` Steven Price
2023-11-07 12:46               ` Ryan Roberts
2023-11-07 18:05                 ` Barry Song
2023-11-08 11:23                   ` Barry Song
2023-11-08 20:20                     ` Ryan Roberts
2023-11-08 21:04                       ` Barry Song
2023-11-04  5:49           ` Barry Song
2024-02-05  9:51   ` Barry Song
2024-02-05 12:14     ` Ryan Roberts
2024-02-18 23:40       ` Barry Song
2024-02-20 20:03         ` Ryan Roberts
2024-03-05  9:00         ` Ryan Roberts
2024-03-05  9:54           ` Barry Song
2024-03-05 10:44             ` Ryan Roberts
2024-02-27 12:28     ` Ryan Roberts
2024-02-27 13:37     ` Ryan Roberts
2024-02-28  2:46       ` Barry Song
2024-02-22  7:05   ` Barry Song
2024-02-22 10:09     ` David Hildenbrand
2024-02-23  9:46       ` Barry Song
2024-02-27 12:05         ` Ryan Roberts
2024-02-28  1:23           ` Barry Song
2024-02-28  9:34             ` David Hildenbrand
2024-02-28 23:18               ` Barry Song
2024-02-28 15:57             ` Ryan Roberts
2023-11-29  7:47 ` [PATCH v3 0/4] " Barry Song
2023-11-29 12:06   ` Ryan Roberts
2023-11-29 20:38     ` Barry Song
2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
2024-01-18 11:10   ` [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE Barry Song
2024-01-26 23:14     ` Chris Li
2024-02-26  2:59       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free() Barry Song
2024-01-26 23:17     ` Chris Li
2024-02-26  4:47       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
2024-01-26 23:22     ` Chris Li
2024-01-18 11:10   ` [PATCH RFC 4/6] mm: support large folios swapin as a whole Barry Song
2024-01-27 19:53     ` Chris Li
2024-02-26  7:29       ` Barry Song
2024-01-27 20:06     ` Chris Li
2024-02-26  7:31       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap() Barry Song
2024-01-18 11:54     ` David Hildenbrand
2024-01-23  6:49       ` Barry Song
2024-01-29  3:25         ` Chris Li
2024-01-29 10:06           ` David Hildenbrand
2024-01-29 16:31             ` Chris Li
2024-02-26  5:05               ` Barry Song
2024-04-06 23:27             ` Barry Song
2024-01-27 23:41     ` Chris Li
2024-01-18 11:10   ` [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT Barry Song
2024-01-29  2:15     ` Chris Li
2024-02-26  6:39       ` Barry Song
2024-02-27 12:22     ` Ryan Roberts
2024-02-27 22:39       ` Barry Song
2024-02-27 14:40     ` Ryan Roberts
2024-02-27 18:57       ` Barry Song
2024-02-28  3:49         ` Barry Song
2024-01-18 15:25   ` [PATCH RFC 0/6] mm: support large folios swap-in Ryan Roberts
2024-01-18 23:54     ` Barry Song
2024-01-19 13:25       ` Ryan Roberts
2024-01-27 14:27         ` Barry Song
2024-01-29  9:05   ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).