linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Swap-out small-sized THP without splitting
@ 2023-10-17 16:13 Ryan Roberts
  2023-10-17 16:13 ` [PATCH v2 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
  2023-10-17 16:13 ` [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
  0 siblings, 2 replies; 8+ messages in thread
From: Ryan Roberts @ 2023-10-17 16:13 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: Ryan Roberts, linux-kernel, linux-mm

Hi All,

This is v2 of a series to add support for swapping out small-sized THP without
needing to first split the large folio via __split_huge_page(). It closely
follows the approach already used by PMD-sized THP.

"Small-sized THP" is an upcoming feature that enables performance improvements
by allocating large folios for anonymous memory, where the large folio size is
smaller than the traditional PMD-size. See [2].

In some circumstances I've observed a performance regression (see patch 2 for
details), and this series is an attempt to fix the regression in advance of
merging small-sized THP support.

I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). Discussion against the RFC concluded
that this is probably sufficient.

The series applies against mm-unstable (3fb06e6d0a6f)


Changes since v1 [1]
====================

 - patch 1:
    - Use cluster_set_count() instead of cluster_set_count_flag() in
      swap_alloc_cluster() since we no longer have any flag to set. I was unable
      to kill cluster_set_count_flag() as proposed against v1 as other call
      sites depend explicitly setting flags to 0.
 - patch 2:
    - Moved large_next[] array into percpu_cluster to make it per-cpu
      (recommended by Huang, Ying).
    - large_next[] array is dynamically allocated because PMD_ORDER is not
      compile-time constant for powerpc (fixes build error).


Thanks,
Ryan


[1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@arm.com/


Ryan Roberts (2):
  mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  mm: swap: Swap-out small-sized THP without splitting

 include/linux/swap.h |  16 +++---
 mm/huge_memory.c     |   3 --
 mm/swapfile.c        | 119 ++++++++++++++++++++++++-------------------
 mm/vmscan.c          |  10 ++--
 4 files changed, 78 insertions(+), 70 deletions(-)

--
2.25.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2023-10-17 16:13 [PATCH v2 0/2] Swap-out small-sized THP without splitting Ryan Roberts
@ 2023-10-17 16:13 ` Ryan Roberts
  2023-10-17 16:13 ` [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
  1 sibling, 0 replies; 8+ messages in thread
From: Ryan Roberts @ 2023-10-17 16:13 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: Ryan Roberts, linux-kernel, linux-mm

As preparation for supporting small-sized THP in the swap-out path,
without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
which, when present, always implies PMD-sized THP, which is the same as
the cluster size.

The only use of the flag was to determine whether a swap entry refers to
a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
Instead of relying on the flag, we now pass in nr_pages, which
originates from the folio's number of pages. This allows the logic to
work for folios of any order.

The one snag is that one of the swap_page_trans_huge_swapped() call
sites does not have the folio. But it was only being called there to
avoid bothering to call __try_to_reclaim_swap() in some cases.
__try_to_reclaim_swap() gets the folio and (via some other functions)
calls swap_page_trans_huge_swapped(). So I've removed the problematic
call site and believe the new logic should be equivalent.

Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
which used to be called during folio splitting, since
split_swap_cluster()'s only job was to remove the flag.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h | 10 ----------
 mm/huge_memory.c     |  3 ---
 mm/swapfile.c        | 47 ++++++++------------------------------------
 3 files changed, 8 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 19f30a29e1f1..a073366a227c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,7 +259,6 @@ struct swap_cluster_info {
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
-#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */
 
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
@@ -595,15 +594,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 }
 #endif /* CONFIG_SWAP */
 
-#ifdef CONFIG_THP_SWAP
-extern int split_swap_cluster(swp_entry_t entry);
-#else
-static inline int split_swap_cluster(swp_entry_t entry)
-{
-	return 0;
-}
-#endif
-
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c9cbcbf6697e..46b3fb943207 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2597,9 +2597,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		shmem_uncharge(head->mapping->host, nr_dropped);
 	remap_page(folio, nr);
 
-	if (folio_test_swapcache(folio))
-		split_swap_cluster(folio->swap);
-
 	for (i = 0; i < nr; i++) {
 		struct page *subpage = head + i;
 		if (subpage == page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e52f486834eb..b83ad77e04c0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -342,18 +342,6 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
 	info->data = 0;
 }
 
-static inline bool cluster_is_huge(struct swap_cluster_info *info)
-{
-	if (IS_ENABLED(CONFIG_THP_SWAP))
-		return info->flags & CLUSTER_FLAG_HUGE;
-	return false;
-}
-
-static inline void cluster_clear_huge(struct swap_cluster_info *info)
-{
-	info->flags &= ~CLUSTER_FLAG_HUGE;
-}
-
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -1021,7 +1009,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
 	offset = idx * SWAPFILE_CLUSTER;
 	ci = lock_cluster(si, offset);
 	alloc_cluster(si, idx);
-	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
+	cluster_set_count(ci, SWAPFILE_CLUSTER);
 
 	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
 	unlock_cluster(ci);
@@ -1354,7 +1342,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 	ci = lock_cluster_or_swap_info(si, offset);
 	if (size == SWAPFILE_CLUSTER) {
-		VM_BUG_ON(!cluster_is_huge(ci));
 		map = si->swap_map + offset;
 		for (i = 0; i < SWAPFILE_CLUSTER; i++) {
 			val = map[i];
@@ -1362,7 +1349,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 			if (val == SWAP_HAS_CACHE)
 				free_entries++;
 		}
-		cluster_clear_huge(ci);
 		if (free_entries == SWAPFILE_CLUSTER) {
 			unlock_cluster_or_swap_info(si, ci);
 			spin_lock(&si->lock);
@@ -1384,23 +1370,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster_or_swap_info(si, ci);
 }
 
-#ifdef CONFIG_THP_SWAP
-int split_swap_cluster(swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-
-	si = _swap_info_get(entry);
-	if (!si)
-		return -EBUSY;
-	ci = lock_cluster(si, offset);
-	cluster_clear_huge(ci);
-	unlock_cluster(ci);
-	return 0;
-}
-#endif
-
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
@@ -1508,22 +1477,23 @@ int swp_swapcount(swp_entry_t entry)
 }
 
 static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
-					 swp_entry_t entry)
+					 swp_entry_t entry,
+					 unsigned int nr_pages)
 {
 	struct swap_cluster_info *ci;
 	unsigned char *map = si->swap_map;
 	unsigned long roffset = swp_offset(entry);
-	unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+	unsigned long offset = round_down(roffset, nr_pages);
 	int i;
 	bool ret = false;
 
 	ci = lock_cluster_or_swap_info(si, offset);
-	if (!ci || !cluster_is_huge(ci)) {
+	if (!ci || nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
 		goto unlock_out;
 	}
-	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		if (swap_count(map[offset + i])) {
 			ret = true;
 			break;
@@ -1545,7 +1515,7 @@ static bool folio_swapped(struct folio *folio)
 	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
 		return swap_swapcount(si, entry) != 0;
 
-	return swap_page_trans_huge_swapped(si, entry);
+	return swap_page_trans_huge_swapped(si, entry, folio_nr_pages(folio));
 }
 
 /**
@@ -1606,8 +1576,7 @@ int free_swap_and_cache(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		count = __swap_entry_free(p, entry);
-		if (count == SWAP_HAS_CACHE &&
-		    !swap_page_trans_huge_swapped(p, entry))
+		if (count == SWAP_HAS_CACHE)
 			__try_to_reclaim_swap(p, swp_offset(entry),
 					      TTRS_UNMAPPED | TTRS_FULL);
 	}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-17 16:13 [PATCH v2 0/2] Swap-out small-sized THP without splitting Ryan Roberts
  2023-10-17 16:13 ` [PATCH v2 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
@ 2023-10-17 16:13 ` Ryan Roberts
  2023-10-18  6:55   ` Huang, Ying
  1 sibling, 1 reply; 8+ messages in thread
From: Ryan Roberts @ 2023-10-17 16:13 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang
  Cc: Ryan Roberts, linux-kernel, linux-mm

The upcoming anonymous small-sized THP feature enables performance
improvements by allocating large folios for anonymous memory. However
I've observed that on an arm64 system running a parallel workload (e.g.
kernel compilation) across many cores, under high memory pressure, the
speed regresses. This is due to bottlenecking on the increased number of
TLBIs added due to all the extra folio splitting.

Therefore, solve this regression by adding support for swapping out
small-sized THP without needing to split the folio, just like is already
done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
enabled, and when the swap backing store is a non-rotating block device.
These are the same constraints as for the existing PMD-sized THP
swap-out support.

Note that no attempt is made to swap-in THP here - this is still done
page-by-page, like for PMD-sized THP.

The main change here is to improve the swap entry allocator so that it
can allocate any power-of-2 number of contiguous entries between [4, (1
<< PMD_ORDER)] (THP cannot support order-1 folios). This is done by
allocating a cluster for each distinct order and allocating sequentially
from it until the cluster is full. This ensures that we don't need to
search the map and we get no fragmentation due to alignment padding for
different orders in the cluster. If there is no current cluster for a
given order, we attempt to allocate a free cluster from the list. If
there are no free clusters, we fail the allocation and the caller falls
back to splitting the folio and allocates individual entries (as per
existing PMD-sized THP fallback).

The per-order current clusters are maintained per-cpu using the existing
percpu_cluster infrastructure. This is done to avoid interleving pages
from different tasks, which would prevent IO being batched. This is
already done for the order-0 allocations so we follow the same pattern.

As far as I can tell, this should not cause any extra fragmentation
concerns, given how similar it is to the existing PMD-sized THP
allocation mechanism. There could be up to (PMD_ORDER-2) * nr_cpus
clusters in concurrent use though, which in a pathalogical case (cluster
set aside for every order for every cpu and only one huge entry
allocated from it) would tie up ~12MiB of unused swap entries for these
high orders (assuming PMD_ORDER=9). In practice, the number of orders in
use will be small and the amount of swap space reserved is very small
compared to a typical swap file.

Note that PMD_ORDER is not compile-time constant on powerpc, so we have
to allocate the large_next[] array at runtime.

I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
device as the swap device and from inside a memcg limited to 40G memory.
I've then run `usemem` from vm-scalability with 70 processes (each has
its own core), each allocating and writing 1G of memory. I've repeated
everything 5 times and taken the mean and stdev:

Mean Performance Improvement vs 4K/baseline

| alloc size |            baseline |       + this series |
|            |  v6.6-rc4+anonfolio |                     |
|:-----------|--------------------:|--------------------:|
| 4K Page    |                0.0% |                1.1% |
| 64K THP    |              -44.1% |                0.9% |
| 2M THP     |               56.0% |               56.4% |

So with this change, the regression for 64K swap performance goes away.
Both 4K and 64K benhcmarks are now bottlenecked on TLBI performance from
try_to_unmap_flush_dirty(), on arm64 at least. When using fewer cpus in
the test, I see upto 2x performance of 64K THP swapping compared to 4K.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h |  6 ++++
 mm/swapfile.c        | 74 +++++++++++++++++++++++++++++++++++---------
 mm/vmscan.c          | 10 +++---
 3 files changed, 71 insertions(+), 19 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a073366a227c..35cbbe6509a9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -268,6 +268,12 @@ struct swap_cluster_info {
 struct percpu_cluster {
 	struct swap_cluster_info index; /* Current cluster index */
 	unsigned int next; /* Likely next allocation offset */
+	unsigned int large_next[];	/*
+					 * next free offset within current
+					 * allocation cluster for large folios,
+					 * or UINT_MAX if no current cluster.
+					 * Index is (order - 1).
+					 */
 };

 struct swap_cluster_list {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b83ad77e04c0..625964e53c22 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	return n_ret;
 }

-static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
+			    unsigned int nr_pages)
 {
+	int order_idx;
 	unsigned long idx;
 	struct swap_cluster_info *ci;
+	struct percpu_cluster *cluster;
 	unsigned long offset;

 	/*
 	 * Should not even be attempting cluster allocations when huge
 	 * page swap is disabled.  Warn and fail the allocation.
 	 */
-	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
+	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
+	    nr_pages < 4 || nr_pages > SWAPFILE_CLUSTER ||
+	    !is_power_of_2(nr_pages)) {
 		VM_WARN_ON_ONCE(1);
 		return 0;
 	}

-	if (cluster_list_empty(&si->free_clusters))
+	/*
+	 * Not using clusters so unable to allocate large entries.
+	 */
+	if (!si->cluster_info)
 		return 0;

-	idx = cluster_list_first(&si->free_clusters);
-	offset = idx * SWAPFILE_CLUSTER;
-	ci = lock_cluster(si, offset);
-	alloc_cluster(si, idx);
-	cluster_set_count(ci, SWAPFILE_CLUSTER);
+	order_idx = ilog2(nr_pages) - 2;
+	cluster = this_cpu_ptr(si->percpu_cluster);
+	offset = cluster->large_next[order_idx];
+
+	if (offset == UINT_MAX) {
+		if (cluster_list_empty(&si->free_clusters))
+			return 0;
+
+		idx = cluster_list_first(&si->free_clusters);
+		offset = idx * SWAPFILE_CLUSTER;

-	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
+		ci = lock_cluster(si, offset);
+		alloc_cluster(si, idx);
+		cluster_set_count(ci, SWAPFILE_CLUSTER);
+
+		/*
+		 * If scan_swap_map_slots() can't find a free cluster, it will
+		 * check si->swap_map directly. To make sure this standby
+		 * cluster isn't taken by scan_swap_map_slots(), mark the swap
+		 * entries bad (occupied). (same approach as discard).
+		 */
+		memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD,
+			SWAPFILE_CLUSTER - nr_pages);
+	} else {
+		idx = offset / SWAPFILE_CLUSTER;
+		ci = lock_cluster(si, offset);
+	}
+
+	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
 	unlock_cluster(ci);
-	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
+	swap_range_alloc(si, offset, nr_pages);
 	*slot = swp_entry(si->type, offset);

+	offset += nr_pages;
+	if (idx != offset / SWAPFILE_CLUSTER)
+		offset = UINT_MAX;
+	cluster->large_next[order_idx] = offset;
+
 	return 1;
 }

@@ -1041,7 +1076,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 	int node;

 	/* Only single cluster request supported */
-	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
+	WARN_ON_ONCE(n_goal > 1 && size > 1);

 	spin_lock(&swap_avail_lock);

@@ -1078,14 +1113,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-		if (size == SWAPFILE_CLUSTER) {
+		if (size > 1) {
 			if (si->flags & SWP_BLKDEV)
-				n_ret = swap_alloc_cluster(si, swp_entries);
+				n_ret = swap_alloc_large(si, swp_entries, size);
 		} else
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 						    n_goal, swp_entries);
 		spin_unlock(&si->lock);
-		if (n_ret || size == SWAPFILE_CLUSTER)
+		if (n_ret || size > 1)
 			goto check_out;
 		cond_resched();

@@ -3046,6 +3081,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (p->bdev && bdev_nonrot(p->bdev)) {
 		int cpu;
 		unsigned long ci, nr_cluster;
+		int nr_order;
+		int i;

 		p->flags |= SWP_SOLIDSTATE;
 		p->cluster_next_cpu = alloc_percpu(unsigned int);
@@ -3073,7 +3110,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		for (ci = 0; ci < nr_cluster; ci++)
 			spin_lock_init(&((cluster_info + ci)->lock));

-		p->percpu_cluster = alloc_percpu(struct percpu_cluster);
+		nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER - 1 : 0;
+		p->percpu_cluster = __alloc_percpu(
+					struct_size(p->percpu_cluster,
+						    large_next,
+						    nr_order),
+					__alignof__(struct percpu_cluster));
 		if (!p->percpu_cluster) {
 			error = -ENOMEM;
 			goto bad_swap_unlock_inode;
@@ -3082,6 +3124,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 			struct percpu_cluster *cluster;
 			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
 			cluster_set_null(&cluster->index);
+			for (i = 0; i < nr_order; i++)
+				cluster->large_next[i] = UINT_MAX;
 		}
 	} else {
 		atomic_inc(&nr_rotate_swap);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c16e2b1ea8ae..5984d2ae4547 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					if (!can_split_folio(folio, NULL))
 						goto activate_locked;
 					/*
-					 * Split folios without a PMD map right
-					 * away. Chances are some or all of the
-					 * tail pages can be freed without IO.
+					 * Split PMD-mappable folios without a
+					 * PMD map right away. Chances are some
+					 * or all of the tail pages can be freed
+					 * without IO.
 					 */
-					if (!folio_entire_mapcount(folio) &&
+					if (folio_test_pmd_mappable(folio) &&
+					    !folio_entire_mapcount(folio) &&
 					    split_folio_to_list(folio,
 								folio_list))
 						goto activate_locked;
--
2.25.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-17 16:13 ` [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
@ 2023-10-18  6:55   ` Huang, Ying
  2023-10-18 14:07     ` Ryan Roberts
  0 siblings, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2023-10-18  6:55 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel,
	linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> The upcoming anonymous small-sized THP feature enables performance
> improvements by allocating large folios for anonymous memory. However
> I've observed that on an arm64 system running a parallel workload (e.g.
> kernel compilation) across many cores, under high memory pressure, the
> speed regresses. This is due to bottlenecking on the increased number of
> TLBIs added due to all the extra folio splitting.
>
> Therefore, solve this regression by adding support for swapping out
> small-sized THP without needing to split the folio, just like is already
> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
> enabled, and when the swap backing store is a non-rotating block device.
> These are the same constraints as for the existing PMD-sized THP
> swap-out support.
>
> Note that no attempt is made to swap-in THP here - this is still done
> page-by-page, like for PMD-sized THP.
>
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [4, (1
> << PMD_ORDER)] (THP cannot support order-1 folios). This is done by
> allocating a cluster for each distinct order and allocating sequentially
> from it until the cluster is full. This ensures that we don't need to
> search the map and we get no fragmentation due to alignment padding for
> different orders in the cluster. If there is no current cluster for a
> given order, we attempt to allocate a free cluster from the list. If
> there are no free clusters, we fail the allocation and the caller falls
> back to splitting the folio and allocates individual entries (as per
> existing PMD-sized THP fallback).
>
> The per-order current clusters are maintained per-cpu using the existing
> percpu_cluster infrastructure. This is done to avoid interleving pages
> from different tasks, which would prevent IO being batched. This is
> already done for the order-0 allocations so we follow the same pattern.
>
> As far as I can tell, this should not cause any extra fragmentation
> concerns, given how similar it is to the existing PMD-sized THP
> allocation mechanism. There could be up to (PMD_ORDER-2) * nr_cpus
> clusters in concurrent use though, which in a pathalogical case (cluster
> set aside for every order for every cpu and only one huge entry
> allocated from it) would tie up ~12MiB of unused swap entries for these
> high orders (assuming PMD_ORDER=9). In practice, the number of orders in
> use will be small and the amount of swap space reserved is very small
> compared to a typical swap file.
>
> Note that PMD_ORDER is not compile-time constant on powerpc, so we have
> to allocate the large_next[] array at runtime.
>
> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
> device as the swap device and from inside a memcg limited to 40G memory.
> I've then run `usemem` from vm-scalability with 70 processes (each has
> its own core), each allocating and writing 1G of memory. I've repeated
> everything 5 times and taken the mean and stdev:
>
> Mean Performance Improvement vs 4K/baseline
>
> | alloc size |            baseline |       + this series |
> |            |  v6.6-rc4+anonfolio |                     |
> |:-----------|--------------------:|--------------------:|
> | 4K Page    |                0.0% |                1.1% |
> | 64K THP    |              -44.1% |                0.9% |
> | 2M THP     |               56.0% |               56.4% |
>
> So with this change, the regression for 64K swap performance goes away.
> Both 4K and 64K benhcmarks are now bottlenecked on TLBI performance from
> try_to_unmap_flush_dirty(), on arm64 at least. When using fewer cpus in
> the test, I see upto 2x performance of 64K THP swapping compared to 4K.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/swap.h |  6 ++++
>  mm/swapfile.c        | 74 +++++++++++++++++++++++++++++++++++---------
>  mm/vmscan.c          | 10 +++---
>  3 files changed, 71 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a073366a227c..35cbbe6509a9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -268,6 +268,12 @@ struct swap_cluster_info {
>  struct percpu_cluster {
>  	struct swap_cluster_info index; /* Current cluster index */
>  	unsigned int next; /* Likely next allocation offset */
> +	unsigned int large_next[];	/*
> +					 * next free offset within current
> +					 * allocation cluster for large folios,
> +					 * or UINT_MAX if no current cluster.
> +					 * Index is (order - 1).
> +					 */
>  };
>
>  struct swap_cluster_list {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index b83ad77e04c0..625964e53c22 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	return n_ret;
>  }
>
> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
> +			    unsigned int nr_pages)

This looks hacky.  IMO, we should put the allocation logic inside
percpu_cluster framework.  If percpu_cluster framework doesn't work for
you, just refactor it firstly.

>  {
> +	int order_idx;
>  	unsigned long idx;
>  	struct swap_cluster_info *ci;
> +	struct percpu_cluster *cluster;
>  	unsigned long offset;
>
>  	/*
>  	 * Should not even be attempting cluster allocations when huge
>  	 * page swap is disabled.  Warn and fail the allocation.
>  	 */
> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
> +	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
> +	    nr_pages < 4 || nr_pages > SWAPFILE_CLUSTER ||
> +	    !is_power_of_2(nr_pages)) {
>  		VM_WARN_ON_ONCE(1);
>  		return 0;
>  	}
>
> -	if (cluster_list_empty(&si->free_clusters))
> +	/*
> +	 * Not using clusters so unable to allocate large entries.
> +	 */
> +	if (!si->cluster_info)
>  		return 0;
>
> -	idx = cluster_list_first(&si->free_clusters);
> -	offset = idx * SWAPFILE_CLUSTER;
> -	ci = lock_cluster(si, offset);
> -	alloc_cluster(si, idx);
> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
> +	order_idx = ilog2(nr_pages) - 2;
> +	cluster = this_cpu_ptr(si->percpu_cluster);
> +	offset = cluster->large_next[order_idx];
> +
> +	if (offset == UINT_MAX) {
> +		if (cluster_list_empty(&si->free_clusters))
> +			return 0;
> +
> +		idx = cluster_list_first(&si->free_clusters);
> +		offset = idx * SWAPFILE_CLUSTER;
>
> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
> +		ci = lock_cluster(si, offset);
> +		alloc_cluster(si, idx);
> +		cluster_set_count(ci, SWAPFILE_CLUSTER);
> +
> +		/*
> +		 * If scan_swap_map_slots() can't find a free cluster, it will
> +		 * check si->swap_map directly. To make sure this standby
> +		 * cluster isn't taken by scan_swap_map_slots(), mark the swap
> +		 * entries bad (occupied). (same approach as discard).
> +		 */
> +		memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD,
> +			SWAPFILE_CLUSTER - nr_pages);

There's an issue with this solution.  If the free space of swap device
runs low, it's possible that

- some cluster are put in the percpu_cluster of some CPUs
  the swap entries there are marked as used

- no free swap entries elsewhere

- nr_swap_pages isn't 0

So, we will still scan LRU, but swap allocation fails, although there's
still free swap space.

I think that we should follow the method we used for the original
percpu_cluster.  That is, if all free swap entries are in
percpu_cluster, we will start to allocate from percpu_cluster.

> +	} else {
> +		idx = offset / SWAPFILE_CLUSTER;
> +		ci = lock_cluster(si, offset);
> +	}
> +
> +	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
>  	unlock_cluster(ci);
> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
> +	swap_range_alloc(si, offset, nr_pages);
>  	*slot = swp_entry(si->type, offset);
>
> +	offset += nr_pages;
> +	if (idx != offset / SWAPFILE_CLUSTER)
> +		offset = UINT_MAX;
> +	cluster->large_next[order_idx] = offset;
> +
>  	return 1;
>  }
>

[snip]

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-18  6:55   ` Huang, Ying
@ 2023-10-18 14:07     ` Ryan Roberts
  2023-10-19  5:49       ` Huang, Ying
  0 siblings, 1 reply; 8+ messages in thread
From: Ryan Roberts @ 2023-10-18 14:07 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel,
	linux-mm

On 18/10/2023 07:55, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> The upcoming anonymous small-sized THP feature enables performance
>> improvements by allocating large folios for anonymous memory. However
>> I've observed that on an arm64 system running a parallel workload (e.g.
>> kernel compilation) across many cores, under high memory pressure, the
>> speed regresses. This is due to bottlenecking on the increased number of
>> TLBIs added due to all the extra folio splitting.
>>
>> Therefore, solve this regression by adding support for swapping out
>> small-sized THP without needing to split the folio, just like is already
>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>> enabled, and when the swap backing store is a non-rotating block device.
>> These are the same constraints as for the existing PMD-sized THP
>> swap-out support.
>>
>> Note that no attempt is made to swap-in THP here - this is still done
>> page-by-page, like for PMD-sized THP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [4, (1
>> << PMD_ORDER)] (THP cannot support order-1 folios). This is done by
>> allocating a cluster for each distinct order and allocating sequentially
>> from it until the cluster is full. This ensures that we don't need to
>> search the map and we get no fragmentation due to alignment padding for
>> different orders in the cluster. If there is no current cluster for a
>> given order, we attempt to allocate a free cluster from the list. If
>> there are no free clusters, we fail the allocation and the caller falls
>> back to splitting the folio and allocates individual entries (as per
>> existing PMD-sized THP fallback).
>>
>> The per-order current clusters are maintained per-cpu using the existing
>> percpu_cluster infrastructure. This is done to avoid interleving pages
>> from different tasks, which would prevent IO being batched. This is
>> already done for the order-0 allocations so we follow the same pattern.
>>
>> As far as I can tell, this should not cause any extra fragmentation
>> concerns, given how similar it is to the existing PMD-sized THP
>> allocation mechanism. There could be up to (PMD_ORDER-2) * nr_cpus
>> clusters in concurrent use though, which in a pathalogical case (cluster
>> set aside for every order for every cpu and only one huge entry
>> allocated from it) would tie up ~12MiB of unused swap entries for these
>> high orders (assuming PMD_ORDER=9). In practice, the number of orders in
>> use will be small and the amount of swap space reserved is very small
>> compared to a typical swap file.
>>
>> Note that PMD_ORDER is not compile-time constant on powerpc, so we have
>> to allocate the large_next[] array at runtime.
>>
>> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
>> device as the swap device and from inside a memcg limited to 40G memory.
>> I've then run `usemem` from vm-scalability with 70 processes (each has
>> its own core), each allocating and writing 1G of memory. I've repeated
>> everything 5 times and taken the mean and stdev:
>>
>> Mean Performance Improvement vs 4K/baseline
>>
>> | alloc size |            baseline |       + this series |
>> |            |  v6.6-rc4+anonfolio |                     |
>> |:-----------|--------------------:|--------------------:|
>> | 4K Page    |                0.0% |                1.1% |
>> | 64K THP    |              -44.1% |                0.9% |
>> | 2M THP     |               56.0% |               56.4% |
>>
>> So with this change, the regression for 64K swap performance goes away.
>> Both 4K and 64K benhcmarks are now bottlenecked on TLBI performance from
>> try_to_unmap_flush_dirty(), on arm64 at least. When using fewer cpus in
>> the test, I see upto 2x performance of 64K THP swapping compared to 4K.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/swap.h |  6 ++++
>>  mm/swapfile.c        | 74 +++++++++++++++++++++++++++++++++++---------
>>  mm/vmscan.c          | 10 +++---
>>  3 files changed, 71 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index a073366a227c..35cbbe6509a9 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -268,6 +268,12 @@ struct swap_cluster_info {
>>  struct percpu_cluster {
>>  	struct swap_cluster_info index; /* Current cluster index */
>>  	unsigned int next; /* Likely next allocation offset */
>> +	unsigned int large_next[];	/*
>> +					 * next free offset within current
>> +					 * allocation cluster for large folios,
>> +					 * or UINT_MAX if no current cluster.
>> +					 * Index is (order - 1).
>> +					 */
>>  };
>>
>>  struct swap_cluster_list {
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index b83ad77e04c0..625964e53c22 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	return n_ret;
>>  }
>>
>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>> +			    unsigned int nr_pages)
> 
> This looks hacky.  IMO, we should put the allocation logic inside
> percpu_cluster framework.  If percpu_cluster framework doesn't work for
> you, just refactor it firstly.

I'm not sure I really understand what you are suggesting - could you elaborate?
What "framework"? I only see a per-cpu data structure and
scan_swap_map_try_ssd_cluster(), which is very much geared towards order-0
allocations.

Are you suggesting you want to allocate large entries (> order-0) from the same
cluster that is used for small (order-0) entries? The problem with this approach
is that there may not be enough space left in the current cluster for the large
entry that you are trying to allocate. Then you would need to take a new cluster
from the free list and presumably leave the previous cluster with unused entries
(which will now only be accessible by scanning). That unused space could be
considerable.

That's why I am currently reserving a "current cluster" per order; that way, all
allocations are the same order, they are all naturally aligned and there is no
waste.

Perhaps I could implement "stealing" between cpus so that a cpu trying to
allocate a specific order, but which doesn't have a current cluster for that
order and the free list is empty, could allocate from another cpu's current
cluster? I don't think it's a good idea to mix different orders in the same
cluster though.

I guess if really low, I could remove a current cluster from a cpu and allow it
to be scanned for order-0 allocations at least?

Any opinions gratefully received!


> 
>>  {
>> +	int order_idx;
>>  	unsigned long idx;
>>  	struct swap_cluster_info *ci;
>> +	struct percpu_cluster *cluster;
>>  	unsigned long offset;
>>
>>  	/*
>>  	 * Should not even be attempting cluster allocations when huge
>>  	 * page swap is disabled.  Warn and fail the allocation.
>>  	 */
>> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
>> +	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
>> +	    nr_pages < 4 || nr_pages > SWAPFILE_CLUSTER ||
>> +	    !is_power_of_2(nr_pages)) {
>>  		VM_WARN_ON_ONCE(1);
>>  		return 0;
>>  	}
>>
>> -	if (cluster_list_empty(&si->free_clusters))
>> +	/*
>> +	 * Not using clusters so unable to allocate large entries.
>> +	 */
>> +	if (!si->cluster_info)
>>  		return 0;
>>
>> -	idx = cluster_list_first(&si->free_clusters);
>> -	offset = idx * SWAPFILE_CLUSTER;
>> -	ci = lock_cluster(si, offset);
>> -	alloc_cluster(si, idx);
>> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
>> +	order_idx = ilog2(nr_pages) - 2;
>> +	cluster = this_cpu_ptr(si->percpu_cluster);
>> +	offset = cluster->large_next[order_idx];
>> +
>> +	if (offset == UINT_MAX) {
>> +		if (cluster_list_empty(&si->free_clusters))
>> +			return 0;
>> +
>> +		idx = cluster_list_first(&si->free_clusters);
>> +		offset = idx * SWAPFILE_CLUSTER;
>>
>> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>> +		ci = lock_cluster(si, offset);
>> +		alloc_cluster(si, idx);
>> +		cluster_set_count(ci, SWAPFILE_CLUSTER);
>> +
>> +		/*
>> +		 * If scan_swap_map_slots() can't find a free cluster, it will
>> +		 * check si->swap_map directly. To make sure this standby
>> +		 * cluster isn't taken by scan_swap_map_slots(), mark the swap
>> +		 * entries bad (occupied). (same approach as discard).
>> +		 */
>> +		memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD,
>> +			SWAPFILE_CLUSTER - nr_pages);
> 
> There's an issue with this solution.  If the free space of swap device
> runs low, it's possible that
> 
> - some cluster are put in the percpu_cluster of some CPUs
>   the swap entries there are marked as used
> 
> - no free swap entries elsewhere
> 
> - nr_swap_pages isn't 0
> 
> So, we will still scan LRU, but swap allocation fails, although there's
> still free swap space.

Ahh yes, good spot.

> 
> I think that we should follow the method we used for the original
> percpu_cluster.  That is, if all free swap entries are in
> percpu_cluster, we will start to allocate from percpu_cluster.

As i suggested above, I think I could do this by removing a cpu's current
cluster for a given order from the percpu_cluster and making it generally
available for scanning. Does that work for you?

> 
>> +	} else {
>> +		idx = offset / SWAPFILE_CLUSTER;
>> +		ci = lock_cluster(si, offset);
>> +	}
>> +
>> +	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
>>  	unlock_cluster(ci);
>> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
>> +	swap_range_alloc(si, offset, nr_pages);
>>  	*slot = swp_entry(si->type, offset);
>>
>> +	offset += nr_pages;
>> +	if (idx != offset / SWAPFILE_CLUSTER)
>> +		offset = UINT_MAX;
>> +	cluster->large_next[order_idx] = offset;
>> +
>>  	return 1;
>>  }
>>
> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-18 14:07     ` Ryan Roberts
@ 2023-10-19  5:49       ` Huang, Ying
  2023-10-19 14:25         ` Ryan Roberts
  0 siblings, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2023-10-19  5:49 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel,
	linux-mm, Tim Chen

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 18/10/2023 07:55, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>

[snip]

>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index a073366a227c..35cbbe6509a9 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -268,6 +268,12 @@ struct swap_cluster_info {
>>>  struct percpu_cluster {
>>>  	struct swap_cluster_info index; /* Current cluster index */
>>>  	unsigned int next; /* Likely next allocation offset */
>>> +	unsigned int large_next[];	/*
>>> +					 * next free offset within current
>>> +					 * allocation cluster for large folios,
>>> +					 * or UINT_MAX if no current cluster.
>>> +					 * Index is (order - 1).
>>> +					 */
>>>  };
>>>
>>>  struct swap_cluster_list {
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index b83ad77e04c0..625964e53c22 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  	return n_ret;
>>>  }
>>>
>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>>> +			    unsigned int nr_pages)
>> 
>> This looks hacky.  IMO, we should put the allocation logic inside
>> percpu_cluster framework.  If percpu_cluster framework doesn't work for
>> you, just refactor it firstly.
>
> I'm not sure I really understand what you are suggesting - could you elaborate?
> What "framework"? I only see a per-cpu data structure and
> scan_swap_map_try_ssd_cluster(), which is very much geared towards order-0
> allocations.

I suggest to share as much code as possible between order-0 and order >
0 swap entry allocation.  I think that we can make
scan_swap_map_try_ssd_cluster() works for order > 0 swap entry allocation.

> Are you suggesting you want to allocate large entries (> order-0) from the same
> cluster that is used for small (order-0) entries? The problem with this approach
> is that there may not be enough space left in the current cluster for the large
> entry that you are trying to allocate. Then you would need to take a new cluster
> from the free list and presumably leave the previous cluster with unused entries
> (which will now only be accessible by scanning). That unused space could be
> considerable.
>
> That's why I am currently reserving a "current cluster" per order; that way, all
> allocations are the same order, they are all naturally aligned and there is no
> waste.

I am fine to use one swap cluster per order per CPU.  I just think that
we should share code.

> Perhaps I could implement "stealing" between cpus so that a cpu trying to
> allocate a specific order, but which doesn't have a current cluster for that
> order and the free list is empty, could allocate from another cpu's current
> cluster? I don't think it's a good idea to mix different orders in the same
> cluster though.

I think we can start from a simple solution, that is, just fall back to
split the large folio.  Then, we can optimize it step by step.

> I guess if really low, I could remove a current cluster from a cpu and allow it
> to be scanned for order-0 allocations at least?

Better to have same behavior between order- and order > 0.  Perhaps
begin with the current solution (allow swap entries in per-CPU cluster
to be stolen).  Then we can optimize based on this.

Not directly related to this patchset.  Maybe we can replace
swap_slots_cache with per-CPU cluster in the future.  This will reduce
the code complexity.

> Any opinions gratefully received!

Thanks!

>> 
>>>  {
>>> +	int order_idx;
>>>  	unsigned long idx;
>>>  	struct swap_cluster_info *ci;
>>> +	struct percpu_cluster *cluster;
>>>  	unsigned long offset;
>>>
>>>  	/*
>>>  	 * Should not even be attempting cluster allocations when huge
>>>  	 * page swap is disabled.  Warn and fail the allocation.
>>>  	 */
>>> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
>>> +	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
>>> +	    nr_pages < 4 || nr_pages > SWAPFILE_CLUSTER ||
>>> +	    !is_power_of_2(nr_pages)) {
>>>  		VM_WARN_ON_ONCE(1);
>>>  		return 0;
>>>  	}
>>>
>>> -	if (cluster_list_empty(&si->free_clusters))
>>> +	/*
>>> +	 * Not using clusters so unable to allocate large entries.
>>> +	 */
>>> +	if (!si->cluster_info)
>>>  		return 0;
>>>
>>> -	idx = cluster_list_first(&si->free_clusters);
>>> -	offset = idx * SWAPFILE_CLUSTER;
>>> -	ci = lock_cluster(si, offset);
>>> -	alloc_cluster(si, idx);
>>> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
>>> +	order_idx = ilog2(nr_pages) - 2;
>>> +	cluster = this_cpu_ptr(si->percpu_cluster);
>>> +	offset = cluster->large_next[order_idx];
>>> +
>>> +	if (offset == UINT_MAX) {
>>> +		if (cluster_list_empty(&si->free_clusters))
>>> +			return 0;
>>> +
>>> +		idx = cluster_list_first(&si->free_clusters);
>>> +		offset = idx * SWAPFILE_CLUSTER;
>>>
>>> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>>> +		ci = lock_cluster(si, offset);
>>> +		alloc_cluster(si, idx);
>>> +		cluster_set_count(ci, SWAPFILE_CLUSTER);
>>> +
>>> +		/*
>>> +		 * If scan_swap_map_slots() can't find a free cluster, it will
>>> +		 * check si->swap_map directly. To make sure this standby
>>> +		 * cluster isn't taken by scan_swap_map_slots(), mark the swap
>>> +		 * entries bad (occupied). (same approach as discard).
>>> +		 */
>>> +		memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD,
>>> +			SWAPFILE_CLUSTER - nr_pages);
>> 
>> There's an issue with this solution.  If the free space of swap device
>> runs low, it's possible that
>> 
>> - some cluster are put in the percpu_cluster of some CPUs
>>   the swap entries there are marked as used
>> 
>> - no free swap entries elsewhere
>> 
>> - nr_swap_pages isn't 0
>> 
>> So, we will still scan LRU, but swap allocation fails, although there's
>> still free swap space.
>
> Ahh yes, good spot.
>
>> 
>> I think that we should follow the method we used for the original
>> percpu_cluster.  That is, if all free swap entries are in
>> percpu_cluster, we will start to allocate from percpu_cluster.
>
> As i suggested above, I think I could do this by removing a cpu's current
> cluster for a given order from the percpu_cluster and making it generally
> available for scanning. Does that work for you?

replied above.

>> 
>>> +	} else {
>>> +		idx = offset / SWAPFILE_CLUSTER;
>>> +		ci = lock_cluster(si, offset);
>>> +	}
>>> +
>>> +	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
>>>  	unlock_cluster(ci);
>>> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
>>> +	swap_range_alloc(si, offset, nr_pages);
>>>  	*slot = swp_entry(si->type, offset);
>>>
>>> +	offset += nr_pages;
>>> +	if (idx != offset / SWAPFILE_CLUSTER)
>>> +		offset = UINT_MAX;
>>> +	cluster->large_next[order_idx] = offset;
>>> +
>>>  	return 1;
>>>  }
>>>
>> 
>> [snip]

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-19  5:49       ` Huang, Ying
@ 2023-10-19 14:25         ` Ryan Roberts
  2023-10-20  3:12           ` Huang, Ying
  0 siblings, 1 reply; 8+ messages in thread
From: Ryan Roberts @ 2023-10-19 14:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel,
	linux-mm, Tim Chen

On 19/10/2023 06:49, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 18/10/2023 07:55, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
> 
> [snip]
> 
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index a073366a227c..35cbbe6509a9 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -268,6 +268,12 @@ struct swap_cluster_info {
>>>>  struct percpu_cluster {
>>>>  	struct swap_cluster_info index; /* Current cluster index */
>>>>  	unsigned int next; /* Likely next allocation offset */
>>>> +	unsigned int large_next[];	/*
>>>> +					 * next free offset within current
>>>> +					 * allocation cluster for large folios,
>>>> +					 * or UINT_MAX if no current cluster.
>>>> +					 * Index is (order - 1).
>>>> +					 */
>>>>  };
>>>>
>>>>  struct swap_cluster_list {
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index b83ad77e04c0..625964e53c22 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>>  	return n_ret;
>>>>  }
>>>>
>>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>>>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>>>> +			    unsigned int nr_pages)
>>>
>>> This looks hacky.  IMO, we should put the allocation logic inside
>>> percpu_cluster framework.  If percpu_cluster framework doesn't work for
>>> you, just refactor it firstly.
>>
>> I'm not sure I really understand what you are suggesting - could you elaborate?
>> What "framework"? I only see a per-cpu data structure and
>> scan_swap_map_try_ssd_cluster(), which is very much geared towards order-0
>> allocations.
> 
> I suggest to share as much code as possible between order-0 and order >
> 0 swap entry allocation.  I think that we can make
> scan_swap_map_try_ssd_cluster() works for order > 0 swap entry allocation.
> 

[...]

>>>> +		/*
>>>> +		 * If scan_swap_map_slots() can't find a free cluster, it will
>>>> +		 * check si->swap_map directly. To make sure this standby
>>>> +		 * cluster isn't taken by scan_swap_map_slots(), mark the swap
>>>> +		 * entries bad (occupied). (same approach as discard).
>>>> +		 */
>>>> +		memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD,
>>>> +			SWAPFILE_CLUSTER - nr_pages);
>>>
>>> There's an issue with this solution.  If the free space of swap device
>>> runs low, it's possible that
>>>
>>> - some cluster are put in the percpu_cluster of some CPUs
>>>   the swap entries there are marked as used
>>>
>>> - no free swap entries elsewhere
>>>
>>> - nr_swap_pages isn't 0
>>>
>>> So, we will still scan LRU, but swap allocation fails, although there's
>>> still free swap space.

I'd like to decide how best to solve this problem before I can figure out how
much code sharing I can do for the order-0 vs order > 0 allocators.

I see a couple of potential options:

1) Manipulate nr_swap_pages to include the entries that are marked SWAP_MAP_BAD,
so when reserving a new per-order/per-cpu cluster, subtract SWAPFILE_CLUSTER,
and then add nr_pages for each allocation from that cluster.

2) Don't mark the entries in the reserved cluster as SWAP_MAP_BAD, which would
allow the scanner to steal (order-0) entries. The scanner could set a flag in
the cluster info to mark it as having been allocated from by the scanner, so the
next attempt to allocate a high order from it would cause discarding it as the
cpu's current cluster and trying to get a fresh cluster from the free list.

While option 2 is a bit more complex, I prefer it as a solution. What do you think?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-19 14:25         ` Ryan Roberts
@ 2023-10-20  3:12           ` Huang, Ying
  0 siblings, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2023-10-20  3:12 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, Kefeng Wang, linux-kernel,
	linux-mm, Tim Chen

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 19/10/2023 06:49, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 18/10/2023 07:55, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>> 
>> [snip]
>> 
>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>> index a073366a227c..35cbbe6509a9 100644
>>>>> --- a/include/linux/swap.h
>>>>> +++ b/include/linux/swap.h
>>>>> @@ -268,6 +268,12 @@ struct swap_cluster_info {
>>>>>  struct percpu_cluster {
>>>>>  	struct swap_cluster_info index; /* Current cluster index */
>>>>>  	unsigned int next; /* Likely next allocation offset */
>>>>> +	unsigned int large_next[];	/*
>>>>> +					 * next free offset within current
>>>>> +					 * allocation cluster for large folios,
>>>>> +					 * or UINT_MAX if no current cluster.
>>>>> +					 * Index is (order - 1).
>>>>> +					 */
>>>>>  };
>>>>>
>>>>>  struct swap_cluster_list {
>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>> index b83ad77e04c0..625964e53c22 100644
>>>>> --- a/mm/swapfile.c
>>>>> +++ b/mm/swapfile.c
>>>>> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>>>  	return n_ret;
>>>>>  }
>>>>>
>>>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>>>>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
>>>>> +			    unsigned int nr_pages)
>>>>
>>>> This looks hacky.  IMO, we should put the allocation logic inside
>>>> percpu_cluster framework.  If percpu_cluster framework doesn't work for
>>>> you, just refactor it firstly.
>>>
>>> I'm not sure I really understand what you are suggesting - could you elaborate?
>>> What "framework"? I only see a per-cpu data structure and
>>> scan_swap_map_try_ssd_cluster(), which is very much geared towards order-0
>>> allocations.
>> 
>> I suggest to share as much code as possible between order-0 and order >
>> 0 swap entry allocation.  I think that we can make
>> scan_swap_map_try_ssd_cluster() works for order > 0 swap entry allocation.
>> 
>
> [...]
>
>>>>> +		/*
>>>>> +		 * If scan_swap_map_slots() can't find a free cluster, it will
>>>>> +		 * check si->swap_map directly. To make sure this standby
>>>>> +		 * cluster isn't taken by scan_swap_map_slots(), mark the swap
>>>>> +		 * entries bad (occupied). (same approach as discard).
>>>>> +		 */
>>>>> +		memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD,
>>>>> +			SWAPFILE_CLUSTER - nr_pages);
>>>>
>>>> There's an issue with this solution.  If the free space of swap device
>>>> runs low, it's possible that
>>>>
>>>> - some cluster are put in the percpu_cluster of some CPUs
>>>>   the swap entries there are marked as used
>>>>
>>>> - no free swap entries elsewhere
>>>>
>>>> - nr_swap_pages isn't 0
>>>>
>>>> So, we will still scan LRU, but swap allocation fails, although there's
>>>> still free swap space.
>
> I'd like to decide how best to solve this problem before I can figure out how
> much code sharing I can do for the order-0 vs order > 0 allocators.
>
> I see a couple of potential options:
>
> 1) Manipulate nr_swap_pages to include the entries that are marked SWAP_MAP_BAD,
> so when reserving a new per-order/per-cpu cluster, subtract SWAPFILE_CLUSTER,
> and then add nr_pages for each allocation from that cluster.
>
> 2) Don't mark the entries in the reserved cluster as SWAP_MAP_BAD, which would
> allow the scanner to steal (order-0) entries. The scanner could set a flag in
> the cluster info to mark it as having been allocated from by the scanner, so the
> next attempt to allocate a high order from it would cause discarding it as the
> cpu's current cluster and trying to get a fresh cluster from the free list.
>
> While option 2 is a bit more complex, I prefer it as a solution. What do you think?

I think that this is a good choice to start with.  We may build more
optimization on top of it if necessary.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-10-20  3:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-17 16:13 [PATCH v2 0/2] Swap-out small-sized THP without splitting Ryan Roberts
2023-10-17 16:13 ` [PATCH v2 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2023-10-17 16:13 ` [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
2023-10-18  6:55   ` Huang, Ying
2023-10-18 14:07     ` Ryan Roberts
2023-10-19  5:49       ` Huang, Ying
2023-10-19 14:25         ` Ryan Roberts
2023-10-20  3:12           ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).