linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting
@ 2023-10-10 14:21 Ryan Roberts
  2023-10-10 14:21 ` [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Ryan Roberts @ 2023-10-10 14:21 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko
  Cc: Ryan Roberts, linux-kernel, linux-mm

Hi All,

This is an RFC for a small series to add support for swapping out small-sized
THP without needing to first split the large folio via __split_huge_page(). It
closely follows the approach already used by PMD-sized THP.

"Small-sized THP" is an upcoming feature that enables performance improvements
by allocating large folios for anonymous memory, where the large folio size is
smaller than the traditional PMD-size. See [1].

In some circumstances I've observed a performance regression (see patch 2 for
details), and this series is an attempt to fix the regression in advance of
merging small-sized THP support.

I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). However, I have a few questions on
whether we should consider relaxing those requirements in certain circumstances:


1) block-backed vs file-backed
==============================

The code only attempts to allocate a contiguous set of entries if swap is backed
by a block device (i.e. not file-backed). The original commit, f0eea189e8e9
("mm, THP, swap: don't allocate huge cluster for file backed swap device"),
stated "It's hard to write a whole transparent huge page (THP) to a file backed
swap device". But didn't state why. Does this imply there is a size limit at
which it becomes hard? And does that therefore imply that for "small enough"
sizes we should now allow use with file-back swap?

This original commit was subsequently fixed with commit 41663430588c ("mm, THP,
swap: fix allocating cluster for swapfile by mistake"), which said the original
commit was using the wrong flag to determine if it was a block device and
therefore in some cases was actually doing large allocations for a file-backed
swap device, and this was causing file-system corruption. But that implies some
sort of correctness issue to me, rather than the performance issue I inferred
from the original commit.

If anyone can offer an explanation, that would be helpful in determining if we
should allow some large sizes for file-backed swap.


2) rotating vs non-rotating
===========================

I notice that the clustered approach is only used for non-rotating swap. That
implies that for rotating media, we will always fail a large allocation, and
fall back to splitting THPs to single pages. Which implies that the regression
I'm fixing here may still be present on rotating media? Or perhaps rotating disk
is so slow that the cost of writing the data out dominates the cost of
splitting?

I considered that potentially the free swap entry search algorithm that is used
in this case could be modified to look for (small) contiguous runs of entries;
Up to ~16 pages (order-4) could be done by doing 2x 64bit reads from map instead
of single byte.

I haven't looked into this idea in detail, but wonder if anybody thinks it is
worth the effort? Or perhaps it would end up causing bad fragmentation.


Finally on testing, I've run the mm selftests and see no regressions, but I
don't think there is anything in there specifically aimed towards swap? Are
there any functional or performance tests that I should run? It would certainly
be good to confirm I haven't regressed PMD-size THP swap performance.

Thanks,
Ryan

[1] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@arm.com/

Ryan Roberts (2):
  mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  mm: swap: Swap-out small-sized THP without splitting

 include/linux/swap.h |  17 +++----
 mm/huge_memory.c     |   3 --
 mm/swapfile.c        | 105 ++++++++++++++++++++++---------------------
 mm/vmscan.c          |  10 +++--
 4 files changed, 66 insertions(+), 69 deletions(-)

--
2.25.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2023-10-10 14:21 [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting Ryan Roberts
@ 2023-10-10 14:21 ` Ryan Roberts
  2023-10-11  7:43   ` Huang, Ying
  2023-10-11  8:17   ` Kefeng Wang
  2023-10-10 14:21 ` [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
  2023-10-11  6:37 ` [RFC PATCH v1 0/2] " Huang, Ying
  2 siblings, 2 replies; 17+ messages in thread
From: Ryan Roberts @ 2023-10-10 14:21 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko
  Cc: Ryan Roberts, linux-kernel, linux-mm

As preparation for supporting small-sized THP in the swap-out path,
without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
which, when present, always implies PMD-sized THP, which is the same as
the cluster size.

The only use of the flag was to determine whether a swap entry refers to
a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
Instead of relying on the flag, we now pass in nr_pages, which
originates from the folio's number of pages. This allows the logic to
work for folios of any order.

The one snag is that one of the swap_page_trans_huge_swapped() call
sites does not have the folio. But it was only being called there to
avoid bothering to call __try_to_reclaim_swap() in some cases.
__try_to_reclaim_swap() gets the folio and (via some other functions)
calls swap_page_trans_huge_swapped(). So I've removed the problematic
call site and believe the new logic should be equivalent.

Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
which used to be called during folio splitting, since
split_swap_cluster()'s only job was to remove the flag.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h | 10 ----------
 mm/huge_memory.c     |  3 ---
 mm/swapfile.c        | 47 ++++++++------------------------------------
 3 files changed, 8 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 19f30a29e1f1..a073366a227c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,7 +259,6 @@ struct swap_cluster_info {
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
-#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */

 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
@@ -595,15 +594,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 }
 #endif /* CONFIG_SWAP */

-#ifdef CONFIG_THP_SWAP
-extern int split_swap_cluster(swp_entry_t entry);
-#else
-static inline int split_swap_cluster(swp_entry_t entry)
-{
-	return 0;
-}
-#endif
-
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c9cbcbf6697e..46b3fb943207 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2597,9 +2597,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		shmem_uncharge(head->mapping->host, nr_dropped);
 	remap_page(folio, nr);

-	if (folio_test_swapcache(folio))
-		split_swap_cluster(folio->swap);
-
 	for (i = 0; i < nr; i++) {
 		struct page *subpage = head + i;
 		if (subpage == page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e52f486834eb..c668838fa660 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -342,18 +342,6 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
 	info->data = 0;
 }

-static inline bool cluster_is_huge(struct swap_cluster_info *info)
-{
-	if (IS_ENABLED(CONFIG_THP_SWAP))
-		return info->flags & CLUSTER_FLAG_HUGE;
-	return false;
-}
-
-static inline void cluster_clear_huge(struct swap_cluster_info *info)
-{
-	info->flags &= ~CLUSTER_FLAG_HUGE;
-}
-
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -1021,7 +1009,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
 	offset = idx * SWAPFILE_CLUSTER;
 	ci = lock_cluster(si, offset);
 	alloc_cluster(si, idx);
-	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
+	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);

 	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
 	unlock_cluster(ci);
@@ -1354,7 +1342,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)

 	ci = lock_cluster_or_swap_info(si, offset);
 	if (size == SWAPFILE_CLUSTER) {
-		VM_BUG_ON(!cluster_is_huge(ci));
 		map = si->swap_map + offset;
 		for (i = 0; i < SWAPFILE_CLUSTER; i++) {
 			val = map[i];
@@ -1362,7 +1349,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 			if (val == SWAP_HAS_CACHE)
 				free_entries++;
 		}
-		cluster_clear_huge(ci);
 		if (free_entries == SWAPFILE_CLUSTER) {
 			unlock_cluster_or_swap_info(si, ci);
 			spin_lock(&si->lock);
@@ -1384,23 +1370,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster_or_swap_info(si, ci);
 }

-#ifdef CONFIG_THP_SWAP
-int split_swap_cluster(swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-
-	si = _swap_info_get(entry);
-	if (!si)
-		return -EBUSY;
-	ci = lock_cluster(si, offset);
-	cluster_clear_huge(ci);
-	unlock_cluster(ci);
-	return 0;
-}
-#endif
-
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
@@ -1508,22 +1477,23 @@ int swp_swapcount(swp_entry_t entry)
 }

 static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
-					 swp_entry_t entry)
+					 swp_entry_t entry,
+					 unsigned int nr_pages)
 {
 	struct swap_cluster_info *ci;
 	unsigned char *map = si->swap_map;
 	unsigned long roffset = swp_offset(entry);
-	unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+	unsigned long offset = round_down(roffset, nr_pages);
 	int i;
 	bool ret = false;

 	ci = lock_cluster_or_swap_info(si, offset);
-	if (!ci || !cluster_is_huge(ci)) {
+	if (!ci || nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
 		goto unlock_out;
 	}
-	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		if (swap_count(map[offset + i])) {
 			ret = true;
 			break;
@@ -1545,7 +1515,7 @@ static bool folio_swapped(struct folio *folio)
 	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
 		return swap_swapcount(si, entry) != 0;

-	return swap_page_trans_huge_swapped(si, entry);
+	return swap_page_trans_huge_swapped(si, entry, folio_nr_pages(folio));
 }

 /**
@@ -1606,8 +1576,7 @@ int free_swap_and_cache(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		count = __swap_entry_free(p, entry);
-		if (count == SWAP_HAS_CACHE &&
-		    !swap_page_trans_huge_swapped(p, entry))
+		if (count == SWAP_HAS_CACHE)
 			__try_to_reclaim_swap(p, swp_offset(entry),
 					      TTRS_UNMAPPED | TTRS_FULL);
 	}
--
2.25.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-10 14:21 [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting Ryan Roberts
  2023-10-10 14:21 ` [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
@ 2023-10-10 14:21 ` Ryan Roberts
  2023-10-11  7:44   ` Ryan Roberts
  2023-10-11  8:25   ` Huang, Ying
  2023-10-11  6:37 ` [RFC PATCH v1 0/2] " Huang, Ying
  2 siblings, 2 replies; 17+ messages in thread
From: Ryan Roberts @ 2023-10-10 14:21 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko
  Cc: Ryan Roberts, linux-kernel, linux-mm

The upcoming anonymous small-sized THP feature enables performance
improvements by allocating large folios for anonymous memory. However
I've observed that on an arm64 system running a parallel workload (e.g.
kernel compilation) across many cores, under high memory pressure, the
speed regresses. This is due to bottlenecking on the increased number of
TLBIs added due to all the extra folio splitting.

Therefore, solve this regression by adding support for swapping out
small-sized THP without needing to split the folio, just like is already
done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
enabled, and when the swap backing store is a non-rotating block device
- these are the same constraints as for the existing PMD-sized THP
swap-out support.

Note that no attempt is made to swap-in THP here - this is still done
page-by-page, like for PMD-sized THP.

The main change here is to improve the swap entry allocator so that it
can allocate any power-of-2 number of contiguous entries between [4, (1
<< PMD_ORDER)]. This is done by allocating a cluster for each distinct
order and allocating sequentially from it until the cluster is full.
This ensures that we don't need to search the map and we get no
fragmentation due to alignment padding for different orders in the
cluster. If there is no current cluster for a given order, we attempt to
allocate a free cluster from the list. If there are no free clusters, we
fail the allocation and the caller falls back to splitting the folio and
allocates individual entries (as per existing PMD-sized THP fallback).

As far as I can tell, this should not cause any extra fragmentation
concerns, given how similar it is to the existing PMD-sized THP
allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
concurrent use though. In practice, the number of orders in use will be
small though.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h |  7 ++++++
 mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
 mm/vmscan.c          | 10 +++++---
 3 files changed, 59 insertions(+), 18 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a073366a227c..fc55b760aeff 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -320,6 +320,13 @@ struct swap_info_struct {
 					 */
 	struct work_struct discard_work; /* discard worker */
 	struct swap_cluster_list discard_clusters; /* discard clusters list */
+	unsigned int large_next[PMD_ORDER]; /*
+					     * next free offset within current
+					     * allocation cluster for large
+					     * folios, or UINT_MAX if no current
+					     * cluster. Index is (order - 1).
+					     * Only when cluster_info is used.
+					     */
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
 					   * entry per node.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c668838fa660..f8093dedc866 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -987,8 +987,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	return n_ret;
 }

-static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
+			    unsigned int nr_pages)
 {
+	int order;
 	unsigned long idx;
 	struct swap_cluster_info *ci;
 	unsigned long offset;
@@ -1002,20 +1004,47 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
 		return 0;
 	}

-	if (cluster_list_empty(&si->free_clusters))
-		return 0;
+	VM_WARN_ON(nr_pages < 2);
+	VM_WARN_ON(nr_pages > SWAPFILE_CLUSTER);
+	VM_WARN_ON(!is_power_of_2(nr_pages));

-	idx = cluster_list_first(&si->free_clusters);
-	offset = idx * SWAPFILE_CLUSTER;
-	ci = lock_cluster(si, offset);
-	alloc_cluster(si, idx);
-	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+	order = ilog2(nr_pages);
+	offset = si->large_next[order - 1];
+
+	if (offset == UINT_MAX) {
+		if (cluster_list_empty(&si->free_clusters))
+			return 0;

-	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
+		idx = cluster_list_first(&si->free_clusters);
+		offset = idx * SWAPFILE_CLUSTER;
+
+		ci = lock_cluster(si, offset);
+		alloc_cluster(si, idx);
+		cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
+
+		/*
+		 * If scan_swap_map_slots() can't find a free cluster, it will
+		 * check si->swap_map directly. To make sure this standby
+		 * cluster isn't taken by scan_swap_map_slots(), mark the swap
+		 * entries bad (occupied). (same approach as discard).
+		 */
+		memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD,
+			SWAPFILE_CLUSTER - nr_pages);
+	} else {
+		idx = offset / SWAPFILE_CLUSTER;
+		ci = lock_cluster(si, offset);
+	}
+
+	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
 	unlock_cluster(ci);
-	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
+	swap_range_alloc(si, offset, nr_pages);
 	*slot = swp_entry(si->type, offset);

+	offset += nr_pages;
+	if (idx != offset / SWAPFILE_CLUSTER)
+		offset = UINT_MAX;
+	si->large_next[order - 1] = offset;
+
 	return 1;
 }

@@ -1041,7 +1070,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 	int node;

 	/* Only single cluster request supported */
-	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
+	WARN_ON_ONCE(n_goal > 1 && size > 1);

 	spin_lock(&swap_avail_lock);

@@ -1078,14 +1107,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-		if (size == SWAPFILE_CLUSTER) {
+		if (size > 1) {
 			if (si->flags & SWP_BLKDEV)
-				n_ret = swap_alloc_cluster(si, swp_entries);
+				n_ret = swap_alloc_large(si, swp_entries, size);
 		} else
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 						    n_goal, swp_entries);
 		spin_unlock(&si->lock);
-		if (n_ret || size == SWAPFILE_CLUSTER)
+		if (n_ret || size > 1)
 			goto check_out;
 		cond_resched();

@@ -2725,6 +2754,9 @@ static struct swap_info_struct *alloc_swap_info(void)
 	spin_lock_init(&p->cont_lock);
 	init_completion(&p->comp);

+	for (i = 0; i < ARRAY_SIZE(p->large_next); i++)
+		p->large_next[i] = UINT_MAX;
+
 	return p;
 }

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c16e2b1ea8ae..5984d2ae4547 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					if (!can_split_folio(folio, NULL))
 						goto activate_locked;
 					/*
-					 * Split folios without a PMD map right
-					 * away. Chances are some or all of the
-					 * tail pages can be freed without IO.
+					 * Split PMD-mappable folios without a
+					 * PMD map right away. Chances are some
+					 * or all of the tail pages can be freed
+					 * without IO.
 					 */
-					if (!folio_entire_mapcount(folio) &&
+					if (folio_test_pmd_mappable(folio) &&
+					    !folio_entire_mapcount(folio) &&
 					    split_folio_to_list(folio,
 								folio_list))
 						goto activate_locked;
--
2.25.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting
  2023-10-10 14:21 [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting Ryan Roberts
  2023-10-10 14:21 ` [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
  2023-10-10 14:21 ` [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
@ 2023-10-11  6:37 ` Huang, Ying
  2023-10-11  7:42   ` Ryan Roberts
  2023-10-13 16:31   ` Ryan Roberts
  2 siblings, 2 replies; 17+ messages in thread
From: Huang, Ying @ 2023-10-11  6:37 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, linux-kernel, linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> Hi All,
>
> This is an RFC for a small series to add support for swapping out small-sized
> THP without needing to first split the large folio via __split_huge_page(). It
> closely follows the approach already used by PMD-sized THP.
>
> "Small-sized THP" is an upcoming feature that enables performance improvements
> by allocating large folios for anonymous memory, where the large folio size is
> smaller than the traditional PMD-size. See [1].
>
> In some circumstances I've observed a performance regression (see patch 2 for
> details), and this series is an attempt to fix the regression in advance of
> merging small-sized THP support.
>
> I've done what I thought was the smallest change possible, and as a result, this
> approach is only employed when the swap is backed by a non-rotating block device
> (just as PMD-sized THP is supported today). However, I have a few questions on
> whether we should consider relaxing those requirements in certain circumstances:
>
>
> 1) block-backed vs file-backed
> ==============================
>
> The code only attempts to allocate a contiguous set of entries if swap is backed
> by a block device (i.e. not file-backed). The original commit, f0eea189e8e9
> ("mm, THP, swap: don't allocate huge cluster for file backed swap device"),
> stated "It's hard to write a whole transparent huge page (THP) to a file backed
> swap device". But didn't state why. Does this imply there is a size limit at
> which it becomes hard? And does that therefore imply that for "small enough"
> sizes we should now allow use with file-back swap?
>
> This original commit was subsequently fixed with commit 41663430588c ("mm, THP,
> swap: fix allocating cluster for swapfile by mistake"), which said the original
> commit was using the wrong flag to determine if it was a block device and
> therefore in some cases was actually doing large allocations for a file-backed
> swap device, and this was causing file-system corruption. But that implies some
> sort of correctness issue to me, rather than the performance issue I inferred
> from the original commit.
>
> If anyone can offer an explanation, that would be helpful in determining if we
> should allow some large sizes for file-backed swap.

swap use 'swap extent' (swap_info_struct.swap_extent_root) to map from
swap offset to storage block number.  For block-backed swap, the mapping
is pure linear.  So, you can use arbitrary large page size.  But for
file-backed swap, only PAGE_SIZE alignment is guaranteed.

> 2) rotating vs non-rotating
> ===========================
>
> I notice that the clustered approach is only used for non-rotating swap. That
> implies that for rotating media, we will always fail a large allocation, and
> fall back to splitting THPs to single pages. Which implies that the regression
> I'm fixing here may still be present on rotating media? Or perhaps rotating disk
> is so slow that the cost of writing the data out dominates the cost of
> splitting?
>
> I considered that potentially the free swap entry search algorithm that is used
> in this case could be modified to look for (small) contiguous runs of entries;
> Up to ~16 pages (order-4) could be done by doing 2x 64bit reads from map instead
> of single byte.
>
> I haven't looked into this idea in detail, but wonder if anybody thinks it is
> worth the effort? Or perhaps it would end up causing bad fragmentation.

I doubt anybody will use rotating storage to back swap now.

> Finally on testing, I've run the mm selftests and see no regressions, but I
> don't think there is anything in there specifically aimed towards swap? Are
> there any functional or performance tests that I should run? It would certainly
> be good to confirm I haven't regressed PMD-size THP swap performance.

I have used swap sub test case of vm-scalbility to test.

https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting
  2023-10-11  6:37 ` [RFC PATCH v1 0/2] " Huang, Ying
@ 2023-10-11  7:42   ` Ryan Roberts
  2023-10-13 16:31   ` Ryan Roberts
  1 sibling, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2023-10-11  7:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, linux-kernel, linux-mm

On 11/10/2023 07:37, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Hi All,
>>
>> This is an RFC for a small series to add support for swapping out small-sized
>> THP without needing to first split the large folio via __split_huge_page(). It
>> closely follows the approach already used by PMD-sized THP.
>>
>> "Small-sized THP" is an upcoming feature that enables performance improvements
>> by allocating large folios for anonymous memory, where the large folio size is
>> smaller than the traditional PMD-size. See [1].
>>
>> In some circumstances I've observed a performance regression (see patch 2 for
>> details), and this series is an attempt to fix the regression in advance of
>> merging small-sized THP support.
>>
>> I've done what I thought was the smallest change possible, and as a result, this
>> approach is only employed when the swap is backed by a non-rotating block device
>> (just as PMD-sized THP is supported today). However, I have a few questions on
>> whether we should consider relaxing those requirements in certain circumstances:
>>
>>
>> 1) block-backed vs file-backed
>> ==============================
>>
>> The code only attempts to allocate a contiguous set of entries if swap is backed
>> by a block device (i.e. not file-backed). The original commit, f0eea189e8e9
>> ("mm, THP, swap: don't allocate huge cluster for file backed swap device"),
>> stated "It's hard to write a whole transparent huge page (THP) to a file backed
>> swap device". But didn't state why. Does this imply there is a size limit at
>> which it becomes hard? And does that therefore imply that for "small enough"
>> sizes we should now allow use with file-back swap?
>>
>> This original commit was subsequently fixed with commit 41663430588c ("mm, THP,
>> swap: fix allocating cluster for swapfile by mistake"), which said the original
>> commit was using the wrong flag to determine if it was a block device and
>> therefore in some cases was actually doing large allocations for a file-backed
>> swap device, and this was causing file-system corruption. But that implies some
>> sort of correctness issue to me, rather than the performance issue I inferred
>> from the original commit.
>>
>> If anyone can offer an explanation, that would be helpful in determining if we
>> should allow some large sizes for file-backed swap.
> 
> swap use 'swap extent' (swap_info_struct.swap_extent_root) to map from
> swap offset to storage block number.  For block-backed swap, the mapping
> is pure linear.  So, you can use arbitrary large page size.  But for
> file-backed swap, only PAGE_SIZE alignment is guaranteed.

Ahh, I see, so its a correctness issue then. Thanks!


> 
>> 2) rotating vs non-rotating
>> ===========================
>>
>> I notice that the clustered approach is only used for non-rotating swap. That
>> implies that for rotating media, we will always fail a large allocation, and
>> fall back to splitting THPs to single pages. Which implies that the regression
>> I'm fixing here may still be present on rotating media? Or perhaps rotating disk
>> is so slow that the cost of writing the data out dominates the cost of
>> splitting?
>>
>> I considered that potentially the free swap entry search algorithm that is used
>> in this case could be modified to look for (small) contiguous runs of entries;
>> Up to ~16 pages (order-4) could be done by doing 2x 64bit reads from map instead
>> of single byte.
>>
>> I haven't looked into this idea in detail, but wonder if anybody thinks it is
>> worth the effort? Or perhaps it would end up causing bad fragmentation.
> 
> I doubt anybody will use rotating storage to back swap now.

I'm often using a QEMU VM for testing with an Ubuntu install. The disk is
enumerating as rotating storage and the swap device is file-backed. But I guess
the former issue at least, is me setting up QEMU with the incorrect options.

> 
>> Finally on testing, I've run the mm selftests and see no regressions, but I
>> don't think there is anything in there specifically aimed towards swap? Are
>> there any functional or performance tests that I should run? It would certainly
>> be good to confirm I haven't regressed PMD-size THP swap performance.
> 
> I have used swap sub test case of vm-scalbility to test.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/

Great - I shall take a look!

> 
> --
> Best Regards,
> Huang, Ying



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2023-10-10 14:21 ` [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
@ 2023-10-11  7:43   ` Huang, Ying
  2023-10-11  8:17   ` Kefeng Wang
  1 sibling, 0 replies; 17+ messages in thread
From: Huang, Ying @ 2023-10-11  7:43 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, linux-kernel, linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> As preparation for supporting small-sized THP in the swap-out path,
> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
> which, when present, always implies PMD-sized THP, which is the same as
> the cluster size.
>
> The only use of the flag was to determine whether a swap entry refers to
> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
> Instead of relying on the flag, we now pass in nr_pages, which
> originates from the folio's number of pages. This allows the logic to
> work for folios of any order.
>
> The one snag is that one of the swap_page_trans_huge_swapped() call
> sites does not have the folio. But it was only being called there to
> avoid bothering to call __try_to_reclaim_swap() in some cases.
> __try_to_reclaim_swap() gets the folio and (via some other functions)
> calls swap_page_trans_huge_swapped(). So I've removed the problematic
> call site and believe the new logic should be equivalent.

I believe this should be OK.  Better to compare the performance too.

> Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
> which used to be called during folio splitting, since
> split_swap_cluster()'s only job was to remove the flag.
>

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-10 14:21 ` [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
@ 2023-10-11  7:44   ` Ryan Roberts
  2023-10-11  8:25   ` Huang, Ying
  1 sibling, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2023-10-11  7:44 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Huang Ying,
	Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko
  Cc: linux-kernel, linux-mm

On 10/10/2023 15:21, Ryan Roberts wrote:
> The upcoming anonymous small-sized THP feature enables performance
> improvements by allocating large folios for anonymous memory. However
> I've observed that on an arm64 system running a parallel workload (e.g.
> kernel compilation) across many cores, under high memory pressure, the
> speed regresses. This is due to bottlenecking on the increased number of
> TLBIs added due to all the extra folio splitting.
> 
> Therefore, solve this regression by adding support for swapping out
> small-sized THP without needing to split the folio, just like is already
> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
> enabled, and when the swap backing store is a non-rotating block device
> - these are the same constraints as for the existing PMD-sized THP
> swap-out support.
> 
> Note that no attempt is made to swap-in THP here - this is still done
> page-by-page, like for PMD-sized THP.
> 
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [4, (1
> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> order and allocating sequentially from it until the cluster is full.
> This ensures that we don't need to search the map and we get no
> fragmentation due to alignment padding for different orders in the
> cluster. If there is no current cluster for a given order, we attempt to
> allocate a free cluster from the list. If there are no free clusters, we
> fail the allocation and the caller falls back to splitting the folio and
> allocates individual entries (as per existing PMD-sized THP fallback).
> 
> As far as I can tell, this should not cause any extra fragmentation
> concerns, given how similar it is to the existing PMD-sized THP
> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
> concurrent use though. In practice, the number of orders in use will be
> small though.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/swap.h |  7 ++++++
>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>  mm/vmscan.c          | 10 +++++---
>  3 files changed, 59 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a073366a227c..fc55b760aeff 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -320,6 +320,13 @@ struct swap_info_struct {
>  					 */
>  	struct work_struct discard_work; /* discard worker */
>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
> +	unsigned int large_next[PMD_ORDER]; /*

Oh dear, I've tripped over this twice in a week now... PMD_ORDER is not a
compile-time const on powerpc, so results in build fail. This would have to be
allocated at init time, I guess.


> +					     * next free offset within current
> +					     * allocation cluster for large
> +					     * folios, or UINT_MAX if no current
> +					     * cluster. Index is (order - 1).
> +					     * Only when cluster_info is used.
> +					     */
>  	struct plist_node avail_lists[]; /*
>  					   * entries in swap_avail_heads, one
>  					   * entry per node.
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index c668838fa660..f8093dedc866 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -987,8 +987,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	return n_ret;
>  }
> 
> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
> +			    unsigned int nr_pages)
>  {
> +	int order;
>  	unsigned long idx;
>  	struct swap_cluster_info *ci;
>  	unsigned long offset;
> @@ -1002,20 +1004,47 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>  		return 0;
>  	}
> 
> -	if (cluster_list_empty(&si->free_clusters))
> -		return 0;
> +	VM_WARN_ON(nr_pages < 2);
> +	VM_WARN_ON(nr_pages > SWAPFILE_CLUSTER);
> +	VM_WARN_ON(!is_power_of_2(nr_pages));
> 
> -	idx = cluster_list_first(&si->free_clusters);
> -	offset = idx * SWAPFILE_CLUSTER;
> -	ci = lock_cluster(si, offset);
> -	alloc_cluster(si, idx);
> -	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
> +	order = ilog2(nr_pages);
> +	offset = si->large_next[order - 1];
> +
> +	if (offset == UINT_MAX) {
> +		if (cluster_list_empty(&si->free_clusters))
> +			return 0;
> 
> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
> +		idx = cluster_list_first(&si->free_clusters);
> +		offset = idx * SWAPFILE_CLUSTER;
> +
> +		ci = lock_cluster(si, offset);
> +		alloc_cluster(si, idx);
> +		cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
> +
> +		/*
> +		 * If scan_swap_map_slots() can't find a free cluster, it will
> +		 * check si->swap_map directly. To make sure this standby
> +		 * cluster isn't taken by scan_swap_map_slots(), mark the swap
> +		 * entries bad (occupied). (same approach as discard).
> +		 */
> +		memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD,
> +			SWAPFILE_CLUSTER - nr_pages);
> +	} else {
> +		idx = offset / SWAPFILE_CLUSTER;
> +		ci = lock_cluster(si, offset);
> +	}
> +
> +	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
>  	unlock_cluster(ci);
> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
> +	swap_range_alloc(si, offset, nr_pages);
>  	*slot = swp_entry(si->type, offset);
> 
> +	offset += nr_pages;
> +	if (idx != offset / SWAPFILE_CLUSTER)
> +		offset = UINT_MAX;
> +	si->large_next[order - 1] = offset;
> +
>  	return 1;
>  }
> 
> @@ -1041,7 +1070,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  	int node;
> 
>  	/* Only single cluster request supported */
> -	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> +	WARN_ON_ONCE(n_goal > 1 && size > 1);
> 
>  	spin_lock(&swap_avail_lock);
> 
> @@ -1078,14 +1107,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  			spin_unlock(&si->lock);
>  			goto nextsi;
>  		}
> -		if (size == SWAPFILE_CLUSTER) {
> +		if (size > 1) {
>  			if (si->flags & SWP_BLKDEV)
> -				n_ret = swap_alloc_cluster(si, swp_entries);
> +				n_ret = swap_alloc_large(si, swp_entries, size);
>  		} else
>  			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>  						    n_goal, swp_entries);
>  		spin_unlock(&si->lock);
> -		if (n_ret || size == SWAPFILE_CLUSTER)
> +		if (n_ret || size > 1)
>  			goto check_out;
>  		cond_resched();
> 
> @@ -2725,6 +2754,9 @@ static struct swap_info_struct *alloc_swap_info(void)
>  	spin_lock_init(&p->cont_lock);
>  	init_completion(&p->comp);
> 
> +	for (i = 0; i < ARRAY_SIZE(p->large_next); i++)
> +		p->large_next[i] = UINT_MAX;
> +
>  	return p;
>  }
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c16e2b1ea8ae..5984d2ae4547 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  					if (!can_split_folio(folio, NULL))
>  						goto activate_locked;
>  					/*
> -					 * Split folios without a PMD map right
> -					 * away. Chances are some or all of the
> -					 * tail pages can be freed without IO.
> +					 * Split PMD-mappable folios without a
> +					 * PMD map right away. Chances are some
> +					 * or all of the tail pages can be freed
> +					 * without IO.
>  					 */
> -					if (!folio_entire_mapcount(folio) &&
> +					if (folio_test_pmd_mappable(folio) &&
> +					    !folio_entire_mapcount(folio) &&
>  					    split_folio_to_list(folio,
>  								folio_list))
>  						goto activate_locked;
> --
> 2.25.1
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2023-10-10 14:21 ` [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
  2023-10-11  7:43   ` Huang, Ying
@ 2023-10-11  8:17   ` Kefeng Wang
  2023-10-11 10:15     ` Ryan Roberts
  2023-10-11 10:16     ` Ryan Roberts
  1 sibling, 2 replies; 17+ messages in thread
From: Kefeng Wang @ 2023-10-11  8:17 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	Huang Ying, Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko
  Cc: linux-kernel, linux-mm



On 2023/10/10 22:21, Ryan Roberts wrote:
> As preparation for supporting small-sized THP in the swap-out path,
> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
> which, when present, always implies PMD-sized THP, which is the same as
> the cluster size.
> 
> The only use of the flag was to determine whether a swap entry refers to
> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
> Instead of relying on the flag, we now pass in nr_pages, which
> originates from the folio's number of pages. This allows the logic to
> work for folios of any order.
> 
> The one snag is that one of the swap_page_trans_huge_swapped() call
> sites does not have the folio. But it was only being called there to
> avoid bothering to call __try_to_reclaim_swap() in some cases.
> __try_to_reclaim_swap() gets the folio and (via some other functions)
> calls swap_page_trans_huge_swapped(). So I've removed the problematic
> call site and believe the new logic should be equivalent.
> 
> Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
> which used to be called during folio splitting, since
> split_swap_cluster()'s only job was to remove the flag.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/swap.h | 10 ----------
>   mm/huge_memory.c     |  3 ---
>   mm/swapfile.c        | 47 ++++++++------------------------------------
>   3 files changed, 8 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 19f30a29e1f1..a073366a227c 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -259,7 +259,6 @@ struct swap_cluster_info {
>   };
>   #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>   #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> -#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */
> 
>   /*
>    * We assign a cluster to each CPU, so each CPU can allocate swap entry from
> @@ -595,15 +594,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
>   }
>   #endif /* CONFIG_SWAP */
> 
> -#ifdef CONFIG_THP_SWAP
> -extern int split_swap_cluster(swp_entry_t entry);
> -#else
> -static inline int split_swap_cluster(swp_entry_t entry)
> -{
> -	return 0;
> -}
> -#endif
> -
>   #ifdef CONFIG_MEMCG
>   static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>   {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c9cbcbf6697e..46b3fb943207 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2597,9 +2597,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   		shmem_uncharge(head->mapping->host, nr_dropped);
>   	remap_page(folio, nr);
> 
> -	if (folio_test_swapcache(folio))
> -		split_swap_cluster(folio->swap);
> -
>   	for (i = 0; i < nr; i++) {
>   		struct page *subpage = head + i;
>   		if (subpage == page)
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index e52f486834eb..c668838fa660 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -342,18 +342,6 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
>   	info->data = 0;
>   }
> 
> -static inline bool cluster_is_huge(struct swap_cluster_info *info)
> -{
> -	if (IS_ENABLED(CONFIG_THP_SWAP))
> -		return info->flags & CLUSTER_FLAG_HUGE;
> -	return false;
> -}
> -
> -static inline void cluster_clear_huge(struct swap_cluster_info *info)
> -{
> -	info->flags &= ~CLUSTER_FLAG_HUGE;
> -}
> -
>   static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
>   						     unsigned long offset)
>   {
> @@ -1021,7 +1009,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>   	offset = idx * SWAPFILE_CLUSTER;
>   	ci = lock_cluster(si, offset);
>   	alloc_cluster(si, idx);
> -	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
> +	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);

Maybe just use cluster_set_count() and kill cluster_set_count_flag().



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-10 14:21 ` [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
  2023-10-11  7:44   ` Ryan Roberts
@ 2023-10-11  8:25   ` Huang, Ying
  2023-10-11 10:36     ` Ryan Roberts
  1 sibling, 1 reply; 17+ messages in thread
From: Huang, Ying @ 2023-10-11  8:25 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, linux-kernel, linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> The upcoming anonymous small-sized THP feature enables performance
> improvements by allocating large folios for anonymous memory. However
> I've observed that on an arm64 system running a parallel workload (e.g.
> kernel compilation) across many cores, under high memory pressure, the
> speed regresses. This is due to bottlenecking on the increased number of
> TLBIs added due to all the extra folio splitting.
>
> Therefore, solve this regression by adding support for swapping out
> small-sized THP without needing to split the folio, just like is already
> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
> enabled, and when the swap backing store is a non-rotating block device
> - these are the same constraints as for the existing PMD-sized THP
> swap-out support.
>
> Note that no attempt is made to swap-in THP here - this is still done
> page-by-page, like for PMD-sized THP.
>
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [4, (1
> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> order and allocating sequentially from it until the cluster is full.
> This ensures that we don't need to search the map and we get no
> fragmentation due to alignment padding for different orders in the
> cluster. If there is no current cluster for a given order, we attempt to
> allocate a free cluster from the list. If there are no free clusters, we
> fail the allocation and the caller falls back to splitting the folio and
> allocates individual entries (as per existing PMD-sized THP fallback).
>
> As far as I can tell, this should not cause any extra fragmentation
> concerns, given how similar it is to the existing PMD-sized THP
> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
> concurrent use though. In practice, the number of orders in use will be
> small though.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/swap.h |  7 ++++++
>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>  mm/vmscan.c          | 10 +++++---
>  3 files changed, 59 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a073366a227c..fc55b760aeff 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -320,6 +320,13 @@ struct swap_info_struct {
>  					 */
>  	struct work_struct discard_work; /* discard worker */
>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
> +	unsigned int large_next[PMD_ORDER]; /*
> +					     * next free offset within current
> +					     * allocation cluster for large
> +					     * folios, or UINT_MAX if no current
> +					     * cluster. Index is (order - 1).
> +					     * Only when cluster_info is used.
> +					     */

I think that it is better to make this per-CPU.  That is, extend the
percpu_cluster mechanism.  Otherwise, we may have scalability issue.

And this should be enclosed in CONFIG_THP_SWAP.

>  	struct plist_node avail_lists[]; /*
>  					   * entries in swap_avail_heads, one
>  					   * entry per node.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2023-10-11  8:17   ` Kefeng Wang
@ 2023-10-11 10:15     ` Ryan Roberts
  2023-10-11 10:16     ` Ryan Roberts
  1 sibling, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2023-10-11 10:15 UTC (permalink / raw)
  To: Kefeng Wang, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	Huang Ying, Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko
  Cc: linux-kernel, linux-mm

On 11/10/2023 09:17, Kefeng Wang wrote:
> 
> 
> On 2023/10/10 22:21, Ryan Roberts wrote:
>> As preparation for supporting small-sized THP in the swap-out path,
>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>> which, when present, always implies PMD-sized THP, which is the same as
>> the cluster size.
>>
>> The only use of the flag was to determine whether a swap entry refers to
>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>> Instead of relying on the flag, we now pass in nr_pages, which
>> originates from the folio's number of pages. This allows the logic to
>> work for folios of any order.
>>
>> The one snag is that one of the swap_page_trans_huge_swapped() call
>> sites does not have the folio. But it was only being called there to
>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>> call site and believe the new logic should be equivalent.
>>
>> Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
>> which used to be called during folio splitting, since
>> split_swap_cluster()'s only job was to remove the flag.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/swap.h | 10 ----------
>>   mm/huge_memory.c     |  3 ---
>>   mm/swapfile.c        | 47 ++++++++------------------------------------
>>   3 files changed, 8 insertions(+), 52 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 19f30a29e1f1..a073366a227c 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -259,7 +259,6 @@ struct swap_cluster_info {
>>   };
>>   #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>>   #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
>> -#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge
>> page */
>>
>>   /*
>>    * We assign a cluster to each CPU, so each CPU can allocate swap entry from
>> @@ -595,15 +594,6 @@ static inline int add_swap_extent(struct swap_info_struct
>> *sis,
>>   }
>>   #endif /* CONFIG_SWAP */
>>
>> -#ifdef CONFIG_THP_SWAP
>> -extern int split_swap_cluster(swp_entry_t entry);
>> -#else
>> -static inline int split_swap_cluster(swp_entry_t entry)
>> -{
>> -    return 0;
>> -}
>> -#endif
>> -
>>   #ifdef CONFIG_MEMCG
>>   static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>>   {
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c9cbcbf6697e..46b3fb943207 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2597,9 +2597,6 @@ static void __split_huge_page(struct page *page, struct
>> list_head *list,
>>           shmem_uncharge(head->mapping->host, nr_dropped);
>>       remap_page(folio, nr);
>>
>> -    if (folio_test_swapcache(folio))
>> -        split_swap_cluster(folio->swap);
>> -
>>       for (i = 0; i < nr; i++) {
>>           struct page *subpage = head + i;
>>           if (subpage == page)
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index e52f486834eb..c668838fa660 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -342,18 +342,6 @@ static inline void cluster_set_null(struct
>> swap_cluster_info *info)
>>       info->data = 0;
>>   }
>>
>> -static inline bool cluster_is_huge(struct swap_cluster_info *info)
>> -{
>> -    if (IS_ENABLED(CONFIG_THP_SWAP))
>> -        return info->flags & CLUSTER_FLAG_HUGE;
>> -    return false;
>> -}
>> -
>> -static inline void cluster_clear_huge(struct swap_cluster_info *info)
>> -{
>> -    info->flags &= ~CLUSTER_FLAG_HUGE;
>> -}
>> -
>>   static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct
>> *si,
>>                                unsigned long offset)
>>   {
>> @@ -1021,7 +1009,7 @@ static int swap_alloc_cluster(struct swap_info_struct
>> *si, swp_entry_t *slot)
>>       offset = idx * SWAPFILE_CLUSTER;
>>       ci = lock_cluster(si, offset);
>>       alloc_cluster(si, idx);
>> -    cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
>> +    cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
> 
> Maybe just use cluster_set_count() and kill cluster_set_count_flag().

Yep good point. I'll do this in the next version - thanks!




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  2023-10-11  8:17   ` Kefeng Wang
  2023-10-11 10:15     ` Ryan Roberts
@ 2023-10-11 10:16     ` Ryan Roberts
  1 sibling, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2023-10-11 10:16 UTC (permalink / raw)
  To: Kefeng Wang, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	Huang Ying, Gao Xiang, Yu Zhao, Yang Shi, Michal Hocko
  Cc: linux-kernel, linux-mm

On 11/10/2023 09:17, Kefeng Wang wrote:
> 
> 
> On 2023/10/10 22:21, Ryan Roberts wrote:
>> As preparation for supporting small-sized THP in the swap-out path,
>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
>> which, when present, always implies PMD-sized THP, which is the same as
>> the cluster size.
>>
>> The only use of the flag was to determine whether a swap entry refers to
>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
>> Instead of relying on the flag, we now pass in nr_pages, which
>> originates from the folio's number of pages. This allows the logic to
>> work for folios of any order.
>>
>> The one snag is that one of the swap_page_trans_huge_swapped() call
>> sites does not have the folio. But it was only being called there to
>> avoid bothering to call __try_to_reclaim_swap() in some cases.
>> __try_to_reclaim_swap() gets the folio and (via some other functions)
>> calls swap_page_trans_huge_swapped(). So I've removed the problematic
>> call site and believe the new logic should be equivalent.
>>
>> Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
>> which used to be called during folio splitting, since
>> split_swap_cluster()'s only job was to remove the flag.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/swap.h | 10 ----------
>>   mm/huge_memory.c     |  3 ---
>>   mm/swapfile.c        | 47 ++++++++------------------------------------
>>   3 files changed, 8 insertions(+), 52 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 19f30a29e1f1..a073366a227c 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -259,7 +259,6 @@ struct swap_cluster_info {
>>   };
>>   #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>>   #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
>> -#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge
>> page */
>>
>>   /*
>>    * We assign a cluster to each CPU, so each CPU can allocate swap entry from
>> @@ -595,15 +594,6 @@ static inline int add_swap_extent(struct swap_info_struct
>> *sis,
>>   }
>>   #endif /* CONFIG_SWAP */
>>
>> -#ifdef CONFIG_THP_SWAP
>> -extern int split_swap_cluster(swp_entry_t entry);
>> -#else
>> -static inline int split_swap_cluster(swp_entry_t entry)
>> -{
>> -    return 0;
>> -}
>> -#endif
>> -
>>   #ifdef CONFIG_MEMCG
>>   static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>>   {
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c9cbcbf6697e..46b3fb943207 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2597,9 +2597,6 @@ static void __split_huge_page(struct page *page, struct
>> list_head *list,
>>           shmem_uncharge(head->mapping->host, nr_dropped);
>>       remap_page(folio, nr);
>>
>> -    if (folio_test_swapcache(folio))
>> -        split_swap_cluster(folio->swap);
>> -
>>       for (i = 0; i < nr; i++) {
>>           struct page *subpage = head + i;
>>           if (subpage == page)
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index e52f486834eb..c668838fa660 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -342,18 +342,6 @@ static inline void cluster_set_null(struct
>> swap_cluster_info *info)
>>       info->data = 0;
>>   }
>>
>> -static inline bool cluster_is_huge(struct swap_cluster_info *info)
>> -{
>> -    if (IS_ENABLED(CONFIG_THP_SWAP))
>> -        return info->flags & CLUSTER_FLAG_HUGE;
>> -    return false;
>> -}
>> -
>> -static inline void cluster_clear_huge(struct swap_cluster_info *info)
>> -{
>> -    info->flags &= ~CLUSTER_FLAG_HUGE;
>> -}
>> -
>>   static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct
>> *si,
>>                                unsigned long offset)
>>   {
>> @@ -1021,7 +1009,7 @@ static int swap_alloc_cluster(struct swap_info_struct
>> *si, swp_entry_t *slot)
>>       offset = idx * SWAPFILE_CLUSTER;
>>       ci = lock_cluster(si, offset);
>>       alloc_cluster(si, idx);
>> -    cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
>> +    cluster_set_count_flag(ci, SWAPFILE_CLUSTER, 0);
> 
> Maybe just use cluster_set_count() and kill cluster_set_count_flag().

Yes, good point - I'll do this in the next version. Thanks!




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-11  8:25   ` Huang, Ying
@ 2023-10-11 10:36     ` Ryan Roberts
  2023-10-11 17:14       ` Ryan Roberts
  0 siblings, 1 reply; 17+ messages in thread
From: Ryan Roberts @ 2023-10-11 10:36 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, linux-kernel, linux-mm

On 11/10/2023 09:25, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> The upcoming anonymous small-sized THP feature enables performance
>> improvements by allocating large folios for anonymous memory. However
>> I've observed that on an arm64 system running a parallel workload (e.g.
>> kernel compilation) across many cores, under high memory pressure, the
>> speed regresses. This is due to bottlenecking on the increased number of
>> TLBIs added due to all the extra folio splitting.
>>
>> Therefore, solve this regression by adding support for swapping out
>> small-sized THP without needing to split the folio, just like is already
>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>> enabled, and when the swap backing store is a non-rotating block device
>> - these are the same constraints as for the existing PMD-sized THP
>> swap-out support.
>>
>> Note that no attempt is made to swap-in THP here - this is still done
>> page-by-page, like for PMD-sized THP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [4, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller falls back to splitting the folio and
>> allocates individual entries (as per existing PMD-sized THP fallback).
>>
>> As far as I can tell, this should not cause any extra fragmentation
>> concerns, given how similar it is to the existing PMD-sized THP
>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
>> concurrent use though. In practice, the number of orders in use will be
>> small though.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/swap.h |  7 ++++++
>>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>>  mm/vmscan.c          | 10 +++++---
>>  3 files changed, 59 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index a073366a227c..fc55b760aeff 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -320,6 +320,13 @@ struct swap_info_struct {
>>  					 */
>>  	struct work_struct discard_work; /* discard worker */
>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>> +	unsigned int large_next[PMD_ORDER]; /*
>> +					     * next free offset within current
>> +					     * allocation cluster for large
>> +					     * folios, or UINT_MAX if no current
>> +					     * cluster. Index is (order - 1).
>> +					     * Only when cluster_info is used.
>> +					     */
> 
> I think that it is better to make this per-CPU.  That is, extend the
> percpu_cluster mechanism.  Otherwise, we may have scalability issue.

Is your concern that the swap_info spinlock will get too contended as its
currently written? From briefly looking at percpu_cluster, it looks like that
spinlock is always held when accessing the per-cpu structures - presumably
that's what's disabling preemption and making sure the thread is not migrated?
So I'm not sure what the benefit is currently? Surely you want to just disable
preemption but not hold the lock? I'm sure I've missed something crucial...

> 
> And this should be enclosed in CONFIG_THP_SWAP.

Yes, I'll fix this in the next version.

Thanks for the review!

> 
>>  	struct plist_node avail_lists[]; /*
>>  					   * entries in swap_avail_heads, one
>>  					   * entry per node.
> 
> --
> Best Regards,
> Huang, Ying



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-11 10:36     ` Ryan Roberts
@ 2023-10-11 17:14       ` Ryan Roberts
  2023-10-16  6:17         ` Huang, Ying
  0 siblings, 1 reply; 17+ messages in thread
From: Ryan Roberts @ 2023-10-11 17:14 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, linux-kernel, linux-mm

On 11/10/2023 11:36, Ryan Roberts wrote:
> On 11/10/2023 09:25, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> The upcoming anonymous small-sized THP feature enables performance
>>> improvements by allocating large folios for anonymous memory. However
>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>> kernel compilation) across many cores, under high memory pressure, the
>>> speed regresses. This is due to bottlenecking on the increased number of
>>> TLBIs added due to all the extra folio splitting.
>>>
>>> Therefore, solve this regression by adding support for swapping out
>>> small-sized THP without needing to split the folio, just like is already
>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>> enabled, and when the swap backing store is a non-rotating block device
>>> - these are the same constraints as for the existing PMD-sized THP
>>> swap-out support.
>>>
>>> Note that no attempt is made to swap-in THP here - this is still done
>>> page-by-page, like for PMD-sized THP.
>>>
>>> The main change here is to improve the swap entry allocator so that it
>>> can allocate any power-of-2 number of contiguous entries between [4, (1
>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>> order and allocating sequentially from it until the cluster is full.
>>> This ensures that we don't need to search the map and we get no
>>> fragmentation due to alignment padding for different orders in the
>>> cluster. If there is no current cluster for a given order, we attempt to
>>> allocate a free cluster from the list. If there are no free clusters, we
>>> fail the allocation and the caller falls back to splitting the folio and
>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>
>>> As far as I can tell, this should not cause any extra fragmentation
>>> concerns, given how similar it is to the existing PMD-sized THP
>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
>>> concurrent use though. In practice, the number of orders in use will be
>>> small though.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/swap.h |  7 ++++++
>>>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>>>  mm/vmscan.c          | 10 +++++---
>>>  3 files changed, 59 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index a073366a227c..fc55b760aeff 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -320,6 +320,13 @@ struct swap_info_struct {
>>>  					 */
>>>  	struct work_struct discard_work; /* discard worker */
>>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>>> +	unsigned int large_next[PMD_ORDER]; /*
>>> +					     * next free offset within current
>>> +					     * allocation cluster for large
>>> +					     * folios, or UINT_MAX if no current
>>> +					     * cluster. Index is (order - 1).
>>> +					     * Only when cluster_info is used.
>>> +					     */
>>
>> I think that it is better to make this per-CPU.  That is, extend the
>> percpu_cluster mechanism.  Otherwise, we may have scalability issue.
> 
> Is your concern that the swap_info spinlock will get too contended as its
> currently written? From briefly looking at percpu_cluster, it looks like that
> spinlock is always held when accessing the per-cpu structures - presumably
> that's what's disabling preemption and making sure the thread is not migrated?
> So I'm not sure what the benefit is currently? Surely you want to just disable
> preemption but not hold the lock? I'm sure I've missed something crucial...

I looked a bit further at how to implement what you are suggesting.
get_swap_pages() is currently taking the swap_info lock which it needs to check
and update some other parts of the swap_info - I'm not sure that part can be
removed. swap_alloc_large() (my new function) is not doing an awful lot of work,
so I'm not convinced that you would save too much by releasing the lock for that
part. In contrast there is a lot more going on in scan_swap_map_slots() so there
is more benefit to releasing the lock and using the percpu stuff - correct me if
I've missunderstood.

As an alternative approach, perhaps it makes more sense to beef up the caching
layer in swap_slots.c to handle large folios too? Then you avoid taking the
swap_info lock at all most of the time, like you currently do for single entry
allocations.

What do you think?

> 
>>
>> And this should be enclosed in CONFIG_THP_SWAP.
> 
> Yes, I'll fix this in the next version.
> 
> Thanks for the review!
> 
>>
>>>  	struct plist_node avail_lists[]; /*
>>>  					   * entries in swap_avail_heads, one
>>>  					   * entry per node.
>>
>> --
>> Best Regards,
>> Huang, Ying
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting
  2023-10-11  6:37 ` [RFC PATCH v1 0/2] " Huang, Ying
  2023-10-11  7:42   ` Ryan Roberts
@ 2023-10-13 16:31   ` Ryan Roberts
  1 sibling, 0 replies; 17+ messages in thread
From: Ryan Roberts @ 2023-10-13 16:31 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, linux-kernel, linux-mm

On 11/10/2023 07:37, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
> [...]
> 
>> Finally on testing, I've run the mm selftests and see no regressions, but I
>> don't think there is anything in there specifically aimed towards swap? Are
>> there any functional or performance tests that I should run? It would certainly
>> be good to confirm I haven't regressed PMD-size THP swap performance.
> 
> I have used swap sub test case of vm-scalbility to test.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/

I ended up using `usemem`, which is the core of this test suite, but deviated
from the pre-canned test case to allow me to use anonymous memory and get
numbers for small-sized THP (this is a very useful tool - thanks for pointing it
out!)

I've run the tests on Ampere Altra, set up with a 35G block ram device as the
swap device and from inside a memcg limited to 40G memory. I've then run
`usemem` with 70 processes (each has its own core), each allocating and writing
1G of memory. I've repeated everything 5 times and taken the mean and stdev:


Mean Performance Improvement vs 4K/baseline

| alloc size |            baseline |    remove-huge-flag | swap-file-small-thp |
|            |  v6.6-rc4+anonfolio |           + patch 1 |           + patch 2 |
|:-----------|--------------------:|--------------------:|--------------------:|
| 4K Page    |                0.0% |                2.3% |                9.1% |
| 64K THP    |              -44.1% |              -46.3% |               30.6% |
| 2M THP     |               56.0% |               54.2% |               60.1% |


Standard Deviation as Percentage of Mean

| alloc size |            baseline |    remove-huge-flag | swap-file-small-thp |
|            |  v6.6-rc4+anonfolio |           + patch 1 |           + patch 2 |
|:-----------|--------------------:|--------------------:|--------------------:|
| 4K Page    |                3.4% |                7.1% |                1.7% |
| 64K THP    |                1.9% |                5.6% |                7.7% |
| 2M THP     |                1.9% |                2.1% |                3.2% |


I don't see any meaningful performance cost to removing the HUGE flag, so
hopefully this gives us confidence to move forward with patch 1.

You can indeed see the performance regression in the baseline when THP is
configured to allocate small-sized THP only (in this case 64K). And you can see
the regression is fixed by patch 2, which avoids splitting the THP and thus
avoids the extra TLBIs. This correlates with what I saw in kernel compilation
workload.

Huang Ying, based on these results, do you still want me to persue a per-cpu
solution to avoid potential contention on the swap info lock? - I proposed in
the thread against patch 2 to do this in the swap_slots layer if so, rather than
in swapfile.c directly (I'm not sure how your original proposal would actually
work?). But based on these results, its not obvious to me that there is a
definite problem here, and it might be simpler to avoid the complexity?

Thanks,
Ryan

> 
> --
> Best Regards,
> Huang, Ying



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-11 17:14       ` Ryan Roberts
@ 2023-10-16  6:17         ` Huang, Ying
  2023-10-16 12:10           ` Ryan Roberts
  0 siblings, 1 reply; 17+ messages in thread
From: Huang, Ying @ 2023-10-16  6:17 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, linux-kernel, linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 11/10/2023 11:36, Ryan Roberts wrote:
>> On 11/10/2023 09:25, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> The upcoming anonymous small-sized THP feature enables performance
>>>> improvements by allocating large folios for anonymous memory. However
>>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>>> kernel compilation) across many cores, under high memory pressure, the
>>>> speed regresses. This is due to bottlenecking on the increased number of
>>>> TLBIs added due to all the extra folio splitting.
>>>>
>>>> Therefore, solve this regression by adding support for swapping out
>>>> small-sized THP without needing to split the folio, just like is already
>>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>>> enabled, and when the swap backing store is a non-rotating block device
>>>> - these are the same constraints as for the existing PMD-sized THP
>>>> swap-out support.
>>>>
>>>> Note that no attempt is made to swap-in THP here - this is still done
>>>> page-by-page, like for PMD-sized THP.
>>>>
>>>> The main change here is to improve the swap entry allocator so that it
>>>> can allocate any power-of-2 number of contiguous entries between [4, (1
>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>> order and allocating sequentially from it until the cluster is full.
>>>> This ensures that we don't need to search the map and we get no
>>>> fragmentation due to alignment padding for different orders in the
>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>> fail the allocation and the caller falls back to splitting the folio and
>>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>>
>>>> As far as I can tell, this should not cause any extra fragmentation
>>>> concerns, given how similar it is to the existing PMD-sized THP
>>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
>>>> concurrent use though. In practice, the number of orders in use will be
>>>> small though.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  include/linux/swap.h |  7 ++++++
>>>>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>>>>  mm/vmscan.c          | 10 +++++---
>>>>  3 files changed, 59 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index a073366a227c..fc55b760aeff 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -320,6 +320,13 @@ struct swap_info_struct {
>>>>  					 */
>>>>  	struct work_struct discard_work; /* discard worker */
>>>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>>>> +	unsigned int large_next[PMD_ORDER]; /*
>>>> +					     * next free offset within current
>>>> +					     * allocation cluster for large
>>>> +					     * folios, or UINT_MAX if no current
>>>> +					     * cluster. Index is (order - 1).
>>>> +					     * Only when cluster_info is used.
>>>> +					     */
>>>
>>> I think that it is better to make this per-CPU.  That is, extend the
>>> percpu_cluster mechanism.  Otherwise, we may have scalability issue.
>> 
>> Is your concern that the swap_info spinlock will get too contended as its
>> currently written? From briefly looking at percpu_cluster, it looks like that
>> spinlock is always held when accessing the per-cpu structures - presumably
>> that's what's disabling preemption and making sure the thread is not migrated?
>> So I'm not sure what the benefit is currently? Surely you want to just disable
>> preemption but not hold the lock? I'm sure I've missed something crucial...
>
> I looked a bit further at how to implement what you are suggesting.
> get_swap_pages() is currently taking the swap_info lock which it needs to check
> and update some other parts of the swap_info - I'm not sure that part can be
> removed. swap_alloc_large() (my new function) is not doing an awful lot of work,
> so I'm not convinced that you would save too much by releasing the lock for that
> part. In contrast there is a lot more going on in scan_swap_map_slots() so there
> is more benefit to releasing the lock and using the percpu stuff - correct me if
> I've missunderstood.
>
> As an alternative approach, perhaps it makes more sense to beef up the caching
> layer in swap_slots.c to handle large folios too? Then you avoid taking the
> swap_info lock at all most of the time, like you currently do for single entry
> allocations.
>
> What do you think?

Sorry for late reply.

percpu_cluster is introduced in commit ebc2a1a69111 ("swap: make cluster
allocation per-cpu").  Please check the changelog for why it's
introduced.  Sorry about my incorrect memory about scalability.
percpu_cluster is introduced for disk performance mainly instead of
scalability.

--
Best Regards,
Huang, Ying

>> 
>>>
>>> And this should be enclosed in CONFIG_THP_SWAP.
>> 
>> Yes, I'll fix this in the next version.
>> 
>> Thanks for the review!
>> 
>>>
>>>>  	struct plist_node avail_lists[]; /*
>>>>  					   * entries in swap_avail_heads, one
>>>>  					   * entry per node.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-16  6:17         ` Huang, Ying
@ 2023-10-16 12:10           ` Ryan Roberts
  2023-10-17  5:44             ` Huang, Ying
  0 siblings, 1 reply; 17+ messages in thread
From: Ryan Roberts @ 2023-10-16 12:10 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, linux-kernel, linux-mm

On 16/10/2023 07:17, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 11/10/2023 11:36, Ryan Roberts wrote:
>>> On 11/10/2023 09:25, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> The upcoming anonymous small-sized THP feature enables performance
>>>>> improvements by allocating large folios for anonymous memory. However
>>>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>>>> kernel compilation) across many cores, under high memory pressure, the
>>>>> speed regresses. This is due to bottlenecking on the increased number of
>>>>> TLBIs added due to all the extra folio splitting.
>>>>>
>>>>> Therefore, solve this regression by adding support for swapping out
>>>>> small-sized THP without needing to split the folio, just like is already
>>>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>>>> enabled, and when the swap backing store is a non-rotating block device
>>>>> - these are the same constraints as for the existing PMD-sized THP
>>>>> swap-out support.
>>>>>
>>>>> Note that no attempt is made to swap-in THP here - this is still done
>>>>> page-by-page, like for PMD-sized THP.
>>>>>
>>>>> The main change here is to improve the swap entry allocator so that it
>>>>> can allocate any power-of-2 number of contiguous entries between [4, (1
>>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>>> order and allocating sequentially from it until the cluster is full.
>>>>> This ensures that we don't need to search the map and we get no
>>>>> fragmentation due to alignment padding for different orders in the
>>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>>> fail the allocation and the caller falls back to splitting the folio and
>>>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>>>
>>>>> As far as I can tell, this should not cause any extra fragmentation
>>>>> concerns, given how similar it is to the existing PMD-sized THP
>>>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
>>>>> concurrent use though. In practice, the number of orders in use will be
>>>>> small though.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>  include/linux/swap.h |  7 ++++++
>>>>>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>>>>>  mm/vmscan.c          | 10 +++++---
>>>>>  3 files changed, 59 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>> index a073366a227c..fc55b760aeff 100644
>>>>> --- a/include/linux/swap.h
>>>>> +++ b/include/linux/swap.h
>>>>> @@ -320,6 +320,13 @@ struct swap_info_struct {
>>>>>  					 */
>>>>>  	struct work_struct discard_work; /* discard worker */
>>>>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>>>>> +	unsigned int large_next[PMD_ORDER]; /*
>>>>> +					     * next free offset within current
>>>>> +					     * allocation cluster for large
>>>>> +					     * folios, or UINT_MAX if no current
>>>>> +					     * cluster. Index is (order - 1).
>>>>> +					     * Only when cluster_info is used.
>>>>> +					     */
>>>>
>>>> I think that it is better to make this per-CPU.  That is, extend the
>>>> percpu_cluster mechanism.  Otherwise, we may have scalability issue.
>>>
>>> Is your concern that the swap_info spinlock will get too contended as its
>>> currently written? From briefly looking at percpu_cluster, it looks like that
>>> spinlock is always held when accessing the per-cpu structures - presumably
>>> that's what's disabling preemption and making sure the thread is not migrated?
>>> So I'm not sure what the benefit is currently? Surely you want to just disable
>>> preemption but not hold the lock? I'm sure I've missed something crucial...
>>
>> I looked a bit further at how to implement what you are suggesting.
>> get_swap_pages() is currently taking the swap_info lock which it needs to check
>> and update some other parts of the swap_info - I'm not sure that part can be
>> removed. swap_alloc_large() (my new function) is not doing an awful lot of work,
>> so I'm not convinced that you would save too much by releasing the lock for that
>> part. In contrast there is a lot more going on in scan_swap_map_slots() so there
>> is more benefit to releasing the lock and using the percpu stuff - correct me if
>> I've missunderstood.
>>
>> As an alternative approach, perhaps it makes more sense to beef up the caching
>> layer in swap_slots.c to handle large folios too? Then you avoid taking the
>> swap_info lock at all most of the time, like you currently do for single entry
>> allocations.
>>
>> What do you think?
> 
> Sorry for late reply.
> 
> percpu_cluster is introduced in commit ebc2a1a69111 ("swap: make cluster
> allocation per-cpu").  Please check the changelog for why it's
> introduced.  Sorry about my incorrect memory about scalability.
> percpu_cluster is introduced for disk performance mainly instead of
> scalability.

Thanks for the pointer. I'm not sure if you are still suggesting that I make my
small-sized THP allocation mechanism per-cpu though?

I anticipate that by virtue of allocating multiple contiguous swap entries for a
small-sized THP we already get a lot of the benefits that percpu_cluster gives
order-0 allocations. (Although obviously it will only give contiguity matching
the size of the THP rather than a full cluster). The downside of making this
percpu would be keeping more free clusters tied up in the percpu caches,
potentially causing a need to scan for free entries more often.

What's your view?

Thanks,
Ryan



> 
> --
> Best Regards,
> Huang, Ying
> 
>>>
>>>>
>>>> And this should be enclosed in CONFIG_THP_SWAP.
>>>
>>> Yes, I'll fix this in the next version.
>>>
>>> Thanks for the review!
>>>
>>>>
>>>>>  	struct plist_node avail_lists[]; /*
>>>>>  					   * entries in swap_avail_heads, one
>>>>>  					   * entry per node.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting
  2023-10-16 12:10           ` Ryan Roberts
@ 2023-10-17  5:44             ` Huang, Ying
  0 siblings, 0 replies; 17+ messages in thread
From: Huang, Ying @ 2023-10-17  5:44 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Gao Xiang,
	Yu Zhao, Yang Shi, Michal Hocko, linux-kernel, linux-mm

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 16/10/2023 07:17, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 11/10/2023 11:36, Ryan Roberts wrote:
>>>> On 11/10/2023 09:25, Huang, Ying wrote:
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> The upcoming anonymous small-sized THP feature enables performance
>>>>>> improvements by allocating large folios for anonymous memory. However
>>>>>> I've observed that on an arm64 system running a parallel workload (e.g.
>>>>>> kernel compilation) across many cores, under high memory pressure, the
>>>>>> speed regresses. This is due to bottlenecking on the increased number of
>>>>>> TLBIs added due to all the extra folio splitting.
>>>>>>
>>>>>> Therefore, solve this regression by adding support for swapping out
>>>>>> small-sized THP without needing to split the folio, just like is already
>>>>>> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
>>>>>> enabled, and when the swap backing store is a non-rotating block device
>>>>>> - these are the same constraints as for the existing PMD-sized THP
>>>>>> swap-out support.
>>>>>>
>>>>>> Note that no attempt is made to swap-in THP here - this is still done
>>>>>> page-by-page, like for PMD-sized THP.
>>>>>>
>>>>>> The main change here is to improve the swap entry allocator so that it
>>>>>> can allocate any power-of-2 number of contiguous entries between [4, (1
>>>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>>>> order and allocating sequentially from it until the cluster is full.
>>>>>> This ensures that we don't need to search the map and we get no
>>>>>> fragmentation due to alignment padding for different orders in the
>>>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>>>> fail the allocation and the caller falls back to splitting the folio and
>>>>>> allocates individual entries (as per existing PMD-sized THP fallback).
>>>>>>
>>>>>> As far as I can tell, this should not cause any extra fragmentation
>>>>>> concerns, given how similar it is to the existing PMD-sized THP
>>>>>> allocation mechanism. There will be up to (PMD_ORDER-1) clusters in
>>>>>> concurrent use though. In practice, the number of orders in use will be
>>>>>> small though.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>  include/linux/swap.h |  7 ++++++
>>>>>>  mm/swapfile.c        | 60 +++++++++++++++++++++++++++++++++-----------
>>>>>>  mm/vmscan.c          | 10 +++++---
>>>>>>  3 files changed, 59 insertions(+), 18 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>> index a073366a227c..fc55b760aeff 100644
>>>>>> --- a/include/linux/swap.h
>>>>>> +++ b/include/linux/swap.h
>>>>>> @@ -320,6 +320,13 @@ struct swap_info_struct {
>>>>>>  					 */
>>>>>>  	struct work_struct discard_work; /* discard worker */
>>>>>>  	struct swap_cluster_list discard_clusters; /* discard clusters list */
>>>>>> +	unsigned int large_next[PMD_ORDER]; /*
>>>>>> +					     * next free offset within current
>>>>>> +					     * allocation cluster for large
>>>>>> +					     * folios, or UINT_MAX if no current
>>>>>> +					     * cluster. Index is (order - 1).
>>>>>> +					     * Only when cluster_info is used.
>>>>>> +					     */
>>>>>
>>>>> I think that it is better to make this per-CPU.  That is, extend the
>>>>> percpu_cluster mechanism.  Otherwise, we may have scalability issue.
>>>>
>>>> Is your concern that the swap_info spinlock will get too contended as its
>>>> currently written? From briefly looking at percpu_cluster, it looks like that
>>>> spinlock is always held when accessing the per-cpu structures - presumably
>>>> that's what's disabling preemption and making sure the thread is not migrated?
>>>> So I'm not sure what the benefit is currently? Surely you want to just disable
>>>> preemption but not hold the lock? I'm sure I've missed something crucial...
>>>
>>> I looked a bit further at how to implement what you are suggesting.
>>> get_swap_pages() is currently taking the swap_info lock which it needs to check
>>> and update some other parts of the swap_info - I'm not sure that part can be
>>> removed. swap_alloc_large() (my new function) is not doing an awful lot of work,
>>> so I'm not convinced that you would save too much by releasing the lock for that
>>> part. In contrast there is a lot more going on in scan_swap_map_slots() so there
>>> is more benefit to releasing the lock and using the percpu stuff - correct me if
>>> I've missunderstood.
>>>
>>> As an alternative approach, perhaps it makes more sense to beef up the caching
>>> layer in swap_slots.c to handle large folios too? Then you avoid taking the
>>> swap_info lock at all most of the time, like you currently do for single entry
>>> allocations.
>>>
>>> What do you think?
>> 
>> Sorry for late reply.
>> 
>> percpu_cluster is introduced in commit ebc2a1a69111 ("swap: make cluster
>> allocation per-cpu").  Please check the changelog for why it's
>> introduced.  Sorry about my incorrect memory about scalability.
>> percpu_cluster is introduced for disk performance mainly instead of
>> scalability.
>
> Thanks for the pointer. I'm not sure if you are still suggesting that I make my
> small-sized THP allocation mechanism per-cpu though?

Yes.  I think that the reason for that we introduced percpu_cluster
still applies now.

> I anticipate that by virtue of allocating multiple contiguous swap entries for a
> small-sized THP we already get a lot of the benefits that percpu_cluster gives
> order-0 allocations. (Although obviously it will only give contiguity matching
> the size of the THP rather than a full cluster).

I think that you will still introduce "interleave disk access" when
multiple CPU allocate from and write to swap device simultaneously.
Right?  Yes, 16KB block is better than 4KB block, but I don't think it
solves the problem.

> The downside of making this percpu would be keeping more free clusters
> tied up in the percpu caches, potentially causing a need to scan for
> free entries more often.

Yes.  We may waste several MB swap space per-CPU.  Is this a practical
issue given the swap device capacity becomes larger and larger?

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-10-17  5:46 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-10 14:21 [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting Ryan Roberts
2023-10-10 14:21 ` [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2023-10-11  7:43   ` Huang, Ying
2023-10-11  8:17   ` Kefeng Wang
2023-10-11 10:15     ` Ryan Roberts
2023-10-11 10:16     ` Ryan Roberts
2023-10-10 14:21 ` [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
2023-10-11  7:44   ` Ryan Roberts
2023-10-11  8:25   ` Huang, Ying
2023-10-11 10:36     ` Ryan Roberts
2023-10-11 17:14       ` Ryan Roberts
2023-10-16  6:17         ` Huang, Ying
2023-10-16 12:10           ` Ryan Roberts
2023-10-17  5:44             ` Huang, Ying
2023-10-11  6:37 ` [RFC PATCH v1 0/2] " Huang, Ying
2023-10-11  7:42   ` Ryan Roberts
2023-10-13 16:31   ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).