archive mirror
 help / color / mirror / Atom feed
From: Ryan Roberts <>
To: Andrew Morton <>,
	David Hildenbrand <>,
	Matthew Wilcox <>,
	Huang Ying <>, Gao Xiang <>,
	Yu Zhao <>, Yang Shi <>,
	Michal Hocko <>
Cc: Ryan Roberts <>,,
Subject: [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting
Date: Tue, 10 Oct 2023 15:21:09 +0100	[thread overview]
Message-ID: <> (raw)

Hi All,

This is an RFC for a small series to add support for swapping out small-sized
THP without needing to first split the large folio via __split_huge_page(). It
closely follows the approach already used by PMD-sized THP.

"Small-sized THP" is an upcoming feature that enables performance improvements
by allocating large folios for anonymous memory, where the large folio size is
smaller than the traditional PMD-size. See [1].

In some circumstances I've observed a performance regression (see patch 2 for
details), and this series is an attempt to fix the regression in advance of
merging small-sized THP support.

I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). However, I have a few questions on
whether we should consider relaxing those requirements in certain circumstances:

1) block-backed vs file-backed

The code only attempts to allocate a contiguous set of entries if swap is backed
by a block device (i.e. not file-backed). The original commit, f0eea189e8e9
("mm, THP, swap: don't allocate huge cluster for file backed swap device"),
stated "It's hard to write a whole transparent huge page (THP) to a file backed
swap device". But didn't state why. Does this imply there is a size limit at
which it becomes hard? And does that therefore imply that for "small enough"
sizes we should now allow use with file-back swap?

This original commit was subsequently fixed with commit 41663430588c ("mm, THP,
swap: fix allocating cluster for swapfile by mistake"), which said the original
commit was using the wrong flag to determine if it was a block device and
therefore in some cases was actually doing large allocations for a file-backed
swap device, and this was causing file-system corruption. But that implies some
sort of correctness issue to me, rather than the performance issue I inferred
from the original commit.

If anyone can offer an explanation, that would be helpful in determining if we
should allow some large sizes for file-backed swap.

2) rotating vs non-rotating

I notice that the clustered approach is only used for non-rotating swap. That
implies that for rotating media, we will always fail a large allocation, and
fall back to splitting THPs to single pages. Which implies that the regression
I'm fixing here may still be present on rotating media? Or perhaps rotating disk
is so slow that the cost of writing the data out dominates the cost of

I considered that potentially the free swap entry search algorithm that is used
in this case could be modified to look for (small) contiguous runs of entries;
Up to ~16 pages (order-4) could be done by doing 2x 64bit reads from map instead
of single byte.

I haven't looked into this idea in detail, but wonder if anybody thinks it is
worth the effort? Or perhaps it would end up causing bad fragmentation.

Finally on testing, I've run the mm selftests and see no regressions, but I
don't think there is anything in there specifically aimed towards swap? Are
there any functional or performance tests that I should run? It would certainly
be good to confirm I haven't regressed PMD-size THP swap performance.



Ryan Roberts (2):
  mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  mm: swap: Swap-out small-sized THP without splitting

 include/linux/swap.h |  17 +++----
 mm/huge_memory.c     |   3 --
 mm/swapfile.c        | 105 ++++++++++++++++++++++---------------------
 mm/vmscan.c          |  10 +++--
 4 files changed, 66 insertions(+), 69 deletions(-)


             reply	other threads:[~2023-10-10 14:25 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-10 14:21 Ryan Roberts [this message]
2023-10-10 14:21 ` [RFC PATCH v1 1/2] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2023-10-11  7:43   ` Huang, Ying
2023-10-11  8:17   ` Kefeng Wang
2023-10-11 10:15     ` Ryan Roberts
2023-10-11 10:16     ` Ryan Roberts
2023-10-10 14:21 ` [RFC PATCH v1 2/2] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
2023-10-11  7:44   ` Ryan Roberts
2023-10-11  8:25   ` Huang, Ying
2023-10-11 10:36     ` Ryan Roberts
2023-10-11 17:14       ` Ryan Roberts
2023-10-16  6:17         ` Huang, Ying
2023-10-16 12:10           ` Ryan Roberts
2023-10-17  5:44             ` Huang, Ying
2023-10-11  6:37 ` [RFC PATCH v1 0/2] " Huang, Ying
2023-10-11  7:42   ` Ryan Roberts
2023-10-13 16:31   ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).