archive mirror
 help / color / mirror / Atom feed
From: Zach O'Keefe <>
To: Alex Shi <>,
	David Hildenbrand <>,
	David Rientjes <>,
	Matthew Wilcox <>,
	Michal Hocko <>,
	Pasha Tatashin <>,
	Peter Xu <>,
	Rongwei Wang <>,
	SeongJae Park <>, Song Liu <>,
	Vlastimil Babka <>, Yang Shi <>,
	Zi Yan <>,
Cc: Andrea Arcangeli <>,
	Andrew Morton <>,
	Arnd Bergmann <>,
	Axel Rasmussen <>,
	Chris Kennelly <>,
	Chris Zankel <>, Helge Deller <>,
	Hugh Dickins <>,
	Ivan Kokshaysky <>,
	"James E.J. Bottomley" <>,
	Jens Axboe <>,
	"Kirill A. Shutemov" <>,
	Matt Turner <>,
	Max Filippov <>,
	Miaohe Lin <>,
	Minchan Kim <>,
	Patrick Xia <>,
	Pavel Begunkov <>,
	Thomas Bogendoerfer <>,
	Souptick Joarder <>
Subject: Re: [RFC] mm: userspace hugepage collapse: file/shmem semantics
Date: Thu, 14 Jul 2022 11:55:01 -0700	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

Hey All,

There are still a couple interface topics (capabilities for process_madvise(2),
errnos) to iron out, but for the most part the behavior and semantics of
MADV_COLLAPSE on anonymous memory seems to be ironed out. Thanks for everyone's
time and effort contributing to that effort.

Looking forward, I'd like to align on the semantics of file/shmem so seal
MADV_COLLAPSE behavior. This is what I'd propose for an initial man-page-like
description of MADV_COLLAPSE for madvise(2), to paint a full-picture view:

Perform a best-effort synchronous collapse of the native pages mapped by the
memory range into Transparent Hugepages (THPs). MADV_COLLAPSE operates on the
current state of memory for the specified process and makes no persistent
changes or guarantees on how pages will be mapped, constructed, or faulted in
the future. However, for file/shmem memory, other mappings of this file extent
may be queued and processed later by khugepaged to attempt to update their
pagetables to map the hugepage by a PMD.

If the ranges provided span multiple VMAs, the semantics of the collapse over
each VMA is independent from the others. This implies a hugepage cannot cross a
VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the
operation may continue to attempt collapsing the remainder of the specified

All non-resident pages covered by the range will first be swapped/faulted-in,
before being copied onto a freshly allocated hugepage. If the native pages
compose the same PTE-mapped hugepage, and are suitably aligned, the collapse
may happen in-place. Unmapped pages will have their data directly initialized
to 0 in the new hugepage. However, for every eligible hugepage aligned/sized
region to-be collapsed, at least one page must currently be backed by memory.

MADV_COLLAPSE is independent of any THP sysfs setting, both in terms of
determining THP eligibility, and allocation semantics. The VMA must not be
VM_PFNMAP, nor can it be stack memory or DAX-backed. The process must not have
PR_SET_THP_DISABLE set. For file-backed memory, the file must either be (1) not
open for write, and the mapping must be executable, or (2) the backing
filesystem must support large pages. Allocation for the new hugepage may enter
direct reclaim and/or compaction, regardless of VMA flags.  When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing the
most native pages.

If all hugepage-sized/aligned regions covered by the provided range were either
successfully collapsed, or were already PMD-mapped THPs, this operation will be
deemed successful. On successful return, all hugepage-aligned/sized memory
regions provided will be mapped by PMDs. Note that this doesn’t guarantee
anything about other possible mappings of the memory. Note that many failures
might have occurred, since the operation may continue to collapse in the event
collapse of a single hugepage-sized/aligned region fails.

MADV_COLLAPSE is only available if the kernel was configured with
CONFIGURE_TRANSPARENT_HUGEPAGE, and file/shmem support additionally require

** Might change with HugeTLB high-granularity mappings[1].

There are a few new items of note here:

1) PMD-mapped on success

MADV_COLLAPSE ultimately wants memory mapped by PMDs, and so I propose we
should always try to actually do the page table updates. For file/shmem, this
means two things: (a) adding support to handle compound pages (both pte-mapped
hugepages and non-HPAGE_PMD_ORDER compound pages), and (b) doing a final PMD
install before returning, and not relying on subsequent fault. This makes the
semantics of file/shmem the same as anonymous. I call out (a), since there was
an existing debate about this, and so I want to ensure we are aligned[1]. Note
that (b), along with presenting a consistent interface to users, also has
real-world usecases too, where relying on fault is difficult (for example,
shmem + UFFDIO_REGISTER_MODE_MINOR-managed memory). Also note that for (b), I'm
proposing to only do the synchronous PMD install for the memory range provided
- the page table collapse of other mappings of the memory can be deferred until
later (by khugepaged).

2) folio timing && file non-writable, executable mapping

I just want to align on some timing due to ongoing folio work. Currently, the
requirement to be able to collapse file/shmem memory is that the file not be
opened for write anywhere, and that the mapping is executable, but we'd
eventually like to support filesystems that claim
mapping_large_folio_support()[2]. Is it acceptable that future MADV_COLLAPSE
works for either mapping_large_folio_support() or the old conditions?
Alternatively, should MADV_COLLAPSE only support mapping_large_folio_support()
filesystems from the onset? (I believe shmem and xfs are the only current

3) (shmem) sysfs settings and huge= tmpfs mount

Should we ignore /sys/kernel/mm/transparent_hugepage/shmem_enabled, similar to
how we ignore /sys/kernel/mm/transparent_hugepage/enabled for anon/file? Does
that include "deny"? This choice is (partially) coupled with tmpfs huge= mount
option. I think today, things work if we ignore this. However, I don't want to
back us into a corner if we ever want to allow MADV_COLLAPSE to work on
writeable shmem mappings one day (or any other incompatibility I'm unaware of).
One option, if in (2) we chose to allow the old conditions, then we could
ignore shmem_enabled in the non-writable, executable case - otherwise defer to
"if the filesystem supports it", where we would then respect huge=.

I think those are the important points. Am I missing anything?

Thanks again everyone for taking the time to read and discuss,



      parent reply	other threads:[~2022-07-14 18:55 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-06 23:59 [mm-unstable v7 00/18] mm: userspace hugepage collapse Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 01/18] mm/khugepaged: remove redundant transhuge_vma_suitable() check Zach O'Keefe
2022-07-11 20:38   ` Yang Shi
2022-07-12 17:14     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 02/18] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 03/18] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-07-08 21:01   ` Andrew Morton
2022-07-11 18:29     ` Zach O'Keefe
2022-07-11 18:45       ` Andrew Morton
2022-07-12 14:17         ` Zach O'Keefe
2022-07-11 21:51       ` Yang Shi
2022-07-06 23:59 ` [mm-unstable v7 04/18] mm/khugepaged: dedup and simplify hugepage alloc and charging Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 05/18] mm/khugepaged: pipe enum scan_result codes back to callers Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 06/18] mm/khugepaged: add flag to predicate khugepaged-only behavior Zach O'Keefe
2022-07-11 20:43   ` Yang Shi
2022-07-12 17:06     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 07/18] mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() Zach O'Keefe
2022-07-11 20:57   ` Yang Shi
2022-07-12 16:58     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 08/18] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage Zach O'Keefe
2022-07-11 21:03   ` Yang Shi
2022-07-12 16:50     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 09/18] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-07-11 21:22   ` Yang Shi
2022-07-12 16:54     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 10/18] mm/khugepaged: rename prefix of shared collapse functions Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 11/18] mm/madvise: add huge_memory:mm_madvise_collapse tracepoint Zach O'Keefe
2022-07-11 21:32   ` Yang Shi
2022-07-12 16:21     ` Zach O'Keefe
2022-07-12 17:05       ` Yang Shi
2022-07-12 17:30         ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-07-08 20:47   ` Andrew Morton
2022-07-13  1:05     ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 13/18] proc/smaps: add PMDMappable field to smaps Zach O'Keefe
2022-07-11 21:37   ` Yang Shi
2022-07-12 16:31     ` Zach O'Keefe
2022-07-12 17:27       ` Yang Shi
2022-07-12 17:57         ` Zach O'Keefe
2022-07-13 18:02           ` Andrew Morton
2022-07-13 18:40             ` Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 14/18] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 15/18] selftests/vm: dedup hugepage allocation logic Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 16/18] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 17/18] selftests/vm: add selftest to verify recollapse of THPs Zach O'Keefe
2022-07-06 23:59 ` [mm-unstable v7 18/18] selftests/vm: add selftest to verify multi THP collapse Zach O'Keefe
2022-07-14 18:55 ` Zach O'Keefe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).