All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Zach O'Keefe" <zokeefe@google.com>
To: Alex Shi <alex.shi@linux.alibaba.com>,
	David Hildenbrand <david@redhat.com>,
	 David Rientjes <rientjes@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	 Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
	Song Liu <songliubraving@fb.com>,  Yang Shi <shy828301@gmail.com>,
	linux-mm@kvack.org, rongwei.wang@linux.alibaba.com
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	 Hugh Dickins <hughd@google.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	 Minchan Kim <minchan@kernel.org>, SeongJae Park <sj@kernel.org>,
	 Pasha Tatashin <pasha.tatashin@soleen.com>
Subject: [RFC] mm: MADV_COLLAPSE semantics
Date: Mon, 23 May 2022 17:18:32 -0700	[thread overview]
Message-ID: <CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com> (raw)

Hey All,

I'm sending this out before the v6 of "mm: userspace hugepage
collapse" for the purposes of aligning on and finalizing the semantics
of the proposed MADV_COLLAPSE madvise(2) mode.

Background:

So far, thanks to everyone's input, we've aligned on:
- MADV_COLLAPSE specifies its own hugepage allocation semantics (it
allows direct reclaim/compaction).
- MADV_COLLAPSE ignores khugepaged heuristics
(/sys/kernel/mm/transparent_hugepage/khugepaged/max_pte_* and
young/referenced page requirements).

In terms of THP _eligibility_, in v5 it was proposed that
MADV_COLLAPSE follow existing THP eligibility semantics
(/sys/kernel/mm/transparent_hugepage/enabled + the VMA flags of the
VMA being collapsed)[1].

However, Rongwei Wang kindly pointed out that the useability of
process_madvise(MADV_COLLAPSE) on a system in "madvise" THP mode was
limited. I agreed to include process_madvise(2) support for
MADV_[NO]HUGEPAGE in v6, but following a discussion with David H., I
think that was a mistake.  Namely, as David kindly pointed out, there
exist programs that don't
work with THP and have good reason to disable it. The example
provided was postcopy life migration in QEMU, which explicitly
disables THP right before faulting in any pages.

Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode,
but otherwise would attempt to collapse.

Why? If someone(*), somewhere told us not to use THPs, then don't
override that decision. Otherwise, this is an explicit, safe(**)
request made on behalf of ourselves, or by a CAP_SYS_ADMIN process,
and shouldn't be blocked by interfaces meant to guide the
"transparent" part of THPs.

Other options considered:

I considered variations of setting VM_HUGEPAGE only if calling on
behalf of self or if VM_NOHUGEPAGE is not set. However, I didn't like
this because there isn't a way to undo the operation: If we supported
process_madvise(MADV_NOHUGEPAGE), we would have to let the application
unclear VM_NOHUGEPAGE because outside processes can't/shouldn't. It
would have to require some *new* madvise mode like MADV_CLEARHUGEPAGE
(that would fail if calling on behalf of another process and
VM_NOHUGEPAGE set) to clear VM_[NO]HUGEPAGE.

A possible downside to the proposed approach is that, if in "madvise"
THP mode and collapsing a VMA not marked VM_HUGEPAGE, it's now the
caller's responsibility to monitor and recollapse this memory back
into THPs. However, in practice this likely means an explicit
MADV_DONTNEED (please let me know if there are other important cases
here), and presumably it's the caller's job to do the monitoring anyway.

Thanks again for taking the time to read / provide input here. I think
this is the last point to clear up before releasing a v6 that should
hopefully have all the functionality we need.

Best,
Zach

---

(*) If we could verify that "never" THP mode was used _only_ for
debugging, then I'd actually opt to ignore "never" in MADV_COLLAPSE.
It's the last dependency MADV_COLLAPSE has on sysfs THP interface and
would provide a convenient way to test/debug MADV_COLLAPSE with
khugepaged / at-fault disabled.
(**) I suppose there could exist applications that see THP "madvise"
mode, never call MADV_HUGEPAGE, and so assume THPs will never be
found.

[1] https://lore.kernel.org/linux-mm/20220504214437.2850685-1-zokeefe@google.com/
[2] https://lore.kernel.org/linux-mm/502a3ced-f3c6-7117-3b24-d80d204d66ee@linux.alibaba.com/


             reply	other threads:[~2022-05-24  0:19 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-24  0:18 Zach O'Keefe [this message]
2022-05-24 13:26 ` [RFC] mm: MADV_COLLAPSE semantics Peter Xu
2022-05-24 17:08   ` Zach O'Keefe
2022-05-24 20:02 ` Yang Shi
2022-05-25  8:24 ` Michal Hocko
2022-05-25 17:32   ` Yang Shi
2022-05-25 18:09     ` Zach O'Keefe
2022-05-26  7:12     ` Michal Hocko
2022-05-26 17:39       ` Yang Shi
2022-05-27  9:46         ` Michal Hocko
2022-05-31 23:47           ` Yang Shi
2022-06-01  9:50             ` Michal Hocko
2022-06-01 17:25               ` Yang Shi
2022-06-02  6:55                 ` Michal Hocko
2022-06-02 16:43                   ` Yang Shi
2022-06-03 13:26                     ` Zach O'Keefe
2022-06-03 13:33                       ` Zach O'Keefe
2022-05-26 18:30   ` Matthew Wilcox
2022-05-27  8:56     ` Michal Hocko
2022-05-27 18:09     ` Yang Shi
2022-05-31 21:36       ` Zach O'Keefe
2022-05-31 23:52         ` Yang Shi
2022-06-01  9:57         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com' \
    --to=zokeefe@google.com \
    --cc=aarcange@redhat.com \
    --cc=alex.shi@linux.alibaba.com \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=rongwei.wang@linux.alibaba.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.