All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] mm: MADV_COLLAPSE semantics
@ 2022-05-24  0:18 Zach O'Keefe
  2022-05-24 13:26 ` Peter Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Zach O'Keefe @ 2022-05-24  0:18 UTC (permalink / raw)
  To: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Peter Xu, Song Liu, Yang Shi, linux-mm,
	rongwei.wang
  Cc: Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

Hey All,

I'm sending this out before the v6 of "mm: userspace hugepage
collapse" for the purposes of aligning on and finalizing the semantics
of the proposed MADV_COLLAPSE madvise(2) mode.

Background:

So far, thanks to everyone's input, we've aligned on:
- MADV_COLLAPSE specifies its own hugepage allocation semantics (it
allows direct reclaim/compaction).
- MADV_COLLAPSE ignores khugepaged heuristics
(/sys/kernel/mm/transparent_hugepage/khugepaged/max_pte_* and
young/referenced page requirements).

In terms of THP _eligibility_, in v5 it was proposed that
MADV_COLLAPSE follow existing THP eligibility semantics
(/sys/kernel/mm/transparent_hugepage/enabled + the VMA flags of the
VMA being collapsed)[1].

However, Rongwei Wang kindly pointed out that the useability of
process_madvise(MADV_COLLAPSE) on a system in "madvise" THP mode was
limited. I agreed to include process_madvise(2) support for
MADV_[NO]HUGEPAGE in v6, but following a discussion with David H., I
think that was a mistake.  Namely, as David kindly pointed out, there
exist programs that don't
work with THP and have good reason to disable it. The example
provided was postcopy life migration in QEMU, which explicitly
disables THP right before faulting in any pages.

Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode,
but otherwise would attempt to collapse.

Why? If someone(*), somewhere told us not to use THPs, then don't
override that decision. Otherwise, this is an explicit, safe(**)
request made on behalf of ourselves, or by a CAP_SYS_ADMIN process,
and shouldn't be blocked by interfaces meant to guide the
"transparent" part of THPs.

Other options considered:

I considered variations of setting VM_HUGEPAGE only if calling on
behalf of self or if VM_NOHUGEPAGE is not set. However, I didn't like
this because there isn't a way to undo the operation: If we supported
process_madvise(MADV_NOHUGEPAGE), we would have to let the application
unclear VM_NOHUGEPAGE because outside processes can't/shouldn't. It
would have to require some *new* madvise mode like MADV_CLEARHUGEPAGE
(that would fail if calling on behalf of another process and
VM_NOHUGEPAGE set) to clear VM_[NO]HUGEPAGE.

A possible downside to the proposed approach is that, if in "madvise"
THP mode and collapsing a VMA not marked VM_HUGEPAGE, it's now the
caller's responsibility to monitor and recollapse this memory back
into THPs. However, in practice this likely means an explicit
MADV_DONTNEED (please let me know if there are other important cases
here), and presumably it's the caller's job to do the monitoring anyway.

Thanks again for taking the time to read / provide input here. I think
this is the last point to clear up before releasing a v6 that should
hopefully have all the functionality we need.

Best,
Zach

---

(*) If we could verify that "never" THP mode was used _only_ for
debugging, then I'd actually opt to ignore "never" in MADV_COLLAPSE.
It's the last dependency MADV_COLLAPSE has on sysfs THP interface and
would provide a convenient way to test/debug MADV_COLLAPSE with
khugepaged / at-fault disabled.
(**) I suppose there could exist applications that see THP "madvise"
mode, never call MADV_HUGEPAGE, and so assume THPs will never be
found.

[1] https://lore.kernel.org/linux-mm/20220504214437.2850685-1-zokeefe@google.com/
[2] https://lore.kernel.org/linux-mm/502a3ced-f3c6-7117-3b24-d80d204d66ee@linux.alibaba.com/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-24  0:18 [RFC] mm: MADV_COLLAPSE semantics Zach O'Keefe
@ 2022-05-24 13:26 ` Peter Xu
  2022-05-24 17:08   ` Zach O'Keefe
  2022-05-24 20:02 ` Yang Shi
  2022-05-25  8:24 ` Michal Hocko
  2 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2022-05-24 13:26 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Song Liu, Yang Shi, linux-mm, rongwei.wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Mon, May 23, 2022 at 05:18:32PM -0700, Zach O'Keefe wrote:
> (*) If we could verify that "never" THP mode was used _only_ for
> debugging, then I'd actually opt to ignore "never" in MADV_COLLAPSE.

Some real time users may have used thp=never to make sure there's no
pgtable uncertainty in all cases (and pages will always be mlocked for the
RT apps, so pre-faulted).

Debattably it's the same as TRANSPARENT_HUGEPAGE=n but the user might want
to use the same kernel with other purpose where thp could still be wanted?
I've no solid clue.  It's just that as long as we have the knob taking
"never" as an option then people may be using it, I'm afraid.

"no" is indeed stronger than "yes" in many cases, at least for thp it's
always like that: thp=never will guarantee no thp globally, while
thp=always will only provide thp when it's still possible.  The same to
MADV_[NO]HUGEPAGE but just for vmas.  From that POV I think your current
plan looks reasonable on respecting "no"s more than "yes"s for both layers.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-24 13:26 ` Peter Xu
@ 2022-05-24 17:08   ` Zach O'Keefe
  0 siblings, 0 replies; 23+ messages in thread
From: Zach O'Keefe @ 2022-05-24 17:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Song Liu, Yang Shi, linux-mm, rongwei.wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Tue, May 24, 2022 at 6:26 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, May 23, 2022 at 05:18:32PM -0700, Zach O'Keefe wrote:
> > (*) If we could verify that "never" THP mode was used _only_ for
> > debugging, then I'd actually opt to ignore "never" in MADV_COLLAPSE.
>
> Some real time users may have used thp=never to make sure there's no
> pgtable uncertainty in all cases (and pages will always be mlocked for the
> RT apps, so pre-faulted).
>

Thanks for the great example here!

> Debattably it's the same as TRANSPARENT_HUGEPAGE=n but the user might want
> to use the same kernel with other purpose where thp could still be wanted?
> I've no solid clue.  It's just that as long as we have the knob taking
> "never" as an option then people may be using it, I'm afraid.
>
> "no" is indeed stronger than "yes" in many cases, at least for thp it's
> always like that: thp=never will guarantee no thp globally, while
> thp=always will only provide thp when it's still possible.  The same to
> MADV_[NO]HUGEPAGE but just for vmas.  From that POV I think your current
> plan looks reasonable on respecting "no"s more than "yes"s for both layers.
>

This makes sense to me. Best to be safe / follow existing "strong no"
convention.

Again, thanks for taking the time to read and provide feedback - very
much appreciated.

Best,
Zach

> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-24  0:18 [RFC] mm: MADV_COLLAPSE semantics Zach O'Keefe
  2022-05-24 13:26 ` Peter Xu
@ 2022-05-24 20:02 ` Yang Shi
  2022-05-25  8:24 ` Michal Hocko
  2 siblings, 0 replies; 23+ messages in thread
From: Yang Shi @ 2022-05-24 20:02 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Michal Hocko, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Mon, May 23, 2022 at 5:19 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Hey All,
>
> I'm sending this out before the v6 of "mm: userspace hugepage
> collapse" for the purposes of aligning on and finalizing the semantics
> of the proposed MADV_COLLAPSE madvise(2) mode.
>
> Background:
>
> So far, thanks to everyone's input, we've aligned on:
> - MADV_COLLAPSE specifies its own hugepage allocation semantics (it
> allows direct reclaim/compaction).
> - MADV_COLLAPSE ignores khugepaged heuristics
> (/sys/kernel/mm/transparent_hugepage/khugepaged/max_pte_* and
> young/referenced page requirements).
>
> In terms of THP _eligibility_, in v5 it was proposed that
> MADV_COLLAPSE follow existing THP eligibility semantics
> (/sys/kernel/mm/transparent_hugepage/enabled + the VMA flags of the
> VMA being collapsed)[1].
>
> However, Rongwei Wang kindly pointed out that the useability of
> process_madvise(MADV_COLLAPSE) on a system in "madvise" THP mode was
> limited. I agreed to include process_madvise(2) support for
> MADV_[NO]HUGEPAGE in v6, but following a discussion with David H., I
> think that was a mistake.  Namely, as David kindly pointed out, there
> exist programs that don't
> work with THP and have good reason to disable it. The example
> provided was postcopy life migration in QEMU, which explicitly
> disables THP right before faulting in any pages.
>
> Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode,
> but otherwise would attempt to collapse.

I do agree to respect VM_NOHUGEPAGE and "never".

Collapsing for not-madvised VMAs for "madvise" mode sounds ok to me,
but I'm not so sure.

>
> Why? If someone(*), somewhere told us not to use THPs, then don't
> override that decision. Otherwise, this is an explicit, safe(**)
> request made on behalf of ourselves, or by a CAP_SYS_ADMIN process,
> and shouldn't be blocked by interfaces meant to guide the
> "transparent" part of THPs.
>
> Other options considered:
>
> I considered variations of setting VM_HUGEPAGE only if calling on
> behalf of self or if VM_NOHUGEPAGE is not set. However, I didn't like
> this because there isn't a way to undo the operation: If we supported
> process_madvise(MADV_NOHUGEPAGE), we would have to let the application
> unclear VM_NOHUGEPAGE because outside processes can't/shouldn't. It
> would have to require some *new* madvise mode like MADV_CLEARHUGEPAGE
> (that would fail if calling on behalf of another process and
> VM_NOHUGEPAGE set) to clear VM_[NO]HUGEPAGE.
>
> A possible downside to the proposed approach is that, if in "madvise"
> THP mode and collapsing a VMA not marked VM_HUGEPAGE, it's now the
> caller's responsibility to monitor and recollapse this memory back
> into THPs. However, in practice this likely means an explicit
> MADV_DONTNEED (please let me know if there are other important cases
> here), and presumably it's the caller's job to do the monitoring anyway.

Page reclaim could also cause the THP split. And it may happen at any
time. I'm not sure how the users or callers could monitor it.

>
> Thanks again for taking the time to read / provide input here. I think
> this is the last point to clear up before releasing a v6 that should
> hopefully have all the functionality we need.
>
> Best,
> Zach
>
> ---
>
> (*) If we could verify that "never" THP mode was used _only_ for
> debugging, then I'd actually opt to ignore "never" in MADV_COLLAPSE.
> It's the last dependency MADV_COLLAPSE has on sysfs THP interface and
> would provide a convenient way to test/debug MADV_COLLAPSE with
> khugepaged / at-fault disabled.
> (**) I suppose there could exist applications that see THP "madvise"
> mode, never call MADV_HUGEPAGE, and so assume THPs will never be
> found.
>
> [1] https://lore.kernel.org/linux-mm/20220504214437.2850685-1-zokeefe@google.com/
> [2] https://lore.kernel.org/linux-mm/502a3ced-f3c6-7117-3b24-d80d204d66ee@linux.alibaba.com/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-24  0:18 [RFC] mm: MADV_COLLAPSE semantics Zach O'Keefe
  2022-05-24 13:26 ` Peter Xu
  2022-05-24 20:02 ` Yang Shi
@ 2022-05-25  8:24 ` Michal Hocko
  2022-05-25 17:32   ` Yang Shi
  2022-05-26 18:30   ` Matthew Wilcox
  2 siblings, 2 replies; 23+ messages in thread
From: Michal Hocko @ 2022-05-25  8:24 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Peter Xu, Song Liu, Yang Shi, linux-mm, rongwei.wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Mon 23-05-22 17:18:32, Zach O'Keefe wrote:
[...]
> Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode,
> but otherwise would attempt to collapse.

I do agree that {process_}madvise should fail on VM_NOHUGEPAGE. The
process has explicitly noted that THP shouldn't be used on such a VMA
and seeing THP could be observed as not complying with that contract.

I am not so sure about the global "never" policy, though. The global
policy controls _kernel_ driven THPs. As the request to collapse memory
comes from the userspace I do not think it should be limited by the
kernel policy. I also think it can be beneficial to implement userspace
based THP policies and exclude any kernel interference and that could be
achieved by global kernel "never" policy and implement the whole
functionality by process_madvise.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-25  8:24 ` Michal Hocko
@ 2022-05-25 17:32   ` Yang Shi
  2022-05-25 18:09     ` Zach O'Keefe
  2022-05-26  7:12     ` Michal Hocko
  2022-05-26 18:30   ` Matthew Wilcox
  1 sibling, 2 replies; 23+ messages in thread
From: Yang Shi @ 2022-05-25 17:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Wed, May 25, 2022 at 1:24 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 23-05-22 17:18:32, Zach O'Keefe wrote:
> [...]
> > Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode,
> > but otherwise would attempt to collapse.
>
> I do agree that {process_}madvise should fail on VM_NOHUGEPAGE. The
> process has explicitly noted that THP shouldn't be used on such a VMA
> and seeing THP could be observed as not complying with that contract.
>
> I am not so sure about the global "never" policy, though. The global
> policy controls _kernel_ driven THPs. As the request to collapse memory
> comes from the userspace I do not think it should be limited by the
> kernel policy. I also think it can be beneficial to implement userspace
> based THP policies and exclude any kernel interference and that could be
> achieved by global kernel "never" policy and implement the whole
> functionality by process_madvise.

I'd prefer to respect "never" for now since it is typically used to
disable THP globally even though the mappings are madvised
(MADV_HUGEPAGE). IMHO I treat MADV_COLLAPSE as weaker MADV_HUGEPAGE
(take effect for non-madvised mappings but not flip VM_NOHUGEPAGE) +
best-effort synchronous THP collapse.

We could lift the restriction in the future if it turns out non
respecting "never" is more useful.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-25 17:32   ` Yang Shi
@ 2022-05-25 18:09     ` Zach O'Keefe
  2022-05-26  7:12     ` Michal Hocko
  1 sibling, 0 replies; 23+ messages in thread
From: Zach O'Keefe @ 2022-05-25 18:09 UTC (permalink / raw)
  To: Yang Shi, Michal Hocko
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Matthew Wilcox,
	Peter Xu, Song Liu, Linux MM, Rongwei Wang, Andrea Arcangeli,
	Axel Rasmussen, Hugh Dickins, Kirill A. Shutemov, Minchan Kim,
	SeongJae Park, Pasha Tatashin

Hey Michal and Yang,

Thanks for the feedback!

On Tue, May 24, 2022 at 1:02 PM Yang Shi <shy828301@gmail.com> wrote:
> [...]
> Page reclaim could also cause the THP split. And it may happen at any
> time. I'm not sure how the users or callers could monitor it.

I don't have a good idea of what monitoring would look like, but this
is a great example that shows splitting can happen from underneath us
and we'll have to design accordingly.

Luckily in this example, the page is likely cold and therefore of less
interest to be backed by THPs.

On Wed, May 25, 2022 at 10:33 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, May 25, 2022 at 1:24 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-05-22 17:18:32, Zach O'Keefe wrote:
> > [...]
> > > Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode,
> > > but otherwise would attempt to collapse.
> >
> > I do agree that {process_}madvise should fail on VM_NOHUGEPAGE. The
> > process has explicitly noted that THP shouldn't be used on such a VMA
> > and seeing THP could be observed as not complying with that contract.
> >
> > I am not so sure about the global "never" policy, though. The global
> > policy controls _kernel_ driven THPs. As the request to collapse memory
> > comes from the userspace I do not think it should be limited by the
> > kernel policy.

Ya, I agree this would be ideal / is the cleanest. However, Peter
mentioned a non-debug example where users wouldn't be expecting THPs
after setting "never". Though, as Peter points out, I'm not sure how
many users do this with CONFIG_TRANSPARENT_HUGEPAGE=y.

>> I also think it can be beneficial to implement userspace
> > based THP policies and exclude any kernel interference and that could be
> > achieved by global kernel "never" policy and implement the whole
> > functionality by process_madvise.

I don't have a clear picture yet, but even if we move THP collapse
policy to userspace, I imagine we'll still want an informed
application/allocator to be able to MADV_HUGEPAGE'ing known hot memory
and fault-in THPs rather than MADV_COLLAPSING after-the-fact. IOW, I
don't know if we'll ever want "never". When I get started on this
work, I was planning on some prctl(2) interface to disable khugepaged
on processes where the userspace agent has taken responsibility for
THP utilization.

> I'd prefer to respect "never" for now since it is typically used to
> disable THP globally even though the mappings are madvised
> (MADV_HUGEPAGE). IMHO I treat MADV_COLLAPSE as weaker MADV_HUGEPAGE
> (take effect for non-madvised mappings but not flip VM_NOHUGEPAGE) +
> best-effort synchronous THP collapse.

I'm likewise in favor of respecting it until proven otherwise - even
though I agree with Michal that it would be nice to not depend on the
kernel policy / sysfs settings here.

> We could lift the restriction in the future if it turns out non
> respecting "never" is more useful.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-25 17:32   ` Yang Shi
  2022-05-25 18:09     ` Zach O'Keefe
@ 2022-05-26  7:12     ` Michal Hocko
  2022-05-26 17:39       ` Yang Shi
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2022-05-26  7:12 UTC (permalink / raw)
  To: Yang Shi
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Wed 25-05-22 10:32:44, Yang Shi wrote:
> On Wed, May 25, 2022 at 1:24 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 23-05-22 17:18:32, Zach O'Keefe wrote:
> > [...]
> > > Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode,
> > > but otherwise would attempt to collapse.
> >
> > I do agree that {process_}madvise should fail on VM_NOHUGEPAGE. The
> > process has explicitly noted that THP shouldn't be used on such a VMA
> > and seeing THP could be observed as not complying with that contract.
> >
> > I am not so sure about the global "never" policy, though. The global
> > policy controls _kernel_ driven THPs. As the request to collapse memory
> > comes from the userspace I do not think it should be limited by the
> > kernel policy. I also think it can be beneficial to implement userspace
> > based THP policies and exclude any kernel interference and that could be
> > achieved by global kernel "never" policy and implement the whole
> > functionality by process_madvise.
> 
> I'd prefer to respect "never" for now since it is typically used to
> disable THP globally even though the mappings are madvised
> (MADV_HUGEPAGE). IMHO I treat MADV_COLLAPSE as weaker MADV_HUGEPAGE
> (take effect for non-madvised mappings but not flip VM_NOHUGEPAGE) +
> best-effort synchronous THP collapse.

MADV_HUGEPAGE is a way to tell the kernel what and how to do in future
time by the kernel.  MADV_COLLAPSE is a way tell what the userspace want
at the moment of the call. So I do not really think they are directly
related in any way except they somehow control THP.

The primary question here is whether we want to support usecases which
want to completely rule out THP handling by the kernel and only rely on
the userspace. If yes, I do not see other way than using never global
policy and rely on MADV_COLLAPSE from the userspace. Or am I missing
something?

> We could lift the restriction in the future if it turns out non
> respecting "never" is more useful.

I do not think we can change the behavior in the future without risking
regressions.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-26  7:12     ` Michal Hocko
@ 2022-05-26 17:39       ` Yang Shi
  2022-05-27  9:46         ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Yang Shi @ 2022-05-26 17:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Thu, May 26, 2022 at 12:12 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 25-05-22 10:32:44, Yang Shi wrote:
> > On Wed, May 25, 2022 at 1:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 23-05-22 17:18:32, Zach O'Keefe wrote:
> > > [...]
> > > > Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode,
> > > > but otherwise would attempt to collapse.
> > >
> > > I do agree that {process_}madvise should fail on VM_NOHUGEPAGE. The
> > > process has explicitly noted that THP shouldn't be used on such a VMA
> > > and seeing THP could be observed as not complying with that contract.
> > >
> > > I am not so sure about the global "never" policy, though. The global
> > > policy controls _kernel_ driven THPs. As the request to collapse memory
> > > comes from the userspace I do not think it should be limited by the
> > > kernel policy. I also think it can be beneficial to implement userspace
> > > based THP policies and exclude any kernel interference and that could be
> > > achieved by global kernel "never" policy and implement the whole
> > > functionality by process_madvise.
> >
> > I'd prefer to respect "never" for now since it is typically used to
> > disable THP globally even though the mappings are madvised
> > (MADV_HUGEPAGE). IMHO I treat MADV_COLLAPSE as weaker MADV_HUGEPAGE
> > (take effect for non-madvised mappings but not flip VM_NOHUGEPAGE) +
> > best-effort synchronous THP collapse.
>
> MADV_HUGEPAGE is a way to tell the kernel what and how to do in future
> time by the kernel.  MADV_COLLAPSE is a way tell what the userspace want
> at the moment of the call. So I do not really think they are directly
> related in any way except they somehow control THP.
>
> The primary question here is whether we want to support usecases which
> want to completely rule out THP handling by the kernel and only rely on
> the userspace. If yes, I do not see other way than using never global
> policy and rely on MADV_COLLAPSE from the userspace. Or am I missing
> something?

I'm not sure whether we want to reach that eventually. But isn't
"madvise" good enough? "madvise" also means to give the delegation to
the users IMHO. The users decide whether huge page is preferred or
not. The users could implement policies:

No - MADV_NOHUGEPAGE
Yes - MADV_HUGEPAGE

But the THP allocation is deferred to real access (page fault) or
khugepaged. So I treated MADV_COLLAPSE as weaker MAD_HUGEPAGE +
synchronous THP allocation.

>
> > We could lift the restriction in the future if it turns out non
> > respecting "never" is more useful.
>
> I do not think we can change the behavior in the future without risking
> regressions.

Yeah we may get THP out of blue. So I thought "madvise" should be good enough.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-25  8:24 ` Michal Hocko
  2022-05-25 17:32   ` Yang Shi
@ 2022-05-26 18:30   ` Matthew Wilcox
  2022-05-27  8:56     ` Michal Hocko
  2022-05-27 18:09     ` Yang Shi
  1 sibling, 2 replies; 23+ messages in thread
From: Matthew Wilcox @ 2022-05-26 18:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Peter Xu, Song Liu, Yang Shi, linux-mm, rongwei.wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Wed, May 25, 2022 at 10:24:55AM +0200, Michal Hocko wrote:
> I am not so sure about the global "never" policy, though. The global
> policy controls _kernel_ driven THPs. As the request to collapse memory
> comes from the userspace I do not think it should be limited by the
> kernel policy. I also think it can be beneficial to implement userspace
> based THP policies and exclude any kernel interference and that could be
> achieved by global kernel "never" policy and implement the whole
> functionality by process_madvise.

I'd prefer to see "never" mean "Don't run khugepaged" rather than "Do
not create THPs".  If the app explicitly asks for a THP, I think it
should get one, regardless of the sysadmin's will.

Death to tunables.  Can we just delete
/sys/kernel/mm/transparent_hugepage/shmem_enabled entirely?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-26 18:30   ` Matthew Wilcox
@ 2022-05-27  8:56     ` Michal Hocko
  2022-05-27 18:09     ` Yang Shi
  1 sibling, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2022-05-27  8:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Peter Xu, Song Liu, Yang Shi, linux-mm, rongwei.wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Thu 26-05-22 19:30:20, Matthew Wilcox wrote:
> On Wed, May 25, 2022 at 10:24:55AM +0200, Michal Hocko wrote:
> > I am not so sure about the global "never" policy, though. The global
> > policy controls _kernel_ driven THPs. As the request to collapse memory
> > comes from the userspace I do not think it should be limited by the
> > kernel policy. I also think it can be beneficial to implement userspace
> > based THP policies and exclude any kernel interference and that could be
> > achieved by global kernel "never" policy and implement the whole
> > functionality by process_madvise.
> 
> I'd prefer to see "never" mean "Don't run khugepaged" rather than "Do
> not create THPs".  If the app explicitly asks for a THP, I think it
> should get one, regardless of the sysadmin's will.
> 
> Death to tunables.  Can we just delete
> /sys/kernel/mm/transparent_hugepage/shmem_enabled entirely?

I do agree that our existing tunables are really complex. One more
reason to not bind the new sync and userspace driven collapsing
functionality to it by any means. Let's really not spread the headache
to the userspace as well.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-26 17:39       ` Yang Shi
@ 2022-05-27  9:46         ` Michal Hocko
  2022-05-31 23:47           ` Yang Shi
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2022-05-27  9:46 UTC (permalink / raw)
  To: Yang Shi
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Thu 26-05-22 10:39:42, Yang Shi wrote:
> On Thu, May 26, 2022 at 12:12 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 25-05-22 10:32:44, Yang Shi wrote:
> > > On Wed, May 25, 2022 at 1:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 23-05-22 17:18:32, Zach O'Keefe wrote:
> > > > [...]
> > > > > Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode,
> > > > > but otherwise would attempt to collapse.
> > > >
> > > > I do agree that {process_}madvise should fail on VM_NOHUGEPAGE. The
> > > > process has explicitly noted that THP shouldn't be used on such a VMA
> > > > and seeing THP could be observed as not complying with that contract.
> > > >
> > > > I am not so sure about the global "never" policy, though. The global
> > > > policy controls _kernel_ driven THPs. As the request to collapse memory
> > > > comes from the userspace I do not think it should be limited by the
> > > > kernel policy. I also think it can be beneficial to implement userspace
> > > > based THP policies and exclude any kernel interference and that could be
> > > > achieved by global kernel "never" policy and implement the whole
> > > > functionality by process_madvise.
> > >
> > > I'd prefer to respect "never" for now since it is typically used to
> > > disable THP globally even though the mappings are madvised
> > > (MADV_HUGEPAGE). IMHO I treat MADV_COLLAPSE as weaker MADV_HUGEPAGE
> > > (take effect for non-madvised mappings but not flip VM_NOHUGEPAGE) +
> > > best-effort synchronous THP collapse.
> >
> > MADV_HUGEPAGE is a way to tell the kernel what and how to do in future
> > time by the kernel.  MADV_COLLAPSE is a way tell what the userspace want
> > at the moment of the call. So I do not really think they are directly
> > related in any way except they somehow control THP.
> >
> > The primary question here is whether we want to support usecases which
> > want to completely rule out THP handling by the kernel and only rely on
> > the userspace. If yes, I do not see other way than using never global
> > policy and rely on MADV_COLLAPSE from the userspace. Or am I missing
> > something?
> 
> I'm not sure whether we want to reach that eventually.

My experience tells me that sooner or later somebody comes with a
usecase for that. We are are not sure that is just a sign somebody will
have that idea. So either we have very good reasons to not allow that
possibility now and ideally we also document that or we should simply
assume it will happen.

> But isn't
> "madvise" good enough? "madvise" also means to give the delegation to
> the users IMHO. The users decide whether huge page is preferred or
> not. The users could implement policies:
> 
> No - MADV_NOHUGEPAGE
> Yes - MADV_HUGEPAGE
> 
> But the THP allocation is deferred to real access (page fault) or
> khugepaged. So I treated MADV_COLLAPSE as weaker MAD_HUGEPAGE +
> synchronous THP allocation.

I really do not see any good reason to tightly couple kernel and user
policies. Hints like MADV_{NO}HUGEPAGE are one thing and both kernel
and userspace might decide to interpret them. But binding MADV_COLLAPSE
to in kernel THP tunables just seems like pushing ourselves into the
corner.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-26 18:30   ` Matthew Wilcox
  2022-05-27  8:56     ` Michal Hocko
@ 2022-05-27 18:09     ` Yang Shi
  2022-05-31 21:36       ` Zach O'Keefe
  1 sibling, 1 reply; 23+ messages in thread
From: Yang Shi @ 2022-05-27 18:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, Zach O'Keefe, Alex Shi, David Hildenbrand,
	David Rientjes, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Thu, May 26, 2022 at 11:30 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, May 25, 2022 at 10:24:55AM +0200, Michal Hocko wrote:
> > I am not so sure about the global "never" policy, though. The global
> > policy controls _kernel_ driven THPs. As the request to collapse memory
> > comes from the userspace I do not think it should be limited by the
> > kernel policy. I also think it can be beneficial to implement userspace
> > based THP policies and exclude any kernel interference and that could be
> > achieved by global kernel "never" policy and implement the whole
> > functionality by process_madvise.
>
> I'd prefer to see "never" mean "Don't run khugepaged" rather than "Do
> not create THPs".  If the app explicitly asks for a THP, I think it
> should get one, regardless of the sysadmin's will.

If we want to decouple THP allocation and khugepaged, maybe a
dedicated switch for khugepaged? Just like /sys/kernel/mm/ksm/run? Or
I should have not proposed a new knob :-)

>
> Death to tunables.  Can we just delete
> /sys/kernel/mm/transparent_hugepage/shmem_enabled entirely?

It is used to control non-mount shm objects, for example, memfd, sys v
shm. The tmpfs has mount options that control huge page eligibility.

Consolidate to /sys/kernel/mm/transparent_hugepage/enabled? Maybe, but
shmem_enabled has a couple of special modes:
- within_size: only allocate huge pages if the page will be fully within i_size
- force: enable THP for all mount tmpfs and non-mount shm
- deny: do opposite of force

force and deny are basically used for debugging purposes.

BTW, currently file THP (readonly fs) is actually controlled by
/sys/kernel/mm/transparent_hugepage/enabled since it just can be
created by khugepaged for now.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-27 18:09     ` Yang Shi
@ 2022-05-31 21:36       ` Zach O'Keefe
  2022-05-31 23:52         ` Yang Shi
  2022-06-01  9:57         ` Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: Zach O'Keefe @ 2022-05-31 21:36 UTC (permalink / raw)
  To: Yang Shi, Matthew Wilcox, Michal Hocko, Peter Xu
  Cc: Alex Shi, David Hildenbrand, David Rientjes, Song Liu, Linux MM,
	Rongwei Wang, Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

Thanks everyone for your time and for the great discussion!

For the purposes of arriving at a decision, I've tried to outline the
major points + my 2c below as:

1. Breaking userland. AFAIK, if permitting MADV_COLLAPSE in "never"
will break real, existing use cases, then linux's policy would
necessitate that we don't do that. Is there a way we can reasonably
determine this? An affirmative answer here makes this decision easy.

2. Current uses of "never" a.k.a dev/debug. If (1) is false, then
we've asserted that *currently* "never" is only used for
development/debugging. During development of MADV_COLLAPSE, I found it
necessary to disable khugepaged via a new debugfs tunable to prevent
khugepaged collapsing memory before MADV_COLLAPSE could act. If
MADV_COLLAPSE wasn't tied to "never", it's one less debugfs tunable
we'd need. OTOH, I can still see the benefit, during debugging, of a
master "no THPs" switch. If we think we'll ever want that master
switch, then let's just keep "never" as said switch.

3. Future uses of "never". Do we want to permit a policy where
userspace *entirely* takes over THP allocation, and khugepaged and
at-fault is disabled in the kernel? If yes, then then might as well
permit "never" to allow that now. Personally, though, I can't imagine
wanting to disable faulting-in THPs in places where we know data will
be hot; but respecting "never" does back us into a corner if we ever
go that route.

4. Flexibility / separation of concerns:  All else being equal,
decoupling user MADV_COLLAPSE from kernel THP sysfs controls is more
flexible and consistent with the rest of MADV_COLLAPSE semantics.

If that's roughly accurate, and in lieu of any other critical points,
if we can determine (1),  then I'd prefer "never" to be tied to kernel
decisions, not userspace. Any strong objections?

Thanks again for your time,
Zach


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-27  9:46         ` Michal Hocko
@ 2022-05-31 23:47           ` Yang Shi
  2022-06-01  9:50             ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Yang Shi @ 2022-05-31 23:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Fri, May 27, 2022 at 2:46 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 26-05-22 10:39:42, Yang Shi wrote:
> > On Thu, May 26, 2022 at 12:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 25-05-22 10:32:44, Yang Shi wrote:
> > > > On Wed, May 25, 2022 at 1:24 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 23-05-22 17:18:32, Zach O'Keefe wrote:
> > > > > [...]
> > > > > > Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode,
> > > > > > but otherwise would attempt to collapse.
> > > > >
> > > > > I do agree that {process_}madvise should fail on VM_NOHUGEPAGE. The
> > > > > process has explicitly noted that THP shouldn't be used on such a VMA
> > > > > and seeing THP could be observed as not complying with that contract.
> > > > >
> > > > > I am not so sure about the global "never" policy, though. The global
> > > > > policy controls _kernel_ driven THPs. As the request to collapse memory
> > > > > comes from the userspace I do not think it should be limited by the
> > > > > kernel policy. I also think it can be beneficial to implement userspace
> > > > > based THP policies and exclude any kernel interference and that could be
> > > > > achieved by global kernel "never" policy and implement the whole
> > > > > functionality by process_madvise.
> > > >
> > > > I'd prefer to respect "never" for now since it is typically used to
> > > > disable THP globally even though the mappings are madvised
> > > > (MADV_HUGEPAGE). IMHO I treat MADV_COLLAPSE as weaker MADV_HUGEPAGE
> > > > (take effect for non-madvised mappings but not flip VM_NOHUGEPAGE) +
> > > > best-effort synchronous THP collapse.
> > >
> > > MADV_HUGEPAGE is a way to tell the kernel what and how to do in future
> > > time by the kernel.  MADV_COLLAPSE is a way tell what the userspace want
> > > at the moment of the call. So I do not really think they are directly
> > > related in any way except they somehow control THP.
> > >
> > > The primary question here is whether we want to support usecases which
> > > want to completely rule out THP handling by the kernel and only rely on
> > > the userspace. If yes, I do not see other way than using never global
> > > policy and rely on MADV_COLLAPSE from the userspace. Or am I missing
> > > something?
> >
> > I'm not sure whether we want to reach that eventually.
>
> My experience tells me that sooner or later somebody comes with a
> usecase for that. We are are not sure that is just a sign somebody will
> have that idea. So either we have very good reasons to not allow that
> possibility now and ideally we also document that or we should simply
> assume it will happen.

Yeah, it is definitely possible and nothing prevents that from happening.

>
> > But isn't
> > "madvise" good enough? "madvise" also means to give the delegation to
> > the users IMHO. The users decide whether huge page is preferred or
> > not. The users could implement policies:
> >
> > No - MADV_NOHUGEPAGE
> > Yes - MADV_HUGEPAGE
> >
> > But the THP allocation is deferred to real access (page fault) or
> > khugepaged. So I treated MADV_COLLAPSE as weaker MAD_HUGEPAGE +
> > synchronous THP allocation.
>
> I really do not see any good reason to tightly couple kernel and user
> policies. Hints like MADV_{NO}HUGEPAGE are one thing and both kernel
> and userspace might decide to interpret them. But binding MADV_COLLAPSE
> to in kernel THP tunables just seems like pushing ourselves into the
> corner.

I don't mean we should tightly couple kernel and user policies. I
think it is about how "never" is treated. AFAICT, typically sys admins
tend to expect "never" as a global switch and they don't expect any
THP allocation should happen in "never" mode even though it is
requested by the users. Maybe they should not expect so in the first
place.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-31 21:36       ` Zach O'Keefe
@ 2022-05-31 23:52         ` Yang Shi
  2022-06-01  9:57         ` Michal Hocko
  1 sibling, 0 replies; 23+ messages in thread
From: Yang Shi @ 2022-05-31 23:52 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Matthew Wilcox, Michal Hocko, Peter Xu, Alex Shi,
	David Hildenbrand, David Rientjes, Song Liu, Linux MM,
	Rongwei Wang, Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Tue, May 31, 2022 at 2:37 PM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Thanks everyone for your time and for the great discussion!
>
> For the purposes of arriving at a decision, I've tried to outline the
> major points + my 2c below as:

Thanks for summing up the discussion.

>
> 1. Breaking userland. AFAIK, if permitting MADV_COLLAPSE in "never"
> will break real, existing use cases, then linux's policy would
> necessitate that we don't do that. Is there a way we can reasonably
> determine this? An affirmative answer here makes this decision easy.

I don't have an affirmative answer. It depends on the users'
expectations. Some users may expect there won't be any THP allocation
in "never" mode even though it is requested by the users. AFAICT some
sys admins may expect so since they may manage machines which may run
untrusted software. So allowing MADV_COLLAPSE in "never" doesn't break
any workload, but may break some expectations.

>
> 2. Current uses of "never" a.k.a dev/debug. If (1) is false, then
> we've asserted that *currently* "never" is only used for
> development/debugging. During development of MADV_COLLAPSE, I found it
> necessary to disable khugepaged via a new debugfs tunable to prevent
> khugepaged collapsing memory before MADV_COLLAPSE could act. If
> MADV_COLLAPSE wasn't tied to "never", it's one less debugfs tunable
> we'd need. OTOH, I can still see the benefit, during debugging, of a
> master "no THPs" switch. If we think we'll ever want that master
> switch, then let's just keep "never" as said switch.
>
> 3. Future uses of "never". Do we want to permit a policy where
> userspace *entirely* takes over THP allocation, and khugepaged and
> at-fault is disabled in the kernel? If yes, then then might as well
> permit "never" to allow that now. Personally, though, I can't imagine
> wanting to disable faulting-in THPs in places where we know data will
> be hot; but respecting "never" does back us into a corner if we ever
> go that route.
>
> 4. Flexibility / separation of concerns:  All else being equal,
> decoupling user MADV_COLLAPSE from kernel THP sysfs controls is more
> flexible and consistent with the rest of MADV_COLLAPSE semantics.
>
> If that's roughly accurate, and in lieu of any other critical points,
> if we can determine (1),  then I'd prefer "never" to be tied to kernel
> decisions, not userspace. Any strong objections?

I do not have strong objections, and I think Michal's point and yours
do make some sense for some usecases. A simple way is to allow
MADV_COLLAPSE in "never" mode, then see whether there will be any
complaints.

>
> Thanks again for your time,
> Zach


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-31 23:47           ` Yang Shi
@ 2022-06-01  9:50             ` Michal Hocko
  2022-06-01 17:25               ` Yang Shi
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2022-06-01  9:50 UTC (permalink / raw)
  To: Yang Shi
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Tue 31-05-22 16:47:49, Yang Shi wrote:
> On Fri, May 27, 2022 at 2:46 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > I really do not see any good reason to tightly couple kernel and user
> > policies. Hints like MADV_{NO}HUGEPAGE are one thing and both kernel
> > and userspace might decide to interpret them. But binding MADV_COLLAPSE
> > to in kernel THP tunables just seems like pushing ourselves into the
> > corner.
> 
> I don't mean we should tightly couple kernel and user policies. I
> think it is about how "never" is treated. AFAICT, typically sys admins
> tend to expect "never" as a global switch and they don't expect any
> THP allocation should happen in "never" mode even though it is
> requested by the users. Maybe they should not expect so in the first
> place.

But this is not how the knob works, right? At least shmem has its own
thing. So we do not have any global kill switch for transparent huge
pages.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-05-31 21:36       ` Zach O'Keefe
  2022-05-31 23:52         ` Yang Shi
@ 2022-06-01  9:57         ` Michal Hocko
  1 sibling, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2022-06-01  9:57 UTC (permalink / raw)
  To: Zach O'Keefe
  Cc: Yang Shi, Matthew Wilcox, Peter Xu, Alex Shi, David Hildenbrand,
	David Rientjes, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Tue 31-05-22 14:36:50, Zach O'Keefe wrote:
> Thanks everyone for your time and for the great discussion!
> 
> For the purposes of arriving at a decision, I've tried to outline the
> major points + my 2c below as:
> 
> 1. Breaking userland. AFAIK, if permitting MADV_COLLAPSE in "never"
> will break real, existing use cases, then linux's policy would
> necessitate that we don't do that. Is there a way we can reasonably
> determine this? An affirmative answer here makes this decision easy.

As pointed in other reply. Never doesn't really imply no THPs. At least
shmem doesn't obey that configuration and relies on the mount option
instead AFAIR.

[...]
> 3. Future uses of "never". Do we want to permit a policy where
> userspace *entirely* takes over THP allocation, and khugepaged and
> at-fault is disabled in the kernel? If yes, then then might as well
> permit "never" to allow that now. Personally, though, I can't imagine
> wanting to disable faulting-in THPs in places where we know data will
> be hot; but respecting "never" does back us into a corner if we ever
> go that route.

My experience tells me that usecases to take control into the userspace
grow rather than shrink. We have people asking for memory reclaim into
the userspace and I do not really see reasons why THPs would any
different.

If we ever really need a global THP kill switch to act on any types
of mappings then we would need to add a new knob because changing the
existing one would be hard without any regressions.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-06-01  9:50             ` Michal Hocko
@ 2022-06-01 17:25               ` Yang Shi
  2022-06-02  6:55                 ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Yang Shi @ 2022-06-01 17:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Wed, Jun 1, 2022 at 2:50 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 31-05-22 16:47:49, Yang Shi wrote:
> > On Fri, May 27, 2022 at 2:46 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > I really do not see any good reason to tightly couple kernel and user
> > > policies. Hints like MADV_{NO}HUGEPAGE are one thing and both kernel
> > > and userspace might decide to interpret them. But binding MADV_COLLAPSE
> > > to in kernel THP tunables just seems like pushing ourselves into the
> > > corner.
> >
> > I don't mean we should tightly couple kernel and user policies. I
> > think it is about how "never" is treated. AFAICT, typically sys admins
> > tend to expect "never" as a global switch and they don't expect any
> > THP allocation should happen in "never" mode even though it is
> > requested by the users. Maybe they should not expect so in the first
> > place.
>
> But this is not how the knob works, right? At least shmem has its own
> thing. So we do not have any global kill switch for transparent huge
> pages.

Yeah, but shmem has "never" mode too, which has the same semantics and
it is the default mode actually. Since MADV_COLLAPSE just collapse
anon memory for now, so the discussion was focused on anon THP. But
shmem is same.

And shmem even has "deny" mode which has stronger semantics.

>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-06-01 17:25               ` Yang Shi
@ 2022-06-02  6:55                 ` Michal Hocko
  2022-06-02 16:43                   ` Yang Shi
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2022-06-02  6:55 UTC (permalink / raw)
  To: Yang Shi
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Wed 01-06-22 10:25:53, Yang Shi wrote:
> On Wed, Jun 1, 2022 at 2:50 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 31-05-22 16:47:49, Yang Shi wrote:
> > > On Fri, May 27, 2022 at 2:46 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > I really do not see any good reason to tightly couple kernel and user
> > > > policies. Hints like MADV_{NO}HUGEPAGE are one thing and both kernel
> > > > and userspace might decide to interpret them. But binding MADV_COLLAPSE
> > > > to in kernel THP tunables just seems like pushing ourselves into the
> > > > corner.
> > >
> > > I don't mean we should tightly couple kernel and user policies. I
> > > think it is about how "never" is treated. AFAICT, typically sys admins
> > > tend to expect "never" as a global switch and they don't expect any
> > > THP allocation should happen in "never" mode even though it is
> > > requested by the users. Maybe they should not expect so in the first
> > > place.
> >
> > But this is not how the knob works, right? At least shmem has its own
> > thing. So we do not have any global kill switch for transparent huge
> > pages.
> 
> Yeah, but shmem has "never" mode too, which has the same semantics and
> it is the default mode actually. Since MADV_COLLAPSE just collapse
> anon memory for now, so the discussion was focused on anon THP. But
> shmem is same.

Do you expect MADV_COLLAPSE would stick to the anonymous memory? Are we
going to get MADV_COLLAPSE_SHMEM, MADV_COLLAPSE_FOR_REAL?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-06-02  6:55                 ` Michal Hocko
@ 2022-06-02 16:43                   ` Yang Shi
  2022-06-03 13:26                     ` Zach O'Keefe
  0 siblings, 1 reply; 23+ messages in thread
From: Yang Shi @ 2022-06-02 16:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Zach O'Keefe, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Wed, Jun 1, 2022 at 11:56 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 01-06-22 10:25:53, Yang Shi wrote:
> > On Wed, Jun 1, 2022 at 2:50 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 31-05-22 16:47:49, Yang Shi wrote:
> > > > On Fri, May 27, 2022 at 2:46 AM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > I really do not see any good reason to tightly couple kernel and user
> > > > > policies. Hints like MADV_{NO}HUGEPAGE are one thing and both kernel
> > > > > and userspace might decide to interpret them. But binding MADV_COLLAPSE
> > > > > to in kernel THP tunables just seems like pushing ourselves into the
> > > > > corner.
> > > >
> > > > I don't mean we should tightly couple kernel and user policies. I
> > > > think it is about how "never" is treated. AFAICT, typically sys admins
> > > > tend to expect "never" as a global switch and they don't expect any
> > > > THP allocation should happen in "never" mode even though it is
> > > > requested by the users. Maybe they should not expect so in the first
> > > > place.
> > >
> > > But this is not how the knob works, right? At least shmem has its own
> > > thing. So we do not have any global kill switch for transparent huge
> > > pages.
> >
> > Yeah, but shmem has "never" mode too, which has the same semantics and
> > it is the default mode actually. Since MADV_COLLAPSE just collapse
> > anon memory for now, so the discussion was focused on anon THP. But
> > shmem is same.
>
> Do you expect MADV_COLLAPSE would stick to the anonymous memory? Are we
> going to get MADV_COLLAPSE_SHMEM, MADV_COLLAPSE_FOR_REAL?

No, my point is "never" mode has the same semantics for both anon and
shmem. When we were talking about whether MADV_COLLAPSE should respect
"never" or not, it means both.

>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-06-02 16:43                   ` Yang Shi
@ 2022-06-03 13:26                     ` Zach O'Keefe
  2022-06-03 13:33                       ` Zach O'Keefe
  0 siblings, 1 reply; 23+ messages in thread
From: Zach O'Keefe @ 2022-06-03 13:26 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, Alex Shi, David Hildenbrand, David Rientjes,
	Matthew Wilcox, Peter Xu, Song Liu, Linux MM, Rongwei Wang,
	Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Kirill A. Shutemov, Minchan Kim, SeongJae Park, Pasha Tatashin

On Thu, Jun 2, 2022 at 9:43 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Jun 1, 2022 at 11:56 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 01-06-22 10:25:53, Yang Shi wrote:
> > > On Wed, Jun 1, 2022 at 2:50 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 31-05-22 16:47:49, Yang Shi wrote:
> > > > > On Fri, May 27, 2022 at 2:46 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > [...]
> > > > > > I really do not see any good reason to tightly couple kernel and user
> > > > > > policies. Hints like MADV_{NO}HUGEPAGE are one thing and both kernel
> > > > > > and userspace might decide to interpret them. But binding MADV_COLLAPSE
> > > > > > to in kernel THP tunables just seems like pushing ourselves into the
> > > > > > corner.
> > > > >
> > > > > I don't mean we should tightly couple kernel and user policies. I
> > > > > think it is about how "never" is treated. AFAICT, typically sys admins
> > > > > tend to expect "never" as a global switch and they don't expect any
> > > > > THP allocation should happen in "never" mode even though it is
> > > > > requested by the users. Maybe they should not expect so in the first
> > > > > place.
> > > >
> > > > But this is not how the knob works, right? At least shmem has its own
> > > > thing. So we do not have any global kill switch for transparent huge
> > > > pages.
> > >
> > > Yeah, but shmem has "never" mode too, which has the same semantics and
> > > it is the default mode actually. Since MADV_COLLAPSE just collapse
> > > anon memory for now, so the discussion was focused on anon THP. But
> > > shmem is same.
> >
> > Do you expect MADV_COLLAPSE would stick to the anonymous memory? Are we
> > going to get MADV_COLLAPSE_SHMEM, MADV_COLLAPSE_FOR_REAL?
>
> No, my point is "never" mode has the same semantics for both anon and
> shmem. When we were talking about whether MADV_COLLAPSE should respect
> "never" or not, it means both.

Ya, this is a good point, and something I honestly overlooked
originally when posting this thread.

> >
> > --
> > Michal Hocko
> > SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC] mm: MADV_COLLAPSE semantics
  2022-06-03 13:26                     ` Zach O'Keefe
@ 2022-06-03 13:33                       ` Zach O'Keefe
  0 siblings, 0 replies; 23+ messages in thread
From: Zach O'Keefe @ 2022-06-03 13:33 UTC (permalink / raw)
  To: David Hildenbrand, David Rientjes, Matthew Wilcox, Michal Hocko,
	Peter Xu, linux-mm, rongwei.wang, Yang Shi
  Cc: Alex Shi, Song Liu, Andrea Arcangeli, Axel Rasmussen,
	Hugh Dickins, Kirill A. Shutemov, Minchan Kim, SeongJae Park,
	Pasha Tatashin

Ok, I think I'll wrap up this thread.

Again, thanks all for taking the time to voice their thoughts and have
this great discussion.

As Yang suggested, I'll start by proposing MADV_COLLAPSE isn't tied to
any of the THP sysfs knobs and we'll see if there are any additional
issues / concerns raised.

Thanks again for everyone's time!

Best,
Zach


On Fri, Jun 3, 2022 at 6:26 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> On Thu, Jun 2, 2022 at 9:43 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Wed, Jun 1, 2022 at 11:56 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 01-06-22 10:25:53, Yang Shi wrote:
> > > > On Wed, Jun 1, 2022 at 2:50 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 31-05-22 16:47:49, Yang Shi wrote:
> > > > > > On Fri, May 27, 2022 at 2:46 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > [...]
> > > > > > > I really do not see any good reason to tightly couple kernel and user
> > > > > > > policies. Hints like MADV_{NO}HUGEPAGE are one thing and both kernel
> > > > > > > and userspace might decide to interpret them. But binding MADV_COLLAPSE
> > > > > > > to in kernel THP tunables just seems like pushing ourselves into the
> > > > > > > corner.
> > > > > >
> > > > > > I don't mean we should tightly couple kernel and user policies. I
> > > > > > think it is about how "never" is treated. AFAICT, typically sys admins
> > > > > > tend to expect "never" as a global switch and they don't expect any
> > > > > > THP allocation should happen in "never" mode even though it is
> > > > > > requested by the users. Maybe they should not expect so in the first
> > > > > > place.
> > > > >
> > > > > But this is not how the knob works, right? At least shmem has its own
> > > > > thing. So we do not have any global kill switch for transparent huge
> > > > > pages.
> > > >
> > > > Yeah, but shmem has "never" mode too, which has the same semantics and
> > > > it is the default mode actually. Since MADV_COLLAPSE just collapse
> > > > anon memory for now, so the discussion was focused on anon THP. But
> > > > shmem is same.
> > >
> > > Do you expect MADV_COLLAPSE would stick to the anonymous memory? Are we
> > > going to get MADV_COLLAPSE_SHMEM, MADV_COLLAPSE_FOR_REAL?
> >
> > No, my point is "never" mode has the same semantics for both anon and
> > shmem. When we were talking about whether MADV_COLLAPSE should respect
> > "never" or not, it means both.
>
> Ya, this is a good point, and something I honestly overlooked
> originally when posting this thread.
>
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-06-03 13:33 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-24  0:18 [RFC] mm: MADV_COLLAPSE semantics Zach O'Keefe
2022-05-24 13:26 ` Peter Xu
2022-05-24 17:08   ` Zach O'Keefe
2022-05-24 20:02 ` Yang Shi
2022-05-25  8:24 ` Michal Hocko
2022-05-25 17:32   ` Yang Shi
2022-05-25 18:09     ` Zach O'Keefe
2022-05-26  7:12     ` Michal Hocko
2022-05-26 17:39       ` Yang Shi
2022-05-27  9:46         ` Michal Hocko
2022-05-31 23:47           ` Yang Shi
2022-06-01  9:50             ` Michal Hocko
2022-06-01 17:25               ` Yang Shi
2022-06-02  6:55                 ` Michal Hocko
2022-06-02 16:43                   ` Yang Shi
2022-06-03 13:26                     ` Zach O'Keefe
2022-06-03 13:33                       ` Zach O'Keefe
2022-05-26 18:30   ` Matthew Wilcox
2022-05-27  8:56     ` Michal Hocko
2022-05-27 18:09     ` Yang Shi
2022-05-31 21:36       ` Zach O'Keefe
2022-05-31 23:52         ` Yang Shi
2022-06-01  9:57         ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.