[ Oops. different thread for me due to edited subject, so I saw this
after replying to the earlier email by David ]

On Thu, Dec 6, 2018 at 1:14 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> MADV_HUGEPAGE changes the picture because the caller expressed a need
> for THP and is willing to go extra mile to get it.

Actually, I think MADV_HUGEPAGE should just be
"TRANSPARENT_HUGEPAGE_ALWAYS but only for this vma".

So MADV_HUGEPAGE shouldn't change any behavior at all, if the kernel
was built with TRANSPARENT_HUGEPAGE_ALWAYS.

Put another way: even if you decide to run a kernel that does *not*
have that "always THP" (beause you presumably think that it's too
blunt an instrument), then MADV_HUGEPAGE says "for _this_ vma, do the
'always THP' bebavior"

I think those semantics would be a whole lot easier to explain to
people, and perhaps more imporantly, starting off from that kind of
mindset also gives good guidance to what MADV_HUGEPAGE behavior should
be: it should be sane enough that it makes sense as the _default_
behavior for the TRANSPARENT_HUGEPAGE_ALWAYS configuration.

But that also means that no, MADV_HUGEPAGE doesn't really change the
picture. All it does is says "I know that for this vma, THP really
does make sense as a default".

It doesn't say "I _have_ to have THP", exactly like
TRANSPARENT_HUGEPAGE_ALWAYS does not mean that every allocation should
strive to be THP.

>I believe that something like the below would be sensible
>         1) THP on a local node with compaction not giving up too early
>         2) THP on a remote node in NOWAIT mode - so no direct
>            compaction/reclaim (trigger kswapd/kcompactd only for
>            defrag=defer+madvise)
>         3) fallback to the base page allocation

That doesn't sound insane to me. That said, the numbers David quoted
do fairly strongly imply that local small-pages are actually preferred
to any remote THP pages.

But *that* in turn makes for other possible questions:

 - if the reason we couldn't get a local hugepage is that we're simply
out of local memory (huge *or* small), then maybe a remote hugepage is
better.

   Note that this now implies that the choice can be an issue of "did
the hugepage allocation fail due to fragmentation, or due to the node
being low of memory"

and there is the other question that I asked in the other thread
(before subject edit):

 - how local is the load to begin with?

   Relatively shortlived processes - or processes that are explicitly
bound to a node - might have different preferences than some
long-lived process where the CPU bounces around, and might have
different trade-offs for the local vs remote question too.

So just based on David's numbers, and some wild handwaving on my part,
a slightly more complex, but still very sensible default might be
something like

 1) try to do a cheap local node hugepage allocation

    Rationale: everybody agrees this is the best case.

    But if that fails:

 2) look at compacting and the local node, but not very hard.

    If there's lots of memory on the local node, but synchronous
compaction doesn't do anything easily, just fall back to small pages.

    Rationale: local memory is generally more important than THP.

    If that fails (ie local node is simply low on memory):

 3) Try to do remote THP allocation

     Rationale: Ok, we simply didn't have a lot of local memory, so
it's not just a question of fragmentation. If it *had* been
fragmentation, lots of small local pages would have been better than a
remote THP page.

     Oops, remote THP allocation failed (possibly after synchronous
remote compaction, but maybe this is where we do kcompactd).

 4) Just do any small page, and do reclaim etc. THP isn't happening,
and it's not a priority when you're starting to feel memory pressure.

In general, I really would want to avoid magic kernel command lines
(or sysfs settings, or whatever) making a huge difference in behavior.
So I really wish people would see the whole
'transparent_hugepage_flags' thing as a way for kernel developers to
try different settings, not as a way for users to tune their loads.

Our default should work as sane defaults, we shouldn't have a "ok,
let's have this sysfs tunable and let people make their own
decisions". That's a cop-out.

Btw, don't get me wrong: I'm not suggesting removing the sysfs knob.
As a debug tool, it's great, where you can ask "ok, do things work
better if you set THP-defrag to defer+madvise".

I'm just saying that we should *not* use that sysfs flag as an excuse
for "ok, if we get the default wrong, people can make their own
defaults". We should strive to do well enough that it really shouldn't
be an issue in normal situations.

                     Linus