[ Oops. different thread for me due to edited subject, so I saw this after replying to the earlier email by David ] On Thu, Dec 6, 2018 at 1:14 AM Michal Hocko wrote: > > MADV_HUGEPAGE changes the picture because the caller expressed a need > for THP and is willing to go extra mile to get it. Actually, I think MADV_HUGEPAGE should just be "TRANSPARENT_HUGEPAGE_ALWAYS but only for this vma". So MADV_HUGEPAGE shouldn't change any behavior at all, if the kernel was built with TRANSPARENT_HUGEPAGE_ALWAYS. Put another way: even if you decide to run a kernel that does *not* have that "always THP" (beause you presumably think that it's too blunt an instrument), then MADV_HUGEPAGE says "for _this_ vma, do the 'always THP' bebavior" I think those semantics would be a whole lot easier to explain to people, and perhaps more imporantly, starting off from that kind of mindset also gives good guidance to what MADV_HUGEPAGE behavior should be: it should be sane enough that it makes sense as the _default_ behavior for the TRANSPARENT_HUGEPAGE_ALWAYS configuration. But that also means that no, MADV_HUGEPAGE doesn't really change the picture. All it does is says "I know that for this vma, THP really does make sense as a default". It doesn't say "I _have_ to have THP", exactly like TRANSPARENT_HUGEPAGE_ALWAYS does not mean that every allocation should strive to be THP. >I believe that something like the below would be sensible > 1) THP on a local node with compaction not giving up too early > 2) THP on a remote node in NOWAIT mode - so no direct > compaction/reclaim (trigger kswapd/kcompactd only for > defrag=defer+madvise) > 3) fallback to the base page allocation That doesn't sound insane to me. That said, the numbers David quoted do fairly strongly imply that local small-pages are actually preferred to any remote THP pages. But *that* in turn makes for other possible questions: - if the reason we couldn't get a local hugepage is that we're simply out of local memory (huge *or* small), then maybe a remote hugepage is better. Note that this now implies that the choice can be an issue of "did the hugepage allocation fail due to fragmentation, or due to the node being low of memory" and there is the other question that I asked in the other thread (before subject edit): - how local is the load to begin with? Relatively shortlived processes - or processes that are explicitly bound to a node - might have different preferences than some long-lived process where the CPU bounces around, and might have different trade-offs for the local vs remote question too. So just based on David's numbers, and some wild handwaving on my part, a slightly more complex, but still very sensible default might be something like 1) try to do a cheap local node hugepage allocation Rationale: everybody agrees this is the best case. But if that fails: 2) look at compacting and the local node, but not very hard. If there's lots of memory on the local node, but synchronous compaction doesn't do anything easily, just fall back to small pages. Rationale: local memory is generally more important than THP. If that fails (ie local node is simply low on memory): 3) Try to do remote THP allocation Rationale: Ok, we simply didn't have a lot of local memory, so it's not just a question of fragmentation. If it *had* been fragmentation, lots of small local pages would have been better than a remote THP page. Oops, remote THP allocation failed (possibly after synchronous remote compaction, but maybe this is where we do kcompactd). 4) Just do any small page, and do reclaim etc. THP isn't happening, and it's not a priority when you're starting to feel memory pressure. In general, I really would want to avoid magic kernel command lines (or sysfs settings, or whatever) making a huge difference in behavior. So I really wish people would see the whole 'transparent_hugepage_flags' thing as a way for kernel developers to try different settings, not as a way for users to tune their loads. Our default should work as sane defaults, we shouldn't have a "ok, let's have this sysfs tunable and let people make their own decisions". That's a cop-out. Btw, don't get me wrong: I'm not suggesting removing the sysfs knob. As a debug tool, it's great, where you can ask "ok, do things work better if you set THP-defrag to defer+madvise". I'm just saying that we should *not* use that sysfs flag as an excuse for "ok, if we get the default wrong, people can make their own defaults". We should strive to do well enough that it really shouldn't be an issue in normal situations. Linus