On 7 Sep 2020, at 3:20, Michal Hocko wrote: > On Fri 04-09-20 14:10:45, Roman Gushchin wrote: >> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote: > [...] >>> An explicit opt-in sounds much more appropriate to me as well. If we go >>> with a specific API then I would not make it 1GB pages specific. Why >>> cannot we have an explicit interface to "defragment" address space >>> range into large pages and the kernel would use large pages where >>> appropriate? Or is the additional copying prohibitively expensive? >> >> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE) >> provides something similar to what you're describing, but there are lot >> of details here, so I'm probably missing something. > > MADV_HUGEPAGE is controlling a preference for THP to be used for a > particular address range. So it looks similar but the historical > behavior is to control page faults as well and the behavior depends on > the global setup. > > I've had in mind something much simpler. Effectively an API to invoke > khugepaged (like) functionality synchronously from the calling context > on the specific address range. It could be more aggressive than the > regular khugepaged and create even 1G pages (or as large THPs as page > tables can handle on the particular arch for that matter). > > As this would be an explicit call we do not have to be worried about > the resulting latency because it would be an explicit call by the > userspace. The default khugepaged has a harder position there because > has no understanding of the target address space and cannot make any > cost/benefit evaluation so it has to be more conservative. Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have better and clearer control of getting huge pages from the kernel and know when they will pay the cost of getting the huge pages. I would think the suggestion is more about the huge page control options currently provided by the kernel do not have predictable performance outcome, since MADV_HUGEPAGE is a best-effort option and does not tell users whether the marked virtual address range is backed by huge pages or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a deterministic result to users on whether the huge page(s) are formed or not. — Best Regards, Yan Zi