From: David Rientjes <rientjes@google.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
ying.huang@intel.com, Andrea Arcangeli <aarcange@redhat.com>,
s.priebe@profihost.ag, mgorman@techsingularity.net,
Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name,
Andrew Morton <akpm@linux-foundation.org>,
zi.yan@cs.rutgers.edu, Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression
Date: Tue, 4 Dec 2018 16:07:27 -0800 (PST) [thread overview]
Message-ID: <alpine.DEB.2.21.1812041551170.213718@chino.kir.corp.google.com> (raw)
In-Reply-To: <20181204084821.GB1286@dhcp22.suse.cz>
On Tue, 4 Dec 2018, Michal Hocko wrote:
> The thing I am really up to here is that reintroduction of
> __GFP_THISNODE, which you are pushing for, will conflate madvise mode
> resp. defrag=always with a numa placement policy because the allocation
> doesn't fallback to a remote node.
>
It isn't specific to MADV_HUGEPAGE, it is the policy for all transparent
hugepage allocations, including defrag=always. We agree that
MADV_HUGEPAGE is not exactly defined: does it mean try harder to allocate
a hugepage locally, try compaction synchronous to the fault, allow remote
fallback? It's undefined.
The original intent was to be used when thp is disabled system wide
(enabled set to "madvise") because its possible the rss of the process
increases if backed by thp. That occurs either if faulting on a hugepage
aligned area or based on max_ptes_none. So we have at least three
possible policies that have evolved over time: preventing increased rss,
direct compaction, remote fallback. Certainly not something that fits
under a single madvise mode.
> And that is a fundamental problem and the antipattern I am talking
> about. Look at it this way. All normal allocations are utilizing all the
> available memory even though they might hit a remote latency penalty. If
> you do care about NUMA placement you have an API to enforce a specific
> placement. What is so different about THP to behave differently. Do
> we really want to later invent an API to actually allow to utilize all
> the memory? There are certainly usecases (that triggered the discussion
> previously) that do not mind the remote latency because all other
> benefits simply outweight it?
>
What is different about THP is that on every platform I have measured,
NUMA matters more than hugepages. Obviously if on Broadwell, Haswell, and
Rome, remote hugepages were a performance win over local pages, this
discussion would not be happening. Faulting local pages rather than
local hugepages, if possible, is easy and doesn't require reclaim.
Faulting remote pages rather than reclaiming local pages is easy in your
scenario, it's non-disruptive.
So to answer "what is so different about THP?", it's the performance data.
The NUMA locality matters more than whether the pages are huge or not. We
also have the added benefit of khugepaged being able to collapse pages
locally if fragmentation improves rather than being stuck accessing a
remote hugepage forever.
> That being said what should users who want to use all the memory do to
> use as many THPs as possible?
If those users want to accept the performance degradation of allocating
remote hugepages instead of local pages, that should likely be an
extension, either madvise or prctl. That's not necessarily the usecase
Andrea would have, I don't believe: he'd still prefer to compact memory
locally and avoid the swap storm than allocate remotely. If impossible to
reclaim locally for regular pages, remote hugepages may be more beneficial
than remote pages.
next prev parent reply other threads:[~2018-12-05 0:07 UTC|newest]
Thread overview: 77+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-27 6:25 [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression kernel test robot
2018-11-27 17:08 ` Linus Torvalds
2018-11-27 18:17 ` Michal Hocko
2018-11-27 18:21 ` Michal Hocko
2018-11-27 19:05 ` Vlastimil Babka
2018-11-27 19:16 ` Vlastimil Babka
2018-11-27 20:57 ` Andrea Arcangeli
2018-11-27 22:50 ` Linus Torvalds
2018-11-28 6:30 ` Michal Hocko
2018-11-28 3:20 ` Huang, Ying
2018-11-28 16:48 ` Linus Torvalds
2018-11-28 18:39 ` Andrea Arcangeli
2018-11-28 23:10 ` David Rientjes
2018-12-03 18:01 ` Linus Torvalds
2018-12-03 18:14 ` Michal Hocko
2018-12-03 18:19 ` Linus Torvalds
2018-12-03 18:30 ` Michal Hocko
2018-12-03 18:45 ` Linus Torvalds
2018-12-03 18:59 ` Michal Hocko
2018-12-03 19:23 ` Andrea Arcangeli
2018-12-03 20:26 ` David Rientjes
2018-12-03 19:28 ` Linus Torvalds
2018-12-03 20:12 ` Andrea Arcangeli
2018-12-03 20:36 ` David Rientjes
2018-12-03 22:04 ` Linus Torvalds
2018-12-03 22:27 ` Linus Torvalds
2018-12-03 22:57 ` David Rientjes
2018-12-04 9:22 ` Vlastimil Babka
2018-12-04 10:45 ` Mel Gorman
2018-12-05 0:47 ` David Rientjes
2018-12-05 9:08 ` Michal Hocko
2018-12-05 10:43 ` Mel Gorman
2018-12-05 11:43 ` Michal Hocko
2018-12-05 10:06 ` Mel Gorman
2018-12-05 20:40 ` Andrea Arcangeli
2018-12-05 21:59 ` David Rientjes
2018-12-06 0:00 ` Andrea Arcangeli
2018-12-05 22:03 ` Linus Torvalds
2018-12-05 22:12 ` David Rientjes
2018-12-05 23:36 ` Andrea Arcangeli
2018-12-05 23:51 ` Linus Torvalds
2018-12-06 0:58 ` Linus Torvalds
2018-12-06 9:14 ` MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression) Michal Hocko
2018-12-06 23:49 ` David Rientjes
2018-12-07 7:34 ` Michal Hocko
2018-12-07 4:31 ` Linus Torvalds
2018-12-07 7:49 ` Michal Hocko
2018-12-07 9:06 ` Vlastimil Babka
2018-12-07 23:15 ` David Rientjes
2018-12-06 23:43 ` [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression David Rientjes
2018-12-07 4:01 ` Linus Torvalds
2018-12-10 0:29 ` David Rientjes
2018-12-10 4:49 ` Andrea Arcangeli
2018-12-12 0:37 ` David Rientjes
2018-12-12 9:50 ` Michal Hocko
2018-12-12 17:00 ` Andrea Arcangeli
2018-12-14 11:32 ` Michal Hocko
2018-12-12 10:14 ` Vlastimil Babka
2018-12-14 21:04 ` David Rientjes
2018-12-14 21:33 ` Vlastimil Babka
2018-12-21 22:18 ` David Rientjes
2018-12-22 12:08 ` Mel Gorman
2018-12-14 23:11 ` Mel Gorman
2018-12-21 22:15 ` David Rientjes
2018-12-12 10:44 ` Andrea Arcangeli
2019-04-15 11:48 ` Michal Hocko
2018-12-06 0:18 ` David Rientjes
2018-12-06 0:54 ` Andrea Arcangeli
2018-12-06 9:23 ` Vlastimil Babka
2018-12-03 20:39 ` David Rientjes
2018-12-03 21:25 ` Michal Hocko
2018-12-03 21:53 ` David Rientjes
2018-12-04 8:48 ` Michal Hocko
2018-12-05 0:07 ` David Rientjes [this message]
2018-12-05 10:18 ` Michal Hocko
2018-12-05 19:16 ` David Rientjes
2018-11-27 7:23 kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.DEB.2.21.1812041551170.213718@chino.kir.corp.google.com \
--to=rientjes@google.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alex.williamson@redhat.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=lkp@01.org \
--cc=mgorman@techsingularity.net \
--cc=mhocko@kernel.org \
--cc=s.priebe@profihost.ag \
--cc=torvalds@linux-foundation.org \
--cc=vbabka@suse.cz \
--cc=ying.huang@intel.com \
--cc=zi.yan@cs.rutgers.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).