From: David Rientjes <rientjes@google.com> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com>, mgorman@techsingularity.net, Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@kernel.org>, ying.huang@intel.com, s.priebe@profihost.ag, Linux List Kernel Mailing <linux-kernel@vger.kernel.org>, alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name, Andrew Morton <akpm@linux-foundation.org>, zi.yan@cs.rutgers.edu Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression Date: Sun, 9 Dec 2018 16:29:13 -0800 (PST) [thread overview] Message-ID: <alpine.DEB.2.21.1812091538310.215735@chino.kir.corp.google.com> (raw) In-Reply-To: <CAHk-=wjVuLjZ1Wr52W=hNqh=_8gbzuKA+YpsVb4NBHCJsE6cyA@mail.gmail.com> On Thu, 6 Dec 2018, Linus Torvalds wrote: > > On Broadwell, the access latency to local small pages was +5.6%, remote > > hugepages +16.4%, and remote small pages +19.9%. > > > > On Naples, the access latency to local small pages was +4.9%, intrasocket > > hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages > > +26.6%, and intersocket hugepages +29.2% > > Are those two last numbers transposed? > > Or why would small page accesses be *faster* than hugepages for the > intersocket case? > > Of course, depending on testing, maybe the page itself was remote, but > the page tables were random, and you happened to get a remote page > table for the hugepage case? > Yes, looks like that was the case, if the page tables were from the same node as the intersocket remote hugepage it looks like a ~0.1% increase accessing small pages, so basically unchanged. So this complicates the allocation strategy somewhat; on this platform, at least, hugepages are preferred on the same socket but there isn't a significant benefit from getting a cross socket hugepage over small page. The typical way this is resolved is based on the SLIT and how the kernel defines RECLAIM_DISTANCE. I'm not sure that we can expect the distances between proximity domains to be defined according to this value for a one-size-fits-all solution. I've always thought that RECLAIM_DISTANCE should be configurable so that initscripts can actually determine its ideal value when using vm.zone_reclaim_mode. > > So it *appears* from the x86 platforms that NUMA matters much more > > significantly than hugeness, but remote hugepages are a slight win over > > remote small pages. PPC appeared the same wrt the local node but then > > prefers hugeness over affinity when it comes to remote pages. > > I do think POWER at least historically has much weaker TLB fills, but > also very costly page table creation/teardown. Constant-time O(1) > arguments about hash lookups are only worth so much when the constant > time is pretty big. They've been working on it. > > So at least on POWER, afaik one issue is literally that hugepages made > the hash setup and teardown situation much better. > I'm still working on the more elaborate test case that will generate these results because I think I can use it at boot to determine an ideal RECLAIM_DISTANCE. I can also get numbers for hash vs radix MMU if you're interested. > One thing that might be worth looking at is whether the process itself > is all that node-local. Maybe we could aim for a policy that says > "prefer local memory, but if we notice that the accesses to this vma > aren't all that local, then who cares?". > > IOW, the default could be something more dynamic than just "always use > __GFP_THISNODE". It could be more along the lines of "start off using > __GFP_THISNODE, but for longer-lived processes that bounce around > across nodes, maybe relax it?" > It would allow the use of MPOL_PREFERRED for an exact preference if they are known to not be bounced around. This would be required for processes that are bound to the cpus of a single node through cpuset or sched_setaffinity() but unconstrained as far as memory is concerned. The goal of __GFP_THISNODE being the default for thp, however, is that we *know* we're going to be accessing it locally at least in the short term, perhaps forever. Any other default would assume the remotely allocated hugepage would eventually be accessed locally, otherwise we would have been much better off just failing the hugepage allocation and accessing small pages. You could make an assumption that's the case iff the process does not fit in its local node, and I think that would be the minority of applications. I guess there could be some heuristic that could determine this based on MM_ANONPAGES of Andrea's qemu and zone->zone_pgdat->node_present_pages. It feels like something that should be more exactly defined, though, for the application to say that it prefers remote hugepages over local small pages because it can't access either locally forever anyway. This was where I suggested a new prctl() mode so that an application can prefer remote hugepages because it knows it's larger than the single node and that requires no change to the binary itself because it is inherited across fork. The sane default, though, seems to always prefer local allocation, whether hugepages or small pages, for the majority of workloads since that's where the lowest access latency is. > Honestly, I think things like vm_policy etc should not be the solution > - yes, some people may know *exactly* what access patterns they want, > but for most situations, I think the policy should be that defaults > "just work". > > In fact, I wish even MADV_HUGEPAGE itself were to approach being a > no-op with THP. > Besides the NUMA locality of the allocations, we still have the allocation latency concern that MADV_HUGEPAGE changes. The madvise mode has taken on two meanings: (1) prefer to fault hugepages when the thp enabled setting is "madvise" so other applications don't blow up their rss unexpectedly, and (2) try synchronous compaction/reclaim at fault for thp defrag settings of "madvise" or "defer+madvise". [ It was intended to take on an additional meaning through the now-reverted patch, which was (3) relax the NUMA locality preference. ] My binaries that remap their text segment to be backed by transparent hugepages and qemu both share the same preference to try hard to fault hugepages through compaction because we don't necessarily care about allocation latency, we care about access latency later. Smaller binaries, those that do not have strict NUMA locality requirements, and short-lived allocations are not going to want to incur the performance penalty of synchronous compaction. So I think today's semantics for MADV_HUGEPAGE make sense, but I'd like to explore other areas that could improve this, both for the default case and the specialized cases: - prctl() mode to readily allow remote allocations rather than reclaiming or compacting memory locally (affects more than just hugepages if the system has a non-zero vm.zone_reclaim_mode), - better feedback loop after the first compaction attempt in the page allocator slowpath to determine if reclaim is actually worthwhile for high-order allocations, and - configurable vm.reclaim_distance (use RECLAIM_DISTANCE as the default) which can be defined since there is not a one-size-fits-all strategy for allocations (there's no benefit to allocating hugepages cross socket on Naples, for example),
WARNING: multiple messages have this Message-ID (diff)
From: David Rientjes <rientjes@google.com> To: lkp@lists.01.org Subject: Re: [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression Date: Sun, 09 Dec 2018 16:29:13 -0800 [thread overview] Message-ID: <alpine.DEB.2.21.1812091538310.215735@chino.kir.corp.google.com> (raw) In-Reply-To: <CAHk-=wjVuLjZ1Wr52W=hNqh=_8gbzuKA+YpsVb4NBHCJsE6cyA@mail.gmail.com> [-- Attachment #1: Type: text/plain, Size: 6809 bytes --] On Thu, 6 Dec 2018, Linus Torvalds wrote: > > On Broadwell, the access latency to local small pages was +5.6%, remote > > hugepages +16.4%, and remote small pages +19.9%. > > > > On Naples, the access latency to local small pages was +4.9%, intrasocket > > hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages > > +26.6%, and intersocket hugepages +29.2% > > Are those two last numbers transposed? > > Or why would small page accesses be *faster* than hugepages for the > intersocket case? > > Of course, depending on testing, maybe the page itself was remote, but > the page tables were random, and you happened to get a remote page > table for the hugepage case? > Yes, looks like that was the case, if the page tables were from the same node as the intersocket remote hugepage it looks like a ~0.1% increase accessing small pages, so basically unchanged. So this complicates the allocation strategy somewhat; on this platform, at least, hugepages are preferred on the same socket but there isn't a significant benefit from getting a cross socket hugepage over small page. The typical way this is resolved is based on the SLIT and how the kernel defines RECLAIM_DISTANCE. I'm not sure that we can expect the distances between proximity domains to be defined according to this value for a one-size-fits-all solution. I've always thought that RECLAIM_DISTANCE should be configurable so that initscripts can actually determine its ideal value when using vm.zone_reclaim_mode. > > So it *appears* from the x86 platforms that NUMA matters much more > > significantly than hugeness, but remote hugepages are a slight win over > > remote small pages. PPC appeared the same wrt the local node but then > > prefers hugeness over affinity when it comes to remote pages. > > I do think POWER at least historically has much weaker TLB fills, but > also very costly page table creation/teardown. Constant-time O(1) > arguments about hash lookups are only worth so much when the constant > time is pretty big. They've been working on it. > > So at least on POWER, afaik one issue is literally that hugepages made > the hash setup and teardown situation much better. > I'm still working on the more elaborate test case that will generate these results because I think I can use it at boot to determine an ideal RECLAIM_DISTANCE. I can also get numbers for hash vs radix MMU if you're interested. > One thing that might be worth looking at is whether the process itself > is all that node-local. Maybe we could aim for a policy that says > "prefer local memory, but if we notice that the accesses to this vma > aren't all that local, then who cares?". > > IOW, the default could be something more dynamic than just "always use > __GFP_THISNODE". It could be more along the lines of "start off using > __GFP_THISNODE, but for longer-lived processes that bounce around > across nodes, maybe relax it?" > It would allow the use of MPOL_PREFERRED for an exact preference if they are known to not be bounced around. This would be required for processes that are bound to the cpus of a single node through cpuset or sched_setaffinity() but unconstrained as far as memory is concerned. The goal of __GFP_THISNODE being the default for thp, however, is that we *know* we're going to be accessing it locally at least in the short term, perhaps forever. Any other default would assume the remotely allocated hugepage would eventually be accessed locally, otherwise we would have been much better off just failing the hugepage allocation and accessing small pages. You could make an assumption that's the case iff the process does not fit in its local node, and I think that would be the minority of applications. I guess there could be some heuristic that could determine this based on MM_ANONPAGES of Andrea's qemu and zone->zone_pgdat->node_present_pages. It feels like something that should be more exactly defined, though, for the application to say that it prefers remote hugepages over local small pages because it can't access either locally forever anyway. This was where I suggested a new prctl() mode so that an application can prefer remote hugepages because it knows it's larger than the single node and that requires no change to the binary itself because it is inherited across fork. The sane default, though, seems to always prefer local allocation, whether hugepages or small pages, for the majority of workloads since that's where the lowest access latency is. > Honestly, I think things like vm_policy etc should not be the solution > - yes, some people may know *exactly* what access patterns they want, > but for most situations, I think the policy should be that defaults > "just work". > > In fact, I wish even MADV_HUGEPAGE itself were to approach being a > no-op with THP. > Besides the NUMA locality of the allocations, we still have the allocation latency concern that MADV_HUGEPAGE changes. The madvise mode has taken on two meanings: (1) prefer to fault hugepages when the thp enabled setting is "madvise" so other applications don't blow up their rss unexpectedly, and (2) try synchronous compaction/reclaim at fault for thp defrag settings of "madvise" or "defer+madvise". [ It was intended to take on an additional meaning through the now-reverted patch, which was (3) relax the NUMA locality preference. ] My binaries that remap their text segment to be backed by transparent hugepages and qemu both share the same preference to try hard to fault hugepages through compaction because we don't necessarily care about allocation latency, we care about access latency later. Smaller binaries, those that do not have strict NUMA locality requirements, and short-lived allocations are not going to want to incur the performance penalty of synchronous compaction. So I think today's semantics for MADV_HUGEPAGE make sense, but I'd like to explore other areas that could improve this, both for the default case and the specialized cases: - prctl() mode to readily allow remote allocations rather than reclaiming or compacting memory locally (affects more than just hugepages if the system has a non-zero vm.zone_reclaim_mode), - better feedback loop after the first compaction attempt in the page allocator slowpath to determine if reclaim is actually worthwhile for high-order allocations, and - configurable vm.reclaim_distance (use RECLAIM_DISTANCE as the default) which can be defined since there is not a one-size-fits-all strategy for allocations (there's no benefit to allocating hugepages cross socket on Naples, for example),
next prev parent reply other threads:[~2018-12-10 0:29 UTC|newest] Thread overview: 154+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-11-27 6:25 [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression kernel test robot 2018-11-27 6:25 ` kernel test robot 2018-11-27 17:08 ` [LKP] " Linus Torvalds 2018-11-27 17:08 ` Linus Torvalds 2018-11-27 18:17 ` [LKP] " Michal Hocko 2018-11-27 18:17 ` Michal Hocko 2018-11-27 18:21 ` [LKP] " Michal Hocko 2018-11-27 18:21 ` Michal Hocko 2018-11-27 19:05 ` [LKP] " Vlastimil Babka 2018-11-27 19:05 ` Vlastimil Babka 2018-11-27 19:16 ` [LKP] " Vlastimil Babka 2018-11-27 19:16 ` Vlastimil Babka 2018-11-27 20:57 ` [LKP] " Andrea Arcangeli 2018-11-27 20:57 ` Andrea Arcangeli 2018-11-27 22:50 ` [LKP] " Linus Torvalds 2018-11-27 22:50 ` Linus Torvalds 2018-11-28 6:30 ` [LKP] " Michal Hocko 2018-11-28 6:30 ` Michal Hocko 2018-11-28 3:20 ` [LKP] " Huang, Ying 2018-11-28 3:20 ` Huang, Ying 2018-11-28 16:48 ` [LKP] " Linus Torvalds 2018-11-28 16:48 ` Linus Torvalds 2018-11-28 18:39 ` [LKP] " Andrea Arcangeli 2018-11-28 18:39 ` Andrea Arcangeli 2018-11-28 23:10 ` [LKP] " David Rientjes 2018-11-28 23:10 ` David Rientjes 2018-12-03 18:01 ` [LKP] " Linus Torvalds 2018-12-03 18:01 ` Linus Torvalds 2018-12-03 18:14 ` [LKP] " Michal Hocko 2018-12-03 18:14 ` Michal Hocko 2018-12-03 18:19 ` [LKP] " Linus Torvalds 2018-12-03 18:19 ` Linus Torvalds 2018-12-03 18:30 ` [LKP] " Michal Hocko 2018-12-03 18:30 ` Michal Hocko 2018-12-03 18:45 ` [LKP] " Linus Torvalds 2018-12-03 18:45 ` Linus Torvalds 2018-12-03 18:59 ` [LKP] " Michal Hocko 2018-12-03 18:59 ` Michal Hocko 2018-12-03 19:23 ` [LKP] " Andrea Arcangeli 2018-12-03 19:23 ` Andrea Arcangeli 2018-12-03 20:26 ` [LKP] " David Rientjes 2018-12-03 20:26 ` David Rientjes 2018-12-03 19:28 ` [LKP] " Linus Torvalds 2018-12-03 19:28 ` Linus Torvalds 2018-12-03 20:12 ` [LKP] " Andrea Arcangeli 2018-12-03 20:12 ` Andrea Arcangeli 2018-12-03 20:36 ` [LKP] " David Rientjes 2018-12-03 20:36 ` David Rientjes 2018-12-03 22:04 ` [LKP] " Linus Torvalds 2018-12-03 22:04 ` Linus Torvalds 2018-12-03 22:27 ` [LKP] " Linus Torvalds 2018-12-03 22:27 ` Linus Torvalds 2018-12-03 22:57 ` [LKP] " David Rientjes 2018-12-03 22:57 ` David Rientjes 2018-12-04 9:22 ` [LKP] " Vlastimil Babka 2018-12-04 9:22 ` Vlastimil Babka 2018-12-04 10:45 ` [LKP] " Mel Gorman 2018-12-04 10:45 ` Mel Gorman 2018-12-05 0:47 ` [LKP] " David Rientjes 2018-12-05 0:47 ` David Rientjes 2018-12-05 9:08 ` [LKP] " Michal Hocko 2018-12-05 9:08 ` Michal Hocko 2018-12-05 10:43 ` [LKP] " Mel Gorman 2018-12-05 10:43 ` Mel Gorman 2018-12-05 11:43 ` [LKP] " Michal Hocko 2018-12-05 11:43 ` Michal Hocko 2018-12-05 10:06 ` [LKP] " Mel Gorman 2018-12-05 10:06 ` Mel Gorman 2018-12-05 20:40 ` [LKP] " Andrea Arcangeli 2018-12-05 20:40 ` Andrea Arcangeli 2018-12-05 21:59 ` [LKP] " David Rientjes 2018-12-05 21:59 ` David Rientjes 2018-12-06 0:00 ` [LKP] " Andrea Arcangeli 2018-12-06 0:00 ` Andrea Arcangeli 2018-12-05 22:03 ` [LKP] " Linus Torvalds 2018-12-05 22:03 ` Linus Torvalds 2018-12-05 22:12 ` [LKP] " David Rientjes 2018-12-05 22:12 ` David Rientjes 2018-12-05 23:36 ` [LKP] " Andrea Arcangeli 2018-12-05 23:36 ` Andrea Arcangeli 2018-12-05 23:51 ` [LKP] " Linus Torvalds 2018-12-05 23:51 ` Linus Torvalds 2018-12-06 0:58 ` [LKP] " Linus Torvalds 2018-12-06 0:58 ` Linus Torvalds 2018-12-06 9:14 ` MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression) Michal Hocko 2018-12-06 9:14 ` MADV_HUGEPAGE vs. NUMA semantic (was: " Michal Hocko 2018-12-06 23:49 ` MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] " David Rientjes 2018-12-06 23:49 ` MADV_HUGEPAGE vs. NUMA semantic (was: " David Rientjes 2018-12-07 7:34 ` MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] " Michal Hocko 2018-12-07 7:34 ` MADV_HUGEPAGE vs. NUMA semantic (was: " Michal Hocko 2018-12-07 4:31 ` MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] " Linus Torvalds 2018-12-07 4:31 ` MADV_HUGEPAGE vs. NUMA semantic (was: " Linus Torvalds 2018-12-07 7:49 ` MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] " Michal Hocko 2018-12-07 7:49 ` MADV_HUGEPAGE vs. NUMA semantic (was: " Michal Hocko 2018-12-07 9:06 ` MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] " Vlastimil Babka 2018-12-07 9:06 ` MADV_HUGEPAGE vs. NUMA semantic (was: " Vlastimil Babka 2018-12-07 23:15 ` MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] " David Rientjes 2018-12-07 23:15 ` MADV_HUGEPAGE vs. NUMA semantic (was: " David Rientjes 2018-12-06 23:43 ` [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression David Rientjes 2018-12-06 23:43 ` David Rientjes 2018-12-07 4:01 ` [LKP] " Linus Torvalds 2018-12-07 4:01 ` Linus Torvalds 2018-12-10 0:29 ` David Rientjes [this message] 2018-12-10 0:29 ` David Rientjes 2018-12-10 4:49 ` [LKP] " Andrea Arcangeli 2018-12-10 4:49 ` Andrea Arcangeli 2018-12-12 0:37 ` [LKP] " David Rientjes 2018-12-12 0:37 ` David Rientjes 2018-12-12 9:50 ` [LKP] " Michal Hocko 2018-12-12 9:50 ` Michal Hocko 2018-12-12 17:00 ` [LKP] " Andrea Arcangeli 2018-12-12 17:00 ` Andrea Arcangeli 2018-12-14 11:32 ` [LKP] " Michal Hocko 2018-12-14 11:32 ` Michal Hocko 2018-12-12 10:14 ` [LKP] " Vlastimil Babka 2018-12-12 10:14 ` Vlastimil Babka 2018-12-14 21:04 ` [LKP] " David Rientjes 2018-12-14 21:04 ` David Rientjes 2018-12-14 21:33 ` [LKP] " Vlastimil Babka 2018-12-14 21:33 ` Vlastimil Babka 2018-12-21 22:18 ` [LKP] " David Rientjes 2018-12-21 22:18 ` David Rientjes 2018-12-21 22:18 ` [LKP] " David Rientjes 2018-12-22 12:08 ` Mel Gorman 2018-12-22 12:08 ` Mel Gorman 2018-12-14 23:11 ` [LKP] " Mel Gorman 2018-12-14 23:11 ` Mel Gorman 2018-12-21 22:15 ` [LKP] " David Rientjes 2018-12-21 22:15 ` David Rientjes 2018-12-12 10:44 ` [LKP] " Andrea Arcangeli 2018-12-12 10:44 ` Andrea Arcangeli 2019-04-15 11:48 ` [LKP] " Michal Hocko 2019-04-15 11:48 ` Michal Hocko 2018-12-06 0:18 ` [LKP] " David Rientjes 2018-12-06 0:18 ` David Rientjes 2018-12-06 0:54 ` [LKP] " Andrea Arcangeli 2018-12-06 0:54 ` Andrea Arcangeli 2018-12-06 9:23 ` [LKP] " Vlastimil Babka 2018-12-06 9:23 ` Vlastimil Babka 2018-12-03 20:39 ` [LKP] " David Rientjes 2018-12-03 20:39 ` David Rientjes 2018-12-03 21:25 ` [LKP] " Michal Hocko 2018-12-03 21:25 ` Michal Hocko 2018-12-03 21:53 ` [LKP] " David Rientjes 2018-12-03 21:53 ` David Rientjes 2018-12-04 8:48 ` [LKP] " Michal Hocko 2018-12-04 8:48 ` Michal Hocko 2018-12-05 0:07 ` [LKP] " David Rientjes 2018-12-05 0:07 ` David Rientjes 2018-12-05 10:18 ` [LKP] " Michal Hocko 2018-12-05 10:18 ` Michal Hocko 2018-12-05 19:16 ` [LKP] " David Rientjes 2018-12-05 19:16 ` David Rientjes 2018-11-27 7:23 [LKP] " kernel test robot
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=alpine.DEB.2.21.1812091538310.215735@chino.kir.corp.google.com \ --to=rientjes@google.com \ --cc=aarcange@redhat.com \ --cc=akpm@linux-foundation.org \ --cc=alex.williamson@redhat.com \ --cc=kirill@shutemov.name \ --cc=linux-kernel@vger.kernel.org \ --cc=lkp@01.org \ --cc=mgorman@techsingularity.net \ --cc=mhocko@kernel.org \ --cc=s.priebe@profihost.ag \ --cc=torvalds@linux-foundation.org \ --cc=vbabka@suse.cz \ --cc=ying.huang@intel.com \ --cc=zi.yan@cs.rutgers.edu \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.