From: David Rientjes <firstname.lastname@example.org> To: Andrea Arcangeli <email@example.com> Cc: Linus Torvalds <firstname.lastname@example.org>, email@example.com, Vlastimil Babka <firstname.lastname@example.org>, Michal Hocko <email@example.com>, firstname.lastname@example.org, email@example.com, Linux List Kernel Mailing <firstname.lastname@example.org>, email@example.com, firstname.lastname@example.org, email@example.com, Andrew Morton <firstname.lastname@example.org>, email@example.com Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression Date: Tue, 11 Dec 2018 16:37:22 -0800 (PST) Message-ID: <alpine.DEB.firstname.lastname@example.org> (raw) In-Reply-To: <20181210044916.GC24097@redhat.com> On Sun, 9 Dec 2018, Andrea Arcangeli wrote: > You didn't release the proprietary software that depends on > __GFP_THISNODE behavior and that you're afraid is getting a > regression. > > Could you at least release with an open source license the benchmark > software that you must have used to do the above measurement to > understand why it gives such a weird result on remote THP? > Hi Andrea, As I said in response to Linus, I'm in the process of writing a more complete benchmarking test across all of our platforms for access and allocation latency for x86 (both Intel and AMD), POWER8/9, and arm64, and doing so on a kernel with minimum overhead (for the allocation latency, I want to remove things like mem cgroup overhead from the result). > On skylake and on the threadripper I can't confirm that there isn't a > significant benefit from cross socket hugepage over cross socket small > page. > > Skylake Xeon(R) Gold 5115: > > # numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29 > node 0 size: 15602 MB > node 0 free: 14077 MB > node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39 > node 1 size: 16099 MB > node 1 free: 15949 MB > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 > # numactl -m 0 -C 0 ./numa-thp-bench > random writes MADV_HUGEPAGE 10109753 usec > random writes MADV_NOHUGEPAGE 13682041 usec > random writes MADV_NOHUGEPAGE 13704208 usec > random writes MADV_HUGEPAGE 10120405 usec > # numactl -m 0 -C 10 ./numa-thp-bench > random writes MADV_HUGEPAGE 15393923 usec > random writes MADV_NOHUGEPAGE 19644793 usec > random writes MADV_NOHUGEPAGE 19671287 usec > random writes MADV_HUGEPAGE 15495281 usec > # grep Xeon /proc/cpuinfo |head -1 > model name : Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz > > local 4k -> local 2m: +35% > local 4k -> remote 2m: -11% > remote 4k -> remote 2m: +26% > > threadripper 1950x: > > # numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 > node 0 size: 15982 MB > node 0 free: 14422 MB > node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 > node 1 size: 16124 MB > node 1 free: 5357 MB > node distances: > node 0 1 > 0: 10 16 > 1: 16 10 > # numactl -m 0 -C 0 /tmp/numa-thp-bench > random writes MADV_HUGEPAGE 12902667 usec > random writes MADV_NOHUGEPAGE 17543070 usec > random writes MADV_NOHUGEPAGE 17568858 usec > random writes MADV_HUGEPAGE 12896588 usec > # numactl -m 0 -C 8 /tmp/numa-thp-bench > random writes MADV_HUGEPAGE 19663515 usec > random writes MADV_NOHUGEPAGE 27819864 usec > random writes MADV_NOHUGEPAGE 27844066 usec > random writes MADV_HUGEPAGE 19662706 usec > # grep Threadripper /proc/cpuinfo |head -1 > model name : AMD Ryzen Threadripper 1950X 16-Core Processor > > local 4k -> local 2m: +35% > local 4k -> remote 2m: -10% > remote 4k -> remote 2m: +41% > > Or if you prefer reversed in terms of compute time (negative > percentage is better in this case): > > local 4k -> local 2m: -26% > local 4k -> remote 2m: +12% > remote 4k -> remote 2m: -29% > > It's true that local 4k is generally a win vs remote THP when the > workload is memory bound also for the threadripper, the threadripper > seems even more favorable to remote THP than skylake Xeon is. > My results are organized slightly different since it considers local hugepages as the baseline and is what we optimize for: on Broadwell, I've obtained more accurate results that show local small pages at +3.8%, remote hugepages at +12.8% and remote small pages at +18.8%. I think we both agree that the locality preference for workloads that fit within a single node is local hugepage -> local small page -> remote hugepage -> remote small page, and that has been unchanged in any of benchmarking results for either of us. > The above is the host bare metal result. Now let's try guest mode on > the threadripper. The last two lines seems more reliable (the first > two lines also needs to fault in the guest RAM because the guest > was fresh booted). > > guest backed by local 2M pages: > > random writes MADV_HUGEPAGE 16025855 usec > random writes MADV_NOHUGEPAGE 21903002 usec > random writes MADV_NOHUGEPAGE 19762767 usec > random writes MADV_HUGEPAGE 15189231 usec > > guest backed by remote 2M pages: > > random writes MADV_HUGEPAGE 25434251 usec > random writes MADV_NOHUGEPAGE 32404119 usec > random writes MADV_NOHUGEPAGE 31455592 usec > random writes MADV_HUGEPAGE 22248304 usec > > guest backed by local 4k pages: > > random writes MADV_HUGEPAGE 28945251 usec > random writes MADV_NOHUGEPAGE 32217690 usec > random writes MADV_NOHUGEPAGE 30664731 usec > random writes MADV_HUGEPAGE 22981082 usec > > guest backed by remote 4k pages: > > random writes MADV_HUGEPAGE 43772939 usec > random writes MADV_NOHUGEPAGE 52745664 usec > random writes MADV_NOHUGEPAGE 51632065 usec > random writes MADV_HUGEPAGE 40263194 usec > > I haven't yet tried the guest mode on the skylake nor > haswell/broadwell. I can do that too but I don't expect a significant > difference. > > On a threadripper guest, the remote 2m is practically identical to > local 4k. So shutting down compaction to try to generate local 4k > memory looks a sure loss. > I'm assuming your results above are with a defrag setting of "madvise" or "defer+madvise". > Even if we ignore the guest mode results completely, if we don't make > assumption on the workload to be able to fit in the node, if I use > MADV_HUGEPAGE I think I'd prefer the risk of a -10% slowdown if the > THP page ends up in a remote node, than not getting the +41% THP > speedup on remote memory if the pagetable ends up being remote or the > 4k page itself ends up being remote over time. > I'm agreeing with you that the preference for remote hugepages over local small pages depends on the configuration and the workload that you are running and there are clear advantages and disadvantages to both. This is different than what the long-standing NUMA preferences have been for thp allocations. I think we can optimize for *both* usecases without causing an unnecessary regression for other and doing so is not extremely complex. Since it depends on the workload, specifically workloads that fit within a single node, I think the reasonable approach would be to have a sane default regardless of the use of MADV_HUGEPAGE or thp defrag settings and then optimzie for the minority of cases where the workload does not fit in a single node. I'm assuming there is no debate about these larger workloads being in the minority, although we have single machines where this encompasses the totality of their workloads. Regarding the role of direct reclaim in the allocator, I think we need work on the feedback from compaction to determine whether it's worthwhile. That's difficult because of the point I continue to bring up: isolate_freepages() is not necessarily always able to access this freed memory. But for cases where we get COMPACT_SKIPPED because the order-0 watermarks are failing, reclaim *is* likely to have an impact in the success of compaction, otherwise we fail and defer because it wasn't able to make a hugepage available. [ If we run compaction regardless of the order-0 watermark check and find a pageblock where we can likely free a hugepage because it is fragmented movable pages, this is a pretty good indication that reclaim is worthwhile iff the reclaimed memory is beyond the migration scanner. ] Let me try to list out what I think is a reasonable design for the various configs assuming we are able to address the reclaim concern above. Note that this is for the majority of users where workloads do not span multiple nodes: - defrag=always: always compact, obviously - defrag=madvise/defer+madvise: - MADV_HUGEPAGE: always compact locally, fallback to small pages locally (small pages become eligible for khugepaged to collapse locally later, no chance of additional access latency) - neither MADV_HUGEPAGE nor MADV_NOHUGEPAGE: kick kcompactd locally, fallback to small pages locally - defrag=defer: kick kcompactd locally, fallback to small pages locally - defrag=never: fallback to small pages locally And that, AFAICT, has been the implementation for almost four years. For workloads that *can* span multiple nodes, this doesn't make much sense, as you point out and have reported in your bug. Considering the reclaim problem separately where we thrash a node unnecessarily, if we consider only hugepages and NUMA locality: - defrag=always: always compact for all allowed zones, zonelist ordered according to NUMA locality - defrag=madvise/defer+madvise: - MADV_HUGEPAGE: always compact for all allowed zones, try to allocate hugepages in zonelist order, only fallback to small pages when compaction fails - neither MADV_HUGEPAGE nor MADV_NOHUGEPAGE: kick kcompactd for all allowed zones, fallback to small pages locally - defrag=defer: kick kcompactd for all allowed zones, fallback to small pages locally - defrag=never: fallback to small pages locally For this policy to be possible, we must clear __GFP_THISNODE. How to determine when to do this? I think we have three options: heuristics (rss vs zone managed pages), per-process prctl(), or global thp setting for machine-wide behavior. I've been suggesting a per-process prctl() that can be set and carried across fork so that there are no changes needed to any workload and can simply special-case the thp allocation policy to use __GFP_THISNODE, which is the default for bare metal, and to not use it when we've said the workload will span multiple nodes. Depending on the size of the workload, it may choose to use this setting on certain systems and not others.
next prev parent reply index Thread overview: 77+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-11-27 6:25 kernel test robot 2018-11-27 17:08 ` Linus Torvalds 2018-11-27 18:17 ` Michal Hocko 2018-11-27 18:21 ` Michal Hocko 2018-11-27 19:05 ` Vlastimil Babka 2018-11-27 19:16 ` Vlastimil Babka 2018-11-27 20:57 ` Andrea Arcangeli 2018-11-27 22:50 ` Linus Torvalds 2018-11-28 6:30 ` Michal Hocko 2018-11-28 3:20 ` Huang\, Ying 2018-11-28 16:48 ` Linus Torvalds 2018-11-28 18:39 ` Andrea Arcangeli 2018-11-28 23:10 ` David Rientjes 2018-12-03 18:01 ` Linus Torvalds 2018-12-03 18:14 ` Michal Hocko 2018-12-03 18:19 ` Linus Torvalds 2018-12-03 18:30 ` Michal Hocko 2018-12-03 18:45 ` Linus Torvalds 2018-12-03 18:59 ` Michal Hocko 2018-12-03 19:23 ` Andrea Arcangeli 2018-12-03 20:26 ` David Rientjes 2018-12-03 19:28 ` Linus Torvalds 2018-12-03 20:12 ` Andrea Arcangeli 2018-12-03 20:36 ` David Rientjes 2018-12-03 22:04 ` Linus Torvalds 2018-12-03 22:27 ` Linus Torvalds 2018-12-03 22:57 ` David Rientjes 2018-12-04 9:22 ` Vlastimil Babka 2018-12-04 10:45 ` Mel Gorman 2018-12-05 0:47 ` David Rientjes 2018-12-05 9:08 ` Michal Hocko 2018-12-05 10:43 ` Mel Gorman 2018-12-05 11:43 ` Michal Hocko 2018-12-05 10:06 ` Mel Gorman 2018-12-05 20:40 ` Andrea Arcangeli 2018-12-05 21:59 ` David Rientjes 2018-12-06 0:00 ` Andrea Arcangeli 2018-12-05 22:03 ` Linus Torvalds 2018-12-05 22:12 ` David Rientjes 2018-12-05 23:36 ` Andrea Arcangeli 2018-12-05 23:51 ` Linus Torvalds 2018-12-06 0:58 ` Linus Torvalds 2018-12-06 9:14 ` MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression) Michal Hocko 2018-12-06 23:49 ` David Rientjes 2018-12-07 7:34 ` Michal Hocko 2018-12-07 4:31 ` Linus Torvalds 2018-12-07 7:49 ` Michal Hocko 2018-12-07 9:06 ` Vlastimil Babka 2018-12-07 23:15 ` David Rientjes 2018-12-06 23:43 ` [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression David Rientjes 2018-12-07 4:01 ` Linus Torvalds 2018-12-10 0:29 ` David Rientjes 2018-12-10 4:49 ` Andrea Arcangeli 2018-12-12 0:37 ` David Rientjes [this message] 2018-12-12 9:50 ` Michal Hocko 2018-12-12 17:00 ` Andrea Arcangeli 2018-12-14 11:32 ` Michal Hocko 2018-12-12 10:14 ` Vlastimil Babka 2018-12-14 21:04 ` David Rientjes 2018-12-14 21:33 ` Vlastimil Babka 2018-12-21 22:18 ` David Rientjes 2018-12-22 12:08 ` Mel Gorman 2018-12-14 23:11 ` Mel Gorman 2018-12-21 22:15 ` David Rientjes 2018-12-12 10:44 ` Andrea Arcangeli 2019-04-15 11:48 ` Michal Hocko 2018-12-06 0:18 ` David Rientjes 2018-12-06 0:54 ` Andrea Arcangeli 2018-12-06 9:23 ` Vlastimil Babka 2018-12-03 20:39 ` David Rientjes 2018-12-03 21:25 ` Michal Hocko 2018-12-03 21:53 ` David Rientjes 2018-12-04 8:48 ` Michal Hocko 2018-12-05 0:07 ` David Rientjes 2018-12-05 10:18 ` Michal Hocko 2018-12-05 19:16 ` David Rientjes 2018-11-27 7:23 kernel test robot
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=alpine.DEB.email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
LKML Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \ firstname.lastname@example.org email@example.com public-inbox-index lkml Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel AGPL code for this site: git clone https://public-inbox.org/ public-inbox