From: David Rientjes <email@example.com> To: Andrea Arcangeli <firstname.lastname@example.org> Cc: Mel Gorman <email@example.com>, Vlastimil Babka <firstname.lastname@example.org>, Linus Torvalds <email@example.com>, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, Linux List Kernel Mailing <email@example.com>, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, Andrew Morton <email@example.com>, firstname.lastname@example.org Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression Date: Wed, 5 Dec 2018 13:59:32 -0800 (PST) Message-ID: <alpine.DEB.email@example.com> (raw) In-Reply-To: <20181205204034.GB11899@redhat.com> On Wed, 5 Dec 2018, Andrea Arcangeli wrote: > > thpscale Percentage Faults Huge > > 4.20.0-rc4 4.20.0-rc4 > > mmots-20181130 gfpthisnode-v1r1 > > Percentage huge-3 95.14 ( 0.00%) 7.94 ( -91.65%) > > Percentage huge-5 91.28 ( 0.00%) 5.00 ( -94.52%) > > Percentage huge-7 86.87 ( 0.00%) 9.36 ( -89.22%) > > Percentage huge-12 83.36 ( 0.00%) 21.03 ( -74.78%) > > Percentage huge-18 83.04 ( 0.00%) 30.73 ( -63.00%) > > Percentage huge-24 83.74 ( 0.00%) 27.47 ( -67.20%) > > Percentage huge-30 83.66 ( 0.00%) 31.85 ( -61.93%) > > Percentage huge-32 83.89 ( 0.00%) 29.09 ( -65.32%) > > > > They're down the toilet. 3 threads are able to get 95% of the requested > > THP pages with Andrews tree as of Nov 30th. David's patch drops that to > > 8% success rate. > > This is the downside of David's patch very well exposed above. And > this will make non-NUMA system regress like above too despite they > have no issue to begin with (which is probably why nobody noticed the > trouble with __GFP_THISNODE reclaim until recently, combined with the > fact most workloads can fit in a single NUMA node). > > So we're effectively crippling down MADV_HUGEPAGE effectiveness on > non-NUMA (where it cannot help to do so) and on NUMA (as a workaround > for the false positive swapout storms) because in some workload and > system THP improvements are less significant than NUMA improvements. > For context, you're referring to the patch I posted that is similar to __GFP_COMPACT_ONLY and patch 2/2 in my series. It's not referring to the revert of the 4.20-rc commit that relaxes the __GFP_THISNODE restriction on thp faults and conflates MADV_HUGEPAGE with NUMA locality. For 4.20, I believe at minimum that patch 1/2 should be merged to restore what we have had for three years, stop piling more semantics on top of the intent (or perceived intent) of MADV_HUGEPAGE, and address the swap storm issue separately. > The higher fault latency is generally the higher cost you pay to get > the good initial THP utilization for apps that do long lived > allocations and in turn can use MADV_HUGEPAGE without downsides. The > cost of compaction pays off over time. > > Short lived allocations sensitive to the allocation latency should not > use MADV_HUGEPAGE in the first place. If you don't want high latency > you shouldn't use MADV_HUGEPAGE and khugepaged already uses > __GFP_THISNODE but it replaces memory so it has a neutral memory > footprint at it, so it's ok with regard to reclaim. > Completely agreed, and is why we want to try synchronous memory compaction to try to allocate hugepages locally in our usecases as well. We aren't particularly concerned about the allocation latency, that is secondary to the long-lived access latency regression that occurs when you do not set __GFP_THISNODE. > In my view David's workload is the outlier that uses MADV_HUGEPAGE but > pretends a low latency and NUMA local behavior as first priority. If > your workload fits in the per-socket CPU cache it doesn't matter which > node it is but it totally matters if you've 2M or 4k tlb. I'm not even > talking about KVM where THP has a multipler effect with EPT. > Hm, no, we do not mind the high allocation latency for MADV_HUGEPAGE users. We *do* care about access latency and that is due to NUMA locality. Before commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings"), *all* thp faults were done with __GFP_THISNODE and had been for at least three years. That commit conflates MADV_HUGEPAGE with a new semantic that it allows remote allocation instead of what it has done for three years: try harder synchronously to allocate hugepages locally. We obviously need to address the problem in another way and not change long-standing behavior that causes regressions. Either my patch 2/2, __GFP_COMPACT_ONLY, a new mempolicy mode, new madvise mode, prctl, etc. > Even if you make the __GFP_NORETRY change for the HPAGE_PMD_ORDER to > skip reclaim in David's patch conditional NUMA being enabled in the > host (so that it won't cripple THP utilization also on non-NUMA > systems), imagine that you go in the bios, turn off interleaving to > enable host NUMA and THP utilization unexpectedly drops significantly > for your VM. > What's needed is appropriate feedback from memory compaction to determine if reclaim is worthwhile: checking only COMPACT_DEFERRED is insufficient. We need to determine if compaction has failed due to order-0 low watermark checks or whether it simply failed to defragment memory so a hugepage could be allocated. Determining if compaction has failed due to order-0 low watermark checks is harder than it seems because the reclaimed memory may not be accessible by isolate_freepages(); we don't have the ability to only reclaim memory from the end of the zone. Otherwise, reclaim is very unlikely to help. > Rome ryzen architecture has been mentioned several times by David but > in my threadripper (not-Rome, as it's supposed to be available in 2019 > only AFIK) enabling THP made a measurable difference for me for some > workloads. As opposed if I turn off NUMA by setting up the > interleaving in the dimm I get a barely measurable slowdown. So I'm > surprised in Rome there's such a radical difference in behavior. > We can compare Naples if that's better; accessing remote hugepages over local pages of the native page size incurs a 13.8% latency increase intrasocket and 38.9% latency increase intersocket. Accessing remote hugepages over remote pages of the native page size is 2.2% better intrasocket and unchanged intersocket. > Like Mel said we need to work towards a more complete solution than > putting __GFP_THISNODE from the outside and then turning off reclaim > from the inside. Mel made examples of things that should > happen, that won't increase allocation latency and that can't happen > with __GFP_THISNODE. > > I'll try to describe again what's going on: > > 1: The allocator is being asked through __GFP_THISNODE "ignore all > remote nodes for all reclaim and compaction" from the > outside. Compaction then returns COMPACT_SKIPPED and tells the > allocator "I can generate many more huge pages if you reclaim/swapout > 2M of anon memory in this node, the only reason I failed to compact > memory is because there aren't enough 4k fragmented pages free in this > zone". The allocator then goes ahead and swaps 2M and invokes > compaction again that succeeds the order 9 allocation fine. Goto 1; > > The above keeps running in a loop at every additional page fault of > the app using MADV_HUGEPAGE until all RAM of the node is swapped out > and replaced by THP and all others nodes had 100% free memory, > potentially 100% order 9, but the allocator completely ignored all > other nodes. That is the thing that we're fixing here, because such > swap storms caused massive slowdowns. If the workload can't fit in a > single node, it's like running with only a fraction of the RAM. > > So David's patch (and __GFP_COMPACT_ONLY) to fix the above swap storm, > inside the allocator skips reclaim entirely when compaction tells "I > can generate one more HPAGE_PMD_ORDER compound page if you > reclaim/swap 2M", if __GFP_NORETRY is set (and makes sure > __GFP_NORETRY is always set for THP). And that however prevents to > generate any more THP globally the moment any node is full of > filesystem cache. > > NOTE: the filesystem cache will still be shrunk but it'll be shrunk by > 4k allocations only. So we just artificially cripple compaction with > David's patch as shown in the quoted results above. This applied to > __GFP_COMPACT_ONLY too and that's I always said there's lots of margin > for improvement there and even __GFP_COMPACT_ONLY was also a stop-gap > measure. > Right, this is all the same as __GFP_COMPACT_ONLY which implicitly set __GFP_NORETRY as part of the gfp flag itself. > So ultimately we decided that the saner behavior that gives the least > risk of regression for the short term, until we can do something > better, was the one that is already applied upstream. > > Of course David's workload regressed, but that's because it gets a > minuscle improvement from THP, maybe it's seeking across all RAM and > it's very RAM heavy bandwidth-heavy workload so 4k or 2m tlb don't > matter at all for his workload and probably he's running it on bare > metal only. > > I think the challenge here is to get David's workload optimally > without creating the above regression but we don't have a way to do it > right now in an automatic way. It's trivial however to add a mbind new > MPOL_THISNODE or MPOL_THISNODE_THP policy to force THP to set > __GFP_THISNODE and return to the swap storm behavior that needle to > say may have worked best by practically partioning the system, in fact > you may want to use __GFP_THISNODE for 4k allocations too so you > invoke reclaim from the local node before allocating RAM from the > remote nodes. > I'm not sure why we'd be changing the default behavior of three years that causes reported regressions instead of introducing a mempolicy that allows for the remote allocation. This commit from 4.0: commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1 Author: Aneesh Kumar K.V <firstname.lastname@example.org> Date: Wed Feb 11 15:27:12 2015 -0800 mm/thp: allocate transparent hugepages on local node This make sure that we try to allocate hugepages from local node if allowed by mempolicy. If we can't, we fallback to small page allocation based on mempolicy. This is based on the observation that allocating pages on local node is more beneficial than allocating hugepages on remote node. With this patch applied we may find transparent huge page allocation failures if the current node doesn't have enough freee hugepages. Before this patch such failures result in us retrying the allocation on other nodes in the numa node mask. [email@example.com: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency] Signed-off-by: Aneesh Kumar K.V <firstname.lastname@example.org> Acked-by: Kirill A. Shutemov <email@example.com> Acked-by: Vlastimil Babka <firstname.lastname@example.org> Cc: David Rientjes <email@example.com> Cc: Andrea Arcangeli <firstname.lastname@example.org> Signed-off-by: Andrew Morton <email@example.com> Signed-off-by: Linus Torvalds <firstname.lastname@example.org> Is still needed, and the reason why I'm requesting we maintain that behavior of nearly four years and not conflate MADV_HUGEPAGE any more than it already is. > To me it doesn't seem the requirements of David's workload are the > same as for other MADV_HUGEPAGE users, I can't image other > MADV_HUGEPAGE users not to care at all if the THP utilization drops, > for David it seems THP is a nice thing to have only, and it seems to > care about allocation latency too (which normal apps using > MADV_HUGEPAGE must not). > I have repeatedly said that allocation latency is secondary to the access latency regression that not setting __GFP_THISNODE causes with a fragmented local node on all of our platforms. > In any case David's patch is better than reverting the revert, as the > swap storms are a showstopper compared to crippling down compaction > ability to compact memory when all nodes are full of cache. > I'm going to propose a v2 of my two patch series, including the feedback from Michal in the first. I suggest the first step is to restore the behavior we have had for three years to prevent the regression I'm reporting and the kernel test robot has reported, and then look to improvements that can be made to prevent local thrashing for users who would rather allocate remote hugepages (and perhaps incur the access latency issues if their workload does not span multiple nodes) instead of trying harder to allocate locally.
next prev parent reply index Thread overview: 77+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-11-27 6:25 kernel test robot 2018-11-27 17:08 ` Linus Torvalds 2018-11-27 18:17 ` Michal Hocko 2018-11-27 18:21 ` Michal Hocko 2018-11-27 19:05 ` Vlastimil Babka 2018-11-27 19:16 ` Vlastimil Babka 2018-11-27 20:57 ` Andrea Arcangeli 2018-11-27 22:50 ` Linus Torvalds 2018-11-28 6:30 ` Michal Hocko 2018-11-28 3:20 ` Huang\, Ying 2018-11-28 16:48 ` Linus Torvalds 2018-11-28 18:39 ` Andrea Arcangeli 2018-11-28 23:10 ` David Rientjes 2018-12-03 18:01 ` Linus Torvalds 2018-12-03 18:14 ` Michal Hocko 2018-12-03 18:19 ` Linus Torvalds 2018-12-03 18:30 ` Michal Hocko 2018-12-03 18:45 ` Linus Torvalds 2018-12-03 18:59 ` Michal Hocko 2018-12-03 19:23 ` Andrea Arcangeli 2018-12-03 20:26 ` David Rientjes 2018-12-03 19:28 ` Linus Torvalds 2018-12-03 20:12 ` Andrea Arcangeli 2018-12-03 20:36 ` David Rientjes 2018-12-03 22:04 ` Linus Torvalds 2018-12-03 22:27 ` Linus Torvalds 2018-12-03 22:57 ` David Rientjes 2018-12-04 9:22 ` Vlastimil Babka 2018-12-04 10:45 ` Mel Gorman 2018-12-05 0:47 ` David Rientjes 2018-12-05 9:08 ` Michal Hocko 2018-12-05 10:43 ` Mel Gorman 2018-12-05 11:43 ` Michal Hocko 2018-12-05 10:06 ` Mel Gorman 2018-12-05 20:40 ` Andrea Arcangeli 2018-12-05 21:59 ` David Rientjes [this message] 2018-12-06 0:00 ` Andrea Arcangeli 2018-12-05 22:03 ` Linus Torvalds 2018-12-05 22:12 ` David Rientjes 2018-12-05 23:36 ` Andrea Arcangeli 2018-12-05 23:51 ` Linus Torvalds 2018-12-06 0:58 ` Linus Torvalds 2018-12-06 9:14 ` MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression) Michal Hocko 2018-12-06 23:49 ` David Rientjes 2018-12-07 7:34 ` Michal Hocko 2018-12-07 4:31 ` Linus Torvalds 2018-12-07 7:49 ` Michal Hocko 2018-12-07 9:06 ` Vlastimil Babka 2018-12-07 23:15 ` David Rientjes 2018-12-06 23:43 ` [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression David Rientjes 2018-12-07 4:01 ` Linus Torvalds 2018-12-10 0:29 ` David Rientjes 2018-12-10 4:49 ` Andrea Arcangeli 2018-12-12 0:37 ` David Rientjes 2018-12-12 9:50 ` Michal Hocko 2018-12-12 17:00 ` Andrea Arcangeli 2018-12-14 11:32 ` Michal Hocko 2018-12-12 10:14 ` Vlastimil Babka 2018-12-14 21:04 ` David Rientjes 2018-12-14 21:33 ` Vlastimil Babka 2018-12-21 22:18 ` David Rientjes 2018-12-22 12:08 ` Mel Gorman 2018-12-14 23:11 ` Mel Gorman 2018-12-21 22:15 ` David Rientjes 2018-12-12 10:44 ` Andrea Arcangeli 2019-04-15 11:48 ` Michal Hocko 2018-12-06 0:18 ` David Rientjes 2018-12-06 0:54 ` Andrea Arcangeli 2018-12-06 9:23 ` Vlastimil Babka 2018-12-03 20:39 ` David Rientjes 2018-12-03 21:25 ` Michal Hocko 2018-12-03 21:53 ` David Rientjes 2018-12-04 8:48 ` Michal Hocko 2018-12-05 0:07 ` David Rientjes 2018-12-05 10:18 ` Michal Hocko 2018-12-05 19:16 ` David Rientjes 2018-11-27 7:23 kernel test robot
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=alpine.DEB.email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
LKML Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \ firstname.lastname@example.org email@example.com public-inbox-index lkml Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel AGPL code for this site: git clone https://public-inbox.org/ public-inbox