Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Andrea Argangeli <andrea@kernel.org>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Stefan Priebe - Profihost AG <s.priebe@profihost.ag>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Stable tree <stable@vger.kernel.org>,
	Michal Hocko <mhocko@suse.com>
Subject: Re: [PATCH 1/2] mm: thp:  relax __GFP_THISNODE for MADV_HUGEPAGE mappings
Date: Fri, 5 Oct 2018 08:38:54 +0100
Message-ID: <20181005073854.GB6931@suse.de> (raw)
In-Reply-To: <alpine.DEB.2.21.1810041302330.16935@chino.kir.corp.google.com>

On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> On Tue, 25 Sep 2018, Michal Hocko wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index da858f794eb6..149b6f4cf023 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> >  		nmask = policy_nodemask(gfp, pol);
> >  		if (!nmask || node_isset(hpage_node, *nmask)) {
> >  			mpol_cond_put(pol);
> > -			page = __alloc_pages_node(hpage_node,
> > -						gfp | __GFP_THISNODE, order);
> > +			/*
> > +			 * We cannot invoke reclaim if __GFP_THISNODE
> > +			 * is set. Invoking reclaim with
> > +			 * __GFP_THISNODE set, would cause THP
> > +			 * allocations to trigger heavy swapping
> > +			 * despite there may be tons of free memory
> > +			 * (including potentially plenty of THP
> > +			 * already available in the buddy) on all the
> > +			 * other NUMA nodes.
> > +			 *
> > +			 * At most we could invoke compaction when
> > +			 * __GFP_THISNODE is set (but we would need to
> > +			 * refrain from invoking reclaim even if
> > +			 * compaction returned COMPACT_SKIPPED because
> > +			 * there wasn't not enough memory to succeed
> > +			 * compaction). For now just avoid
> > +			 * __GFP_THISNODE instead of limiting the
> > +			 * allocation path to a strict and single
> > +			 * compaction invocation.
> > +			 *
> > +			 * Supposedly if direct reclaim was enabled by
> > +			 * the caller, the app prefers THP regardless
> > +			 * of the node it comes from so this would be
> > +			 * more desiderable behavior than only
> > +			 * providing THP originated from the local
> > +			 * node in such case.
> > +			 */
> > +			if (!(gfp & __GFP_DIRECT_RECLAIM))
> > +				gfp |= __GFP_THISNODE;
> > +			page = __alloc_pages_node(hpage_node, gfp, order);
> >  			goto out;
> >  		}
> >  	}
> 
> This causes, on average, a 13.9% access latency regression on Haswell, and 
> the regression would likely be more severe on Naples and Rome.
> 

That assumes that fragmentation prevents easy allocation which may very
well be the case. While it would be great that compaction or the page
allocator could be further improved to deal with fragmentation, it's
outside the scope of this patch.

> There exist libraries that allow the .text segment of processes to be 
> remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE 
> to stress local compaction to defragment node local memory for hugepages 
> at startup. 

That is taking advantage of a co-incidence of the implementation.
MADV_HUGEPAGE is *advice* that huge pages be used, not what the locality
is. A hint for strong locality preferences should be separate advice
(madvise) or a separate memory policy. Doing that is outside the context
of this patch but nothing stops you introducing such a policy or madvise,
whichever you think would be best for the libraries to consume (I'm only
aware of libhugetlbfs but there might be others).

> The cost, including the statistics Mel gathered, is 
> acceptable for these processes: they are not concerned with startup cost, 
> they are concerned only with optimal access latency while they are 
> running.
> 

Then such applications at startup have the option of setting
zone_reclaim_mode during initialisation assuming a privileged helper
can be created. That would be somewhat heavy handed and a longer-term
solution would still be to create a proper memory policy of madvise flag
for those libraries.

> So while it may take longer to start the process because memory compaction 
> is attempting to allocate hugepages with __GFP_DIRECT_RECLAIM, in the 
> cases where compaction is successful, this is a very significant long-term 
> win.  In cases where compaction fails, falling back to local pages of the 
> native page size instead of remote thp is a win for the remaining time 
> this process wins: as stated, 13.9% faster for all memory accesses to the 
> process's text while it runs on Haswell.
> 

Again, I remind you that it only benefits applications that prefectly
fit into NUMA nodes. Not all applications are created with that level of
awareness and easily get thrashed if using MADV_HUGEPAGE and do not fit
into a NUMA node.

While it is unfortunate that there are specialised applications that
benefit from the current configuration, I bet there is heavier usage of
qemu affected by the bug this patch addresses than specialised
applications that both fit perfectly into NUMA nodes and are extremely
sensitive to access latencies. It's a question of causing the least harm
to the most users which is what this patch does.

If you need behaviour for more agressive reclaim or locality hints then
kindly introduce them and do not depend in MADV_HUGEPAGE accidentically
doubling up as hints about memory locality.

-- 
Mel Gorman
SUSE Labs

  parent reply index

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-25 12:03 [PATCH 0/2] thp nodereclaim fixes Michal Hocko
2018-09-25 12:03 ` [PATCH 1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings Michal Hocko
2018-09-25 12:20   ` Mel Gorman
2018-09-25 12:30     ` Michal Hocko
2018-10-04 20:16   ` David Rientjes
2018-10-04 21:10     ` Andrea Arcangeli
2018-10-04 23:05       ` David Rientjes
2018-10-06  3:19         ` Andrea Arcangeli
2018-10-05  7:38     ` Mel Gorman [this message]
2018-10-05 20:35       ` David Rientjes
2018-10-05 23:21         ` Andrea Arcangeli
2018-10-08 20:41           ` David Rientjes
2018-10-09  9:48             ` Mel Gorman
2018-10-09 12:27               ` Michal Hocko
2018-10-09 13:00                 ` Mel Gorman
2018-10-09 14:25                   ` Michal Hocko
2018-10-09 15:16                     ` Mel Gorman
2018-10-09 23:03                     ` Andrea Arcangeli
2018-10-10 21:19                       ` David Rientjes
2018-10-15 22:30                         ` David Rientjes
2018-10-15 22:44                           ` Andrew Morton
2018-10-15 23:19                             ` Andrea Arcangeli
2018-10-22 20:54                               ` David Rientjes
2018-10-16  7:46                             ` Mel Gorman
2018-10-16 22:37                               ` Andrew Morton
2018-10-16 23:11                                 ` Andrea Arcangeli
2018-10-16 23:16                                   ` Andrew Morton
2018-10-17  7:08                                     ` Michal Hocko
2018-10-17  9:00                                 ` Mel Gorman
2018-10-22 21:04                               ` David Rientjes
2018-10-23  1:27                                 ` Zi Yan
2018-10-28 21:45                                   ` David Rientjes
2018-10-23  7:57                                 ` Mel Gorman
2018-10-23  8:38                                   ` Mel Gorman
2018-10-15 22:57                           ` Andrea Arcangeli
2018-10-22 20:45                             ` David Rientjes
2018-10-09 22:17               ` David Rientjes
2018-10-09 22:51                 ` Andrea Arcangeli
2018-10-10  7:54                   ` Vlastimil Babka
2018-10-10 21:00                   ` David Rientjes
2018-10-09 13:08             ` Vlastimil Babka
2018-10-09 22:21             ` Andrea Arcangeli
2018-10-29  5:17   ` Balbir Singh
2018-10-29  9:00     ` Michal Hocko
2018-10-29  9:42       ` Balbir Singh
2018-10-29 10:08         ` Michal Hocko
2018-10-29 10:56           ` Andrea Arcangeli
2018-09-25 12:03 ` [PATCH 2/2] mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask Michal Hocko
2018-09-26 13:30   ` Kirill A. Shutemov
2018-09-26 14:17     ` Michal Hocko
2018-09-26 14:22       ` Michal Hocko
2018-10-19  2:11         ` Andrew Morton
2018-10-19  8:06           ` Michal Hocko
2018-10-22 13:27             ` Vlastimil Babka
2018-10-24 23:17               ` Andrew Morton
2018-10-25  4:56                 ` Vlastimil Babka
2018-10-25 16:14                   ` Michal Hocko
2018-10-25 16:18                     ` Andrew Morton
2018-10-25 16:45                       ` Michal Hocko
2018-10-22 13:15         ` Vlastimil Babka
2018-10-22 13:30           ` Michal Hocko
2018-10-22 13:35             ` Vlastimil Babka
2018-10-22 13:46               ` Michal Hocko
2018-10-22 13:53                 ` Vlastimil Babka
2018-10-04 20:17     ` David Rientjes
2018-10-04 21:49       ` Zi Yan
2018-10-09 12:36       ` Michal Hocko
2018-09-26 13:08 ` linux-mm@ archive on lore.kernel.org (Was: [PATCH 0/2] thp nodereclaim fixes) Kirill A. Shutemov
2018-09-26 13:14   ` Michal Hocko
2018-09-26 22:22     ` Andrew Morton
2018-09-26 23:08       ` Mel Gorman
2018-09-27  0:47         ` Konstantin Ryabitsev
2018-09-26 15:25   ` Konstantin Ryabitsev
2018-09-27 11:30     ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181005073854.GB6931@suse.de \
    --to=mgorman@suse.de \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrea@kernel.org \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mhocko@suse.com \
    --cc=rientjes@google.com \
    --cc=s.priebe@profihost.ag \
    --cc=stable@vger.kernel.org \
    --cc=vbabka@suse.cz \
    --cc=zi.yan@cs.rutgers.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git