linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Rientjes <rientjes@google.com>
To: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages
Date: Sun, 24 Nov 2019 16:10:53 -0800 (PST)	[thread overview]
Message-ID: <alpine.DEB.2.21.1911241548340.192260@chino.kir.corp.google.com> (raw)
In-Reply-To: <20191113112042.GG28938@suse.de>

On Wed, 13 Nov 2019, Mel Gorman wrote:

> > > The whole point of the Vlastimil's patch is to have an optimistic local
> > > node allocation first and the full gfp context one in the fallback path.
> > > If our full gfp context doesn't really work well then we can revisit
> > > that of course but that should happen at alloc_hugepage_direct_gfpmask
> > > level.
> > 
> > Since the patch reverts the precaution put into the page allocator to not 
> > attempt reclaim if the allocation order is significantly large and the 
> > return value from compaction specifies it is unlikely to succed on its 
> > own, I believe Vlastimil's patch will cause the same regression that 
> > Andrea saw is the whole host is low on memory and/or significantly 
> > fragmented.  So the suggestion was that he test this change to make sure 
> > we aren't introducing a regression for his workload.
> 
> TLDR: I do not have evidence that Vlastimil's patch causes more swapping
> 	but more information is needed from Andrea on exactly how he's
> 	testing this. It's not clear to me what was originally tested
> 	and whether memory just had to be full or whether it had to be
> 	fragmented. If fragmented, then we have to agree on what an
> 	appropriate mechanism is for fragmenting memory. Hypothetical
> 	kernel modules that don't exist do not count.
> 
> I put together a testcase whereby a virtual machine is deployed, started
> and then time how long it takes to run memhog on 80% of the guests
> physical memory. I varied how large the virtual machine is and ran it on
> a 2-socket machine so that the smaller tests would be single node and
> the larger tests would span both nodes. Before each startup, a large
> file is read to fill the memory with pagecache.
> 

First, thanks very much for the follow-up and considerable amount of time 
testing and benchmarking this.

I, like you, do not have a reliable test case that will reproduce the 
issue that Andrea initially reported over a year ago.  I believe in the 
discussion that repeatedly referred to swap storms that, with the 
__GFP_THISNODE policy, we were not thrashing because the local node was 
low on memory due to page cache.  How memory is filled with page cache 
will naturally effect how it can be reclaimed when compaction fails 
locally, I don't know if it's an accurate representation of the initial 
problem.  I also don't recall details about the swapfile or exactly where 
we were contending while trying to fault local hugepages.

My concern, and it's only a concern at this point and not a regression 
report because we don't have a result from Andrea, is that the result of 
this patch is that the second allocation in alloc_pages_vma() enables the 
exact same allocation policy that Andrea reported was a problem earlier if 
__GFP_DIRECT_RECLAIM is set, which it will be as a result of qemu's use of 
MADV_HUGEPAGE.

That reclaim and compaction is now done over the entire system and not 
isolated only to the local node so there are two plausible outcomes: (1) 
the remote note is not fragmented and we can easily fault a remote 
hugepage or (2) we thrash and cause swap storms remotely as well as 
locally.

(1) is the outcome that Andrea is seeking based on the initial reverts: 
that much we know.  So my concern is that if the *system* is fragmented 
that we have now introduced a much more significant swap storm that will 
result in a much more serious regression.

So my question would be: if we know the previous behavior that allowed 
excessive swap and recalling into compaction was deemed harmful for the 
local node, why do we now believe it cannot be harmful if done for all 
system memory?  The patch subverts the precaution put into place in the 
page allocator to specifically not do this excessive reclaim and recall 
into compaction dance and I believe restores the previous bad behavior if 
remote memory is similarly fragmented.  (What prevents this??)

Andrea was able to test this on several kernel versions with a fragmented 
local node so I *assume* it would not be difficult to measure the extent 
to which this patch can become harmful if all memory is fragmented.  I'm 
hoping that we can quantify that potentially negative impact before 
opening users up to the possibility.  As you said, it behaves better on 
some systems and workloads and worse on others and we both agree more 
information is needed.

I think asking Andrea to test and quantify the change with a fragmented 
system would help us to make a more informed decision and not add a 
potential regression to 5.5 or whichever kernel this would be merged in.

  reply	other threads:[~2019-11-25  0:11 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-04 19:54 [patch for-5.3 0/4] revert immediate fallback to remote hugepages David Rientjes
2019-09-04 19:54 ` [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction may not succeed David Rientjes
2019-09-05  9:00   ` Michal Hocko
2019-09-05 11:22     ` Vlastimil Babka
2019-09-05 20:53       ` Mike Kravetz
2019-09-06 20:16         ` David Rientjes
2019-09-06 20:49       ` David Rientjes
2019-09-04 20:43 ` [patch for-5.3 0/4] revert immediate fallback to remote hugepages Linus Torvalds
2019-09-05 20:54   ` David Rientjes
2019-09-07 19:51     ` David Rientjes
2019-09-07 19:55       ` Linus Torvalds
2019-09-08  1:50         ` David Rientjes
2019-09-08 12:47           ` Vlastimil Babka
2019-09-08 20:45             ` David Rientjes
2019-09-09  8:37               ` Michal Hocko
2019-09-04 20:55 ` Andrea Arcangeli
2019-09-05 21:06   ` David Rientjes
2019-09-09 19:30     ` Michal Hocko
2019-09-25  7:08       ` Michal Hocko
2019-09-26 19:03         ` David Rientjes
2019-09-27  7:48           ` Michal Hocko
2019-09-28 20:59             ` Linus Torvalds
2019-09-30 11:28               ` Michal Hocko
2019-10-01  5:43                 ` Michal Hocko
2019-10-01  8:37                   ` Michal Hocko
2019-10-18 14:15                     ` Michal Hocko
2019-10-23 11:03                       ` Vlastimil Babka
2019-10-24 18:59                         ` David Rientjes
2019-10-29 14:14                           ` Vlastimil Babka
2019-10-29 15:15                             ` Michal Hocko
2019-10-29 21:33                               ` Andrew Morton
2019-10-29 21:45                                 ` Vlastimil Babka
2019-10-29 23:25                                 ` David Rientjes
2019-11-05 13:02                                   ` Michal Hocko
2019-11-06  1:01                                     ` David Rientjes
2019-11-06  7:35                                       ` Michal Hocko
2019-11-06 21:32                                         ` David Rientjes
2019-11-13 11:20                                           ` Mel Gorman
2019-11-25  0:10                                             ` David Rientjes [this message]
2019-11-25 11:47                                               ` Michal Hocko
2019-11-25 20:38                                                 ` David Rientjes
2019-11-25 21:34                                                   ` Vlastimil Babka
2019-10-01 13:50                   ` Vlastimil Babka
2019-10-01 20:31                     ` David Rientjes
2019-10-01 21:54                       ` Vlastimil Babka
2019-10-02 10:34                         ` Michal Hocko
2019-10-02 22:32                           ` David Rientjes
2019-10-03  8:00                             ` Vlastimil Babka
2019-10-04 12:18                               ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.21.1911241548340.192260@chino.kir.corp.google.com \
    --to=rientjes@google.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).