Re: [PATCH 2/2] Revert "mm, thp: restore node-local hugepage allocations"

From: David Rientjes <rientjes@google.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@suse.de>,  Michal Hocko <mhocko@kernel.org>,
	Vlastimil Babka <vbabka@suse.cz>,  Zi Yan <zi.yan@cs.rutgers.edu>,
	 Stefan Priebe - Profihost AG <s.priebe@profihost.ag>,
	 "Kirill A. Shutemov" <kirill@shutemov.name>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] Revert "mm, thp: restore node-local hugepage allocations"
Date: Tue, 28 May 2019 19:06:57 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.21.1905281840470.86034@chino.kir.corp.google.com> (raw)
In-Reply-To: <20190524202931.GB11202@redhat.com>

On Fri, 24 May 2019, Andrea Arcangeli wrote:

> > > We are going in circles, *yes* there is a problem for potential swap 
> > > storms today because of the poor interaction between memory compaction and 
> > > directed reclaim but this is a result of a poor API that does not allow 
> > > userspace to specify that its workload really will span multiple sockets 
> > > so faulting remotely is the best course of action.  The fix is not to 
> > > cause regressions for others who have implemented a userspace stack that 
> > > is based on the past 3+ years of long standing behavior or for specialized 
> > > workloads where it is known that it spans multiple sockets so we want some 
> > > kind of different behavior.  We need to provide a clear and stable API to 
> > > define these terms for the page allocator that is independent of any 
> > > global setting of thp enabled, defrag, zone_reclaim_mode, etc.  It's 
> > > workload dependent.
> > 
> > um, who is going to do this work?
> 
> That's a good question. It's going to be a not simple patch to
> backport to -stable: it'll be intrusive and it will affect
> mm/page_alloc.c significantly so it'll reject heavy. I wouldn't
> consider it -stable material at least in the short term, it will
> require some testing.
> 

Hi Andrea,

I'm not sure what patch you're referring to, unfortunately.  The above 
comment was referring to APIs that are made available to userspace to 
define when to fault locally vs remotely and what the preference should be 
for any form of compaction or reclaim to achieve that.  Today we have 
global enabling options, global defrag settings, enabling prctls, and 
madvise options.  The point it makes is that whether a specific workload 
fits into a single socket is workload dependant and thus we are left with 
prctls and madvise options.  The prctl either enables thp or it doesn't, 
it is not interesting here; the madvise is overloaded in four different 
ways (enabling, stalling at fault, collapsability, defrag) so it's not 
surprising that continuing to overload it for existing users will cause 
undesired results.  It makes an argument that we need a clear and stable 
means of defining the behavior, not changing the 4+ year behavior and 
giving those who regress no workaround.

> This is why applying a simple fix that avoids the swap storms (and the
> swap-less pathological THP regression for vfio device assignment GUP
> pinning) is preferable before adding an alloc_pages_multi_order (or
> equivalent) so that it'll be the allocator that will decide when
> exactly to fallback from 2M to 4k depending on the NUMA distance and
> memory availability during the zonelist walk. The basic idea is to
> call alloc_pages just once (not first for 2M and then for 4k) and
> alloc_pages will decide which page "order" to return.
> 

The commit description doesn't mention the swap storms that you're trying 
to fix, it's probably better to describe that again and why it is not 
beneficial to swap unless an entire pageblock can become free or memory 
compaction has indicated that additional memory freeing would allow 
migration to make an entire pageblock free.  I understand that's a 
invasive code change, but merging this patch changes the 4+ year behavior 
that started here:

commit 077fcf116c8c2bd7ee9487b645aa3b50368db7e1
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Wed Feb 11 15:27:12 2015 -0800

    mm/thp: allocate transparent hugepages on local node

And that commit's description describes quite well the regression that we 
encounter if we remove __GFP_THISNODE here.  That's because the access 
latency regression is much more substantial than what was reported for 
Naples in your changelog.

In the interest of making forward progress, can we agree that swapping 
from the local node *never* makes sense unless we can show that an entire 
pageblock can become free or that it enables memory compaction to migrate 
memory that can make an entire pageblock free?  Are you reporting swap 
storms for the local node when one of these is true?

> > Implementing a new API doesn't help existing userspace which is hurting
> > from the problem which this patch addresses.
> 
> Yes, we can't change all apps that may not fit in a single NUMA
> node. Currently it's unsafe to turn "transparent_hugepages/defrag =
> always" or the bad behavior can then materialize also outside of
> MADV_HUGEPAGE. Those apps that use MADV_HUGEPAGE on their long lived
> allocations (i.e. guest physical memory) like qemu are affected even
> with the default "defrag = madvise". Those apps are using
> MADV_HUGEPAGE for more than 3 years and they are widely used and open
> source of course.
> 

I continue to reiterate that the 4+ year long standing behavior of 
MADV_HUGEPAGE is overloaded; you are anticipating a specific behavior for 
workloads that do not fit in a single NUMA node whereas other users 
developed in the past four years are anticipating a different behavior.  
I'm trying to propose solutions that can not cause regressions for any 
user, such as the prctl() example that is inherited across fork, and can 
be used to define the behavior.  This could be a very trivial extension to 
prctl(PR_SET_THP_DISABLE) or it could be more elaborate as an addition.  
This would be set by any thread that forks qemu and can define that the 
workload prefers remote hugepages because it spans more than one node.  
Certainly we should agree that the majority of Linux workloads do not span 
more than one socket.  However, it *may* be possible to define this as a 
global thp setting since most machines that run large guests are only 
running large guests so the default machine-level policy can reflect that.