All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: David Rientjes <rientjes@google.com>,
	Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kbusch@kernel.org, yang.shi@linux.alibaba.com,
	ying.huang@intel.com, dan.j.williams@intel.com
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
Date: Wed, 1 Jul 2020 09:48:22 -0700	[thread overview]
Message-ID: <c06b4453-c533-a9ba-939a-8877fb301ad6@intel.com> (raw)
In-Reply-To: <alpine.DEB.2.22.394.2006301732010.1644114@chino.kir.corp.google.com>

On 6/30/20 5:47 PM, David Rientjes wrote:
> On Mon, 29 Jun 2020, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>>
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>>
> 
> Thanks for sharing these patches and kick-starting the conversation, Dave.
> 
> Could this cause us to break a user's mbind() or allow a user to 
> circumvent their cpuset.mems?

In its current form, yes.

My current rationale for this is that while it's not as deferential as
it can be to the user/kernel ABI contract, it's good *overall* behavior.
 The auto-migration only kicks in when the data is about to go away.  So
while the user's data might be slower than they like, it is *WAY* faster
than they deserve because it should be off on the disk.

> Because we don't have a mapping of the page back to its allocation 
> context (or the process context in which it was allocated), it seems like 
> both are possible.
> 
> So let's assume that migration nodes cannot be other DRAM nodes.  
> Otherwise, memory pressure could be intentionally or unintentionally 
> induced to migrate these pages to another node.  Do we have such a 
> restriction on migration nodes?

There's nothing explicit.  On a normal, balanced system where there's a
1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
implicit since the migration path is one deep and goes from DRAM->PMEM.

If there were some oddball system where there was a memory only DRAM
node, it might very well end up being a migration target.

>> Some places we would like to see this used:
>>
>>   1. Persistent memory being as a slower, cheaper DRAM replacement
>>   2. Remote memory-only "expansion" NUMA nodes
>>   3. Resolving memory imbalances where one NUMA node is seeing more
>>      allocation activity than another.  This helps keep more recent
>>      allocations closer to the CPUs on the node doing the allocating.
> 
> (3) is the concerning one given the above if we are to use 
> migrate_demote_mapping() for DRAM node balancing.

Yeah, agreed.  That's the sketchiest of the three.  :)

>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> +	/*
>> +	 * 'mask' targets allocation only to the desired node in the
>> +	 * migration path, and fails fast if the allocation can not be
>> +	 * immediately satisfied.  Reclaim is already active and heroic
>> +	 * allocation efforts are unwanted.
>> +	 */
>> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> +			__GFP_MOVABLE;
> 
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
> actually want to kick kswapd on the pmem node?

In my mental model, cold data flows from:

	DRAM -> PMEM -> swap

Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
for kinda cold data, kswapd can be working on doing the PMEM->swap part
on really cold data.

...
>> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>>  			; /* try to reclaim the page below */
>>  		}
>>  
>> +		rc = migrate_demote_mapping(page);
>> +		/*
>> +		 * -ENOMEM on a THP may indicate either migration is
>> +		 * unsupported or there was not enough contiguous
>> +		 * space. Split the THP into base pages and retry the
>> +		 * head immediately. The tail pages will be considered
>> +		 * individually within the current loop's page list.
>> +		 */
>> +		if (rc == -ENOMEM && PageTransHuge(page) &&
>> +		    !split_huge_page_to_list(page, page_list))
>> +			rc = migrate_demote_mapping(page);
>> +
>> +		if (rc == MIGRATEPAGE_SUCCESS) {
>> +			unlock_page(page);
>> +			if (likely(put_page_testzero(page)))
>> +				goto free_it;
>> +			/*
>> +			 * Speculative reference will free this page,
>> +			 * so leave it off the LRU.
>> +			 */
>> +			nr_reclaimed++;
> 
> nr_reclaimed += nr_pages instead?

Oh, good catch.  I also need to go double-check that 'nr_pages' isn't
wrong elsewhere because of the split.

  parent reply	other threads:[~2020-07-01 16:48 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages in lieu of discard Dave Hansen
2020-06-29 23:45 ` Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-07-01  8:47   ` Greg Thelen
2020-07-01  8:47     ` Greg Thelen
2020-07-01 14:46     ` Dave Hansen
2020-07-01 18:32       ` Yang Shi
2020-06-29 23:45 ` [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-07-01  0:47   ` David Rientjes
2020-07-01  0:47     ` David Rientjes
2020-07-01  1:29     ` Yang Shi
2020-07-01  5:41       ` David Rientjes
2020-07-01  5:41         ` David Rientjes
2020-07-01  8:54         ` Huang, Ying
2020-07-01  8:54           ` Huang, Ying
2020-07-01 18:20           ` Dave Hansen
2020-07-01 19:50             ` David Rientjes
2020-07-01 19:50               ` David Rientjes
2020-07-02  1:50               ` Huang, Ying
2020-07-02  1:50                 ` Huang, Ying
2020-07-01 15:15         ` Dave Hansen
2020-07-01 17:21         ` Yang Shi
2020-07-01 19:45           ` David Rientjes
2020-07-01 19:45             ` David Rientjes
2020-07-02 10:02             ` Jonathan Cameron
2020-07-01  1:40     ` Huang, Ying
2020-07-01  1:40       ` Huang, Ying
2020-07-01 16:48     ` Dave Hansen [this message]
2020-07-01 19:25       ` David Rientjes
2020-07-01 19:25         ` David Rientjes
2020-07-02  5:02         ` Huang, Ying
2020-07-02  5:02           ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 4/8] mm/vmscan: add page demotion counter Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 5/8] mm/numa: automatically generate node migration order Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-30  8:22   ` Huang, Ying
2020-06-30  8:22     ` Huang, Ying
2020-07-01 18:23     ` Dave Hansen
2020-07-02  1:20       ` Huang, Ying
2020-07-02  1:20         ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration Dave Hansen
2020-06-29 23:45   ` Dave Hansen
2020-06-30  7:23   ` Huang, Ying
2020-06-30  7:23     ` Huang, Ying
2020-06-30 17:50     ` Yang Shi
2020-07-01  0:48       ` Huang, Ying
2020-07-01  0:48         ` Huang, Ying
2020-07-01  1:12         ` Yang Shi
2020-07-01  1:28           ` Huang, Ying
2020-07-01  1:28             ` Huang, Ying
2020-07-01 16:02       ` Dave Hansen
2020-07-03  9:30   ` Huang, Ying
2020-07-03  9:30     ` Huang, Ying
2020-06-30 18:36 ` [RFC][PATCH 0/8] Migrate Pages in lieu of discard Shakeel Butt
2020-06-30 18:36   ` Shakeel Butt
2020-06-30 18:51   ` Dave Hansen
2020-06-30 19:25     ` Shakeel Butt
2020-06-30 19:25       ` Shakeel Butt
2020-06-30 19:31       ` Dave Hansen
2020-07-01 14:24         ` [RFC] [PATCH " Zi Yan
2020-07-01 14:32           ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c06b4453-c533-a9ba-939a-8877fb301ad6@intel.com \
    --to=dave.hansen@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.