Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: David Rientjes <rientjes@google.com>,
	Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kbusch@kernel.org, yang.shi@linux.alibaba.com,
	ying.huang@intel.com, dan.j.williams@intel.com
Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard
Date: Wed, 1 Jul 2020 09:48:22 -0700
Message-ID: <c06b4453-c533-a9ba-939a-8877fb301ad6@intel.com> (raw)
In-Reply-To: <alpine.DEB.2.22.394.2006301732010.1644114@chino.kir.corp.google.com>

On 6/30/20 5:47 PM, David Rientjes wrote:
> On Mon, 29 Jun 2020, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>>
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>>
> 
> Thanks for sharing these patches and kick-starting the conversation, Dave.
> 
> Could this cause us to break a user's mbind() or allow a user to 
> circumvent their cpuset.mems?

In its current form, yes.

My current rationale for this is that while it's not as deferential as
it can be to the user/kernel ABI contract, it's good *overall* behavior.
 The auto-migration only kicks in when the data is about to go away.  So
while the user's data might be slower than they like, it is *WAY* faster
than they deserve because it should be off on the disk.

> Because we don't have a mapping of the page back to its allocation 
> context (or the process context in which it was allocated), it seems like 
> both are possible.
> 
> So let's assume that migration nodes cannot be other DRAM nodes.  
> Otherwise, memory pressure could be intentionally or unintentionally 
> induced to migrate these pages to another node.  Do we have such a 
> restriction on migration nodes?

There's nothing explicit.  On a normal, balanced system where there's a
1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
implicit since the migration path is one deep and goes from DRAM->PMEM.

If there were some oddball system where there was a memory only DRAM
node, it might very well end up being a migration target.

>> Some places we would like to see this used:
>>
>>   1. Persistent memory being as a slower, cheaper DRAM replacement
>>   2. Remote memory-only "expansion" NUMA nodes
>>   3. Resolving memory imbalances where one NUMA node is seeing more
>>      allocation activity than another.  This helps keep more recent
>>      allocations closer to the CPUs on the node doing the allocating.
> 
> (3) is the concerning one given the above if we are to use 
> migrate_demote_mapping() for DRAM node balancing.

Yeah, agreed.  That's the sketchiest of the three.  :)

>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> +	/*
>> +	 * 'mask' targets allocation only to the desired node in the
>> +	 * migration path, and fails fast if the allocation can not be
>> +	 * immediately satisfied.  Reclaim is already active and heroic
>> +	 * allocation efforts are unwanted.
>> +	 */
>> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> +			__GFP_MOVABLE;
> 
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
> actually want to kick kswapd on the pmem node?

In my mental model, cold data flows from:

	DRAM -> PMEM -> swap

Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
for kinda cold data, kswapd can be working on doing the PMEM->swap part
on really cold data.

...
>> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>>  			; /* try to reclaim the page below */
>>  		}
>>  
>> +		rc = migrate_demote_mapping(page);
>> +		/*
>> +		 * -ENOMEM on a THP may indicate either migration is
>> +		 * unsupported or there was not enough contiguous
>> +		 * space. Split the THP into base pages and retry the
>> +		 * head immediately. The tail pages will be considered
>> +		 * individually within the current loop's page list.
>> +		 */
>> +		if (rc == -ENOMEM && PageTransHuge(page) &&
>> +		    !split_huge_page_to_list(page, page_list))
>> +			rc = migrate_demote_mapping(page);
>> +
>> +		if (rc == MIGRATEPAGE_SUCCESS) {
>> +			unlock_page(page);
>> +			if (likely(put_page_testzero(page)))
>> +				goto free_it;
>> +			/*
>> +			 * Speculative reference will free this page,
>> +			 * so leave it off the LRU.
>> +			 */
>> +			nr_reclaimed++;
> 
> nr_reclaimed += nr_pages instead?

Oh, good catch.  I also need to go double-check that 'nr_pages' isn't
wrong elsewhere because of the split.


  parent reply index

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-29 23:45 [RFC][PATCH 0/8] Migrate Pages " Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 1/8] mm/numa: node demotion data structure and lookup Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 2/8] mm/migrate: Defer allocating new page until needed Dave Hansen
2020-07-01  8:47   ` Greg Thelen
2020-07-01 14:46     ` Dave Hansen
2020-07-01 18:32       ` Yang Shi
2020-06-29 23:45 ` [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard Dave Hansen
2020-07-01  0:47   ` David Rientjes
2020-07-01  1:29     ` Yang Shi
2020-07-01  5:41       ` David Rientjes
2020-07-01  8:54         ` Huang, Ying
2020-07-01 18:20           ` Dave Hansen
2020-07-01 19:50             ` David Rientjes
2020-07-02  1:50               ` Huang, Ying
2020-07-01 15:15         ` Dave Hansen
2020-07-01 17:21         ` Yang Shi
2020-07-01 19:45           ` David Rientjes
2020-07-02 10:02             ` Jonathan Cameron
2020-07-01  1:40     ` Huang, Ying
2020-07-01 16:48     ` Dave Hansen [this message]
2020-07-01 19:25       ` David Rientjes
2020-07-02  5:02         ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 4/8] mm/vmscan: add page demotion counter Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 5/8] mm/numa: automatically generate node migration order Dave Hansen
2020-06-30  8:22   ` Huang, Ying
2020-07-01 18:23     ` Dave Hansen
2020-07-02  1:20       ` Huang, Ying
2020-06-29 23:45 ` [RFC][PATCH 6/8] mm/vmscan: Consider anonymous pages without swap Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 7/8] mm/vmscan: never demote for memcg reclaim Dave Hansen
2020-06-29 23:45 ` [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration Dave Hansen
2020-06-30  7:23   ` Huang, Ying
2020-06-30 17:50     ` Yang Shi
2020-07-01  0:48       ` Huang, Ying
2020-07-01  1:12         ` Yang Shi
2020-07-01  1:28           ` Huang, Ying
2020-07-01 16:02       ` Dave Hansen
2020-07-03  9:30   ` Huang, Ying
2020-06-30 18:36 ` [RFC][PATCH 0/8] Migrate Pages in lieu of discard Shakeel Butt
2020-06-30 18:51   ` Dave Hansen
2020-06-30 19:25     ` Shakeel Butt
2020-06-30 19:31       ` Dave Hansen
2020-07-01 14:24         ` [RFC] [PATCH " Zi Yan
2020-07-01 14:32           ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c06b4453-c533-a9ba-939a-8877fb301ad6@intel.com \
    --to=dave.hansen@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=kbusch@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git