Re: [PATCH -V8 02/10] mm/numa: automatically generate node migration order

From: "Huang, Ying" <ying.huang@intel.com>
To: Zi Yan <ziy@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>, Yang Shi <shy828301@gmail.com>,
	Michal Hocko <mhocko@suse.com>, Wei Xu <weixugc@google.com>,
	David Rientjes <rientjes@google.com>,
	Dan Williams <dan.j.williams@intel.com>,
	"David Hildenbrand" <david@redhat.com>,
	osalvador <osalvador@suse.de>
Subject: Re: [PATCH -V8 02/10] mm/numa: automatically generate node migration order
Date: Tue, 22 Jun 2021 09:14:29 +0800	[thread overview]
Message-ID: <87sg1an1je.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <2AA3D792-7F14-4297-8EDD-3B5A7B31AECA@nvidia.com> (Zi Yan's message of "Mon, 21 Jun 2021 10:50:14 -0400")

Zi Yan <ziy@nvidia.com> writes:

> On 19 Jun 2021, at 4:18, Huang, Ying wrote:
>
>> Zi Yan <ziy@nvidia.com> writes:
>>
>>> On 18 Jun 2021, at 2:15, Huang Ying wrote:

[snip]

>>>> +/*
>>>> + * When memory fills up on a node, memory contents can be
>>>> + * automatically migrated to another node instead of
>>>> + * discarded at reclaim.
>>>> + *
>>>> + * Establish a "migration path" which will start at nodes
>>>> + * with CPUs and will follow the priorities used to build the
>>>> + * page allocator zonelists.
>>>> + *
>>>> + * The difference here is that cycles must be avoided.  If
>>>> + * node0 migrates to node1, then neither node1, nor anything
>>>> + * node1 migrates to can migrate to node0.
>>>> + *
>>>> + * This function can run simultaneously with readers of
>>>> + * node_demotion[].  However, it can not run simultaneously
>>>> + * with itself.  Exclusion is provided by memory hotplug events
>>>> + * being single-threaded.
>>>> + */
>>>> +static void __set_migration_target_nodes(void)
>>>> +{
>>>> +	nodemask_t next_pass	= NODE_MASK_NONE;
>>>> +	nodemask_t this_pass	= NODE_MASK_NONE;
>>>> +	nodemask_t used_targets = NODE_MASK_NONE;
>>>> +	int node;
>>>> +
>>>> +	/*
>>>> +	 * Avoid any oddities like cycles that could occur
>>>> +	 * from changes in the topology.  This will leave
>>>> +	 * a momentary gap when migration is disabled.
>>>> +	 */
>>>> +	disable_all_migrate_targets();
>>>> +
>>>> +	/*
>>>> +	 * Ensure that the "disable" is visible across the system.
>>>> +	 * Readers will see either a combination of before+disable
>>>> +	 * state or disable+after.  They will never see before and
>>>> +	 * after state together.
>>>> +	 *
>>>> +	 * The before+after state together might have cycles and
>>>> +	 * could cause readers to do things like loop until this
>>>> +	 * function finishes.  This ensures they can only see a
>>>> +	 * single "bad" read and would, for instance, only loop
>>>> +	 * once.
>>>> +	 */
>>>> +	smp_wmb();
>>>> +
>>>> +	/*
>>>> +	 * Allocations go close to CPUs, first.  Assume that
>>>> +	 * the migration path starts at the nodes with CPUs.
>>>> +	 */
>>>> +	next_pass = node_states[N_CPU];
>>>
>>> Is there a plan of allowing user to change where the migration
>>> path starts? Or maybe one step further providing an interface
>>> to allow user to specify the demotion path. Something like
>>> /sys/devices/system/node/node*/node_demotion.
>>
>> I don't think that's necessary at least for now.  Do you know any real
>> world use case for this?
>
> In our P9+volta system, GPU memory is exposed as a NUMA node.
> For the GPU workloads with data size greater than GPU memory size,
> it will be very helpful to allow pages in GPU memory to be migrated/demoted
> to CPU memory. With your current assumption, GPU memory -> CPU memory
> demotion seems not possible, right? This should also apply to any
> system with a device memory exposed as a NUMA node and workloads running
> on the device and using CPU memory as a lower tier memory than the device
> memory.

Thanks a lot for your use case!  It appears that the demotion path
specified by users is one possible way to satisfy your requirement.  And
I think it's possible to enable that on top of this patchset.  But we
still have no specific plan to work on that at least for now.

Best Regards,
Huang, Ying