Re: [RFC PATCH] mm/mempolicy: add MPOL_PREFERRED_STRICT memory policy

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	Ben Widawsky <ben.widawsky@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Feng Tang <feng.tang@intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Randy Dunlap <rdunlap@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Dan Williams <dan.j.williams@intel.com>,
	Huang Ying <ying.huang@intel.com>
Subject: Re: [RFC PATCH] mm/mempolicy: add MPOL_PREFERRED_STRICT memory policy
Date: Thu, 14 Oct 2021 21:20:51 +0530	[thread overview]
Message-ID: <249414f6-1bb7-b76c-5b5b-2b3ed8937d7b@linux.ibm.com> (raw)
In-Reply-To: <YWhFFOtyVQ8Mespc@dhcp22.suse.cz>

On 10/14/21 20:26, Michal Hocko wrote:
> On Thu 14-10-21 18:59:14, Aneesh Kumar K.V wrote:
>> On 10/14/21 17:11, Michal Hocko wrote:
>>> On Thu 14-10-21 15:58:29, Aneesh Kumar K.V wrote:
>>>> On 10/14/21 15:08, Michal Hocko wrote:
>>> [...]
>>>>> Besides that it would be really great to finish the discussion about the
>>>>> usecase before suggesting a new userspace API.
>>>>>
>>>>
>>>> Application would like to hint a preferred node for allocating memory
>>>> backing a va range and at the same time wants to avoid fallback to some set
>>>> of nodes (in the use case I am interested don't fall back to slow memory
>>>> nodes).
>>>
>>> We do have means for that, right? You can set your memory policy and
>>> then set the cpu afffinity to the node you want to allocate from
>>> initially. You can migrate to a different cpu/node if this is not the
>>> preferred affinity. Why is that not usable?
>>
>> For the same reason you mentioned earlier, these nodes can be cpu less
>> nodes.
> 
> It would have been easier if you were explicit about the usecase rather
> than let other guess.
> 
>>> Also think about extensibility. Say I want to allocate from a set of
>>> nodes first before falling back to the rest of the nodemask? If you want
>>> to add a new API then think of other potential usecases.
>>>
>>
>> Describing the specific allocation details become hard with preferred node
>> being a nodemask. With the below interface
>>
>> SYSCALL_DEFINE5(preferred_mbind, unsigned long, start, unsigned long, len,
>> 		const unsigned long __user *, preferred_nmask, const unsigned long __user
>> *, fallback_nmask,
>> 		unsigned long, maxnode)
>> {
>>
>>
>> 1. The preferred node is the first node in the preferred node mask
>> 2. Then we try to allocate from nodes present in the preferred node mask
>> which is closer to the first node in the preferred node mask
>> 3. If the above fails, we try to allocate from nodes in the fallback node
>> mask which is closer to the first node in the preferred nodemask.
>>
>> Isn't that too complicated? Do we have a real usecase for that?
> 
> No, I think this is a suboptimal interface. AFAIU you really want to
> define a "home" node(s) rather than any policy. Home node would
> effectively override the default local node whatever policy you have as
> it makes sense whether you have MPOL_PREFERRED_MANY or MPOL_BIND.
> 

yes. I did describe it as below in an earlier email

"We could do
set_mempolicy(MPOLD_PREFERRED, nodemask(nodeX)))
set_mempolicy(MPOLD_PREFFERED_EXTEND, nodemask(fallback nodemask for 
above PREFERRED policy)) "

But I agree that restricting this to virtual address range is much 
better. Now I am wondering whether a nodemask is any better than a 
nodeid. The concept of home nodes is confusing when compared to home node.
What would be the meaning of multiple nodes in a home nodes concept?

Should we do

SYSCALL_DEFINE4(home_node_mbind, unsigned long, start, unsigned long, len,
		unsigned long, home_node, unsigned long, flags)

the flags is kept for future extension if any.

I guess this home node will only apply w.r.t MPOL_BIND and 
MPOL_PREFFERED_MANY policy for now?

> Another potential interface would be set_nodeorder which would
> explicitly set the allocation fallback ordering. Again agnostic of the
> underlying memory policy. This would be more generic but the question is
> whether this is not too generic and whether there are usecases for that.
> 

I would suggest we wait for applications really wanting a fallback order 
other than distance based one before adding this. Distance based 
fallback order from a preferred node is well understood from application 
point of view.

-aneesh