[PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
@ 2022-08-01  8:42 Muchun Song
  2022-08-01  9:06 ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Muchun Song @ 2022-08-01  8:42 UTC (permalink / raw)
  To: akpm, bwidawsk, mhocko, dave.hansen, feng.tang
  Cc: linux-mm, linux-kernel, Muchun Song

policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
for filtering nodes for page allocation, which is a hard restriction (see the user
of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
since all of the users already have handled the case of MPOL_PREFERRED_MANY before
calling it.  BTW, it is found by code inspection.

Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/mempolicy.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6c27acb6cd63..4deec7e598c6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
 		return &policy->nodes;

-	if (mode == MPOL_PREFERRED_MANY)
-		return &policy->nodes;
-
 	return NULL;
 }

-- 
2.11.0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-01  8:42 [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case Muchun Song
@ 2022-08-01  9:06 ` Michal Hocko
  2022-08-01  9:26   ` Feng Tang
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2022-08-01  9:06 UTC (permalink / raw)
  To: Muchun Song
  Cc: akpm, bwidawsk, dave.hansen, feng.tang, linux-mm, linux-kernel

On Mon 01-08-22 16:42:07, Muchun Song wrote:
> policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> for filtering nodes for page allocation, which is a hard restriction (see the user
> of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> calling it.  BTW, it is found by code inspection.

I am not sure this is the right fix. It is quite true that
policy_nodemask is a tricky function to use. It pretends to have a
higher level logic but all existing users are expected to be policy
aware and they special case allocation for each policy. That would mean
that hugetlb should do the same.

I haven't checked the actual behavior implications for hugetlb here. Is
MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it
work? From a quick look this just ignores MPOL_PREFERRED_MANY
completely.
 
> Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/mempolicy.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 6c27acb6cd63..4deec7e598c6 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
>  		return &policy->nodes;
>  
> -	if (mode == MPOL_PREFERRED_MANY)
> -		return &policy->nodes;
> -
>  	return NULL;
>  }
>  
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-01  9:06 ` Michal Hocko
@ 2022-08-01  9:26   ` Feng Tang
  2022-08-02  3:42     ` Muchun Song
  0 siblings, 1 reply; 22+ messages in thread
From: Feng Tang @ 2022-08-01  9:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel

On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > for filtering nodes for page allocation, which is a hard restriction (see the user
> > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > calling it.  BTW, it is found by code inspection.
> 
> I am not sure this is the right fix. It is quite true that
> policy_nodemask is a tricky function to use. It pretends to have a
> higher level logic but all existing users are expected to be policy
> aware and they special case allocation for each policy. That would mean
> that hugetlb should do the same.

Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
confused about policy_nodemask(), as it is never a 'strict' one as
the old code is:

	if (unlikely(mode == MPOL_BIND) &&
		apply_policy_zone(policy, gfp_zone(gfp)) &&
		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
		return &policy->nodes;

	return NULL

Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
still return NULL (equals all nodes).

From the semantics of allowed_mems_nr(), I think it does get changed
a little by b27abaccf8e8. And to enforce the 'strict' semantic for
'allowed', we may need a more strict nodemask API for it.

> I haven't checked the actual behavior implications for hugetlb here. Is
> MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it
> work? From a quick look this just ignores MPOL_PREFERRED_MANY
> completely.

IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double
check and report back if otherwise.

> > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/mempolicy.c | 3 ---
> >  1 file changed, 3 deletions(-)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 6c27acb6cd63..4deec7e598c6 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> >  		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> >  		return &policy->nodes;
> >  
> > -	if (mode == MPOL_PREFERRED_MANY)
> > -		return &policy->nodes;

I think it will make MPOL_PREFERRED_MANY not usable.

Thanks,
Feng

> > -
> >  	return NULL;
> >  }
> >  
> > -- 
> > 2.11.0
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-01  9:26   ` Feng Tang
@ 2022-08-02  3:42     ` Muchun Song
  2022-08-02  5:52       ` Feng Tang
  0 siblings, 1 reply; 22+ messages in thread
From: Muchun Song @ 2022-08-02  3:42 UTC (permalink / raw)
  To: Feng Tang
  Cc: Michal Hocko, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel

On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote:
> On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> > On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > > for filtering nodes for page allocation, which is a hard restriction (see the user
> > > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > > calling it.  BTW, it is found by code inspection.
> > 
> > I am not sure this is the right fix. It is quite true that
> > policy_nodemask is a tricky function to use. It pretends to have a
> > higher level logic but all existing users are expected to be policy
> > aware and they special case allocation for each policy. That would mean
> > that hugetlb should do the same.
> 
> Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
> confused about policy_nodemask(), as it is never a 'strict' one as
> the old code is:
> 
> 	if (unlikely(mode == MPOL_BIND) &&
> 		apply_policy_zone(policy, gfp_zone(gfp)) &&
> 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> 		return &policy->nodes;
> 
> 	return NULL
> 
> Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
> still return NULL (equals all nodes).
>

Well, I agree policy_nodemask() is really confusing because of the
shortage of comments and the weird logic.

> From the semantics of allowed_mems_nr(), I think it does get changed
> a little by b27abaccf8e8. And to enforce the 'strict' semantic for
> 'allowed', we may need a more strict nodemask API for it.
>

Maybe this is a good idea to fix this, e.g. introducing a new helper
to return the strict allowed nodemask.

> > I haven't checked the actual behavior implications for hugetlb here. Is
> > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it
> > work? From a quick look this just ignores MPOL_PREFERRED_MANY
> > completely.
> 
> IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double
> check and report back if otherwise.
>
> > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > ---
> > >  mm/mempolicy.c | 3 ---
> > >  1 file changed, 3 deletions(-)
> > > 
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > index 6c27acb6cd63..4deec7e598c6 100644
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > >  		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > >  		return &policy->nodes;
> > >  
> > > -	if (mode == MPOL_PREFERRED_MANY)
> > > -		return &policy->nodes;
> 
> I think it will make MPOL_PREFERRED_MANY not usable.
>

Sorry, I didn't got what you mean here. Could you explain more details
about why it is not usable?

Thanks.

> Thanks,
> Feng
> 
> > > -
> > >  	return NULL;
> > >  }
> > >  
> > > -- 
> > > 2.11.0
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-02  3:42     ` Muchun Song
@ 2022-08-02  5:52       ` Feng Tang
  2022-08-02  6:40         ` Muchun Song
  0 siblings, 1 reply; 22+ messages in thread
From: Feng Tang @ 2022-08-02  5:52 UTC (permalink / raw)
  To: Muchun Song
  Cc: Hocko, Michal, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel

On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote:
> On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote:
> > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> > > On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > > > for filtering nodes for page allocation, which is a hard restriction (see the user
> > > > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > > > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > > > calling it.  BTW, it is found by code inspection.
> > > 
> > > I am not sure this is the right fix. It is quite true that
> > > policy_nodemask is a tricky function to use. It pretends to have a
> > > higher level logic but all existing users are expected to be policy
> > > aware and they special case allocation for each policy. That would mean
> > > that hugetlb should do the same.
> > 
> > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
> > confused about policy_nodemask(), as it is never a 'strict' one as
> > the old code is:
> > 
> > 	if (unlikely(mode == MPOL_BIND) &&
> > 		apply_policy_zone(policy, gfp_zone(gfp)) &&
> > 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > 		return &policy->nodes;
> > 
> > 	return NULL
> > 
> > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
> > still return NULL (equals all nodes).
> >
> 
> Well, I agree policy_nodemask() is really confusing because of the
> shortage of comments and the weird logic.
> 
> > From the semantics of allowed_mems_nr(), I think it does get changed
> > a little by b27abaccf8e8. And to enforce the 'strict' semantic for
> > 'allowed', we may need a more strict nodemask API for it.
> >
> 
> Maybe this is a good idea to fix this, e.g. introducing a new helper
> to return the strict allowed nodemask.

Yep. 

I had another thought to add one global all-zero nodemask, for API like
policy_nodemask(), it has 2 types of return value:
* a nodemask with some bits set
* NULL (means all nodes)

Here a new type of zero nodemask (a gloabl variable)can be created to
indicate no qualified node.

> > > I haven't checked the actual behavior implications for hugetlb here. Is
> > > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it
> > > work? From a quick look this just ignores MPOL_PREFERRED_MANY
> > > completely.
> > 
> > IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double
> > check and report back if otherwise.
> >
> > > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > ---
> > > >  mm/mempolicy.c | 3 ---
> > > >  1 file changed, 3 deletions(-)
> > > > 
> > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > > index 6c27acb6cd63..4deec7e598c6 100644
> > > > --- a/mm/mempolicy.c
> > > > +++ b/mm/mempolicy.c
> > > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > >  		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > >  		return &policy->nodes;
> > > >  
> > > > -	if (mode == MPOL_PREFERRED_MANY)
> > > > -		return &policy->nodes;
> > 
> > I think it will make MPOL_PREFERRED_MANY not usable.
> >
> 
> Sorry, I didn't got what you mean here. Could you explain more details
> about why it is not usable?
 
I thought alloc_pages() will rely on policy_nodemask(), which was wrong
as I forgot the MPOL_PREFERRED_MANY has a dedicated function
alloc_pages_preferred_many() to handle it. Sorry for the confusion.

Thanks,
Feng

> Thanks.
> 
> > Thanks,
> > Feng
> > 
> > > > -
> > > >  	return NULL;
> > > >  }
> > > >  
> > > > -- 
> > > > 2.11.0
> > > 
> > > -- 
> > > Michal Hocko
> > > SUSE Labs
> > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-02  5:52       ` Feng Tang
@ 2022-08-02  6:40         ` Muchun Song
  2022-08-02  7:39           ` Feng Tang
  0 siblings, 1 reply; 22+ messages in thread
From: Muchun Song @ 2022-08-02  6:40 UTC (permalink / raw)
  To: Feng Tang
  Cc: Hocko, Michal, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel

On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote:
> On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote:
> > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote:
> > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> > > > On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > > > > for filtering nodes for page allocation, which is a hard restriction (see the user
> > > > > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > > > > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > > > > calling it.  BTW, it is found by code inspection.
> > > > 
> > > > I am not sure this is the right fix. It is quite true that
> > > > policy_nodemask is a tricky function to use. It pretends to have a
> > > > higher level logic but all existing users are expected to be policy
> > > > aware and they special case allocation for each policy. That would mean
> > > > that hugetlb should do the same.
> > > 
> > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
> > > confused about policy_nodemask(), as it is never a 'strict' one as
> > > the old code is:
> > > 
> > > 	if (unlikely(mode == MPOL_BIND) &&
> > > 		apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > 		return &policy->nodes;
> > > 
> > > 	return NULL
> > > 
> > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
> > > still return NULL (equals all nodes).
> > >
> > 
> > Well, I agree policy_nodemask() is really confusing because of the
> > shortage of comments and the weird logic.
> > 
> > > From the semantics of allowed_mems_nr(), I think it does get changed
> > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for
> > > 'allowed', we may need a more strict nodemask API for it.
> > >
> > 
> > Maybe this is a good idea to fix this, e.g. introducing a new helper
> > to return the strict allowed nodemask.
> 
> Yep. 
> 
> I had another thought to add one global all-zero nodemask, for API like
> policy_nodemask(), it has 2 types of return value:
> * a nodemask with some bits set
> * NULL (means all nodes)
> 
> Here a new type of zero nodemask (a gloabl variable)can be created to
> indicate no qualified node.
>

I know why you want to introduce a gloable zero nidemask. Since we already
have a glable nodemask array, namely node_states, instead of returning NULL
for the case of all nodes, how about returing node_states[N_ONLINE] for it?
And make it return NULL for the case where no nodes are allowed. Any thought?
 
> > > > I haven't checked the actual behavior implications for hugetlb here. Is
> > > > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it
> > > > work? From a quick look this just ignores MPOL_PREFERRED_MANY
> > > > completely.
> > > 
> > > IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double
> > > check and report back if otherwise.
> > >
> > > > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > > ---
> > > > >  mm/mempolicy.c | 3 ---
> > > > >  1 file changed, 3 deletions(-)
> > > > > 
> > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > > > index 6c27acb6cd63..4deec7e598c6 100644
> > > > > --- a/mm/mempolicy.c
> > > > > +++ b/mm/mempolicy.c
> > > > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > > >  		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > > >  		return &policy->nodes;
> > > > >  
> > > > > -	if (mode == MPOL_PREFERRED_MANY)
> > > > > -		return &policy->nodes;
> > > 
> > > I think it will make MPOL_PREFERRED_MANY not usable.
> > >
> > 
> > Sorry, I didn't got what you mean here. Could you explain more details
> > about why it is not usable?
>  
> I thought alloc_pages() will rely on policy_nodemask(), which was wrong
> as I forgot the MPOL_PREFERRED_MANY has a dedicated function
> alloc_pages_preferred_many() to handle it. Sorry for the confusion.
> 
> Thanks,
> Feng
> 
> > Thanks.
> > 
> > > Thanks,
> > > Feng
> > > 
> > > > > -
> > > > >  	return NULL;
> > > > >  }
> > > > >  
> > > > > -- 
> > > > > 2.11.0
> > > > 
> > > > -- 
> > > > Michal Hocko
> > > > SUSE Labs
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-02  6:40         ` Muchun Song
@ 2022-08-02  7:39           ` Feng Tang
  2022-08-02  9:02             ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Feng Tang @ 2022-08-02  7:39 UTC (permalink / raw)
  To: Muchun Song
  Cc: Hocko, Michal, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel

On Tue, Aug 02, 2022 at 02:40:11PM +0800, Muchun Song wrote:
> On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote:
> > On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote:
> > > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote:
> > > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> > > > > On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > > > > > for filtering nodes for page allocation, which is a hard restriction (see the user
> > > > > > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > > > > > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > > > > > calling it.  BTW, it is found by code inspection.
> > > > > 
> > > > > I am not sure this is the right fix. It is quite true that
> > > > > policy_nodemask is a tricky function to use. It pretends to have a
> > > > > higher level logic but all existing users are expected to be policy
> > > > > aware and they special case allocation for each policy. That would mean
> > > > > that hugetlb should do the same.
> > > > 
> > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
> > > > confused about policy_nodemask(), as it is never a 'strict' one as
> > > > the old code is:
> > > > 
> > > > 	if (unlikely(mode == MPOL_BIND) &&
> > > > 		apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > > 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > > 		return &policy->nodes;
> > > > 
> > > > 	return NULL
> > > > 
> > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
> > > > still return NULL (equals all nodes).
> > > >
> > > 
> > > Well, I agree policy_nodemask() is really confusing because of the
> > > shortage of comments and the weird logic.
> > > 
> > > > From the semantics of allowed_mems_nr(), I think it does get changed
> > > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for
> > > > 'allowed', we may need a more strict nodemask API for it.
> > > >
> > > 
> > > Maybe this is a good idea to fix this, e.g. introducing a new helper
> > > to return the strict allowed nodemask.
> > 
> > Yep. 
> > 
> > I had another thought to add one global all-zero nodemask, for API like
> > policy_nodemask(), it has 2 types of return value:
> > * a nodemask with some bits set
> > * NULL (means all nodes)
> > 
> > Here a new type of zero nodemask (a gloabl variable)can be created to
> > indicate no qualified node.
> >
> 
> I know why you want to introduce a gloable zero nidemask. Since we already
> have a glable nodemask array, namely node_states, instead of returning NULL
> for the case of all nodes, how about returing node_states[N_ONLINE] for it?
> And make it return NULL for the case where no nodes are allowed. Any thought?

I think return node_states[N_ONLINE] can simplify the code in allowed_mems_nr(),
the empty zero nodes can simplify further.

Here is some draft patch (not tested) to show the idea

Thanks,
Feng

---
 include/linux/mempolicy.h |  8 ++++++++
 include/linux/nodemask.h  |  7 +++++++
 mm/hugetlb.c              |  7 ++++---
 mm/mempolicy.c            | 17 +++++++++++++++++
 mm/page_alloc.c           |  3 +++
 5 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 668389b4b53d..b5451fef1620 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -150,6 +150,7 @@ extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
 extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
 				const nodemask_t *mask);
 extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
+extern nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy);
 
 static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
 {
@@ -158,6 +159,13 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
 	return policy_nodemask(gfp, mpol);
 }
 
+static inline nodemask_t *allowed_policy_nodemask_current(gfp_t gfp)
+{
+	struct mempolicy *mpol = get_task_policy(current);
+
+	return allowed_policy_nodemask(gfp, mpol);
+}
+
 extern unsigned int mempolicy_slab_node(void);
 
 extern enum zone_type policy_zone;
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 0f233b76c9ce..dc5fab38e810 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -409,6 +409,13 @@ enum node_states {
 
 extern nodemask_t node_states[NR_NODE_STATES];
 
+extern nodemask_t zero_nodes;
+
+static inline bool is_empty_nodes(nodemask_t *pnodes)
+{
+	 return (pnodes == &zero_nodes || __nodes_empty(pnodes, MAX_NUMNODES));
+}
+
 #if MAX_NUMNODES > 1
 static inline int node_state(int node, enum node_states state)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a57e1be41401..dc9f4ed32909 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4340,10 +4340,11 @@ static unsigned int allowed_mems_nr(struct hstate *h)
 
 	mpol_allowed = policy_nodemask_current(gfp_mask);
 
-	for_each_node_mask(node, cpuset_current_mems_allowed) {
-		if (!mpol_allowed || node_isset(node, *mpol_allowed))
+	if (is_empty_nodes(mpol_allowed))
+		return 0;
+
+	for_each_node_mask(node, mpol_allowed)
 			nr += array[node];
-	}
 
 	return nr;
 }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d39b01fd52fe..3e936b8ca9ea 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1845,6 +1845,23 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 	return NULL;
 }
 
+/*
+ * Return the allowed nodes mask for a mempolicy and page allocation,
+ * which is a 'stricter' semantic than policy_nodemsk()
+ */
+nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy)
+{
+	if (unlikely(policy->mode == MPOL_BIND)) {
+		if (apply_policy_zone(policy, gfp_zone(gfp)) &&
+			cpuset_nodemask_valid_mems_allowed(&policy->nodes))
+			return &policy->nodes;
+		else
+			return &zero_nodes;
+	}
+
+	return NULL;
+}
+
 /*
  * Return the  preferred node id for 'prefer' mempolicy, and return
  * the given id for all other policies.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e008a3df0485..3549ea037588 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -181,6 +181,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 };
 EXPORT_SYMBOL(node_states);
 
+nodemask_t zero_nodes = NODE_MASK_NONE;
+EXPORT_SYMBOL(zero_nodes);
+
 atomic_long_t _totalram_pages __read_mostly;
 EXPORT_SYMBOL(_totalram_pages);
 unsigned long totalreserve_pages __read_mostly;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-02  7:39           ` Feng Tang
@ 2022-08-02  9:02             ` Michal Hocko
  2022-08-03  6:41               ` Feng Tang
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2022-08-02  9:02 UTC (permalink / raw)
  To: Feng Tang
  Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel,
	Mike Kravetz

Please make sure to CC Mike on hugetlb related changes.

I didn't really get to grasp your proposed solution but it feels goind
sideways. The real issue is that hugetlb uses a dedicated allocation
scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not
think we should be tricking that by providing some fake nodemasks and
what not.

The good news is that allocation from the pool is MPOL_PREFERRED_MANY
aware because it first tries to allocation from the preffered node mask
and then fall back to the full nodemask (dequeue_huge_page_vma).
If the existing pools cannot really satisfy that allocation then it
tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also
performs 2 stage allocation with the node mask and no node masks. But
both of them might fail.

The bad news is that other allocation functions - including those that
allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g.
__nr_hugepages_store_common paths which use the allocating process
policy to fill up the pool so the pool could be under provisioned if
that context is using MPOL_PREFERRED_MANY.

Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation
code and I have to admit I do not really remember details there. This is
a subtle code and my best guess would be that policy_nodemask_current
should be hugetlb specific and only care about MPOL_BIND.

On Tue 02-08-22 15:39:52, Feng Tang wrote:
> On Tue, Aug 02, 2022 at 02:40:11PM +0800, Muchun Song wrote:
> > On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote:
> > > On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote:
> > > > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote:
> > > > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote:
> > > > > > On Mon 01-08-22 16:42:07, Muchun Song wrote:
> > > > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy
> > > > > > > for filtering nodes for page allocation, which is a hard restriction (see the user
> > > > > > > of allowed_mems_nr() in hugetlb.c).  However, MPOL_PREFERRED_MANY is a preferred
> > > > > > > mode not a hard restriction.  Now it breaks the user of HugeTLB.  Remove it from
> > > > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask()
> > > > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before
> > > > > > > calling it.  BTW, it is found by code inspection.
> > > > > > 
> > > > > > I am not sure this is the right fix. It is quite true that
> > > > > > policy_nodemask is a tricky function to use. It pretends to have a
> > > > > > higher level logic but all existing users are expected to be policy
> > > > > > aware and they special case allocation for each policy. That would mean
> > > > > > that hugetlb should do the same.
> > > > > 
> > > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also
> > > > > confused about policy_nodemask(), as it is never a 'strict' one as
> > > > > the old code is:
> > > > > 
> > > > > 	if (unlikely(mode == MPOL_BIND) &&
> > > > > 		apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > > > 		cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > > > 		return &policy->nodes;
> > > > > 
> > > > > 	return NULL
> > > > > 
> > > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will 
> > > > > still return NULL (equals all nodes).
> > > > >
> > > > 
> > > > Well, I agree policy_nodemask() is really confusing because of the
> > > > shortage of comments and the weird logic.
> > > > 
> > > > > From the semantics of allowed_mems_nr(), I think it does get changed
> > > > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for
> > > > > 'allowed', we may need a more strict nodemask API for it.
> > > > >
> > > > 
> > > > Maybe this is a good idea to fix this, e.g. introducing a new helper
> > > > to return the strict allowed nodemask.
> > > 
> > > Yep. 
> > > 
> > > I had another thought to add one global all-zero nodemask, for API like
> > > policy_nodemask(), it has 2 types of return value:
> > > * a nodemask with some bits set
> > > * NULL (means all nodes)
> > > 
> > > Here a new type of zero nodemask (a gloabl variable)can be created to
> > > indicate no qualified node.
> > >
> > 
> > I know why you want to introduce a gloable zero nidemask. Since we already
> > have a glable nodemask array, namely node_states, instead of returning NULL
> > for the case of all nodes, how about returing node_states[N_ONLINE] for it?
> > And make it return NULL for the case where no nodes are allowed. Any thought?
> 
> I think return node_states[N_ONLINE] can simplify the code in allowed_mems_nr(),
> the empty zero nodes can simplify further.
> 
> Here is some draft patch (not tested) to show the idea
> 
> Thanks,
> Feng
> 
> ---
>  include/linux/mempolicy.h |  8 ++++++++
>  include/linux/nodemask.h  |  7 +++++++
>  mm/hugetlb.c              |  7 ++++---
>  mm/mempolicy.c            | 17 +++++++++++++++++
>  mm/page_alloc.c           |  3 +++
>  5 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 668389b4b53d..b5451fef1620 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -150,6 +150,7 @@ extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
>  extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
>  				const nodemask_t *mask);
>  extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
> +extern nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy);
>  
>  static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
>  {
> @@ -158,6 +159,13 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
>  	return policy_nodemask(gfp, mpol);
>  }
>  
> +static inline nodemask_t *allowed_policy_nodemask_current(gfp_t gfp)
> +{
> +	struct mempolicy *mpol = get_task_policy(current);
> +
> +	return allowed_policy_nodemask(gfp, mpol);
> +}
> +
>  extern unsigned int mempolicy_slab_node(void);
>  
>  extern enum zone_type policy_zone;
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 0f233b76c9ce..dc5fab38e810 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -409,6 +409,13 @@ enum node_states {
>  
>  extern nodemask_t node_states[NR_NODE_STATES];
>  
> +extern nodemask_t zero_nodes;
> +
> +static inline bool is_empty_nodes(nodemask_t *pnodes)
> +{
> +	 return (pnodes == &zero_nodes || __nodes_empty(pnodes, MAX_NUMNODES));
> +}
> +
>  #if MAX_NUMNODES > 1
>  static inline int node_state(int node, enum node_states state)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a57e1be41401..dc9f4ed32909 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4340,10 +4340,11 @@ static unsigned int allowed_mems_nr(struct hstate *h)
>  
>  	mpol_allowed = policy_nodemask_current(gfp_mask);
>  
> -	for_each_node_mask(node, cpuset_current_mems_allowed) {
> -		if (!mpol_allowed || node_isset(node, *mpol_allowed))
> +	if (is_empty_nodes(mpol_allowed))
> +		return 0;
> +
> +	for_each_node_mask(node, mpol_allowed)
>  			nr += array[node];
> -	}
>  
>  	return nr;
>  }
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d39b01fd52fe..3e936b8ca9ea 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1845,6 +1845,23 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  	return NULL;
>  }
>  
> +/*
> + * Return the allowed nodes mask for a mempolicy and page allocation,
> + * which is a 'stricter' semantic than policy_nodemsk()
> + */
> +nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> +{
> +	if (unlikely(policy->mode == MPOL_BIND)) {
> +		if (apply_policy_zone(policy, gfp_zone(gfp)) &&
> +			cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> +			return &policy->nodes;
> +		else
> +			return &zero_nodes;
> +	}
> +
> +	return NULL;
> +}
> +
>  /*
>   * Return the  preferred node id for 'prefer' mempolicy, and return
>   * the given id for all other policies.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e008a3df0485..3549ea037588 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -181,6 +181,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
>  };
>  EXPORT_SYMBOL(node_states);
>  
> +nodemask_t zero_nodes = NODE_MASK_NONE;
> +EXPORT_SYMBOL(zero_nodes);
> +
>  atomic_long_t _totalram_pages __read_mostly;
>  EXPORT_SYMBOL(_totalram_pages);
>  unsigned long totalreserve_pages __read_mostly;
> -- 
> 2.27.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-02  9:02             ` Michal Hocko
@ 2022-08-03  6:41               ` Feng Tang
  2022-08-03  7:36                 ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Feng Tang @ 2022-08-03  6:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel,
	Mike Kravetz

On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote:
> Please make sure to CC Mike on hugetlb related changes.

OK.

> I didn't really get to grasp your proposed solution but it feels goind
> sideways. The real issue is that hugetlb uses a dedicated allocation
> scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not
> think we should be tricking that by providing some fake nodemasks and
> what not.
> 
> The good news is that allocation from the pool is MPOL_PREFERRED_MANY
> aware because it first tries to allocation from the preffered node mask
> and then fall back to the full nodemask (dequeue_huge_page_vma).
> If the existing pools cannot really satisfy that allocation then it
> tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also
> performs 2 stage allocation with the node mask and no node masks. But
> both of them might fail.
> 
> The bad news is that other allocation functions - including those that
> allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g.
> __nr_hugepages_store_common paths which use the allocating process
> policy to fill up the pool so the pool could be under provisioned if
> that context is using MPOL_PREFERRED_MANY.

Thanks for the check!

So you mean if the prferred nodes don't have enough pages, we should
also fallback to all like dequeue_huge_page_vma() does?

Or we can user a policy API which return nodemask for MPOL_BIND and 
NULL for all other policies, like allowed_mems_nr() needs.

--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
 	return policy_nodemask(gfp, mpol);
 }
 
+#ifdef CONFIG_HUGETLB_FS
+static inline nodemask_t *strict_policy_nodemask_current(void)
+{
+	struct mempolicy *mpol = get_task_policy(current);
+
+	if (mpol->mode == MPOL_BIND)
+		return &mpol->nodes;
+
+	return NULL;
+}
+#endif
+

> Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation
> code and I have to admit I do not really remember details there. This is
> a subtle code and my best guess would be that policy_nodemask_current
> should be hugetlb specific and only care about MPOL_BIND.

The API needed by allowed_mem_nr() is a little different as it has gfp
flag and cpuset config to consider.

Thanks,
Feng

[snip]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-03  6:41               ` Feng Tang
@ 2022-08-03  7:36                 ` Michal Hocko
  2022-08-03 17:14                   ` Feng Tang
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2022-08-03  7:36 UTC (permalink / raw)
  To: Feng Tang
  Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel,
	Mike Kravetz

On Wed 03-08-22 14:41:20, Feng Tang wrote:
> On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote:
> > Please make sure to CC Mike on hugetlb related changes.
> 
> OK.
> 
> > I didn't really get to grasp your proposed solution but it feels goind
> > sideways. The real issue is that hugetlb uses a dedicated allocation
> > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not
> > think we should be tricking that by providing some fake nodemasks and
> > what not.
> > 
> > The good news is that allocation from the pool is MPOL_PREFERRED_MANY
> > aware because it first tries to allocation from the preffered node mask
> > and then fall back to the full nodemask (dequeue_huge_page_vma).
> > If the existing pools cannot really satisfy that allocation then it
> > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also
> > performs 2 stage allocation with the node mask and no node masks. But
> > both of them might fail.
> > 
> > The bad news is that other allocation functions - including those that
> > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g.
> > __nr_hugepages_store_common paths which use the allocating process
> > policy to fill up the pool so the pool could be under provisioned if
> > that context is using MPOL_PREFERRED_MANY.
> 
> Thanks for the check!
> 
> So you mean if the prferred nodes don't have enough pages, we should
> also fallback to all like dequeue_huge_page_vma() does?
> 
> Or we can user a policy API which return nodemask for MPOL_BIND and 
> NULL for all other policies, like allowed_mems_nr() needs.
> 
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
>  	return policy_nodemask(gfp, mpol);
>  }
>  
> +#ifdef CONFIG_HUGETLB_FS
> +static inline nodemask_t *strict_policy_nodemask_current(void)
> +{
> +	struct mempolicy *mpol = get_task_policy(current);
> +
> +	if (mpol->mode == MPOL_BIND)
> +		return &mpol->nodes;
> +
> +	return NULL;
> +}
> +#endif

Yes something like this, except that I would also move this into hugetlb
proper because this doesn't seem generally useful.
 
> > Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation
> > code and I have to admit I do not really remember details there. This is
> > a subtle code and my best guess would be that policy_nodemask_current
> > should be hugetlb specific and only care about MPOL_BIND.
> 
> The API needed by allowed_mem_nr() is a little different as it has gfp
> flag and cpuset config to consider.

Why would gfp mask matter? 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-03  7:36                 ` Michal Hocko
@ 2022-08-03 17:14                   ` Feng Tang
  2022-08-03 11:28                     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Feng Tang @ 2022-08-03 17:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel,
	Mike Kravetz

On Wed, Aug 03, 2022 at 03:36:41PM +0800, Michal Hocko wrote:
> On Wed 03-08-22 14:41:20, Feng Tang wrote:
> > On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote:
> > > Please make sure to CC Mike on hugetlb related changes.
> > 
> > OK.
> > 
> > > I didn't really get to grasp your proposed solution but it feels goind
> > > sideways. The real issue is that hugetlb uses a dedicated allocation
> > > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not
> > > think we should be tricking that by providing some fake nodemasks and
> > > what not.
> > > 
> > > The good news is that allocation from the pool is MPOL_PREFERRED_MANY
> > > aware because it first tries to allocation from the preffered node mask
> > > and then fall back to the full nodemask (dequeue_huge_page_vma).
> > > If the existing pools cannot really satisfy that allocation then it
> > > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also
> > > performs 2 stage allocation with the node mask and no node masks. But
> > > both of them might fail.
> > > 
> > > The bad news is that other allocation functions - including those that
> > > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g.
> > > __nr_hugepages_store_common paths which use the allocating process
> > > policy to fill up the pool so the pool could be under provisioned if
> > > that context is using MPOL_PREFERRED_MANY.
> > 
> > Thanks for the check!
> > 
> > So you mean if the prferred nodes don't have enough pages, we should
> > also fallback to all like dequeue_huge_page_vma() does?
> > 
> > Or we can user a policy API which return nodemask for MPOL_BIND and 
> > NULL for all other policies, like allowed_mems_nr() needs.
> > 
> > --- a/include/linux/mempolicy.h
> > +++ b/include/linux/mempolicy.h
> > @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> >  	return policy_nodemask(gfp, mpol);
> >  }
> >  
> > +#ifdef CONFIG_HUGETLB_FS
> > +static inline nodemask_t *strict_policy_nodemask_current(void)
> > +{
> > +	struct mempolicy *mpol = get_task_policy(current);
> > +
> > +	if (mpol->mode == MPOL_BIND)
> > +		return &mpol->nodes;
> > +
> > +	return NULL;
> > +}
> > +#endif
> 
> Yes something like this, except that I would also move this into hugetlb
> proper because this doesn't seem generally useful.
  

Ok, I change it as below:

---
 mm/hugetlb.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 14be38822cf8..ef1d4ffa733f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -91,6 +91,24 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 
+/*
+ * Return nodemask of what is allowed by current process' memory
+ * policy, as MPOL_BIND is the only 'strict' policy, return NULL
+ * for all other policies
+ */
+static inline nodemask_t *allowed_policy_nodemask_current(void)
+{
+#ifdef CONFIG_NUMA
+	struct mempolicy *mpol = get_task_policy(current);
+
+	if (mpol->mode == MPOL_BIND)
+		return &mpol->nodes;
+	return NULL;
+#else
+	return NULL;
+#endif
+}
+
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
 	if (spool->count)
@@ -3556,7 +3574,7 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 					   unsigned long count, size_t len)
 {
 	int err;
-	nodemask_t nodes_allowed, *n_mask;
+	nodemask_t nodes_allowed, *n_mask = NULL;
 
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return -EINVAL;
@@ -3565,11 +3583,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy,
 		/*
 		 * global hstate attribute
 		 */
-		if (!(obey_mempolicy &&
-				init_nodemask_of_mempolicy(&nodes_allowed)))
+		if (obey_mempolicy)
+			n_mask = allowed_policy_nodemask_current();
+
+		if (!n_mask)
 			n_mask = &node_states[N_MEMORY];
-		else
-			n_mask = &nodes_allowed;
 	} else {
 		/*
 		 * Node specific request.  count adjustment happens in
-- 
2.27.0

> > > Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation
> > > code and I have to admit I do not really remember details there. This is
> > > a subtle code and my best guess would be that policy_nodemask_current
> > > should be hugetlb specific and only care about MPOL_BIND.
> > 
> > The API needed by allowed_mem_nr() is a little different as it has gfp
> > flag and cpuset config to consider.
> 
> Why would gfp mask matter? 

I'm not very familiar with the old semantics (will check more), from current
code, it checks both the gfp flags and cpuset limit.

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-03 17:14                   ` Feng Tang
@ 2022-08-03 11:28                     ` Michal Hocko
  2022-08-03 20:43                       ` Feng Tang
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2022-08-03 11:28 UTC (permalink / raw)
  To: Feng Tang
  Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel,
	Mike Kravetz

On Thu 04-08-22 01:14:32, Feng Tang wrote:
[...]
> Ok, I change it as below:

Wouldn't it be better to make this allowed_mems_nr specific to be
explicit about the intention?

Not that I feel strongly about that.

> ---
>  mm/hugetlb.c | 28 +++++++++++++++++++++++-----
>  1 file changed, 23 insertions(+), 5 deletions(-)

Not even compile tested
 include/linux/mempolicy.h | 12 ------------
 mm/hugetlb.c              | 24 ++++++++++++++++++++----
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 668389b4b53d..e38b0ef20b8b 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
 				const nodemask_t *mask);
 extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
 
-static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
-{
-	struct mempolicy *mpol = get_task_policy(current);
-
-	return policy_nodemask(gfp, mpol);
-}
-
 extern unsigned int mempolicy_slab_node(void);
 
 extern enum zone_type policy_zone;
@@ -294,11 +287,6 @@ static inline void mpol_put_task_policy(struct task_struct *task)
 {
 }
 
-static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
-{
-	return NULL;
-}
-
 static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 {
 	return  false;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a18c071c294e..6cacbc9b15a1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s)
 }
 __setup("default_hugepagesz=", default_hugepagesz_setup);
 
+struct mempolicy *policy_mbind_nodemask(gfp_t gfp)
+{
+#ifdef CONFIG_MEMPOLICY
+	struct mempolicy *mpol = get_task_policy(current);
+
+	/*
+	 * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask)
+	 * specifically for hugetlb case
+	 */
+	if (mpol->mode == MPOL_BIND &&
+		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
+		 cpuset_nodemask_valid_mems_allowed(&policy->nodes))
+		return &mpol->nodes;
+#endif
+	return NULL;
+}
+
 static unsigned int allowed_mems_nr(struct hstate *h)
 {
 	int node;
 	unsigned int nr = 0;
-	nodemask_t *mpol_allowed;
+	nodemask_t *mbind_nodemask;
 	unsigned int *array = h->free_huge_pages_node;
 	gfp_t gfp_mask = htlb_alloc_mask(h);
 
-	mpol_allowed = policy_nodemask_current(gfp_mask);
-
+	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
 	for_each_node_mask(node, cpuset_current_mems_allowed) {
-		if (!mpol_allowed || node_isset(node, *mpol_allowed))
+		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
 			nr += array[node];
 	}
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-03 11:28                     ` Michal Hocko
@ 2022-08-03 20:43                       ` Feng Tang
  2022-08-03 12:56                         ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Feng Tang @ 2022-08-03 20:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel,
	Mike Kravetz

On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote:
> On Thu 04-08-22 01:14:32, Feng Tang wrote:
> [...]
> > Ok, I change it as below:
> 
> Wouldn't it be better to make this allowed_mems_nr specific to be
> explicit about the intention?
 
Yes, it is.

> Not that I feel strongly about that.
> 
> > ---
> >  mm/hugetlb.c | 28 +++++++++++++++++++++++-----
> >  1 file changed, 23 insertions(+), 5 deletions(-)
> 
> Not even compile tested
>  include/linux/mempolicy.h | 12 ------------
>  mm/hugetlb.c              | 24 ++++++++++++++++++++----
>  2 files changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 668389b4b53d..e38b0ef20b8b 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
>  				const nodemask_t *mask);
>  extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
>  
> -static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> -{
> -	struct mempolicy *mpol = get_task_policy(current);
> -
> -	return policy_nodemask(gfp, mpol);
> -}
> -
>  extern unsigned int mempolicy_slab_node(void);
>  
>  extern enum zone_type policy_zone;
> @@ -294,11 +287,6 @@ static inline void mpol_put_task_policy(struct task_struct *task)
>  {
>  }
>  
> -static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> -{
> -	return NULL;
> -}
> -
>  static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>  {
>  	return  false;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a18c071c294e..6cacbc9b15a1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s)
>  }
>  __setup("default_hugepagesz=", default_hugepagesz_setup);
>  
> +struct mempolicy *policy_mbind_nodemask(gfp_t gfp)
> +{
> +#ifdef CONFIG_MEMPOLICY
> +	struct mempolicy *mpol = get_task_policy(current);
> +
> +	/*
> +	 * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask)
> +	 * specifically for hugetlb case
> +	 */
> +	if (mpol->mode == MPOL_BIND &&
> +		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
> +		 cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> +		return &mpol->nodes;
> +#endif
> +	return NULL;

I saw the logic is not changed, and it confused me that if there is
no qualified node, it will still return NULL which effectively equals
node_states[N_MEMORY], while I think it should return a all zero
nodemasks.

Thanks,
Feng

> +}
> +
>  static unsigned int allowed_mems_nr(struct hstate *h)
>  {
>  	int node;
>  	unsigned int nr = 0;
> -	nodemask_t *mpol_allowed;
> +	nodemask_t *mbind_nodemask;
>  	unsigned int *array = h->free_huge_pages_node;
>  	gfp_t gfp_mask = htlb_alloc_mask(h);
>  
> -	mpol_allowed = policy_nodemask_current(gfp_mask);
> -
> +	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
>  	for_each_node_mask(node, cpuset_current_mems_allowed) {
> -		if (!mpol_allowed || node_isset(node, *mpol_allowed))
> +		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
>  			nr += array[node];
>  	}
>  
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-03 20:43                       ` Feng Tang
@ 2022-08-03 12:56                         ` Michal Hocko
  2022-08-03 21:08                           ` Feng Tang
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2022-08-03 12:56 UTC (permalink / raw)
  To: Feng Tang
  Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel,
	Mike Kravetz

On Thu 04-08-22 04:43:06, Feng Tang wrote:
> On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote:
[...]
> > +struct mempolicy *policy_mbind_nodemask(gfp_t gfp)
> > +{
> > +#ifdef CONFIG_MEMPOLICY
> > +	struct mempolicy *mpol = get_task_policy(current);
> > +
> > +	/*
> > +	 * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask)
> > +	 * specifically for hugetlb case
> > +	 */
> > +	if (mpol->mode == MPOL_BIND &&
> > +		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
> > +		 cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > +		return &mpol->nodes;
> > +#endif
> > +	return NULL;
> 
> I saw the logic is not changed, and it confused me that if there is
> no qualified node, it will still return NULL which effectively equals
> node_states[N_MEMORY], while I think it should return a all zero
> nodemasks.

This is a separate thing and I have to admit that the existing code is
rather non-intuitive or even broken. I guess we do not care all that
much because MBIND with completely non-overlapping cpusets is just a
broken configuration. I am not sure this case is interesting or even
supported.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-03 12:56                         ` Michal Hocko
@ 2022-08-03 21:08                           ` Feng Tang
  2022-08-03 13:21                             ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Feng Tang @ 2022-08-03 21:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel,
	Mike Kravetz

On Wed, Aug 03, 2022 at 08:56:44PM +0800, Michal Hocko wrote:
> On Thu 04-08-22 04:43:06, Feng Tang wrote:
> > On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote:
> [...]
> > > +struct mempolicy *policy_mbind_nodemask(gfp_t gfp)
> > > +{
> > > +#ifdef CONFIG_MEMPOLICY
> > > +	struct mempolicy *mpol = get_task_policy(current);
> > > +
> > > +	/*
> > > +	 * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask)
> > > +	 * specifically for hugetlb case
> > > +	 */
> > > +	if (mpol->mode == MPOL_BIND &&
> > > +		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
> > > +		 cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> > > +		return &mpol->nodes;
> > > +#endif
> > > +	return NULL;
> > 
> > I saw the logic is not changed, and it confused me that if there is
> > no qualified node, it will still return NULL which effectively equals
> > node_states[N_MEMORY], while I think it should return a all zero
> > nodemasks.
> 
> This is a separate thing and I have to admit that the existing code is
> rather non-intuitive or even broken. I guess we do not care all that
> much because MBIND with completely non-overlapping cpusets is just a
> broken configuration. I am not sure this case is interesting or even
> supported.

Fair enough, and moving the policy_mbind_nodemask() into hugetlb.c for
one single caller make it much less severe.

Do we still need the other nodemask API I proposed earlier which has
no parameter of gfp_flag, and used for __nr_hugepages_store_common?

Thanks,
Feng


> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-03 21:08                           ` Feng Tang
@ 2022-08-03 13:21                             ` Michal Hocko
  2022-08-04  8:27                               ` Feng Tang
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2022-08-03 13:21 UTC (permalink / raw)
  To: Feng Tang
  Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel,
	Mike Kravetz

On Thu 04-08-22 05:08:34, Feng Tang wrote:
[...]
> Do we still need the other nodemask API I proposed earlier which has
> no parameter of gfp_flag, and used for __nr_hugepages_store_common?

I would touch as little code as possible.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-03 13:21                             ` Michal Hocko
@ 2022-08-04  8:27                               ` Feng Tang
  2022-08-04 10:43                                 ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Feng Tang @ 2022-08-04  8:27 UTC (permalink / raw)
  To: Michal Hocko, Muchun Song, Mike Kravetz, Andrew Morton
  Cc: bwidawsk, dave.hansen, linux-mm, linux-kernel

On Wed, Aug 03, 2022 at 09:21:22PM +0800, Michal Hocko wrote:
> On Thu 04-08-22 05:08:34, Feng Tang wrote:
> [...]
> > Do we still need the other nodemask API I proposed earlier which has
> > no parameter of gfp_flag, and used for __nr_hugepages_store_common?
> 
> I would touch as little code as possible.

OK.

Please review the following patch, thanks! - Feng

---
From a2db9a57da616bb3ea21e48a4a9ceb5c2cf4f7a2 Mon Sep 17 00:00:00 2001
From: Feng Tang <feng.tang@intel.com>
Date: Thu, 4 Aug 2022 09:39:24 +0800
Subject: [PATCH RFC] mm/hugetlb: add dedicated func to get 'allowed' nodemask for
 current process

Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced
in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
[1], the policy_nodemask_current()'s semantics for this new policy
has been changed, which returns 'preferred' nodes instead of 'allowed'
nodes, and could hurt the usage of its caller in hugetlb:
allowed_mems_nr().

Michal found the policy_nodemask_current() is only used by hugetlb,
and suggested to move it to hugetlb code with more explicit name to
enforce the 'allowed' semantics for which only MPOL_BIND policy
matters.

One note for the new policy_mbind_nodemask() is, the cross check
from MPOL_BIND, gfp flags and cpuset configuration can lead to
a no available node case, which is considered to be broken
configuration and 'NULL' (equals all nodes) is returned.

[1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/
Reported-by: Muchun Song <songmuchun@bytedance.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/mempolicy.h | 32 ++++++++++++++++++++------------
 mm/hugetlb.c              | 24 ++++++++++++++++++++----
 mm/mempolicy.c            | 20 --------------------
 3 files changed, 40 insertions(+), 36 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 668389b4b53d..ea0168fffb4c 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
 				const nodemask_t *mask);
 extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
 
-static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
-{
-	struct mempolicy *mpol = get_task_policy(current);
-
-	return policy_nodemask(gfp, mpol);
-}
-
 extern unsigned int mempolicy_slab_node(void);
 
 extern enum zone_type policy_zone;
@@ -189,6 +182,26 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 	return  (pol->mode == MPOL_PREFERRED_MANY);
 }
 
+static inline int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
+{
+	enum zone_type dynamic_policy_zone = policy_zone;
+
+	BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
+
+	/*
+	 * if policy->nodes has movable memory only,
+	 * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
+	 *
+	 * policy->nodes is intersect with node_states[N_MEMORY].
+	 * so if the following test fails, it implies
+	 * policy->nodes has movable memory only.
+	 */
+	if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY]))
+		dynamic_policy_zone = ZONE_MOVABLE;
+
+	return zone >= dynamic_policy_zone;
+}
+
 
 #else
 
@@ -294,11 +307,6 @@ static inline void mpol_put_task_policy(struct task_struct *task)
 {
 }
 
-static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
-{
-	return NULL;
-}
-
 static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 {
 	return  false;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a18c071c294e..ad84bb85b6de 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s)
 }
 __setup("default_hugepagesz=", default_hugepagesz_setup);
 
+static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
+{
+#ifdef CONFIG_NUMA
+	struct mempolicy *mpol = get_task_policy(current);
+
+	/*
+	 * Only enforce MPOL_BIND policy which overlaps with cpuset policy
+	 * (from policy_nodemask) specifically for hugetlb case
+	 */
+	if (mpol->mode == MPOL_BIND &&
+		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
+		 cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
+		return &mpol->nodes;
+#endif
+	return NULL;
+}
+
 static unsigned int allowed_mems_nr(struct hstate *h)
 {
 	int node;
 	unsigned int nr = 0;
-	nodemask_t *mpol_allowed;
+	nodemask_t *mbind_nodemask;
 	unsigned int *array = h->free_huge_pages_node;
 	gfp_t gfp_mask = htlb_alloc_mask(h);
 
-	mpol_allowed = policy_nodemask_current(gfp_mask);
-
+	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
 	for_each_node_mask(node, cpuset_current_mems_allowed) {
-		if (!mpol_allowed || node_isset(node, *mpol_allowed))
+		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
 			nr += array[node];
 	}
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d39b01fd52fe..5553bd53927f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1805,26 +1805,6 @@ bool vma_policy_mof(struct vm_area_struct *vma)
 	return pol->flags & MPOL_F_MOF;
 }
 
-static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
-{
-	enum zone_type dynamic_policy_zone = policy_zone;
-
-	BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
-
-	/*
-	 * if policy->nodes has movable memory only,
-	 * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
-	 *
-	 * policy->nodes is intersect with node_states[N_MEMORY].
-	 * so if the following test fails, it implies
-	 * policy->nodes has movable memory only.
-	 */
-	if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY]))
-		dynamic_policy_zone = ZONE_MOVABLE;
-
-	return zone >= dynamic_policy_zone;
-}
-
 /*
  * Return a nodemask representing a mempolicy for filtering nodes for
  * page allocation
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case
  2022-08-04  8:27                               ` Feng Tang
@ 2022-08-04 10:43                                 ` Michal Hocko
  2022-08-04 13:03                                   ` [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process Feng Tang
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2022-08-04 10:43 UTC (permalink / raw)
  To: Feng Tang
  Cc: Muchun Song, Mike Kravetz, Andrew Morton, bwidawsk, dave.hansen,
	linux-mm, linux-kernel

On Thu 04-08-22 16:27:24, Feng Tang wrote:
[...]
> >From a2db9a57da616bb3ea21e48a4a9ceb5c2cf4f7a2 Mon Sep 17 00:00:00 2001
> From: Feng Tang <feng.tang@intel.com>
> Date: Thu, 4 Aug 2022 09:39:24 +0800
> Subject: [PATCH RFC] mm/hugetlb: add dedicated func to get 'allowed' nodemask for
>  current process
> 
> Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced
> in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> [1], the policy_nodemask_current()'s semantics for this new policy
> has been changed, which returns 'preferred' nodes instead of 'allowed'
> nodes, and could hurt the usage of its caller in hugetlb:
> allowed_mems_nr().
> 
> Michal found the policy_nodemask_current() is only used by hugetlb,
> and suggested to move it to hugetlb code with more explicit name to
> enforce the 'allowed' semantics for which only MPOL_BIND policy
> matters.
> 
> One note for the new policy_mbind_nodemask() is, the cross check
> from MPOL_BIND, gfp flags and cpuset configuration can lead to
> a no available node case, which is considered to be broken
> configuration and 'NULL' (equals all nodes) is returned.
> 
> [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/
> Reported-by: Muchun Song <songmuchun@bytedance.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>

LGTM I would just make apply_policy_zone extern rather than making it
static inline in a header which can turn out to cause other header
dependencies.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process
  2022-08-04 10:43                                 ` Michal Hocko
@ 2022-08-04 13:03                                   ` Feng Tang
  2022-08-04 13:36                                     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Feng Tang @ 2022-08-04 13:03 UTC (permalink / raw)
  To: Michal Hocko, Muchun Song, Mike Kravetz, Andrew Morton
  Cc: Dave Hansen, Ben Widawsky, linux-mm, linux-kernel, Feng Tang

Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced
in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
[1], the policy_nodemask_current()'s semantics for this new policy
has been changed, which returns 'preferred' nodes instead of 'allowed'
nodes, and could hurt the usage of its caller in hugetlb:
allowed_mems_nr().

Michal found the policy_nodemask_current() is only used by hugetlb,
and suggested to move it to hugetlb code with more explicit name to
enforce the 'allowed' semantics for which only MPOL_BIND policy
matters.

One note for the new policy_mbind_nodemask() is, the cross check
from MPOL_BIND, gfp flags and cpuset configuration can lead to
a no available node case, which is considered to be broken
configuration, and 'NULL' (equals all nodes) will be returned.

apply_policy_zone() is made extern to be called in hugetlb code
and its return value is changed to bool.

[1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/
Reported-by: Muchun Song <songmuchun@bytedance.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/mempolicy.h | 13 +------------
 mm/hugetlb.c              | 24 ++++++++++++++++++++----
 mm/mempolicy.c            |  2 +-
 3 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 668389b4b53d..d232de7cdc56 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
 				const nodemask_t *mask);
 extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
 
-static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
-{
-	struct mempolicy *mpol = get_task_policy(current);
-
-	return policy_nodemask(gfp, mpol);
-}
-
 extern unsigned int mempolicy_slab_node(void);
 
 extern enum zone_type policy_zone;
@@ -189,6 +182,7 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 	return  (pol->mode == MPOL_PREFERRED_MANY);
 }
 
+extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
 
 #else
 
@@ -294,11 +288,6 @@ static inline void mpol_put_task_policy(struct task_struct *task)
 {
 }
 
-static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
-{
-	return NULL;
-}
-
 static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 {
 	return  false;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a18c071c294e..ad84bb85b6de 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s)
 }
 __setup("default_hugepagesz=", default_hugepagesz_setup);
 
+static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
+{
+#ifdef CONFIG_NUMA
+	struct mempolicy *mpol = get_task_policy(current);
+
+	/*
+	 * Only enforce MPOL_BIND policy which overlaps with cpuset policy
+	 * (from policy_nodemask) specifically for hugetlb case
+	 */
+	if (mpol->mode == MPOL_BIND &&
+		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
+		 cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
+		return &mpol->nodes;
+#endif
+	return NULL;
+}
+
 static unsigned int allowed_mems_nr(struct hstate *h)
 {
 	int node;
 	unsigned int nr = 0;
-	nodemask_t *mpol_allowed;
+	nodemask_t *mbind_nodemask;
 	unsigned int *array = h->free_huge_pages_node;
 	gfp_t gfp_mask = htlb_alloc_mask(h);
 
-	mpol_allowed = policy_nodemask_current(gfp_mask);
-
+	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
 	for_each_node_mask(node, cpuset_current_mems_allowed) {
-		if (!mpol_allowed || node_isset(node, *mpol_allowed))
+		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
 			nr += array[node];
 	}
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d39b01fd52fe..9f15bc533601 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1805,7 +1805,7 @@ bool vma_policy_mof(struct vm_area_struct *vma)
 	return pol->flags & MPOL_F_MOF;
 }
 
-static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
+bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 {
 	enum zone_type dynamic_policy_zone = policy_zone;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process
  2022-08-04 13:03                                   ` [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process Feng Tang
@ 2022-08-04 13:36                                     ` Michal Hocko
  2022-08-04 22:37                                       ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2022-08-04 13:36 UTC (permalink / raw)
  To: Feng Tang
  Cc: Muchun Song, Mike Kravetz, Andrew Morton, Dave Hansen,
	Ben Widawsky, linux-mm, linux-kernel

On Thu 04-08-22 21:03:42, Feng Tang wrote:
> Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced
> in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> [1], the policy_nodemask_current()'s semantics for this new policy
> has been changed, which returns 'preferred' nodes instead of 'allowed'
> nodes, and could hurt the usage of its caller in hugetlb:
> allowed_mems_nr().

The acutal user visible effect description is missing here. AFAIU it
would be this.

With the changed semantic of policy_nodemask_current a taks with
MPOL_PREFERRED_MANY policy could fail to get its reservation even though
it can fall back to other nodes (either defined by cpusets or all online
nodes) for that reservation failing mmap calles unnecessarily early.

The fix is to not consider MPOL_PREFERRED_MANY for reservations at all
because they, unlike MPOL_MBIND, do not pose any actual hard constrain.

You can keep the rest.
> Michal found the policy_nodemask_current() is only used by hugetlb,
> and suggested to move it to hugetlb code with more explicit name to
> enforce the 'allowed' semantics for which only MPOL_BIND policy
> matters.
> 
> One note for the new policy_mbind_nodemask() is, the cross check
> from MPOL_BIND, gfp flags and cpuset configuration can lead to
> a no available node case, which is considered to be broken
> configuration, and 'NULL' (equals all nodes) will be returned.

This is neither important nor useful for this particular patch.

> apply_policy_zone() is made extern to be called in hugetlb code
> and its return value is changed to bool.
> 
> [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/

Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")

I do not think stable is really required.

> Reported-by: Muchun Song <songmuchun@bytedance.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>

with that
Acked-by: Michal Hocko <mhocko@suse.com>

thanks!
> ---
>  include/linux/mempolicy.h | 13 +------------
>  mm/hugetlb.c              | 24 ++++++++++++++++++++----
>  mm/mempolicy.c            |  2 +-
>  3 files changed, 22 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 668389b4b53d..d232de7cdc56 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
>  				const nodemask_t *mask);
>  extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
>  
> -static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> -{
> -	struct mempolicy *mpol = get_task_policy(current);
> -
> -	return policy_nodemask(gfp, mpol);
> -}
> -
>  extern unsigned int mempolicy_slab_node(void);
>  
>  extern enum zone_type policy_zone;
> @@ -189,6 +182,7 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>  	return  (pol->mode == MPOL_PREFERRED_MANY);
>  }
>  
> +extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
>  
>  #else
>  
> @@ -294,11 +288,6 @@ static inline void mpol_put_task_policy(struct task_struct *task)
>  {
>  }
>  
> -static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> -{
> -	return NULL;
> -}
> -
>  static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>  {
>  	return  false;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a18c071c294e..ad84bb85b6de 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s)
>  }
>  __setup("default_hugepagesz=", default_hugepagesz_setup);
>  
> +static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
> +{
> +#ifdef CONFIG_NUMA
> +	struct mempolicy *mpol = get_task_policy(current);
> +
> +	/*
> +	 * Only enforce MPOL_BIND policy which overlaps with cpuset policy
> +	 * (from policy_nodemask) specifically for hugetlb case
> +	 */
> +	if (mpol->mode == MPOL_BIND &&
> +		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
> +		 cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
> +		return &mpol->nodes;
> +#endif
> +	return NULL;
> +}
> +
>  static unsigned int allowed_mems_nr(struct hstate *h)
>  {
>  	int node;
>  	unsigned int nr = 0;
> -	nodemask_t *mpol_allowed;
> +	nodemask_t *mbind_nodemask;
>  	unsigned int *array = h->free_huge_pages_node;
>  	gfp_t gfp_mask = htlb_alloc_mask(h);
>  
> -	mpol_allowed = policy_nodemask_current(gfp_mask);
> -
> +	mbind_nodemask = policy_mbind_nodemask(gfp_mask);
>  	for_each_node_mask(node, cpuset_current_mems_allowed) {
> -		if (!mpol_allowed || node_isset(node, *mpol_allowed))
> +		if (!mbind_nodemask || node_isset(node, *mbind_nodemask))
>  			nr += array[node];
>  	}
>  
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d39b01fd52fe..9f15bc533601 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1805,7 +1805,7 @@ bool vma_policy_mof(struct vm_area_struct *vma)
>  	return pol->flags & MPOL_F_MOF;
>  }
>  
> -static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> +bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>  {
>  	enum zone_type dynamic_policy_zone = policy_zone;
>  
> -- 
> 2.27.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process
  2022-08-04 13:36                                     ` Michal Hocko
@ 2022-08-04 22:37                                       ` Andrew Morton
  2022-08-05  0:06                                         ` Feng Tang
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2022-08-04 22:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Feng Tang, Muchun Song, Mike Kravetz, Dave Hansen, Ben Widawsky,
	linux-mm, linux-kernel

On Thu, 4 Aug 2022 15:36:48 +0200 Michal Hocko <mhocko@suse.com> wrote:

> On Thu 04-08-22 21:03:42, Feng Tang wrote:
> > Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced
> > in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> > [1], the policy_nodemask_current()'s semantics for this new policy
> > has been changed, which returns 'preferred' nodes instead of 'allowed'
> > nodes, and could hurt the usage of its caller in hugetlb:
> > allowed_mems_nr().
> 
> The acutal user visible effect description is missing here. AFAIU it
> would be this.
> 
> With the changed semantic of policy_nodemask_current a taks with
> MPOL_PREFERRED_MANY policy could fail to get its reservation even though
> it can fall back to other nodes (either defined by cpusets or all online
> nodes) for that reservation failing mmap calles unnecessarily early.
> 
> The fix is to not consider MPOL_PREFERRED_MANY for reservations at all
> because they, unlike MPOL_MBIND, do not pose any actual hard constrain.

And is this Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY
for multiple preferred nodes")?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process
  2022-08-04 22:37                                       ` Andrew Morton
@ 2022-08-05  0:06                                         ` Feng Tang
  0 siblings, 0 replies; 22+ messages in thread
From: Feng Tang @ 2022-08-05  0:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hocko, Michal, Muchun Song, Mike Kravetz, Dave Hansen,
	Ben Widawsky, linux-mm, linux-kernel

On Fri, Aug 05, 2022 at 06:37:17AM +0800, Andrew Morton wrote:
> On Thu, 4 Aug 2022 15:36:48 +0200 Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Thu 04-08-22 21:03:42, Feng Tang wrote:
> > > Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced
> > > in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
> > > [1], the policy_nodemask_current()'s semantics for this new policy
> > > has been changed, which returns 'preferred' nodes instead of 'allowed'
> > > nodes, and could hurt the usage of its caller in hugetlb:
> > > allowed_mems_nr().
> > 
> > The acutal user visible effect description is missing here. AFAIU it
> > would be this.
> > 
> > With the changed semantic of policy_nodemask_current a taks with
> > MPOL_PREFERRED_MANY policy could fail to get its reservation even though
> > it can fall back to other nodes (either defined by cpusets or all online
> > nodes) for that reservation failing mmap calles unnecessarily early.
> > 
> > The fix is to not consider MPOL_PREFERRED_MANY for reservations at all
> > because they, unlike MPOL_MBIND, do not pose any actual hard constrain.
> 
> And is this Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY
> for multiple preferred nodes")?

Yes. Will add it in the next version, thanks.

- Feng

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-08-05  0:07 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-01  8:42 [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case Muchun Song
2022-08-01  9:06 ` Michal Hocko
2022-08-01  9:26   ` Feng Tang
2022-08-02  3:42     ` Muchun Song
2022-08-02  5:52       ` Feng Tang
2022-08-02  6:40         ` Muchun Song
2022-08-02  7:39           ` Feng Tang
2022-08-02  9:02             ` Michal Hocko
2022-08-03  6:41               ` Feng Tang
2022-08-03  7:36                 ` Michal Hocko
2022-08-03 17:14                   ` Feng Tang
2022-08-03 11:28                     ` Michal Hocko
2022-08-03 20:43                       ` Feng Tang
2022-08-03 12:56                         ` Michal Hocko
2022-08-03 21:08                           ` Feng Tang
2022-08-03 13:21                             ` Michal Hocko
2022-08-04  8:27                               ` Feng Tang
2022-08-04 10:43                                 ` Michal Hocko
2022-08-04 13:03                                   ` [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process Feng Tang
2022-08-04 13:36                                     ` Michal Hocko
2022-08-04 22:37                                       ` Andrew Morton
2022-08-05  0:06                                         ` Feng Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).