* [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case @ 2022-08-01 8:42 Muchun Song 2022-08-01 9:06 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Muchun Song @ 2022-08-01 8:42 UTC (permalink / raw) To: akpm, bwidawsk, mhocko, dave.hansen, feng.tang Cc: linux-mm, linux-kernel, Muchun Song policy_nodemask() is supposed to be returned a nodemask representing a mempolicy for filtering nodes for page allocation, which is a hard restriction (see the user of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from policy_nodemask() to fix it, which will not affect current users of policy_nodemask() since all of the users already have handled the case of MPOL_PREFERRED_MANY before calling it. BTW, it is found by code inspection. Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") Signed-off-by: Muchun Song <songmuchun@bytedance.com> --- mm/mempolicy.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 6c27acb6cd63..4deec7e598c6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) cpuset_nodemask_valid_mems_allowed(&policy->nodes)) return &policy->nodes; - if (mode == MPOL_PREFERRED_MANY) - return &policy->nodes; - return NULL; } -- 2.11.0 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-01 8:42 [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case Muchun Song @ 2022-08-01 9:06 ` Michal Hocko 2022-08-01 9:26 ` Feng Tang 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2022-08-01 9:06 UTC (permalink / raw) To: Muchun Song Cc: akpm, bwidawsk, dave.hansen, feng.tang, linux-mm, linux-kernel On Mon 01-08-22 16:42:07, Muchun Song wrote: > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > for filtering nodes for page allocation, which is a hard restriction (see the user > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > calling it. BTW, it is found by code inspection. I am not sure this is the right fix. It is quite true that policy_nodemask is a tricky function to use. It pretends to have a higher level logic but all existing users are expected to be policy aware and they special case allocation for each policy. That would mean that hugetlb should do the same. I haven't checked the actual behavior implications for hugetlb here. Is MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it work? From a quick look this just ignores MPOL_PREFERRED_MANY completely. > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > --- > mm/mempolicy.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 6c27acb6cd63..4deec7e598c6 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > return &policy->nodes; > > - if (mode == MPOL_PREFERRED_MANY) > - return &policy->nodes; > - > return NULL; > } > > -- > 2.11.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-01 9:06 ` Michal Hocko @ 2022-08-01 9:26 ` Feng Tang 2022-08-02 3:42 ` Muchun Song 0 siblings, 1 reply; 22+ messages in thread From: Feng Tang @ 2022-08-01 9:26 UTC (permalink / raw) To: Michal Hocko Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > for filtering nodes for page allocation, which is a hard restriction (see the user > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > calling it. BTW, it is found by code inspection. > > I am not sure this is the right fix. It is quite true that > policy_nodemask is a tricky function to use. It pretends to have a > higher level logic but all existing users are expected to be policy > aware and they special case allocation for each policy. That would mean > that hugetlb should do the same. Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also confused about policy_nodemask(), as it is never a 'strict' one as the old code is: if (unlikely(mode == MPOL_BIND) && apply_policy_zone(policy, gfp_zone(gfp)) && cpuset_nodemask_valid_mems_allowed(&policy->nodes)) return &policy->nodes; return NULL Even when the MPOL_BIND's nodes is not allowed by cpuset, it will still return NULL (equals all nodes). From the semantics of allowed_mems_nr(), I think it does get changed a little by b27abaccf8e8. And to enforce the 'strict' semantic for 'allowed', we may need a more strict nodemask API for it. > I haven't checked the actual behavior implications for hugetlb here. Is > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it > work? From a quick look this just ignores MPOL_PREFERRED_MANY > completely. IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double check and report back if otherwise. > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > --- > > mm/mempolicy.c | 3 --- > > 1 file changed, 3 deletions(-) > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 6c27acb6cd63..4deec7e598c6 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > return &policy->nodes; > > > > - if (mode == MPOL_PREFERRED_MANY) > > - return &policy->nodes; I think it will make MPOL_PREFERRED_MANY not usable. Thanks, Feng > > - > > return NULL; > > } > > > > -- > > 2.11.0 > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-01 9:26 ` Feng Tang @ 2022-08-02 3:42 ` Muchun Song 2022-08-02 5:52 ` Feng Tang 0 siblings, 1 reply; 22+ messages in thread From: Muchun Song @ 2022-08-02 3:42 UTC (permalink / raw) To: Feng Tang Cc: Michal Hocko, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote: > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > > for filtering nodes for page allocation, which is a hard restriction (see the user > > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > > calling it. BTW, it is found by code inspection. > > > > I am not sure this is the right fix. It is quite true that > > policy_nodemask is a tricky function to use. It pretends to have a > > higher level logic but all existing users are expected to be policy > > aware and they special case allocation for each policy. That would mean > > that hugetlb should do the same. > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also > confused about policy_nodemask(), as it is never a 'strict' one as > the old code is: > > if (unlikely(mode == MPOL_BIND) && > apply_policy_zone(policy, gfp_zone(gfp)) && > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > return &policy->nodes; > > return NULL > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will > still return NULL (equals all nodes). > Well, I agree policy_nodemask() is really confusing because of the shortage of comments and the weird logic. > From the semantics of allowed_mems_nr(), I think it does get changed > a little by b27abaccf8e8. And to enforce the 'strict' semantic for > 'allowed', we may need a more strict nodemask API for it. > Maybe this is a good idea to fix this, e.g. introducing a new helper to return the strict allowed nodemask. > > I haven't checked the actual behavior implications for hugetlb here. Is > > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it > > work? From a quick look this just ignores MPOL_PREFERRED_MANY > > completely. > > IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double > check and report back if otherwise. > > > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > --- > > > mm/mempolicy.c | 3 --- > > > 1 file changed, 3 deletions(-) > > > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > > index 6c27acb6cd63..4deec7e598c6 100644 > > > --- a/mm/mempolicy.c > > > +++ b/mm/mempolicy.c > > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > return &policy->nodes; > > > > > > - if (mode == MPOL_PREFERRED_MANY) > > > - return &policy->nodes; > > I think it will make MPOL_PREFERRED_MANY not usable. > Sorry, I didn't got what you mean here. Could you explain more details about why it is not usable? Thanks. > Thanks, > Feng > > > > - > > > return NULL; > > > } > > > > > > -- > > > 2.11.0 > > > > -- > > Michal Hocko > > SUSE Labs > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-02 3:42 ` Muchun Song @ 2022-08-02 5:52 ` Feng Tang 2022-08-02 6:40 ` Muchun Song 0 siblings, 1 reply; 22+ messages in thread From: Feng Tang @ 2022-08-02 5:52 UTC (permalink / raw) To: Muchun Song Cc: Hocko, Michal, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote: > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote: > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > > > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > > > for filtering nodes for page allocation, which is a hard restriction (see the user > > > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > > > calling it. BTW, it is found by code inspection. > > > > > > I am not sure this is the right fix. It is quite true that > > > policy_nodemask is a tricky function to use. It pretends to have a > > > higher level logic but all existing users are expected to be policy > > > aware and they special case allocation for each policy. That would mean > > > that hugetlb should do the same. > > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also > > confused about policy_nodemask(), as it is never a 'strict' one as > > the old code is: > > > > if (unlikely(mode == MPOL_BIND) && > > apply_policy_zone(policy, gfp_zone(gfp)) && > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > return &policy->nodes; > > > > return NULL > > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will > > still return NULL (equals all nodes). > > > > Well, I agree policy_nodemask() is really confusing because of the > shortage of comments and the weird logic. > > > From the semantics of allowed_mems_nr(), I think it does get changed > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for > > 'allowed', we may need a more strict nodemask API for it. > > > > Maybe this is a good idea to fix this, e.g. introducing a new helper > to return the strict allowed nodemask. Yep. I had another thought to add one global all-zero nodemask, for API like policy_nodemask(), it has 2 types of return value: * a nodemask with some bits set * NULL (means all nodes) Here a new type of zero nodemask (a gloabl variable)can be created to indicate no qualified node. > > > I haven't checked the actual behavior implications for hugetlb here. Is > > > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it > > > work? From a quick look this just ignores MPOL_PREFERRED_MANY > > > completely. > > > > IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double > > check and report back if otherwise. > > > > > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > > --- > > > > mm/mempolicy.c | 3 --- > > > > 1 file changed, 3 deletions(-) > > > > > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > > > index 6c27acb6cd63..4deec7e598c6 100644 > > > > --- a/mm/mempolicy.c > > > > +++ b/mm/mempolicy.c > > > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > > return &policy->nodes; > > > > > > > > - if (mode == MPOL_PREFERRED_MANY) > > > > - return &policy->nodes; > > > > I think it will make MPOL_PREFERRED_MANY not usable. > > > > Sorry, I didn't got what you mean here. Could you explain more details > about why it is not usable? I thought alloc_pages() will rely on policy_nodemask(), which was wrong as I forgot the MPOL_PREFERRED_MANY has a dedicated function alloc_pages_preferred_many() to handle it. Sorry for the confusion. Thanks, Feng > Thanks. > > > Thanks, > > Feng > > > > > > - > > > > return NULL; > > > > } > > > > > > > > -- > > > > 2.11.0 > > > > > > -- > > > Michal Hocko > > > SUSE Labs > > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-02 5:52 ` Feng Tang @ 2022-08-02 6:40 ` Muchun Song 2022-08-02 7:39 ` Feng Tang 0 siblings, 1 reply; 22+ messages in thread From: Muchun Song @ 2022-08-02 6:40 UTC (permalink / raw) To: Feng Tang Cc: Hocko, Michal, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote: > On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote: > > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote: > > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > > > > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > > > > for filtering nodes for page allocation, which is a hard restriction (see the user > > > > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > > > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > > > > calling it. BTW, it is found by code inspection. > > > > > > > > I am not sure this is the right fix. It is quite true that > > > > policy_nodemask is a tricky function to use. It pretends to have a > > > > higher level logic but all existing users are expected to be policy > > > > aware and they special case allocation for each policy. That would mean > > > > that hugetlb should do the same. > > > > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also > > > confused about policy_nodemask(), as it is never a 'strict' one as > > > the old code is: > > > > > > if (unlikely(mode == MPOL_BIND) && > > > apply_policy_zone(policy, gfp_zone(gfp)) && > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > return &policy->nodes; > > > > > > return NULL > > > > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will > > > still return NULL (equals all nodes). > > > > > > > Well, I agree policy_nodemask() is really confusing because of the > > shortage of comments and the weird logic. > > > > > From the semantics of allowed_mems_nr(), I think it does get changed > > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for > > > 'allowed', we may need a more strict nodemask API for it. > > > > > > > Maybe this is a good idea to fix this, e.g. introducing a new helper > > to return the strict allowed nodemask. > > Yep. > > I had another thought to add one global all-zero nodemask, for API like > policy_nodemask(), it has 2 types of return value: > * a nodemask with some bits set > * NULL (means all nodes) > > Here a new type of zero nodemask (a gloabl variable)can be created to > indicate no qualified node. > I know why you want to introduce a gloable zero nidemask. Since we already have a glable nodemask array, namely node_states, instead of returning NULL for the case of all nodes, how about returing node_states[N_ONLINE] for it? And make it return NULL for the case where no nodes are allowed. Any thought? > > > > I haven't checked the actual behavior implications for hugetlb here. Is > > > > MPOL_PREFERRED_MANY even supported for hugetlb? Does this change make it > > > > work? From a quick look this just ignores MPOL_PREFERRED_MANY > > > > completely. > > > > > > IIRC, the hugetlb will hornor MPOL_PREFERRED_MANY. And I can double > > > check and report back if otherwise. > > > > > > > > Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > > > --- > > > > > mm/mempolicy.c | 3 --- > > > > > 1 file changed, 3 deletions(-) > > > > > > > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > > > > index 6c27acb6cd63..4deec7e598c6 100644 > > > > > --- a/mm/mempolicy.c > > > > > +++ b/mm/mempolicy.c > > > > > @@ -1845,9 +1845,6 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > > > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > > > return &policy->nodes; > > > > > > > > > > - if (mode == MPOL_PREFERRED_MANY) > > > > > - return &policy->nodes; > > > > > > I think it will make MPOL_PREFERRED_MANY not usable. > > > > > > > Sorry, I didn't got what you mean here. Could you explain more details > > about why it is not usable? > > I thought alloc_pages() will rely on policy_nodemask(), which was wrong > as I forgot the MPOL_PREFERRED_MANY has a dedicated function > alloc_pages_preferred_many() to handle it. Sorry for the confusion. > > Thanks, > Feng > > > Thanks. > > > > > Thanks, > > > Feng > > > > > > > > - > > > > > return NULL; > > > > > } > > > > > > > > > > -- > > > > > 2.11.0 > > > > > > > > -- > > > > Michal Hocko > > > > SUSE Labs > > > > > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-02 6:40 ` Muchun Song @ 2022-08-02 7:39 ` Feng Tang 2022-08-02 9:02 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Feng Tang @ 2022-08-02 7:39 UTC (permalink / raw) To: Muchun Song Cc: Hocko, Michal, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel On Tue, Aug 02, 2022 at 02:40:11PM +0800, Muchun Song wrote: > On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote: > > On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote: > > > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote: > > > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > > > > > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > > > > > for filtering nodes for page allocation, which is a hard restriction (see the user > > > > > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > > > > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > > > > > calling it. BTW, it is found by code inspection. > > > > > > > > > > I am not sure this is the right fix. It is quite true that > > > > > policy_nodemask is a tricky function to use. It pretends to have a > > > > > higher level logic but all existing users are expected to be policy > > > > > aware and they special case allocation for each policy. That would mean > > > > > that hugetlb should do the same. > > > > > > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also > > > > confused about policy_nodemask(), as it is never a 'strict' one as > > > > the old code is: > > > > > > > > if (unlikely(mode == MPOL_BIND) && > > > > apply_policy_zone(policy, gfp_zone(gfp)) && > > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > > return &policy->nodes; > > > > > > > > return NULL > > > > > > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will > > > > still return NULL (equals all nodes). > > > > > > > > > > Well, I agree policy_nodemask() is really confusing because of the > > > shortage of comments and the weird logic. > > > > > > > From the semantics of allowed_mems_nr(), I think it does get changed > > > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for > > > > 'allowed', we may need a more strict nodemask API for it. > > > > > > > > > > Maybe this is a good idea to fix this, e.g. introducing a new helper > > > to return the strict allowed nodemask. > > > > Yep. > > > > I had another thought to add one global all-zero nodemask, for API like > > policy_nodemask(), it has 2 types of return value: > > * a nodemask with some bits set > > * NULL (means all nodes) > > > > Here a new type of zero nodemask (a gloabl variable)can be created to > > indicate no qualified node. > > > > I know why you want to introduce a gloable zero nidemask. Since we already > have a glable nodemask array, namely node_states, instead of returning NULL > for the case of all nodes, how about returing node_states[N_ONLINE] for it? > And make it return NULL for the case where no nodes are allowed. Any thought? I think return node_states[N_ONLINE] can simplify the code in allowed_mems_nr(), the empty zero nodes can simplify further. Here is some draft patch (not tested) to show the idea Thanks, Feng --- include/linux/mempolicy.h | 8 ++++++++ include/linux/nodemask.h | 7 +++++++ mm/hugetlb.c | 7 ++++--- mm/mempolicy.c | 17 +++++++++++++++++ mm/page_alloc.c | 3 +++ 5 files changed, 39 insertions(+), 3 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 668389b4b53d..b5451fef1620 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -150,6 +150,7 @@ extern bool init_nodemask_of_mempolicy(nodemask_t *mask); extern bool mempolicy_in_oom_domain(struct task_struct *tsk, const nodemask_t *mask); extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); +extern nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy); static inline nodemask_t *policy_nodemask_current(gfp_t gfp) { @@ -158,6 +159,13 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) return policy_nodemask(gfp, mpol); } +static inline nodemask_t *allowed_policy_nodemask_current(gfp_t gfp) +{ + struct mempolicy *mpol = get_task_policy(current); + + return allowed_policy_nodemask(gfp, mpol); +} + extern unsigned int mempolicy_slab_node(void); extern enum zone_type policy_zone; diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 0f233b76c9ce..dc5fab38e810 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -409,6 +409,13 @@ enum node_states { extern nodemask_t node_states[NR_NODE_STATES]; +extern nodemask_t zero_nodes; + +static inline bool is_empty_nodes(nodemask_t *pnodes) +{ + return (pnodes == &zero_nodes || __nodes_empty(pnodes, MAX_NUMNODES)); +} + #if MAX_NUMNODES > 1 static inline int node_state(int node, enum node_states state) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a57e1be41401..dc9f4ed32909 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4340,10 +4340,11 @@ static unsigned int allowed_mems_nr(struct hstate *h) mpol_allowed = policy_nodemask_current(gfp_mask); - for_each_node_mask(node, cpuset_current_mems_allowed) { - if (!mpol_allowed || node_isset(node, *mpol_allowed)) + if (is_empty_nodes(mpol_allowed)) + return 0; + + for_each_node_mask(node, mpol_allowed) nr += array[node]; - } return nr; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d39b01fd52fe..3e936b8ca9ea 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1845,6 +1845,23 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) return NULL; } +/* + * Return the allowed nodes mask for a mempolicy and page allocation, + * which is a 'stricter' semantic than policy_nodemsk() + */ +nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy) +{ + if (unlikely(policy->mode == MPOL_BIND)) { + if (apply_policy_zone(policy, gfp_zone(gfp)) && + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) + return &policy->nodes; + else + return &zero_nodes; + } + + return NULL; +} + /* * Return the preferred node id for 'prefer' mempolicy, and return * the given id for all other policies. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e008a3df0485..3549ea037588 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -181,6 +181,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = { }; EXPORT_SYMBOL(node_states); +nodemask_t zero_nodes = NODE_MASK_NONE; +EXPORT_SYMBOL(zero_nodes); + atomic_long_t _totalram_pages __read_mostly; EXPORT_SYMBOL(_totalram_pages); unsigned long totalreserve_pages __read_mostly; -- 2.27.0 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-02 7:39 ` Feng Tang @ 2022-08-02 9:02 ` Michal Hocko 2022-08-03 6:41 ` Feng Tang 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2022-08-02 9:02 UTC (permalink / raw) To: Feng Tang Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel, Mike Kravetz Please make sure to CC Mike on hugetlb related changes. I didn't really get to grasp your proposed solution but it feels goind sideways. The real issue is that hugetlb uses a dedicated allocation scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not think we should be tricking that by providing some fake nodemasks and what not. The good news is that allocation from the pool is MPOL_PREFERRED_MANY aware because it first tries to allocation from the preffered node mask and then fall back to the full nodemask (dequeue_huge_page_vma). If the existing pools cannot really satisfy that allocation then it tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also performs 2 stage allocation with the node mask and no node masks. But both of them might fail. The bad news is that other allocation functions - including those that allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g. __nr_hugepages_store_common paths which use the allocating process policy to fill up the pool so the pool could be under provisioned if that context is using MPOL_PREFERRED_MANY. Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation code and I have to admit I do not really remember details there. This is a subtle code and my best guess would be that policy_nodemask_current should be hugetlb specific and only care about MPOL_BIND. On Tue 02-08-22 15:39:52, Feng Tang wrote: > On Tue, Aug 02, 2022 at 02:40:11PM +0800, Muchun Song wrote: > > On Tue, Aug 02, 2022 at 01:52:05PM +0800, Feng Tang wrote: > > > On Tue, Aug 02, 2022 at 11:42:52AM +0800, Muchun Song wrote: > > > > On Mon, Aug 01, 2022 at 05:26:23PM +0800, Feng Tang wrote: > > > > > On Mon, Aug 01, 2022 at 05:06:14PM +0800, Michal Hocko wrote: > > > > > > On Mon 01-08-22 16:42:07, Muchun Song wrote: > > > > > > > policy_nodemask() is supposed to be returned a nodemask representing a mempolicy > > > > > > > for filtering nodes for page allocation, which is a hard restriction (see the user > > > > > > > of allowed_mems_nr() in hugetlb.c). However, MPOL_PREFERRED_MANY is a preferred > > > > > > > mode not a hard restriction. Now it breaks the user of HugeTLB. Remove it from > > > > > > > policy_nodemask() to fix it, which will not affect current users of policy_nodemask() > > > > > > > since all of the users already have handled the case of MPOL_PREFERRED_MANY before > > > > > > > calling it. BTW, it is found by code inspection. > > > > > > > > > > > > I am not sure this is the right fix. It is quite true that > > > > > > policy_nodemask is a tricky function to use. It pretends to have a > > > > > > higher level logic but all existing users are expected to be policy > > > > > > aware and they special case allocation for each policy. That would mean > > > > > > that hugetlb should do the same. > > > > > > > > > > Yes, when I worked on the MPOL_PREFERRED_MANY patches, I was also > > > > > confused about policy_nodemask(), as it is never a 'strict' one as > > > > > the old code is: > > > > > > > > > > if (unlikely(mode == MPOL_BIND) && > > > > > apply_policy_zone(policy, gfp_zone(gfp)) && > > > > > cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > > > return &policy->nodes; > > > > > > > > > > return NULL > > > > > > > > > > Even when the MPOL_BIND's nodes is not allowed by cpuset, it will > > > > > still return NULL (equals all nodes). > > > > > > > > > > > > > Well, I agree policy_nodemask() is really confusing because of the > > > > shortage of comments and the weird logic. > > > > > > > > > From the semantics of allowed_mems_nr(), I think it does get changed > > > > > a little by b27abaccf8e8. And to enforce the 'strict' semantic for > > > > > 'allowed', we may need a more strict nodemask API for it. > > > > > > > > > > > > > Maybe this is a good idea to fix this, e.g. introducing a new helper > > > > to return the strict allowed nodemask. > > > > > > Yep. > > > > > > I had another thought to add one global all-zero nodemask, for API like > > > policy_nodemask(), it has 2 types of return value: > > > * a nodemask with some bits set > > > * NULL (means all nodes) > > > > > > Here a new type of zero nodemask (a gloabl variable)can be created to > > > indicate no qualified node. > > > > > > > I know why you want to introduce a gloable zero nidemask. Since we already > > have a glable nodemask array, namely node_states, instead of returning NULL > > for the case of all nodes, how about returing node_states[N_ONLINE] for it? > > And make it return NULL for the case where no nodes are allowed. Any thought? > > I think return node_states[N_ONLINE] can simplify the code in allowed_mems_nr(), > the empty zero nodes can simplify further. > > Here is some draft patch (not tested) to show the idea > > Thanks, > Feng > > --- > include/linux/mempolicy.h | 8 ++++++++ > include/linux/nodemask.h | 7 +++++++ > mm/hugetlb.c | 7 ++++--- > mm/mempolicy.c | 17 +++++++++++++++++ > mm/page_alloc.c | 3 +++ > 5 files changed, 39 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 668389b4b53d..b5451fef1620 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -150,6 +150,7 @@ extern bool init_nodemask_of_mempolicy(nodemask_t *mask); > extern bool mempolicy_in_oom_domain(struct task_struct *tsk, > const nodemask_t *mask); > extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); > +extern nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy); > > static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > { > @@ -158,6 +159,13 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > return policy_nodemask(gfp, mpol); > } > > +static inline nodemask_t *allowed_policy_nodemask_current(gfp_t gfp) > +{ > + struct mempolicy *mpol = get_task_policy(current); > + > + return allowed_policy_nodemask(gfp, mpol); > +} > + > extern unsigned int mempolicy_slab_node(void); > > extern enum zone_type policy_zone; > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h > index 0f233b76c9ce..dc5fab38e810 100644 > --- a/include/linux/nodemask.h > +++ b/include/linux/nodemask.h > @@ -409,6 +409,13 @@ enum node_states { > > extern nodemask_t node_states[NR_NODE_STATES]; > > +extern nodemask_t zero_nodes; > + > +static inline bool is_empty_nodes(nodemask_t *pnodes) > +{ > + return (pnodes == &zero_nodes || __nodes_empty(pnodes, MAX_NUMNODES)); > +} > + > #if MAX_NUMNODES > 1 > static inline int node_state(int node, enum node_states state) > { > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index a57e1be41401..dc9f4ed32909 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4340,10 +4340,11 @@ static unsigned int allowed_mems_nr(struct hstate *h) > > mpol_allowed = policy_nodemask_current(gfp_mask); > > - for_each_node_mask(node, cpuset_current_mems_allowed) { > - if (!mpol_allowed || node_isset(node, *mpol_allowed)) > + if (is_empty_nodes(mpol_allowed)) > + return 0; > + > + for_each_node_mask(node, mpol_allowed) > nr += array[node]; > - } > > return nr; > } > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index d39b01fd52fe..3e936b8ca9ea 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1845,6 +1845,23 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) > return NULL; > } > > +/* > + * Return the allowed nodes mask for a mempolicy and page allocation, > + * which is a 'stricter' semantic than policy_nodemsk() > + */ > +nodemask_t *allowed_policy_nodemask(gfp_t gfp, struct mempolicy *policy) > +{ > + if (unlikely(policy->mode == MPOL_BIND)) { > + if (apply_policy_zone(policy, gfp_zone(gfp)) && > + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > + return &policy->nodes; > + else > + return &zero_nodes; > + } > + > + return NULL; > +} > + > /* > * Return the preferred node id for 'prefer' mempolicy, and return > * the given id for all other policies. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e008a3df0485..3549ea037588 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -181,6 +181,9 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = { > }; > EXPORT_SYMBOL(node_states); > > +nodemask_t zero_nodes = NODE_MASK_NONE; > +EXPORT_SYMBOL(zero_nodes); > + > atomic_long_t _totalram_pages __read_mostly; > EXPORT_SYMBOL(_totalram_pages); > unsigned long totalreserve_pages __read_mostly; > -- > 2.27.0 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-02 9:02 ` Michal Hocko @ 2022-08-03 6:41 ` Feng Tang 2022-08-03 7:36 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Feng Tang @ 2022-08-03 6:41 UTC (permalink / raw) To: Michal Hocko Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel, Mike Kravetz On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote: > Please make sure to CC Mike on hugetlb related changes. OK. > I didn't really get to grasp your proposed solution but it feels goind > sideways. The real issue is that hugetlb uses a dedicated allocation > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not > think we should be tricking that by providing some fake nodemasks and > what not. > > The good news is that allocation from the pool is MPOL_PREFERRED_MANY > aware because it first tries to allocation from the preffered node mask > and then fall back to the full nodemask (dequeue_huge_page_vma). > If the existing pools cannot really satisfy that allocation then it > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also > performs 2 stage allocation with the node mask and no node masks. But > both of them might fail. > > The bad news is that other allocation functions - including those that > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g. > __nr_hugepages_store_common paths which use the allocating process > policy to fill up the pool so the pool could be under provisioned if > that context is using MPOL_PREFERRED_MANY. Thanks for the check! So you mean if the prferred nodes don't have enough pages, we should also fallback to all like dequeue_huge_page_vma() does? Or we can user a policy API which return nodemask for MPOL_BIND and NULL for all other policies, like allowed_mems_nr() needs. --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) return policy_nodemask(gfp, mpol); } +#ifdef CONFIG_HUGETLB_FS +static inline nodemask_t *strict_policy_nodemask_current(void) +{ + struct mempolicy *mpol = get_task_policy(current); + + if (mpol->mode == MPOL_BIND) + return &mpol->nodes; + + return NULL; +} +#endif + > Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation > code and I have to admit I do not really remember details there. This is > a subtle code and my best guess would be that policy_nodemask_current > should be hugetlb specific and only care about MPOL_BIND. The API needed by allowed_mem_nr() is a little different as it has gfp flag and cpuset config to consider. Thanks, Feng [snip] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-03 6:41 ` Feng Tang @ 2022-08-03 7:36 ` Michal Hocko 2022-08-03 17:14 ` Feng Tang 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2022-08-03 7:36 UTC (permalink / raw) To: Feng Tang Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel, Mike Kravetz On Wed 03-08-22 14:41:20, Feng Tang wrote: > On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote: > > Please make sure to CC Mike on hugetlb related changes. > > OK. > > > I didn't really get to grasp your proposed solution but it feels goind > > sideways. The real issue is that hugetlb uses a dedicated allocation > > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not > > think we should be tricking that by providing some fake nodemasks and > > what not. > > > > The good news is that allocation from the pool is MPOL_PREFERRED_MANY > > aware because it first tries to allocation from the preffered node mask > > and then fall back to the full nodemask (dequeue_huge_page_vma). > > If the existing pools cannot really satisfy that allocation then it > > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also > > performs 2 stage allocation with the node mask and no node masks. But > > both of them might fail. > > > > The bad news is that other allocation functions - including those that > > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g. > > __nr_hugepages_store_common paths which use the allocating process > > policy to fill up the pool so the pool could be under provisioned if > > that context is using MPOL_PREFERRED_MANY. > > Thanks for the check! > > So you mean if the prferred nodes don't have enough pages, we should > also fallback to all like dequeue_huge_page_vma() does? > > Or we can user a policy API which return nodemask for MPOL_BIND and > NULL for all other policies, like allowed_mems_nr() needs. > > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > return policy_nodemask(gfp, mpol); > } > > +#ifdef CONFIG_HUGETLB_FS > +static inline nodemask_t *strict_policy_nodemask_current(void) > +{ > + struct mempolicy *mpol = get_task_policy(current); > + > + if (mpol->mode == MPOL_BIND) > + return &mpol->nodes; > + > + return NULL; > +} > +#endif Yes something like this, except that I would also move this into hugetlb proper because this doesn't seem generally useful. > > Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation > > code and I have to admit I do not really remember details there. This is > > a subtle code and my best guess would be that policy_nodemask_current > > should be hugetlb specific and only care about MPOL_BIND. > > The API needed by allowed_mem_nr() is a little different as it has gfp > flag and cpuset config to consider. Why would gfp mask matter? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-03 7:36 ` Michal Hocko @ 2022-08-03 17:14 ` Feng Tang 2022-08-03 11:28 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Feng Tang @ 2022-08-03 17:14 UTC (permalink / raw) To: Michal Hocko Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel, Mike Kravetz On Wed, Aug 03, 2022 at 03:36:41PM +0800, Michal Hocko wrote: > On Wed 03-08-22 14:41:20, Feng Tang wrote: > > On Tue, Aug 02, 2022 at 05:02:37PM +0800, Michal Hocko wrote: > > > Please make sure to CC Mike on hugetlb related changes. > > > > OK. > > > > > I didn't really get to grasp your proposed solution but it feels goind > > > sideways. The real issue is that hugetlb uses a dedicated allocation > > > scheme which is not fully MPOL_PREFERRED_MANY aware AFAICS. I do not > > > think we should be tricking that by providing some fake nodemasks and > > > what not. > > > > > > The good news is that allocation from the pool is MPOL_PREFERRED_MANY > > > aware because it first tries to allocation from the preffered node mask > > > and then fall back to the full nodemask (dequeue_huge_page_vma). > > > If the existing pools cannot really satisfy that allocation then it > > > tries to allocate a new hugetlb page (alloc_fresh_huge_page) which also > > > performs 2 stage allocation with the node mask and no node masks. But > > > both of them might fail. > > > > > > The bad news is that other allocation functions - including those that > > > allocate to the pool are not fully MPOL_PREFERRED_MANY aware. E.g. > > > __nr_hugepages_store_common paths which use the allocating process > > > policy to fill up the pool so the pool could be under provisioned if > > > that context is using MPOL_PREFERRED_MANY. > > > > Thanks for the check! > > > > So you mean if the prferred nodes don't have enough pages, we should > > also fallback to all like dequeue_huge_page_vma() does? > > > > Or we can user a policy API which return nodemask for MPOL_BIND and > > NULL for all other policies, like allowed_mems_nr() needs. > > > > --- a/include/linux/mempolicy.h > > +++ b/include/linux/mempolicy.h > > @@ -158,6 +158,18 @@ static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > > return policy_nodemask(gfp, mpol); > > } > > > > +#ifdef CONFIG_HUGETLB_FS > > +static inline nodemask_t *strict_policy_nodemask_current(void) > > +{ > > + struct mempolicy *mpol = get_task_policy(current); > > + > > + if (mpol->mode == MPOL_BIND) > > + return &mpol->nodes; > > + > > + return NULL; > > +} > > +#endif > > Yes something like this, except that I would also move this into hugetlb > proper because this doesn't seem generally useful. Ok, I change it as below: --- mm/hugetlb.c | 28 +++++++++++++++++++++++----- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 14be38822cf8..ef1d4ffa733f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -91,6 +91,24 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp; /* Forward declaration */ static int hugetlb_acct_memory(struct hstate *h, long delta); +/* + * Return nodemask of what is allowed by current process' memory + * policy, as MPOL_BIND is the only 'strict' policy, return NULL + * for all other policies + */ +static inline nodemask_t *allowed_policy_nodemask_current(void) +{ +#ifdef CONFIG_NUMA + struct mempolicy *mpol = get_task_policy(current); + + if (mpol->mode == MPOL_BIND) + return &mpol->nodes; + return NULL; +#else + return NULL; +#endif +} + static inline bool subpool_is_free(struct hugepage_subpool *spool) { if (spool->count) @@ -3556,7 +3574,7 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, unsigned long count, size_t len) { int err; - nodemask_t nodes_allowed, *n_mask; + nodemask_t nodes_allowed, *n_mask = NULL; if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) return -EINVAL; @@ -3565,11 +3583,11 @@ static ssize_t __nr_hugepages_store_common(bool obey_mempolicy, /* * global hstate attribute */ - if (!(obey_mempolicy && - init_nodemask_of_mempolicy(&nodes_allowed))) + if (obey_mempolicy) + n_mask = allowed_policy_nodemask_current(); + + if (!n_mask) n_mask = &node_states[N_MEMORY]; - else - n_mask = &nodes_allowed; } else { /* * Node specific request. count adjustment happens in -- 2.27.0 > > > Wrt. allowed_mems_nr (i.e. hugetlb_acct_memory) this is a reservation > > > code and I have to admit I do not really remember details there. This is > > > a subtle code and my best guess would be that policy_nodemask_current > > > should be hugetlb specific and only care about MPOL_BIND. > > > > The API needed by allowed_mem_nr() is a little different as it has gfp > > flag and cpuset config to consider. > > Why would gfp mask matter? I'm not very familiar with the old semantics (will check more), from current code, it checks both the gfp flags and cpuset limit. Thanks, Feng > -- > Michal Hocko > SUSE Labs ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-03 17:14 ` Feng Tang @ 2022-08-03 11:28 ` Michal Hocko 2022-08-03 20:43 ` Feng Tang 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2022-08-03 11:28 UTC (permalink / raw) To: Feng Tang Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel, Mike Kravetz On Thu 04-08-22 01:14:32, Feng Tang wrote: [...] > Ok, I change it as below: Wouldn't it be better to make this allowed_mems_nr specific to be explicit about the intention? Not that I feel strongly about that. > --- > mm/hugetlb.c | 28 +++++++++++++++++++++++----- > 1 file changed, 23 insertions(+), 5 deletions(-) Not even compile tested include/linux/mempolicy.h | 12 ------------ mm/hugetlb.c | 24 ++++++++++++++++++++---- 2 files changed, 20 insertions(+), 16 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 668389b4b53d..e38b0ef20b8b 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk, const nodemask_t *mask); extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) -{ - struct mempolicy *mpol = get_task_policy(current); - - return policy_nodemask(gfp, mpol); -} - extern unsigned int mempolicy_slab_node(void); extern enum zone_type policy_zone; @@ -294,11 +287,6 @@ static inline void mpol_put_task_policy(struct task_struct *task) { } -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) -{ - return NULL; -} - static inline bool mpol_is_preferred_many(struct mempolicy *pol) { return false; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a18c071c294e..6cacbc9b15a1 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s) } __setup("default_hugepagesz=", default_hugepagesz_setup); +struct mempolicy *policy_mbind_nodemask(gfp_t gfp) +{ +#ifdef CONFIG_MEMPOLICY + struct mempolicy *mpol = get_task_policy(current); + + /* + * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask) + * specifically for hugetlb case + */ + if (mpol->mode == MPOL_BIND && + (apply_policy_zone(mpol, gfp_zone(gfp)) && + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) + return &mpol->nodes; +#endif + return NULL; +} + static unsigned int allowed_mems_nr(struct hstate *h) { int node; unsigned int nr = 0; - nodemask_t *mpol_allowed; + nodemask_t *mbind_nodemask; unsigned int *array = h->free_huge_pages_node; gfp_t gfp_mask = htlb_alloc_mask(h); - mpol_allowed = policy_nodemask_current(gfp_mask); - + mbind_nodemask = policy_mbind_nodemask(gfp_mask); for_each_node_mask(node, cpuset_current_mems_allowed) { - if (!mpol_allowed || node_isset(node, *mpol_allowed)) + if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) nr += array[node]; } -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-03 11:28 ` Michal Hocko @ 2022-08-03 20:43 ` Feng Tang 2022-08-03 12:56 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Feng Tang @ 2022-08-03 20:43 UTC (permalink / raw) To: Michal Hocko Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel, Mike Kravetz On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote: > On Thu 04-08-22 01:14:32, Feng Tang wrote: > [...] > > Ok, I change it as below: > > Wouldn't it be better to make this allowed_mems_nr specific to be > explicit about the intention? Yes, it is. > Not that I feel strongly about that. > > > --- > > mm/hugetlb.c | 28 +++++++++++++++++++++++----- > > 1 file changed, 23 insertions(+), 5 deletions(-) > > Not even compile tested > include/linux/mempolicy.h | 12 ------------ > mm/hugetlb.c | 24 ++++++++++++++++++++---- > 2 files changed, 20 insertions(+), 16 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 668389b4b53d..e38b0ef20b8b 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk, > const nodemask_t *mask); > extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); > > -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > -{ > - struct mempolicy *mpol = get_task_policy(current); > - > - return policy_nodemask(gfp, mpol); > -} > - > extern unsigned int mempolicy_slab_node(void); > > extern enum zone_type policy_zone; > @@ -294,11 +287,6 @@ static inline void mpol_put_task_policy(struct task_struct *task) > { > } > > -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > -{ > - return NULL; > -} > - > static inline bool mpol_is_preferred_many(struct mempolicy *pol) > { > return false; > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index a18c071c294e..6cacbc9b15a1 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s) > } > __setup("default_hugepagesz=", default_hugepagesz_setup); > > +struct mempolicy *policy_mbind_nodemask(gfp_t gfp) > +{ > +#ifdef CONFIG_MEMPOLICY > + struct mempolicy *mpol = get_task_policy(current); > + > + /* > + * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask) > + * specifically for hugetlb case > + */ > + if (mpol->mode == MPOL_BIND && > + (apply_policy_zone(mpol, gfp_zone(gfp)) && > + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > + return &mpol->nodes; > +#endif > + return NULL; I saw the logic is not changed, and it confused me that if there is no qualified node, it will still return NULL which effectively equals node_states[N_MEMORY], while I think it should return a all zero nodemasks. Thanks, Feng > +} > + > static unsigned int allowed_mems_nr(struct hstate *h) > { > int node; > unsigned int nr = 0; > - nodemask_t *mpol_allowed; > + nodemask_t *mbind_nodemask; > unsigned int *array = h->free_huge_pages_node; > gfp_t gfp_mask = htlb_alloc_mask(h); > > - mpol_allowed = policy_nodemask_current(gfp_mask); > - > + mbind_nodemask = policy_mbind_nodemask(gfp_mask); > for_each_node_mask(node, cpuset_current_mems_allowed) { > - if (!mpol_allowed || node_isset(node, *mpol_allowed)) > + if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) > nr += array[node]; > } > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-03 20:43 ` Feng Tang @ 2022-08-03 12:56 ` Michal Hocko 2022-08-03 21:08 ` Feng Tang 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2022-08-03 12:56 UTC (permalink / raw) To: Feng Tang Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel, Mike Kravetz On Thu 04-08-22 04:43:06, Feng Tang wrote: > On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote: [...] > > +struct mempolicy *policy_mbind_nodemask(gfp_t gfp) > > +{ > > +#ifdef CONFIG_MEMPOLICY > > + struct mempolicy *mpol = get_task_policy(current); > > + > > + /* > > + * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask) > > + * specifically for hugetlb case > > + */ > > + if (mpol->mode == MPOL_BIND && > > + (apply_policy_zone(mpol, gfp_zone(gfp)) && > > + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > + return &mpol->nodes; > > +#endif > > + return NULL; > > I saw the logic is not changed, and it confused me that if there is > no qualified node, it will still return NULL which effectively equals > node_states[N_MEMORY], while I think it should return a all zero > nodemasks. This is a separate thing and I have to admit that the existing code is rather non-intuitive or even broken. I guess we do not care all that much because MBIND with completely non-overlapping cpusets is just a broken configuration. I am not sure this case is interesting or even supported. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-03 12:56 ` Michal Hocko @ 2022-08-03 21:08 ` Feng Tang 2022-08-03 13:21 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Feng Tang @ 2022-08-03 21:08 UTC (permalink / raw) To: Michal Hocko Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel, Mike Kravetz On Wed, Aug 03, 2022 at 08:56:44PM +0800, Michal Hocko wrote: > On Thu 04-08-22 04:43:06, Feng Tang wrote: > > On Wed, Aug 03, 2022 at 07:28:59PM +0800, Michal Hocko wrote: > [...] > > > +struct mempolicy *policy_mbind_nodemask(gfp_t gfp) > > > +{ > > > +#ifdef CONFIG_MEMPOLICY > > > + struct mempolicy *mpol = get_task_policy(current); > > > + > > > + /* > > > + * only enforce MBIND which overlaps with cpuset policy (from policy_nodemask) > > > + * specifically for hugetlb case > > > + */ > > > + if (mpol->mode == MPOL_BIND && > > > + (apply_policy_zone(mpol, gfp_zone(gfp)) && > > > + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) > > > + return &mpol->nodes; > > > +#endif > > > + return NULL; > > > > I saw the logic is not changed, and it confused me that if there is > > no qualified node, it will still return NULL which effectively equals > > node_states[N_MEMORY], while I think it should return a all zero > > nodemasks. > > This is a separate thing and I have to admit that the existing code is > rather non-intuitive or even broken. I guess we do not care all that > much because MBIND with completely non-overlapping cpusets is just a > broken configuration. I am not sure this case is interesting or even > supported. Fair enough, and moving the policy_mbind_nodemask() into hugetlb.c for one single caller make it much less severe. Do we still need the other nodemask API I proposed earlier which has no parameter of gfp_flag, and used for __nr_hugepages_store_common? Thanks, Feng > -- > Michal Hocko > SUSE Labs > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-03 21:08 ` Feng Tang @ 2022-08-03 13:21 ` Michal Hocko 2022-08-04 8:27 ` Feng Tang 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2022-08-03 13:21 UTC (permalink / raw) To: Feng Tang Cc: Muchun Song, akpm, bwidawsk, dave.hansen, linux-mm, linux-kernel, Mike Kravetz On Thu 04-08-22 05:08:34, Feng Tang wrote: [...] > Do we still need the other nodemask API I proposed earlier which has > no parameter of gfp_flag, and used for __nr_hugepages_store_common? I would touch as little code as possible. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-03 13:21 ` Michal Hocko @ 2022-08-04 8:27 ` Feng Tang 2022-08-04 10:43 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Feng Tang @ 2022-08-04 8:27 UTC (permalink / raw) To: Michal Hocko, Muchun Song, Mike Kravetz, Andrew Morton Cc: bwidawsk, dave.hansen, linux-mm, linux-kernel On Wed, Aug 03, 2022 at 09:21:22PM +0800, Michal Hocko wrote: > On Thu 04-08-22 05:08:34, Feng Tang wrote: > [...] > > Do we still need the other nodemask API I proposed earlier which has > > no parameter of gfp_flag, and used for __nr_hugepages_store_common? > > I would touch as little code as possible. OK. Please review the following patch, thanks! - Feng --- From a2db9a57da616bb3ea21e48a4a9ceb5c2cf4f7a2 Mon Sep 17 00:00:00 2001 From: Feng Tang <feng.tang@intel.com> Date: Thu, 4 Aug 2022 09:39:24 +0800 Subject: [PATCH RFC] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") [1], the policy_nodemask_current()'s semantics for this new policy has been changed, which returns 'preferred' nodes instead of 'allowed' nodes, and could hurt the usage of its caller in hugetlb: allowed_mems_nr(). Michal found the policy_nodemask_current() is only used by hugetlb, and suggested to move it to hugetlb code with more explicit name to enforce the 'allowed' semantics for which only MPOL_BIND policy matters. One note for the new policy_mbind_nodemask() is, the cross check from MPOL_BIND, gfp flags and cpuset configuration can lead to a no available node case, which is considered to be broken configuration and 'NULL' (equals all nodes) is returned. [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/ Reported-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Feng Tang <feng.tang@intel.com> --- include/linux/mempolicy.h | 32 ++++++++++++++++++++------------ mm/hugetlb.c | 24 ++++++++++++++++++++---- mm/mempolicy.c | 20 -------------------- 3 files changed, 40 insertions(+), 36 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 668389b4b53d..ea0168fffb4c 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk, const nodemask_t *mask); extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) -{ - struct mempolicy *mpol = get_task_policy(current); - - return policy_nodemask(gfp, mpol); -} - extern unsigned int mempolicy_slab_node(void); extern enum zone_type policy_zone; @@ -189,6 +182,26 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol) return (pol->mode == MPOL_PREFERRED_MANY); } +static inline int apply_policy_zone(struct mempolicy *policy, enum zone_type zone) +{ + enum zone_type dynamic_policy_zone = policy_zone; + + BUG_ON(dynamic_policy_zone == ZONE_MOVABLE); + + /* + * if policy->nodes has movable memory only, + * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only. + * + * policy->nodes is intersect with node_states[N_MEMORY]. + * so if the following test fails, it implies + * policy->nodes has movable memory only. + */ + if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY])) + dynamic_policy_zone = ZONE_MOVABLE; + + return zone >= dynamic_policy_zone; +} + #else @@ -294,11 +307,6 @@ static inline void mpol_put_task_policy(struct task_struct *task) { } -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) -{ - return NULL; -} - static inline bool mpol_is_preferred_many(struct mempolicy *pol) { return false; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a18c071c294e..ad84bb85b6de 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s) } __setup("default_hugepagesz=", default_hugepagesz_setup); +static nodemask_t *policy_mbind_nodemask(gfp_t gfp) +{ +#ifdef CONFIG_NUMA + struct mempolicy *mpol = get_task_policy(current); + + /* + * Only enforce MPOL_BIND policy which overlaps with cpuset policy + * (from policy_nodemask) specifically for hugetlb case + */ + if (mpol->mode == MPOL_BIND && + (apply_policy_zone(mpol, gfp_zone(gfp)) && + cpuset_nodemask_valid_mems_allowed(&mpol->nodes))) + return &mpol->nodes; +#endif + return NULL; +} + static unsigned int allowed_mems_nr(struct hstate *h) { int node; unsigned int nr = 0; - nodemask_t *mpol_allowed; + nodemask_t *mbind_nodemask; unsigned int *array = h->free_huge_pages_node; gfp_t gfp_mask = htlb_alloc_mask(h); - mpol_allowed = policy_nodemask_current(gfp_mask); - + mbind_nodemask = policy_mbind_nodemask(gfp_mask); for_each_node_mask(node, cpuset_current_mems_allowed) { - if (!mpol_allowed || node_isset(node, *mpol_allowed)) + if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) nr += array[node]; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d39b01fd52fe..5553bd53927f 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1805,26 +1805,6 @@ bool vma_policy_mof(struct vm_area_struct *vma) return pol->flags & MPOL_F_MOF; } -static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone) -{ - enum zone_type dynamic_policy_zone = policy_zone; - - BUG_ON(dynamic_policy_zone == ZONE_MOVABLE); - - /* - * if policy->nodes has movable memory only, - * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only. - * - * policy->nodes is intersect with node_states[N_MEMORY]. - * so if the following test fails, it implies - * policy->nodes has movable memory only. - */ - if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY])) - dynamic_policy_zone = ZONE_MOVABLE; - - return zone >= dynamic_policy_zone; -} - /* * Return a nodemask representing a mempolicy for filtering nodes for * page allocation -- 2.27.0 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case 2022-08-04 8:27 ` Feng Tang @ 2022-08-04 10:43 ` Michal Hocko 2022-08-04 13:03 ` [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process Feng Tang 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2022-08-04 10:43 UTC (permalink / raw) To: Feng Tang Cc: Muchun Song, Mike Kravetz, Andrew Morton, bwidawsk, dave.hansen, linux-mm, linux-kernel On Thu 04-08-22 16:27:24, Feng Tang wrote: [...] > >From a2db9a57da616bb3ea21e48a4a9ceb5c2cf4f7a2 Mon Sep 17 00:00:00 2001 > From: Feng Tang <feng.tang@intel.com> > Date: Thu, 4 Aug 2022 09:39:24 +0800 > Subject: [PATCH RFC] mm/hugetlb: add dedicated func to get 'allowed' nodemask for > current process > > Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced > in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > [1], the policy_nodemask_current()'s semantics for this new policy > has been changed, which returns 'preferred' nodes instead of 'allowed' > nodes, and could hurt the usage of its caller in hugetlb: > allowed_mems_nr(). > > Michal found the policy_nodemask_current() is only used by hugetlb, > and suggested to move it to hugetlb code with more explicit name to > enforce the 'allowed' semantics for which only MPOL_BIND policy > matters. > > One note for the new policy_mbind_nodemask() is, the cross check > from MPOL_BIND, gfp flags and cpuset configuration can lead to > a no available node case, which is considered to be broken > configuration and 'NULL' (equals all nodes) is returned. > > [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/ > Reported-by: Muchun Song <songmuchun@bytedance.com> > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Feng Tang <feng.tang@intel.com> LGTM I would just make apply_policy_zone extern rather than making it static inline in a header which can turn out to cause other header dependencies. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process 2022-08-04 10:43 ` Michal Hocko @ 2022-08-04 13:03 ` Feng Tang 2022-08-04 13:36 ` Michal Hocko 0 siblings, 1 reply; 22+ messages in thread From: Feng Tang @ 2022-08-04 13:03 UTC (permalink / raw) To: Michal Hocko, Muchun Song, Mike Kravetz, Andrew Morton Cc: Dave Hansen, Ben Widawsky, linux-mm, linux-kernel, Feng Tang Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") [1], the policy_nodemask_current()'s semantics for this new policy has been changed, which returns 'preferred' nodes instead of 'allowed' nodes, and could hurt the usage of its caller in hugetlb: allowed_mems_nr(). Michal found the policy_nodemask_current() is only used by hugetlb, and suggested to move it to hugetlb code with more explicit name to enforce the 'allowed' semantics for which only MPOL_BIND policy matters. One note for the new policy_mbind_nodemask() is, the cross check from MPOL_BIND, gfp flags and cpuset configuration can lead to a no available node case, which is considered to be broken configuration, and 'NULL' (equals all nodes) will be returned. apply_policy_zone() is made extern to be called in hugetlb code and its return value is changed to bool. [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/ Reported-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Feng Tang <feng.tang@intel.com> --- include/linux/mempolicy.h | 13 +------------ mm/hugetlb.c | 24 ++++++++++++++++++++---- mm/mempolicy.c | 2 +- 3 files changed, 22 insertions(+), 17 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 668389b4b53d..d232de7cdc56 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk, const nodemask_t *mask); extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) -{ - struct mempolicy *mpol = get_task_policy(current); - - return policy_nodemask(gfp, mpol); -} - extern unsigned int mempolicy_slab_node(void); extern enum zone_type policy_zone; @@ -189,6 +182,7 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol) return (pol->mode == MPOL_PREFERRED_MANY); } +extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone); #else @@ -294,11 +288,6 @@ static inline void mpol_put_task_policy(struct task_struct *task) { } -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) -{ - return NULL; -} - static inline bool mpol_is_preferred_many(struct mempolicy *pol) { return false; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a18c071c294e..ad84bb85b6de 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s) } __setup("default_hugepagesz=", default_hugepagesz_setup); +static nodemask_t *policy_mbind_nodemask(gfp_t gfp) +{ +#ifdef CONFIG_NUMA + struct mempolicy *mpol = get_task_policy(current); + + /* + * Only enforce MPOL_BIND policy which overlaps with cpuset policy + * (from policy_nodemask) specifically for hugetlb case + */ + if (mpol->mode == MPOL_BIND && + (apply_policy_zone(mpol, gfp_zone(gfp)) && + cpuset_nodemask_valid_mems_allowed(&mpol->nodes))) + return &mpol->nodes; +#endif + return NULL; +} + static unsigned int allowed_mems_nr(struct hstate *h) { int node; unsigned int nr = 0; - nodemask_t *mpol_allowed; + nodemask_t *mbind_nodemask; unsigned int *array = h->free_huge_pages_node; gfp_t gfp_mask = htlb_alloc_mask(h); - mpol_allowed = policy_nodemask_current(gfp_mask); - + mbind_nodemask = policy_mbind_nodemask(gfp_mask); for_each_node_mask(node, cpuset_current_mems_allowed) { - if (!mpol_allowed || node_isset(node, *mpol_allowed)) + if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) nr += array[node]; } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d39b01fd52fe..9f15bc533601 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1805,7 +1805,7 @@ bool vma_policy_mof(struct vm_area_struct *vma) return pol->flags & MPOL_F_MOF; } -static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone) +bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) { enum zone_type dynamic_policy_zone = policy_zone; -- 2.27.0 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process 2022-08-04 13:03 ` [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process Feng Tang @ 2022-08-04 13:36 ` Michal Hocko 2022-08-04 22:37 ` Andrew Morton 0 siblings, 1 reply; 22+ messages in thread From: Michal Hocko @ 2022-08-04 13:36 UTC (permalink / raw) To: Feng Tang Cc: Muchun Song, Mike Kravetz, Andrew Morton, Dave Hansen, Ben Widawsky, linux-mm, linux-kernel On Thu 04-08-22 21:03:42, Feng Tang wrote: > Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced > in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > [1], the policy_nodemask_current()'s semantics for this new policy > has been changed, which returns 'preferred' nodes instead of 'allowed' > nodes, and could hurt the usage of its caller in hugetlb: > allowed_mems_nr(). The acutal user visible effect description is missing here. AFAIU it would be this. With the changed semantic of policy_nodemask_current a taks with MPOL_PREFERRED_MANY policy could fail to get its reservation even though it can fall back to other nodes (either defined by cpusets or all online nodes) for that reservation failing mmap calles unnecessarily early. The fix is to not consider MPOL_PREFERRED_MANY for reservations at all because they, unlike MPOL_MBIND, do not pose any actual hard constrain. You can keep the rest. > Michal found the policy_nodemask_current() is only used by hugetlb, > and suggested to move it to hugetlb code with more explicit name to > enforce the 'allowed' semantics for which only MPOL_BIND policy > matters. > > One note for the new policy_mbind_nodemask() is, the cross check > from MPOL_BIND, gfp flags and cpuset configuration can lead to > a no available node case, which is considered to be broken > configuration, and 'NULL' (equals all nodes) will be returned. This is neither important nor useful for this particular patch. > apply_policy_zone() is made extern to be called in hugetlb code > and its return value is changed to bool. > > [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/ Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") I do not think stable is really required. > Reported-by: Muchun Song <songmuchun@bytedance.com> > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Feng Tang <feng.tang@intel.com> with that Acked-by: Michal Hocko <mhocko@suse.com> thanks! > --- > include/linux/mempolicy.h | 13 +------------ > mm/hugetlb.c | 24 ++++++++++++++++++++---- > mm/mempolicy.c | 2 +- > 3 files changed, 22 insertions(+), 17 deletions(-) > > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 668389b4b53d..d232de7cdc56 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -151,13 +151,6 @@ extern bool mempolicy_in_oom_domain(struct task_struct *tsk, > const nodemask_t *mask); > extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); > > -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > -{ > - struct mempolicy *mpol = get_task_policy(current); > - > - return policy_nodemask(gfp, mpol); > -} > - > extern unsigned int mempolicy_slab_node(void); > > extern enum zone_type policy_zone; > @@ -189,6 +182,7 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol) > return (pol->mode == MPOL_PREFERRED_MANY); > } > > +extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone); > > #else > > @@ -294,11 +288,6 @@ static inline void mpol_put_task_policy(struct task_struct *task) > { > } > > -static inline nodemask_t *policy_nodemask_current(gfp_t gfp) > -{ > - return NULL; > -} > - > static inline bool mpol_is_preferred_many(struct mempolicy *pol) > { > return false; > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index a18c071c294e..ad84bb85b6de 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4330,18 +4330,34 @@ static int __init default_hugepagesz_setup(char *s) > } > __setup("default_hugepagesz=", default_hugepagesz_setup); > > +static nodemask_t *policy_mbind_nodemask(gfp_t gfp) > +{ > +#ifdef CONFIG_NUMA > + struct mempolicy *mpol = get_task_policy(current); > + > + /* > + * Only enforce MPOL_BIND policy which overlaps with cpuset policy > + * (from policy_nodemask) specifically for hugetlb case > + */ > + if (mpol->mode == MPOL_BIND && > + (apply_policy_zone(mpol, gfp_zone(gfp)) && > + cpuset_nodemask_valid_mems_allowed(&mpol->nodes))) > + return &mpol->nodes; > +#endif > + return NULL; > +} > + > static unsigned int allowed_mems_nr(struct hstate *h) > { > int node; > unsigned int nr = 0; > - nodemask_t *mpol_allowed; > + nodemask_t *mbind_nodemask; > unsigned int *array = h->free_huge_pages_node; > gfp_t gfp_mask = htlb_alloc_mask(h); > > - mpol_allowed = policy_nodemask_current(gfp_mask); > - > + mbind_nodemask = policy_mbind_nodemask(gfp_mask); > for_each_node_mask(node, cpuset_current_mems_allowed) { > - if (!mpol_allowed || node_isset(node, *mpol_allowed)) > + if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) > nr += array[node]; > } > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index d39b01fd52fe..9f15bc533601 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1805,7 +1805,7 @@ bool vma_policy_mof(struct vm_area_struct *vma) > return pol->flags & MPOL_F_MOF; > } > > -static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > +bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > { > enum zone_type dynamic_policy_zone = policy_zone; > > -- > 2.27.0 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process 2022-08-04 13:36 ` Michal Hocko @ 2022-08-04 22:37 ` Andrew Morton 2022-08-05 0:06 ` Feng Tang 0 siblings, 1 reply; 22+ messages in thread From: Andrew Morton @ 2022-08-04 22:37 UTC (permalink / raw) To: Michal Hocko Cc: Feng Tang, Muchun Song, Mike Kravetz, Dave Hansen, Ben Widawsky, linux-mm, linux-kernel On Thu, 4 Aug 2022 15:36:48 +0200 Michal Hocko <mhocko@suse.com> wrote: > On Thu 04-08-22 21:03:42, Feng Tang wrote: > > Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced > > in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > > [1], the policy_nodemask_current()'s semantics for this new policy > > has been changed, which returns 'preferred' nodes instead of 'allowed' > > nodes, and could hurt the usage of its caller in hugetlb: > > allowed_mems_nr(). > > The acutal user visible effect description is missing here. AFAIU it > would be this. > > With the changed semantic of policy_nodemask_current a taks with > MPOL_PREFERRED_MANY policy could fail to get its reservation even though > it can fall back to other nodes (either defined by cpusets or all online > nodes) for that reservation failing mmap calles unnecessarily early. > > The fix is to not consider MPOL_PREFERRED_MANY for reservations at all > because they, unlike MPOL_MBIND, do not pose any actual hard constrain. And is this Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process 2022-08-04 22:37 ` Andrew Morton @ 2022-08-05 0:06 ` Feng Tang 0 siblings, 0 replies; 22+ messages in thread From: Feng Tang @ 2022-08-05 0:06 UTC (permalink / raw) To: Andrew Morton Cc: Hocko, Michal, Muchun Song, Mike Kravetz, Dave Hansen, Ben Widawsky, linux-mm, linux-kernel On Fri, Aug 05, 2022 at 06:37:17AM +0800, Andrew Morton wrote: > On Thu, 4 Aug 2022 15:36:48 +0200 Michal Hocko <mhocko@suse.com> wrote: > > > On Thu 04-08-22 21:03:42, Feng Tang wrote: > > > Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced > > > in commit b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") > > > [1], the policy_nodemask_current()'s semantics for this new policy > > > has been changed, which returns 'preferred' nodes instead of 'allowed' > > > nodes, and could hurt the usage of its caller in hugetlb: > > > allowed_mems_nr(). > > > > The acutal user visible effect description is missing here. AFAIU it > > would be this. > > > > With the changed semantic of policy_nodemask_current a taks with > > MPOL_PREFERRED_MANY policy could fail to get its reservation even though > > it can fall back to other nodes (either defined by cpusets or all online > > nodes) for that reservation failing mmap calles unnecessarily early. > > > > The fix is to not consider MPOL_PREFERRED_MANY for reservations at all > > because they, unlike MPOL_MBIND, do not pose any actual hard constrain. > > And is this Fixes: b27abaccf8e8 ("mm/mempolicy: add MPOL_PREFERRED_MANY > for multiple preferred nodes")? Yes. Will add it in the next version, thanks. - Feng ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2022-08-05 0:07 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-08-01 8:42 [PATCH] mm: mempolicy: fix policy_nodemask() for MPOL_PREFERRED_MANY case Muchun Song 2022-08-01 9:06 ` Michal Hocko 2022-08-01 9:26 ` Feng Tang 2022-08-02 3:42 ` Muchun Song 2022-08-02 5:52 ` Feng Tang 2022-08-02 6:40 ` Muchun Song 2022-08-02 7:39 ` Feng Tang 2022-08-02 9:02 ` Michal Hocko 2022-08-03 6:41 ` Feng Tang 2022-08-03 7:36 ` Michal Hocko 2022-08-03 17:14 ` Feng Tang 2022-08-03 11:28 ` Michal Hocko 2022-08-03 20:43 ` Feng Tang 2022-08-03 12:56 ` Michal Hocko 2022-08-03 21:08 ` Feng Tang 2022-08-03 13:21 ` Michal Hocko 2022-08-04 8:27 ` Feng Tang 2022-08-04 10:43 ` Michal Hocko 2022-08-04 13:03 ` [PATCH] mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process Feng Tang 2022-08-04 13:36 ` Michal Hocko 2022-08-04 22:37 ` Andrew Morton 2022-08-05 0:06 ` Feng Tang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).