All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
@ 2016-10-13 12:59 ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2016-10-13 12:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, David Rientjes, Anshuman Khandual, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__GFP_THISNODE is documented to enforce the allocation to be satisified
from the requested node with no fallbacks or placement policy
enforcements. policy_zonelist seemingly breaks this semantic if the
current policy is MPOL_MBIND and instead of taking the node it will
fallback to the first node in the mask if the requested one is not in
the mask. This is confusing to say the least because it fact we
shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs
with nodes outside of their mempolicy binding. And secondly
policy_zonelist is called only from 3 places:
- huge_zonelist - never should do __GFP_THISNODE when going this path
- alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
- alloc_pages_current - which uses default_policy id __GFP_THISNODE is
  used

So we shouldn't even need to care about this possibility and can drop
the confusing code. Let's keep a WARN_ON_ONCE in place to catch
potential users and fix them up properly (aka use a different allocation
function which ignores mempolicy).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
I have noticed this while discussing this code [1]. The code as is
quite confusing and I think it is worth cleaning up. I decided to be
conservative and keep at least WARN_ON_ONCE if we have some caller which
relies on __GFP_THISNODE in a mempolicy context so that we can fix it up.

[1] http://lkml.kernel.org/r/57FE0184.6030008@linux.vnet.ibm.com

 mm/mempolicy.c | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ad1c96ac313c..33a305397bd4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1679,25 +1679,17 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 	int nd)
 {
-	switch (policy->mode) {
-	case MPOL_PREFERRED:
-		if (!(policy->flags & MPOL_F_LOCAL))
-			nd = policy->v.preferred_node;
-		break;
-	case MPOL_BIND:
+	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
+		nd = policy->v.preferred_node;
+	else {
 		/*
-		 * Normally, MPOL_BIND allocations are node-local within the
-		 * allowed nodemask.  However, if __GFP_THISNODE is set and the
-		 * current node isn't part of the mask, we use the zonelist for
-		 * the first node in the mask instead.
+		 * __GFP_THISNODE shouldn't even be used with the bind policy because
+		 * we might easily break the expectation to stay on the requested node
+		 * and not break the policy.
 		 */
-		if (unlikely(gfp & __GFP_THISNODE) &&
-				unlikely(!node_isset(nd, policy->v.nodes)))
-			nd = first_node(policy->v.nodes);
-		break;
-	default:
-		BUG();
+		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
 	}
+
 	return node_zonelist(nd, gfp);
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
@ 2016-10-13 12:59 ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2016-10-13 12:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, David Rientjes, Anshuman Khandual, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__GFP_THISNODE is documented to enforce the allocation to be satisified
from the requested node with no fallbacks or placement policy
enforcements. policy_zonelist seemingly breaks this semantic if the
current policy is MPOL_MBIND and instead of taking the node it will
fallback to the first node in the mask if the requested one is not in
the mask. This is confusing to say the least because it fact we
shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs
with nodes outside of their mempolicy binding. And secondly
policy_zonelist is called only from 3 places:
- huge_zonelist - never should do __GFP_THISNODE when going this path
- alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
- alloc_pages_current - which uses default_policy id __GFP_THISNODE is
  used

So we shouldn't even need to care about this possibility and can drop
the confusing code. Let's keep a WARN_ON_ONCE in place to catch
potential users and fix them up properly (aka use a different allocation
function which ignores mempolicy).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
I have noticed this while discussing this code [1]. The code as is
quite confusing and I think it is worth cleaning up. I decided to be
conservative and keep at least WARN_ON_ONCE if we have some caller which
relies on __GFP_THISNODE in a mempolicy context so that we can fix it up.

[1] http://lkml.kernel.org/r/57FE0184.6030008@linux.vnet.ibm.com

 mm/mempolicy.c | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ad1c96ac313c..33a305397bd4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1679,25 +1679,17 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 	int nd)
 {
-	switch (policy->mode) {
-	case MPOL_PREFERRED:
-		if (!(policy->flags & MPOL_F_LOCAL))
-			nd = policy->v.preferred_node;
-		break;
-	case MPOL_BIND:
+	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
+		nd = policy->v.preferred_node;
+	else {
 		/*
-		 * Normally, MPOL_BIND allocations are node-local within the
-		 * allowed nodemask.  However, if __GFP_THISNODE is set and the
-		 * current node isn't part of the mask, we use the zonelist for
-		 * the first node in the mask instead.
+		 * __GFP_THISNODE shouldn't even be used with the bind policy because
+		 * we might easily break the expectation to stay on the requested node
+		 * and not break the policy.
 		 */
-		if (unlikely(gfp & __GFP_THISNODE) &&
-				unlikely(!node_isset(nd, policy->v.nodes)))
-			nd = first_node(policy->v.nodes);
-		break;
-	default:
-		BUG();
+		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
 	}
+
 	return node_zonelist(nd, gfp);
 }
 
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
  2016-10-13 12:59 ` Michal Hocko
@ 2016-10-18  9:44   ` Vlastimil Babka
  -1 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2016-10-18  9:44 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Mel Gorman, David Rientjes, Anshuman Khandual, linux-mm, LKML,
	Michal Hocko

On 10/13/2016 02:59 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> __GFP_THISNODE is documented to enforce the allocation to be satisified
> from the requested node with no fallbacks or placement policy
> enforcements. policy_zonelist seemingly breaks this semantic if the
> current policy is MPOL_MBIND and instead of taking the node it will
> fallback to the first node in the mask if the requested one is not in
> the mask. This is confusing to say the least because it fact we
> shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs
> with nodes outside of their mempolicy binding. And secondly
> policy_zonelist is called only from 3 places:
> - huge_zonelist - never should do __GFP_THISNODE when going this path
> - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
> - alloc_pages_current - which uses default_policy id __GFP_THISNODE is

						    ^ if

>   used
>
> So we shouldn't even need to care about this possibility and can drop
> the confusing code. Let's keep a WARN_ON_ONCE in place to catch
> potential users and fix them up properly (aka use a different allocation
> function which ignores mempolicy).
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Looks good, and a BUG_ON() removed as a bonus :)
Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>
> Hi,
> I have noticed this while discussing this code [1]. The code as is
> quite confusing and I think it is worth cleaning up. I decided to be
> conservative and keep at least WARN_ON_ONCE if we have some caller which
> relies on __GFP_THISNODE in a mempolicy context so that we can fix it up.
>
> [1] http://lkml.kernel.org/r/57FE0184.6030008@linux.vnet.ibm.com
>
>  mm/mempolicy.c | 24 ++++++++----------------
>  1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ad1c96ac313c..33a305397bd4 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1679,25 +1679,17 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
>  	int nd)
>  {
> -	switch (policy->mode) {
> -	case MPOL_PREFERRED:
> -		if (!(policy->flags & MPOL_F_LOCAL))
> -			nd = policy->v.preferred_node;
> -		break;
> -	case MPOL_BIND:
> +	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
> +		nd = policy->v.preferred_node;
> +	else {
>  		/*
> -		 * Normally, MPOL_BIND allocations are node-local within the
> -		 * allowed nodemask.  However, if __GFP_THISNODE is set and the
> -		 * current node isn't part of the mask, we use the zonelist for
> -		 * the first node in the mask instead.
> +		 * __GFP_THISNODE shouldn't even be used with the bind policy because
> +		 * we might easily break the expectation to stay on the requested node
> +		 * and not break the policy.
>  		 */
> -		if (unlikely(gfp & __GFP_THISNODE) &&
> -				unlikely(!node_isset(nd, policy->v.nodes)))
> -			nd = first_node(policy->v.nodes);
> -		break;
> -	default:
> -		BUG();
> +		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>  	}
> +
>  	return node_zonelist(nd, gfp);
>  }
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
@ 2016-10-18  9:44   ` Vlastimil Babka
  0 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2016-10-18  9:44 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Mel Gorman, David Rientjes, Anshuman Khandual, linux-mm, LKML,
	Michal Hocko

On 10/13/2016 02:59 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> __GFP_THISNODE is documented to enforce the allocation to be satisified
> from the requested node with no fallbacks or placement policy
> enforcements. policy_zonelist seemingly breaks this semantic if the
> current policy is MPOL_MBIND and instead of taking the node it will
> fallback to the first node in the mask if the requested one is not in
> the mask. This is confusing to say the least because it fact we
> shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs
> with nodes outside of their mempolicy binding. And secondly
> policy_zonelist is called only from 3 places:
> - huge_zonelist - never should do __GFP_THISNODE when going this path
> - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
> - alloc_pages_current - which uses default_policy id __GFP_THISNODE is

						    ^ if

>   used
>
> So we shouldn't even need to care about this possibility and can drop
> the confusing code. Let's keep a WARN_ON_ONCE in place to catch
> potential users and fix them up properly (aka use a different allocation
> function which ignores mempolicy).
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Looks good, and a BUG_ON() removed as a bonus :)
Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>
> Hi,
> I have noticed this while discussing this code [1]. The code as is
> quite confusing and I think it is worth cleaning up. I decided to be
> conservative and keep at least WARN_ON_ONCE if we have some caller which
> relies on __GFP_THISNODE in a mempolicy context so that we can fix it up.
>
> [1] http://lkml.kernel.org/r/57FE0184.6030008@linux.vnet.ibm.com
>
>  mm/mempolicy.c | 24 ++++++++----------------
>  1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ad1c96ac313c..33a305397bd4 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1679,25 +1679,17 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
>  	int nd)
>  {
> -	switch (policy->mode) {
> -	case MPOL_PREFERRED:
> -		if (!(policy->flags & MPOL_F_LOCAL))
> -			nd = policy->v.preferred_node;
> -		break;
> -	case MPOL_BIND:
> +	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
> +		nd = policy->v.preferred_node;
> +	else {
>  		/*
> -		 * Normally, MPOL_BIND allocations are node-local within the
> -		 * allowed nodemask.  However, if __GFP_THISNODE is set and the
> -		 * current node isn't part of the mask, we use the zonelist for
> -		 * the first node in the mask instead.
> +		 * __GFP_THISNODE shouldn't even be used with the bind policy because
> +		 * we might easily break the expectation to stay on the requested node
> +		 * and not break the policy.
>  		 */
> -		if (unlikely(gfp & __GFP_THISNODE) &&
> -				unlikely(!node_isset(nd, policy->v.nodes)))
> -			nd = first_node(policy->v.nodes);
> -		break;
> -	default:
> -		BUG();
> +		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>  	}
> +
>  	return node_zonelist(nd, gfp);
>  }
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
  2016-10-13 12:59 ` Michal Hocko
@ 2016-10-21 11:34   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 12+ messages in thread
From: Aneesh Kumar K.V @ 2016-10-21 11:34 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Mel Gorman, David Rientjes, Anshuman Khandual, linux-mm, LKML,
	Michal Hocko

Michal Hocko <mhocko@kernel.org> writes:

> From: Michal Hocko <mhocko@suse.com>
>
> __GFP_THISNODE is documented to enforce the allocation to be satisified
> from the requested node with no fallbacks or placement policy
> enforcements. policy_zonelist seemingly breaks this semantic if the
> current policy is MPOL_MBIND and instead of taking the node it will
> fallback to the first node in the mask if the requested one is not in
> the mask. This is confusing to say the least because it fact we
> shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs
> with nodes outside of their mempolicy binding. And secondly
> policy_zonelist is called only from 3 places:
> - huge_zonelist - never should do __GFP_THISNODE when going this path
> - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
> - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
>   used
>
> So we shouldn't even need to care about this possibility and can drop
> the confusing code. Let's keep a WARN_ON_ONCE in place to catch
> potential users and fix them up properly (aka use a different allocation
> function which ignores mempolicy).
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>
> Hi,
> I have noticed this while discussing this code [1]. The code as is
> quite confusing and I think it is worth cleaning up. I decided to be
> conservative and keep at least WARN_ON_ONCE if we have some caller which
> relies on __GFP_THISNODE in a mempolicy context so that we can fix it up.
>
> [1] http://lkml.kernel.org/r/57FE0184.6030008@linux.vnet.ibm.com
>
>  mm/mempolicy.c | 24 ++++++++----------------
>  1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ad1c96ac313c..33a305397bd4 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1679,25 +1679,17 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
>  	int nd)
>  {
> -	switch (policy->mode) {
> -	case MPOL_PREFERRED:
> -		if (!(policy->flags & MPOL_F_LOCAL))
> -			nd = policy->v.preferred_node;
> -		break;
> -	case MPOL_BIND:
> +	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
> +		nd = policy->v.preferred_node;
> +	else {
>  		/*
> -		 * Normally, MPOL_BIND allocations are node-local within the
> -		 * allowed nodemask.  However, if __GFP_THISNODE is set and the
> -		 * current node isn't part of the mask, we use the zonelist for
> -		 * the first node in the mask instead.
> +		 * __GFP_THISNODE shouldn't even be used with the bind policy because
> +		 * we might easily break the expectation to stay on the requested node
> +		 * and not break the policy.
>  		 */
> -		if (unlikely(gfp & __GFP_THISNODE) &&
> -				unlikely(!node_isset(nd, policy->v.nodes)))
> -			nd = first_node(policy->v.nodes);
> -		break;
> -	default:
> -		BUG();
> +		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>  	}
> +
>  	return node_zonelist(nd, gfp);
>  }
>  

For both MPOL_PREFERED and MPOL_INTERLEAVE we pick the zone list from
the node other than the current running node. Why don't we do that for
MPOL_BIND ?ie, if the current node is not part of the policy node mask
why are we not picking the first node from the policy node mask for
MPOL_BIND ?

-aneesh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
@ 2016-10-21 11:34   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 12+ messages in thread
From: Aneesh Kumar K.V @ 2016-10-21 11:34 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Mel Gorman, David Rientjes, Anshuman Khandual, linux-mm, LKML,
	Michal Hocko

Michal Hocko <mhocko@kernel.org> writes:

> From: Michal Hocko <mhocko@suse.com>
>
> __GFP_THISNODE is documented to enforce the allocation to be satisified
> from the requested node with no fallbacks or placement policy
> enforcements. policy_zonelist seemingly breaks this semantic if the
> current policy is MPOL_MBIND and instead of taking the node it will
> fallback to the first node in the mask if the requested one is not in
> the mask. This is confusing to say the least because it fact we
> shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs
> with nodes outside of their mempolicy binding. And secondly
> policy_zonelist is called only from 3 places:
> - huge_zonelist - never should do __GFP_THISNODE when going this path
> - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
> - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
>   used
>
> So we shouldn't even need to care about this possibility and can drop
> the confusing code. Let's keep a WARN_ON_ONCE in place to catch
> potential users and fix them up properly (aka use a different allocation
> function which ignores mempolicy).
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>
> Hi,
> I have noticed this while discussing this code [1]. The code as is
> quite confusing and I think it is worth cleaning up. I decided to be
> conservative and keep at least WARN_ON_ONCE if we have some caller which
> relies on __GFP_THISNODE in a mempolicy context so that we can fix it up.
>
> [1] http://lkml.kernel.org/r/57FE0184.6030008@linux.vnet.ibm.com
>
>  mm/mempolicy.c | 24 ++++++++----------------
>  1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ad1c96ac313c..33a305397bd4 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1679,25 +1679,17 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
>  	int nd)
>  {
> -	switch (policy->mode) {
> -	case MPOL_PREFERRED:
> -		if (!(policy->flags & MPOL_F_LOCAL))
> -			nd = policy->v.preferred_node;
> -		break;
> -	case MPOL_BIND:
> +	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
> +		nd = policy->v.preferred_node;
> +	else {
>  		/*
> -		 * Normally, MPOL_BIND allocations are node-local within the
> -		 * allowed nodemask.  However, if __GFP_THISNODE is set and the
> -		 * current node isn't part of the mask, we use the zonelist for
> -		 * the first node in the mask instead.
> +		 * __GFP_THISNODE shouldn't even be used with the bind policy because
> +		 * we might easily break the expectation to stay on the requested node
> +		 * and not break the policy.
>  		 */
> -		if (unlikely(gfp & __GFP_THISNODE) &&
> -				unlikely(!node_isset(nd, policy->v.nodes)))
> -			nd = first_node(policy->v.nodes);
> -		break;
> -	default:
> -		BUG();
> +		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>  	}
> +
>  	return node_zonelist(nd, gfp);
>  }
>  

For both MPOL_PREFERED and MPOL_INTERLEAVE we pick the zone list from
the node other than the current running node. Why don't we do that for
MPOL_BIND ?ie, if the current node is not part of the policy node mask
why are we not picking the first node from the policy node mask for
MPOL_BIND ?

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
  2016-10-21 11:34   ` Aneesh Kumar K.V
@ 2016-10-21 11:52     ` Michal Hocko
  -1 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2016-10-21 11:52 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Mel Gorman, David Rientjes, Anshuman Khandual,
	linux-mm, LKML

On Fri 21-10-16 17:04:50, Aneesh Kumar K.V wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > From: Michal Hocko <mhocko@suse.com>
> >
> > __GFP_THISNODE is documented to enforce the allocation to be satisified
> > from the requested node with no fallbacks or placement policy
> > enforcements. policy_zonelist seemingly breaks this semantic if the
> > current policy is MPOL_MBIND and instead of taking the node it will
> > fallback to the first node in the mask if the requested one is not in
> > the mask. This is confusing to say the least because it fact we
> > shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs
> > with nodes outside of their mempolicy binding. And secondly
> > policy_zonelist is called only from 3 places:
> > - huge_zonelist - never should do __GFP_THISNODE when going this path
> > - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
> > - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
> >   used
> >
> > So we shouldn't even need to care about this possibility and can drop
> > the confusing code. Let's keep a WARN_ON_ONCE in place to catch
> > potential users and fix them up properly (aka use a different allocation
> > function which ignores mempolicy).
> >
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >
> > Hi,
> > I have noticed this while discussing this code [1]. The code as is
> > quite confusing and I think it is worth cleaning up. I decided to be
> > conservative and keep at least WARN_ON_ONCE if we have some caller which
> > relies on __GFP_THISNODE in a mempolicy context so that we can fix it up.
> >
> > [1] http://lkml.kernel.org/r/57FE0184.6030008@linux.vnet.ibm.com
> >
> >  mm/mempolicy.c | 24 ++++++++----------------
> >  1 file changed, 8 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index ad1c96ac313c..33a305397bd4 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1679,25 +1679,17 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> >  static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
> >  	int nd)
> >  {
> > -	switch (policy->mode) {
> > -	case MPOL_PREFERRED:
> > -		if (!(policy->flags & MPOL_F_LOCAL))
> > -			nd = policy->v.preferred_node;
> > -		break;
> > -	case MPOL_BIND:
> > +	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
> > +		nd = policy->v.preferred_node;
> > +	else {
> >  		/*
> > -		 * Normally, MPOL_BIND allocations are node-local within the
> > -		 * allowed nodemask.  However, if __GFP_THISNODE is set and the
> > -		 * current node isn't part of the mask, we use the zonelist for
> > -		 * the first node in the mask instead.
> > +		 * __GFP_THISNODE shouldn't even be used with the bind policy because
> > +		 * we might easily break the expectation to stay on the requested node
> > +		 * and not break the policy.
> >  		 */
> > -		if (unlikely(gfp & __GFP_THISNODE) &&
> > -				unlikely(!node_isset(nd, policy->v.nodes)))
> > -			nd = first_node(policy->v.nodes);
> > -		break;
> > -	default:
> > -		BUG();
> > +		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> >  	}
> > +
> >  	return node_zonelist(nd, gfp);
> >  }
> >  
> 
> For both MPOL_PREFERED and MPOL_INTERLEAVE we pick the zone list from
> the node other than the current running node. Why don't we do that for
> MPOL_BIND ?ie, if the current node is not part of the policy node mask
> why are we not picking the first node from the policy node mask for
> MPOL_BIND ?

I am not sure I understand your question here. There is no
__GFP_THISNODE specific code for those policies.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
@ 2016-10-21 11:52     ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2016-10-21 11:52 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, Mel Gorman, David Rientjes, Anshuman Khandual,
	linux-mm, LKML

On Fri 21-10-16 17:04:50, Aneesh Kumar K.V wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > From: Michal Hocko <mhocko@suse.com>
> >
> > __GFP_THISNODE is documented to enforce the allocation to be satisified
> > from the requested node with no fallbacks or placement policy
> > enforcements. policy_zonelist seemingly breaks this semantic if the
> > current policy is MPOL_MBIND and instead of taking the node it will
> > fallback to the first node in the mask if the requested one is not in
> > the mask. This is confusing to say the least because it fact we
> > shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs
> > with nodes outside of their mempolicy binding. And secondly
> > policy_zonelist is called only from 3 places:
> > - huge_zonelist - never should do __GFP_THISNODE when going this path
> > - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
> > - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
> >   used
> >
> > So we shouldn't even need to care about this possibility and can drop
> > the confusing code. Let's keep a WARN_ON_ONCE in place to catch
> > potential users and fix them up properly (aka use a different allocation
> > function which ignores mempolicy).
> >
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >
> > Hi,
> > I have noticed this while discussing this code [1]. The code as is
> > quite confusing and I think it is worth cleaning up. I decided to be
> > conservative and keep at least WARN_ON_ONCE if we have some caller which
> > relies on __GFP_THISNODE in a mempolicy context so that we can fix it up.
> >
> > [1] http://lkml.kernel.org/r/57FE0184.6030008@linux.vnet.ibm.com
> >
> >  mm/mempolicy.c | 24 ++++++++----------------
> >  1 file changed, 8 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index ad1c96ac313c..33a305397bd4 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1679,25 +1679,17 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> >  static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
> >  	int nd)
> >  {
> > -	switch (policy->mode) {
> > -	case MPOL_PREFERRED:
> > -		if (!(policy->flags & MPOL_F_LOCAL))
> > -			nd = policy->v.preferred_node;
> > -		break;
> > -	case MPOL_BIND:
> > +	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
> > +		nd = policy->v.preferred_node;
> > +	else {
> >  		/*
> > -		 * Normally, MPOL_BIND allocations are node-local within the
> > -		 * allowed nodemask.  However, if __GFP_THISNODE is set and the
> > -		 * current node isn't part of the mask, we use the zonelist for
> > -		 * the first node in the mask instead.
> > +		 * __GFP_THISNODE shouldn't even be used with the bind policy because
> > +		 * we might easily break the expectation to stay on the requested node
> > +		 * and not break the policy.
> >  		 */
> > -		if (unlikely(gfp & __GFP_THISNODE) &&
> > -				unlikely(!node_isset(nd, policy->v.nodes)))
> > -			nd = first_node(policy->v.nodes);
> > -		break;
> > -	default:
> > -		BUG();
> > +		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
> >  	}
> > +
> >  	return node_zonelist(nd, gfp);
> >  }
> >  
> 
> For both MPOL_PREFERED and MPOL_INTERLEAVE we pick the zone list from
> the node other than the current running node. Why don't we do that for
> MPOL_BIND ?ie, if the current node is not part of the policy node mask
> why are we not picking the first node from the policy node mask for
> MPOL_BIND ?

I am not sure I understand your question here. There is no
__GFP_THISNODE specific code for those policies.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
  2016-10-21 11:34   ` Aneesh Kumar K.V
@ 2016-10-21 12:08     ` Vlastimil Babka
  -1 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2016-10-21 12:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Michal Hocko, Andrew Morton
  Cc: Mel Gorman, David Rientjes, Anshuman Khandual, linux-mm, LKML,
	Michal Hocko

On 10/21/2016 01:34 PM, Aneesh Kumar K.V wrote:
> Michal Hocko <mhocko@kernel.org> writes:
>>
>
> For both MPOL_PREFERED and MPOL_INTERLEAVE we pick the zone list from
> the node other than the current running node. Why don't we do that for
> MPOL_BIND ?ie, if the current node is not part of the policy node mask
> why are we not picking the first node from the policy node mask for
> MPOL_BIND ?

For MPOL_PREFERED and MPOL_INTERLEAVE we got some explicit preference of nodes, 
so it makes sense that the nodes in the zonelist we pick are ordered by the 
distance from that node, regardless of current node.

For MPOL_BIND, we don't have preferences but restrictions. If the current cpu is 
from a node within the restriction, then great. If it's not, finding a node 
according to distance from current cpu is probably less arbitrary than by 
distance from the node that happens to have the lowest id in the node mask?

> -aneesh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
@ 2016-10-21 12:08     ` Vlastimil Babka
  0 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2016-10-21 12:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Michal Hocko, Andrew Morton
  Cc: Mel Gorman, David Rientjes, Anshuman Khandual, linux-mm, LKML,
	Michal Hocko

On 10/21/2016 01:34 PM, Aneesh Kumar K.V wrote:
> Michal Hocko <mhocko@kernel.org> writes:
>>
>
> For both MPOL_PREFERED and MPOL_INTERLEAVE we pick the zone list from
> the node other than the current running node. Why don't we do that for
> MPOL_BIND ?ie, if the current node is not part of the policy node mask
> why are we not picking the first node from the policy node mask for
> MPOL_BIND ?

For MPOL_PREFERED and MPOL_INTERLEAVE we got some explicit preference of nodes, 
so it makes sense that the nodes in the zonelist we pick are ordered by the 
distance from that node, regardless of current node.

For MPOL_BIND, we don't have preferences but restrictions. If the current cpu is 
from a node within the restriction, then great. If it's not, finding a node 
according to distance from current cpu is probably less arbitrary than by 
distance from the node that happens to have the lowest id in the node mask?

> -aneesh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
  2016-10-21 12:08     ` Vlastimil Babka
@ 2016-10-21 12:25       ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 12+ messages in thread
From: Aneesh Kumar K.V @ 2016-10-21 12:25 UTC (permalink / raw)
  To: Vlastimil Babka, Michal Hocko, Andrew Morton
  Cc: Mel Gorman, David Rientjes, Anshuman Khandual, linux-mm, LKML,
	Michal Hocko

Vlastimil Babka <vbabka@suse.cz> writes:

> On 10/21/2016 01:34 PM, Aneesh Kumar K.V wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>>>
>>
>> For both MPOL_PREFERED and MPOL_INTERLEAVE we pick the zone list from
>> the node other than the current running node. Why don't we do that for
>> MPOL_BIND ?ie, if the current node is not part of the policy node mask
>> why are we not picking the first node from the policy node mask for
>> MPOL_BIND ?
>
> For MPOL_PREFERED and MPOL_INTERLEAVE we got some explicit preference of nodes, 
> so it makes sense that the nodes in the zonelist we pick are ordered by the 
> distance from that node, regardless of current node.
>
> For MPOL_BIND, we don't have preferences but restrictions. If the current cpu is 
> from a node within the restriction, then great. If it's not, finding a node 
> according to distance from current cpu is probably less arbitrary than by 
> distance from the node that happens to have the lowest id in the node mask?

I agree. This is related to the changes we are working in this part of
the kernel. We are looking at adding support for coherent device. By
default we don't want to allocate memory from the coherent device node,
but then we are looking at an user space interface that can be used to
force allocation.

For now, to avoid allocation hitting the coherent device, we build the
zonelist of the nodes such that zones from the coherent device are not
present in any other node's zone list. We looked at use MPOL_BIND as
the user space interface to force allocation from coherent device node.
MPOL_BIND usage breaks with the above detail you mentioned about
MPOL_BIND.

>From what you are suggesting above, I guess the right approach is to add
coherent node's zones to all the node's zone list and make sure the default
node mask used for allocation (N_MEMORY) doesn't have coherent device
node ?

-aneesh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist
@ 2016-10-21 12:25       ` Aneesh Kumar K.V
  0 siblings, 0 replies; 12+ messages in thread
From: Aneesh Kumar K.V @ 2016-10-21 12:25 UTC (permalink / raw)
  To: Vlastimil Babka, Michal Hocko, Andrew Morton
  Cc: Mel Gorman, David Rientjes, Anshuman Khandual, linux-mm, LKML,
	Michal Hocko

Vlastimil Babka <vbabka@suse.cz> writes:

> On 10/21/2016 01:34 PM, Aneesh Kumar K.V wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>>>
>>
>> For both MPOL_PREFERED and MPOL_INTERLEAVE we pick the zone list from
>> the node other than the current running node. Why don't we do that for
>> MPOL_BIND ?ie, if the current node is not part of the policy node mask
>> why are we not picking the first node from the policy node mask for
>> MPOL_BIND ?
>
> For MPOL_PREFERED and MPOL_INTERLEAVE we got some explicit preference of nodes, 
> so it makes sense that the nodes in the zonelist we pick are ordered by the 
> distance from that node, regardless of current node.
>
> For MPOL_BIND, we don't have preferences but restrictions. If the current cpu is 
> from a node within the restriction, then great. If it's not, finding a node 
> according to distance from current cpu is probably less arbitrary than by 
> distance from the node that happens to have the lowest id in the node mask?

I agree. This is related to the changes we are working in this part of
the kernel. We are looking at adding support for coherent device. By
default we don't want to allocate memory from the coherent device node,
but then we are looking at an user space interface that can be used to
force allocation.

For now, to avoid allocation hitting the coherent device, we build the
zonelist of the nodes such that zones from the coherent device are not
present in any other node's zone list. We looked at use MPOL_BIND as
the user space interface to force allocation from coherent device node.
MPOL_BIND usage breaks with the above detail you mentioned about
MPOL_BIND.

>From what you are suggesting above, I guess the right approach is to add
coherent node's zones to all the node's zone list and make sure the default
node mask used for allocation (N_MEMORY) doesn't have coherent device
node ?

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-10-21 12:26 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-13 12:59 [PATCH] mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist Michal Hocko
2016-10-13 12:59 ` Michal Hocko
2016-10-18  9:44 ` Vlastimil Babka
2016-10-18  9:44   ` Vlastimil Babka
2016-10-21 11:34 ` Aneesh Kumar K.V
2016-10-21 11:34   ` Aneesh Kumar K.V
2016-10-21 11:52   ` Michal Hocko
2016-10-21 11:52     ` Michal Hocko
2016-10-21 12:08   ` Vlastimil Babka
2016-10-21 12:08     ` Vlastimil Babka
2016-10-21 12:25     ` Aneesh Kumar K.V
2016-10-21 12:25       ` Aneesh Kumar K.V

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.