Re: [RFC] autonuma: Migrate on fault among multiple bound nodes

* Re: [RFC] autonuma: Migrate on fault among multiple bound nodes
       [not found] <20200916005936.232788-1-ying.huang@intel.com>
@ 2020-09-16  8:10 ` peterz
  2020-09-16  8:46   ` Huang, Ying
  2020-09-17  2:18   ` Huang, Ying
  2020-09-16 13:39 ` Qian Cai
  1 sibling, 2 replies; 7+ messages in thread
From: peterz @ 2020-09-16  8:10 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-mm, linux-kernel, Andrew Morton, Ingo Molnar, Mel Gorman,
	Rik van Riel, Johannes Weiner, Matthew Wilcox (Oracle),
	Dave Hansen, Andi Kleen, Michal Hocko, David Rientjes

On Wed, Sep 16, 2020 at 08:59:36AM +0800, Huang Ying wrote:

> So in this patch, if MPOL_BIND is used to bind the memory of the
> application to multiple nodes, and in the hint page fault handler both
> the faulting page node and the accessing node are in the policy
> nodemask, the page will be tried to be migrated to the accessing node
> to reduce the cross-node accessing.

Seems fair enough..

> Questions:
> 
> Sysctl knob kernel.numa_balancing can enable/disable AutoNUMA
> optimizing globally.  And now, it appears that the explicit NUMA
> memory policy specifying (e.g. via numactl, mbind(), etc.) acts like
> an implicit per-thread/VMA knob to enable/disable the AutoNUMA
> optimizing for the thread/VMA.  Although this looks like a side effect
> instead of an API, from commit fc3147245d19 ("mm: numa: Limit NUMA
> scanning to migrate-on-fault VMAs"), this is used by some users?  So
> the question is, do we need an explicit per-thread/VMA knob to
> enable/disable AutoNUMA optimizing for the thread/VMA?  Or just use
> the global knob, either optimize all thread/VMAs as long as the
> explicitly specified memory policies are respected, or don't optimize
> at all.

I don't understand the question; that commit is not about disabling numa
balancing, it's about avoiding pointless work and overhead. What's the
point of scanning memory if you're not going to be allowed to move it
anyway.

> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  mm/mempolicy.c | 43 +++++++++++++++++++++++++++++++------------
>  1 file changed, 31 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index eddbe4e56c73..a941eab2de24 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1827,6 +1827,13 @@ static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  	return pol;
>  }
>  
> +static bool mpol_may_mof(struct mempolicy *pol)
> +{
> +	/* May migrate among bound nodes for MPOL_BIND */
> +	return pol->flags & MPOL_F_MOF ||
> +		(pol->mode == MPOL_BIND && nodes_weight(pol->v.nodes) > 1);
> +}

This is weird, why not just set F_MOF on the policy?

In fact, why wouldn't something like:

  mbind(.mode=MPOL_BIND, .flags=MPOL_MF_LAZY);

work today? Afaict MF_LAZY will unconditionally result in M_MOF.

> @@ -2494,20 +2503,30 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
>  		break;
>  
>  	case MPOL_BIND:
>  		/*
> +		 * Allows binding to multiple nodes.  If both current and
> +		 * accessing nodes are in policy nodemask, migrate to
> +		 * accessing node to optimize page placement. Otherwise,
> +		 * use current page if in policy nodemask or MPOL_F_MOF not
> +		 * set, else select nearest allowed node, if any.  If no
> +		 * allowed nodes, use current [!misplaced].
>  		 */
> +		if (node_isset(curnid, pol->v.nodes)) {
> +			if (node_isset(thisnid, pol->v.nodes)) {
> +				moron = true;
> +				polnid = thisnid;
> +			} else {
> +				goto out;
> +			}
> +		} else if (!(pol->flags & MPOL_F_MOF)) {
>  			goto out;
> +		} else {
> +			z = first_zones_zonelist(
>  				node_zonelist(numa_node_id(), GFP_HIGHUSER),
>  				gfp_zone(GFP_HIGHUSER),
>  				&pol->v.nodes);
> +			polnid = zone_to_nid(z->zone);
> +		}
>  		break;
>  
>  	default:

Did that want to be this instead? I don't think I follow the other
changes.

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eddbe4e56c73..2a64913f9ac6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2501,8 +2501,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * else select nearest allowed node, if any.
 		 * If no allowed nodes, use current [!misplaced].
 		 */
-		if (node_isset(curnid, pol->v.nodes))
+		if (node_isset(curnid, pol->v.nodes)) {
+			if (node_isset(thisnod, pol->v.nodes))
+				goto moron;
 			goto out;
+		}
 		z = first_zones_zonelist(
 				node_zonelist(numa_node_id(), GFP_HIGHUSER),
 				gfp_zone(GFP_HIGHUSER),
@@ -2516,6 +2519,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
+moron:
 		polnid = thisnid;
 
 		if (!should_numa_migrate_memory(current, page, curnid, thiscpu))


^ permalink raw reply related	[flat|nested] 7+ messages in thread