linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH -V2 1/2] mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON
       [not found] ` <20201028023411.15045-2-ying.huang@intel.com>
@ 2020-10-29  9:04   ` Michal Hocko
  2020-10-30  7:27     ` Huang, Ying
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2020-10-29  9:04 UTC (permalink / raw)
  To: Huang Ying
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Matthew Wilcox (Oracle),
	Rafael Aquini, Andrew Morton, Ingo Molnar, Mel Gorman,
	Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen,
	David Rientjes

On Wed 28-10-20 10:34:10, Huang Ying wrote:
> To follow code-of-conduct better.

This is changing a user visible interface and any userspace which refers
to the existing name will fail to compile unless I am missing something.

Have you checked how many applications would be affected?

Btw I find "follow CoC better" a very weak argument without further
explanation.

> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Suggested-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Acked-by: Rafael Aquini <aquini@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Rafael Aquini <aquini@redhat.com>
> ---
>  include/uapi/linux/mempolicy.h | 2 +-
>  kernel/sched/debug.c           | 2 +-
>  mm/mempolicy.c                 | 6 +++---
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index 3354774af61e..3c3666d017e6 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -60,7 +60,7 @@ enum {
>  #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
>  #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
>  #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
> -#define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
> +#define MPOL_F_MOPRON	(1 << 4) /* Migrate On Protnone Reference On Node */
>  
>  
>  #endif /* _UAPI_LINUX_MEMPOLICY_H */
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 0655524700d2..8bfb6adb3f31 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -898,7 +898,7 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
>  
>  	task_lock(p);
>  	pol = p->mempolicy;
> -	if (pol && !(pol->flags & MPOL_F_MORON))
> +	if (pol && !(pol->flags & MPOL_F_MOPRON))
>  		pol = NULL;
>  	mpol_get(pol);
>  	task_unlock(p);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 3fde772ef5ef..f6948b659643 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2511,7 +2511,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
>  	}
>  
>  	/* Migrate the page towards the node whose CPU is referencing it */
> -	if (pol->flags & MPOL_F_MORON) {
> +	if (pol->flags & MPOL_F_MOPRON) {
>  		polnid = thisnid;
>  
>  		if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
> @@ -2802,7 +2802,7 @@ void __init numa_policy_init(void)
>  		preferred_node_policy[nid] = (struct mempolicy) {
>  			.refcnt = ATOMIC_INIT(1),
>  			.mode = MPOL_PREFERRED,
> -			.flags = MPOL_F_MOF | MPOL_F_MORON,
> +			.flags = MPOL_F_MOF | MPOL_F_MOPRON,
>  			.v = { .preferred_node = nid, },
>  		};
>  	}
> @@ -3010,7 +3010,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
>  	unsigned short mode = MPOL_DEFAULT;
>  	unsigned short flags = 0;
>  
> -	if (pol && pol != &default_policy && !(pol->flags & MPOL_F_MORON)) {
> +	if (pol && pol != &default_policy && !(pol->flags & MPOL_F_MOPRON)) {
>  		mode = pol->mode;
>  		flags = pol->flags;
>  	}
> -- 
> 2.28.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -V2 1/2] mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON
  2020-10-29  9:04   ` [PATCH -V2 1/2] mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON Michal Hocko
@ 2020-10-30  7:27     ` Huang, Ying
  2020-10-30  8:25       ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2020-10-30  7:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Matthew Wilcox (Oracle),
	Rafael Aquini, Andrew Morton, Ingo Molnar, Mel Gorman,
	Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen,
	David Rientjes

Michal Hocko <mhocko@suse.com> writes:

> On Wed 28-10-20 10:34:10, Huang Ying wrote:
>> To follow code-of-conduct better.
>
> This is changing a user visible interface and any userspace which refers
> to the existing name will fail to compile unless I am missing something.

Although these flags are put in uapi, I found these flags are actually
internal flags used in "flags" field of struct mempolicy, they are never
used as flags for any user space API.  I guess they are placed in uapi
header file to guarantee they aren't conflict with MPOL_MODE_FLAGS.

> Have you checked how many applications would be affected?

Based on above analysis, I think there is no application that will be
affected.

> Btw I find "follow CoC better" a very weak argument without further
> explanation.

That is the only reason for the patch.  If nobody thinks the change is
necessary, I can just drop the patch.

Best Regards,
Huang, Ying

>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Suggested-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>> Acked-by: Rafael Aquini <aquini@redhat.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Rafael Aquini <aquini@redhat.com>
>> ---
>>  include/uapi/linux/mempolicy.h | 2 +-
>>  kernel/sched/debug.c           | 2 +-
>>  mm/mempolicy.c                 | 6 +++---
>>  3 files changed, 5 insertions(+), 5 deletions(-)
>> 
>> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
>> index 3354774af61e..3c3666d017e6 100644
>> --- a/include/uapi/linux/mempolicy.h
>> +++ b/include/uapi/linux/mempolicy.h
>> @@ -60,7 +60,7 @@ enum {
>>  #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
>>  #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
>>  #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
>> -#define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
>> +#define MPOL_F_MOPRON	(1 << 4) /* Migrate On Protnone Reference On Node */
>>  
>>  
>>  #endif /* _UAPI_LINUX_MEMPOLICY_H */
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 0655524700d2..8bfb6adb3f31 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -898,7 +898,7 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
>>  
>>  	task_lock(p);
>>  	pol = p->mempolicy;
>> -	if (pol && !(pol->flags & MPOL_F_MORON))
>> +	if (pol && !(pol->flags & MPOL_F_MOPRON))
>>  		pol = NULL;
>>  	mpol_get(pol);
>>  	task_unlock(p);
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 3fde772ef5ef..f6948b659643 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -2511,7 +2511,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
>>  	}
>>  
>>  	/* Migrate the page towards the node whose CPU is referencing it */
>> -	if (pol->flags & MPOL_F_MORON) {
>> +	if (pol->flags & MPOL_F_MOPRON) {
>>  		polnid = thisnid;
>>  
>>  		if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
>> @@ -2802,7 +2802,7 @@ void __init numa_policy_init(void)
>>  		preferred_node_policy[nid] = (struct mempolicy) {
>>  			.refcnt = ATOMIC_INIT(1),
>>  			.mode = MPOL_PREFERRED,
>> -			.flags = MPOL_F_MOF | MPOL_F_MORON,
>> +			.flags = MPOL_F_MOF | MPOL_F_MOPRON,
>>  			.v = { .preferred_node = nid, },
>>  		};
>>  	}
>> @@ -3010,7 +3010,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
>>  	unsigned short mode = MPOL_DEFAULT;
>>  	unsigned short flags = 0;
>>  
>> -	if (pol && pol != &default_policy && !(pol->flags & MPOL_F_MORON)) {
>> +	if (pol && pol != &default_policy && !(pol->flags & MPOL_F_MOPRON)) {
>>  		mode = pol->mode;
>>  		flags = pol->flags;
>>  	}
>> -- 
>> 2.28.0
>> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -V2 1/2] mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON
  2020-10-30  7:27     ` Huang, Ying
@ 2020-10-30  8:25       ` Michal Hocko
  2020-11-02  3:12         ` Huang, Ying
  0 siblings, 1 reply; 10+ messages in thread
From: Michal Hocko @ 2020-10-30  8:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Matthew Wilcox (Oracle),
	Rafael Aquini, Andrew Morton, Ingo Molnar, Mel Gorman,
	Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen,
	David Rientjes

On Fri 30-10-20 15:27:51, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Wed 28-10-20 10:34:10, Huang Ying wrote:
> >> To follow code-of-conduct better.
> >
> > This is changing a user visible interface and any userspace which refers
> > to the existing name will fail to compile unless I am missing something.
> 
> Although these flags are put in uapi, I found these flags are actually
> internal flags used in "flags" field of struct mempolicy, they are never
> used as flags for any user space API.  I guess they are placed in uapi
> header file to guarantee they aren't conflict with MPOL_MODE_FLAGS.

You are right. I have missed that. The comment in the header even explains
that. Anyway the placement is rather unusual and I think that those
flags do not belong there.
 
> > Have you checked how many applications would be affected?
> 
> Based on above analysis, I think there is no application that will be
> affected.
> 
> > Btw I find "follow CoC better" a very weak argument without further
> > explanation.
> 
> That is the only reason for the patch.  If nobody thinks the change is
> necessary, I can just drop the patch.

Well, to be honest I do not see any problem with the naming.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -V2 1/2] mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON
  2020-10-30  8:25       ` Michal Hocko
@ 2020-11-02  3:12         ` Huang, Ying
  0 siblings, 0 replies; 10+ messages in thread
From: Huang, Ying @ 2020-11-02  3:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Matthew Wilcox (Oracle),
	Rafael Aquini, Andrew Morton, Ingo Molnar, Mel Gorman,
	Rik van Riel, Johannes Weiner, Dave Hansen, Andi Kleen,
	David Rientjes

Michal Hocko <mhocko@suse.com> writes:

> On Fri 30-10-20 15:27:51, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Wed 28-10-20 10:34:10, Huang Ying wrote:
>> >> To follow code-of-conduct better.
>> >
>> > This is changing a user visible interface and any userspace which refers
>> > to the existing name will fail to compile unless I am missing something.
>> 
>> Although these flags are put in uapi, I found these flags are actually
>> internal flags used in "flags" field of struct mempolicy, they are never
>> used as flags for any user space API.  I guess they are placed in uapi
>> header file to guarantee they aren't conflict with MPOL_MODE_FLAGS.
>
> You are right. I have missed that. The comment in the header even explains
> that. Anyway the placement is rather unusual and I think that those
> flags do not belong there.
>  
>> > Have you checked how many applications would be affected?
>> 
>> Based on above analysis, I think there is no application that will be
>> affected.
>> 
>> > Btw I find "follow CoC better" a very weak argument without further
>> > explanation.
>> 
>> That is the only reason for the patch.  If nobody thinks the change is
>> necessary, I can just drop the patch.
>
> Well, to be honest I do not see any problem with the naming.

This is a capitalized words and prefixed, so most people think it's OK.
And in PATCH 2/2, there's a newly added label,

mopron:

Which may become

moron:

some people think that we'd better to change it.  And to make the wording
consistent, the constant is changed too.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -V2 2/2] autonuma: Migrate on fault among multiple bound nodes
       [not found] ` <20201028023411.15045-3-ying.huang@intel.com>
@ 2020-11-02 11:17   ` Mel Gorman
  2020-11-04  5:36     ` Huang, Ying
  0 siblings, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2020-11-02 11:17 UTC (permalink / raw)
  To: Huang Ying
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Andrew Morton,
	Ingo Molnar, Rik van Riel, Johannes Weiner,
	Matthew Wilcox (Oracle),
	Dave Hansen, Andi Kleen, Michal Hocko, David Rientjes

On Wed, Oct 28, 2020 at 10:34:11AM +0800, Huang Ying wrote:
> Now, AutoNUMA can only optimize the page placement among the NUMA nodes if the
> default memory policy is used.  Because the memory policy specified explicitly
> should take precedence.  But this seems too strict in some situations.  For
> example, on a system with 4 NUMA nodes, if the memory of an application is bound
> to the node 0 and 1, AutoNUMA can potentially migrate the pages between the node
> 0 and 1 to reduce cross-node accessing without breaking the explicit memory
> binding policy.
> 
> So in this patch, if mbind(.mode=MPOL_BIND, .flags=MPOL_MF_LAZY) is used to bind
> the memory of the application to multiple nodes, and in the hint page fault
> handler both the faulting page node and the accessing node are in the policy
> nodemask, the page will be tried to be migrated to the accessing node to reduce
> the cross-node accessing.
> 
> [Peter Zijlstra: provided the simplified implementation method.]
> 
> Questions:
> 
> Sysctl knob kernel.numa_balancing can enable/disable AutoNUMA optimizing
> globally.  But for the memory areas that are bound to multiple NUMA nodes, even
> if the AutoNUMA is enabled globally via the sysctl knob, we still need to enable
> AutoNUMA again with a special flag.  Why not just optimize the page placement if
> possible as long as AutoNUMA is enabled globally?  The interface would look
> simpler with that.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>

I've no specific objection to the patch or the name change. I can't
remember exactly why I picked the name, it was 8 years ago but I think it
was because the policy represented the most basic possible approach that
could be done without any attempt at being intelligent and established
a baseline. The intent was that anything built on top had to be better
than the most basic policy imaginable. The name reflected the dictionary
definition at the time and happened to match the acronym closely enough
and I wanted to make it absolutely clear to reviewers that the policy
was not good enough (ruling out MPOL_BASIC or variants thereof) even if
it happened to work for some workload and there was no intent to report
it to the userspace API.

The only hazard with the patch is that applications that use MPOL_BIND
on multiple nodes may now incur some NUMA balancing overhead due to
trapping faults and migrations. It might still end up being better but
I was not aware of a *realistic* workload that binds to multiple nodes
deliberately. Generally I expect if an application is binding, it's
binding to one local node.

If it shows up in regressions, it'll be interesting to get a detailed
description of the workload. Pay particular attention to if THP is
disabled as I learned relatively recently that NUMA balancing with THP
disabled has higher overhead (which is hardly surprising).

Lacking data or a specific objection

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -V2 2/2] autonuma: Migrate on fault among multiple bound nodes
  2020-11-02 11:17   ` [PATCH -V2 2/2] autonuma: Migrate on fault among multiple bound nodes Mel Gorman
@ 2020-11-04  5:36     ` Huang, Ying
  2020-11-05 11:25       ` Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2020-11-04  5:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Andrew Morton,
	Ingo Molnar, Rik van Riel, Johannes Weiner,
	Matthew Wilcox (Oracle),
	Dave Hansen, Andi Kleen, Michal Hocko, David Rientjes

Hi, Mel,

Thanks for comments!

Mel Gorman <mgorman@suse.de> writes:

> On Wed, Oct 28, 2020 at 10:34:11AM +0800, Huang Ying wrote:
>> Now, AutoNUMA can only optimize the page placement among the NUMA nodes if the
>> default memory policy is used.  Because the memory policy specified explicitly
>> should take precedence.  But this seems too strict in some situations.  For
>> example, on a system with 4 NUMA nodes, if the memory of an application is bound
>> to the node 0 and 1, AutoNUMA can potentially migrate the pages between the node
>> 0 and 1 to reduce cross-node accessing without breaking the explicit memory
>> binding policy.
>> 
>> So in this patch, if mbind(.mode=MPOL_BIND, .flags=MPOL_MF_LAZY) is used to bind
>> the memory of the application to multiple nodes, and in the hint page fault
>> handler both the faulting page node and the accessing node are in the policy
>> nodemask, the page will be tried to be migrated to the accessing node to reduce
>> the cross-node accessing.
>> 
>> [Peter Zijlstra: provided the simplified implementation method.]
>> 
>> Questions:
>> 
>> Sysctl knob kernel.numa_balancing can enable/disable AutoNUMA optimizing
>> globally.  But for the memory areas that are bound to multiple NUMA nodes, even
>> if the AutoNUMA is enabled globally via the sysctl knob, we still need to enable
>> AutoNUMA again with a special flag.  Why not just optimize the page placement if
>> possible as long as AutoNUMA is enabled globally?  The interface would look
>> simpler with that.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>
> I've no specific objection to the patch or the name change. I can't
> remember exactly why I picked the name, it was 8 years ago but I think it
> was because the policy represented the most basic possible approach that
> could be done without any attempt at being intelligent and established
> a baseline. The intent was that anything built on top had to be better
> than the most basic policy imaginable. The name reflected the dictionary
> definition at the time and happened to match the acronym closely enough
> and I wanted to make it absolutely clear to reviewers that the policy
> was not good enough (ruling out MPOL_BASIC or variants thereof) even if
> it happened to work for some workload and there was no intent to report
> it to the userspace API.
>
> The only hazard with the patch is that applications that use MPOL_BIND
> on multiple nodes may now incur some NUMA balancing overhead due to
> trapping faults and migrations.

For this specific version of patch, I don't think this will happen.
Because now, MPOL_F_MOF need to be set in struct mempolicy, for
MPOL_BIND, only if mbind() syscall is called with MPOL_MF_LAZY, that
will be the case.  So I think most workloads will not be affected by
this patch.  The feature is opt-in.

But from another point of view, I suggest to remove the constraints of
MPOL_F_MOF in the future.  If the overhead of AutoNUMA isn't acceptable,
why not just disable AutoNUMA globally via sysctl knob?

> It might still end up being better but I was not aware of a
> *realistic* workload that binds to multiple nodes
> deliberately. Generally I expect if an application is binding, it's
> binding to one local node.

Yes.  It's not popular configuration for now.  But for the memory
tiering system with both DRAM and PMEM, the DRAM and PMEM in one socket
will become 2 NUMA nodes.  To avoid too much cross-socket memory
accessing, but take advantage of both the DRAM and PMEM, the workload
can be bound to 2 NUMA nodes (DRAM and PMEM).

> If it shows up in regressions, it'll be interesting to get a detailed
> description of the workload. Pay particular attention to if THP is
> disabled as I learned relatively recently that NUMA balancing with THP
> disabled has higher overhead (which is hardly surprising).

Got it!

> Lacking data or a specific objection
>
> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks!

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -V2 2/2] autonuma: Migrate on fault among multiple bound nodes
  2020-11-04  5:36     ` Huang, Ying
@ 2020-11-05 11:25       ` Mel Gorman
  2020-11-06  7:28         ` Huang, Ying
  2020-11-11  6:50         ` Huang, Ying
  0 siblings, 2 replies; 10+ messages in thread
From: Mel Gorman @ 2020-11-05 11:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Andrew Morton,
	Ingo Molnar, Rik van Riel, Johannes Weiner,
	Matthew Wilcox (Oracle),
	Dave Hansen, Andi Kleen, Michal Hocko, David Rientjes

On Wed, Nov 04, 2020 at 01:36:58PM +0800, Huang, Ying wrote:
> > I've no specific objection to the patch or the name change. I can't
> > remember exactly why I picked the name, it was 8 years ago but I think it
> > was because the policy represented the most basic possible approach that
> > could be done without any attempt at being intelligent and established
> > a baseline. The intent was that anything built on top had to be better
> > than the most basic policy imaginable. The name reflected the dictionary
> > definition at the time and happened to match the acronym closely enough
> > and I wanted to make it absolutely clear to reviewers that the policy
> > was not good enough (ruling out MPOL_BASIC or variants thereof) even if
> > it happened to work for some workload and there was no intent to report
> > it to the userspace API.
> >
> > The only hazard with the patch is that applications that use MPOL_BIND
> > on multiple nodes may now incur some NUMA balancing overhead due to
> > trapping faults and migrations.
> 
> For this specific version of patch, I don't think this will happen.
> Because now, MPOL_F_MOF need to be set in struct mempolicy, for
> MPOL_BIND, only if mbind() syscall is called with MPOL_MF_LAZY, that
> will be the case.  So I think most workloads will not be affected by
> this patch.  The feature is opt-in.
> 

Ok.

> But from another point of view, I suggest to remove the constraints of
> MPOL_F_MOF in the future.  If the overhead of AutoNUMA isn't acceptable,
> why not just disable AutoNUMA globally via sysctl knob?
> 

Because it's a double edged sword. NUMA Balancing can make a workload
faster while still incurring more overhead than it should -- particularly
when threads are involved rescanning the same or unrelated regions.
Global disabling only really should happen when an application is running
that is the only application on the machine and has full NUMA awareness.

> > It might still end up being better but I was not aware of a
> > *realistic* workload that binds to multiple nodes
> > deliberately. Generally I expect if an application is binding, it's
> > binding to one local node.
> 
> Yes.  It's not popular configuration for now.  But for the memory
> tiering system with both DRAM and PMEM, the DRAM and PMEM in one socket
> will become 2 NUMA nodes.  To avoid too much cross-socket memory
> accessing, but take advantage of both the DRAM and PMEM, the workload
> can be bound to 2 NUMA nodes (DRAM and PMEM).
> 

Ok, that may lead to unpredictable performance as it'll have variable
performance with limited control of the "important" applications that
should use DRAM over PMEM. That's a long road but the step is not
incompatible with the long-term goal.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -V2 2/2] autonuma: Migrate on fault among multiple bound nodes
  2020-11-05 11:25       ` Mel Gorman
@ 2020-11-06  7:28         ` Huang, Ying
  2020-11-06 15:55           ` Ben Widawsky
  2020-11-11  6:50         ` Huang, Ying
  1 sibling, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2020-11-06  7:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Andrew Morton,
	Ingo Molnar, Rik van Riel, Johannes Weiner,
	Matthew Wilcox (Oracle),
	Dave Hansen, Andi Kleen, Michal Hocko, David Rientjes,
	ben.widawsky

Mel Gorman <mgorman@suse.de> writes:

> On Wed, Nov 04, 2020 at 01:36:58PM +0800, Huang, Ying wrote:
>> But from another point of view, I suggest to remove the constraints of
>> MPOL_F_MOF in the future.  If the overhead of AutoNUMA isn't acceptable,
>> why not just disable AutoNUMA globally via sysctl knob?
>> 
>
> Because it's a double edged sword. NUMA Balancing can make a workload
> faster while still incurring more overhead than it should -- particularly
> when threads are involved rescanning the same or unrelated regions.
> Global disabling only really should happen when an application is running
> that is the only application on the machine and has full NUMA awareness.

Got it.  So NUMA Balancing may in generally benefit some workloads but
hurt some other workloads on one machine.  So we need a method to
enable/disable NUMA Balancing for one workload.  Previously, this is
done via the explicit NUMA policy.  If some explicit NUMA policy is
specified, NUMA Balancing is disabled for the memory region or the
thread.  And this can be reverted again for a memory region via
MPOL_MF_LAZY.  It appears that we lacks MPOL_MF_LAZY for the thread yet.

>> > It might still end up being better but I was not aware of a
>> > *realistic* workload that binds to multiple nodes
>> > deliberately. Generally I expect if an application is binding, it's
>> > binding to one local node.
>> 
>> Yes.  It's not popular configuration for now.  But for the memory
>> tiering system with both DRAM and PMEM, the DRAM and PMEM in one socket
>> will become 2 NUMA nodes.  To avoid too much cross-socket memory
>> accessing, but take advantage of both the DRAM and PMEM, the workload
>> can be bound to 2 NUMA nodes (DRAM and PMEM).
>> 
>
> Ok, that may lead to unpredictable performance as it'll have variable
> performance with limited control of the "important" applications that
> should use DRAM over PMEM. That's a long road but the step is not
> incompatible with the long-term goal.

Yes.  Ben Widawsky is working on a patchset to make it possible to
prefer the remote DRAM instead of the local PMEM as follows,

https://lore.kernel.org/linux-mm/20200630212517.308045-1-ben.widawsky@intel.com/

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -V2 2/2] autonuma: Migrate on fault among multiple bound nodes
  2020-11-06  7:28         ` Huang, Ying
@ 2020-11-06 15:55           ` Ben Widawsky
  0 siblings, 0 replies; 10+ messages in thread
From: Ben Widawsky @ 2020-11-06 15:55 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mel Gorman, Peter Zijlstra, linux-mm, linux-kernel,
	Andrew Morton, Ingo Molnar, Rik van Riel, Johannes Weiner,
	Matthew Wilcox (Oracle),
	Dave Hansen, Andi Kleen, Michal Hocko, David Rientjes

On 20-11-06 15:28:59, Huang, Ying wrote:
> Mel Gorman <mgorman@suse.de> writes:
> 
> > On Wed, Nov 04, 2020 at 01:36:58PM +0800, Huang, Ying wrote:
> >> But from another point of view, I suggest to remove the constraints of
> >> MPOL_F_MOF in the future.  If the overhead of AutoNUMA isn't acceptable,
> >> why not just disable AutoNUMA globally via sysctl knob?
> >> 
> >
> > Because it's a double edged sword. NUMA Balancing can make a workload
> > faster while still incurring more overhead than it should -- particularly
> > when threads are involved rescanning the same or unrelated regions.
> > Global disabling only really should happen when an application is running
> > that is the only application on the machine and has full NUMA awareness.
> 
> Got it.  So NUMA Balancing may in generally benefit some workloads but
> hurt some other workloads on one machine.  So we need a method to
> enable/disable NUMA Balancing for one workload.  Previously, this is
> done via the explicit NUMA policy.  If some explicit NUMA policy is
> specified, NUMA Balancing is disabled for the memory region or the
> thread.  And this can be reverted again for a memory region via
> MPOL_MF_LAZY.  It appears that we lacks MPOL_MF_LAZY for the thread yet.
> 
> >> > It might still end up being better but I was not aware of a
> >> > *realistic* workload that binds to multiple nodes
> >> > deliberately. Generally I expect if an application is binding, it's
> >> > binding to one local node.
> >> 
> >> Yes.  It's not popular configuration for now.  But for the memory
> >> tiering system with both DRAM and PMEM, the DRAM and PMEM in one socket
> >> will become 2 NUMA nodes.  To avoid too much cross-socket memory
> >> accessing, but take advantage of both the DRAM and PMEM, the workload
> >> can be bound to 2 NUMA nodes (DRAM and PMEM).
> >> 
> >
> > Ok, that may lead to unpredictable performance as it'll have variable
> > performance with limited control of the "important" applications that
> > should use DRAM over PMEM. That's a long road but the step is not
> > incompatible with the long-term goal.
> 
> Yes.  Ben Widawsky is working on a patchset to make it possible to
> prefer the remote DRAM instead of the local PMEM as follows,
> 
> https://lore.kernel.org/linux-mm/20200630212517.308045-1-ben.widawsky@intel.com/
> 
> Best Regards,
> Huang, Ying
> 

Rebased version was posted here:
https://lore.kernel.org/linux-mm/20201030190238.306764-1-ben.widawsky@intel.com/

Thanks.
Ben

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH -V2 2/2] autonuma: Migrate on fault among multiple bound nodes
  2020-11-05 11:25       ` Mel Gorman
  2020-11-06  7:28         ` Huang, Ying
@ 2020-11-11  6:50         ` Huang, Ying
  1 sibling, 0 replies; 10+ messages in thread
From: Huang, Ying @ 2020-11-11  6:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, linux-mm, linux-kernel, Andrew Morton,
	Ingo Molnar, Rik van Riel, Johannes Weiner,
	Matthew Wilcox (Oracle),
	Dave Hansen, Andi Kleen, Michal Hocko, David Rientjes

Hi, Mel,

Mel Gorman <mgorman@suse.de> writes:

> On Wed, Nov 04, 2020 at 01:36:58PM +0800, Huang, Ying wrote:
>> > I've no specific objection to the patch or the name change. I can't
>> > remember exactly why I picked the name, it was 8 years ago but I think it
>> > was because the policy represented the most basic possible approach that
>> > could be done without any attempt at being intelligent and established
>> > a baseline. The intent was that anything built on top had to be better
>> > than the most basic policy imaginable. The name reflected the dictionary
>> > definition at the time and happened to match the acronym closely enough
>> > and I wanted to make it absolutely clear to reviewers that the policy
>> > was not good enough (ruling out MPOL_BASIC or variants thereof) even if
>> > it happened to work for some workload and there was no intent to report
>> > it to the userspace API.
>> >
>> > The only hazard with the patch is that applications that use MPOL_BIND
>> > on multiple nodes may now incur some NUMA balancing overhead due to
>> > trapping faults and migrations.
>> 
>> For this specific version of patch, I don't think this will happen.
>> Because now, MPOL_F_MOF need to be set in struct mempolicy, for
>> MPOL_BIND, only if mbind() syscall is called with MPOL_MF_LAZY, that
>> will be the case.  So I think most workloads will not be affected by
>> this patch.  The feature is opt-in.
>> 
>
> Ok.

I just found MPOL_MF_LAZY is disabled now.  And as in commit
a720094ded8c ("mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from
userspace for now"), the ABI needs to be revisted before exporting to
the user space formally.  Sorry about that, I should have found that
earlier.

Think about that.  I think MPOL_MF_LAZY is tied with MPOL_MF_MOVE, so
it's semantics isn't good for the purpose of the patch.  So I have
rewritten the patch and the description and sent it as follows, can you
help to review it?

https://lore.kernel.org/lkml/20201111063717.186589-1-ying.huang@intel.com/

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-11-11  6:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20201028023411.15045-1-ying.huang@intel.com>
     [not found] ` <20201028023411.15045-2-ying.huang@intel.com>
2020-10-29  9:04   ` [PATCH -V2 1/2] mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON Michal Hocko
2020-10-30  7:27     ` Huang, Ying
2020-10-30  8:25       ` Michal Hocko
2020-11-02  3:12         ` Huang, Ying
     [not found] ` <20201028023411.15045-3-ying.huang@intel.com>
2020-11-02 11:17   ` [PATCH -V2 2/2] autonuma: Migrate on fault among multiple bound nodes Mel Gorman
2020-11-04  5:36     ` Huang, Ying
2020-11-05 11:25       ` Mel Gorman
2020-11-06  7:28         ` Huang, Ying
2020-11-06 15:55           ` Ben Widawsky
2020-11-11  6:50         ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).