linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND] autonuma: Fix scan period updating
@ 2019-07-25  8:01 Huang, Ying
  2019-07-25 17:35 ` Srikar Dronamraju
  0 siblings, 1 reply; 9+ messages in thread
From: Huang, Ying @ 2019-07-25  8:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-mm, linux-kernel, Huang Ying, Rik van Riel, Mel Gorman,
	jhladky, lvenanci, Andrew Morton

From: Huang Ying <ying.huang@intel.com>

From the commit log and comments of commit 37ec97deb3a8 ("sched/numa:
Slow down scan rate if shared faults dominate"), the autonuma scan
period should be increased (scanning is slowed down) if the majority
of the page accesses are shared with other processes.  But in current
code, the scan period will be decreased (scanning is speeded up) in
that situation.

The commit log and comments make more sense.  So this patch fixes the
code to make it match the commit log and comments.  And this has been
verified via tracing the scan period changing and /proc/vmstat
numa_pte_updates counter when running a multi-threaded memory
accessing program (most memory areas are accessed by multiple
threads).

Fixes: 37ec97deb3a8 ("sched/numa: Slow down scan rate if shared faults dominate")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: jhladky@redhat.com
Cc: lvenanci@redhat.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 kernel/sched/fair.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95a87e9..468a1c5038b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1940,7 +1940,7 @@ static void update_task_scan_period(struct task_struct *p,
 			unsigned long shared, unsigned long private)
 {
 	unsigned int period_slot;
-	int lr_ratio, ps_ratio;
+	int lr_ratio, sp_ratio;
 	int diff;
 
 	unsigned long remote = p->numa_faults_locality[0];
@@ -1971,22 +1971,22 @@ static void update_task_scan_period(struct task_struct *p,
 	 */
 	period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
 	lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
-	ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
+	sp_ratio = (shared * NUMA_PERIOD_SLOTS) / (private + shared);
 
-	if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
+	if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
 		/*
-		 * Most memory accesses are local. There is no need to
-		 * do fast NUMA scanning, since memory is already local.
+		 * Most memory accesses are shared with other tasks.
+		 * There is no point in continuing fast NUMA scanning,
+		 * since other tasks may just move the memory elsewhere.
 		 */
-		int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
+		int slot = sp_ratio - NUMA_PERIOD_THRESHOLD;
 		if (!slot)
 			slot = 1;
 		diff = slot * period_slot;
 	} else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) {
 		/*
-		 * Most memory accesses are shared with other tasks.
-		 * There is no point in continuing fast NUMA scanning,
-		 * since other tasks may just move the memory elsewhere.
+		 * Most memory accesses are local. There is no need to
+		 * do fast NUMA scanning, since memory is already local.
 		 */
 		int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
 		if (!slot)
@@ -1998,7 +1998,7 @@ static void update_task_scan_period(struct task_struct *p,
 		 * yet they are not on the local NUMA node. Speed up
 		 * NUMA scanning to get the memory moved over.
 		 */
-		int ratio = max(lr_ratio, ps_ratio);
+		int ratio = max(lr_ratio, sp_ratio);
 		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] autonuma: Fix scan period updating
  2019-07-25  8:01 [PATCH RESEND] autonuma: Fix scan period updating Huang, Ying
@ 2019-07-25 17:35 ` Srikar Dronamraju
  2019-07-26  7:45   ` Huang, Ying
  0 siblings, 1 reply; 9+ messages in thread
From: Srikar Dronamraju @ 2019-07-25 17:35 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Peter Zijlstra, Ingo Molnar, linux-mm, linux-kernel,
	Rik van Riel, Mel Gorman, jhladky, lvenanci, Andrew Morton

* Huang, Ying <ying.huang@intel.com> [2019-07-25 16:01:24]:

> From: Huang Ying <ying.huang@intel.com>
> 
> From the commit log and comments of commit 37ec97deb3a8 ("sched/numa:
> Slow down scan rate if shared faults dominate"), the autonuma scan
> period should be increased (scanning is slowed down) if the majority
> of the page accesses are shared with other processes.  But in current
> code, the scan period will be decreased (scanning is speeded up) in
> that situation.
> 
> The commit log and comments make more sense.  So this patch fixes the
> code to make it match the commit log and comments.  And this has been
> verified via tracing the scan period changing and /proc/vmstat
> numa_pte_updates counter when running a multi-threaded memory
> accessing program (most memory areas are accessed by multiple
> threads).
> 

Lets split into 4 modes.
More Local and Private Page Accesses:
We definitely want to scan slowly i.e increase the scan window.

More Local and Shared Page Accesses:
We still want to scan slowly because we have consolidated and there is no
point in scanning faster. So scan slowly + increase the scan window.
(Do remember access on any active node counts as local!!!)

More Remote + Private page Accesses:
Most likely the Private accesses are going to be local accesses.

In the unlikely event of the private accesses not being local, we should
scan faster so that the memory and task consolidates.

More Remote + Shared page Accesses: This means the workload has not
consolidated and needs to scan faster. So we need to scan faster.

So I would think we should go back to before 37ec97deb3a8.

i.e 

	int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;

	if (!slot)
		slot = 1;
	diff = slot * period_slot;


No?

> Fixes: 37ec97deb3a8 ("sched/numa: Slow down scan rate if shared faults dominate")
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: jhladky@redhat.com
> Cc: lvenanci@redhat.com
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  kernel/sched/fair.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 036be95a87e9..468a1c5038b2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1940,7 +1940,7 @@ static void update_task_scan_period(struct task_struct *p,
>  			unsigned long shared, unsigned long private)
>  {
>  	unsigned int period_slot;
> -	int lr_ratio, ps_ratio;
> +	int lr_ratio, sp_ratio;
>  	int diff;
>  
>  	unsigned long remote = p->numa_faults_locality[0];
> @@ -1971,22 +1971,22 @@ static void update_task_scan_period(struct task_struct *p,
>  	 */
>  	period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
>  	lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
> -	ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
> +	sp_ratio = (shared * NUMA_PERIOD_SLOTS) / (private + shared);
>  
> -	if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
> +	if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
>  		/*
> -		 * Most memory accesses are local. There is no need to
> -		 * do fast NUMA scanning, since memory is already local.
> +		 * Most memory accesses are shared with other tasks.
> +		 * There is no point in continuing fast NUMA scanning,
> +		 * since other tasks may just move the memory elsewhere.

With this change, I would expect that with Shared page accesses,
consolidation to take a hit.

>  		 */
> -		int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
> +		int slot = sp_ratio - NUMA_PERIOD_THRESHOLD;
>  		if (!slot)
>  			slot = 1;
>  		diff = slot * period_slot;
>  	} else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) {
>  		/*
> -		 * Most memory accesses are shared with other tasks.
> -		 * There is no point in continuing fast NUMA scanning,
> -		 * since other tasks may just move the memory elsewhere.
> +		 * Most memory accesses are local. There is no need to
> +		 * do fast NUMA scanning, since memory is already local.

Comment wise this make sense.

>  		 */
>  		int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
>  		if (!slot)
> @@ -1998,7 +1998,7 @@ static void update_task_scan_period(struct task_struct *p,
>  		 * yet they are not on the local NUMA node. Speed up
>  		 * NUMA scanning to get the memory moved over.
>  		 */
> -		int ratio = max(lr_ratio, ps_ratio);
> +		int ratio = max(lr_ratio, sp_ratio);
>  		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>  	}
>  
> -- 
> 2.20.1
> 

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] autonuma: Fix scan period updating
  2019-07-25 17:35 ` Srikar Dronamraju
@ 2019-07-26  7:45   ` Huang, Ying
  2019-07-26  9:20     ` Srikar Dronamraju
  0 siblings, 1 reply; 9+ messages in thread
From: Huang, Ying @ 2019-07-26  7:45 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, linux-mm, linux-kernel,
	Rik van Riel, Mel Gorman, jhladky, lvenanci, Andrew Morton

Hi, Srikar,

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

> * Huang, Ying <ying.huang@intel.com> [2019-07-25 16:01:24]:
>
>> From: Huang Ying <ying.huang@intel.com>
>> 
>> From the commit log and comments of commit 37ec97deb3a8 ("sched/numa:
>> Slow down scan rate if shared faults dominate"), the autonuma scan
>> period should be increased (scanning is slowed down) if the majority
>> of the page accesses are shared with other processes.  But in current
>> code, the scan period will be decreased (scanning is speeded up) in
>> that situation.
>> 
>> The commit log and comments make more sense.  So this patch fixes the
>> code to make it match the commit log and comments.  And this has been
>> verified via tracing the scan period changing and /proc/vmstat
>> numa_pte_updates counter when running a multi-threaded memory
>> accessing program (most memory areas are accessed by multiple
>> threads).
>> 
>
> Lets split into 4 modes.
> More Local and Private Page Accesses:
> We definitely want to scan slowly i.e increase the scan window.
>
> More Local and Shared Page Accesses:
> We still want to scan slowly because we have consolidated and there is no
> point in scanning faster. So scan slowly + increase the scan window.
> (Do remember access on any active node counts as local!!!)
>
> More Remote + Private page Accesses:
> Most likely the Private accesses are going to be local accesses.
>
> In the unlikely event of the private accesses not being local, we should
> scan faster so that the memory and task consolidates.
>
> More Remote + Shared page Accesses: This means the workload has not
> consolidated and needs to scan faster. So we need to scan faster.

This sounds reasonable.  But

lr_ratio < NUMA_PERIOD_THRESHOLD

doesn't indicate More Remote.  If Local = Remote, it is also true.  If
there are also more Shared, we should slow down the scanning.  So, the
logic could be

if (lr_ratio >= NUMA_PERIOD_THRESHOLD)
    slow down scanning
else if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
    if (NUMA_PERIOD_SLOTS - lr_ratio >= NUMA_PERIOD_THRESHOLD)
        speed up scanning
    else
        slow down scanning
} else
   speed up scanning

This follows your idea better?

Best Regards,
Huang, Ying

> So I would think we should go back to before 37ec97deb3a8.
>
> i.e 
>
> 	int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
>
> 	if (!slot)
> 		slot = 1;
> 	diff = slot * period_slot;
>
>
> No?
>
>> Fixes: 37ec97deb3a8 ("sched/numa: Slow down scan rate if shared faults dominate")
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: jhladky@redhat.com
>> Cc: lvenanci@redhat.com
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>  kernel/sched/fair.c | 20 ++++++++++----------
>>  1 file changed, 10 insertions(+), 10 deletions(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 036be95a87e9..468a1c5038b2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1940,7 +1940,7 @@ static void update_task_scan_period(struct task_struct *p,
>>  			unsigned long shared, unsigned long private)
>>  {
>>  	unsigned int period_slot;
>> -	int lr_ratio, ps_ratio;
>> +	int lr_ratio, sp_ratio;
>>  	int diff;
>>  
>>  	unsigned long remote = p->numa_faults_locality[0];
>> @@ -1971,22 +1971,22 @@ static void update_task_scan_period(struct task_struct *p,
>>  	 */
>>  	period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
>>  	lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
>> -	ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
>> +	sp_ratio = (shared * NUMA_PERIOD_SLOTS) / (private + shared);
>>  
>> -	if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
>> +	if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
>>  		/*
>> -		 * Most memory accesses are local. There is no need to
>> -		 * do fast NUMA scanning, since memory is already local.
>> +		 * Most memory accesses are shared with other tasks.
>> +		 * There is no point in continuing fast NUMA scanning,
>> +		 * since other tasks may just move the memory elsewhere.
>
> With this change, I would expect that with Shared page accesses,
> consolidation to take a hit.
>
>>  		 */
>> -		int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
>> +		int slot = sp_ratio - NUMA_PERIOD_THRESHOLD;
>>  		if (!slot)
>>  			slot = 1;
>>  		diff = slot * period_slot;
>>  	} else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) {
>>  		/*
>> -		 * Most memory accesses are shared with other tasks.
>> -		 * There is no point in continuing fast NUMA scanning,
>> -		 * since other tasks may just move the memory elsewhere.
>> +		 * Most memory accesses are local. There is no need to
>> +		 * do fast NUMA scanning, since memory is already local.
>
> Comment wise this make sense.
>
>>  		 */
>>  		int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
>>  		if (!slot)
>> @@ -1998,7 +1998,7 @@ static void update_task_scan_period(struct task_struct *p,
>>  		 * yet they are not on the local NUMA node. Speed up
>>  		 * NUMA scanning to get the memory moved over.
>>  		 */
>> -		int ratio = max(lr_ratio, ps_ratio);
>> +		int ratio = max(lr_ratio, sp_ratio);
>>  		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>>  	}
>>  
>> -- 
>> 2.20.1
>> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] autonuma: Fix scan period updating
  2019-07-26  7:45   ` Huang, Ying
@ 2019-07-26  9:20     ` Srikar Dronamraju
  2019-07-29  3:04       ` Huang, Ying
  0 siblings, 1 reply; 9+ messages in thread
From: Srikar Dronamraju @ 2019-07-26  9:20 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Peter Zijlstra, Ingo Molnar, linux-mm, linux-kernel,
	Rik van Riel, Mel Gorman, jhladky, lvenanci, Andrew Morton

* Huang, Ying <ying.huang@intel.com> [2019-07-26 15:45:39]:

> Hi, Srikar,
> 
> >
> > More Remote + Private page Accesses:
> > Most likely the Private accesses are going to be local accesses.
> >
> > In the unlikely event of the private accesses not being local, we should
> > scan faster so that the memory and task consolidates.
> >
> > More Remote + Shared page Accesses: This means the workload has not
> > consolidated and needs to scan faster. So we need to scan faster.
> 
> This sounds reasonable.  But
> 
> lr_ratio < NUMA_PERIOD_THRESHOLD
> 
> doesn't indicate More Remote.  If Local = Remote, it is also true.  If

less lr_ratio means more remote.

> there are also more Shared, we should slow down the scanning.  So, the

Why should we slowing down if there are more remote shared accesses?

> logic could be
> 
> if (lr_ratio >= NUMA_PERIOD_THRESHOLD)
>     slow down scanning
> else if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
>     if (NUMA_PERIOD_SLOTS - lr_ratio >= NUMA_PERIOD_THRESHOLD)
>         speed up scanning
>     else
>         slow down scanning
> } else
>    speed up scanning
> 
> This follows your idea better?
> 
> Best Regards,
> Huang, Ying

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] autonuma: Fix scan period updating
  2019-07-26  9:20     ` Srikar Dronamraju
@ 2019-07-29  3:04       ` Huang, Ying
  2019-07-29  7:28         ` Srikar Dronamraju
  0 siblings, 1 reply; 9+ messages in thread
From: Huang, Ying @ 2019-07-29  3:04 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, linux-mm, linux-kernel,
	Rik van Riel, Mel Gorman, jhladky, lvenanci, Andrew Morton

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

> * Huang, Ying <ying.huang@intel.com> [2019-07-26 15:45:39]:
>
>> Hi, Srikar,
>> 
>> >
>> > More Remote + Private page Accesses:
>> > Most likely the Private accesses are going to be local accesses.
>> >
>> > In the unlikely event of the private accesses not being local, we should
>> > scan faster so that the memory and task consolidates.
>> >
>> > More Remote + Shared page Accesses: This means the workload has not
>> > consolidated and needs to scan faster. So we need to scan faster.
>> 
>> This sounds reasonable.  But
>> 
>> lr_ratio < NUMA_PERIOD_THRESHOLD
>> 
>> doesn't indicate More Remote.  If Local = Remote, it is also true.  If
>
> less lr_ratio means more remote.
>
>> there are also more Shared, we should slow down the scanning.  So, the
>
> Why should we slowing down if there are more remote shared accesses?
>
>> logic could be
>> 
>> if (lr_ratio >= NUMA_PERIOD_THRESHOLD)
>>     slow down scanning
>> else if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
>>     if (NUMA_PERIOD_SLOTS - lr_ratio >= NUMA_PERIOD_THRESHOLD)
>>         speed up scanning

Thought about this again.  For example, a multi-threads workload runs on
a 4-sockets machine, and most memory accesses are shared.  The optimal
situation will be pseudo-interleaving, that is, spreading memory
accesses evenly among 4 NUMA nodes.  Where "share" >> "private", and
"remote" > "local".  And we should slow down scanning to reduce the
overhead.

What do you think about this?

Best Regards,
Huang, Ying

>>     else
>>         slow down scanning
>> } else
>>    speed up scanning
>> 
>> This follows your idea better?
>> 
>> Best Regards,
>> Huang, Ying


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] autonuma: Fix scan period updating
  2019-07-29  3:04       ` Huang, Ying
@ 2019-07-29  7:28         ` Srikar Dronamraju
  2019-07-29  8:16           ` Huang, Ying
  0 siblings, 1 reply; 9+ messages in thread
From: Srikar Dronamraju @ 2019-07-29  7:28 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Peter Zijlstra, Ingo Molnar, linux-mm, linux-kernel,
	Rik van Riel, Mel Gorman, jhladky, lvenanci, Andrew Morton

> >> 
> >> if (lr_ratio >= NUMA_PERIOD_THRESHOLD)
> >>     slow down scanning
> >> else if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
> >>     if (NUMA_PERIOD_SLOTS - lr_ratio >= NUMA_PERIOD_THRESHOLD)
> >>         speed up scanning
> 
> Thought about this again.  For example, a multi-threads workload runs on
> a 4-sockets machine, and most memory accesses are shared.  The optimal
> situation will be pseudo-interleaving, that is, spreading memory
> accesses evenly among 4 NUMA nodes.  Where "share" >> "private", and
> "remote" > "local".  And we should slow down scanning to reduce the
> overhead.
> 
> What do you think about this?

If all 4 nodes have equal access, then all 4 nodes will be active nodes.

From task_numa_fault()

	if (!priv && !local && ng && ng->active_nodes > 1 &&
				numa_is_active_node(cpu_node, ng) &&
				numa_is_active_node(mem_node, ng))
		local = 1;

Hence all accesses will be accounted as local. Hence scanning would slow
down.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] autonuma: Fix scan period updating
  2019-07-29  7:28         ` Srikar Dronamraju
@ 2019-07-29  8:16           ` Huang, Ying
  2019-07-29  8:56             ` Mel Gorman
  0 siblings, 1 reply; 9+ messages in thread
From: Huang, Ying @ 2019-07-29  8:16 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, linux-mm, linux-kernel,
	Rik van Riel, Mel Gorman, jhladky, lvenanci, Andrew Morton

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

>> >> 
>> >> if (lr_ratio >= NUMA_PERIOD_THRESHOLD)
>> >>     slow down scanning
>> >> else if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
>> >>     if (NUMA_PERIOD_SLOTS - lr_ratio >= NUMA_PERIOD_THRESHOLD)
>> >>         speed up scanning
>> 
>> Thought about this again.  For example, a multi-threads workload runs on
>> a 4-sockets machine, and most memory accesses are shared.  The optimal
>> situation will be pseudo-interleaving, that is, spreading memory
>> accesses evenly among 4 NUMA nodes.  Where "share" >> "private", and
>> "remote" > "local".  And we should slow down scanning to reduce the
>> overhead.
>> 
>> What do you think about this?
>
> If all 4 nodes have equal access, then all 4 nodes will be active nodes.
>
> From task_numa_fault()
>
> 	if (!priv && !local && ng && ng->active_nodes > 1 &&
> 				numa_is_active_node(cpu_node, ng) &&
> 				numa_is_active_node(mem_node, ng))
> 		local = 1;
>
> Hence all accesses will be accounted as local. Hence scanning would slow
> down.

Yes.  You are right!  Thanks a lot!

There may be another case.  For example, a workload with 9 threads runs
on a 2-sockets machine, and most memory accesses are shared.  7 threads
runs on the node 0 and 2 threads runs on the node 1 based on CPU load
balancing.  Then the 2 threads on the node 1 will have "share" >>
"private" and "remote" >> "local".  But it doesn't help to speed up
scanning.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] autonuma: Fix scan period updating
  2019-07-29  8:16           ` Huang, Ying
@ 2019-07-29  8:56             ` Mel Gorman
  2019-07-30  1:38               ` Huang, Ying
  0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2019-07-29  8:56 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, linux-mm,
	linux-kernel, Rik van Riel, jhladky, lvenanci, Andrew Morton

On Mon, Jul 29, 2019 at 04:16:28PM +0800, Huang, Ying wrote:
> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> 
> >> >> 
> >> >> if (lr_ratio >= NUMA_PERIOD_THRESHOLD)
> >> >>     slow down scanning
> >> >> else if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
> >> >>     if (NUMA_PERIOD_SLOTS - lr_ratio >= NUMA_PERIOD_THRESHOLD)
> >> >>         speed up scanning
> >> 
> >> Thought about this again.  For example, a multi-threads workload runs on
> >> a 4-sockets machine, and most memory accesses are shared.  The optimal
> >> situation will be pseudo-interleaving, that is, spreading memory
> >> accesses evenly among 4 NUMA nodes.  Where "share" >> "private", and
> >> "remote" > "local".  And we should slow down scanning to reduce the
> >> overhead.
> >> 
> >> What do you think about this?
> >
> > If all 4 nodes have equal access, then all 4 nodes will be active nodes.
> >
> > From task_numa_fault()
> >
> > 	if (!priv && !local && ng && ng->active_nodes > 1 &&
> > 				numa_is_active_node(cpu_node, ng) &&
> > 				numa_is_active_node(mem_node, ng))
> > 		local = 1;
> >
> > Hence all accesses will be accounted as local. Hence scanning would slow
> > down.
> 
> Yes.  You are right!  Thanks a lot!
> 
> There may be another case.  For example, a workload with 9 threads runs
> on a 2-sockets machine, and most memory accesses are shared.  7 threads
> runs on the node 0 and 2 threads runs on the node 1 based on CPU load
> balancing.  Then the 2 threads on the node 1 will have "share" >>
> "private" and "remote" >> "local".  But it doesn't help to speed up
> scanning.
> 

Ok, so the results from the patch are mostly neutral. There are some
small differences in scan rates depending on the workload but it's not
universal and the headline performance is sometimes worse. I couldn't
find something that would justify the change on its own. I think in the
short term -- just fix the comments.

For the shared access consideration, the scan rate is important but so too
is the decision on when pseudo interleaving should be used. Both should
probably be taken into account when making changes in this area. The
current code may not be optimal but it also has not generated bug reports,
high CPU usage or obviously bad locality decision in the field.  Hence,
for this patch or a similar series, it is critical that some workloads are
selected that really care about the locality of shared access and evaluate
based on that. Initially it was done with a large battery of tests run
by different people but some of those people have changed role since and
would not be in a position to rerun the tests. There also was the issue
that when those were done, NUMA balancing was new so it's comparative
baseline was "do nothing at all".

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RESEND] autonuma: Fix scan period updating
  2019-07-29  8:56             ` Mel Gorman
@ 2019-07-30  1:38               ` Huang, Ying
  0 siblings, 0 replies; 9+ messages in thread
From: Huang, Ying @ 2019-07-30  1:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, linux-mm,
	linux-kernel, Rik van Riel, jhladky, lvenanci, Andrew Morton

Mel Gorman <mgorman@suse.de> writes:

> On Mon, Jul 29, 2019 at 04:16:28PM +0800, Huang, Ying wrote:
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> 
>> >> >> 
>> >> >> if (lr_ratio >= NUMA_PERIOD_THRESHOLD)
>> >> >>     slow down scanning
>> >> >> else if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
>> >> >>     if (NUMA_PERIOD_SLOTS - lr_ratio >= NUMA_PERIOD_THRESHOLD)
>> >> >>         speed up scanning
>> >> 
>> >> Thought about this again.  For example, a multi-threads workload runs on
>> >> a 4-sockets machine, and most memory accesses are shared.  The optimal
>> >> situation will be pseudo-interleaving, that is, spreading memory
>> >> accesses evenly among 4 NUMA nodes.  Where "share" >> "private", and
>> >> "remote" > "local".  And we should slow down scanning to reduce the
>> >> overhead.
>> >> 
>> >> What do you think about this?
>> >
>> > If all 4 nodes have equal access, then all 4 nodes will be active nodes.
>> >
>> > From task_numa_fault()
>> >
>> > 	if (!priv && !local && ng && ng->active_nodes > 1 &&
>> > 				numa_is_active_node(cpu_node, ng) &&
>> > 				numa_is_active_node(mem_node, ng))
>> > 		local = 1;
>> >
>> > Hence all accesses will be accounted as local. Hence scanning would slow
>> > down.
>> 
>> Yes.  You are right!  Thanks a lot!
>> 
>> There may be another case.  For example, a workload with 9 threads runs
>> on a 2-sockets machine, and most memory accesses are shared.  7 threads
>> runs on the node 0 and 2 threads runs on the node 1 based on CPU load
>> balancing.  Then the 2 threads on the node 1 will have "share" >>
>> "private" and "remote" >> "local".  But it doesn't help to speed up
>> scanning.
>> 
>
> Ok, so the results from the patch are mostly neutral. There are some
> small differences in scan rates depending on the workload but it's not
> universal and the headline performance is sometimes worse. I couldn't
> find something that would justify the change on its own.

Thanks a lot for your help!

> I think in the short term -- just fix the comments.

Then we will change the comments to something like,

"Slow down scanning if most memory accesses are private."

It's hard to be understood.  Maybe we just keep the code and comments as
it was until we have better understanding.

> For the shared access consideration, the scan rate is important but so too
> is the decision on when pseudo interleaving should be used. Both should
> probably be taken into account when making changes in this area. The
> current code may not be optimal but it also has not generated bug reports,
> high CPU usage or obviously bad locality decision in the field.  Hence,
> for this patch or a similar series, it is critical that some workloads are
> selected that really care about the locality of shared access and evaluate
> based on that. Initially it was done with a large battery of tests run
> by different people but some of those people have changed role since and
> would not be in a position to rerun the tests. There also was the issue
> that when those were done, NUMA balancing was new so it's comparative
> baseline was "do nothing at all".

Yes.  I totally agree that we should change the behavior based on
testing.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-07-30  1:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-25  8:01 [PATCH RESEND] autonuma: Fix scan period updating Huang, Ying
2019-07-25 17:35 ` Srikar Dronamraju
2019-07-26  7:45   ` Huang, Ying
2019-07-26  9:20     ` Srikar Dronamraju
2019-07-29  3:04       ` Huang, Ying
2019-07-29  7:28         ` Srikar Dronamraju
2019-07-29  8:16           ` Huang, Ying
2019-07-29  8:56             ` Mel Gorman
2019-07-30  1:38               ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).