All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/fair: Care divide error in update_task_scan_period()
@ 2014-10-08  6:43 Yasuaki Ishimatsu
  2014-10-08  8:31 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Yasuaki Ishimatsu @ 2014-10-08  6:43 UTC (permalink / raw)
  To: mingo, peterz; +Cc: linux-kernel, riel, tkhai

While offling node by hot removing memory, the following divide error
occurs:

  divide error: 0000 [#1] SMP
  [...]
  Call Trace:
   [...] handle_mm_fault
   [...] ? try_to_wake_up
   [...] ? wake_up_state
   [...] __do_page_fault
   [...] ? do_futex
   [...] ? put_prev_entity
   [...] ? __switch_to
   [...] do_page_fault
   [...] page_fault
  [...]
  RIP  [<ffffffff810a7081>] task_numa_fault
   RSP <ffff88084eb2bcb0>

The issue occurs as follows:
  1. When page fault occurs and page is allocated from node 1,
     task_struct->numa_faults_buffer_memory[] of node 1 is
     incremented and p->numa_faults_locality[] is also incremented
     as follows:

     o numa_faults_buffer_memory[]       o numa_faults_locality[]
              NR_NUMA_HINT_FAULT_TYPES
             |      0     |     1     |
     ----------------------------------  ----------------------
      node 0 |      0     |     0     |   remote |      0     |
      node 1 |      0     |     1     |   locale |      1     |
     ----------------------------------  ----------------------

  2. node 1 is offlined by hot removing memory.

  3. When page fault occurs, fault_types[] is calculated by using
     p->numa_faults_buffer_memory[] of all online nodes in
     task_numa_placement(). But node 1 was offline by step 2. So
     the fault_types[] is calculated by using only
     p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
     are set to 0.

  4. The values(0) of fault_types[] pass to update_task_scan_period().

  5. numa_faults_locality[1] is set to 1. So the following division is
     calculated.

        static void update_task_scan_period(struct task_struct *p,
                                unsigned long shared, unsigned long private){
        ...
                ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
        }

  6. But both of private and shared are set to 0. So divide error
     occurs here.

  The divide error is rare case because the trigger is node offline.
  By this patch, when both of private and shared are set to 0, diff
  is just set to 0, not calculating the division.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
---
 kernel/sched/fair.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa3c86..fb7dc3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1496,18 +1496,26 @@ static void update_task_scan_period(struct task_struct *p,
 			slot = 1;
 		diff = slot * period_slot;
 	} else {
-		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
+		if (unlikely((private + shared) == 0))
+			/*
+			 * This is a rare case. The trigger is node offline.
+			 */
+			diff = 0;
+		else {
+			diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;

-		/*
-		 * Scale scan rate increases based on sharing. There is an
-		 * inverse relationship between the degree of sharing and
-		 * the adjustment made to the scanning period. Broadly
-		 * speaking the intent is that there is little point
-		 * scanning faster if shared accesses dominate as it may
-		 * simply bounce migrations uselessly
-		 */
-		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
-		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
+			/*
+			 * Scale scan rate increases based on sharing. There is
+			 * an inverse relationship between the degree of sharing
+			 * and the adjustment made to the scanning period.
+			 * Broadly speaking the intent is that there is little
+			 * point scanning faster if shared accesses dominate as
+			 * it may simply bounce migrations uselessly
+			 */
+			ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS,
+							(private + shared));
+			diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
+		}
 	}

 	p->numa_scan_period = clamp(p->numa_scan_period + diff,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()
  2014-10-08  6:43 [PATCH] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
@ 2014-10-08  8:31 ` Peter Zijlstra
  2014-10-08 11:51 ` Wanpeng Li
  2014-10-08 16:42 ` Rik van Riel
  2 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2014-10-08  8:31 UTC (permalink / raw)
  To: Yasuaki Ishimatsu; +Cc: mingo, linux-kernel, riel, tkhai, mgorman

On Wed, Oct 08, 2014 at 03:43:11PM +0900, Yasuaki Ishimatsu wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfa3c86..fb7dc3f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1496,18 +1496,26 @@ static void update_task_scan_period(struct task_struct *p,
>  			slot = 1;
>  		diff = slot * period_slot;
>  	} else {
> -		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
> +		if (unlikely((private + shared) == 0))
> +			/*
> +			 * This is a rare case. The trigger is node offline.
> +			 */
> +			diff = 0;
> +		else {
> +			diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
> 
> -		/*
> -		 * Scale scan rate increases based on sharing. There is an
> -		 * inverse relationship between the degree of sharing and
> -		 * the adjustment made to the scanning period. Broadly
> -		 * speaking the intent is that there is little point
> -		 * scanning faster if shared accesses dominate as it may
> -		 * simply bounce migrations uselessly
> -		 */
> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
> -		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
> +			/*
> +			 * Scale scan rate increases based on sharing. There is
> +			 * an inverse relationship between the degree of sharing
> +			 * and the adjustment made to the scanning period.
> +			 * Broadly speaking the intent is that there is little
> +			 * point scanning faster if shared accesses dominate as
> +			 * it may simply bounce migrations uselessly
> +			 */
> +			ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS,
> +							(private + shared));
> +			diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
> +		}
>  	}
> 
>  	p->numa_scan_period = clamp(p->numa_scan_period + diff,

Yeah, so I don't like the patch nor do I really like the function as it
stands -- which I suppose is part of why I don't like the patch.

The problem I have with the function is that its very inconsistent in
behaviour. In the early return path it sets numa_scan_period and
numa_next_scan, in the later return path it sets numa_scan_period and
numa_faults_locality.

I feel both return paths should affect the same set of variables, esp.
the non clearing of numa_faults_locality in the early path seems weird.

The thing I suppose I don't like about the patch is its added
indentation and the fact that the simple +1 thing wasn't considered.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()
  2014-10-08  6:43 [PATCH] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
  2014-10-08  8:31 ` Peter Zijlstra
@ 2014-10-08 11:51 ` Wanpeng Li
  2014-10-09  5:34   ` Yasuaki Ishimatsu
  2014-10-08 16:42 ` Rik van Riel
  2 siblings, 1 reply; 7+ messages in thread
From: Wanpeng Li @ 2014-10-08 11:51 UTC (permalink / raw)
  To: Yasuaki Ishimatsu, mingo, peterz; +Cc: linux-kernel, riel, tkhai


于 10/8/14, 2:43 PM, Yasuaki Ishimatsu 写道:
> While offling node by hot removing memory, the following divide error
> occurs:
>
>   divide error: 0000 [#1] SMP
>   [...]
>   Call Trace:
>    [...] handle_mm_fault
>    [...] ? try_to_wake_up
>    [...] ? wake_up_state
>    [...] __do_page_fault
>    [...] ? do_futex
>    [...] ? put_prev_entity
>    [...] ? __switch_to
>    [...] do_page_fault
>    [...] page_fault
>   [...]
>   RIP  [<ffffffff810a7081>] task_numa_fault
>    RSP <ffff88084eb2bcb0>
>
> The issue occurs as follows:
>   1. When page fault occurs and page is allocated from node 1,
>      task_struct->numa_faults_buffer_memory[] of node 1 is
>      incremented and p->numa_faults_locality[] is also incremented
>      as follows:
>
>      o numa_faults_buffer_memory[]       o numa_faults_locality[]
>               NR_NUMA_HINT_FAULT_TYPES
>              |      0     |     1     |
>      ----------------------------------  ----------------------
>       node 0 |      0     |     0     |   remote |      0     |
>       node 1 |      0     |     1     |   locale |      1     |
>      ----------------------------------  ----------------------
>
>   2. node 1 is offlined by hot removing memory.
>
>   3. When page fault occurs, fault_types[] is calculated by using
>      p->numa_faults_buffer_memory[] of all online nodes in
>      task_numa_placement(). But node 1 was offline by step 2. So
>      the fault_types[] is calculated by using only
>      p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
>      are set to 0.
>
>   4. The values(0) of fault_types[] pass to update_task_scan_period().
>
>   5. numa_faults_locality[1] is set to 1. So the following division is
>      calculated.
>
>         static void update_task_scan_period(struct task_struct *p,
>                                 unsigned long shared, unsigned long private){
>         ...
>                 ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>         }
>
>   6. But both of private and shared are set to 0. So divide error
>      occurs here.
>
>   The divide error is rare case because the trigger is node offline.
>   By this patch, when both of private and shared are set to 0, diff
>   is just set to 0, not calculating the division.
>
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> ---
>  kernel/sched/fair.c | 30 +++++++++++++++++++-----------
>  1 file changed, 19 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfa3c86..fb7dc3f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1496,18 +1496,26 @@ static void update_task_scan_period(struct task_struct *p,
>  			slot = 1;
>  		diff = slot * period_slot;
>  	} else {
> -		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
> +		if (unlikely((private + shared) == 0))
> +			/*
> +			 * This is a rare case. The trigger is node offline.
> +			 */
> +			diff = 0;
> +		else {
> +			diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>
> -		/*
> -		 * Scale scan rate increases based on sharing. There is an
> -		 * inverse relationship between the degree of sharing and
> -		 * the adjustment made to the scanning period. Broadly
> -		 * speaking the intent is that there is little point
> -		 * scanning faster if shared accesses dominate as it may
> -		 * simply bounce migrations uselessly
> -		 */
> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
> -		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
> +			/*
> +			 * Scale scan rate increases based on sharing. There is
> +			 * an inverse relationship between the degree of sharing
> +			 * and the adjustment made to the scanning period.
> +			 * Broadly speaking the intent is that there is little
> +			 * point scanning faster if shared accesses dominate as
> +			 * it may simply bounce migrations uselessly
> +			 */
> +			ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS,
> +							(private + shared));
> +			diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
> +		}
>  	}

How about just

ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared + 1));


Regards,
Wanpeng Li

>  	p->numa_scan_period = clamp(p->numa_scan_period + diff,


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()
  2014-10-08  6:43 [PATCH] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
  2014-10-08  8:31 ` Peter Zijlstra
  2014-10-08 11:51 ` Wanpeng Li
@ 2014-10-08 16:42 ` Rik van Riel
  2014-10-08 16:54   ` Peter Zijlstra
  2 siblings, 1 reply; 7+ messages in thread
From: Rik van Riel @ 2014-10-08 16:42 UTC (permalink / raw)
  To: Yasuaki Ishimatsu, mingo, peterz; +Cc: linux-kernel, tkhai

On 10/08/2014 02:43 AM, Yasuaki Ishimatsu wrote:

>   The divide error is rare case because the trigger is node offline.
>   By this patch, when both of private and shared are set to 0, diff
>   is just set to 0, not calculating the division.

How about a simple

    if (private + shared) == 0)
          return;

higher up in the function, to avoid adding an extra
layer of indentation and confusion to the main part
of the function?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()
  2014-10-08 16:42 ` Rik van Riel
@ 2014-10-08 16:54   ` Peter Zijlstra
  2014-10-09  5:19     ` Yasuaki Ishimatsu
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2014-10-08 16:54 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Yasuaki Ishimatsu, mingo, linux-kernel, tkhai

On Wed, Oct 08, 2014 at 12:42:24PM -0400, Rik van Riel wrote:
> On 10/08/2014 02:43 AM, Yasuaki Ishimatsu wrote:
> 
> >   The divide error is rare case because the trigger is node offline.
> >   By this patch, when both of private and shared are set to 0, diff
> >   is just set to 0, not calculating the division.
> 
> How about a simple
> 
>     if (private + shared) == 0)
>           return;
> 
> higher up in the function, to avoid adding an extra
> layer of indentation and confusion to the main part
> of the function?

At which point we'll have 3 different return semantics. Should we not
clear numa_faults_localityp[], even in this case?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()
  2014-10-08 16:54   ` Peter Zijlstra
@ 2014-10-09  5:19     ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 7+ messages in thread
From: Yasuaki Ishimatsu @ 2014-10-09  5:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Peter Zijlstra, mingo, linux-kernel, tkhai

(2014/10/09 1:54), Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 12:42:24PM -0400, Rik van Riel wrote:
>> On 10/08/2014 02:43 AM, Yasuaki Ishimatsu wrote:
>>
>>>    The divide error is rare case because the trigger is node offline.
>>>    By this patch, when both of private and shared are set to 0, diff
>>>    is just set to 0, not calculating the division.
>>
>> How about a simple
>>
>>      if (private + shared) == 0)
>>            return;
>>
>> higher up in the function, to avoid adding an extra
>> layer of indentation and confusion to the main part
>> of the function?
>
> At which point we'll have 3 different return semantics. Should we not
> clear numa_faults_localityp[], even in this case?
>

I'm not familiar with Numa balancing feature. So I want to know it too.
If it's not necessary to clear numa_faults_locality[], I'll apply the idea.

Thanks,
Yasuaki Ishimatsu


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched/fair: Care divide error in update_task_scan_period()
  2014-10-08 11:51 ` Wanpeng Li
@ 2014-10-09  5:34   ` Yasuaki Ishimatsu
  0 siblings, 0 replies; 7+ messages in thread
From: Yasuaki Ishimatsu @ 2014-10-09  5:34 UTC (permalink / raw)
  To: Wanpeng Li, mingo, peterz; +Cc: linux-kernel, riel, tkhai

(2014/10/08 20:51), Wanpeng Li wrote:
> 
> 于 10/8/14, 2:43 PM, Yasuaki Ishimatsu 写道:
>> While offling node by hot removing memory, the following divide error
>> occurs:
>>
>>    divide error: 0000 [#1] SMP
>>    [...]
>>    Call Trace:
>>     [...] handle_mm_fault
>>     [...] ? try_to_wake_up
>>     [...] ? wake_up_state
>>     [...] __do_page_fault
>>     [...] ? do_futex
>>     [...] ? put_prev_entity
>>     [...] ? __switch_to
>>     [...] do_page_fault
>>     [...] page_fault
>>    [...]
>>    RIP  [<ffffffff810a7081>] task_numa_fault
>>     RSP <ffff88084eb2bcb0>
>>
>> The issue occurs as follows:
>>    1. When page fault occurs and page is allocated from node 1,
>>       task_struct->numa_faults_buffer_memory[] of node 1 is
>>       incremented and p->numa_faults_locality[] is also incremented
>>       as follows:
>>
>>       o numa_faults_buffer_memory[]       o numa_faults_locality[]
>>                NR_NUMA_HINT_FAULT_TYPES
>>               |      0     |     1     |
>>       ----------------------------------  ----------------------
>>        node 0 |      0     |     0     |   remote |      0     |
>>        node 1 |      0     |     1     |   locale |      1     |
>>       ----------------------------------  ----------------------
>>
>>    2. node 1 is offlined by hot removing memory.
>>
>>    3. When page fault occurs, fault_types[] is calculated by using
>>       p->numa_faults_buffer_memory[] of all online nodes in
>>       task_numa_placement(). But node 1 was offline by step 2. So
>>       the fault_types[] is calculated by using only
>>       p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
>>       are set to 0.
>>
>>    4. The values(0) of fault_types[] pass to update_task_scan_period().
>>
>>    5. numa_faults_locality[1] is set to 1. So the following division is
>>       calculated.
>>
>>          static void update_task_scan_period(struct task_struct *p,
>>                                  unsigned long shared, unsigned long private){
>>          ...
>>                  ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>>          }
>>
>>    6. But both of private and shared are set to 0. So divide error
>>       occurs here.
>>
>>    The divide error is rare case because the trigger is node offline.
>>    By this patch, when both of private and shared are set to 0, diff
>>    is just set to 0, not calculating the division.
>>
>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>> ---
>>   kernel/sched/fair.c | 30 +++++++++++++++++++-----------
>>   1 file changed, 19 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bfa3c86..fb7dc3f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1496,18 +1496,26 @@ static void update_task_scan_period(struct task_struct *p,
>>   			slot = 1;
>>   		diff = slot * period_slot;
>>   	} else {
>> -		diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>> +		if (unlikely((private + shared) == 0))
>> +			/*
>> +			 * This is a rare case. The trigger is node offline.
>> +			 */
>> +			diff = 0;
>> +		else {
>> +			diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
>>
>> -		/*
>> -		 * Scale scan rate increases based on sharing. There is an
>> -		 * inverse relationship between the degree of sharing and
>> -		 * the adjustment made to the scanning period. Broadly
>> -		 * speaking the intent is that there is little point
>> -		 * scanning faster if shared accesses dominate as it may
>> -		 * simply bounce migrations uselessly
>> -		 */
>> -		ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
>> -		diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>> +			/*
>> +			 * Scale scan rate increases based on sharing. There is
>> +			 * an inverse relationship between the degree of sharing
>> +			 * and the adjustment made to the scanning period.
>> +			 * Broadly speaking the intent is that there is little
>> +			 * point scanning faster if shared accesses dominate as
>> +			 * it may simply bounce migrations uselessly
>> +			 */
>> +			ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS,
>> +							(private + shared));
>> +			diff = (diff * ratio) / NUMA_PERIOD_SLOTS;
>> +		}
>>   	}
> 

> How about just
> 
> ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared + 1));

Thank you for providing sample code. Rik also provided other idea.
So I am confirming which is better idea.

Thanks,
Yasuaki Ishimatsu

> 
> 
> Regards,
> Wanpeng Li
> 
>>   	p->numa_scan_period = clamp(p->numa_scan_period + diff,
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-10-09  5:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-08  6:43 [PATCH] sched/fair: Care divide error in update_task_scan_period() Yasuaki Ishimatsu
2014-10-08  8:31 ` Peter Zijlstra
2014-10-08 11:51 ` Wanpeng Li
2014-10-09  5:34   ` Yasuaki Ishimatsu
2014-10-08 16:42 ` Rik van Riel
2014-10-08 16:54   ` Peter Zijlstra
2014-10-09  5:19     ` Yasuaki Ishimatsu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.