linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration
@ 2019-01-30 21:23 Michael Bringmann
  2019-01-31  5:38 ` Michael Ellerman
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Bringmann @ 2019-01-30 21:23 UTC (permalink / raw)
  To: linuxppc-dev, Juliet Kim, Tyrel Datwyler, Thomas Falcon,
	Nathan Lynch, Gustavo Walbon, mwb

This patch is to check for cede'ed CPUs during LPM.  Some extreme
tests encountered a problem ehere Linux has put some threads to
sleep (possibly to save energy or something), LPM was attempted,
and the Linux kernel didn't awaken the sleeping threads, but issued
the H_JOIN for the active threads.  Since the sleeping threads
are not awake, they can not issue the expected H_JOIN, and the
partition would never suspend.  This patch wakes the sleeping
threads back up.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
---
Changes in v02:
   -- Rebase to latest powerpc kernel source.
---
 arch/powerpc/include/asm/plpar_wrappers.h |    6 ++----
 arch/powerpc/kernel/rtas.c                |    8 +++++++-
 arch/powerpc/platforms/pseries/setup.c    |   18 ++++++++++++++++++
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
index cff5a41..8292eff 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -26,10 +26,8 @@ static inline void set_cede_latency_hint(u8 latency_hint)
 	get_lppaca()->cede_latency_hint = latency_hint;
 }
 
-static inline long cede_processor(void)
-{
-	return plpar_hcall_norets(H_CEDE);
-}
+int cpu_is_ceded(int cpu);
+long cede_processor(void);
 
 static inline long extended_cede_processor(unsigned long latency_hint)
 {
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index de35bd8f..fea3d21 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -44,6 +44,7 @@
 #include <asm/time.h>
 #include <asm/mmu.h>
 #include <asm/topology.h>
+#include <asm/plpar_wrappers.h>
 
 /* This is here deliberately so it's only used in this file */
 void enter_rtas(unsigned long);
@@ -942,7 +943,7 @@ int rtas_ibm_suspend_me(u64 handle)
 	struct rtas_suspend_me_data data;
 	DECLARE_COMPLETION_ONSTACK(done);
 	cpumask_var_t offline_mask;
-	int cpuret;
+	int cpuret, cpu;
 
 	if (!rtas_service_present("ibm,suspend-me"))
 		return -ENOSYS;
@@ -991,6 +992,11 @@ int rtas_ibm_suspend_me(u64 handle)
 		goto out_hotplug_enable;
 	}
 
+	for_each_present_cpu(cpu) {
+		if (cpu_is_ceded(cpu))
+			plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu));
+	}
+
 	/* Call function on all CPUs.  One of us will make the
 	 * rtas call
 	 */
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 41f62ca2..48ae6d4 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -331,6 +331,24 @@ static int alloc_dispatch_log_kmem_cache(void)
 }
 machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
 
+static DEFINE_PER_CPU(int, cpu_ceded);
+
+int cpu_is_ceded(int cpu)
+{
+	return per_cpu(cpu_ceded, cpu);
+}
+
+long cede_processor(void)
+{
+	long rc;
+
+	per_cpu(cpu_ceded, raw_smp_processor_id()) = 1;
+	rc = plpar_hcall_norets(H_CEDE);
+	per_cpu(cpu_ceded, raw_smp_processor_id()) = 0;
+
+	return rc;
+}
+
 static void pseries_lpar_idle(void)
 {
 	/*


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration
  2019-01-30 21:23 [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration Michael Bringmann
@ 2019-01-31  5:38 ` Michael Ellerman
  2019-01-31 21:53   ` Michael Bringmann
  0 siblings, 1 reply; 9+ messages in thread
From: Michael Ellerman @ 2019-01-31  5:38 UTC (permalink / raw)
  To: Michael Bringmann, linuxppc-dev, Juliet Kim, Tyrel Datwyler,
	Thomas Falcon, Nathan Lynch, Gustavo Walbon, mwb

Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
> This patch is to check for cede'ed CPUs during LPM.  Some extreme
> tests encountered a problem ehere Linux has put some threads to
> sleep (possibly to save energy or something), LPM was attempted,
> and the Linux kernel didn't awaken the sleeping threads, but issued
> the H_JOIN for the active threads.  Since the sleeping threads
> are not awake, they can not issue the expected H_JOIN, and the
> partition would never suspend.  This patch wakes the sleeping
> threads back up.

I'm don't think this is the right solution.

Just after your for loop we do an on_each_cpu() call, which sends an IPI
to every CPU, and that should wake all CPUs up from CEDE.

If that's not happening then there is a bug somewhere, and we need to
work out where.


> diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
> index cff5a41..8292eff 100644
> --- a/arch/powerpc/include/asm/plpar_wrappers.h
> +++ b/arch/powerpc/include/asm/plpar_wrappers.h
> @@ -26,10 +26,8 @@ static inline void set_cede_latency_hint(u8 latency_hint)
>  	get_lppaca()->cede_latency_hint = latency_hint;
>  }
>  
> -static inline long cede_processor(void)
> -{
> -	return plpar_hcall_norets(H_CEDE);
> -}
> +int cpu_is_ceded(int cpu);
> +long cede_processor(void);
>  
>  static inline long extended_cede_processor(unsigned long latency_hint)
>  {
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index de35bd8f..fea3d21 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -44,6 +44,7 @@
>  #include <asm/time.h>
>  #include <asm/mmu.h>
>  #include <asm/topology.h>
> +#include <asm/plpar_wrappers.h>
>  
>  /* This is here deliberately so it's only used in this file */
>  void enter_rtas(unsigned long);
> @@ -942,7 +943,7 @@ int rtas_ibm_suspend_me(u64 handle)
>  	struct rtas_suspend_me_data data;
>  	DECLARE_COMPLETION_ONSTACK(done);
>  	cpumask_var_t offline_mask;
> -	int cpuret;
> +	int cpuret, cpu;
>  
>  	if (!rtas_service_present("ibm,suspend-me"))
>  		return -ENOSYS;
> @@ -991,6 +992,11 @@ int rtas_ibm_suspend_me(u64 handle)
>  		goto out_hotplug_enable;
>  	}
>  
> +	for_each_present_cpu(cpu) {
> +		if (cpu_is_ceded(cpu))
> +			plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu));
> +	}

There's a race condition here, there's nothing to prevent the CPUs you
just PROD'ed from going back into CEDE before you do the on_each_cpu()
call below.

>  	/* Call function on all CPUs.  One of us will make the
>  	 * rtas call
>  	 */
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 41f62ca2..48ae6d4 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -331,6 +331,24 @@ static int alloc_dispatch_log_kmem_cache(void)
>  }
>  machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
>  
> +static DEFINE_PER_CPU(int, cpu_ceded);
> +
> +int cpu_is_ceded(int cpu)
> +{
> +	return per_cpu(cpu_ceded, cpu);
> +}
> +
> +long cede_processor(void)
> +{
> +	long rc;
> +
> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 1;

And there's also a race condition here. From the other CPU's perspective
the store to cpu_ceded is not necessarily ordered vs the hcall below.
Which means the other CPU can see cpu_ceded = 0, and therefore not prod
us, but this CPU has already called H_CEDE.

> +	rc = plpar_hcall_norets(H_CEDE);
> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 0;
> +
> +	return rc;
> +}
> +
>  static void pseries_lpar_idle(void)
>  {
>  	/*

cheers

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration
  2019-01-31  5:38 ` Michael Ellerman
@ 2019-01-31 21:53   ` Michael Bringmann
  2019-01-31 22:21     ` Tyrel Datwyler
  2019-02-01 14:14     ` Michael Bringmann
  0 siblings, 2 replies; 9+ messages in thread
From: Michael Bringmann @ 2019-01-31 21:53 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev, Juliet Kim, Tyrel Datwyler,
	Thomas Falcon, Nathan Lynch, Gustavo Walbon, Pete Heyrman

On 1/30/19 11:38 PM, Michael Ellerman wrote:
> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
>> This patch is to check for cede'ed CPUs during LPM.  Some extreme
>> tests encountered a problem ehere Linux has put some threads to
>> sleep (possibly to save energy or something), LPM was attempted,
>> and the Linux kernel didn't awaken the sleeping threads, but issued
>> the H_JOIN for the active threads.  Since the sleeping threads
>> are not awake, they can not issue the expected H_JOIN, and the
>> partition would never suspend.  This patch wakes the sleeping
>> threads back up.
> 
> I'm don't think this is the right solution.
> 
> Just after your for loop we do an on_each_cpu() call, which sends an IPI
> to every CPU, and that should wake all CPUs up from CEDE.
> 
> If that's not happening then there is a bug somewhere, and we need to
> work out where.

Let me explain the scenario of the LPM case that Pete Heyrman found, and
that Nathan F. was working upon, previously.

In the scenario, the partition has 5 dedicated processors each with 8 threads
running.

From the PHYP data we can see that on VP 0, threads 3, 4, 5, 6 and 7 issued
a H_CEDE requesting to save energy by putting the requesting thread into
sleep mode.  In this state, the thread will only be awakened by H_PROD from
another running thread or from an external user action (power off, reboot
and such).  Timers and external interrupts are disabled in this mode.

About 3 seconds later, as part of the LPM operation, the other 35 threads
have all issued a H_JOIN request.  Join is part of the LPM process where
the threads suspend themselves as part of the LPM operation so the partition
can be migrated to the target server.

So, the current state is the the OS has suspended the execution of all the
threads in the partition without successfully suspending all threads as part
of LPM.

Net, OS has an issue where they suspended every processor thread so nothing
can run.

This appears to be slightly different than the previous LPM stalls we have
seen where the migration stalls because of cpus being taken offline and not
making the H_JOIN call.

In this scenario we appear to have CPUs that have done an H_CEDE prior to
the LPM. For these CPUs we would need to do a H_PROD to wake them back up
so they can do a H_JOIN and allow the LPM to continue.

The problem is that Linux has some threads that they put to sleep (probably
to save energy or something), LPM was attempted, Linux didn't awaken the
sleeping threads but issued the H_JOIN for the active threads.  Since the
sleeping threads don't issue the H_JOIN the partition will never suspend.

I am checking again with Pete regarding your concerns.

Thanks.

> 
> 
>> diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
>> index cff5a41..8292eff 100644
>> --- a/arch/powerpc/include/asm/plpar_wrappers.h
>> +++ b/arch/powerpc/include/asm/plpar_wrappers.h
>> @@ -26,10 +26,8 @@ static inline void set_cede_latency_hint(u8 latency_hint)
>>  	get_lppaca()->cede_latency_hint = latency_hint;
>>  }
>>  
>> -static inline long cede_processor(void)
>> -{
>> -	return plpar_hcall_norets(H_CEDE);
>> -}
>> +int cpu_is_ceded(int cpu);
>> +long cede_processor(void);
>>  
>>  static inline long extended_cede_processor(unsigned long latency_hint)
>>  {
>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>> index de35bd8f..fea3d21 100644
>> --- a/arch/powerpc/kernel/rtas.c
>> +++ b/arch/powerpc/kernel/rtas.c
>> @@ -44,6 +44,7 @@
>>  #include <asm/time.h>
>>  #include <asm/mmu.h>
>>  #include <asm/topology.h>
>> +#include <asm/plpar_wrappers.h>
>>  
>>  /* This is here deliberately so it's only used in this file */
>>  void enter_rtas(unsigned long);
>> @@ -942,7 +943,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>  	struct rtas_suspend_me_data data;
>>  	DECLARE_COMPLETION_ONSTACK(done);
>>  	cpumask_var_t offline_mask;
>> -	int cpuret;
>> +	int cpuret, cpu;
>>  
>>  	if (!rtas_service_present("ibm,suspend-me"))
>>  		return -ENOSYS;
>> @@ -991,6 +992,11 @@ int rtas_ibm_suspend_me(u64 handle)
>>  		goto out_hotplug_enable;
>>  	}
>>  
>> +	for_each_present_cpu(cpu) {
>> +		if (cpu_is_ceded(cpu))
>> +			plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu));
>> +	}
> 
> There's a race condition here, there's nothing to prevent the CPUs you
> just PROD'ed from going back into CEDE before you do the on_each_cpu()
> call below> 
>>  	/* Call function on all CPUs.  One of us will make the
>>  	 * rtas call
>>  	 */
>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>> index 41f62ca2..48ae6d4 100644
>> --- a/arch/powerpc/platforms/pseries/setup.c
>> +++ b/arch/powerpc/platforms/pseries/setup.c
>> @@ -331,6 +331,24 @@ static int alloc_dispatch_log_kmem_cache(void)
>>  }
>>  machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
>>  
>> +static DEFINE_PER_CPU(int, cpu_ceded);
>> +
>> +int cpu_is_ceded(int cpu)
>> +{
>> +	return per_cpu(cpu_ceded, cpu);
>> +}
>> +
>> +long cede_processor(void)
>> +{
>> +	long rc;
>> +
>> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 1;
> 
> And there's also a race condition here. From the other CPU's perspective
> the store to cpu_ceded is not necessarily ordered vs the hcall below.
> Which means the other CPU can see cpu_ceded = 0, and therefore not prod
> us, but this CPU has already called H_CEDE.
> 
>> +	rc = plpar_hcall_norets(H_CEDE);
>> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 0;
>> +
>> +	return rc;
>> +}
>> +
>>  static void pseries_lpar_idle(void)
>>  {
>>  	/*
> 
> cheers
> 
> 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@linux.vnet.ibm.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration
  2019-01-31 21:53   ` Michael Bringmann
@ 2019-01-31 22:21     ` Tyrel Datwyler
  2019-01-31 22:30       ` Michael Bringmann
  2019-01-31 22:34       ` Tyrel Datwyler
  2019-02-01 14:14     ` Michael Bringmann
  1 sibling, 2 replies; 9+ messages in thread
From: Tyrel Datwyler @ 2019-01-31 22:21 UTC (permalink / raw)
  To: Michael Bringmann, Michael Ellerman, linuxppc-dev, Juliet Kim,
	Thomas Falcon, Nathan Lynch, Gustavo Walbon, Pete Heyrman

On 01/31/2019 01:53 PM, Michael Bringmann wrote:
> On 1/30/19 11:38 PM, Michael Ellerman wrote:
>> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
>>> This patch is to check for cede'ed CPUs during LPM.  Some extreme
>>> tests encountered a problem ehere Linux has put some threads to
>>> sleep (possibly to save energy or something), LPM was attempted,
>>> and the Linux kernel didn't awaken the sleeping threads, but issued
>>> the H_JOIN for the active threads.  Since the sleeping threads
>>> are not awake, they can not issue the expected H_JOIN, and the
>>> partition would never suspend.  This patch wakes the sleeping
>>> threads back up.
>>
>> I'm don't think this is the right solution.
>>
>> Just after your for loop we do an on_each_cpu() call, which sends an IPI
>> to every CPU, and that should wake all CPUs up from CEDE.
>>
>> If that's not happening then there is a bug somewhere, and we need to
>> work out where.
> 
> Let me explain the scenario of the LPM case that Pete Heyrman found, and
> that Nathan F. was working upon, previously.
> 
> In the scenario, the partition has 5 dedicated processors each with 8 threads
> running.

Do we CEDE processors when running dedicated? I thought H_CEDE was part of the
Shared Processor LPAR option.

> 
> From the PHYP data we can see that on VP 0, threads 3, 4, 5, 6 and 7 issued
> a H_CEDE requesting to save energy by putting the requesting thread into
> sleep mode.  In this state, the thread will only be awakened by H_PROD from
> another running thread or from an external user action (power off, reboot
> and such).  Timers and external interrupts are disabled in this mode.

Not according to PAPR. A CEDE'd processor should awaken if signaled by external
interrupt such as decrementer or IPI as well.

-Tyrel

> 
> About 3 seconds later, as part of the LPM operation, the other 35 threads
> have all issued a H_JOIN request.  Join is part of the LPM process where
> the threads suspend themselves as part of the LPM operation so the partition
> can be migrated to the target server.
> 
> So, the current state is the the OS has suspended the execution of all the
> threads in the partition without successfully suspending all threads as part
> of LPM.
> 
> Net, OS has an issue where they suspended every processor thread so nothing
> can run.
> 
> This appears to be slightly different than the previous LPM stalls we have
> seen where the migration stalls because of cpus being taken offline and not
> making the H_JOIN call.
> 
> In this scenario we appear to have CPUs that have done an H_CEDE prior to
> the LPM. For these CPUs we would need to do a H_PROD to wake them back up
> so they can do a H_JOIN and allow the LPM to continue.
> 
> The problem is that Linux has some threads that they put to sleep (probably
> to save energy or something), LPM was attempted, Linux didn't awaken the
> sleeping threads but issued the H_JOIN for the active threads.  Since the
> sleeping threads don't issue the H_JOIN the partition will never suspend.
> 
> I am checking again with Pete regarding your concerns.
> 
> Thanks.
> 
>>
>>
>>> diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
>>> index cff5a41..8292eff 100644
>>> --- a/arch/powerpc/include/asm/plpar_wrappers.h
>>> +++ b/arch/powerpc/include/asm/plpar_wrappers.h
>>> @@ -26,10 +26,8 @@ static inline void set_cede_latency_hint(u8 latency_hint)
>>>  	get_lppaca()->cede_latency_hint = latency_hint;
>>>  }
>>>  
>>> -static inline long cede_processor(void)
>>> -{
>>> -	return plpar_hcall_norets(H_CEDE);
>>> -}
>>> +int cpu_is_ceded(int cpu);
>>> +long cede_processor(void);
>>>  
>>>  static inline long extended_cede_processor(unsigned long latency_hint)
>>>  {
>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>> index de35bd8f..fea3d21 100644
>>> --- a/arch/powerpc/kernel/rtas.c
>>> +++ b/arch/powerpc/kernel/rtas.c
>>> @@ -44,6 +44,7 @@
>>>  #include <asm/time.h>
>>>  #include <asm/mmu.h>
>>>  #include <asm/topology.h>
>>> +#include <asm/plpar_wrappers.h>
>>>  
>>>  /* This is here deliberately so it's only used in this file */
>>>  void enter_rtas(unsigned long);
>>> @@ -942,7 +943,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>>  	struct rtas_suspend_me_data data;
>>>  	DECLARE_COMPLETION_ONSTACK(done);
>>>  	cpumask_var_t offline_mask;
>>> -	int cpuret;
>>> +	int cpuret, cpu;
>>>  
>>>  	if (!rtas_service_present("ibm,suspend-me"))
>>>  		return -ENOSYS;
>>> @@ -991,6 +992,11 @@ int rtas_ibm_suspend_me(u64 handle)
>>>  		goto out_hotplug_enable;
>>>  	}
>>>  
>>> +	for_each_present_cpu(cpu) {
>>> +		if (cpu_is_ceded(cpu))
>>> +			plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu));
>>> +	}
>>
>> There's a race condition here, there's nothing to prevent the CPUs you
>> just PROD'ed from going back into CEDE before you do the on_each_cpu()
>> call below> 
>>>  	/* Call function on all CPUs.  One of us will make the
>>>  	 * rtas call
>>>  	 */
>>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>>> index 41f62ca2..48ae6d4 100644
>>> --- a/arch/powerpc/platforms/pseries/setup.c
>>> +++ b/arch/powerpc/platforms/pseries/setup.c
>>> @@ -331,6 +331,24 @@ static int alloc_dispatch_log_kmem_cache(void)
>>>  }
>>>  machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
>>>  
>>> +static DEFINE_PER_CPU(int, cpu_ceded);
>>> +
>>> +int cpu_is_ceded(int cpu)
>>> +{
>>> +	return per_cpu(cpu_ceded, cpu);
>>> +}
>>> +
>>> +long cede_processor(void)
>>> +{
>>> +	long rc;
>>> +
>>> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 1;
>>
>> And there's also a race condition here. From the other CPU's perspective
>> the store to cpu_ceded is not necessarily ordered vs the hcall below.
>> Which means the other CPU can see cpu_ceded = 0, and therefore not prod
>> us, but this CPU has already called H_CEDE.
>>
>>> +	rc = plpar_hcall_norets(H_CEDE);
>>> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 0;
>>> +
>>> +	return rc;
>>> +}
>>> +
>>>  static void pseries_lpar_idle(void)
>>>  {
>>>  	/*
>>
>> cheers
>>
>>
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration
  2019-01-31 22:21     ` Tyrel Datwyler
@ 2019-01-31 22:30       ` Michael Bringmann
  2019-01-31 22:34       ` Tyrel Datwyler
  1 sibling, 0 replies; 9+ messages in thread
From: Michael Bringmann @ 2019-01-31 22:30 UTC (permalink / raw)
  To: Tyrel Datwyler, Michael Ellerman, linuxppc-dev, Juliet Kim,
	Thomas Falcon, Nathan Lynch, Gustavo Walbon, Pete Heyrman

On 1/31/19 4:21 PM, Tyrel Datwyler wrote:
> On 01/31/2019 01:53 PM, Michael Bringmann wrote:
>> On 1/30/19 11:38 PM, Michael Ellerman wrote:
>>> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
>>>> This patch is to check for cede'ed CPUs during LPM.  Some extreme
>>>> tests encountered a problem ehere Linux has put some threads to
>>>> sleep (possibly to save energy or something), LPM was attempted,
>>>> and the Linux kernel didn't awaken the sleeping threads, but issued
>>>> the H_JOIN for the active threads.  Since the sleeping threads
>>>> are not awake, they can not issue the expected H_JOIN, and the
>>>> partition would never suspend.  This patch wakes the sleeping
>>>> threads back up.
>>>
>>> I'm don't think this is the right solution.
>>>
>>> Just after your for loop we do an on_each_cpu() call, which sends an IPI
>>> to every CPU, and that should wake all CPUs up from CEDE.
>>>
>>> If that's not happening then there is a bug somewhere, and we need to
>>> work out where.
>>
>> Let me explain the scenario of the LPM case that Pete Heyrman found, and
>> that Nathan F. was working upon, previously.
>>
>> In the scenario, the partition has 5 dedicated processors each with 8 threads
>> running.
> 
> Do we CEDE processors when running dedicated? I thought H_CEDE was part of the
> Shared Processor LPAR option.
> 
>>
>> From the PHYP data we can see that on VP 0, threads 3, 4, 5, 6 and 7 issued
>> a H_CEDE requesting to save energy by putting the requesting thread into
>> sleep mode.  In this state, the thread will only be awakened by H_PROD from
>> another running thread or from an external user action (power off, reboot
>> and such).  Timers and external interrupts are disabled in this mode.
> 
> Not according to PAPR. A CEDE'd processor should awaken if signaled by external
> interrupt such as decrementer or IPI as well.

Checking these points with Pete H.
Thanks.

> 
> -Tyrel
> 
>>
>> About 3 seconds later, as part of the LPM operation, the other 35 threads
>> have all issued a H_JOIN request.  Join is part of the LPM process where
>> the threads suspend themselves as part of the LPM operation so the partition
>> can be migrated to the target server.
>>
>> So, the current state is the the OS has suspended the execution of all the
>> threads in the partition without successfully suspending all threads as part
>> of LPM.
>>
>> Net, OS has an issue where they suspended every processor thread so nothing
>> can run.
>>
>> This appears to be slightly different than the previous LPM stalls we have
>> seen where the migration stalls because of cpus being taken offline and not
>> making the H_JOIN call.
>>
>> In this scenario we appear to have CPUs that have done an H_CEDE prior to
>> the LPM. For these CPUs we would need to do a H_PROD to wake them back up
>> so they can do a H_JOIN and allow the LPM to continue.
>>
>> The problem is that Linux has some threads that they put to sleep (probably
>> to save energy or something), LPM was attempted, Linux didn't awaken the
>> sleeping threads but issued the H_JOIN for the active threads.  Since the
>> sleeping threads don't issue the H_JOIN the partition will never suspend.
>>
>> I am checking again with Pete regarding your concerns.
>>
>> Thanks.
>>
>>>
>>>
>>>> diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
>>>> index cff5a41..8292eff 100644
>>>> --- a/arch/powerpc/include/asm/plpar_wrappers.h
>>>> +++ b/arch/powerpc/include/asm/plpar_wrappers.h
>>>> @@ -26,10 +26,8 @@ static inline void set_cede_latency_hint(u8 latency_hint)
>>>>  	get_lppaca()->cede_latency_hint = latency_hint;
>>>>  }
>>>>  
>>>> -static inline long cede_processor(void)
>>>> -{
>>>> -	return plpar_hcall_norets(H_CEDE);
>>>> -}
>>>> +int cpu_is_ceded(int cpu);
>>>> +long cede_processor(void);
>>>>  
>>>>  static inline long extended_cede_processor(unsigned long latency_hint)
>>>>  {
>>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>>> index de35bd8f..fea3d21 100644
>>>> --- a/arch/powerpc/kernel/rtas.c
>>>> +++ b/arch/powerpc/kernel/rtas.c
>>>> @@ -44,6 +44,7 @@
>>>>  #include <asm/time.h>
>>>>  #include <asm/mmu.h>
>>>>  #include <asm/topology.h>
>>>> +#include <asm/plpar_wrappers.h>
>>>>  
>>>>  /* This is here deliberately so it's only used in this file */
>>>>  void enter_rtas(unsigned long);
>>>> @@ -942,7 +943,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>>>  	struct rtas_suspend_me_data data;
>>>>  	DECLARE_COMPLETION_ONSTACK(done);
>>>>  	cpumask_var_t offline_mask;
>>>> -	int cpuret;
>>>> +	int cpuret, cpu;
>>>>  
>>>>  	if (!rtas_service_present("ibm,suspend-me"))
>>>>  		return -ENOSYS;
>>>> @@ -991,6 +992,11 @@ int rtas_ibm_suspend_me(u64 handle)
>>>>  		goto out_hotplug_enable;
>>>>  	}
>>>>  
>>>> +	for_each_present_cpu(cpu) {
>>>> +		if (cpu_is_ceded(cpu))
>>>> +			plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu));
>>>> +	}
>>>
>>> There's a race condition here, there's nothing to prevent the CPUs you
>>> just PROD'ed from going back into CEDE before you do the on_each_cpu()
>>> call below> 
>>>>  	/* Call function on all CPUs.  One of us will make the
>>>>  	 * rtas call
>>>>  	 */
>>>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>>>> index 41f62ca2..48ae6d4 100644
>>>> --- a/arch/powerpc/platforms/pseries/setup.c
>>>> +++ b/arch/powerpc/platforms/pseries/setup.c
>>>> @@ -331,6 +331,24 @@ static int alloc_dispatch_log_kmem_cache(void)
>>>>  }
>>>>  machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
>>>>  
>>>> +static DEFINE_PER_CPU(int, cpu_ceded);
>>>> +
>>>> +int cpu_is_ceded(int cpu)
>>>> +{
>>>> +	return per_cpu(cpu_ceded, cpu);
>>>> +}
>>>> +
>>>> +long cede_processor(void)
>>>> +{
>>>> +	long rc;
>>>> +
>>>> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 1;
>>>
>>> And there's also a race condition here. From the other CPU's perspective
>>> the store to cpu_ceded is not necessarily ordered vs the hcall below.
>>> Which means the other CPU can see cpu_ceded = 0, and therefore not prod
>>> us, but this CPU has already called H_CEDE.
>>>
>>>> +	rc = plpar_hcall_norets(H_CEDE);
>>>> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 0;
>>>> +
>>>> +	return rc;
>>>> +}
>>>> +
>>>>  static void pseries_lpar_idle(void)
>>>>  {
>>>>  	/*
>>>
>>> cheers
>>>
>>>
>>
> 
> 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@linux.vnet.ibm.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration
  2019-01-31 22:21     ` Tyrel Datwyler
  2019-01-31 22:30       ` Michael Bringmann
@ 2019-01-31 22:34       ` Tyrel Datwyler
  2019-02-05 10:05         ` Michael Ellerman
  1 sibling, 1 reply; 9+ messages in thread
From: Tyrel Datwyler @ 2019-01-31 22:34 UTC (permalink / raw)
  To: Michael Bringmann, Michael Ellerman, linuxppc-dev, Juliet Kim,
	Thomas Falcon, Nathan Lynch, Gustavo Walbon, Pete Heyrman

On 01/31/2019 02:21 PM, Tyrel Datwyler wrote:
> On 01/31/2019 01:53 PM, Michael Bringmann wrote:
>> On 1/30/19 11:38 PM, Michael Ellerman wrote:
>>> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
>>>> This patch is to check for cede'ed CPUs during LPM.  Some extreme
>>>> tests encountered a problem ehere Linux has put some threads to
>>>> sleep (possibly to save energy or something), LPM was attempted,
>>>> and the Linux kernel didn't awaken the sleeping threads, but issued
>>>> the H_JOIN for the active threads.  Since the sleeping threads
>>>> are not awake, they can not issue the expected H_JOIN, and the
>>>> partition would never suspend.  This patch wakes the sleeping
>>>> threads back up.
>>>
>>> I'm don't think this is the right solution.
>>>
>>> Just after your for loop we do an on_each_cpu() call, which sends an IPI
>>> to every CPU, and that should wake all CPUs up from CEDE.
>>>
>>> If that's not happening then there is a bug somewhere, and we need to
>>> work out where.
>>
>> Let me explain the scenario of the LPM case that Pete Heyrman found, and
>> that Nathan F. was working upon, previously.
>>
>> In the scenario, the partition has 5 dedicated processors each with 8 threads
>> running.
> 
> Do we CEDE processors when running dedicated? I thought H_CEDE was part of the
> Shared Processor LPAR option.

Looks like the cpuidle-pseries driver uses CEDE with dedicated processors as
long as firmware supports SPLPAR option.

> 
>>
>> From the PHYP data we can see that on VP 0, threads 3, 4, 5, 6 and 7 issued
>> a H_CEDE requesting to save energy by putting the requesting thread into
>> sleep mode.  In this state, the thread will only be awakened by H_PROD from
>> another running thread or from an external user action (power off, reboot
>> and such).  Timers and external interrupts are disabled in this mode.
> 
> Not according to PAPR. A CEDE'd processor should awaken if signaled by external
> interrupt such as decrementer or IPI as well.

This statement should still apply though. From PAPR:

14.11.3.3 H_CEDE
The architectural intent of this hcall() is to have the virtual processor, which
has no useful work to do, enter a wait state ceding its processor capacity to
other virtual processors until some useful work appears, signaled either through
an interrupt or a prod hcall(). To help the caller reduce race conditions, this
call may be made with interrupts disabled but the semantics of the hcall()
enable the virtual processor’s interrupts so that it may always receive wake up
interrupt signals.

-Tyrel

> 
> -Tyrel
> 
>>
>> About 3 seconds later, as part of the LPM operation, the other 35 threads
>> have all issued a H_JOIN request.  Join is part of the LPM process where
>> the threads suspend themselves as part of the LPM operation so the partition
>> can be migrated to the target server.
>>
>> So, the current state is the the OS has suspended the execution of all the
>> threads in the partition without successfully suspending all threads as part
>> of LPM.
>>
>> Net, OS has an issue where they suspended every processor thread so nothing
>> can run.
>>
>> This appears to be slightly different than the previous LPM stalls we have
>> seen where the migration stalls because of cpus being taken offline and not
>> making the H_JOIN call.
>>
>> In this scenario we appear to have CPUs that have done an H_CEDE prior to
>> the LPM. For these CPUs we would need to do a H_PROD to wake them back up
>> so they can do a H_JOIN and allow the LPM to continue.
>>
>> The problem is that Linux has some threads that they put to sleep (probably
>> to save energy or something), LPM was attempted, Linux didn't awaken the
>> sleeping threads but issued the H_JOIN for the active threads.  Since the
>> sleeping threads don't issue the H_JOIN the partition will never suspend.
>>
>> I am checking again with Pete regarding your concerns.
>>
>> Thanks.
>>
>>>
>>>
>>>> diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
>>>> index cff5a41..8292eff 100644
>>>> --- a/arch/powerpc/include/asm/plpar_wrappers.h
>>>> +++ b/arch/powerpc/include/asm/plpar_wrappers.h
>>>> @@ -26,10 +26,8 @@ static inline void set_cede_latency_hint(u8 latency_hint)
>>>>  	get_lppaca()->cede_latency_hint = latency_hint;
>>>>  }
>>>>  
>>>> -static inline long cede_processor(void)
>>>> -{
>>>> -	return plpar_hcall_norets(H_CEDE);
>>>> -}
>>>> +int cpu_is_ceded(int cpu);
>>>> +long cede_processor(void);
>>>>  
>>>>  static inline long extended_cede_processor(unsigned long latency_hint)
>>>>  {
>>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>>> index de35bd8f..fea3d21 100644
>>>> --- a/arch/powerpc/kernel/rtas.c
>>>> +++ b/arch/powerpc/kernel/rtas.c
>>>> @@ -44,6 +44,7 @@
>>>>  #include <asm/time.h>
>>>>  #include <asm/mmu.h>
>>>>  #include <asm/topology.h>
>>>> +#include <asm/plpar_wrappers.h>
>>>>  
>>>>  /* This is here deliberately so it's only used in this file */
>>>>  void enter_rtas(unsigned long);
>>>> @@ -942,7 +943,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>>>  	struct rtas_suspend_me_data data;
>>>>  	DECLARE_COMPLETION_ONSTACK(done);
>>>>  	cpumask_var_t offline_mask;
>>>> -	int cpuret;
>>>> +	int cpuret, cpu;
>>>>  
>>>>  	if (!rtas_service_present("ibm,suspend-me"))
>>>>  		return -ENOSYS;
>>>> @@ -991,6 +992,11 @@ int rtas_ibm_suspend_me(u64 handle)
>>>>  		goto out_hotplug_enable;
>>>>  	}
>>>>  
>>>> +	for_each_present_cpu(cpu) {
>>>> +		if (cpu_is_ceded(cpu))
>>>> +			plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu));
>>>> +	}
>>>
>>> There's a race condition here, there's nothing to prevent the CPUs you
>>> just PROD'ed from going back into CEDE before you do the on_each_cpu()
>>> call below> 
>>>>  	/* Call function on all CPUs.  One of us will make the
>>>>  	 * rtas call
>>>>  	 */
>>>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>>>> index 41f62ca2..48ae6d4 100644
>>>> --- a/arch/powerpc/platforms/pseries/setup.c
>>>> +++ b/arch/powerpc/platforms/pseries/setup.c
>>>> @@ -331,6 +331,24 @@ static int alloc_dispatch_log_kmem_cache(void)
>>>>  }
>>>>  machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
>>>>  
>>>> +static DEFINE_PER_CPU(int, cpu_ceded);
>>>> +
>>>> +int cpu_is_ceded(int cpu)
>>>> +{
>>>> +	return per_cpu(cpu_ceded, cpu);
>>>> +}
>>>> +
>>>> +long cede_processor(void)
>>>> +{
>>>> +	long rc;
>>>> +
>>>> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 1;
>>>
>>> And there's also a race condition here. From the other CPU's perspective
>>> the store to cpu_ceded is not necessarily ordered vs the hcall below.
>>> Which means the other CPU can see cpu_ceded = 0, and therefore not prod
>>> us, but this CPU has already called H_CEDE.
>>>
>>>> +	rc = plpar_hcall_norets(H_CEDE);
>>>> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 0;
>>>> +
>>>> +	return rc;
>>>> +}
>>>> +
>>>>  static void pseries_lpar_idle(void)
>>>>  {
>>>>  	/*
>>>
>>> cheers
>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration
  2019-01-31 21:53   ` Michael Bringmann
  2019-01-31 22:21     ` Tyrel Datwyler
@ 2019-02-01 14:14     ` Michael Bringmann
  2019-02-05 10:24       ` Michael Ellerman
  1 sibling, 1 reply; 9+ messages in thread
From: Michael Bringmann @ 2019-02-01 14:14 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev, Juliet Kim, Tyrel Datwyler,
	Thomas Falcon, Nathan Lynch, Gustavo Walbon, Pete Heyrman

See below.

On 1/31/19 3:53 PM, Michael Bringmann wrote:
> On 1/30/19 11:38 PM, Michael Ellerman wrote:
>> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
>>> This patch is to check for cede'ed CPUs during LPM.  Some extreme
>>> tests encountered a problem ehere Linux has put some threads to
>>> sleep (possibly to save energy or something), LPM was attempted,
>>> and the Linux kernel didn't awaken the sleeping threads, but issued
>>> the H_JOIN for the active threads.  Since the sleeping threads
>>> are not awake, they can not issue the expected H_JOIN, and the
>>> partition would never suspend.  This patch wakes the sleeping
>>> threads back up.
>>
>> I'm don't think this is the right solution.
>>
>> Just after your for loop we do an on_each_cpu() call, which sends an IPI
>> to every CPU, and that should wake all CPUs up from CEDE.
>>
>> If that's not happening then there is a bug somewhere, and we need to
>> work out where.

From Pete Heyrman:
    Both sending IPI or H_PROD will awaken a logical processors that has ceded.
    When you have logical proc doing cede and one logical proc doing prod or IPI
    you have a race condition that the prod/IPI can proceed the cede request.
    If you use prod, the hypervisor takes care of the synchronization by ignoring
    a cede request if it was preceeded by a prod.  With IPI the interrupt is
    delivered which could then be followed by a cede so OS would need to provide
    synchronization.

Shouldn't this answer your concerns about race conditions and the suitability
of using H_PROD?

Michael

> 
> Let me explain the scenario of the LPM case that Pete Heyrman found, and
> that Nathan F. was working upon, previously.
> 
> In the scenario, the partition has 5 dedicated processors each with 8 threads
> running.
> 
>>From the PHYP data we can see that on VP 0, threads 3, 4, 5, 6 and 7 issued
> a H_CEDE requesting to save energy by putting the requesting thread into
> sleep mode.  In this state, the thread will only be awakened by H_PROD from
> another running thread or from an external user action (power off, reboot
> and such).  Timers and external interrupts are disabled in this mode.
> 
> About 3 seconds later, as part of the LPM operation, the other 35 threads
> have all issued a H_JOIN request.  Join is part of the LPM process where
> the threads suspend themselves as part of the LPM operation so the partition
> can be migrated to the target server.
> 
> So, the current state is the the OS has suspended the execution of all the
> threads in the partition without successfully suspending all threads as part
> of LPM.
> 
> Net, OS has an issue where they suspended every processor thread so nothing
> can run.
> 
> This appears to be slightly different than the previous LPM stalls we have
> seen where the migration stalls because of cpus being taken offline and not
> making the H_JOIN call.
> 
> In this scenario we appear to have CPUs that have done an H_CEDE prior to
> the LPM. For these CPUs we would need to do a H_PROD to wake them back up
> so they can do a H_JOIN and allow the LPM to continue.
> 
> The problem is that Linux has some threads that they put to sleep (probably
> to save energy or something), LPM was attempted, Linux didn't awaken the
> sleeping threads but issued the H_JOIN for the active threads.  Since the
> sleeping threads don't issue the H_JOIN the partition will never suspend.
> 
> I am checking again with Pete regarding your concerns.
> 
> Thanks.
> 
>>
>>
>>> diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
>>> index cff5a41..8292eff 100644
>>> --- a/arch/powerpc/include/asm/plpar_wrappers.h
>>> +++ b/arch/powerpc/include/asm/plpar_wrappers.h
>>> @@ -26,10 +26,8 @@ static inline void set_cede_latency_hint(u8 latency_hint)
>>>  	get_lppaca()->cede_latency_hint = latency_hint;
>>>  }
>>>  
>>> -static inline long cede_processor(void)
>>> -{
>>> -	return plpar_hcall_norets(H_CEDE);
>>> -}
>>> +int cpu_is_ceded(int cpu);
>>> +long cede_processor(void);
>>>  
>>>  static inline long extended_cede_processor(unsigned long latency_hint)
>>>  {
>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>> index de35bd8f..fea3d21 100644
>>> --- a/arch/powerpc/kernel/rtas.c
>>> +++ b/arch/powerpc/kernel/rtas.c
>>> @@ -44,6 +44,7 @@
>>>  #include <asm/time.h>
>>>  #include <asm/mmu.h>
>>>  #include <asm/topology.h>
>>> +#include <asm/plpar_wrappers.h>
>>>  
>>>  /* This is here deliberately so it's only used in this file */
>>>  void enter_rtas(unsigned long);
>>> @@ -942,7 +943,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>>  	struct rtas_suspend_me_data data;
>>>  	DECLARE_COMPLETION_ONSTACK(done);
>>>  	cpumask_var_t offline_mask;
>>> -	int cpuret;
>>> +	int cpuret, cpu;
>>>  
>>>  	if (!rtas_service_present("ibm,suspend-me"))
>>>  		return -ENOSYS;
>>> @@ -991,6 +992,11 @@ int rtas_ibm_suspend_me(u64 handle)
>>>  		goto out_hotplug_enable;
>>>  	}
>>>  
>>> +	for_each_present_cpu(cpu) {
>>> +		if (cpu_is_ceded(cpu))
>>> +			plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu));
>>> +	}
>>
>> There's a race condition here, there's nothing to prevent the CPUs you
>> just PROD'ed from going back into CEDE before you do the on_each_cpu()
>> call below> 
>>>  	/* Call function on all CPUs.  One of us will make the
>>>  	 * rtas call
>>>  	 */
>>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>>> index 41f62ca2..48ae6d4 100644
>>> --- a/arch/powerpc/platforms/pseries/setup.c
>>> +++ b/arch/powerpc/platforms/pseries/setup.c
>>> @@ -331,6 +331,24 @@ static int alloc_dispatch_log_kmem_cache(void)
>>>  }
>>>  machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
>>>  
>>> +static DEFINE_PER_CPU(int, cpu_ceded);
>>> +
>>> +int cpu_is_ceded(int cpu)
>>> +{
>>> +	return per_cpu(cpu_ceded, cpu);
>>> +}
>>> +
>>> +long cede_processor(void)
>>> +{
>>> +	long rc;
>>> +
>>> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 1;
>>
>> And there's also a race condition here. From the other CPU's perspective
>> the store to cpu_ceded is not necessarily ordered vs the hcall below.
>> Which means the other CPU can see cpu_ceded = 0, and therefore not prod
>> us, but this CPU has already called H_CEDE.
>>
>>> +	rc = plpar_hcall_norets(H_CEDE);
>>> +	per_cpu(cpu_ceded, raw_smp_processor_id()) = 0;
>>> +
>>> +	return rc;
>>> +}
>>> +
>>>  static void pseries_lpar_idle(void)
>>>  {
>>>  	/*
>>
>> cheers
>>
>>
> 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@linux.vnet.ibm.com


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration
  2019-01-31 22:34       ` Tyrel Datwyler
@ 2019-02-05 10:05         ` Michael Ellerman
  0 siblings, 0 replies; 9+ messages in thread
From: Michael Ellerman @ 2019-02-05 10:05 UTC (permalink / raw)
  To: Tyrel Datwyler, Michael Bringmann, linuxppc-dev, Juliet Kim,
	Thomas Falcon, Nathan Lynch, Gustavo Walbon, Pete Heyrman

Tyrel Datwyler <tyreld@linux.vnet.ibm.com> writes:
> On 01/31/2019 02:21 PM, Tyrel Datwyler wrote:
>> On 01/31/2019 01:53 PM, Michael Bringmann wrote:
>>> On 1/30/19 11:38 PM, Michael Ellerman wrote:
>>>> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
>>>>> This patch is to check for cede'ed CPUs during LPM.  Some extreme
>>>>> tests encountered a problem ehere Linux has put some threads to
>>>>> sleep (possibly to save energy or something), LPM was attempted,
>>>>> and the Linux kernel didn't awaken the sleeping threads, but issued
>>>>> the H_JOIN for the active threads.  Since the sleeping threads
>>>>> are not awake, they can not issue the expected H_JOIN, and the
>>>>> partition would never suspend.  This patch wakes the sleeping
>>>>> threads back up.
>>>>
>>>> I'm don't think this is the right solution.
>>>>
>>>> Just after your for loop we do an on_each_cpu() call, which sends an IPI
>>>> to every CPU, and that should wake all CPUs up from CEDE.
>>>>
>>>> If that's not happening then there is a bug somewhere, and we need to
>>>> work out where.
>>>
>>> Let me explain the scenario of the LPM case that Pete Heyrman found, and
>>> that Nathan F. was working upon, previously.
>>>
>>> In the scenario, the partition has 5 dedicated processors each with 8 threads
>>> running.
>> 
>> Do we CEDE processors when running dedicated? I thought H_CEDE was part of the
>> Shared Processor LPAR option.
>
> Looks like the cpuidle-pseries driver uses CEDE with dedicated processors as
> long as firmware supports SPLPAR option.
>
>> 
>>>
>>> From the PHYP data we can see that on VP 0, threads 3, 4, 5, 6 and 7 issued
>>> a H_CEDE requesting to save energy by putting the requesting thread into
>>> sleep mode.  In this state, the thread will only be awakened by H_PROD from
>>> another running thread or from an external user action (power off, reboot
>>> and such).  Timers and external interrupts are disabled in this mode.
>> 
>> Not according to PAPR. A CEDE'd processor should awaken if signaled by external
>> interrupt such as decrementer or IPI as well.
>
> This statement should still apply though. From PAPR:
>
> 14.11.3.3 H_CEDE
> The architectural intent of this hcall() is to have the virtual processor, which
> has no useful work to do, enter a wait state ceding its processor capacity to
> other virtual processors until some useful work appears, signaled either through
> an interrupt or a prod hcall(). To help the caller reduce race conditions, this
> call may be made with interrupts disabled but the semantics of the hcall()
> enable the virtual processor’s interrupts so that it may always receive wake up
> interrupt signals.

Thanks for digging that out of PAPR.

H_CEDE must respond to IPIs, we have no logic to H_PROD CPUs that are
idle in order to wake them up.

There must be something else going on here.

cheers

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration
  2019-02-01 14:14     ` Michael Bringmann
@ 2019-02-05 10:24       ` Michael Ellerman
  0 siblings, 0 replies; 9+ messages in thread
From: Michael Ellerman @ 2019-02-05 10:24 UTC (permalink / raw)
  To: Michael Bringmann, linuxppc-dev, Juliet Kim, Tyrel Datwyler,
	Thomas Falcon, Nathan Lynch, Gustavo Walbon, Pete Heyrman

Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
> See below.
>
> On 1/31/19 3:53 PM, Michael Bringmann wrote:
>> On 1/30/19 11:38 PM, Michael Ellerman wrote:
>>> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
>>>> This patch is to check for cede'ed CPUs during LPM.  Some extreme
>>>> tests encountered a problem ehere Linux has put some threads to
>>>> sleep (possibly to save energy or something), LPM was attempted,
>>>> and the Linux kernel didn't awaken the sleeping threads, but issued
>>>> the H_JOIN for the active threads.  Since the sleeping threads
>>>> are not awake, they can not issue the expected H_JOIN, and the
>>>> partition would never suspend.  This patch wakes the sleeping
>>>> threads back up.
>>>
>>> I'm don't think this is the right solution.
>>>
>>> Just after your for loop we do an on_each_cpu() call, which sends an IPI
>>> to every CPU, and that should wake all CPUs up from CEDE.
>>>
>>> If that's not happening then there is a bug somewhere, and we need to
>>> work out where.
>
> From Pete Heyrman:
>     Both sending IPI or H_PROD will awaken a logical processors that has ceded.
>     When you have logical proc doing cede and one logical proc doing prod or IPI
>     you have a race condition that the prod/IPI can proceed the cede request.
>     If you use prod, the hypervisor takes care of the synchronization by ignoring
>     a cede request if it was preceeded by a prod.  With IPI the interrupt is
>     delivered which could then be followed by a cede so OS would need to provide
>     synchronization.
>
> Shouldn't this answer your concerns about race conditions and the suitability
> of using H_PROD?

No sorry it doesn't.

Assuming the other CPU is idle it will just continually do CEDE in a
loop, sending it a PROD will just wake it up once and then it will CEDE
again. That first CEDE might return immediately on seeing the PROD, but
then the kernel will just CEDE again because it has nothing to do.

In contrast the IPI we send wakes up the other CPU and tells it to run a
function, rtas_percpu_suspend_me(), which does the H_JOIN directly.

I still don't understand how the original bug ever even happened. That's
what I want to know.

The way we do the joining and suspend seems like it could be simpler,
there's a bunch of atomic flags and __rtas_suspend_last_cpu() seems to
duplicate much of __rtas_suspend_cpu(). It seems more likely we have a
bug in there somewhere.

cheers

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-02-05 10:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-30 21:23 [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration Michael Bringmann
2019-01-31  5:38 ` Michael Ellerman
2019-01-31 21:53   ` Michael Bringmann
2019-01-31 22:21     ` Tyrel Datwyler
2019-01-31 22:30       ` Michael Bringmann
2019-01-31 22:34       ` Tyrel Datwyler
2019-02-05 10:05         ` Michael Ellerman
2019-02-01 14:14     ` Michael Bringmann
2019-02-05 10:24       ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).