All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] powerpc/pseries: Disable CPU hotplug across migrations
@ 2018-09-17 19:14 Nathan Fontenot
  2018-09-17 20:41 ` Tyrel Datwyler
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Nathan Fontenot @ 2018-09-17 19:14 UTC (permalink / raw)
  To: linuxppc-dev

When performing partition migrations all present CPUs must be online
as all present CPUs must make the H_JOIN call as part of the migration
process. Once all present CPUs make the H_JOIN call, one CPU is returned
to make the rtas call to perform the migration to the destination system.

During testing of migration and changing the SMT state we have found
instances where CPUs are offlined, as part of the SMT state change,
before they make the H_JOIN call. This results in a hung system where
every CPU is either in H_JOIN or offline.

To prevent this this patch disables CPU hotplug during the migration
process.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/rtas.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 8afd146bc9c7..2c7ed31c736e 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle)
 		goto out;
 	}
 
+	cpu_hotplug_disable();
 	stop_topology_update();
 
 	/* Call function on all CPUs.  One of us will make the
@@ -995,6 +996,7 @@ int rtas_ibm_suspend_me(u64 handle)
 		printk(KERN_ERR "Error doing global join\n");
 
 	start_topology_update();
+	cpu_hotplug_enable();
 
 	/* Take down CPUs not online prior to suspend */
 	cpuret = rtas_offline_cpus_mask(offline_mask);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] powerpc/pseries: Disable CPU hotplug across migrations
  2018-09-17 19:14 [PATCH] powerpc/pseries: Disable CPU hotplug across migrations Nathan Fontenot
@ 2018-09-17 20:41 ` Tyrel Datwyler
  2018-09-18 10:32 ` Gautham R Shenoy
  2018-09-20  4:21 ` Michael Ellerman
  2 siblings, 0 replies; 12+ messages in thread
From: Tyrel Datwyler @ 2018-09-17 20:41 UTC (permalink / raw)
  To: Nathan Fontenot, linuxppc-dev

On 09/17/2018 12:14 PM, Nathan Fontenot wrote:
> When performing partition migrations all present CPUs must be online
> as all present CPUs must make the H_JOIN call as part of the migration
> process. Once all present CPUs make the H_JOIN call, one CPU is returned
> to make the rtas call to perform the migration to the destination system.
> 
> During testing of migration and changing the SMT state we have found
> instances where CPUs are offlined, as part of the SMT state change,
> before they make the H_JOIN call. This results in a hung system where
> every CPU is either in H_JOIN or offline.
> 
> To prevent this this patch disables CPU hotplug during the migration
> process.
> 
> Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>

Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] powerpc/pseries: Disable CPU hotplug across migrations
  2018-09-17 19:14 [PATCH] powerpc/pseries: Disable CPU hotplug across migrations Nathan Fontenot
  2018-09-17 20:41 ` Tyrel Datwyler
@ 2018-09-18 10:32 ` Gautham R Shenoy
  2018-09-20 15:03   ` Nathan Fontenot
  2018-09-20  4:21 ` Michael Ellerman
  2 siblings, 1 reply; 12+ messages in thread
From: Gautham R Shenoy @ 2018-09-18 10:32 UTC (permalink / raw)
  To: nfont, ego, tyreld; +Cc: linuxppc-dev

Hi Nathan,
On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot
<nfont@linux.vnet.ibm.com> wrote:
>
> When performing partition migrations all present CPUs must be online
> as all present CPUs must make the H_JOIN call as part of the migration
> process. Once all present CPUs make the H_JOIN call, one CPU is returned
> to make the rtas call to perform the migration to the destination system.
>
> During testing of migration and changing the SMT state we have found
> instances where CPUs are offlined, as part of the SMT state change,
> before they make the H_JOIN call. This results in a hung system where
> every CPU is either in H_JOIN or offline.
>
> To prevent this this patch disables CPU hotplug during the migration
> process.
>
> Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/rtas.c |    2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index 8afd146bc9c7..2c7ed31c736e 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle)
>                 goto out;
>         }
>
> +       cpu_hotplug_disable();

So, some of the onlined CPUs ( via
rtas_online_cpus_mask(offline_mask);) can go still offline,
if the userspace issues an offline command, just before we execute
cpu_hotplug_disable().

So we are narrowing down the race, but it still exists. Am I missing something ?

>         stop_topology_update();
>
>         /* Call function on all CPUs.  One of us will make the
> @@ -995,6 +996,7 @@ int rtas_ibm_suspend_me(u64 handle)
>                 printk(KERN_ERR "Error doing global join\n");
>
>         start_topology_update();
> +       cpu_hotplug_enable();
>
>         /* Take down CPUs not online prior to suspend */
>         cpuret = rtas_offline_cpus_mask(offline_mask);
>


-- 
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: powerpc/pseries: Disable CPU hotplug across migrations
  2018-09-17 19:14 [PATCH] powerpc/pseries: Disable CPU hotplug across migrations Nathan Fontenot
  2018-09-17 20:41 ` Tyrel Datwyler
  2018-09-18 10:32 ` Gautham R Shenoy
@ 2018-09-20  4:21 ` Michael Ellerman
  2 siblings, 0 replies; 12+ messages in thread
From: Michael Ellerman @ 2018-09-20  4:21 UTC (permalink / raw)
  To: Nathan Fontenot, linuxppc-dev

On Mon, 2018-09-17 at 19:14:02 UTC, Nathan Fontenot wrote:
> When performing partition migrations all present CPUs must be online
> as all present CPUs must make the H_JOIN call as part of the migration
> process. Once all present CPUs make the H_JOIN call, one CPU is returned
> to make the rtas call to perform the migration to the destination system.
> 
> During testing of migration and changing the SMT state we have found
> instances where CPUs are offlined, as part of the SMT state change,
> before they make the H_JOIN call. This results in a hung system where
> every CPU is either in H_JOIN or offline.
> 
> To prevent this this patch disables CPU hotplug during the migration
> process.
> 
> Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
> Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/85a88cabad57d26d826dd94ea34d3a

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] powerpc/pseries: Disable CPU hotplug across migrations
  2018-09-18 10:32 ` Gautham R Shenoy
@ 2018-09-20 15:03   ` Nathan Fontenot
  2018-09-24  7:00     ` Michael Ellerman
  0 siblings, 1 reply; 12+ messages in thread
From: Nathan Fontenot @ 2018-09-20 15:03 UTC (permalink / raw)
  To: ego, tyreld; +Cc: linuxppc-dev

On 09/18/2018 05:32 AM, Gautham R Shenoy wrote:
> Hi Nathan,
> On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot
> <nfont@linux.vnet.ibm.com> wrote:
>>
>> When performing partition migrations all present CPUs must be online
>> as all present CPUs must make the H_JOIN call as part of the migration
>> process. Once all present CPUs make the H_JOIN call, one CPU is returned
>> to make the rtas call to perform the migration to the destination system.
>>
>> During testing of migration and changing the SMT state we have found
>> instances where CPUs are offlined, as part of the SMT state change,
>> before they make the H_JOIN call. This results in a hung system where
>> every CPU is either in H_JOIN or offline.
>>
>> To prevent this this patch disables CPU hotplug during the migration
>> process.
>>
>> Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/kernel/rtas.c |    2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>> index 8afd146bc9c7..2c7ed31c736e 100644
>> --- a/arch/powerpc/kernel/rtas.c
>> +++ b/arch/powerpc/kernel/rtas.c
>> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>                 goto out;
>>         }
>>
>> +       cpu_hotplug_disable();
> 
> So, some of the onlined CPUs ( via
> rtas_online_cpus_mask(offline_mask);) can go still offline,
> if the userspace issues an offline command, just before we execute
> cpu_hotplug_disable().
> 
> So we are narrowing down the race, but it still exists. Am I missing something ?

You're correct, this narrows the window in which a CPU can go offline.

In testing with this patch we have not been able to re-create the failure but
there is still a small window.

-Nathan

> 
>>         stop_topology_update();
>>
>>         /* Call function on all CPUs.  One of us will make the
>> @@ -995,6 +996,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>                 printk(KERN_ERR "Error doing global join\n");
>>
>>         start_topology_update();
>> +       cpu_hotplug_enable();
>>
>>         /* Take down CPUs not online prior to suspend */
>>         cpuret = rtas_offline_cpus_mask(offline_mask);
>>
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] powerpc/pseries: Disable CPU hotplug across migrations
  2018-09-20 15:03   ` Nathan Fontenot
@ 2018-09-24  7:00     ` Michael Ellerman
  2018-09-24  8:56       ` Gautham R Shenoy
  0 siblings, 1 reply; 12+ messages in thread
From: Michael Ellerman @ 2018-09-24  7:00 UTC (permalink / raw)
  To: Nathan Fontenot, ego, tyreld; +Cc: linuxppc-dev

Nathan Fontenot <nfont@linux.vnet.ibm.com> writes:
> On 09/18/2018 05:32 AM, Gautham R Shenoy wrote:
>> Hi Nathan,
>> On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot
>> <nfont@linux.vnet.ibm.com> wrote:
>>>
>>> When performing partition migrations all present CPUs must be online
>>> as all present CPUs must make the H_JOIN call as part of the migration
>>> process. Once all present CPUs make the H_JOIN call, one CPU is returned
>>> to make the rtas call to perform the migration to the destination system.
>>>
>>> During testing of migration and changing the SMT state we have found
>>> instances where CPUs are offlined, as part of the SMT state change,
>>> before they make the H_JOIN call. This results in a hung system where
>>> every CPU is either in H_JOIN or offline.
>>>
>>> To prevent this this patch disables CPU hotplug during the migration
>>> process.
>>>
>>> Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
>>> ---
>>>  arch/powerpc/kernel/rtas.c |    2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>> index 8afd146bc9c7..2c7ed31c736e 100644
>>> --- a/arch/powerpc/kernel/rtas.c
>>> +++ b/arch/powerpc/kernel/rtas.c
>>> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>>                 goto out;
>>>         }
>>>
>>> +       cpu_hotplug_disable();
>> 
>> So, some of the onlined CPUs ( via
>> rtas_online_cpus_mask(offline_mask);) can go still offline,
>> if the userspace issues an offline command, just before we execute
>> cpu_hotplug_disable().
>> 
>> So we are narrowing down the race, but it still exists. Am I missing something ?
>
> You're correct, this narrows the window in which a CPU can go offline.
>
> In testing with this patch we have not been able to re-create the failure but
> there is still a small window.

Well let's close it.

We just need to check that all present CPUs are online after we've
called cpu_hotplug_disable() don't we?

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] powerpc/pseries: Disable CPU hotplug across migrations
  2018-09-24  7:00     ` Michael Ellerman
@ 2018-09-24  8:56       ` Gautham R Shenoy
  2018-09-24 14:30         ` Nathan Fontenot
  2018-09-25  0:42         ` Michael Ellerman
  0 siblings, 2 replies; 12+ messages in thread
From: Gautham R Shenoy @ 2018-09-24  8:56 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Nathan Fontenot, ego, tyreld, linuxppc-dev

Hi Michael,

On Mon, Sep 24, 2018 at 05:00:42PM +1000, Michael Ellerman wrote:
> Nathan Fontenot <nfont@linux.vnet.ibm.com> writes:
> > On 09/18/2018 05:32 AM, Gautham R Shenoy wrote:
> >> Hi Nathan,
> >> On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot
> >> <nfont@linux.vnet.ibm.com> wrote:
> >>>
> >>> When performing partition migrations all present CPUs must be online
> >>> as all present CPUs must make the H_JOIN call as part of the migration
> >>> process. Once all present CPUs make the H_JOIN call, one CPU is returned
> >>> to make the rtas call to perform the migration to the destination system.
> >>>
> >>> During testing of migration and changing the SMT state we have found
> >>> instances where CPUs are offlined, as part of the SMT state change,
> >>> before they make the H_JOIN call. This results in a hung system where
> >>> every CPU is either in H_JOIN or offline.
> >>>
> >>> To prevent this this patch disables CPU hotplug during the migration
> >>> process.
> >>>
> >>> Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
> >>> ---
> >>>  arch/powerpc/kernel/rtas.c |    2 ++
> >>>  1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> >>> index 8afd146bc9c7..2c7ed31c736e 100644
> >>> --- a/arch/powerpc/kernel/rtas.c
> >>> +++ b/arch/powerpc/kernel/rtas.c
> >>> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle)
> >>>                 goto out;
> >>>         }
> >>>
> >>> +       cpu_hotplug_disable();
> >> 
> >> So, some of the onlined CPUs ( via
> >> rtas_online_cpus_mask(offline_mask);) can go still offline,
> >> if the userspace issues an offline command, just before we execute
> >> cpu_hotplug_disable().
> >> 
> >> So we are narrowing down the race, but it still exists. Am I missing something ?
> >
> > You're correct, this narrows the window in which a CPU can go offline.
> >
> > In testing with this patch we have not been able to re-create the failure but
> > there is still a small window.
> 
> Well let's close it.
> 
> We just need to check that all present CPUs are online after we've
> called cpu_hotplug_disable() don't we?

Yes. However, we cannot use the cpu_up() API to bring the offline CPUs
online, since will return with an -EBUSY if CPU-Hotplug has been
disabled. _cpu_up() works, but it is (understandably) a static
function in kernel/cpu.c

So, we might need a new APIs along the lines of
disable_nonboot_cpus()/enable_nonboot_cpus() 
that is currently being used by the suspend subsystem, only that we
would need the APIs to
      - Disable hotplug and online all the CPUs in an atomic
      fashion. Would be good if the API returns the cpumask of CPUs
      which were offline, which were brought online by this API.

      - Restore the state of the machine by offlining the CPUs which
      we brought online, and enable hotplug again. 
      
> 
> cheers
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] powerpc/pseries: Disable CPU hotplug across migrations
  2018-09-24  8:56       ` Gautham R Shenoy
@ 2018-09-24 14:30         ` Nathan Fontenot
  2018-09-24 20:49           ` Tyrel Datwyler
  2018-09-25  0:42         ` Michael Ellerman
  1 sibling, 1 reply; 12+ messages in thread
From: Nathan Fontenot @ 2018-09-24 14:30 UTC (permalink / raw)
  To: ego, Michael Ellerman; +Cc: tyreld, linuxppc-dev

On 09/24/2018 03:56 AM, Gautham R Shenoy wrote:
> Hi Michael,
> 
> On Mon, Sep 24, 2018 at 05:00:42PM +1000, Michael Ellerman wrote:
>> Nathan Fontenot <nfont@linux.vnet.ibm.com> writes:
>>> On 09/18/2018 05:32 AM, Gautham R Shenoy wrote:
>>>> Hi Nathan,
>>>> On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot
>>>> <nfont@linux.vnet.ibm.com> wrote:
>>>>>
>>>>> When performing partition migrations all present CPUs must be online
>>>>> as all present CPUs must make the H_JOIN call as part of the migration
>>>>> process. Once all present CPUs make the H_JOIN call, one CPU is returned
>>>>> to make the rtas call to perform the migration to the destination system.
>>>>>
>>>>> During testing of migration and changing the SMT state we have found
>>>>> instances where CPUs are offlined, as part of the SMT state change,
>>>>> before they make the H_JOIN call. This results in a hung system where
>>>>> every CPU is either in H_JOIN or offline.
>>>>>
>>>>> To prevent this this patch disables CPU hotplug during the migration
>>>>> process.
>>>>>
>>>>> Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
>>>>> ---
>>>>>  arch/powerpc/kernel/rtas.c |    2 ++
>>>>>  1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>>>> index 8afd146bc9c7..2c7ed31c736e 100644
>>>>> --- a/arch/powerpc/kernel/rtas.c
>>>>> +++ b/arch/powerpc/kernel/rtas.c
>>>>> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>>>>                 goto out;
>>>>>         }
>>>>>
>>>>> +       cpu_hotplug_disable();
>>>>
>>>> So, some of the onlined CPUs ( via
>>>> rtas_online_cpus_mask(offline_mask);) can go still offline,
>>>> if the userspace issues an offline command, just before we execute
>>>> cpu_hotplug_disable().
>>>>
>>>> So we are narrowing down the race, but it still exists. Am I missing something ?
>>>
>>> You're correct, this narrows the window in which a CPU can go offline.
>>>
>>> In testing with this patch we have not been able to re-create the failure but
>>> there is still a small window.
>>
>> Well let's close it.
>>
>> We just need to check that all present CPUs are online after we've
>> called cpu_hotplug_disable() don't we?
> 
> Yes. However, we cannot use the cpu_up() API to bring the offline CPUs
> online, since will return with an -EBUSY if CPU-Hotplug has been
> disabled. _cpu_up() works, but it is (understandably) a static
> function in kernel/cpu.c
> 
> So, we might need a new APIs along the lines of
> disable_nonboot_cpus()/enable_nonboot_cpus() 
> that is currently being used by the suspend subsystem, only that we
> would need the APIs to
>       - Disable hotplug and online all the CPUs in an atomic
>       fashion. Would be good if the API returns the cpumask of CPUs
>       which were offline, which were brought online by this API.
> 
>       - Restore the state of the machine by offlining the CPUs which
>       we brought online, and enable hotplug again. 
>       

There is already code in the LPM path that saves a cpu mask of the offline
cpus prior to bringing them all online so we can offline them again after
the migration.

The missing piece to fully close the window is an API that will allow us to
online cpus while cpu hotplug is disabled.

Since we have not been able to re-create the failure with this patch would
it be ok to pull in this patch while other options are explored?

-Nathan

>>
>> cheers
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] powerpc/pseries: Disable CPU hotplug across migrations
  2018-09-24 14:30         ` Nathan Fontenot
@ 2018-09-24 20:49           ` Tyrel Datwyler
  2018-09-25  0:38             ` Michael Ellerman
  0 siblings, 1 reply; 12+ messages in thread
From: Tyrel Datwyler @ 2018-09-24 20:49 UTC (permalink / raw)
  To: Nathan Fontenot, ego, Michael Ellerman; +Cc: linuxppc-dev

On 09/24/2018 07:30 AM, Nathan Fontenot wrote:
> On 09/24/2018 03:56 AM, Gautham R Shenoy wrote:
>> Hi Michael,
>>
>> On Mon, Sep 24, 2018 at 05:00:42PM +1000, Michael Ellerman wrote:
>>> Nathan Fontenot <nfont@linux.vnet.ibm.com> writes:
>>>> On 09/18/2018 05:32 AM, Gautham R Shenoy wrote:
>>>>> Hi Nathan,
>>>>> On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot
>>>>> <nfont@linux.vnet.ibm.com> wrote:
>>>>>>
>>>>>> When performing partition migrations all present CPUs must be online
>>>>>> as all present CPUs must make the H_JOIN call as part of the migration
>>>>>> process. Once all present CPUs make the H_JOIN call, one CPU is returned
>>>>>> to make the rtas call to perform the migration to the destination system.
>>>>>>
>>>>>> During testing of migration and changing the SMT state we have found
>>>>>> instances where CPUs are offlined, as part of the SMT state change,
>>>>>> before they make the H_JOIN call. This results in a hung system where
>>>>>> every CPU is either in H_JOIN or offline.
>>>>>>
>>>>>> To prevent this this patch disables CPU hotplug during the migration
>>>>>> process.
>>>>>>
>>>>>> Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
>>>>>> ---
>>>>>>  arch/powerpc/kernel/rtas.c |    2 ++
>>>>>>  1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>>>>> index 8afd146bc9c7..2c7ed31c736e 100644
>>>>>> --- a/arch/powerpc/kernel/rtas.c
>>>>>> +++ b/arch/powerpc/kernel/rtas.c
>>>>>> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>>>>>                 goto out;
>>>>>>         }
>>>>>>
>>>>>> +       cpu_hotplug_disable();
>>>>>
>>>>> So, some of the onlined CPUs ( via
>>>>> rtas_online_cpus_mask(offline_mask);) can go still offline,
>>>>> if the userspace issues an offline command, just before we execute
>>>>> cpu_hotplug_disable().
>>>>>
>>>>> So we are narrowing down the race, but it still exists. Am I missing something ?
>>>>
>>>> You're correct, this narrows the window in which a CPU can go offline.
>>>>
>>>> In testing with this patch we have not been able to re-create the failure but
>>>> there is still a small window.
>>>
>>> Well let's close it.
>>>
>>> We just need to check that all present CPUs are online after we've
>>> called cpu_hotplug_disable() don't we?
>>
>> Yes. However, we cannot use the cpu_up() API to bring the offline CPUs
>> online, since will return with an -EBUSY if CPU-Hotplug has been
>> disabled. _cpu_up() works, but it is (understandably) a static
>> function in kernel/cpu.c
>>
>> So, we might need a new APIs along the lines of
>> disable_nonboot_cpus()/enable_nonboot_cpus() 
>> that is currently being used by the suspend subsystem, only that we
>> would need the APIs to
>>       - Disable hotplug and online all the CPUs in an atomic
>>       fashion. Would be good if the API returns the cpumask of CPUs
>>       which were offline, which were brought online by this API.
>>
>>       - Restore the state of the machine by offlining the CPUs which
>>       we brought online, and enable hotplug again. 
>>       
> 
> There is already code in the LPM path that saves a cpu mask of the offline
> cpus prior to bringing them all online so we can offline them again after
> the migration.
> 
> The missing piece to fully close the window is an API that will allow us to
> online cpus while cpu hotplug is disabled.
> 
> Since we have not been able to re-create the failure with this patch would
> it be ok to pull in this patch while other options are explored?

I think mpe initially applied this to -next. Not sure if he dropped it, but I would definitely give a +1 to carrying this workaround for now until we can put together an API that fully closes the gap. We are hot with LPM blocked tests at the moment.

-Tyrel

> 
> -Nathan
> 
>>>
>>> cheers
>>>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] powerpc/pseries: Disable CPU hotplug across migrations
  2018-09-24 20:49           ` Tyrel Datwyler
@ 2018-09-25  0:38             ` Michael Ellerman
  0 siblings, 0 replies; 12+ messages in thread
From: Michael Ellerman @ 2018-09-25  0:38 UTC (permalink / raw)
  To: Tyrel Datwyler, Nathan Fontenot, ego; +Cc: linuxppc-dev

Tyrel Datwyler <tyreld@linux.vnet.ibm.com> writes:
> On 09/24/2018 07:30 AM, Nathan Fontenot wrote:
...
>> 
>> Since we have not been able to re-create the failure with this patch would
>> it be ok to pull in this patch while other options are explored?
>
> I think mpe initially applied this to -next. Not sure if he dropped
> it, but I would definitely give a +1 to carrying this workaround for
> now until we can put together an API that fully closes the gap. We are
> hot with LPM blocked tests at the moment.

Yeah it's in next, I'm not going to drop it. Any fix would be an
incremental fix on top.

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] powerpc/pseries: Disable CPU hotplug across migrations
  2018-09-24  8:56       ` Gautham R Shenoy
  2018-09-24 14:30         ` Nathan Fontenot
@ 2018-09-25  0:42         ` Michael Ellerman
  2018-09-25  6:19           ` Gautham R Shenoy
  1 sibling, 1 reply; 12+ messages in thread
From: Michael Ellerman @ 2018-09-25  0:42 UTC (permalink / raw)
  To: Gautham R Shenoy; +Cc: Nathan Fontenot, ego, tyreld, linuxppc-dev

Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:
> On Mon, Sep 24, 2018 at 05:00:42PM +1000, Michael Ellerman wrote:
>> Nathan Fontenot <nfont@linux.vnet.ibm.com> writes:
>> > On 09/18/2018 05:32 AM, Gautham R Shenoy wrote:
>> >> On Tue, Sep 18, 2018 at 1:05 AM Nathan Fontenot
>> >>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>> >>> index 8afd146bc9c7..2c7ed31c736e 100644
>> >>> --- a/arch/powerpc/kernel/rtas.c
>> >>> +++ b/arch/powerpc/kernel/rtas.c
>> >>> @@ -981,6 +981,7 @@ int rtas_ibm_suspend_me(u64 handle)
>> >>>                 goto out;
>> >>>         }
>> >>>
>> >>> +       cpu_hotplug_disable();
>> >> 
>> >> So, some of the onlined CPUs ( via
>> >> rtas_online_cpus_mask(offline_mask);) can go still offline,
>> >> if the userspace issues an offline command, just before we execute
>> >> cpu_hotplug_disable().
>> >> 
>> >> So we are narrowing down the race, but it still exists. Am I missing something ?
>> >
>> > You're correct, this narrows the window in which a CPU can go offline.
>> >
>> > In testing with this patch we have not been able to re-create the failure but
>> > there is still a small window.
>> 
>> Well let's close it.
>> 
>> We just need to check that all present CPUs are online after we've
>> called cpu_hotplug_disable() don't we?
>
> Yes. However, we cannot use the cpu_up() API to bring the offline CPUs
> online, since will return with an -EBUSY if CPU-Hotplug has been
> disabled.

I'm not suggesting we try to bring them online after we've disabled CPU
hotplug, if we detect that race we can just fail the migration.

Can't we do:
 - save mask of offline CPUs
 - bring all offline CPUs online
 - disable CPU hotplug
 - check if any CPUs are offline
   - if so, we've raced with an offline
   - bail out of the migration with an error


Instead of bailing out we could go back to the start and try again for
some number of retries, but that's probably overkill anyway.

What am I missing?

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] powerpc/pseries: Disable CPU hotplug across migrations
  2018-09-25  0:42         ` Michael Ellerman
@ 2018-09-25  6:19           ` Gautham R Shenoy
  0 siblings, 0 replies; 12+ messages in thread
From: Gautham R Shenoy @ 2018-09-25  6:19 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Gautham R Shenoy, Nathan Fontenot, tyreld, linuxppc-dev

On Tue, Sep 25, 2018 at 10:42:05AM +1000, Michael Ellerman wrote:
[..snip..]
> I'm not suggesting we try to bring them online after we've disabled CPU
> hotplug, if we detect that race we can just fail the migration.
> 
> Can't we do:
>  - save mask of offline CPUs
>  - bring all offline CPUs online
>  - disable CPU hotplug
>  - check if any CPUs are offline
>    - if so, we've raced with an offline
>    - bail out of the migration with an error
> 
> 
> Instead of bailing out we could go back to the start and try again for
> some number of retries, but that's probably overkill anyway.
> 
> What am I missing?

I guess that will work. The race is unlikely anyway, so I doubt
CPU-Hotplug can DDOS the partition migration.

Does the following implementation of the same look ok ? (Build tested) 

------------------------------------ X8------------------------------------- 
>From acb9eb9f8bb14cf3121aeb0589255cbc31292be7 Mon Sep 17 00:00:00 2001
From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
Date: Tue, 25 Sep 2018 11:01:18 +0530
Subject: [PATCH] powerpc/rtas: Fix a potential race between CPU-Offline & Migration

commit 85a88cabad57 ("powerpc/pseries: Disable CPU hotplug across
migrations") disables any CPU-hotplug operations when Live Partition
Migration is in progress. However, there is a minor race-window
between the time all the CPUs are onlined by rtas_ibm_suspend_me() and
the CPU-Hotplugs are disabled via cpu_hotplug_disable() when some CPUs
could be offlined by the userspace, thus nullifying the assumption
that all the CPUs are online at this point.

This patch fixes this by checking if all the present CPUs are brought
online after disabling CPU-Hotplug. Otherwise, it retries to bring the
CPUs online again for a finite number of times failing which
rtas_ibm_suspend_me() returns -EBUSY.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/rtas.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 2c7ed31..e6a6425 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -934,6 +934,8 @@ int rtas_offline_cpus_mask(cpumask_var_t cpus)
 }
 EXPORT_SYMBOL(rtas_offline_cpus_mask);
 
+#define MAX_SUSPEND_HOTPLUG_RETRIES    5
+
 int rtas_ibm_suspend_me(u64 handle)
 {
 	long state;
@@ -943,6 +945,7 @@ int rtas_ibm_suspend_me(u64 handle)
 	DECLARE_COMPLETION_ONSTACK(done);
 	cpumask_var_t offline_mask;
 	int cpuret;
+	int retries = MAX_SUSPEND_HOTPLUG_RETRIES;
 
 	if (!rtas_service_present("ibm,suspend-me"))
 		return -ENOSYS;
@@ -972,6 +975,7 @@ int rtas_ibm_suspend_me(u64 handle)
 	data.token = rtas_token("ibm,suspend-me");
 	data.complete = &done;
 
+again:
 	/* All present CPUs must be online */
 	cpumask_andnot(offline_mask, cpu_present_mask, cpu_online_mask);
 	cpuret = rtas_online_cpus_mask(offline_mask);
@@ -982,6 +986,19 @@ int rtas_ibm_suspend_me(u64 handle)
 	}
 
 	cpu_hotplug_disable();
+
+	/* Check if we raced with a CPU-Offline Operation */
+	if (unlikely(!cpumask_equal(cpu_present_mask, cpu_online_mask))) {
+		cpu_hotplug_enable();
+		if (retries-- > 0)
+			goto again;
+
+		pr_err("%s: Too many concurrent CPU-Offline operation in progress\n",
+		       __func__);
+		atomic_set(&data.error, -EBUSY);
+		goto out;
+	}
+
 	stop_topology_update();
 
 	/* Call function on all CPUs.  One of us will make the
-- 
1.9.4

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-09-25  6:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-17 19:14 [PATCH] powerpc/pseries: Disable CPU hotplug across migrations Nathan Fontenot
2018-09-17 20:41 ` Tyrel Datwyler
2018-09-18 10:32 ` Gautham R Shenoy
2018-09-20 15:03   ` Nathan Fontenot
2018-09-24  7:00     ` Michael Ellerman
2018-09-24  8:56       ` Gautham R Shenoy
2018-09-24 14:30         ` Nathan Fontenot
2018-09-24 20:49           ` Tyrel Datwyler
2018-09-25  0:38             ` Michael Ellerman
2018-09-25  0:42         ` Michael Ellerman
2018-09-25  6:19           ` Gautham R Shenoy
2018-09-20  4:21 ` Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.