linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error
@ 2021-04-16 21:02 Daniel Henrique Barboza
  2021-04-16 21:02 ` [PATCH 1/2] dlpar.c: introduce dlpar_unisolate_drc() Daniel Henrique Barboza
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Daniel Henrique Barboza @ 2021-04-16 21:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Daniel Henrique Barboza, david

At this moment, PAPR [1] does not have a way to report errors during a device
removal operation. This puts a strain in the hypervisor, which needs extra
mechanisms to try to fallback and recover from an error that might have
happened during the removal. The QEMU community has dealt with it during these
years by either trying to preempt the error before sending the HP event or, in
case of a guest side failure, reboot the guest to complete the removal process.

This started to change with QEMU commit fe1831eff8a4 ("spapr_drc.c: use DRC
reconfiguration to cleanup DIMM unplug state"), where a way to fallback from a
memory removal error was introduced. In this case, when QEMU detects that the
kernel is reconfiguring LMBs DRCs that were marked as pending removal, the
entire process is reverted from the QEMU side as well. Around the same time,
other discussions in the QEMU mailing discussed an alternative for other device
as well.

In [2] the idea of using RTAS set-indicator for this role was first introduced.
The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is already
UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else for both QEMU
and phyp. This gives us an opportunity to use this behavior to signal the
hypervisor layer when a device removal happens, allowing it to do a proper
error handling knowing for sure that the removal failed in the kernel. Using
set-indicator to report HP errors isn't strange to PAPR, as per R1-13.5.3.4-4.
of table 13.7 of [1]:

"For all DR options: If this is a DR operation that involves the user insert-
ing a DR entity, then if the firmware can determine that the inserted entity
would cause a system disturbance, then the set-indicator RTAS call must not
unisolate the entity and must return an error status which is unique to the
particular error."

PAPR does not make any restrictions or considerations about setting an already
Unisolated/Configured DRC to 'unisolate', meaning we have a chance to use it
for this purpose - signal an OS side error when attempting to remove a DR
entity.  To validate the design, this is being implemented only for CPUs.

QEMU will use this mechanism to rollback the device removal (hotunplug) state,
allowing for a better error handling mechanism. A implementation of how QEMU
can do it is in [3]. When using a kernel with this series applied, together
with this QEMU build, this is what happens in a common CPU removal/hotunplug
error scenario (trying to remove the last online CPU):

( QEMU command line: qemu-system-ppc64 -machine pseries,accel=kvm,usb=off
-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 ... )

[root@localhost ~]# QEMU 5.2.92 monitor - type 'help' for more information
(qemu) device_add host-spapr-cpu-core,core-id=1,id=core1
(qemu) 

[root@localhost ~]# echo 0 > /sys/devices/system/cpu/cpu0/online
[   77.548442][   T13] IRQ 19: no longer affine to CPU0
[   77.548452][   T13] IRQ 20: no longer affine to CPU0
[   77.548458][   T13] IRQ 256: no longer affine to CPU0
[   77.548465][   T13] IRQ 258: no longer affine to CPU0
[   77.548472][   T13] IRQ 259: no longer affine to CPU0
[   77.548479][   T13] IRQ 260: no longer affine to CPU0
[   77.548485][   T13] IRQ 261: no longer affine to CPU0
[   77.548590][    T0] cpu 0 (hwid 0) Ready to die...
[root@localhost ~]# (qemu) 
(qemu) device_del core1
(qemu) [   83.214073][  T100] pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16
qemu-system-ppc64: Device hotunplug rejected by the guest for device core1

(qemu) 

As soon as the CPU removal fails in dlpar_cpu(), QEMU becames aware of
it and is able to do error recovery.

If this solution is well received, I'll push for an architecture change
request internally at IBM to make this mechanism PAPR official.


[1] https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
[2] https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06395.html
[3] https://github.com/danielhb/qemu/tree/unisolate_drc_callback_v1

Daniel Henrique Barboza (2):
  dlpar.c: introduce dlpar_unisolate_drc()
  hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure

 arch/powerpc/platforms/pseries/dlpar.c       | 14 ++++++++++++++
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  9 ++++++++-
 arch/powerpc/platforms/pseries/pseries.h     |  1 +
 3 files changed, 23 insertions(+), 1 deletion(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] dlpar.c: introduce dlpar_unisolate_drc()
  2021-04-16 21:02 [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error Daniel Henrique Barboza
@ 2021-04-16 21:02 ` Daniel Henrique Barboza
  2021-04-19  4:01   ` David Gibson
  2021-04-16 21:02 ` [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure Daniel Henrique Barboza
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Daniel Henrique Barboza @ 2021-04-16 21:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Daniel Henrique Barboza, david

Next patch will execute a set-indicator call in hotplug-cpu.c.

Create a dlpar_unisolate_drc() helper to avoid spreading more
rtas_set_indicator() calls outside of dlpar.c.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 arch/powerpc/platforms/pseries/dlpar.c   | 14 ++++++++++++++
 arch/powerpc/platforms/pseries/pseries.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
index 233503fcf8f0..3ac70790ec7a 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -329,6 +329,20 @@ int dlpar_release_drc(u32 drc_index)
 	return 0;
 }
 
+int dlpar_unisolate_drc(u32 drc_index)
+{
+	int dr_status, rc;
+
+	rc = rtas_call(rtas_token("get-sensor-state"), 2, 2, &dr_status,
+				DR_ENTITY_SENSE, drc_index);
+	if (rc || dr_status != DR_ENTITY_PRESENT)
+		return -1;
+
+	rtas_set_indicator(ISOLATION_STATE, drc_index, UNISOLATE);
+
+	return 0;
+}
+
 int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog)
 {
 	int rc;
diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
index 4fe48c04c6c2..4ea12037c920 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -55,6 +55,7 @@ extern int dlpar_attach_node(struct device_node *, struct device_node *);
 extern int dlpar_detach_node(struct device_node *);
 extern int dlpar_acquire_drc(u32 drc_index);
 extern int dlpar_release_drc(u32 drc_index);
+extern int dlpar_unisolate_drc(u32 drc_index);
 
 void queue_hotplug_event(struct pseries_hp_errorlog *hp_errlog);
 int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_errlog);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure
  2021-04-16 21:02 [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error Daniel Henrique Barboza
  2021-04-16 21:02 ` [PATCH 1/2] dlpar.c: introduce dlpar_unisolate_drc() Daniel Henrique Barboza
@ 2021-04-16 21:02 ` Daniel Henrique Barboza
  2021-04-19  4:02   ` David Gibson
  2021-04-19 12:48   ` Michael Ellerman
  2021-04-19  3:59 ` [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error David Gibson
  2021-04-21 13:08 ` Michael Ellerman
  3 siblings, 2 replies; 10+ messages in thread
From: Daniel Henrique Barboza @ 2021-04-16 21:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Daniel Henrique Barboza, david

The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is
already UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else
for both QEMU and phyp. This gives us an opportunity to use this
behavior to signal the hypervisor layer when an error during device
removal happens, allowing it to do a proper error handling, while not
breaking QEMU/phyp implementations that don't have this support.

This patch introduces this idea by unisolating all CPU DRCs that failed
to be removed by dlpar_cpu_remove_by_index(), when handling the
PSERIES_HP_ELOG_ID_DRC_INDEX event. This is being done for this event
only because its the only CPU removal event QEMU uses, and there's no
need at this moment to add this mechanism for phyp only code.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..ed66895c2f51 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -802,8 +802,15 @@ int dlpar_cpu(struct pseries_hp_errorlog *hp_elog)
 	case PSERIES_HP_ELOG_ACTION_REMOVE:
 		if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_COUNT)
 			rc = dlpar_cpu_remove_by_count(count);
-		else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX)
+		else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) {
 			rc = dlpar_cpu_remove_by_index(drc_index);
+			/* Setting the isolation state of an UNISOLATED/CONFIGURED
+			 * device to UNISOLATE is a no-op, but the hypervison can
+			 * use it as a hint that the cpu removal failed.
+			 */
+			if (rc)
+				dlpar_unisolate_drc(drc_index);
+		}
 		else
 			rc = -EINVAL;
 		break;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error
  2021-04-16 21:02 [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error Daniel Henrique Barboza
  2021-04-16 21:02 ` [PATCH 1/2] dlpar.c: introduce dlpar_unisolate_drc() Daniel Henrique Barboza
  2021-04-16 21:02 ` [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure Daniel Henrique Barboza
@ 2021-04-19  3:59 ` David Gibson
  2021-04-21 13:08 ` Michael Ellerman
  3 siblings, 0 replies; 10+ messages in thread
From: David Gibson @ 2021-04-19  3:59 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4958 bytes --]

On Fri, Apr 16, 2021 at 06:02:14PM -0300, Daniel Henrique Barboza wrote:
> At this moment, PAPR [1] does not have a way to report errors during a device
> removal operation. This puts a strain in the hypervisor, which needs extra
> mechanisms to try to fallback and recover from an error that might have
> happened during the removal. The QEMU community has dealt with it during these
> years by either trying to preempt the error before sending the HP event or, in
> case of a guest side failure, reboot the guest to complete the removal process.
> 
> This started to change with QEMU commit fe1831eff8a4 ("spapr_drc.c: use DRC
> reconfiguration to cleanup DIMM unplug state"), where a way to fallback from a
> memory removal error was introduced. In this case, when QEMU detects that the
> kernel is reconfiguring LMBs DRCs that were marked as pending removal, the
> entire process is reverted from the QEMU side as well. Around the same time,
> other discussions in the QEMU mailing discussed an alternative for other device
> as well.
> 
> In [2] the idea of using RTAS set-indicator for this role was first introduced.
> The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is already
> UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else for both QEMU
> and phyp. This gives us an opportunity to use this behavior to signal the
> hypervisor layer when a device removal happens, allowing it to do a
> proper

Nit: it's not when a device removal happens, but when it *fails* to happen.

> error handling knowing for sure that the removal failed in the kernel. Using
> set-indicator to report HP errors isn't strange to PAPR, as per R1-13.5.3.4-4.
> of table 13.7 of [1]:

> "For all DR options: If this is a DR operation that involves the user insert-
> ing a DR entity, then if the firmware can determine that the inserted entity
> would cause a system disturbance, then the set-indicator RTAS call must not
> unisolate the entity and must return an error status which is unique to the
> particular error."
> 
> PAPR does not make any restrictions or considerations about setting an already
> Unisolated/Configured DRC to 'unisolate', meaning we have a chance to use it
> for this purpose - signal an OS side error when attempting to remove a DR
> entity.  To validate the design, this is being implemented only for CPUs.
> 
> QEMU will use this mechanism to rollback the device removal (hotunplug) state,
> allowing for a better error handling mechanism. A implementation of how QEMU
> can do it is in [3]. When using a kernel with this series applied, together
> with this QEMU build, this is what happens in a common CPU removal/hotunplug
> error scenario (trying to remove the last online CPU):
> 
> ( QEMU command line: qemu-system-ppc64 -machine pseries,accel=kvm,usb=off
> -smp 1,maxcpus=2,threads=1,cores=2,sockets=1 ... )
> 
> [root@localhost ~]# QEMU 5.2.92 monitor - type 'help' for more information
> (qemu) device_add host-spapr-cpu-core,core-id=1,id=core1
> (qemu) 
> 
> [root@localhost ~]# echo 0 > /sys/devices/system/cpu/cpu0/online
> [   77.548442][   T13] IRQ 19: no longer affine to CPU0
> [   77.548452][   T13] IRQ 20: no longer affine to CPU0
> [   77.548458][   T13] IRQ 256: no longer affine to CPU0
> [   77.548465][   T13] IRQ 258: no longer affine to CPU0
> [   77.548472][   T13] IRQ 259: no longer affine to CPU0
> [   77.548479][   T13] IRQ 260: no longer affine to CPU0
> [   77.548485][   T13] IRQ 261: no longer affine to CPU0
> [   77.548590][    T0] cpu 0 (hwid 0) Ready to die...
> [root@localhost ~]# (qemu) 
> (qemu) device_del core1
> (qemu) [   83.214073][  T100] pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16
> qemu-system-ppc64: Device hotunplug rejected by the guest for device core1
> 
> (qemu) 
> 
> As soon as the CPU removal fails in dlpar_cpu(), QEMU becames aware of
> it and is able to do error recovery.
> 
> If this solution is well received, I'll push for an architecture change
> request internally at IBM to make this mechanism PAPR official.
> 
> 
> [1] https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
> [2] https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06395.html
> [3] https://github.com/danielhb/qemu/tree/unisolate_drc_callback_v1
> 
> Daniel Henrique Barboza (2):
>   dlpar.c: introduce dlpar_unisolate_drc()
>   hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure
> 
>  arch/powerpc/platforms/pseries/dlpar.c       | 14 ++++++++++++++
>  arch/powerpc/platforms/pseries/hotplug-cpu.c |  9 ++++++++-
>  arch/powerpc/platforms/pseries/pseries.h     |  1 +
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] dlpar.c: introduce dlpar_unisolate_drc()
  2021-04-16 21:02 ` [PATCH 1/2] dlpar.c: introduce dlpar_unisolate_drc() Daniel Henrique Barboza
@ 2021-04-19  4:01   ` David Gibson
  0 siblings, 0 replies; 10+ messages in thread
From: David Gibson @ 2021-04-19  4:01 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2196 bytes --]

On Fri, Apr 16, 2021 at 06:02:15PM -0300, Daniel Henrique Barboza wrote:
> Next patch will execute a set-indicator call in hotplug-cpu.c.
> 
> Create a dlpar_unisolate_drc() helper to avoid spreading more
> rtas_set_indicator() calls outside of dlpar.c.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/platforms/pseries/dlpar.c   | 14 ++++++++++++++
>  arch/powerpc/platforms/pseries/pseries.h |  1 +
>  2 files changed, 15 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
> index 233503fcf8f0..3ac70790ec7a 100644
> --- a/arch/powerpc/platforms/pseries/dlpar.c
> +++ b/arch/powerpc/platforms/pseries/dlpar.c
> @@ -329,6 +329,20 @@ int dlpar_release_drc(u32 drc_index)
>  	return 0;
>  }
>  
> +int dlpar_unisolate_drc(u32 drc_index)
> +{
> +	int dr_status, rc;
> +
> +	rc = rtas_call(rtas_token("get-sensor-state"), 2, 2, &dr_status,
> +				DR_ENTITY_SENSE, drc_index);
> +	if (rc || dr_status != DR_ENTITY_PRESENT)
> +		return -1;
> +
> +	rtas_set_indicator(ISOLATION_STATE, drc_index, UNISOLATE);
> +
> +	return 0;
> +}
> +
>  int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog)
>  {
>  	int rc;
> diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h
> index 4fe48c04c6c2..4ea12037c920 100644
> --- a/arch/powerpc/platforms/pseries/pseries.h
> +++ b/arch/powerpc/platforms/pseries/pseries.h
> @@ -55,6 +55,7 @@ extern int dlpar_attach_node(struct device_node *, struct device_node *);
>  extern int dlpar_detach_node(struct device_node *);
>  extern int dlpar_acquire_drc(u32 drc_index);
>  extern int dlpar_release_drc(u32 drc_index);
> +extern int dlpar_unisolate_drc(u32 drc_index);
>  
>  void queue_hotplug_event(struct pseries_hp_errorlog *hp_errlog);
>  int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_errlog);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure
  2021-04-16 21:02 ` [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure Daniel Henrique Barboza
@ 2021-04-19  4:02   ` David Gibson
  2021-04-19 12:48   ` Michael Ellerman
  1 sibling, 0 replies; 10+ messages in thread
From: David Gibson @ 2021-04-19  4:02 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2335 bytes --]

On Fri, Apr 16, 2021 at 06:02:16PM -0300, Daniel Henrique Barboza wrote:
> The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is
> already UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else
> for both QEMU and phyp. This gives us an opportunity to use this
> behavior to signal the hypervisor layer when an error during device
> removal happens, allowing it to do a proper error handling, while not
> breaking QEMU/phyp implementations that don't have this support.
> 
> This patch introduces this idea by unisolating all CPU DRCs that failed
> to be removed by dlpar_cpu_remove_by_index(), when handling the
> PSERIES_HP_ELOG_ID_DRC_INDEX event. This is being done for this event
> only because its the only CPU removal event QEMU uses, and there's no
> need at this moment to add this mechanism for phyp only code.
> 
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

except...

> ---
>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 12cbffd3c2e3..ed66895c2f51 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -802,8 +802,15 @@ int dlpar_cpu(struct pseries_hp_errorlog *hp_elog)
>  	case PSERIES_HP_ELOG_ACTION_REMOVE:
>  		if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_COUNT)
>  			rc = dlpar_cpu_remove_by_count(count);
> -		else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX)
> +		else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) {
>  			rc = dlpar_cpu_remove_by_index(drc_index);
> +			/* Setting the isolation state of an UNISOLATED/CONFIGURED
> +			 * device to UNISOLATE is a no-op, but the hypervison can

typo here s/hypervison/hypervisor/

> +			 * use it as a hint that the cpu removal failed.
> +			 */
> +			if (rc)
> +				dlpar_unisolate_drc(drc_index);
> +		}
>  		else
>  			rc = -EINVAL;
>  		break;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure
  2021-04-16 21:02 ` [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure Daniel Henrique Barboza
  2021-04-19  4:02   ` David Gibson
@ 2021-04-19 12:48   ` Michael Ellerman
  2021-04-19 13:13     ` Daniel Henrique Barboza
  1 sibling, 1 reply; 10+ messages in thread
From: Michael Ellerman @ 2021-04-19 12:48 UTC (permalink / raw)
  To: Daniel Henrique Barboza, linuxppc-dev; +Cc: Daniel Henrique Barboza, david

Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is
> already UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else
> for both QEMU and phyp. This gives us an opportunity to use this
> behavior to signal the hypervisor layer when an error during device
> removal happens, allowing it to do a proper error handling, while not
> breaking QEMU/phyp implementations that don't have this support.
>
> This patch introduces this idea by unisolating all CPU DRCs that failed
> to be removed by dlpar_cpu_remove_by_index(), when handling the
> PSERIES_HP_ELOG_ID_DRC_INDEX event. This is being done for this event
> only because its the only CPU removal event QEMU uses, and there's no
> need at this moment to add this mechanism for phyp only code.

Have you also confirmed that phyp is not bothered by it? ie. everything
seems to continue working when you trigger this path on phyp.

cheers

> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 12cbffd3c2e3..ed66895c2f51 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -802,8 +802,15 @@ int dlpar_cpu(struct pseries_hp_errorlog *hp_elog)
>  	case PSERIES_HP_ELOG_ACTION_REMOVE:
>  		if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_COUNT)
>  			rc = dlpar_cpu_remove_by_count(count);
> -		else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX)
> +		else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) {
>  			rc = dlpar_cpu_remove_by_index(drc_index);
> +			/* Setting the isolation state of an UNISOLATED/CONFIGURED
> +			 * device to UNISOLATE is a no-op, but the hypervison can
> +			 * use it as a hint that the cpu removal failed.
> +			 */
> +			if (rc)
> +				dlpar_unisolate_drc(drc_index);
> +		}
>  		else
>  			rc = -EINVAL;
>  		break;
> -- 
> 2.30.2

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure
  2021-04-19 12:48   ` Michael Ellerman
@ 2021-04-19 13:13     ` Daniel Henrique Barboza
  2021-04-20  3:57       ` Michael Ellerman
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel Henrique Barboza @ 2021-04-19 13:13 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: david



On 4/19/21 9:48 AM, Michael Ellerman wrote:
> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>> The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is
>> already UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else
>> for both QEMU and phyp. This gives us an opportunity to use this
>> behavior to signal the hypervisor layer when an error during device
>> removal happens, allowing it to do a proper error handling, while not
>> breaking QEMU/phyp implementations that don't have this support.
>>
>> This patch introduces this idea by unisolating all CPU DRCs that failed
>> to be removed by dlpar_cpu_remove_by_index(), when handling the
>> PSERIES_HP_ELOG_ID_DRC_INDEX event. This is being done for this event
>> only because its the only CPU removal event QEMU uses, and there's no
>> need at this moment to add this mechanism for phyp only code.
> 
> Have you also confirmed that phyp is not bothered by it? ie. everything
> seems to continue working when you trigger this path on phyp.

Yes. Daniel Bueso (dbuesom@us.ibm.com) from the partition firmware team
helped me with that. We confirmed that phyp returns RTAS_OK under these
conditions (Unisolating an unisolated/configured DRC).


Thanks,


DHB

> 
> cheers
> 
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> index 12cbffd3c2e3..ed66895c2f51 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> @@ -802,8 +802,15 @@ int dlpar_cpu(struct pseries_hp_errorlog *hp_elog)
>>   	case PSERIES_HP_ELOG_ACTION_REMOVE:
>>   		if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_COUNT)
>>   			rc = dlpar_cpu_remove_by_count(count);
>> -		else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX)
>> +		else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) {
>>   			rc = dlpar_cpu_remove_by_index(drc_index);
>> +			/* Setting the isolation state of an UNISOLATED/CONFIGURED
>> +			 * device to UNISOLATE is a no-op, but the hypervison can
>> +			 * use it as a hint that the cpu removal failed.
>> +			 */
>> +			if (rc)
>> +				dlpar_unisolate_drc(drc_index);
>> +		}
>>   		else
>>   			rc = -EINVAL;
>>   		break;
>> -- 
>> 2.30.2

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure
  2021-04-19 13:13     ` Daniel Henrique Barboza
@ 2021-04-20  3:57       ` Michael Ellerman
  0 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2021-04-20  3:57 UTC (permalink / raw)
  To: Daniel Henrique Barboza, linuxppc-dev; +Cc: david

Daniel Henrique Barboza <danielhb413@gmail.com> writes:
> On 4/19/21 9:48 AM, Michael Ellerman wrote:
>> Daniel Henrique Barboza <danielhb413@gmail.com> writes:
>>> The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is
>>> already UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else
>>> for both QEMU and phyp. This gives us an opportunity to use this
>>> behavior to signal the hypervisor layer when an error during device
>>> removal happens, allowing it to do a proper error handling, while not
>>> breaking QEMU/phyp implementations that don't have this support.
>>>
>>> This patch introduces this idea by unisolating all CPU DRCs that failed
>>> to be removed by dlpar_cpu_remove_by_index(), when handling the
>>> PSERIES_HP_ELOG_ID_DRC_INDEX event. This is being done for this event
>>> only because its the only CPU removal event QEMU uses, and there's no
>>> need at this moment to add this mechanism for phyp only code.
>> 
>> Have you also confirmed that phyp is not bothered by it? ie. everything
>> seems to continue working when you trigger this path on phyp.
>
> Yes. Daniel Bueso (dbuesom@us.ibm.com) from the partition firmware team
> helped me with that. We confirmed that phyp returns RTAS_OK under these
> conditions (Unisolating an unisolated/configured DRC).

Thanks.

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error
  2021-04-16 21:02 [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error Daniel Henrique Barboza
                   ` (2 preceding siblings ...)
  2021-04-19  3:59 ` [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error David Gibson
@ 2021-04-21 13:08 ` Michael Ellerman
  3 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2021-04-21 13:08 UTC (permalink / raw)
  To: Daniel Henrique Barboza, linuxppc-dev; +Cc: david

On Fri, 16 Apr 2021 18:02:14 -0300, Daniel Henrique Barboza wrote:
> At this moment, PAPR [1] does not have a way to report errors during a device
> removal operation. This puts a strain in the hypervisor, which needs extra
> mechanisms to try to fallback and recover from an error that might have
> happened during the removal. The QEMU community has dealt with it during these
> years by either trying to preempt the error before sending the HP event or, in
> case of a guest side failure, reboot the guest to complete the removal process.
> 
> [...]

Applied to powerpc/next.

[1/2] dlpar.c: introduce dlpar_unisolate_drc()
      https://git.kernel.org/powerpc/c/0e3b3ff83ce24a7a01e467ca42e3e33e87195c0d
[2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure
      https://git.kernel.org/powerpc/c/29c9a2699e71a7866a98ebdf6ea38135d31b4e1f

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-04-21 13:11 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-16 21:02 [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error Daniel Henrique Barboza
2021-04-16 21:02 ` [PATCH 1/2] dlpar.c: introduce dlpar_unisolate_drc() Daniel Henrique Barboza
2021-04-19  4:01   ` David Gibson
2021-04-16 21:02 ` [PATCH 2/2] hotplug-cpu.c: set UNISOLATE on dlpar_cpu_remove() failure Daniel Henrique Barboza
2021-04-19  4:02   ` David Gibson
2021-04-19 12:48   ` Michael Ellerman
2021-04-19 13:13     ` Daniel Henrique Barboza
2021-04-20  3:57       ` Michael Ellerman
2021-04-19  3:59 ` [PATCH 0/2] pseries: UNISOLATE DRCs to signal device removal error David Gibson
2021-04-21 13:08 ` Michael Ellerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).