QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 0/1] pSeries: handle hotunplug errors in drc_unisolate_logical()
@ 2021-04-20 16:50 Daniel Henrique Barboza
  2021-04-20 16:51 ` [PATCH v2 1/1] spapr_drc.c: " Daniel Henrique Barboza
  0 siblings, 1 reply; 3+ messages in thread
From: Daniel Henrique Barboza @ 2021-04-20 16:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

Changes from v1:
- added more context in the commit message
- added David's R-b
v1 link: https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg03145.html 

Hi,

This is the QEMU side of a kernel change being proposed in [1],
where an attempt to implement a CPU hotunplug error report
mechanism was proposed.

The idea was discussed first in this mailing list [2], where the
RTAS set-indicator call would be used to signal QEMU when a kernel
side error happens during the unplug process.

Using the modified kernel and this patch, this is the result of a
failed CPU hotunplug attempt when trying to unplug the last online
CPU of the guest:

( QEMU command line: qemu-system-ppc64 -machine pseries,accel=kvm,usb=off
-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 ... )

[root@localhost ~]# QEMU 5.2.92 monitor - type 'help' for more information
(qemu) device_add host-spapr-cpu-core,core-id=1,id=core1
(qemu) 

[root@localhost ~]# echo 0 > /sys/devices/system/cpu/cpu0/online
[   77.548442][   T13] IRQ 19: no longer affine to CPU0
[   77.548452][   T13] IRQ 20: no longer affine to CPU0
[   77.548458][   T13] IRQ 256: no longer affine to CPU0
[   77.548465][   T13] IRQ 258: no longer affine to CPU0
[   77.548472][   T13] IRQ 259: no longer affine to CPU0
[   77.548479][   T13] IRQ 260: no longer affine to CPU0
[   77.548485][   T13] IRQ 261: no longer affine to CPU0
[   77.548590][    T0] cpu 0 (hwid 0) Ready to die...
[root@localhost ~]# (qemu) 
(qemu) device_del core1
(qemu) [   83.214073][  T100] pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16
qemu-system-ppc64: Device hotunplug rejected by the guest for device core1

(qemu) 


As mentioned in the kernel change, if this is accepted I'll push
for a PAPR change to make this an official device removal error
report mechanism.


[1] https://lore.kernel.org/linuxppc-dev/20210416210216.380291-3-danielhb413@gmail.com/
[2] https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06395.html

Daniel Henrique Barboza (1):
  spapr_drc.c: handle hotunplug errors in drc_unisolate_logical()

 hw/ppc/spapr_drc.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v2 1/1] spapr_drc.c: handle hotunplug errors in drc_unisolate_logical()
  2021-04-20 16:50 [PATCH v2 0/1] pSeries: handle hotunplug errors in drc_unisolate_logical() Daniel Henrique Barboza
@ 2021-04-20 16:51 ` Daniel Henrique Barboza
  2021-04-21  4:09   ` David Gibson
  0 siblings, 1 reply; 3+ messages in thread
From: Daniel Henrique Barboza @ 2021-04-20 16:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel Henrique Barboza, qemu-ppc, groug, david

At this moment, PAPR does not provide a way to report errors during a
device removal operation. This led the pSeries machine to implement
extra mechanisms to try to fallback and recover from an error that might
have happened during the hotunplug in the guest side. This started to
change a bit with commit fe1831eff8a4 ("spapr_drc.c: use DRC
reconfiguration to cleanup DIMM unplug state"), where one way to
fallback from a memory removal error was introduced.

Around the same time, in [1], the idea of using RTAS set-indicator for
this role was first introduced. The RTAS set-indicator call, when
attempting to UNISOLATE a DRC that is already UNISOLATED or CONFIGURED,
returns RTAS_OK and does nothing else for both QEMU and phyp. This gives
us an opportunity to use this behavior to signal the hypervisor layer
when a device removal errir happens, allowing QEMU/phyp to do a proper
error handling. Using set-indicator to report HP errors isn't strange to
PAPR, as per R1-13.5.3.4-4. of table 13.7 of current PAPR [2]:

"For all DR options: If this is a DR operation that involves the user
insert- ing a DR entity, then if the firmware can determine that the
inserted entity would cause a system disturbance, then the set-indicator
RTAS call must not unisolate the entity and must return an error status
which is unique to the particular error."

A change was proposed to the pSeries Linux kernel to call set-indicator
to move a DRC to 'unisolate' in the case of a hotunplug error in the
guest side [3]. Setting a DRC that is already unisolated or configured to
'unisolate' is a no-op (returns RTAS_OK) for QEMU and also for phyp.
Being a benign change for hypervisors that doesn't care about handling
such errors, we expect the kernel to accept this change at some point.

This patch prepares the pSeries machine for this new kernel feature by
changing drc_unisolate_logical() to handle guest side hotunplug errors.
For CPUs it's a simple matter of setting drc->unplug_requested to 'false',
while for LMBs the process is similar to the rollback that is done in
rtas_ibm_configure_connector().

[1] https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06395.html
[2] https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
[3] https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210416210216.380291-3-danielhb413@gmail.com/

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
---
 hw/ppc/spapr_drc.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c
index 9e16505fa1..6918e0c9d1 100644
--- a/hw/ppc/spapr_drc.c
+++ b/hw/ppc/spapr_drc.c
@@ -151,9 +151,32 @@ static uint32_t drc_isolate_logical(SpaprDrc *drc)
 
 static uint32_t drc_unisolate_logical(SpaprDrc *drc)
 {
+    SpaprMachineState *spapr = NULL;
+
     switch (drc->state) {
     case SPAPR_DRC_STATE_LOGICAL_UNISOLATE:
     case SPAPR_DRC_STATE_LOGICAL_CONFIGURED:
+        /*
+         * Unisolating a logical DRC that was marked for unplug
+         * means that the kernel is refusing the removal.
+         */
+        if (drc->unplug_requested && drc->dev) {
+            if (spapr_drc_type(drc) == SPAPR_DR_CONNECTOR_TYPE_LMB) {
+                spapr = SPAPR_MACHINE(qdev_get_machine());
+
+                spapr_memory_unplug_rollback(spapr, drc->dev);
+            }
+
+            drc->unplug_requested = false;
+            error_report("Device hotunplug rejected by the guest "
+                         "for device %s", drc->dev->id);
+
+            /*
+             * TODO: send a QAPI DEVICE_UNPLUG_ERROR event when
+             * it is implemented.
+             */
+        }
+
         return RTAS_OUT_SUCCESS; /* Nothing to do */
     case SPAPR_DRC_STATE_LOGICAL_AVAILABLE:
         break; /* see below */
-- 
2.30.2



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2 1/1] spapr_drc.c: handle hotunplug errors in drc_unisolate_logical()
  2021-04-20 16:51 ` [PATCH v2 1/1] spapr_drc.c: " Daniel Henrique Barboza
@ 2021-04-21  4:09   ` David Gibson
  0 siblings, 0 replies; 3+ messages in thread
From: David Gibson @ 2021-04-21  4:09 UTC (permalink / raw)
  To: Daniel Henrique Barboza; +Cc: qemu-ppc, qemu-devel, groug


[-- Attachment #1: Type: text/plain, Size: 4411 bytes --]

On Tue, Apr 20, 2021 at 01:51:00PM -0300, Daniel Henrique Barboza wrote:
> At this moment, PAPR does not provide a way to report errors during a
> device removal operation. This led the pSeries machine to implement
> extra mechanisms to try to fallback and recover from an error that might
> have happened during the hotunplug in the guest side. This started to
> change a bit with commit fe1831eff8a4 ("spapr_drc.c: use DRC
> reconfiguration to cleanup DIMM unplug state"), where one way to
> fallback from a memory removal error was introduced.
> 
> Around the same time, in [1], the idea of using RTAS set-indicator for
> this role was first introduced. The RTAS set-indicator call, when
> attempting to UNISOLATE a DRC that is already UNISOLATED or CONFIGURED,
> returns RTAS_OK and does nothing else for both QEMU and phyp. This gives
> us an opportunity to use this behavior to signal the hypervisor layer
> when a device removal errir happens, allowing QEMU/phyp to do a proper
> error handling. Using set-indicator to report HP errors isn't strange to
> PAPR, as per R1-13.5.3.4-4. of table 13.7 of current PAPR [2]:
> 
> "For all DR options: If this is a DR operation that involves the user
> insert- ing a DR entity, then if the firmware can determine that the
> inserted entity would cause a system disturbance, then the set-indicator
> RTAS call must not unisolate the entity and must return an error status
> which is unique to the particular error."
> 
> A change was proposed to the pSeries Linux kernel to call set-indicator
> to move a DRC to 'unisolate' in the case of a hotunplug error in the
> guest side [3]. Setting a DRC that is already unisolated or configured to
> 'unisolate' is a no-op (returns RTAS_OK) for QEMU and also for phyp.
> Being a benign change for hypervisors that doesn't care about handling
> such errors, we expect the kernel to accept this change at some point.
> 
> This patch prepares the pSeries machine for this new kernel feature by
> changing drc_unisolate_logical() to handle guest side hotunplug errors.
> For CPUs it's a simple matter of setting drc->unplug_requested to 'false',
> while for LMBs the process is similar to the rollback that is done in
> rtas_ibm_configure_connector().
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg06395.html
> [2] https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
> [3] https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210416210216.380291-3-danielhb413@gmail.com/
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>

Applied to ppc-for-6.1, thanks.

> ---
>  hw/ppc/spapr_drc.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/hw/ppc/spapr_drc.c b/hw/ppc/spapr_drc.c
> index 9e16505fa1..6918e0c9d1 100644
> --- a/hw/ppc/spapr_drc.c
> +++ b/hw/ppc/spapr_drc.c
> @@ -151,9 +151,32 @@ static uint32_t drc_isolate_logical(SpaprDrc *drc)
>  
>  static uint32_t drc_unisolate_logical(SpaprDrc *drc)
>  {
> +    SpaprMachineState *spapr = NULL;
> +
>      switch (drc->state) {
>      case SPAPR_DRC_STATE_LOGICAL_UNISOLATE:
>      case SPAPR_DRC_STATE_LOGICAL_CONFIGURED:
> +        /*
> +         * Unisolating a logical DRC that was marked for unplug
> +         * means that the kernel is refusing the removal.
> +         */
> +        if (drc->unplug_requested && drc->dev) {
> +            if (spapr_drc_type(drc) == SPAPR_DR_CONNECTOR_TYPE_LMB) {
> +                spapr = SPAPR_MACHINE(qdev_get_machine());
> +
> +                spapr_memory_unplug_rollback(spapr, drc->dev);
> +            }
> +
> +            drc->unplug_requested = false;
> +            error_report("Device hotunplug rejected by the guest "
> +                         "for device %s", drc->dev->id);
> +
> +            /*
> +             * TODO: send a QAPI DEVICE_UNPLUG_ERROR event when
> +             * it is implemented.
> +             */
> +        }
> +
>          return RTAS_OUT_SUCCESS; /* Nothing to do */
>      case SPAPR_DRC_STATE_LOGICAL_AVAILABLE:
>          break; /* see below */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, back to index

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-20 16:50 [PATCH v2 0/1] pSeries: handle hotunplug errors in drc_unisolate_logical() Daniel Henrique Barboza
2021-04-20 16:51 ` [PATCH v2 1/1] spapr_drc.c: " Daniel Henrique Barboza
2021-04-21  4:09   ` David Gibson

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git
	git clone --mirror https://lore.kernel.org/qemu-devel/2 qemu-devel/git/2.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git